I was recently reading this article from ClickHouse demonstrating ClickHouse’s integration with Postgres and performance characteristics. I followed along. Here are my notes.
My experimental setup was slightly different. Rather than using hosted services (Supabase and ClickHouse Cloud), I ran everything locally on my M2 MacBook Air. Versions in question:
- PostgreSQL 15.6 (Postgres.app) on aarch64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.102), 64-bit
- ClickHouse server version 23.12.4.15 (official build).
Dataset
The dataset used for the blog post is downloadable from this link. It is provided as an SQL file inside a tarball. This seems silly – it’s a single file, so just GZIP it, no TAR necessary. But alas. Also, the format seems to be many INSERT
s, each INSERT
containing 10000 rows. This is fine. But I think we could do better by using COPY
. In any case, I’ll present the time inserting with both approaches. I have a a separate post drafted but not yet published on different ways to convert the SQL data into COPY-able data.
In any case, I initialize the database like this:
- Create database with
createdb
- Create table
- Insert data (using either SQL file or
COPY
) - Create indices
This slightly differs from the order in the post in that the post creates the indices first. I generally like to create the indices after inserting a bunch of data, but YMMV. I also thought about ANALYZE
but I didn’t want to venture too far from what the post did.
Timing for INSERT
(starting from freshly-created table):
% time psql housetest < uk_prices.sql
INSERT 0 10000
INSERT 0 10000
INSERT 0 10000
...
INSERT 0 10000
INSERT 0 10000
INSERT 0 4966
psql housetest < uk_prices.sql 71.28s user 6.23s system 20% cpu 6:18.56 total
Timing for COPY
(starting from freshly-created table, aka, I dropped/recreated the database & table before trying this):
% time psql housetest -c "COPY uk_price_paid (price, date, postcode1, postcode2, type, is_new, duration, addr1, addr2, street, locality, town, district, county) FROM '$PWD/uk_prices.tsv' HEADER"
COPY 27734966
psql housetest -c 0.01s user 0.01s system 0% cpu 56.382 total
Almost 7x faster. Worth it? You decide.
Oh, and not to forget the indices:
% psql housetest
psql (15.6 (Postgres.app))
Type "help" for help.
housetest=# \timing on
Timing is on.
housetest=# CREATE INDEX ON uk_price_paid (type);
CREATE INDEX
Time: 50008.346 ms (00:50.008)
housetest=# CREATE INDEX ON uk_price_paid (town);
CREATE INDEX
Time: 11661.637 ms (00:11.662)
housetest=# CREATE INDEX ON uk_price_paid (extract(year from date));
CREATE INDEX
Time: 23902.013 ms (00:23.902)
ClickHouse setup
Buried in the blog post is this tidbit:
Note we use the setting
external_table_functions_use_nulls = 0
, to ensure Null values are represented as their default values (instead ofNull
).
It was not obvious how to do this at first. To do so:
- Start
clickhouse server
in a desired data directory (in order to create the skeletonconfig.xml
) - Interrupt it after it starts
- Copy
preprocessed_configs/config.xml
to justconfig.xml
and remove the “do not edit” comment at the top - Find this block:
<profiles> <default/> </profiles>
- Change it to:
<profiles> <default> <external_table_functions_use_nulls>0</external_table_functions_use_nulls> </default> </profiles>
- Start
clickhouse server
again - Verify by running the query
SELECT name, value FROM system.settings WHERE changed
and ensure you see a row sayingexternal_table_functions_use_nulls
is 0
Additionally, we’re going to start out with the CREATE TABLE
alias off the bat:
CREATE TABLE uk_price_paid_postgresql AS postgresql('localhost', 'housetest', 'uk_price_paid', '<username>', '')
SHOW CREATE TABLE uk_price_paid_postgresql
Ensure there is no Nullable
in the SHOW CREATE TABLE
result.
Lastly, we’ll set up the MergeTree copy of the data. I had to correct the query as the INSERT
provided by the blog post had an extraneous _v2
, but the rest of it worked fine. For posterity, the time taken to copy the data into a MergeTree:
INSERT INTO uk_price_paid SELECT * EXCEPT id
FROM uk_price_paid_postgresql
0 rows in set. Elapsed: 29.022 sec. Processed 27.73 million rows, 5.13 GB (955.65 thousand rows/s., 176.78 MB/s.)
Peak memory usage: 273.53 MiB.
Results by query
Average price per year for flats in the UK
Engine | Published figure | My measurement |
---|---|---|
Postgres | 28.535 s | 0.896 s |
ClickHouse + Postgres | 26.408 s | 2.972 s |
ClickHouse + MergeTree | 0.079 s | 0.029 s |
- My Postgres performance degraded if I ran a
VACUUM ANALYZE
prior to the test. This seems counterintuitive. After discovering that this harmed performance, I deleted everything and started over without this step. - My Postgres query plan is slightly different from the blog post’s. Mine seems to use a bitmap heap scan on the inner loop rather than a parallel seq scan. My “gather merge” node also uses 2 workers instead of 1, and the estimated costs/rows are different. (The result set is the same, however.)
Most expensive postcodes in a city
Engine | Published figure | My measurement |
---|---|---|
Postgres | 0.543 s | 0.052 s |
ClickHouse + Postgres | 2.362 s | 0.123 s |
ClickHouse + MergeTree | 0.077 s | 0.022 s |
- My Postgres query plan for this one also uses a bitmap index scan instead of a parallel index scan.
Postcodes in London with the largest percentage price change in the last 20 yrs
Engine | Published figure | My measurement |
---|---|---|
Postgres | 8.904 s | 0.332 s |
ClickHouse + Postgres | 28.531 s | 3.067 s |
ClickHouse + MergeTree | 0.062 s | 0.032 s |
- I tested only the faster ClickHouse query with
medianIf
.
Conclusions
- My laptop is way faster than whatever their cloud environment is set up with.
- ClickHouse + Postgres is nearly always slower (in these examples, at least) than Postgres would be on its own.
- However, when ClickHouse owns the data in a MergeTree, it provides solid performance wins. ClickHouse’s Postgres interop seems great for bulk data transfer and one-off queries, but you probably don’t want to use it for anything too heavy.
Tags: ClickHouse, PostgreSQL