Reproducing ClickHouse Postgres Integration Example

I was recently reading this article from ClickHouse demonstrating ClickHouse’s integration with Postgres and performance characteristics. I followed along. Here are my notes.

My experimental setup was slightly different. Rather than using hosted services (Supabase and ClickHouse Cloud), I ran everything locally on my M2 MacBook Air. Versions in question:

PostgreSQL 15.6 (Postgres.app) on aarch64-apple-darwin21.6.0, compiled by Apple clang version 14.0.0 (clang-1400.0.29.102), 64-bit
ClickHouse server version 23.12.4.15 (official build).

Dataset

The dataset used for the blog post is downloadable from this link. It is provided as an SQL file inside a tarball. This seems silly – it’s a single file, so just GZIP it, no TAR necessary. But alas. Also, the format seems to be many INSERTs, each INSERT containing 10000 rows. This is fine. But I think we could do better by using COPY. In any case, I’ll present the time inserting with both approaches. I have a a separate post drafted but not yet published on different ways to convert the SQL data into COPY-able data.

In any case, I initialize the database like this:

Create database with createdb
Create table
Insert data (using either SQL file or COPY)
Create indices

This slightly differs from the order in the post in that the post creates the indices first. I generally like to create the indices after inserting a bunch of data, but YMMV. I also thought about ANALYZE but I didn’t want to venture too far from what the post did.

Timing for INSERT (starting from freshly-created table):

% time psql housetest < uk_prices.sql
INSERT 0 10000
INSERT 0 10000
INSERT 0 10000
...
INSERT 0 10000
INSERT 0 10000
INSERT 0 4966
psql housetest < uk_prices.sql  71.28s user 6.23s system 20% cpu 6:18.56 total

Timing for COPY (starting from freshly-created table, aka, I dropped/recreated the database & table before trying this):

% time psql housetest -c "COPY uk_price_paid (price, date, postcode1, postcode2, type, is_new, duration, addr1, addr2, street, locality, town, district, county) FROM '$PWD/uk_prices.tsv' HEADER"
COPY 27734966
psql housetest -c   0.01s user 0.01s system 0% cpu 56.382 total

Almost 7x faster. Worth it? You decide.

Oh, and not to forget the indices:

% psql housetest
psql (15.6 (Postgres.app))
Type "help" for help.

housetest=# \timing on
Timing is on.
housetest=# CREATE INDEX ON uk_price_paid (type);
CREATE INDEX
Time: 50008.346 ms (00:50.008)
housetest=# CREATE INDEX ON uk_price_paid (town);
CREATE INDEX
Time: 11661.637 ms (00:11.662)
housetest=# CREATE INDEX ON uk_price_paid (extract(year from date));
CREATE INDEX
Time: 23902.013 ms (00:23.902)

ClickHouse setup

Buried in the blog post is this tidbit:

Note we use the setting external_table_functions_use_nulls = 0, to ensure Null values are represented as their default values (instead of Null).

It was not obvious how to do this at first. To do so:

Start clickhouse server in a desired data directory (in order to create the skeleton config.xml)
Interrupt it after it starts
Copy preprocessed_configs/config.xml to just config.xml and remove the “do not edit” comment at the top
Find this block:
```
<profiles>
    <default/>
</profiles>
```

Change it to:

<profiles>
    <default>
        <external_table_functions_use_nulls>0</external_table_functions_use_nulls>
    </default>
</profiles>

Start clickhouse server again
Verify by running the query SELECT name, value FROM system.settings WHERE changed and ensure you see a row saying external_table_functions_use_nulls is 0

Additionally, we’re going to start out with the CREATE TABLE alias off the bat:

CREATE TABLE uk_price_paid_postgresql AS postgresql('localhost', 'housetest', 'uk_price_paid', '<username>', '')

SHOW CREATE TABLE uk_price_paid_postgresql

Ensure there is no Nullable in the SHOW CREATE TABLE result.

Lastly, we’ll set up the MergeTree copy of the data. I had to correct the query as the INSERT provided by the blog post had an extraneous _v2, but the rest of it worked fine. For posterity, the time taken to copy the data into a MergeTree:

INSERT INTO uk_price_paid SELECT * EXCEPT id
FROM uk_price_paid_postgresql

0 rows in set. Elapsed: 29.022 sec. Processed 27.73 million rows, 5.13 GB (955.65 thousand rows/s., 176.78 MB/s.)
Peak memory usage: 273.53 MiB.

Results by query

Average price per year for flats in the UK

Engine	Published figure	My measurement
Postgres	28.535 s	0.896 s
ClickHouse + Postgres	26.408 s	2.972 s
ClickHouse + MergeTree	0.079 s	0.029 s

My Postgres performance degraded if I ran a VACUUM ANALYZE prior to the test. This seems counterintuitive. After discovering that this harmed performance, I deleted everything and started over without this step.
My Postgres query plan is slightly different from the blog post’s. Mine seems to use a bitmap heap scan on the inner loop rather than a parallel seq scan. My “gather merge” node also uses 2 workers instead of 1, and the estimated costs/rows are different. (The result set is the same, however.)

Most expensive postcodes in a city

Engine	Published figure	My measurement
Postgres	0.543 s	0.052 s
ClickHouse + Postgres	2.362 s	0.123 s
ClickHouse + MergeTree	0.077 s	0.022 s

My Postgres query plan for this one also uses a bitmap index scan instead of a parallel index scan.

Postcodes in London with the largest percentage price change in the last 20 yrs

Engine	Published figure	My measurement
Postgres	8.904 s	0.332 s
ClickHouse + Postgres	28.531 s	3.067 s
ClickHouse + MergeTree	0.062 s	0.032 s

I tested only the faster ClickHouse query with medianIf.

Conclusions

My laptop is way faster than whatever their cloud environment is set up with.
ClickHouse + Postgres is nearly always slower (in these examples, at least) than Postgres would be on its own.
However, when ClickHouse owns the data in a MergeTree, it provides solid performance wins. ClickHouse’s Postgres interop seems great for bulk data transfer and one-off queries, but you probably don’t want to use it for anything too heavy.

Tags: ClickHouse, PostgreSQL

This entry was posted on Monday, April 8th, 2024 at 8:00 am and is filed under Technology. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.