Optimizing ClickHouse Queries

A Step by Step Troubleshooting Journey

Posted by pinylin on January 15, 2025

Previously, a product line used the ELK stack for their logging system, processing approximately 1.5 TB of data daily. After migrating to ClickHouse, the daily data volume reduced to around 0.5 TB (excluding replications). However, after the migration, we encountered a critical issue where the keyword search functionality has become painfully slow, and are unable to effectively retrieve historical data.

Initially, no one suspected that the issue was the query statements; the consensus was that the large volume of data was the primary culprit for the low query efficiency.

So, the first step was to test the performance of the full-text index.

Test Case

  • full_text
    1
    2
    3
    
     ┌─database─┬─table─────────┬─compressed_size─┬─uncompressed_size─┬─ratio─┐
    │ logs     │ hackernews_ft │ 6.24 GiB        │ 11.00 GiB         │  1.76 │
    └──────────┴───────────────┴─────────────────┴───────────────────┴───────┘
    
  • tokenbf
    1
    2
    
    ┌─database─┬─table──────────┬─compressed_size─┬─uncompressed_size─┬─ratio─┐
    │ logs     │ hackernews_tbf │ 6.24 GiB        │ 11.00 GiB         │  1.76 │ └──────────┴────────────────┴─────────────────┴───────────────────┴───────┘
    

count(id)

  tokenbf_v1 full_text t&f cnt  
kilobyte 7.218 5.327 7.515 459  
megabyte 8.835 8.994 7.283 1741  
gigabyte 11.473 6.708 9.463 2648  
terabyte 10.705 6.856 11.727 1646  
petabyte 8.535 8.197 7.348 929  
exabyte 3.066 2.029 2.138 191  
zettabyte 0.478 0.406 1.283 53  
yottabyte 0.673 0.502 1.019 68  
           
assessmen 0.609 0.163 0.271 15  
gatewa 0.553 0.180 0.311 13  
clickhous 0.286 0.063 0.059 2  

With ==index_granularity = 8192; GRANULARITY=1;== Assuming each log entry has 150 tokens, let’s do the calculation. There’s absolutely no problem. Bloom filter calculator

So what the issue? suddenly I recalled that, in some query statements, there were several sort items after the “order by”.

select limit

count(gigabyte) = 2648, count(this)=7198701 these two both taking around 12s, roughly on the same level. The duration is also roughly equivalent to the time it takes to scan the entire table.

  • With Table Struct: ORDER BY (type, author)
1
2
3
4
5
SELECT * FROM hackernews_ft WHERE hasToken(lower(comment), 'gigabyte') ORDER BY type limit 50;
-- 3.153 sec

SELECT * FROM hackernews_tbf WHERE hasToken(lower(comment), 'gigabyte') ORDER BY type limit 50;
-- 2.599 sec
  • Wrong ORDER BY CASE: it essentially haven’t using the index. ```sql

SELECT * FROM hackernews_ft WHERE hasToken(lower(comment), ‘gigabyte’) ORDER BY type, url desc limit 50; – 10.285 sec SELECT * FROM hackernews_tbf WHERE hasToken(lower(comment), ‘gigabyte’) ORDER BY type, url desc limit 50; – 8.699 sec

1
2
3
4
5
6
7
8
9
10
11
12
13
It's evident that in scenarios with a large result set, full-text search offers no advantage.

Having ruled out the index selection, I began to be convinced that the query statement's performance is too low, essentially not utilizing the index at all.

Let's create a new table to verify this.

## Test Case(Order)


```sql
CREATE TABLE hackernews_ord (    id UInt64,    deleted UInt8,    type String,    author String,    timestamp DateTime,    comment String,    dead UInt8,    parent UInt64,    poll UInt64,    children Array(UInt32),    url String,    score UInt32,    title String,    parts Array(UInt32),    descendants UInt32)ENGINE = MergeTree 
ORDER BY (toYYYYMMDD(timestamp), type, author, poll);

Then add index: tokenbf_v1, start to select

==obvious and clear at a glance==

1
2
3
4
5
6
7
8
9
10
11
SELECT * FROM hackernews_ord WHERE hasToken(lower(comment), 'gigabyte') ORDER BY type limit 50;
-- 12.349 sec  not use the table struct order

SELECT * FROM hackernews_ord WHERE hasToken(lower(comment), 'gigabyte') ORDER BY timestamp limit 50;
-- 13.433 sec

SELECT * FROM hackernews_ord WHERE hasToken(lower(comment), 'gigabyte') ORDER BY toYYYYMMDD(timestamp) limit 50;
-- 2.223 sec

SELECT * FROM hackernews_ord WHERE hasToken(lower(comment), 'gigabyte') ORDER BY toYYYYMMDD(timestamp),poll limit 50;
-- 6.785 sec

Conclusion

Yay, the conclusion is that the query statement didn’t even make use of the primary index, while others were doubting the performance of the secondary index.