新年快乐,今年的跨年我选择在酒店构建ClickHouse。顺便总结一下最近用到的ClickHouse full_text index

构建arm binary
有些信创项目用的arm服务器,由于没有SSE4_2 指令,所以需要自己构建ClickHouse
As far as I know, ARM v8.0 builds are only provided for the master branch.
It is really simple to compile the ClickHouse code by yourself. Here is a tutorial. Just make sure to check out v24.8 (the last LTS version) and pass -DNO_ARMV81_OR_HIGHER=1 to CMake.
build
git submodule status git submodule update –init –recursive
1
2
mkdir build
cd build
1
2
export CC=clang-18 CXX=clang++-18
cmake -DNO_ARMV81_OR_HIGHER=1 ..
1
ninja clickhouse-server clickhouse-client
full_text index
- full_text
1 2
ALTER TABLE hackernews ADD INDEX comment_lowercase(lower(comment)) TYPE full_text; ALTER TABLE hackernews MATERIALIZE INDEX comment_lowercase;
- tokenbf index
1 2
ALTER TABLE hackernews ADD INDEX token_lowercase(lower(comment)) TYPE tokenbf_v1(30720, 2, 0); ALTER TABLE hackernews MATERIALIZE INDEX token_lowercase;
测试数据
测试环境是2C4G的轻量云,所以以下性能请相对参考
- no index
1 2
┌─database─┬─table──────┬─compressed_size─┬─uncompressed_size─┬─disk_size─┐ │ logs │ hackernews │ 6.24 GiB │ 11.00 GiB │ 6.24 GiB │ └──────────┴────────────┴─────────────────┴───────────────────┴───────────┘
- tokenbf
1 2
┌─database─┬─table──────┬─compressed_size─┬─uncompressed_size─┬─disk_size─┐ │ logs │ hackernews │ 6.24 GiB │ 11.00 GiB │ 6.34 GiB │ └──────────┴────────────┴─────────────────┴───────────────────┴───────────┘
- full-text
查表跟没有索引一致, 可能跟实验特性有关,
1 2
┌─database─┬─table──────┬─compressed_size─┬─uncompressed_size─┬─disk_size─┐ │ logs │ hackernews │ 6.24 GiB │ 11.00 GiB │ 6.24 GiB │ └──────────┴────────────┴─────────────────┴───────────────────┴───────────┘
1
2
EXPLAIN indexes=1
SELECT count()FROM hackernews WHERE hasToken(lower(comment), 'clickhouse');
case 1:
1
2
3
4
5
6
7
8
9
SELECT count()FROM hackernews WHERE hasToken(lower(comment), 'clickhouse');
-- 无索引
-- 1 row in set. Elapsed: 28.060 sec. Processed 28.74 million rows, 9.75 GB (1.02 million rows/s., 347.60 MB/s.)
-- full_text
-- 1 row in set. Elapsed: 2.214 sec. Processed 4.39 million rows, 1.73 GB (1.98 million rows/s., 780.76 MB/s.)
-- tokenbf
-- 1 row in set. Elapsed: 2.462 sec. Processed 4.46 million rows, 1.76 GB (1.81 million rows/s., 715.81 MB/s.)
case2:
1
2
3
4
5
6
7
8
SELECT count()FROM hackernews WHERE hasToken(lower(comment), 'clickhous');
-- 无索引 29s
-- full_text
-- 1 row in set. Elapsed: 0.063 sec. Processed 16.38 thousand rows, 5.81 MB (260.68 thousand rows/s., 92.43 MB/s.)
-- tokenbf
-- 1 row in set. Elapsed: 0.328 sec. Processed 450.56 thousand rows, 185.22 MB (1.38 million rows/s., 565.39 MB/s.)
正则匹配
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
EXPLAIN indexes=1
SELECT count()FROM logs.hackernews WHERE hasToken(lower(comment), 'clickhouse');
-- 走索引
EXPLAIN indexes=1
SELECT count()FROM logs.hackernews WHERE match(lower(comment), 'clickhou.{1}e');
-- RE2 正则表达式; EXPLAI显示走索引,SELECT好像真不走
EXPLAIN indexes=1
SELECT count()FROM logs.hackernews WHERE REGEXP_MATCHES(lower(comment), 'clickhou.e');
-- RE2 正则表达式; EXPLAI显示走索引,SELECT好像真不走
EXPLAIN indexes=1
SELECT count()FROM logs.hackernews WHERE comment regexp 'click.*ouse';
-- 不走索引 RE2
EXPLAIN indexes=1
SELECT count()FROM logs.hackernews WHERE multiMatchAny(lower(comment), ['clickhou.{1}e', 'flow']);
-- 不走索引 RE2
结论
- 实际上只有hasToken 走索引
- 正则查询不会走索引,开发的话建议使用 使用 match, multiMatchAny(因为match的 EXPLAI显示走索引,后面这个实验特性release后可能会支持索引)