clickhouse的多路径存储策略(转载)

原文: https://www.cnblogs.com/wshenjin/p/14583918.html
click官网:https://clickhouse.com/
clickhouse从19.15开始,MergeTree实现了自定义存储策略的功能:

  • JBOD策略:这种策略适合服务器挂多磁盘但没做raid的场景。JBOD是一种轮询策略,每次执行INSERT或者MERGE,所以产生的新分区会轮询写入各个磁盘。这种策略可以减低单盘负载,在一定条件下提升数据并行读写的性能。
  • HOT/COLD策略:这种策略适合服务挂不通类型磁盘的场景。将磁盘分为HOT与COLD两类区域,HOT区使用小容量高速存储,注重存取性能;COLD区使用大容量低速存储,注重容量。MergeTree写入时,会在HOT区创建分区目录保存数据,当分区数据达到阈值,数据自动移动到COLD区。每个区内部也支持定义多个磁盘,支持JBOD策略。
clickhouse-client --host127.0.0.1 --port9000--database test_db --user default--password='123456' 
clickhouse-client --query "SYSTEM RELOAD CONFIG"
clickhouse-client --query "SYSTEM RELOAD CONFIG" --config-file /etc/clickhouse-server/users.xml
clickhouse-client --query "SYSTEM RELOAD CONFIG" --shard shard01-01.example.com
 show processG
 EXCHANGE TABLES default.table1 and default.table12;
rename table table1 to log_test.tabl2;

配置语法

  • storage_configuration 标签定义多路径存储策略
  • disks 标签定义磁盘
  • policies 标签定义策略
<storage_configuration>
    <!-- 定义磁盘 -->   
    <disks>                                        
        <disk_name_01>   <!-- 自定义磁盘名称,全局唯一 -->
            <path>/path/to/disk_name_01/</path> <!-- 磁盘路径 -->
            <keep_free_space_bytes>21474836480</keep_free_space_bytes>
        </disk_name_01>
        <disk_name_02>
            <path>/path/to/disk_name_02/</path>
            <keep_free_space_bytes>21474836480</keep_free_space_bytes>
        </disk_name_02>
    </disks>

    <!-- 定义策略>  
    <policies>  
        <policy_name>  <!-- 自定义策略名称,全局唯一 -->
            <!-- 定义volume --> 
            <volumes>   
                <volume_name_01>  <!-- 自定义volum名称,全局唯一 -->   
                   <disk>disk_name_01</disk>   <!-- 指定该volume下使用的磁盘,磁盘名称要和上面定义的对应 -->
                   <disk>disk_name_02</disk>
                   <max_data_part_size_bytes>1073741824</max_data_part_size_bytes>
                </volume_name_01>
            </volumes>
            <move_factor>0.2</move_factor>
        </policy_name>
    </policies>
</storage_configuration>
  • keep_free_space_bytes: 选填项,用于指定指定磁盘的预留空间,单位字节bit。
  • max_data_part_size_bytes:选填,字节为单位 默认1G,表示在这个卷的单个disk磁盘中,一个数据分区的最大磁盘存储阈值,若当前分区的数据大小超过阈值,则之后的分区会写入下一个disk磁盘。
  • move_factor:选填项,默认值为0.1,;若当前卷的可用空间大小小于factor因子,并且定义多个卷,则数据会自动移动到下一个卷。

JBOD

配置:

<storage_configuration>
    <disks>
        <disk_01>
            <path>/data/chbase/data_01/</path>
        </disk_01>
        <disk_02>
            <path>/data/chbase/data_02/</path>
        </disk_02>
        <disk_03>
            <path>/data/chbase/data_03/</path>
        </disk_03>
    </disks>
    <policies>
        <jbod_policies>
            <volumes>
                <jbod_volume>
                   <disk>disk_01</disk>
                   <disk>disk_02</disk>
                </jbod_volume>
            </volumes>
        </jbod_policies>
    </policies>
</storage_configuration>

查看disk配置:

localhost :) SELECT name, path, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total, formatReadableSize(keep_free_space) AS reserved FROM system.disks;

┌─name────┬─path────────────────────────────┬─free──────┬─total─────┬─reserved─┐
│ default │ /data/database/clickhouse/data/ │ 13.19 GiB │ 29.98 GiB │ 0.00 B   │
│ disk_01 │ /data/chbase/data_01/           │ 13.19 GiB │ 29.98 GiB │ 0.00 B   │
│ disk_02 │ /data/chbase/data_02/           │ 13.19 GiB │ 29.98 GiB │ 0.00 B   │
│ disk_03 │ /data/chbase/data_03/           │ 13.19 GiB │ 29.98 GiB │ 0.00 B   │
└─────────┴─────────────────────────────────┴───────────┴───────────┴──────────┘

查看policies策略:

localhost :) SELECT policy_name, volume_name, volume_priority, disks, formatReadableSize(max_data_part_size) AS max_data_part_size, move_factor FROM system.storage_policies;

┌─policy_name───┬─volume_name─┬─volume_priority─┬─disks─────────────────┬─max_data_part_size─┬─move_factor─┐
│ default       │ default     │               1 │ ['default']           │ 0.00 B             │           0 │
│ jbod_policies │ jbod_volume │               1 │ ['disk_01','disk_02'] │ 0.00 B             │         0.1 │
└───────────────┴─────────────┴─────────────────┴───────────────────────┴────────────────────┴─────────────┘

建表测试:

##使用settings storage_policy='jbod_policies'指定策略
localhost :) CREATE TABLE jbod_table_v1
(
    `id` UInt64
)
ENGINE = MergeTree()
ORDER BY id
SETTINGS storage_policy = 'jbod_policies'

##写入第一批数据,创建第一个分区目录
localhost :) INSERT INTO jbod_table_v1 SELECT rand() FROM numbers(10);

##查看系统分区表,可以看到第一个分区all_1_1_0被写入到disk_01
localhost :) SELECT name, disk_name FROM system.parts WHERE table='jbod_table_v1';
┌─name──────┬─disk_name─┐
│ all_1_1_0 │ disk_01   │
└───────────┴───────────┘

##写入第二批数据,创建第二个分区目录
localhost :) INSERT INTO jbod_table_v1 SELECT rand() FROM numbers(10);

##可以看到第二个分区all_2_2_0被写入到disk_02 
localhost :) SELECT name, disk_name FROM system.parts WHERE table='jbod_table_v1';
┌─name──────┬─disk_name─┐
│ all_1_1_0 │ disk_01   │
│ all_2_2_0 │ disk_02   │
└───────────┴───────────┘

##反复几次
localhost :) SELECT name, disk_name FROM system.parts WHERE table='jbod_table_v1';
┌─name──────┬─disk_name─┐
│ all_1_1_0 │ disk_01   │
│ all_2_2_0 │ disk_02   │
│ all_3_3_0 │ disk_01   │
│ all_4_4_0 │ disk_02   │
└───────────┴───────────┘

JBOD策略,每当生成一个新数据分区的时候,分区目录会根据volume中定义的disk顺序依次轮询并写入各个disk。

HOT/COLD

配置:

<storage_configuration>
    <disks>
        <disk_01>
            <path>/data/chbase/data_01/</path>
        </disk_01>
        <disk_02>
            <path>/data/chbase/data_02/</path>
        </disk_02>
        <clod_disk>
            <path>/data/chbase/cold_data/</path>
            <keep_free_space_bytes>21474836480</keep_free_space_bytes>
        </clod_disk>
    </disks>
    <policies>
        <hot_to_cold>
            <volumes>
                <hot_volume>
                   <disk>disk_01</disk>
                   <disk>disk_02</disk>
                   <max_data_part_size_bytes>1048576</max_data_part_size_bytes>
                </hot_volume>
                <cold_volume>
                    <disk>clod_disk</disk>
                </cold_volume>
            </volumes>
            <move_factor>0.2</move_factor>
        </hot_to_cold>
    </policies>
</storage_configuration>

查看disk配置:

localhost :) SELECT name, path, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total, formatReadableSize(keep_free_space) AS reserved FROM system.disks;

┌─name──────┬─path────────────────────────────┬─free──────┬─total─────┬─reserved──┐
│ clod_disk │ /data/chbase/cold_data/         │ 29.94 GiB │ 29.97 GiB │ 20.00 GiB │
│ default   │ /data/database/clickhouse/data/ │ 16.49 GiB │ 27.94 GiB │ 0.00 B    │
│ disk_01   │ /data/chbase/data_01/           │ 9.96 GiB  │ 9.99 GiB  │ 0.00 B    │
│ disk_02   │ /data/chbase/data_02/           │ 9.96 GiB  │ 9.99 GiB  │ 0.00 B    │
└───────────┴─────────────────────────────────┴───────────┴───────────┴───────────┘

查看policies策略:

localhost :) SELECT policy_name, volume_name, volume_priority, disks, formatReadableSize(max_data_part_size) AS max_data_part_size, move_factor FROM system.storage_policies;

┌─policy_name─┬─volume_name──┬─volume_priority─┬─disks─────────────────┬─max_data_part_size─┬─move_factor─┐
│ default     │ default      │               1 │ ['default']           │ 0.00 B             │           0 │
│ hot_to_cold │ hot_volume   │               1 │ ['disk_01','disk_02'] │ 1.00 GiB           │         0.2 │
│ hot_to_cold │ cold_volume  │               2 │ ['clod_disk']         │ 0.00 B             │         0.2 │
└─────────────┴──────────────┴─────────────────┴───────────────────────┴────────────────────┴─────────────┘

可以看出,hot_to_cold策略有两个volume: hot_volume、cold_volume。其中hot_volume有两块磁盘:disk_01、disk_02。

建表测试:

localhost : CREATE TABLE htc_table_1
(
    `id` UInt64
)
ENGINE = MergeTree()
ORDER BY id
SETTINGS storage_policy = 'hot_to_cold';

##写入两个分区
localhost :) INSERT INTO htc_table_1 SELECT rand() FROM numbers(100000);

##查看两次生成的分区采用JBOD策略均衡在disk_01、disk_02
localhost :) SELECT name, disk_name FROM system.parts WHERE table='htc_table_1';

┌─name──────┬─disk_name─┐
│ all_1_1_0 │ disk_01   │
│ all_2_2_0 │ disk_02   │
└───────────┴───────────┘

##由于max_data_part_size_bytes配置是1M,写入一个超过1M大小的分区
localhost :) INSERT INTO htc_table_1 SELECT rand() FROM numbers(300000);

##可以看到第三个分区被写入到clod_disk
localhost :) SELECT name, disk_name FROM system.parts WHERE table='htc_table_1';

┌─name──────┬─disk_name─┐
│ all_1_1_0 │ disk_01   │
│ all_2_2_0 │ disk_02   │
│ all_3_3_0 │ clod_disk │
└───────────┴───────────┘

HOT/COLD策略,由多个disk组成volume组。每当一个新数据分区生成的时候,按照阈值(max_data_part_size_bytes)的大小,分区目录会按照volume组中定义的顺序依次写入。

合并分区或者一次性写入的分区大小超过max_data_part_size_bytes,也会被写入到COLD卷中。

分区移动

虽然MergeTree定义完存储策略后不能修改,但却可以移动分区

## 将某个分区移动到当前volume的另一个disk
localhost :) ALTER TABLE htc_table_1 MOVE PART 'all_1_1_0' TO DISK 'disk_02';

localhost :) SELECT name, disk_name FROM system.parts WHERE table='htc_table_1';

┌─name──────┬─disk_name─┐
│ all_1_1_0 │ disk_02   │
│ all_2_2_0 │ disk_02   │
│ all_3_3_0 │ clod_disk │
└───────────┴───────────┘


##将某个分区移动到其他volume
localhost :) ALTER TABLE htc_table_1 MOVE PART 'all_1_1_0' TO VOLUME 'cold_volume';

localhost :) SELECT name, disk_name FROM system.parts WHERE table='htc_table_1';

┌─name──────┬─disk_name─┐
│ all_1_1_0 │ clod_disk │
│ all_2_2_0 │ disk_02   │
│ all_3_3_0 │ clod_disk │
└───────────┴───────────┘
select
  left(query, 150) as sql,
  count() as queryNum,
  sum(query_duration_ms) as totalTime,
  totalTime / queryNum as avgTime
from
  system.query_log ql
where
  event_time > toDateTime ('2024-02-23 13:19:00')
  and event_time < toDateTime ('2022-04-23 13:23:00')
group by
  sql
order by
  totalTime desc

SELECT
  *
FROM
  system.query_log
WHERE
  event_time > toDateTime ('2022-04-23 13:19:00')
  and event_time < toDateTime ('2022-04-23 13:23:00')
  AND query LIKE '%xxxxxxxpt%'
     ORDER BY read_rows  DESC 
limit
  20


SELECT
table AS `表名`,
sum(rows) AS `总行数`,
formatReadableSize (sum(data_uncompressed_bytes)) AS `原始大小`,
formatReadableSize (sum(data_compressed_bytes)) AS `压缩大小`,
round(
  (
    sum(data_compressed_bytes) / sum(data_uncompressed_bytes)
  ) * 100,
  0
) AS `压缩率`
FROM
  system.parts
GROUP BY
table
CREATE TABLE default.test_table
(
    `id` String,
    `timestamp` DateTime,
    `distinct_id` String,
    `time` UInt64,
    `undertime` UInt64 MATERIALIZED 9123372036854775807 - time
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY distinct_id
SAMPLE BY distinct_id
TTL timestamp TO DISK 'default', timestamp + toIntervalMonth(2) TO DISK 'ck_vdc1', timestamp + toIntervalMonth(4) TO DISK 'ck_vdd1'
SETTINGS index_granularity = 8192

# 授权
chown -R clickhouse.clickhouse /data/clickhouse
chown -R /etc/clickhouse-server/config.d/default_policies.xml

# 查看disk配置:
SELECT name, path, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total, formatReadableSize(keep_free_space) AS reserved FROM system.disks;

# 查看policies策略:
SELECT policy_name, volume_name, volume_priority, disks, formatReadableSize(max_data_part_size) AS max_data_part_size, move_factor FROM system.storage_policies;

# 检查创建语句
SHOW CREATE newtestdb_1.visits_v1;

# 查询存储策略(无数据时无法检查)
SELECT table,disk_name,path FROM system.parts WHERE table = 'visits_v1'
SELECT table,disk_name,path FROM system.parts WHERE database = 'newtestdb_1'修改存储策略

# 修改存储策略

ALTER TABLE newtestdb_1.visits_v1 MODIFY TTL StartTime TO DISK 'default', StartTime + INTERVAL 7 YEAR TO DISK 'disknew1';

ALTER TABLE newtestdb_1.visits_v1 MATERIALIZE TTL;


SELECT database, table, formatReadableSize(sum(bytes_on_disk)) as disk_space FROM system.parts GROUP BY database, table ORDER BY disk_space DESC;