Intro to ClickHouse® Internals
ClickHouse stores data in immutable parts organized by partitions. Parts are merged in the background for optimization. Sorting keys enable sparse indexes for fast data location.
Parts
Data in ClickHouse is stored in parts - immutable chunks written to disk during inserts. Each insert creates one or more parts depending on your partition key.
Key points:
- Parts are immutable (never modified in place)
- More parts = slower queries (>100 parts per partition degrades performance)
- Parts are merged together in the background
Partitions
Data is divided into partitions based on a partition key. Partitions enable fast data management and partition pruning.
Key points:
- Partition by time only (daily or monthly)
- Partition pruning skips irrelevant partitions
- Sorting key has bigger impact on query performance than partition key
Merges
ClickHouse continuously merges smaller parts into larger ones in the background.
Key points:
- Merges consume CPU, memory, and disk I/O
- Creating parts faster than merges can handle causes backlog
- Merges compete with queries and ingestion for shared resources
Sorting Keys and Sparse Indexes
The sorting key (ORDER BY) determines how data is physically stored and enables sparse primary indexes for fast data location.
Key points:
- Sparse indexes store min/max values every 8192 rows (granularity)
- Queries can skip entire granules that don't match
- Filtering by columns not in sorting key requires full table scan
- Put frequently filtered columns first in sorting key
Replication
In distributed setups, replicas maintain copies of data for high availability. Replication uses ZooKeeper for coordination.
Key points:
- Replicas store copies of data on different servers
- ZooKeeper coordinates DDL operations and part replication
- Replication queue can grow if parts created faster than replicated
- Zero-copy replication (not in OSS) replicates metadata only
- Parts are immutable chunks - Monitor part counts (>100 per partition degrades performance) and understand how merges optimize them.
- Partition by time only - Avoid over-partitioning. Partition pruning helps, but sorting key has bigger impact on query performance.
- Merges compete for resources - Balance part creation rate with merge capacity to avoid backlog and performance degradation.
- Design sorting keys for queries - Put frequently filtered columns first. Sparse indexes enable fast data skipping within partitions.
Learn More
For detailed information, see:
- Parts, Partitions, Merges, and Indexes - Deep dive into parts, partitions, merges, and indexes with practical monitoring queries