ClickHouse Projection에서 다른 Granularity 설정 (25.12)

ClickHouse 분류

Feature

Type

Research

작성자

Ken

ClickHouse 25.12부터 Projection에서 원본 테이블과 다른 index_granularity를 설정할 수 있습니다. 이를 통해 쿼리 패턴에 최적화된 인덱스 구조를 만들어 Point Query는 32배, Range Query는 2배 이상의 성능 향상을 달성할 수 있습니다.

1. 기능 개요
2. Granularity의 영향
3. 실전 적용 시나리오
시나리오 1: 사용자 이벤트 테이블
시나리오 2: 시계열 메트릭 저장
시나리오 3: 전자상거래 주문 데이터
4. Granularity 선택 가이드
5. 성능 벤치마크
6. 스토리지 영향
7. 운영 Best Practices
1. Projection 개수 제한
2. 쿼리 패턴 분석
3. EXPLAIN으로 검증
4. Materialization 전략
8. 고급 패턴
패턴 1: 시간 기반 Partitioning + Projection
패턴 2: Multi-level Granularity
패턴 3: Aggregate Projection 최적화
9. 주의사항
❌ 피해야 할 패턴
⚠️ 버전 호환성
10. 마이그레이션 가이드
11. 모니터링 쿼리
12. 결론
참고 자료

1. 기능 개요

문법 (ClickHouse 25.12+)

ALTER TABLE table_name
ADD PROJECTION projection_name (
    SELECT columns ORDER BY sort_key
)
WITH SETTINGS (
    index_granularity = N
);

핵심 포인트

원본 테이블: 기본 granularity (보통 8192)로 시계열/배치 쿼리 최적화
Projection: 작은 granularity (256~1024)로 Point Query 최적화
자동 선택: ClickHouse 옵티마이저가 쿼리에 따라 최적 경로 선택

2. Granularity의 영향

실제 데이터 분석 (200만 행 테스트)

Granularity	Total Marks	스토리지 크기	인덱스 메모리	인덱스 오버헤드
256	7,814	66.34 MiB	61.17 KiB	0.09%
1024	1,954	57.25 MiB	0 B	0%
4096	489	54.34 MiB	0 B	0%
8192	245	54.02 MiB	2.04 KiB	0.004%

핵심 인사이트

Point Query (WHERE player_id = X)

Granularity 256 사용 시:

7,814개 granule 중 1개만 스캔 (256 rows)
I/O: 32배 감소 (8192 vs 256 rows)

Granularity 8192 사용 시:

245개 granule 중 1개 스캔 (8192 rows)
불필요한 데이터 많이 읽음

Range Query (SCAN 연속 데이터)

Granularity 8192가 유리:

Granule 수 감소 → 메타데이터 오버헤드 감소
Sequential I/O 효율 증가

3. 실전 적용 시나리오

시나리오 1: 사용자 이벤트 테이블

CREATE TABLE user_events (
    event_time DateTime,
    user_id UInt32,
    session_id String,
    event_type LowCardinality(String),
    properties JSON
) ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)
ORDER BY event_time
SETTINGS index_granularity = 8192;  -- 시계열 스캔 최적화

-- Projection 1: 사용자별 조회 최적화
ALTER TABLE user_events
ADD PROJECTION user_lookup (
    SELECT * ORDER BY user_id, event_time
)
WITH SETTINGS (
    index_granularity = 256  -- Point Query 최적화
);

-- Projection 2: 세션 분석 최적화
ALTER TABLE user_events
ADD PROJECTION session_analysis (
    SELECT * ORDER BY session_id, event_time
)
WITH SETTINGS (
    index_granularity = 512
);

쿼리 성능:

WHERE user_id = 12345: Projection user_lookup 사용 → 30x faster
WHERE event_time BETWEEN ... AND ...: 원본 테이블 사용 → 최적 유지
WHERE session_id = 'abc123': Projection session_analysis 사용 → 15x faster

시나리오 2: 시계열 메트릭 저장

CREATE TABLE metrics (
    timestamp DateTime64(3),
    metric_name LowCardinality(String),
    host String,
    value Float64,
    tags Map(String, String)
) ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY timestamp
SETTINGS index_granularity = 16384;  -- 대량 스캔 최적화

-- Projection: Host별 조회
ALTER TABLE metrics
ADD PROJECTION host_metrics (
    SELECT * ORDER BY host, timestamp
)
WITH SETTINGS (
    index_granularity = 1024
);

-- Projection: Metric 집계
ALTER TABLE metrics
ADD PROJECTION metric_rollup (
    SELECT
        toStartOfMinute(timestamp) as minute,
        metric_name,
        host,
        avg(value) as avg_value,
        max(value) as max_value,
        count() as cnt
    GROUP BY minute, metric_name, host
)
WITH SETTINGS (
    index_granularity = 2048  -- 집계 데이터 최적화
);

시나리오 3: 전자상거래 주문 데이터

CREATE TABLE orders (
    order_time DateTime,
    order_id UInt64,
    user_id UInt32,
    store_id UInt16,
    product_id UInt32,
    amount Decimal(10,2),
    status LowCardinality(String)
) ENGINE = MergeTree
PARTITION BY toYYYYMM(order_time)
ORDER BY order_time
SETTINGS index_granularity = 8192;

-- Projection 1: 주문 조회
ALTER TABLE orders
ADD PROJECTION order_lookup (
    SELECT * ORDER BY order_id
)
WITH SETTINGS (
    index_granularity = 256  -- 단일 주문 조회
);

-- Projection 2: 사용자별 주문 이력
ALTER TABLE orders
ADD PROJECTION user_orders (
    SELECT * ORDER BY user_id, order_time DESC
)
WITH SETTINGS (
    index_granularity = 512  -- 사용자당 수십~수백 주문
);

-- Projection 3: 상점별 집계
ALTER TABLE orders
ADD PROJECTION store_stats (
    SELECT
        toDate(order_time) as date,
        store_id,
        status,
        count() as order_count,
        sum(amount) as total_sales
    GROUP BY date, store_id, status
)
WITH SETTINGS (
    index_granularity = 4096  -- 집계된 데이터
);

4. Granularity 선택 가이드

Decision Matrix

쿼리 선택도 (Selectivity)가 기준:

High Selectivity (결과 < 1000 rows)
  → Granularity: 256 ~ 512
  → 예: WHERE id = ?, WHERE unique_key = ?

Medium Selectivity (결과 1000 ~ 100K rows)
  → Granularity: 1024 ~ 2048
  → 예: WHERE user_id = ?, WHERE category = ?

Low Selectivity (결과 > 100K rows)
  → Granularity: 4096 ~ 8192
  → 예: WHERE date = ?, 대량 집계 쿼리

Full Table Scan
  → Granularity: 8192 ~ 16384
  → 예: SELECT count(*), GROUP BY 전체 집계

Cardinality 기반 선택

-- High Cardinality (user_id, session_id, transaction_id)
-- 중복이 거의 없음
index_granularity = 256 ~ 512

-- Medium Cardinality (category, region, device_type)
-- 수백~수천 개 유니크 값
index_granularity = 1024 ~ 2048

-- Low Cardinality (status, type, flag)
-- 수십 개 이하 유니크 값
index_granularity = 4096 ~ 8192

5. 성능 벤치마크

테스트 환경

데이터: 200만 행 게임 플레이어 데이터
테이블: ORDER BY player_id
쿼리: WHERE player_id = 500000

Point Query 성능

Granularity	Marks 수	읽은 행 수	상대 성능
256	7,814	~256	32x
1024	1,954	~1024	8x
4096	489	~4096	2x
8192	245	~8192	1x

Range Query 성능 (10% 데이터 스캔)

Granularity	Marks 수	Overhead	상대 성능
256	7,814	높음	0.8x
1024	1,954	중간	0.95x
4096	489	낮음	1.05x
8192	245	최소	1x

6. 스토리지 영향

인덱스 메모리 공식

Index Memory = (Total Rows / Granularity) × Mark Size (≈8 bytes)

예시: 10억 행 테이블
- G=256:   10억 ÷ 256  × 8 = 31.25 MB
- G=1024:  10억 ÷ 1024 × 8 = 7.81 MB
- G=8192:  10억 ÷ 8192 × 8 = 0.98 MB

실제 측정 결과 (200만 행)

G=256:  61.17 KiB (0.09% overhead)
G=1024: 0 B       (캐시 미로드)
G=4096: 0 B       (캐시 미로드)
G=8192: 2.04 KiB  (0.004% overhead)

인사이트: 인덱스 메모리 오버헤드는 매우 작음 (<0.1%)

7. 운영 Best Practices

1. Projection 개수 제한

권장: 테이블당 2~3개 이하
최대: 5개 (특수 케이스)

이유:

INSERT 시 모든 projection 업데이트 필요
백그라운드 merge 부하 증가
스토리지 비용 증가

2. 쿼리 패턴 분석

-- 실제 쿼리 패턴 분석
SELECT
    extractAll(query, 'WHERE ([^\\s]+)')[1] as where_clause,
    count() as frequency,
    avg(read_rows) as avg_read_rows,
    avg(query_duration_ms) as avg_ms
FROM system.query_log
WHERE type = 'QueryFinish'
  AND table = 'your_table'
  AND event_date >= today() - 7
GROUP BY where_clause
ORDER BY frequency DESC
LIMIT 20;

3. EXPLAIN으로 검증

-- Projection 사용 여부 확인
EXPLAIN indexes = 1
SELECT * FROM events WHERE user_id = 12345
SETTINGS allow_experimental_projection_optimization = 1;

-- 예상 결과:
-- ReadFromMergeTree (projection: user_lookup_g256)
--   Indexes:
--     PrimaryKey
--       Keys: user_id
--       Condition: (user_id in [12345, 12345])
--       Parts: 1/10
--       Granules: 1/7814  ← 작은 granularity 효과

4. Materialization 전략

-- 비동기 구체화 (운영 환경 권장)
ALTER TABLE events
MATERIALIZE PROJECTION user_lookup_g256;

-- 동기 구체화 (테스트용)
ALTER TABLE events
MATERIALIZE PROJECTION user_lookup_g256
SETTINGS mutations_sync = 1;

-- 진행 상황 모니터링
SELECT
    database,
    table,
    command,
    is_done,
    parts_to_do,
    parts_to_do_names
FROM system.mutations
WHERE is_done = 0;

8. 고급 패턴

패턴 1: 시간 기반 Partitioning + Projection

CREATE TABLE events (
    event_time DateTime,
    user_id UInt32,
    data String
) ENGINE = MergeTree
PARTITION BY toYYYYMM(event_time)  -- 월별 파티션
ORDER BY event_time
SETTINGS index_granularity = 8192;

-- 최근 데이터만 빠른 조회 필요
ALTER TABLE events
ADD PROJECTION recent_user_events (
    SELECT *
    WHERE event_time >= now() - INTERVAL 7 DAY
    ORDER BY user_id, event_time
)
WITH SETTINGS (
    index_granularity = 256
);

효과:

Partition Pruning + Fine-grained Index
최근 7일 데이터 Point Query 최적화

패턴 2: Multi-level Granularity

-- 원본: 대량 스캔용
CREATE TABLE logs (...)
ORDER BY timestamp
SETTINGS index_granularity = 16384;

-- Projection 1: 중간 범위 쿼리
ALTER TABLE logs ADD PROJECTION medium_range (
    SELECT * ORDER BY host, timestamp
) WITH SETTINGS (index_granularity = 2048);

-- Projection 2: Point Query
ALTER TABLE logs ADD PROJECTION point_query (
    SELECT * ORDER BY request_id
) WITH SETTINGS (index_granularity = 256);

패턴 3: Aggregate Projection 최적화

ALTER TABLE sales
ADD PROJECTION daily_summary (
    SELECT
        toDate(sale_time) as date,
        store_id,
        category,
        count() as sales_count,
        sum(amount) as total_amount,
        uniq(customer_id) as unique_customers
    GROUP BY date, store_id, category
)
WITH SETTINGS (
    index_granularity = 1024  -- 집계 후 데이터 크기 고려
);

9. 주의사항

❌ 피해야 할 패턴

1. 너무 작은 Granularity

-- BAD: 과도한 인덱스 오버헤드
WITH SETTINGS (index_granularity = 64)

2. PRIMARY KEY 컬럼에 Projection

-- BAD: 이미 최적화된 컬럼
CREATE TABLE t (...) ORDER BY user_id;
ALTER TABLE t ADD PROJECTION redundant (
    SELECT * ORDER BY user_id  -- 중복!
);

3. 너무 많은 Projection

-- BAD: 5개 이상은 비효율
ADD PROJECTION p1 ...;
ADD PROJECTION p2 ...;
ADD PROJECTION p3 ...;
ADD PROJECTION p4 ...;
ADD PROJECTION p5 ...;
ADD PROJECTION p6 ...;  -- 너무 많음

⚠️ 버전 호환성

ClickHouse 25.11 이하:

WITH SETTINGS 문법 지원 안 됨
Projection은 원본 테이블의 granularity 상속

ClickHouse 25.12+:

WITH SETTINGS 문법 지원
Projection별 독립적인 granularity 설정 가능

10. 마이그레이션 가이드

25.11 → 25.12 업그레이드 시

-- 1. 기존 Projection 확인
SELECT
    database,
    table,
    name as projection_name
FROM system.projections;

-- 2. 새 버전에서 재생성
ALTER TABLE events DROP PROJECTION user_lookup;

ALTER TABLE events ADD PROJECTION user_lookup (
    SELECT * ORDER BY user_id
) WITH SETTINGS (
    index_granularity = 256
);

ALTER TABLE events MATERIALIZE PROJECTION user_lookup;

-- 3. 성능 비교
-- Before & After 쿼리 실행 시간 측정

11. 모니터링 쿼리

Projection 사용률 확인

SELECT
    query_id,
    query,
    ProjectionNames,
    read_rows,
    query_duration_ms
FROM system.query_log
WHERE type = 'QueryFinish'
  AND table = 'events'
  AND event_date >= today() - 1
  AND ProjectionNames != []
ORDER BY event_time DESC
LIMIT 20;

Projection 크기 추적

SELECT
    database,
    table,
    name as projection_name,
    formatReadableSize(sum(bytes_on_disk)) as size,
    sum(rows) as rows,
    sum(marks) as marks,
    groupArray(active) as active_parts
FROM system.projection_parts
WHERE database = 'production'
GROUP BY database, table, name
ORDER BY sum(bytes_on_disk) DESC;

12. 결론

핵심 원칙

쿼리 패턴 분석 우선: 실제 워크로드 기반 결정
작게 시작: 가장 빈번한 1~2개 쿼리 패턴만 최적화
측정 후 확장: EXPLAIN과 query_log로 효과 검증
균형 유지: 스토리지 vs 성능 트레이드오프 고려

Granularity 선택 요약

사용 케이스	권장 Granularity	예상 성능 향상
Point Query (ID 조회)	256 ~ 512	20x ~ 30x
User Session 조회	512 ~ 1024	10x ~ 20x
Category 필터	1024 ~ 2048	5x ~ 10x
시간 범위 스캔	4096 ~ 8192	1x ~ 2x
전체 테이블 집계	8192 ~ 16384	0.5x ~ 1x

참고 자료

활용 코드