Skip to content

Migrate to ClickHouse#1108

Draft
skyfallwastaken wants to merge 24 commits intomainfrom
beta
Draft

Migrate to ClickHouse#1108
skyfallwastaken wants to merge 24 commits intomainfrom
beta

Conversation

@skyfallwastaken
Copy link
Copy Markdown
Member

DO NOT MERGE UNTIL FURTHER TESTING IS DONE.

@skyfallwastaken
Copy link
Copy Markdown
Member Author

@greptileai review and scan for regressions

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 24, 2026

Greptile Summary

This PR is a major infrastructure migration, moving the heartbeats table from PostgreSQL to ClickHouse via the clickhouse-activerecord gem. A new ClickhouseRecord base class is introduced, all SQL window functions are ported from LAG/LEAD to ClickHouse's lagInFrame/leadInFrame (with correct 3-argument forms to avoid first-row inflation), and a ReplacingMergeTree-backed heartbeat_user_daily_summary table plus a Refreshable Materialized View are added to power home stats.

Key changes include:

  • All cross-database JOINs (Postgres users ↔ ClickHouse heartbeats) are decomposed into separate queries, with eligible_user_ids fetched from Postgres first.
  • A new HeartbeatCacheInvalidator replaces the old maximum(:time) cache key with a versioned key bumped on every write.
  • Heartbeat submission in the API controller moves from find_or_create_by to a batched insert_all, with cache invalidation called once per batch.
  • The streak calculation is moved to Ruby-side aggregation (fetching raw timestamps) to correctly handle per-user timezone day boundaries.

Outstanding concerns (some carried from previous review rounds):

  • current_user.timezone is interpolated directly into a ClickHouse SQL string in dashboard_data.rb (line 103) without quoting — the same SQL injection pattern that was identified and fixed in WeeklySummaryMailer but not addressed here.
  • The heartbeat_user_daily_summary_mv materialized view's inner window function partitions by user_id only, not by (user_id, day), so lagInFrame can diff across day boundaries; additionally it uses UTC dates while all per-user queries use the user's local timezone, creating a systematic discrepancy.
  • fields_hash is no longer computed anywhere — neither on API submission nor on import — meaning ReplacingMergeTree's deduplication based on the id-inclusive ORDER BY key will not collapse re-submitted or re-imported heartbeats, and the hash-based dedup in the import service is now effectively dead code.
  • The controller's insert_all and the import service's flush_batch both suffer from the race condition and re-import duplication issues flagged in earlier review rounds.

Confidence Score: 1/5

  • This PR should not be merged yet — it carries multiple data integrity and security issues that could silently inflate user stats, allow SQL injection, and make re-imported data impossible to deduplicate.
  • The PR author explicitly marked it "DO NOT MERGE UNTIL FURTHER TESTING IS DONE". Beyond that, several concrete bugs remain unresolved from prior rounds (fields_hash deduplication gone, duplicate heartbeat accumulation, lagInFrame first-row default in some paths, unquoted timezone injection in the mailer since fixed but recurring in dashboard_data.rb) plus two new issues found in this round (dashboard_data.rb timezone injection, materialized view cross-day window bleed). Any one of these could cause silent data corruption or stat inflation in production.
  • app/controllers/concerns/dashboard_data.rb (timezone SQL injection), app/services/heartbeat_import_service.rb (fields_hash + dedup), app/controllers/api/hackatime/v1/hackatime_controller.rb (insert_all without dedup), db/migrate_clickhouse/20260324000003_create_heartbeat_user_daily_summary_refreshable_view.rb (cross-day window partition)

Important Files Changed

Filename Overview
app/models/concerns/heartbeatable.rb Core ClickHouse query migration — all LAG/LEAD window functions ported to lagInFrame/leadInFrame with 3-arg form (fixing the first-row default); streak calculation moved to Ruby-side aggregation to support per-user timezones; daily_durations now uses ClickHouse timezone-aware date truncation.
app/models/heartbeat.rb Migrated from ApplicationRecord (Postgres) to ClickhouseRecord; soft-delete and fields_hash callback removed; custom timestamp-based ID generation added; after_create/update cache invalidation wired up via HeartbeatCacheInvalidator.
app/controllers/concerns/dashboard_data.rb Dashboard queries migrated to ClickHouse with batched weekly stats and cached filter options; contains unquoted timezone SQL injection risk at line 103 (same pattern already fixed in the mailer).
db/migrate_clickhouse/20260324000003_create_heartbeat_user_daily_summary_refreshable_view.rb Creates a Refreshable Materialized View populating heartbeat_user_daily_summary every 10 minutes; window function partitions by user_id only (not day), causing cross-midnight duration bleed; uses UTC dates while per-user queries use local timezones.
app/services/heartbeat_import_service.rb Import pipeline switched from upsert_all with fields_hash dedup to plain insert_all; fields_hash computation was dropped entirely, so re-imports produce duplicate rows that ReplacingMergeTree won't merge (id is part of ORDER BY).
app/services/heartbeat_cache_invalidator.rb New lightweight cache version service replacing the old maximum(:time) approach; version bumped on every heartbeat create/import; 30-day TTL is well chosen.
app/controllers/api/hackatime/v1/hackatime_controller.rb Bulk insert_all replaces per-heartbeat find_or_create_by; fields_hash not computed so all inserted heartbeats have empty fields_hash, breaking deduplication; race condition on concurrent submissions still present.
app/jobs/cache/home_stats_job.rb Now reads from the heartbeat_user_daily_summary materialized view (with FINAL for dedup); adds a cache-busting version key tied to the view's last_success_time from system.view_refreshes.
app/controllers/api/admin/v1/admin_controller.rb Queries ported to ClickHouse syntax; IS NOT NULL replaced with > 0 for non-nullable columns; ARRAY_AGG(DISTINCT) replaced with groupUniqArray; DISTINCT ON replaced with argMax grouping.
test/test_helper.rb Added before_setup hook to TRUNCATE ClickHouse tables before each test; WebMock configured to allow localhost + clickhouse host.

Sequence Diagram

sequenceDiagram
    participant Client as WakaTime Client
    participant API as HackatimeController
    participant CH as ClickHouse (heartbeats)
    participant Cache as Rails.cache
    participant PG as PostgreSQL (users)
    participant MV as heartbeat_user_daily_summary_mv

    Client->>API: POST /heartbeats (batch)
    API->>API: prepare_heartbeat_attrs()
    API->>API: assign IDs (timestamp << 10 | rand)
    API->>CH: insert_all(records)
    API->>Cache: HeartbeatCacheInvalidator.bump_for(user_id)

    Note over MV,CH: Every 10 minutes (async)
    CH-->>MV: REFRESH MATERIALIZED VIEW
    MV->>CH: heartbeat_user_daily_summary updated

    Client->>API: GET /dashboard
    API->>Cache: fetch(dashboard_cache_version)
    Cache-->>API: miss
    API->>PG: User.where(id:).pluck(:id, :timezone)
    API->>CH: SELECT groupUniqArray(project), ... FROM heartbeats
    API->>CH: SELECT weekly project durations (batched)
    API->>CH: SELECT duration_seconds (lagInFrame window)
    API-->>Client: dashboard JSON

    Note over API,CH: Home stats path
    API->>CH: SELECT uniq(user_id), sum(duration_s) FROM heartbeat_user_daily_summary FINAL
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: app/controllers/concerns/dashboard_data.rb
Line: 101-103

Comment:
**Unquoted timezone interpolation — SQL injection risk**

`current_user.timezone` is interpolated directly into the ClickHouse SQL string at lines 103 (twice in the `select`) without any quoting or sanitization. A crafted timezone value stored in the user's profile (e.g. `UTC') UNION SELECT ...--`) would be executed verbatim.

The identical vulnerability was already identified and fixed in `WeeklySummaryMailer` (using `connection.quote`), but the same pattern here was not addressed:

```suggestion
        tz_quoted = Heartbeat.connection.quote(current_user.timezone)
        weekly_sql = weekly_hb
          .select("toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff")
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: db/migrate_clickhouse/20260324000003_create_heartbeat_user_daily_summary_refreshable_view.rb
Line: 17-35

Comment:
**Window partitions by `user_id` only, bleeds across day boundaries**

The inner window function partitions by `user_id` alone:

```sql
lagInFrame(time, 1, time) OVER (
  PARTITION BY user_id
  ORDER BY time ASC
  ...
)
```

This means the lag for the *first heartbeat of a new day* is computed against the *last heartbeat of the previous day*. If the user was coding right up to midnight, that cross-midnight gap (which can be up to 120 seconds) is added to the new day's `duration_s`, inflating it incorrectly.

The ad-hoc query path in `heartbeatable.rb` correctly partitions by day:
```sql
PARTITION BY toDate(toDateTime(toUInt32(time), '#{timezone}'))
```

The materialized view should use the same approach, partitioning by `(user_id, toDate(toDateTime(toUInt32(time))))`, to stay consistent with individual user stat calculations.

Note: the current structure also partitions by UTC day while the per-user queries use the user's local timezone, so the totals in `heartbeat_user_daily_summary` will diverge from those shown on user dashboards for non-UTC users.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: app/services/heartbeat_import_service.rb
Line: 94-119

Comment:
**`fields_hash` no longer computed on import**

`generate_fields_hash` was removed from `Heartbeat` and `before_save :set_fields_hash!` was also removed. The import service previously set `attrs[:fields_hash]` before inserting, but that line is now gone. All imported heartbeats will have `fields_hash = ''` (the ClickHouse column default).

This means `existing_hashes` deduplication in any future code path, the `find_or_create_by(fields_hash:)` guard added in the controller, and the `UNIQUE` semantics of `ReplacingMergeTree` on `id`—none of these will work correctly for imported data. Two re-imports of the same file will produce duplicate rows that the engine won't merge (since `id` is part of `ORDER BY` and IDs are freshly generated each time).

Consider either restoring `fields_hash` computation in `flush_batch`, or tracking seen external timestamps/entities per-user to prevent re-import duplicates.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (3): Last reviewed commit: "Claude-slop" | Re-trigger Greptile

@skyfallwastaken
Copy link
Copy Markdown
Member Author

@greptileai re-review and scan for regressions

user_id = conn.quote(current_user.id)
filters = %i[project language operating_system editor category]
sql = filters.map { |f| "groupUniqArray(#{f}) AS #{f}_values" }.join(", ")
row = conn.select_one("SELECT #{sql} FROM heartbeats WHERE user_id = #{user_id}")

Check warning

Code scanning / Brakeman

Possible SQL injection. Warning

Possible SQL injection.
ORDER BY user_id ASC, time ASC
SQL

rows = connection.select_all(raw_sql)

Check warning

Code scanning / Brakeman

Possible SQL injection. Warning

Possible SQL injection.
@skyfallwastaken
Copy link
Copy Markdown
Member Author

@greptileai review and scan for regressions. review the time calculation code and all ClickHouse code, then review whether or not it makes sense, has regressions from the main branch, duplicated clickhouse-activerecord features, can be made more efficient, etc. Also check to see whether or not a piece of code can be shortened WITHOUT LOSING PERFORMANCE OR FUNCTIONALITY. PAY CLOSE ATTENTION TO TIME TRACKING CODE. ALSO LOOK FOR POTENTIAL PERFORMANCE IMPROVEMENTS

Comment on lines +101 to +103
tz = current_user.timezone
weekly_sql = weekly_hb
.select("toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Unquoted timezone interpolation — SQL injection risk

current_user.timezone is interpolated directly into the ClickHouse SQL string at lines 103 (twice in the select) without any quoting or sanitization. A crafted timezone value stored in the user's profile (e.g. UTC') UNION SELECT ...--) would be executed verbatim.

The identical vulnerability was already identified and fixed in WeeklySummaryMailer (using connection.quote), but the same pattern here was not addressed:

Suggested change
tz = current_user.timezone
weekly_sql = weekly_hb
.select("toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), '#{tz}'), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff")
tz_quoted = Heartbeat.connection.quote(current_user.timezone)
weekly_sql = weekly_hb
.select("toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff")
Prompt To Fix With AI
This is a comment left during a code review.
Path: app/controllers/concerns/dashboard_data.rb
Line: 101-103

Comment:
**Unquoted timezone interpolation — SQL injection risk**

`current_user.timezone` is interpolated directly into the ClickHouse SQL string at lines 103 (twice in the `select`) without any quoting or sanitization. A crafted timezone value stored in the user's profile (e.g. `UTC') UNION SELECT ...--`) would be executed verbatim.

The identical vulnerability was already identified and fixed in `WeeklySummaryMailer` (using `connection.quote`), but the same pattern here was not addressed:

```suggestion
        tz_quoted = Heartbeat.connection.quote(current_user.timezone)
        weekly_sql = weekly_hb
          .select("toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) as week_start, `project` as grouped_time, least(greatest(time - lagInFrame(time, 1, time) OVER (PARTITION BY `project`, toStartOfWeek(toDateTime(toUInt32(time), #{tz_quoted}), 1) ORDER BY time ASC ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0), #{timeout}) as diff")
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +17 to +35
SELECT
user_id,
time,
least(
greatest(
time - lagInFrame(time, 1, time) OVER (
PARTITION BY user_id
ORDER BY time ASC
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
),
0
),
120
) AS diff
FROM heartbeats
WHERE time IS NOT NULL
AND time >= 0
AND time <= 253402300799
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Window partitions by user_id only, bleeds across day boundaries

The inner window function partitions by user_id alone:

lagInFrame(time, 1, time) OVER (
  PARTITION BY user_id
  ORDER BY time ASC
  ...
)

This means the lag for the first heartbeat of a new day is computed against the last heartbeat of the previous day. If the user was coding right up to midnight, that cross-midnight gap (which can be up to 120 seconds) is added to the new day's duration_s, inflating it incorrectly.

The ad-hoc query path in heartbeatable.rb correctly partitions by day:

PARTITION BY toDate(toDateTime(toUInt32(time), '#{timezone}'))

The materialized view should use the same approach, partitioning by (user_id, toDate(toDateTime(toUInt32(time)))), to stay consistent with individual user stat calculations.

Note: the current structure also partitions by UTC day while the per-user queries use the user's local timezone, so the totals in heartbeat_user_daily_summary will diverge from those shown on user dashboards for non-UTC users.

Prompt To Fix With AI
This is a comment left during a code review.
Path: db/migrate_clickhouse/20260324000003_create_heartbeat_user_daily_summary_refreshable_view.rb
Line: 17-35

Comment:
**Window partitions by `user_id` only, bleeds across day boundaries**

The inner window function partitions by `user_id` alone:

```sql
lagInFrame(time, 1, time) OVER (
  PARTITION BY user_id
  ORDER BY time ASC
  ...
)
```

This means the lag for the *first heartbeat of a new day* is computed against the *last heartbeat of the previous day*. If the user was coding right up to midnight, that cross-midnight gap (which can be up to 120 seconds) is added to the new day's `duration_s`, inflating it incorrectly.

The ad-hoc query path in `heartbeatable.rb` correctly partitions by day:
```sql
PARTITION BY toDate(toDateTime(toUInt32(time), '#{timezone}'))
```

The materialized view should use the same approach, partitioning by `(user_id, toDate(toDateTime(toUInt32(time))))`, to stay consistent with individual user stat calculations.

Note: the current structure also partitions by UTC day while the per-user queries use the user's local timezone, so the totals in `heartbeat_user_daily_summary` will diverge from those shown on user dashboards for non-UTC users.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines 94 to 119
total_count
end

def self.flush_batch(seen_hashes)
return 0 if seen_hashes.empty?
def self.flush_batch(user_id, records)
return 0 if records.empty?

records = seen_hashes.values
records.each do |r|
timestamp = Time.current
r[:created_at] = timestamp
r[:updated_at] = timestamp
now = Time.current
base_us = (now.to_r * 1_000_000).to_i
records.each_with_index do |r, i|
r[:id] = ((base_us + i) << Heartbeat::CLICKHOUSE_ID_RANDOM_BITS) | SecureRandom.random_number(Heartbeat::CLICKHOUSE_ID_RANDOM_MAX)
r[:created_at] = now
r[:updated_at] = now
end

ActiveRecord::Base.logger.silence do
result = Heartbeat.upsert_all(records, unique_by: [ :fields_hash ])
result.length
Heartbeat.connection.with_settings(async_insert: 0, wait_for_async_insert: 1) do
Heartbeat.insert_all(records)
end
Heartbeat.connection.clear_query_cache
end

HeartbeatCacheInvalidator.bump_for(user_id)
records.length
end

class HeartbeatSaxHandler < Oj::Saj
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 fields_hash no longer computed on import

generate_fields_hash was removed from Heartbeat and before_save :set_fields_hash! was also removed. The import service previously set attrs[:fields_hash] before inserting, but that line is now gone. All imported heartbeats will have fields_hash = '' (the ClickHouse column default).

This means existing_hashes deduplication in any future code path, the find_or_create_by(fields_hash:) guard added in the controller, and the UNIQUE semantics of ReplacingMergeTree on id—none of these will work correctly for imported data. Two re-imports of the same file will produce duplicate rows that the engine won't merge (since id is part of ORDER BY and IDs are freshly generated each time).

Consider either restoring fields_hash computation in flush_batch, or tracking seen external timestamps/entities per-user to prevent re-import duplicates.

Prompt To Fix With AI
This is a comment left during a code review.
Path: app/services/heartbeat_import_service.rb
Line: 94-119

Comment:
**`fields_hash` no longer computed on import**

`generate_fields_hash` was removed from `Heartbeat` and `before_save :set_fields_hash!` was also removed. The import service previously set `attrs[:fields_hash]` before inserting, but that line is now gone. All imported heartbeats will have `fields_hash = ''` (the ClickHouse column default).

This means `existing_hashes` deduplication in any future code path, the `find_or_create_by(fields_hash:)` guard added in the controller, and the `UNIQUE` semantics of `ReplacingMergeTree` on `id`—none of these will work correctly for imported data. Two re-imports of the same file will produce duplicate rows that the engine won't merge (since `id` is part of `ORDER BY` and IDs are freshly generated each time).

Consider either restoring `fields_hash` computation in `flush_batch`, or tracking seen external timestamps/entities per-user to prevent re-import duplicates.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant