Transforming data across timezones in Analytics platform: a ClickHouse case study

Transforming timestamped data across timezones seems straightforward but only until you begin doing it. Take the case of raw data getting generated in the timezone of the generator, gets stored in UTC and aggregations must be generated in the user’s (browser) timezone (not fixed). When multiple components are involved in a Data Analytics platform built with ClickHouse at the heart, the design must be elegant, lest one gets caught in a wild goose chase in troubleshooting.

In the telco-grade analytics platform that we built with ClickHouse (more details here), the components are:

Nodes which generates data in files, FluentD runs here and pushes the data from the files to the Clickhouse server.
Clickhouse server to store the data and generate reports.
API which exposes the reports data from Clickhouse to the front-end component for analysis.

As usual the devil is in the details so let’s dive deeper

The Data generation nodes can run in any timezones
The user can request the generated reports in any timezone.

The Data generation nodes can run in any timezones

Let’s start here, each data generation node can run in any timezone and the timestamp data generated in the files follows the timezone of the node.

When we store we wanted to store in a uniform timezone regardless of where the data was originating from for this we decided to store the timestamps in “UTC” timezone.

Luckily we did not have to do anything here as Clickhouse and FluentD handles all by itself, how you ask?

By default FluentD assumes the data to be generated in local timezone of the node, which means the timestamp read will have this meta information.
Clickhouse notices this timezone information from the data and automatically stores it in UTC, you can check the clickhouse docs for more details.

The user can request the generated reports in any timezone

So now that timestamp in data is stored correctly in UTC, based on the user request we convert the timestamp from UTC to the timezone that user specifies during the query time on the fly. This can be done easily by using the toDateTime function in clickhouse.

toDateTime({timestamp field}, '{your timezone}')

That’s it right? Unfortunately not, there are some more complexities.

The reports can be viewed in minute, hour, day, week and month breakdowns

Why is this complex? Well generating reports on the fly on dataset having billions of records and expecting query latency of a few milliseconds to at-most few seconds is impossible. To optimise for latency some kind of precomputing is needed, ClickHouse makes it easier to solve using materialized views.

Let’s create a materialized view for each breakdown of the report i.e we create a materalized view for minute, hour, day, week and month, unfortunately this does not work, let’s take an example of hourly breakdown as previously mentioned we store the data in UTC timezone, so the hourly view will have the pre-computed data for 2024-01-01 00:00:00say midnight of Jan 1st 2024, now imagine the user asks the data in timezone Asia/Kolkata this hour translates to 2024-01-01 05:30:00 which is in not a complete hour in Asia/Kolkata this approach does not work.

What to do next? Create a materialised view for each timezone possible, this would work but is a very bad approach as Clickhouse would have to store and compute everything, also imagine the space used by this.

We decided to go with a minute based materialised view for each report and then we will aggregate the data from that view to all other time breakdowns on the fly.

So for a hourly view our sql query would look like this

SELECT date_trunc('hour', toDateTime(minute_based_timestamp, '{user timezone}')) as hour_based_timestamp,countMerge(metric_1) as aggregate_1,sumMerge(metric_2) as aggregate_2FROM minute_based_mv_for_report_1{where conditions as needed by user request}GROUP BY ALL

Based on the user request we had to generate this query and run it on the go, with user filters, this would make our API layer complex and also make it difficult for analyst users as they would have to write this kind of query every time for their needs.

We decided that this query generation needs simplification, so we decided to use views in SQL, well now you might be wondering well how can we handle timezone if we use views? as user will pass this on the fly, it would be nice to if there was someway we can pass timezone as a parameter to views, luckily clickhouse has recently introduced parameterized views.

When the report is created along with the materialised views we create parameterised views as well (this is one time thing)

CREATE hourly_view_for_report_1 (hour_based_timestamp DateTime,aggregate_1 Int64,aggregate_2 Int64) ASSELECT date_trunc('hour', toDateTime(minute_based_timestamp, {timezone:String})) as hour_based_timestamp,countMerge(metric_1) as aggregate_1,sumMerge(metric_2) as aggregate_2FROM minute_based_mv_for_report_1GROUP BY ALL

2. Pass the timezone as a parameter to this view, when user requests the report with its filters

SELECT hour_based_timestamp, aggregate_1,aggregate_2 FROM hourly_view_for_report_1(timezone='{user_timezone}'){where conditions as needed by user request}

As you can see the now both the API and the Analyst user (that directly uses ClickHouse) just need to remember this simple query.

We also noticed one small but very important thing while dealing with timezones. Imagine on the above query user applies a time filter, we normally write a query like below.

SELECT hour_based_timestamp, aggregate_1,aggregate_2 FROM hourly_view_for_report_1(timezone='{user_timezone}')WHERE hour_based_timestampBETWEEN timestamp '{user_start_time}' AND timestamp '{user_end_time}'

ClickHouse treats the user provided timestamp here as UTC and gives wrong results if you are expecting it otherwise. So in our case the time provided by user is also in user timezone so we changed it to

SELECT hour_based_timestamp, aggregate_1,aggregate_2 FROM hourly_view_for_report_1(timezone='{user_timezone}')WHERE hour_based_timestampBETWEEN toDateTime('{user_start_time}','{user_timezone}')AND toDateTime('{user_end_time}','{user_timezone}')

After this everything worked fine, we deployed everything in standalone mode and everything works fine; next we deploy this in cluster mode suddenly there is a problem, the API displays all results in UTC regardless of the timezone user passes. This problem left us scratching our head as to where the problem was, the deployment works in standalone but the same in replication does not work, after a lot of effort we came to know of the problem.

Believe it or not the ch-proxy the reverse proxy we use to distribute load to ClickHouse servers was causing the issue, after some debugging we say that the data we get using the ClickHouse-connect library using the ch-proxy and directly connecting to ClickHouse node was different, if we connect directly we are getting it right, hence the standalone mode was working, but using proxy it did not work and gave us datetime in UTC.

We are not completely sure on why this is the case, we have two assumptions

Ch-proxy is not fully compatible with ClickHouse server at least for the date-time part.
User forwarded settings is not passed by ch-proxy to ClickHouse, maybe timezone is kind of user setting from ClickHouse client?

(If any of the readers know about this, we are really curious to understand this)

So how did we solve this? In ClickHouse-connect you can pass a query timezone as a parameter, we passed the user requested timezone as query timezone and everything worked fine.

client.query_row_block_stream(sql, query_tz=tz)

Conclusion

ClickHouse by itself is sturdy in handling timezones. However designing a system where data generated is in one timezone, stored in another timezone and reporting has to be in yet another timezone will need a durable design like the one described in this use case.

We at Pevatrons build data intensive applications which handle a variety of data at huge volumes, but we do it in a manner that the design is maintainable. Ping us if you are facing trouble with your data pipelines and timezones, we can help.

Transforming data across timezones in Analytics platform: a ClickHouse case study

The Data generation nodes can run in any timezones

The user can request the generated reports in any timezone

The reports can be viewed in minute, hour, day, week and month breakdowns

Conclusion

Recent Posts

Comments