Telco-grade near-real-time analytics with ClickHouse and FluentD
- Pevatrons Engineer
- Apr 9, 2024
- 4 min read

A telecom service provider has a solution that helps carriers deliver text messages for enterprise customers. Although the solution works just fine, what it lacks is (near) real-time analytics - what happened in the last hour can only be known after three to six hours and that resulted in valuable time lost in rectifying any problems without customer escalations.
The service provider wanted to build a (near) real-time analytics service whose overarching goals were:
Ingest records at the rate of 28k transactions per second
Storage of more than 50 billion raw records (occupying > 50 TB of disk space)
Produce analytics in less than 5 minutes after ingestion
As if this was not complex enough, this was all an on-prem installation with sometimes old versions of operating systems and databases running. And before we forget to add - no Docker container - now you know the complexity.
Selecting the right database
The structure of the records were such that most of the fields had a low cardinality. We needed a data warehousing solution that is a columnar database and zeroed in on Apache Doris and ClickHouse - we had to choose one of them.ClickHouse emerged as the preferred choice due to its efficient resource deployment, low latency, and real-time support.
ClickHouse | Apache Doris | |
Machine | c6a.4xlarge, 500gb gp2 | c6a.4xlarge, 500gb gp2 |
Ingestion rate per second | 450k | 238k |
Query latency | 0.40s | 0.75s |
Concurrent Queries | 2400 TPS | 3600TPS |
Apart from the metrics above, key factors influencing our decision to choose ClickHouse included: vibrant open source community, Time-To-Live (TTL) Management for Data, High Availability Features, seamless Support for Real-Time Data Ingestion, user-Friendly Interface and support for Materialised Views.
System for data collection
With the choice of Clickhouse getting over we needed a mechanism to pump data into Clickhouse from the source where they were getting generated. The main constraints were
Detect log rotation to read from a new file and not miss data from the existing log files
Insert data to the ClickHouse cluster in batches instead of a record-by-record
Low resource consumption (CPU and Memory)
The open source log collection tool FluentD fits the bill as it can:
It can emulate a Unix tail -f
In-built retry mechanism
Buffering mechanism
So, with ClickHouse and FluentD, our system looked like this, simple by some stretch of imagination
Read our detailed blog on how we built the data transfer mechanism using FluentD here.
Generating analytical reports
Analytical reports are an aggregation of raw records; the simplest being number of records received in a given timeframe to really complex ones. The main goal was:
Reports should be available at (near) real-time
Reports should be available by the minute, hour, day
Each time querying the ClickHouse raw table would mean not only huge query latency but also high resource utilisation.
Materialized views by ClickHouse was a fundamental game-changer. At periodic intervals, the aggregation is stored on to the disk for easy retrieval. The reports when queried by the user accesses these materialized views and does not touch the raw table.
ClickHouse has this intelligent feature known as Projections - which eliminates the need to define materialized views. However, we discovered a bug in ClickHouse Projections; will write a detailed post on that disappointment.
Integration with MySQL
Metadata is stored in a MySQL database. When executing queries on aggregated tables for reports, we retrieve meta information from the MySQL database using dictionaries in ClickHouse. The dictionaries feature has been exceptionally beneficial for us, becoming integral to our operations as Materialized Views (MV). Practically all our database queries leverage dictionaries.
The primary advantage lies in not needing to store all potential information in the raw table. Instead, you can store just an ID and subsequently retrieve additional information from dictionaries within your queries. This approach virtually incurs negligible costs in terms of query performance.
Replication for High Availability

For ensuring high availability of our system, we opted for a single shard-two replica ClickHouse cluster topology, which effectively meets our current scalability needs. Our configuration comprises five servers: two serve as hosts for data storage in ClickHouse, while the remaining three function as ClickHouse keeper nodes responsible for coordinating data replication. This setup enables us to replicate databases and tables across both data nodes using the ReplicatedMergeTree table engine.
Challenges encountered
Here is an incomplete list of various issues we faced, not tied to specific versions:
The assumption regarding the Projection feature didn't meet expectations
ClickHouse replication mandates high network bandwidth for effective coordination between nodes, particularly in multi-data centre deployments.
Monitoring tools should be activated for replication clusters to ensure smooth operation.
Selecting the appropriate aggregate algorithm is crucial for both Tables and Materialised Views.
Note that you cannot detach a Materialised View from a physical table before executing an "alter" command.
We want to emphasise that these are just the issues we faced at various stages of the project. While we cannot predict that you will encounter these issues, we can say one thing for sure: migrations in ClickHouse are a lengthy and labour-intensive process that is difficult to automate (if possible at all). What is worse, it is complicated by issues and bugs in ClickHouse itself, some of which we mentioned above.
Conclusion
Reflecting on our project's technology choices, we remain steadfast in our decision to opt for ClickHouse and the corresponding architectural solutions. ClickHouse seamlessly integrates with our project's requirements, and as of now, we haven't come across any superior alternatives.
Nevertheless, I advise caution to those considering ClickHouse for their projects. It serves a very specific purpose and may present challenges along the way. However, these hurdles are typically surmountable. If you're uncertain about ClickHouse's suitability for your project, chances are it may not be necessary. In such cases, opting for a traditional database might be the better choice.
We would be failing in our duty if we do not express our gratitude to the people behind the documentation at ClickHouse - no other tool is this impressive in documentation.
Comments