Efficient Data Transportation: From Log Files to ClickHouse with Fluentd
- Pevatrons Engineer
- Jan 15, 2024
- 4 min read

Recently we at Pevatrons have been working on an end-to-end data pipeline for near real-time analytics, to give a high-level overview of the project the core software generates up to 28,000 events every second translating to north of 2 billion events per day. Our customer needs to be able to view the near real-time analytics of the logs in the dashboard with a max delay of 3 minutes. After a lot of research, we have decided to use Clickhouse to store our logs and generate analytics reports, we will talk more about Clickhouse in another blog post.
Just as we need to be able to analyze the data using Clickhouse, it is as important to get our data to Clickhouse on time, to make all the analytics possible, in this blog post we will talk about the journey of data from log files to clickhouse.
High-level architecture of the current system that generates events
Let's talk about the high-level architecture of the software that generates these log events.
The components of the System
Nodes that Generate the Logs (Already implemented by the client)
Data Collector which can collect logs from nodes and insert them to Clickhouse in a near real-time manner (Needs to be designed and Implemented by Pevatrons)
ClickHouse Cluster used to store the logs and generate analytical reports and dashboards (Needs to be designed and Implemented by Pevatrons)
Dashboard and Reporting (The backend needs to be designed and implemented by Pevatrons and Frontend by another team closely working with the client)
Nodes that Generate the Logs
This component generates the log events
The log events are written to a file.
The log files would be rotated between 1-5 minutes.
The Data Collector Component
The data collector component had the following requirements
Read logs from log files in a near real-time manner and apply some basic transformations.
Should be able to detect log rotation to read from a new file and not miss data from the existing log files.
Have some mechanism to store the data in some way temporarily so that we can insert data to the ClickHouse cluster in batches instead of a record-by-record.
Handle failure with a retry mechanism.
Low resource consumption (CPU and Memory) as this component would mostly be colocated with the data generation node.
Easily configurable, instead of us writing code for every small thing in the tool.
Now that the requirements were clear, our team started researching the tool that could satisfy all the above requirements, we found a tool named FluentD
Features of FluentD
Low resource consumption 1CPU and 300MB of Memory.
Opensource and Easily configurable via a Config File.
Can read from a log file by using the tail input plugin, it can also detect log rotation.
FluentD also has a buffer plugin that helps in storing the log events in the buffer temporarily, we can easily configure the buffer size, the flush time to output etc.
It has a retry mechanism enabled by default, so it can wait for the node failures in case the output destination is down.
It seems to have a lot of output (data destination) plugins, so FluentD can send the collected data to the desired destination.
We can write custom plugins for any of the components given that you are familiar with the Ruby programming language.
Limitations
Not everything was perfect, we faced the following problems
We tried most of the available Clickhouse output plugins, but unfortunately, nothing worked for us, leading us to write our own Custom Clickhouse plugin, given that our team did not specialise in Ruby, this took a lot of time to understand and implement it. We will talk more about this at a later stage, down below.
Since the tail input plugin does not support multiple workers (multi-process) to scale we would need to implement a complex architecture in FluentD in future, if log generation throughput increases.
Architecture with FluentD
Stages of Processing
Input: Reading from the file, this also includes parsing and detection of log rotation.
Transform: The log files had a lot more details than what was necessary for data analytics, so we had to drop some of those, for this, we used the record_modifier filter plugin of FluentD.
Output: In this stage we buffer the data in memory so that we can batch-insert records to ClickHouse optimally, also we set the flush_interval to 1 minute, which flushes all the records in memory every 1 minute, this is important in case the process generates logs slower than what we anticipated, this makes sure even in the worst the delay from FluentD to ClickHouse is close to a minute even in such situations.
This led to the FluentD configuration being like down below
As we said before we needed to use a custom output plugin to send the data to ClickHouse as all the other FluentD plugins for ClickHouse did not work for us. We are making the plugin open source in this blog, just in case you are in a similar situation compared to us, place the plugin in FluentD's plugin directory and it should work for you too.
The plugin would not have been possible without this ClickHouse driver in Ruby.
Revised Architecture with FluentD as Data Collector
Conclusion
In legacy systems where log files are already getting generated, we found FluentD to be an excellent tool to capture the logs, transform and send them to the destination, in our case the destination was ClickHouse with Ruby Code Plugin we were able make ClickHouse ingest the log data. However, what we discovered was that while it was able to meet the performance requirements of this project it is not fully scalable in case the log generation rate suddenly doubles, We will have another blog post that will talk extensively about how we were able to optimise ingestion rate using multi-workers (even though FluentD does not natively support it for tail plugin).
コメント