Revolutionizing Ad-Hoc Processing: Transforming Data for Performance

Introduction

In this data use case, we'll dive into how a company overcame its data processing and analytics obstacles by transforming its data files from one format to another. With their data pipeline clogged due to the text-based format, the company encountered delays and incurred elevated processing and storage expenses. Recognizing the criticality of data being in the appropriate format, they took action to streamline their processes, resulting in substantial time and cost savings.

Issues Faced

The company in question was a data analytics firm that received over 1 billion daily records from four or more data partners. They generated insights from this data and sold them to generate revenue. The data arrived in text format, including CSV, TSV, and JSON. The company required access to this data to produce ad-hoc reports based on client needs, varying daily to yearly frequencies.

To process this data on an ad-hoc basis, they compressed files to reduce their size, partitioned the data based on date and they were using AWS Athena for querying the data. However, despite these measures, the company faced significant query latency and high query costs due to the large volume of data and the extended time frame involved in querying it. For instance, generating reports for a year would require scanning a substantially greater amount of data than generating reports for a week, resulting in the need for time-consuming query optimization and retries in case of failure.

What was done?

The previous architecture

During our analysis of their architecture, we observed that most of the reports they generated only required a specific subset of columns from their data, rather than all of them. Additionally, we knew from prior experience that binary data formats are processed more efficiently by machines than text data formats.

To address their data processing and reporting challenges, we recommended utilizing Apache Parquet, a binary data format that stores data in a columnar manner and includes compression support. This change meant that the query engine would only need the necessary fields from a binary format, reducing both query processing time and the amount of data scanned. This is particularly relevant for tools like AWS Athena, which charges customers based on the amount of data scanned. As a result, the company achieved a 27% reduction in storage costs within AWS S3.

Transform the data from Text to Parquet

The initial step involved transforming the data from text to Parquet format, which can be accomplished using AWS Athena or AWS Glue. Although there is a cost and time investment involved in data transformation, the benefits of enhanced performance and long-term cost savings make it a worthwhile investment. Additionally, data transformation only needs to be completed once upon data receipt, after which the old text files can be archived/purged as they are no longer needed.

Generate reports from the parquet data accessible through AWS Athena

What Improved?

Upon implementing the recommended architecture changes, the company achieved significant improvements including

63% enhancement in query performance
38% reduction in AWS Athena costs
27% reduction in data storage costs within AWS S3.

Conclusion

In ad-hoc query processing, a comprehensive understanding of data formats is crucial, as demonstrated in the example of the aforementioned company. It can substantially improve query performance while also reducing processing costs.