Transforming CSV files to more efficient Apache Parquet in a few minutes using AWS Athena

Apache Parquet is an open source columnar data file designed for efficient storage and retrieval.

Before we go into our topic, let us see why it makes sense to use Apache Parquet compared to CSV/TSV files.

Advantages of Apache Parquet

It is a binary format, meaning programs can easily read it compared to a textual format with less time.
It is a columnar data file, meaning if you need to just read a few columns you can save a lot of time and query time if you are using AWS Athena, AWS Redshift Spectrum, Google Big Query it can take advantage of it and save money too!.
Parquet supports compression of data and many different encodings, so you can save your storage costs too.

This blog post is supposed to be more on converting data from regular text based CSV/TSV to Apache Parquet, you can read more about cost savings and performance improvements here in a nice blog post by the AWS Team.

In this blog we will use a 100 million record CSV file available in kaggle Approx 3GB file size, feel free to use your own set of files for this.

Brief Steps on How to achieve this.

Understand your CSV Data Structure.
Load the CSV data into AWS S3.
Create the table in AWS Athena to be able to view it.
Use AWS Athena to convert your data from CSV to parquet.

Conclusion

After seeing the advantages of Parquet over the CSV/TSV it should be easy to decide to switch to Apache Parquet. AWS Athena makes it super easy to do so. Do let us know how your experience was with this and if you have any techniques or challenges. Keep loading!!

Transforming CSV files to more efficient Apache Parquet in a few minutes using AWS Athena

Advantages of Apache Parquet

Conclusion

Recent Posts

Comments