Data Lakes: Understanding File Format and Table Format

Rajesh Shah
2 min readMar 6, 2022

--

Overview

Its been about a year since my data engineering journey started. For enterprises Data Engineering is instrumental in successful execution of ML/AI use cases. A strong data engineering team is integral part of ML/AI platforms.

Enterprises have created “Data Lakes” for storing vast amount of data generated everyday. This data can be structured or unstructured for example audio, images and video. In today’s story I am going to try to explain two important concepts or design decisions behind Data Lakes: File Format and Table Format. My experience has been with structured data, so that will be the focus of this story.

File Format: Data at rest in a “Data Lakes” will be stored in some object storage like “AWS S3 Storage”, “Google Cloud Storage” or “Azure Cloud Storage”. The format of object storage is called File Format. It is not recommended to store data in “json” or “xml” formats. Choice of one of the data storage file formats will be important for success of data lakes. Using a data storage format allows to define a schema for structured data. Data can be stored in columnar or row format. Data at rest in “Data Lakes” will be stored in a file format like “Avro”, “Parquet”, etc.

Table Format: So once we decide on “File Format”, remember we can’t put all the data in a single file hence. we need to make a decision on the “Table Format” to organize the files in “Data Lakes”. This decision is important because that would drive how efficiently you can select/insert/update/delete data. Some the popular Table Formats are “Hive”, “Iceberg” etc. You can also think of this as Metadata Layer in Data Lakes.

Using a standard File Format and Table Format for storing data would help build BI/ML/AI applications around “Data Lake”. Lot of open source tooling is build around these standards.

Popular File Formats

Apache Avro

Apache Parquet

Apache ORC

Popular Table Formats

Apache Hive

Apache Iceberg

Delta Lake

Apache Hudi

Final Thoughts

This story is quick attempt to explain two high level concepts of a Data Lakes “File Format” and “Table Format” and technologies behind them. When I started my data engineering journey, I was perplexed with so many technologies. Even though I had prior experience working with RDBMS, understanding big data technologies initially was a challenge. I think you have to step back and understand challenges of managing big data to appreciate all the standards that have been developed in this space.

This is the best time to be in data engineering especially to make ML/AI dream successful!

Disclaimer: This is a personal blog. The opinions expressed here represent my own and not those of my current or any previous employers.

--

--

Rajesh Shah

Software Engineer with 15+ years experience (Interested in Cloud Computing, Kubernetes, Docker, Serverless Computing, BlockChain Technologies)