SQream Platform
GPU Powered Data & Analytics Acceleration
Enterprise (Private Deployment) SQL on GPU for Large & Complex Queries
Public Cloud (GCP, AWS) GPU Powered Data Lakehouse
No Code Data Solution for Small & Medium Business
Scale your ML and AI with Production-Sized Models
By Allison Foster
Managing large datasets can be difficult due to the issues they present. One frequent problem is moving data around. For that, the data needs to be in a format that compresses well.
The data should also be searchable, and in a format that maintains the relationships between records and fields.
There are many formats to choose from. Comma-separated values (CSV) is a tried-and-true format that works with any language or platform, but it lacks a dictionary or data types, so you’ll need to write your own. Even then, CSV is usually unwieldy with large data sets.
JSON addresses many of CSV’s shortcomings, but it’s not designed for large-scale data storage. The files are large and searching them is still slow.
Fortunately, there are data formats designed for large data sets. Two of the most popular formats are Apache’s Avro and Parquet. Both of these were created as part of the Hadoop project, but they handle data in very different ways. Let’s look at the similarities and tradeoffs.
Avro is a data serialization system that provides an open-source format for serializing and exchanging data. Although it is part of the Hadoop project, it can be used in any application, regardless of whether Hadoop libraries are present. Avro’s serialization works effectively for both data files and messages.
Being row-based, Avro stores all fields of each record together, making it ideal for situations where all the fields for a record need to be accessed simultaneously.
Avro stores data in a binary format and represents data definitions in a JSON dictionary. It combines both the data and its schema into a single file or message, ensuring that everything needed to process the data is located in one place. Unlike similar systems, such as Protocol Buffers, Avro clients do not require generated code to read messages, making it an excellent choice for scripted languages.
One of Avro’s key advantages is its robust support for schema changes through schema evolution. This feature allows Avro to handle missing fields, new fields, and altered fields gracefully, ensuring that Avro data is both backward and forward compatible.
When Avro decodes data, it can utilize two different schemas: one is from the encoded data, reflecting what the publisher encoded, and the other is from the reader, indicating their expectations. Avro reconciles the differences to provide the reader with usable results.
Avro supports various data types, including primitive types (such as boolean, int, long, and string), complex types (like enumerations, maps, and arrays), and user-defined records. The data format features APIs for several programming languages, including Java, Python, Ruby, C, C++, Perl, and PHP. Additionally, data serialized in one language can be used in another, and Avro’s C interface allows it to be called from many other languages.
The two primary uses for Avro are data serialization and remote procedure calls.
Avro’s speed, flexibility, and compact size make it a popular choice for storing data and for transmitting it over messaging systems like Kafka. In both cases, programs can encode and decode data quickly, while easily managing versioning differences. It’s therefore easier to deploy new code versions with Avro than with primitive formats like CSV or JSON. You can also use Avro with RPC services like gRPC, and use its powerful versioning system with remote procedures.
Apache Parquet is a free and open-source data format for data storage and retrieval. It’s also a product of the Hadoop project, but it differs from Avro in very important ways.
Avro’s language support differs from Avro. The core Parquet project only releases Java jars, but, C, C++, and python support is available via the Arrow project. There is also a Python library for reading Parquet files, and you can process them with Pandas.
The biggest difference between Avro and Parquet is that Parquet is a column-oriented data format, meaning Parquet stores data by column instead of row. This makes Parquet a good choice when you only need to access specific fields. It also makes reading Parquet files very fast in search situations.
This difference also means that Parquet is not a good choice for network messages, since streaming data and column formats don’t work well together. While it’s possible to use a column-oriented format for streaming data, it often eliminates many of the performance benefits.
Parquet’s schema support is like Avro’s. It supports primitive types like boolean, int, long, and string, and offers robust support for complex and user-defined data types. But, schema evolution is expensive with column data, since changes require reprocessing entire data sets, instead of record-by-record in row-oriented data.
Applications that need rapid access to specific fields in a large dataset use Parquet. The format works remarkably well for read-intensive applications and low latency data storage and retrieval.
When you want to aggregate a few columns in a large data set, Parquet is your best bet. Writing files in Parquet is more compute-intensive than Avro, but querying is faster.
Many python data analysis applications use Parquet files with the Pandas library.
Based on what we’ve covered, it’s clear that Avro and Parquet are different data formats and intended for very different applications.
This biggest difference is row vs. column-oriented data. How you need to access your data set is your guide to the format you need.
A quick example of how they handle the data will make this more clear.
Let’s imagine a set of website users.
Here’s how they look in a row-oriented data format:
ID, name, email 1, John Doe, [email protected] 2, Jane Doe, [email protected] 3, Alfred Neumann, [email protected] This could be a CSV file. Avro and CSV are both row-oriented, with the primary differences between them being how Avro stores the rows and its data dictionary.
ID, name, email
1, John Doe, [email protected]
2, Jane Doe, [email protected]
3, Alfred Neumann, [email protected]
This could be a CSV file. Avro and CSV are both row-oriented, with the primary differences between them being how Avro stores the rows and its data dictionary.
Here’s the same data in a column-oriented file:
ID, 1, 2, 3 name, John Doe, jane Doe, Alfred Neumann email, [email protected], [email protected], [email protected]
ID, 1, 2, 3
name, John Doe, jane Doe, Alfred Neumann
email, [email protected], [email protected], [email protected]
Why would an application want to use column-oriented data? Imagine a query for all user names – that’s line #2 in the example. With three records, the difference seems almost academic, but imagine a more realistic data set, like 10k records. Then add 20 or 30 more fields to the user record. With row-oriented data, the program needs to process every record to retrieve each email. With column-oriented, it’s still just one line in the file.
Even searching for a user with a specific email address is faster, since you only need to search that column and use the offset to get the rest of the data.
As we covered above, Avro is more tolerant of schema changes, and is also better for passing messages between services.
We’ve covered the Apache Avro and Parquet data formats. Both are products of the Apache Hadoop project, and are designed to solve data storage and transfer problems. While they both have robust support for schemas, complex data, and nested data types, they solve different issues. Avro is a row-oriented format that works for files and messages, while Parquet is columnar.
The format best suited for your application depends on your data and how you will use it. And, of course, you’re not limited to only one. If you process more than one data set, you may find you need both!
Whether the files are stored on the data-lake or inside the data warehouse, SQreamDB & SQream Blue support querying both Avro and Parquet files.
This post was written by Eric Goebelbecker. Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).