Managing large datasets can be difficult due to the issues they present. One frequent problem is moving data around. For that, the data needs to be in a format that compresses well.
The data should also be searchable, and in a format that maintains the relationships between records and fields.
There are many formats to choose from. Comma-separated values (CSV) is a tried-and-true format that works with any language or platform, but it lacks a dictionary or data types, so you’ll need to write your own. Even then, CSV is usually unwieldy with large data sets.
JSON addresses many of CSV’s shortcomings, but it’s not designed for large-scale data storage. The files are large and searching them is still slow.
Fortunately, there are data formats designed for large data sets. Two of the most popular formats are Apache’s Avro and Parquet. Both of these were created as part of the Hadoop project, but they handle data in very different ways. Let’s look at the similarities and tradeoffs.
What Is Avro?
Avro bills itself as a data serialization system. It’s an open-source data format for serializing and exchanging data. Even though it’s part of the Hadoop project, you can use it in any application, with or without Hadoop libraries. Avro’s serialization works for both data files and messages.
Avro is row-based, so it stores all the fields for each record together. This makes it the best choice for situations where all the fields for a record need to be accessed together.
Avro stores data in a binary format and data definitions in a JSON dictionary. It puts both the data and the schema together in a single file or message, so everything a program needs to process the data is in one place. Unlike similar systems like Protocol Buffers, Avro clients don’t need generated code to read messages. This makes Avro an excellent choice for scripted languages.
One of Avro’s key benefits is robust support for schema changes via a mechanism called schema evolution. This feature gives Avro the ability to gracefully handle missing, new, and altered fields, making Avro data both backward and forward compatible.
When Avro decodes data, it can use two different schemas. One comes from the encoded data and reflects what the publisher encoded, and the other is from the reader and indicates what they expect. Avro will work out the differences so the reader has a usable result.
Avro supports primitive (boolean, int, long, and string), complex (enumerations, maps, arrays, and user-defined records), and nested data types.
The data format has APIs for Java, Python, Ruby, C, C++, Perl, and PHP. Data serialized by one language can be used by another, and Avro’s C interface means it’s callable from many other languages.
What is Avro Used For?
The two primary uses for Avro are data serialization and remote procedure calls.
Avro’s speed, flexibility, and compact size make it a popular choice for storing data and for transmitting it over messaging systems like Kafka. In both cases, programs can encode and decode data quickly, while easily managing versioning differences. It’s therefore easier to deploy new code versions with Avro than with primitive formats like CSV or JSON. You can also use Avro with RPC services like gRPC, and use its powerful versioning system with remote procedures.
What is Parquet?
Apache Parquet is a free and open-source data format for data storage and retrieval. It’s also a product of the Hadoop project, but it differs from Avro in very important ways.
Avro’s language support differs from Avro. The core Parquet project only releases Java jars, but, C, C++, and python support is available via the Arrow project. There is also a Python library for reading Parquet files, and you can process them with Pandas.
The biggest difference is that Parquet is a column-oriented data format, meaning Parquet stores data by column instead of row. This makes Parquet a good choice when you only need to access specific fields. It also makes reading Parquet files very fast in search situations.
This difference also means that Parquet is not a good choice for network messages, since streaming data and column formats don’t work well together. While it’s possible to use a column-oriented format for streaming data, it often eliminates many of the performance benefits.
Parquet’s schema support is like Avro’s. It supports primitive types like boolean, int, long, and string, and offers robust support for complex and user-defined data types. But, schema evolution is expensive with column data, since changes require reprocessing entire data sets, instead of record-by-record in row-oriented data.
What Is Parquet Used For?
Applications that need rapid access to specific fields in a large dataset use Parquet. The format works remarkably well for read-intensive applications and low latency data storage and retrieval.
When you want to aggregate a few columns in a large data set, Parquet is your best bet. Writing files in Parquet is more compute-intensive than Avro, but querying is faster.
Many python data analysis applications use Parquet files with the Pandas library.
Avro vs Parquet: So Which One?
Based on what we’ve covered, it’s clear that Avro and Parquet are different data formats and intended for very different applications.
This biggest difference is row vs. column-oriented data. How you need to access your data set is your guide to the format you need.
A quick example of how they handle the data will make this more clear.
Let’s imagine a set of website users.
Here’s how they look in a row-oriented data format:
ID, name, email
1, John Doe, [email protected]
2, Jane Doe, [email protected]
3, Alfred Neumann, [email protected]
This could be a CSV file. Avro and CSV are both row-oriented, with the primary differences between them being how Avro stores the rows and its data dictionary.
Here’s the same data in a column-oriented file:
ID, 1, 2, 3
name, John Doe, jane Doe, Alfred Neumann
Why would an application want to use column-oriented data? Imagine a query for all user names – that’s line #2 in the example. With three records, the difference seems almost academic, but imagine a more realistic data set, like 10k records. Then add 20 or 30 more fields to the user record. With row-oriented data, the program needs to process every record to retrieve each email. With column-oriented, it’s still just one line in the file.
Even searching for a user with a specific email address is faster, since you only need to search that column and use the offset to get the rest of the data.
As we covered above, Avro is more tolerant of schema changes, and is also better for passing messages between services.
Let Your Data Decide
We’ve covered the Apache Avro and Parquet data formats. Both are products of the Apache Hadoop project, and are designed to solve data storage and transfer problems. While they both have robust support for schemas, complex data, and nested data types, they solve different issues. Avro is a row-oriented format that works for files and messages, while Parquet is columnar.
The format best suited for your application depends on your data and how you will use it. And, of course, you’re not limited to only one. If you process more than one data set, you may find you need both!
Whether the files are stored on the data-lake or inside the data warehouse, SQreamDB supports querying both Avro and Parquet files.
This post was written by Eric Goebelbecker. Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).