A Detailed Introduction to the Avro Data Format

By Oren Askarov

11.21.2024 twitter linkedin facebook

Introduction

In this article, we’re going to take a look at Apache Avro Data Format, which is a data serialization system that converts an object in memory to a stream of bytes.

Serialization

The world of online shopping changed dramatically after COVID-19. Let’s examine the technology that enabled these changes.
Imagine a software application that manages orders. Each order contains data such as the ID, date, customer ID, type, and more. Depending on the programming language used, this information is typically loaded in memory as an object or structure. Below are a few scenarios in which we might want to serialize this object:

  1. **Storing the Object in a File**: Imagine that we want to share a list of orders with our suppliers without granting them direct access to our system. In this case, we could export the orders to a file by serializing each order into a stream of characters or bytes. Furthermore, if the number of orders exceeds our memory capacity, the application may temporarily store them in a file and only load the relevant ones for the current page.
  2. **Sending Over a Network Connection**: Suppose we need to share the orders with another system within our company, like the delivery system that operates its own web service. To transmit an order object to that service over the network, we would need to serialize it into a stream of bytes. After serialization, we will eventually need to deserialize the data, converting it back from the stream of bytes into an object. This process requires that the serialization format be structured to allow for accurate deserialization. There are several formats currently in use, each with its own drawbacks:

– **CSV**: Lacks support for complex data types such as structures and arrays.
– **XML**: Can be verbose and cumbersome.
– **JSON**: Still somewhat verbose, as including field names and string representations for various data types can take up significant space.

What is Avro Data format?

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON to define data types and protocols and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data and a wire format for communication between Hadoop nodes and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing (Avro IDL) and another which is more machine-readable based on JSON.

It is similar to Thrift and Protocol Buffers but does not require running a code-generation program when a schema changes (unless desired for statically typed languages).

Apache Spark SQL can access Avro as a data source.

 

Avro Schema

The schema specifies the structure of the serialized format. It defines the fields that the serialized format contains, their data types (whether optional or not), default values, etc. The schema itself is in JSON format. A schema can be any of the below:

  • A JSON string containing a type name—e.g., “string” or “int
  • Or a JSON object that defines a new record type in the format { “type” : “typeName”, attributes }
  • Or a JSON array specifying a union of multiple types—e.g., [ “null“, “string“, “int“] is a union of three types and means that the field using this type can be either null OR a string OR an int

The built-in data types are as follows:

Primitive
  • boolean: a binary value
  • int: 32-bit signed integer
  • long: 64-bit signed integer
  • float: single precision (32-bit) IEEE 754 floating-point number
  • double: double precision (64-bit) IEEE 754 floating-point number
  • bytes: sequence of 8-bit unsigned bytes
  • string: Unicode character sequence
Complex
  • record: a type that contains named fields, each having a type, required, default, etc.
  • enum: an enumeration of given values; other values aren’t allowed
  • array: an array of items
  • map: a map/dictionary of key-value pairs, where the keys must be strings
  • fixed: a fixed number of bytes
  • union: an array containing type names; the resulting type can be any one of these types

The complex types each have different attributes (e.g., record has fields, array has items). There are some attributes common to them though.

  • The doc attribute is used to document what the type is for.
  • The namespace attribute helps to avoid name collisions among types with the same name. For example, there could be multiple types named “document.” Each would have a different namespace (e.g., “com.company1” and “org.org2”). And we would then refer to a type by its full name to get the correct one (e.g., “com.company1.document,” instead of just “document”).
  • The default attribute specifies the default value for a type.

The above is a simplified explanation of the schema. The full Avro format specification can be found here.

Example

As an example, the schema for the order data that we saw above would be as follows:


{ 

  "namespace" : "org.example",

  "type": "record", 

  "name": "order", 

   "fields" : [

    {"name": "orderId", "type": "long"}, 

    {"name": "orderDate", "type": "int", "logicalType" : "date"}, 

    {"name": "customerId", "type": "string"}, 

    {"name": "orderType", "type":{ "type" : "enum", "name" : "orderType", 

         "symbols" : ["NORMAL", "INTERNAL", "CUSTOM"] }},

     {"name": "notes", "type": [ "null", "string"], "default" : null }

   ]

}

Here are some things to note about the above example:

  • The date type is not supported out of the box and is represented as int that is the number of days from the start of the epoch. An extra attribute, logicalType, is supported so that applications handling this data may process/convert it further.
  • When we use a complex type in a field, it needs to use the object syntax: { type:XXX name:YYY }.
  • The “notes” string field can have a null value. We say this by using the union of null and string for its type. It also has a default value of null. The default value will be used when we’re reading the serialized data if this field is missing from the data.

Avro Data

The serialized data can be created in three formats/encodings. See this reference for details.

  1. JSON format: This is the verbose, human-readable format, useful for debugging. But it’s not performant for parsing and takes up more space.
  2. Binary format: This is the compact, performant default format. Strings are encoded as UTF-8, the rest as binary. As a simplified example, consider how 2^16 = 65536 will take up five characters to encode as a string, but only two bytes as binary. Thus, the nonstring types are serialized to an optimized binary format. One important point is that this format doesn’t store the field names/IDs, so the schema used to write an object is always required when reading back the serialized data in order to identify the fields.
  3. Single object encoding: In addition to the binary format, this also contains a fingerprint or hash of the schema with which it was created. This is useful for scenarios where the schema has changed a few times. The serialized data that we’re reading may belong to any of those schemas, and we want to know which data/message belongs to which schema. In this case, the fingerprint/hash can be used to find the matching schema. This could also be achieved by introducing a custom schema field in each message.

Object Container Files

Avro also provides a format for storing Avro records as a file. The records are stored in the binary format. The schema used to write the records is also included in the file. The records are stored in blocks and can be compressed. Also, the blocks make the file easier to split, which is useful for distributed processing like Map-Reduce. This format is supported by many tools/frameworks like Hadoop, Spark, Pig, and Hive.

Writing to Avro

Many popular languages have APIs for working with Avro. Here, we’re going to see an example of writing to Avro in Python. Note how the objects we’re writing are just generic dictionaries and not classes with strongly typed methods like getOrderId(). We need to convert the date into an int containing the number of days since epoch, as Avro uses this for dates. At the end, we’ll have printed the number for bytes used for encoding the objects.

Here we’re using the DatumWriter to write individual records; we’re not using the Avro object container file format that also contains the schema. This will be the case when we’re writing individual items to a message queue or sending them as params to a service. The DataFileWriter should be used if we want to write an Avro object container file.


from avro.io import DatumWriter, DatumReader, BinaryEncoder, BinaryDecoder

import avro.schema

from io import BytesIO

from datetime import datetime

schemaStr = ”’


{ "type": "record", 

"name": "order", 

"fields" : [ 

    {"name": "orderId", "type": "long"}, 

    {"name": "orderDate", "type": "int", "logicalType" : "date"}, 

    {"name": "customerId", "type": "string"}, 

    {"name": "orderType", "type": { "type" : "enum", "name" : "orderType", "symbols" : ["NORMAL", "INTERNAL", "CUSTOM"] } },

    {"name": "notes", "type": [ "null", "string"], "default" : null }

    ]

}
'''

def daysSinceEpoch( dateStr) :

    inpDate = datetime.strptime( dateStr, '%Y-%m-%d')

    epochDate = datetime.utcfromtimestamp(0)

    return( inpDate - epochDate ).days




schema = avro.schema.parse( schemaStr)



wbuff = BytesIO()

avroEncoder = BinaryEncoder(wbuff)

avroWriter = DatumWriter(schema)

avroWriter.write( { "orderId":11, "orderDate":daysSinceEpoch( "2022-01-27"), "customerId":"CST223", 

    "orderType":"INTERNAL" }, avroEncoder)

avroWriter.write( { "orderId":12, "orderDate":daysSinceEpoch( "2022-02-27"), "customerId":"MST001", 

    "orderType":"NORMAL", "notes" : "Urgent" }, avroEncoder)

print( wbuff.tell())

 

Reading From Avro Data Format

Writing to Avro is straightforward. We have the schema and data to be written. Reading Avro, however, may involve two schemas: the writer schema, which was used to write the message, and the reader schema that is going to read the message. This is necessary if there are different schema versions (maybe fields have been added or removed), and so the writer’s schema is different from the reader’s schema. It could also happen that the reader needs to read only a subset of the fields written. Below is the code in Python, appended to the one we already have for the writer. In this case, we’ve used a different schema for the reader, which contains only the orderId and a new field, extraInfo. Since extraInfo wasn’t in the written data, its default value will be used by the reader.

Speak with SQream's Experts

schemaStrRdr = ”’
{
“type”: “record”,
“name”: “order”,
“fields” : [
{“name”: “orderId”, “type”: “long”},
{“name”: “extraInfo”, “type”: “string”, “default” : “NA”}
]
}
”’
schemaRdr = avro.schema.parse(schemaStrRdr)
rbuff = BytesIO(wbuff.getvalue())
avroDecoder = BinaryDecoder(rbuff)
avroReader = DatumReader(schema, schemaRdr) # Use both schemas

msg = None
while True:
try:
msg = avroReader.read(avroDecoder)
except Exception as ex:
msg = None
print(ex) # No proper EOF handling for DatumReader
if msg is None:
break
print(msg, rbuff.tell())

Schema Evolution

By using both the writer and reader schemas when reading data, Avro allows the schema to evolve in a limited way—for example, when adding/removing fields or when fields are changed in a way that is allowed (e.g. int to long, float, or double; string to and from bytes). We’ve already seen an example of this in the reader code above. Note that if a field is missing from the data being read and does not have a default value in the reader’s schema, an error will be thrown. More here.

Avro Schema Registry

Given the many potential factors (e.g., that Avro messages could be consumed by many different applications, each is going to need a schema to read/write the messages, the schemas could change, and there could be multiple schema versions in use), it makes sense to keep the schemas versioned and stored in a central registry, the Schema Registry. Typically, this would be a distributed, REST API service that manages and returns schema definitions.

 

Why Avro Data Format?

Why use Avro when there are already other formats present? There are quite a few reasons. Avro:

  • Has a compact and fast binary data format
  • Is a documented format that makes use of schemas for correctness
  • Has rich data types (e.g., arrays, maps, enumerations, objects, etc.)
  • Provides a container file format that is splitable into chunks for distributed processing and contains the schema along with the data
  • Ships with integration with popular languages like Python, Java, C++, etc. and can work with map/dictionary-like objects to represent records; does not need to have strongly typed classes generated from schemas
  • Supports schema evolution (i.e., schema changes); for example, if a new field is added later to the schema, reading serialized data without the field won’t cause a problem

 

Conclusion

We’ve seen what the Avro data format is, how to define schemas, and how to read and write objects to and from Avro. If you want a fast, compact, and well-supported format that’s easy to use and validate, you should consider Avro. And if you’re looking for some useful tools to help you with Avro, check out this Avro schema validator.

This post was written by Manoj Mokashi. Manoj has more than 25 years of experience as a developer, mostly on java, web technologies, and databases. He’s also worked on PHP, Python, Spark, AI/ML, and Solr and he really enjoys learning new things.

 

SQream Avro Support

 

Avro is a popular binary row-based serialized textual format. It can be seen as a binary alternative to JSON – drawing inspiration from its flexibility and nesting, while offering a much more efficient storage method.  Avro data consists of a JSON-formatted schema and a set of data payloads, which are either serialized in the Avro binary format or in JSON – the latter mostly used for debugging, and not supported by itself in all Avro-consuming products.  SQream supports Avro with simple data ingestion capabilities for both batch and streaming (for example, Kafka).  Once ingested, (see image below) Avro object(s) are mapped into rows. All existing data types can be represented in JSON fields, including JSONPath support and encoding. It is a simple plug and play solution for Avro objects/files to reduce the complexity of and accelerate TTTI.