We’ve seen the rise of social media applications like WhatsApp, Instagram, and TikTok over the last two decades.
This has resulted in the generation of an unthinkable amount of data daily. As of 2022, WhatsApp alone recorded over 2.44 billion active users who exchanged over 100 billion messages daily.
Databases and data warehouses have become less efficient for analyzing, storing, and manipulating large datasets like this, so the need to adopt data lakes has become obligatory. Now it comes down to how you’ll manage the data in your data lake.
In this post, we’ll discuss Apache Iceberg, its features, and how it can help in big data analysis.
What Is Apache Iceberg?
Apache Iceberg is an open-source table format that enables big data analysts to efficiently manage and query data in a data lake. Iceberg is creating metadata around your files allowing tools to see them as tables use the functionalities and the potentiality of an SQL table in traditional databases (ACID transactions, time travel and more).
Apache Iceberg was developed because of the need to address some of the limitations that come with traditional catalogs and table formats like Apache Hive, such as query performance, high cost of storage, and so on.
It introduced a lot of interesting features that help users more effectively manage data.
One of the many features it has is the support for expressive SQL. Iceberg supports SQL-like commands to carry out operations like merging new data, updating rows, and performing delete operations.
Iceberg is also an improvement over Hive, which is used to track records at the folder level. Unlike Hive, Apache Iceberg tracks records at the file level, which makes it easy to perform atomic changes to a single record without having to rewrite the whole directory. Doing that can lead to performance issues.
Apache Iceberg has several types of features.
1. User-Experience Features
- Schema evolution—With Iceberg support for schema evolution, users can add, drop, rename, update, and reorder columns in a table without the need for the table to be rewritten. It ensures that schema evolution changes are free of any side effects by assigning a unique ID to every newly created column and automatically adding it to the column metadata. It also ensures operations like adding, renaming, or updating a column don’t read the existing value of another column, thus deleting a column. This ensures that there will always be a difference between each column.
- Hidden partitioning—Users no longer need to know the structural layout of files in their table before running queries on them. Iceberg hides the table partitioning from you and finds the exact record relating to the query specified without requiring any extra query filters. It also evolves the schema and partition of your table as your data scales. This is unlike Hive, which will require you to produce perfect query filters to return the correct value.
- Time Travel —The versioning feature makes sure to save any change to your data for future reference, be it adding, updating, or deleting. That way, if you have any problems with the current version of your data, you can easily roll back to a much more stable, older version, which ensures you never lose all your data. It also means you can compare your current data to your previous data.
2. Reliability Features
- Snapshot isolation—This guarantees that any read of the dataset sees a consistent snapshot. In essence, it reads the last committed value that was present at the time the read is executed. For example, let’s say that X and Y are simultaneously updating a record, and Y commits its changes before X. A snapshot is created and attached to the metadata of the record. Now, when X concludes its update and it’s ready to be merged, a check is carried out to see if the update is using the latest snapshot (Y updates). That way, no conflict is found. After both commits, the record gets updated, and a new snapshot is created that reflects the latest update by both X and Y.
- Atomic commits—This ensures data remains consistent across every query. It means you either complete any update you make to the dataset or it won’t save any change at all so as to avoid partial changes. That way, queries return only the correct data, which prevents users from viewing incomplete or inconsistent data.
- Reliable reads—Every transaction (update, add, drop) on Iceberg creates a new snapshot. That way, readers can make use of the several last versions of each update to create a reliable query for the table.
- File-level operations—It’s either impossible or a very tedious task to carry out atomic changes with traditional catalogs because they track records either by position or by name, which will require you to read through a directory, then partition(s), before you can update a single record. However, with Iceberg, you can directly target a single record and make any update to it without making any changes to the folder. This is possible because of the records stored in its metadata.
3. Performance Features
Apache Iceberg holds some interesting performance features that make working with it really easy. In Iceberg, every file that belongs to a table has some metadata stored for any transaction that occurs, along with some extra statistics for each file. This enables users to carry out “scan planning,” which refers to the process of finding only the files that match the query.
Iceberg uses two levels of metadata to effectively track files in a snapshot: manifest files and manifest list. The first level holds the data files, along with their partition data and column-level stats, while the manifest list stores the snapshot’s list of the manifest with a value range for each partition.
To achieve fast scan planning, Iceberg uses the partition min and max values in the manifest list to filter the manifests. After that, it reads through all the manifests returned to get the data file. That way, it’s possible to plan without reading all the manifest files because you can use the manifest list to narrow down the number of manifest files you needed to read.
That way, you can perform efficient and cost-effective queries on your files and also improve the overall performance of queries.
Apache Iceberg Pros and Cons
Let’s have a look at some of the advantages of using a table format like Apache Iceberg to manage data querying and manipulation in data lakes.
- Efficient for small updates—Unlike a file format, Apache Iceberg pulls records from the file level rather than the folder level. That way, updating a single record is easy because you no longer need to go through folders and then partition to make a small update to a record. Instead, you can target and update that single record without updating the whole directory.
- Faster execution—You get a lot of metadata in each file that query engines can utilize to make faster and more effective queries. This improves the amount of time it takes to execute a query, and it also prevents the engine from making expensive reads.
- Multiple choices of tools and engines—Apache Iceberg is open-source by design so it’s independent of any tool or query engine. Users have a choice of which engine they want to work with, so there is no vendor lock-in. Iceberg also supports the use of more than one query engine simultaneously without causing any damage to data. Some query engines Iceberg supports include Dremio, Spark, Flink, Snowflake, Cloudera and Trino.
- Excel with big data—Apache Iceberg excels when it comes to managing big data. With tens of petabytes and partitions due to some of its intuitive features like schema evolution, snapshot isolation, data compaction, and many more, querying data at scale can be pretty easy and effective.
For new users who aren’t particularly familiar with a table format or Apache Iceberg for big data management, using Apache Iceberg may be relatively complex, as making use of it requires some level of expertise and experience. One way of overcoming this hurdle is to work with a managed version of Apache Iceberg, like Tabular.
Even though Apache Iceberg has a strong community, it’s still a “new kid on the block.” Therefore, there are few third-party resources available.
But in no way does this affect its performance.
When Not to Use Iceberg
Here are a few instances Apache Iceberg might not be a fit for your data manipulation and management.
- Small data—If you have a small dataset that doesn’t require you to make use of a data lake, using Iceberg might be overkill.
- Real-time ingestion—Apache Iceberg doesn’t support injecting real-time data out of the box because it uses batch processing.
- Distributed framework—If your aim is not to use a distributed computing framework, then Apache Iceberg is not the perfect choice for you, as it was designed to use a disturbed computing framework to process data at the same time.
Essentially, you need to understand your organization’s needs and consider comparing the pros and cons of Apache Iceberg to decide if it suits your organization’s need for managing data at scale.
In this post, we learned what Apache Iceberg is and some of its incredible features. We also went through some advantages and disadvantages of using Apache Iceberg, as well as when it might not be the best option to use.
To learn more about Apache Iceberg, check out this blog.
This post was written by Suleiman Abubakar Sadeeq. Suleiman Abubakar Sadeeq is an ambitious react developer learning and helping to build enterprise apps. In his free time, he plays football, watches soccer, and enjoys playing video games.