SQream Platform
GPU Powered Data & Analytics Acceleration
Enterprise (Private Deployment) SQL on GPU for Large & Complex Queries
Public Cloud (GCP, AWS) GPU Powered Data Lakehouse
No Code Data Solution for Small & Medium Business
Scale your ML and AI with Production-Sized Models
By Ohad Shalev
We’ve seen the rise of social media applications like WhatsApp, Instagram, and TikTok over the last two decades.
This has resulted in the generation of an unthinkable amount of data daily. As of 2022, WhatsApp alone recorded over 2.44 billion active users who exchanged over 100 billion messages daily, (Equivalent to a total of 4.32 TB of data).
Databases and data warehouses have become less efficient for analyzing, storing, and manipulating large datasets like this, so the need to adopt data lakes has become obligatory. Now it comes down to how you’ll manage the data in your data lake.
In this post, we’ll discuss Apache Iceberg, its features, and how it can help in big data analysis.
Apache Iceberg is an open-source table format designed to help big data analysts efficiently manage and query data in a data lake. By creating metadata around your files, Iceberg allows tools to treat the files as tables, leveraging features similar to those found in traditional databases, such as ACID transactions and time travel.
The development of Apache Iceberg began to address some limitations associated with traditional catalogs and table formats like Apache Hive, particularly in terms of query performance and storage costs.
Iceberg introduces several valuable features that empower users to manage data more effectively. One notable feature is its support for expressive SQL. Iceberg allows for SQL-like commands to merge new data, update rows, and perform delete operations.
Additionally, Iceberg offers improvements over Hive by enabling tracking at the file level rather than the folder level. This enhancement allows for atomic changes to a single record without requiring a complete rewrite of the entire directory, which can lead to performance issues.
Apache Iceberg has several types of features.
Apache Iceberg holds some interesting performance features that make working with it really easy. In Iceberg, every file that belongs to a table has some metadata stored for any transaction that occurs, along with some extra statistics for each file. This enables users to carry out “scan planning,” which refers to the process of finding only the files that match the query.
Iceberg uses two levels of metadata to effectively track files in a snapshot: manifest files and manifest list. The first level holds the data files, along with their partition data and column-level stats, while the manifest list stores the snapshot’s list of the manifest with a value range for each partition.
To achieve fast scan planning, Iceberg uses the partition min and max values in the manifest list to filter the manifests. After that, it reads through all the manifests returned to get the data file. That way, it’s possible to plan without reading all the manifest files because you can use the manifest list to narrow down the number of manifest files you needed to read.
That way, you can perform efficient and cost-effective queries on your files and also improve the overall performance of queries.
Let’s have a look at some of the advantages of using a table format like Apache Iceberg to manage data querying and manipulation in data lakes.
For new users who aren’t particularly familiar with a table format or Apache Iceberg for big data management, using Apache Iceberg may be relatively complex, as making use of it requires some level of expertise and experience. One way of overcoming this hurdle is to work with a managed version of Apache Iceberg, like Tabular.
Even though Apache Iceberg has a strong community, it’s still a “new kid on the block.” Therefore, there are few third-party resources available.
But in no way does this affect its performance.
Here are a few instances Apache Iceberg might not be a fit for your data manipulation and management.
Essentially, you need to understand your organization’s needs and consider comparing the pros and cons of Apache Iceberg to decide if it suits your organization’s need for managing data at scale.
In this post, we learned what Apache Iceberg is and some of its incredible features. We also went through some advantages and disadvantages of using Apache Iceberg, as well as when it might not be the best option to use.
To learn more about Apache Iceberg, check out this blog.
This post was written by Suleiman Abubakar Sadeeq. Suleiman Abubakar Sadeeq is an ambitious react developer learning and helping to build enterprise apps. In his free time, he plays football, watches soccer, and enjoys playing video games.