SQream Platform
GPU Powered Data & Analytics Acceleration
Enterprise (Private Deployment) SQL on GPU for Large & Complex Queries
Public Cloud (GCP, AWS) GPU Powered Data Lakehouse
No Code Data Solution for Small & Medium Business
By SQream
We’ve seen the rise of social media applications like WhatsApp, Instagram, and TikTok over the last two decades.
This has resulted in the generation of an unthinkable amount of data daily. As of 2022, WhatsApp alone recorded over 2.44 billion active users who exchanged over 100 billion messages daily.
Databases and data warehouses have become less efficient for analyzing, storing, and manipulating large datasets like this, so the need to adopt data lakes has become obligatory. Now it comes down to how you’ll manage the data in your data lake.
In this post, we’ll discuss Apache Iceberg, its features, and how it can help in big data analysis.
Apache Iceberg is an open-source table format that enables big data analysts to efficiently manage and query data in a data lake. Iceberg is creating metadata around your files allowing tools to see them as tables use the functionalities and the potentiality of an SQL table in traditional databases (ACID transactions, time travel and more).
Apache Iceberg was developed because of the need to address some of the limitations that come with traditional catalogs and table formats like Apache Hive, such as query performance, high cost of storage, and so on.
It introduced a lot of interesting features that help users more effectively manage data.
One of the many features it has is the support for expressive SQL. Iceberg supports SQL-like commands to carry out operations like merging new data, updating rows, and performing delete operations.
Iceberg is also an improvement over Hive, which is used to track records at the folder level. Unlike Hive, Apache Iceberg tracks records at the file level, which makes it easy to perform atomic changes to a single record without having to rewrite the whole directory. Doing that can lead to performance issues.
Apache Iceberg has several types of features.
Apache Iceberg holds some interesting performance features that make working with it really easy. In Iceberg, every file that belongs to a table has some metadata stored for any transaction that occurs, along with some extra statistics for each file. This enables users to carry out “scan planning,” which refers to the process of finding only the files that match the query.
Iceberg uses two levels of metadata to effectively track files in a snapshot: manifest files and manifest list. The first level holds the data files, along with their partition data and column-level stats, while the manifest list stores the snapshot’s list of the manifest with a value range for each partition.
To achieve fast scan planning, Iceberg uses the partition min and max values in the manifest list to filter the manifests. After that, it reads through all the manifests returned to get the data file. That way, it’s possible to plan without reading all the manifest files because you can use the manifest list to narrow down the number of manifest files you needed to read.
That way, you can perform efficient and cost-effective queries on your files and also improve the overall performance of queries.
Let’s have a look at some of the advantages of using a table format like Apache Iceberg to manage data querying and manipulation in data lakes.
For new users who aren’t particularly familiar with a table format or Apache Iceberg for big data management, using Apache Iceberg may be relatively complex, as making use of it requires some level of expertise and experience. One way of overcoming this hurdle is to work with a managed version of Apache Iceberg, like Tabular.
Even though Apache Iceberg has a strong community, it’s still a “new kid on the block.” Therefore, there are few third-party resources available.
But in no way does this affect its performance.
Here are a few instances Apache Iceberg might not be a fit for your data manipulation and management.
Essentially, you need to understand your organization’s needs and consider comparing the pros and cons of Apache Iceberg to decide if it suits your organization’s need for managing data at scale.
In this post, we learned what Apache Iceberg is and some of its incredible features. We also went through some advantages and disadvantages of using Apache Iceberg, as well as when it might not be the best option to use.
To learn more about Apache Iceberg, check out this blog.
This post was written by Suleiman Abubakar Sadeeq. Suleiman Abubakar Sadeeq is an ambitious react developer learning and helping to build enterprise apps. In his free time, he plays football, watches soccer, and enjoys playing video games.