I was an Oracle DBA for 20 years. As a DBA, we were taught that the most compact and straightforward way to store and analyze our data is to normalize it, to save disk space and reduce mistakes. Normalization allows us to keep our analysis flexible. It’s the basis for well-known techniques like star schemas, upon which many applications and reports are built. This allowed us to use the fantastically useful SQL JOIN operation.
However, when data started growing and storage got cheaper, DBAs like me were no longer concerned with the data size from normalization. Because data was now mostly distributed, we became terrified of the dreaded, computationally expensive JOIN operation, and the shuffle it often caused. We were taught that “it’s best if you don’t do JOINs” on big data. We started de-normalizing everything we could. We needed to make sure the data distribution is done extremely carefully, and we had to pre-arrange it to make sure that our systems could handle it.
Arranging data for quick analysis meant we were forced into solutions like materialized views and projections. We started duplicating the data in unified, de-normalized tables, and then we duplicated them again for different distribution keys. We even de-normalized our indexes by using prebuilt joined key indexes. De-normalizing everything would make our database grow by almost 10x. 1 TB of data would become 9 TB after we were done with the process. And all this, just so that we could analyze it quickly enough.
The SQL JOIN has become synonymous with poor technique. The word signifies restrictions – an operation that requires you to engineer your data in advance for the database to support it. Just like that, the flexibility of normalization is gone.
Why are JOINs so scary?
The main reason big data systems either don’t support JOINs or discourage you from performing them is that they were not designed to work well in distributed systems. In fact, many big data platforms are distributed, and the JOIN operation doesn’t scale well. It’s worst-case compute complexity of O(n2) makes the JOIN one of the most compute-intense operations in Relational Algebra (the underlying implementation behind a clear majority of relational databases). But the fact that JOINs are computationally expensive doesn’t mean we should shy away from them.
I’ve heard many engineers and DBAs say, “we don’t use JOINs here” – simply because the current data platform isn’t designed to handle them. There IS another way.
Embrace the JOIN
JOINs have incredibly powerful potential. They allow us to combine data sets, correlate data, and reduce the complexity of the database. They facilitate a higher degree of flexibility and referential integrity without having to redesign the database every time we want to make a change.
And so, in my current role as Product Manager at SQream we’ve gone the other route and encourage our customers to perform as many JOINs as they’d like.
A natural, flexible way to perform JOINs
From the moment we set out to create SQream DB, it was clear to us that a relational database can only be successful if it implements the full set of JOIN operations that SQL users are accustomed to.
We went further and developed one of the first on-GPU JOIN implementations for a variety of algorithms – loop join, merge-sort join, and more. By making use of advanced metadata techniques as well as the immense raw-parallelism of the GPU, SQream DB can not only join, but also aggregate, filter, transform, run window functions, sort, and perform a variety of analytical functions on the data as it streams into the GPU’s architecture. Because all of this is done automatically and on-the-fly, it doesn’t really require any DBA intervention. This makes it not just fast, but also saves human time and resources.
While at SQream we consider JOINs to be a core feature, it shouldn’t be taken for granted. We aren’t content with joining just two tables as some other databases do, with a bunch of limitations. SQream DB doesn’t limit the size or number of tables in the JOIN operation, nor does it limit the size of ORDER BYs or nested queries. And it has been this way since day one.
Fast JOINs on any number of tables and keys
One of SQream DB’s most useful capabilities is the ability to join any number of tables on any number of keys, without having to pre-index or even match the data-types. The significance of this capability is that your tables can be unlimited in size, with any level of (de-)normalization, and still maintain flexibility. Most impressively – SQream can join eight tables with hundreds of billions of rows in just a few seconds, faster than any other system tested.
Results of comparison between SQream DB and 3 leading MPP relational databases, on an 8-way star-schema inner join.
With SQream DB, you can get the most out of your big data while maintaining total flexibility. Enjoy the advantages of normalization, keep your data compact and relevant, and at the same time – empower data engineers, analysts and BI with multi-tables JOINs and maximum flexibility & performance.
We are not afraid of JOINs, we embrace them!