“Relational databases don’t scale, SQL doesn’t scale” exclaim some blog-posts and marketing material. “SQL is dead, and RDBMSs aren’t working – here’s why!” they tell you, in conveniently arranged “top 5 reasons why”-style posts. They’ll tell you how there are zettabytes of data, and that Facebook has 900,000 logins every minute, and there are 108 million Snapchats sent every hour.
Great stats. Really, they’re fun. They catch your attention, but they’re irrelevant. They’re not talking about actual problems for small, medium or even large companies. They’re talking about how difficult scaling Google might be. But you’re not Google (probably). Your needs are vastly different, because you can’t spend $5B on computer hardware every quarter.
What they’re really trying to tell you, is your old single-node MySQL (with MyISAM no less!) doesn’t scale well. What they’re really doing is creating a false equivalency between the implementation of one specific RDBMS, and the proven mathematical theory of relational databases. And that’s just plain bad science.
Scalability is a confusing buzzword
Buzzwords are all around, but none are more confusing than Scalability. In fact, it can actually mean a variety of things:
- Scalability might mean how the database grows to use available resources. So, if you got a 2x more expensive computer, with 2x as many CPU cores, you’d expect the database to run about twice as fast.
- Scalability might mean how the database can be ‘clustered’ – by adding more computers of the same type. So, with 2x as many computers, you’d expect the database to (once again) run faster. Maybe not twice as fast, but faster.
- Scalability might mean how the database can grow to analyze more data by increasing storage size.
However, I want to talk about a different type of scalability. Specifically, performance and data size scalability. I’ll explain:
A database that scales well with performance, will be able run 10 simultaneous queries with the same velocity as 1 stand-alone query.
A database that scales well with data size, will run a query on a 20GB data set with the same velocity as on a 10GB data set.
With good scaling, it doesn’t matter how much data is in the system, performance is consistent.
Scalability is about making sure that the performance is consistent, no matter how much data or how many people are using the system. Now, it’d be really really good if our systems always performed well not just at the same velocity, but also at the same exact speed. This means a query on 1GB will takes just as long as on 100GB of data. However, this is typically impossible, except in some very trivial cases (things like “SELECT 1” or “SELECT TOP 1 * FROM table“).
It’s not just about fast GPUs
If you read some of the marketing around GPU databases, you might think that it’s just about writing GPU code. I think this is doing a disservice to the very smart people writing GPU databases. I’m sure they’ll agree that their hard work is much more than just handing the work over to the GPU. In SQream DB, It’s not just the GPU. It’s actually mostly about the software architecture. We built our data management components specifically to utilize the processing power available, not just from the GPU but from all the resources available in the machine.
If data isn’t arranged properly, it doesn’t matter how fast the parts are because they’ll be bottle-necked by the weakest link, a lot before the PCIe I/O bottleneck. And indeed, bringing large volumes of data under management and making them exploitable by many users are complex challenges. At SQream, we’ve crafted our own storage and data organization methods for that reason exactly.
We realized early on, that the GPU can’t be ‘properly fed’ if the data is just handled naively. Recently, with new projects like Apache Arrow seeing adoption in some GPU databases, it seems like others have taken notice. What Apache Arrow calls “Zero copy reads” has been a part of SQream DB from day one.
The non-indexed approach helps scaling
At SQream, we have a customer with 400-500 billion row fact tables, joined by a big bunch of auxiliary dimension tables, ranging from 200m to 800m rows. These tables store about 140 TB of data. Most queries return on these tables within 10-20 seconds. For a traditional RDBMS, it would take longer than that just to get the indexes in order, and this is why:
In SQream DB, we specifically avoid indexing, and replace it with a transparent metadata tagging system, that uses the GPU to tag and compress incoming data, resulting in fast compressed write speeds, but also fast querying. By tagging the data, we can replace the typical “where-to-look” indexing scheme with “where-not-to-look”. In other words, SQream DB knows where the data is not.
Contrast this to a standard indexed database which uses the opposite approach, and hunts for the key in the index. If SQream DB had used indexed structures, they wouldn’t scale well for large datasets. We can double, triple and quadruple the amount of data in the system, and the queries would still return results in a consistent time-frame, because SQream DB still knows where the data is not.
An added benefit is that without indexes on the columns, all columns can be used for searching, joining and aggregating. Without indexes in the way for loading, we can ingest data into SQream DB, with no penalty from the use of indexes.
A graph for good measure
I explained why a database might scale well so far, and even why arranging the data is more important than using fast hardware. But now let’s look at what this means in actual user scenarios.
In the graph below (provided to us by one of our customers), 5 different real SQL queries were run on 1B, 12B and 300B row FACT tables. It’s easy to see that query time is linear across the wide range of data sizes.
In different customer scenario, the customer wanted to calculate ACV across all retail stores nation-wide. In this ACV calculation on 24 TB of data, 300B rows in a main fact table, 8 auxiliary dimension tables – with complex, nested joins and aggregations.
Once again, we can see SQream DB performs linearly with the number of users concurrently querying.
So, you see – linear scalability is not a myth. We have a linearly scaling GPU database, and we’re extremely proud of it.