SQream Platform
GPU Powered Data & Analytics Acceleration
Enterprise (Private Deployment) SQL on GPU for Large & Complex Queries
Public Cloud (GCP, AWS) GPU Powered Data Lakehouse
No Code Data Solution for Small & Medium Business
Scale your ML and AI with Production-Sized Models
By Jake Wheat
The idea of running SQL on GPUs has been around for more than a decade. But there are many questions about the viability of this. Is it possible? Is it too difficult? Can it be more than a technology demo? What about the PCI bus transfer bottleneck, or the relatively limited about of memory on graphics cards? In this first blog post, I will share some ideas about running SQL on GPUs, to give a little more of an idea of how it can be done. I will focus on queries which are typical for ‘analytic columnar databases’, such as the well known and sometimes-liked TPC-H benchmark (you can see the TPC-H queries here). Our second blog post will focus on optimising this engine further.
These sorts of queries are interesting to run on GPUs for a few reasons. Here are a two:
I will start with a few ideas about a toy engine which works on data which fits into the GPU memory, and extend it with some ideas on how to be able to operate on much bigger data sizes. In a follow up blog post, I will finish with a few ideas about how to optimize a system like this run a bit faster.
Here is a table of some simple SQL queries:
In the middle column, you can see the so-called ‘primitive operators’ that implement the main operation for the query. In the right column, you can see the corresponding Thrust function which could be used to implement that physical operator on the GPU. Thrust is a high level CUDA library from NVIDIA which is very easily to use for C++ programmers who are familiar with the STL. The only item in this list which isn’t trivial is the sort merge join. However, this can still be implemented quite easily with a few Thrust functions put together in a simple way. This is a good illustration of several things, including:
There are some (minor) downsides to Thrust, but it certainly gives you the ability to put together prototype code very quickly without spending a lot of time either learning GPGPU programming or writing a lot of complex low level code. The list of operators above is a good set for a demo engine, but it is a bit limited in SQL features to be a good production engine.
Here is an additional list of operators which would give you an engine which could run a lot of useful real world queries:
Most of these operators are of similar difficulty to implement as the sort merge join in the previous table.
If we implement all this then we have a pretty decent SQL engine, but with one unfortunate downside: we are limited to operating on data which can fit in GPU memory. What happens when we have 12GB of memory on our graphics card, but our table is 10TB? The first thing we can do is break the data up into chunks. Then we read one chunk from disk at a time, process it on the GPU, then send it to the client.
Chunk the data when it’s bigger than GPU memory
For some operators, such as sort, we will need to do some additional work. In the case of sort, one way to implement it is to take an existing external sort algorithm which does multiple passes over the data and buffers intermediate results to disk. Here is a simple diagram of a naive way to use GPU acceleration with an existing external sort algorithm: There are a number of other operators which need similar ‘external’ versions. For instance:
We now have a basic GPU SQL engine. In part 2 of this blog post, we’ll talk about optimising this engine to fit data bigger than RAM.
Here are a few links for further reading: