SQL on GPU: Implementing SQL on GPUs

12.11.2017

The idea of running SQL on GPUs has been around for more than a decade. But there are many questions about the viability of this. Is it possible? Is it too difficult? Can it be more than a technology demo? What about the PCI bus transfer bottleneck, or the relatively limited about of memory on graphics cards?
In this first blog post, I will share some ideas about running SQL on GPUs, to give a little more of an idea of how it can be done. I will focus on queries which are typical for ‘analytic columnar databases’, such as the well known and sometimes-liked TPC-H benchmark (you can see the TPC-H queries here).
Our second blog post will focus on optimising this engine further.

SQL on GPUs

These sorts of queries are interesting to run on GPUs for a few reasons. Here are a two:

It’s true that some of these kinds of queries are IO bound, but many are CPU bound, even when running on clusters with a lot of expensive CPUs.
The implementation of many of these queries matches the kind of code which runs well on GPUs.

I will start with a few ideas about a toy engine which works on data which fits into the GPU memory, and extend it with some ideas on how to be able to operate on much bigger data sizes. In a follow up blog post, I will finish with a few ideas about how to optimize a system like this run a bit faster.

Getting started

Here is a table of some simple SQL queries:

Example SQL	Physical Operator	Implementation
`select a+b, c * 5 from t`	select (a.k.a project/extend/rename)	`thrust::transform`
`select a, count(*), sum(b), avg(b) from t group by a`	stream aggregate	`thrust::reduce_by_key`
`select a, b from t where a > 0.5`	filter	`thrust::remove_if`
`select distinct a from t`	stream distinct	`thrust::unique`
`select a, b, c, d from t order by a,b`	sort	`thrust::sort`
`select * from t union all select * from u`	union all
`select * from t inner join u using (a)`	sort merge join (smj)	simple implementation: `thrust::upper_bounds`, `lower_bounds`, expand indexes, `gather`

In the middle column, you can see the so-called ‘primitive operators’ that implement the main operation for the query.
In the right column, you can see the corresponding Thrust function which could be used to implement that physical operator on the GPU. Thrust is a high level CUDA library from NVIDIA which is very easily to use for C++ programmers who are familiar with the STL.
The only item in this list which isn’t trivial is the sort merge join. However, this can still be implemented quite easily with a few Thrust functions put together in a simple way.
This is a good illustration of several things, including:

• The close correspondence of some SQL to very straightforward GPU code
• How easy it is to write prototype code using the Thrust library

There are some (minor) downsides to Thrust, but it certainly gives you the ability to put together prototype code very quickly without spending a lot of time either learning GPGPU programming or writing a lot of complex low level code.
The list of operators above is a good set for a demo engine, but it is a bit limited in SQL features to be a good production engine.

More operators

Here is an additional list of operators which would give you an engine which could run a lot of useful real world queries:

• non indexed nested loop join (NINLJ)
• not matching
• outer join
• stream union distinct
• distinct aggregates
• grouping sets
• window functions

Most of these operators are of similar difficulty to implement as the sort merge join in the previous table.

Bigger Data

If we implement all this then we have a pretty decent SQL engine, but with one unfortunate downside: we are limited to operating on data which can fit in GPU memory. What happens when we have 12GB of memory on our graphics card, but our table is 10TB?
The first thing we can do is break the data up into chunks. Then we read one chunk from disk at a time, process it on the GPU, then send it to the client.

Chunk the data when it’s bigger than GPU memory

External algorithms

For some operators, such as sort, we will need to do some additional work. In the case of sort, one way to implement it is to take an existing external sort algorithm which does multiple passes over the data and buffers intermediate results to disk.
Here is a simple diagram of a naive way to use GPU acceleration with an existing external sort algorithm:
GPU External Sort
There are a number of other operators which need similar ‘external’ versions. For instance: