Implementing SQL on GPUs 2/2

By Jake Wheat

12.17.2017 twitter linkedin facebook

This is the second in a two part blog post about implementing SQL queries on GPUs. In part 1 of this post, we talked about implementing a simple SQL engine using GPUs and running on big data in a very simple and direct way. An engine like this won’t be as fast as we want, and there are many ways to optimise it for better performance.
Let’s see what options we have to optimise:

Optimising, non-GPU specific

Here are some non-GPU specific ideas which you should follow more or less the same for a GPU database as a regular CPU database:

  • • Use good benchmarking and profiling tools
  • • Take advantage of columnar – don’t read columns which aren’t needed
  • • Compressed data – helps with both disk I/O and PCI bus I/O
  • • Make sure the engine uses good I/O patterns: big reads and writes, work with the Linux FS cache effectively
  • • Use large host memory buffers
  • • Make the best of AST optimisations (e.g. push predicates closer to the tablescans)
  • • Implement a cost based optimizer

There are many other ideas which translate directly from advanced columnar SQL engines to GPU implementations.
The rest of this post will outline a few GPU specific ideas.

The GPU task queue

A simple idea for working with a GPU is to use a task queue. A GPU worker thread will take a tasks from the queue, load the data onto the GPU, run some kernels, then download the results to host memory.
This isn’t exactly an optimisation. In fact, it’s possible you would have already implemented exactly this in an engine like the one described earlier.
Here is a diagram of the idea

Short GPU task concept for concurrency

Short GPU task concept for concurrency

There are a lot of benefits to starting with this approach, such as:

  • • We can build on it to overlap multiple GPU operations in a single query. This can be used to hide PCI transfer time.
  • • It can be used to multiplex a GPU between multiple concurrent queries, with a very simple way to avoid one query starving out another query. This can help us implement concurrent queries without increasing the peak GPU memory use.
  • • This system can also be used to distribute tasks from a query to multiple GPUs in the same host machine, which is useful when we are bottlenecked on the GPU execution.

Combining tasks

Here is a very simple idea: combining multiple GPU operations without round-tripping the intermediate results to host memory.
Consider the following SQL query –

select a, b+c from t where b > 5 order by a

The logical operators to implement this query are shown in the following diagram:

The logical operators to implement this query

The logical operators to implement this query

In a CPU database, this is how it’d be run. But because we’re running these SQL queries on GPUs, we will make the data transfers to and from GPU explicit, to show how a naive direct implementation might work:

Trivial tasks for a simple SQL query with explicit GPU copy operations

Trivial tasks for a simple SQL query with explicit GPU copy operations

If we combine the GPU operations, we can get this approach, which should run much faster:

Combining tasks on the GPU where possible to reduce device-to-host and host-to-device copying

Combining tasks on the GPU where possible to reduce device-to-host and host-to-device copying

By reducing the amount of copies, we significantly reduce overhead, by avoiding round-tripping the intermediate results to host memory.

Use larger chunks

Another simple idea is to use larger chunks of data. In this example, by sorting the data up to 10,000,000 rows at once, instead of up to 10,000 rows at once, the total query time should be reduced, sometimes significantly.

Small chunks Large chunks
  1. Transfer 10,000 rows to device (GPU)
  2. transform d := b + c
  3. remove_if b > 5
  4. sort by a
  5. Transfer results to host (CPU)
  1. Transfer 100,000 rows to device (GPU)
  2. transform d := b + c
  3. remove_if b > 5
  4. sort by a
  5. Transfer results to host (CPU)

For best performance, you usually want to make the chunks as big as you can, as long as they still fit in the GPU memory. With GPU memory being quite small (compared to host memory), this is an important factor.

Rechunking 1

If we want to optimise bulk IO reads, we want to do large transfers from disk at once.

Rechunking table scans, optimizing disk I/O sizes

Rechunking table scans, optimizing disk I/O sizes

In the first line, we read chunks of data from disk which will fit on the GPU. But this size of disk reads won’t perform that well. To get the disk transfer rate closer to the hardware maximum, we want to read much bigger sizes.
In the second line, we read a large amount of data from the disk, but then this is too big to load to the GPU.
In the third line, we load the big chunks of data from disk, then each time we split this big chunk into small chunks to be processed one at a time on the GPU.

Rechunking 2

Some of the physical operators on the GPU can input a large amount of rows, then output a small amount of rows, depending on the specific query and the data distribution. We can boost the performance of subsequent operators by collecting these small outputs into bigger chunks.


Rechunking after remove_if

The example in the diagram above is a query with a where and an order by. The where is implemented using the remove if operator, and the order by partially by the sort operator. If the where has a high selectivity, as we see in the first line, then the sort will perform well.
In the second line, we see low selectivity. This means that the where predicate only matches e.g. 20% of the rows. By running the sort operator on chunks which are only 20% full, we will perform badly relative to full chunks .
In the third line, we add the optimisation. After we get the small chunks from the remove if operator, we combine them back into big chunks, then run the sort operator on these big chunks and get a nice performance improvement.

Rechunking 3

Here is another example of the impact of chunk sizes. In this case, we are running a non indexed nested loop join. We need to load every chunk from the A input with every chunk of the B input. Here is the pseudo code for the nested loop:

for each chunk in table A:
  for each chunk in table B:
    load A,B to the GPU, join them

Some example numbers:
Table A size: 10,000,000,000 rows
Table B size: 10,000,000 rows
At a chunk size of 1 million rows you will upload 110 billion rows to the device. If you increase it to 20 million rows, you will upload only 15 billion rows to the device. It could run seven times as fast with this change.

Rechunking to reduce PCI transfers in NINLJ

Reducing PCI transfer in NINLJ

Reducing PCI transfers

We want to try to avoid transferring data from host to device wherever possible.
For instance, in the previous join example, if we can fit a chunk from table A on GPU at the same time as the whole table B, we can leave all the table B data on the GPU for the whole join.
Another example: a common situation is that data is inserted in order of a timestamp field. It’s trivial to keep this data ordered by this field in the storage. If the table contains data for 24 months, but most queries have a where predicate which selects an individual day, we can completely skip loading most of the data very easily.

Optimising kernels

Sometimes, you find your bottleneck on the GPU itself (yes, it really happens sometimes even on big data queries). Here are three quick examples which came up at SQream:


We found our implementation of reduce_by_key (we replaced the Thrust one early on for various reasons) to perform particularly badly when most of the keys where distinct (i.e. the number of output rows is not much less than the number of input rows). We optimised this by implementing specialised code paths for the case where the keys are mostly distinct, and the case where they are not mostly distinct.


Originally, to implement multikey sort we used an idea called ‘lexicographical sort’ which was found on the Thrust mailing list. The naive approach performs badly when the total key size is quite big, and there are many loops through the first stage of the algorithm. One possible way to improve this is to use a version of segmented sort.


We found a big bottleneck in some joins in our implemention which uses lower_bounds and upper_bounds from Thrust. We are looking at a few different ways to improve this performance since trying to optimise these functions themselves to run faster than the Thrust implementations has been difficult.


One final issue we have: strings. There are lots of performance and functionality issues with strings on the GPU. Matching the performance of CPU based compression in general is still an unsolved problem on GPUs. There are also issues with operating on very large string columns on big data on the GPU, because of GPU memory constraints.
There are other issues with strings on the GPU even when you are not running on big data, such as finding ways to deal with variable length strings on the GPU, using collations (which means you can sort and compare the data properly), and supporting unicode.
Good solutions: we’re still looking …


No big conclusion, but I hope this has filled in a few blanks on how you go about implementing ‘big data SQL’ on GPUs if you haven’t thought about it much before, and given you some interesting things to think about.

Some further reading

We’re proud of our Big Data GPU Engine

We’re proud of our GPU database engine. We’ve spent a lot of time and work making it highly optimised for a variety of use-cases, from telecoms to ad-tech, retail and finance.
Read more about SQream DB, or book a GPU Database demo to see for yourself what SQream’s GPU database can do for your data analytics.