SQream Platform
GPU Powered Data & Analytics Acceleration
Enterprise (Private Deployment) SQL on GPU for Large & Complex Queries
Public Cloud (GCP, AWS) GPU Powered Data Lakehouse
No Code Data Solution for Small & Medium Business
By Jake Wheat
This is the second in a two part blog post about implementing SQL queries on GPUs. In part 1 of this post, we talked about implementing a simple SQL engine using GPUs and running on big data in a very simple and direct way. An engine like this won’t be as fast as we want, and there are many ways to optimise it for better performance. Let’s see what options we have to optimise:
Here are some non-GPU specific ideas which you should follow more or less the same for a GPU database as a regular CPU database:
There are many other ideas which translate directly from advanced columnar SQL engines to GPU implementations. The rest of this post will outline a few GPU specific ideas.
A simple idea for working with a GPU is to use a task queue. A GPU worker thread will take a tasks from the queue, load the data onto the GPU, run some kernels, then download the results to host memory. This isn’t exactly an optimisation. In fact, it’s possible you would have already implemented exactly this in an engine like the one described earlier. Here is a diagram of the idea
Short GPU task concept for concurrency
There are a lot of benefits to starting with this approach, such as:
Here is a very simple idea: combining multiple GPU operations without round-tripping the intermediate results to host memory. Consider the following SQL query –
select a, b+c from t where b > 5 order by a
The logical operators to implement this query are shown in the following diagram:
The logical operators to implement this query
In a CPU database, this is how it’d be run. But because we’re running these SQL queries on GPUs, we will make the data transfers to and from GPU explicit, to show how a naive direct implementation might work:
Trivial tasks for a simple SQL query with explicit GPU copy operations
If we combine the GPU operations, we can get this approach, which should run much faster:
Combining tasks on the GPU where possible to reduce device-to-host and host-to-device copying
By reducing the amount of copies, we significantly reduce overhead, by avoiding round-tripping the intermediate results to host memory.
Another simple idea is to use larger chunks of data. In this example, by sorting the data up to 10,000,000 rows at once, instead of up to 10,000 rows at once, the total query time should be reduced, sometimes significantly.
For best performance, you usually want to make the chunks as big as you can, as long as they still fit in the GPU memory. With GPU memory being quite small (compared to host memory), this is an important factor.
If we want to optimise bulk IO reads, we want to do large transfers from disk at once.
Rechunking table scans, optimizing disk I/O sizes
In the first line, we read chunks of data from disk which will fit on the GPU. But this size of disk reads won’t perform that well. To get the disk transfer rate closer to the hardware maximum, we want to read much bigger sizes. In the second line, we read a large amount of data from the disk, but then this is too big to load to the GPU. In the third line, we load the big chunks of data from disk, then each time we split this big chunk into small chunks to be processed one at a time on the GPU.
Some of the physical operators on the GPU can input a large amount of rows, then output a small amount of rows, depending on the specific query and the data distribution. We can boost the performance of subsequent operators by collecting these small outputs into bigger chunks.
Rechunking after remove_if
The example in the diagram above is a query with a where and an order by. The where is implemented using the remove if operator, and the order by partially by the sort operator. If the where has a high selectivity, as we see in the first line, then the sort will perform well. In the second line, we see low selectivity. This means that the where predicate only matches e.g. 20% of the rows. By running the sort operator on chunks which are only 20% full, we will perform badly relative to full chunks . In the third line, we add the optimisation. After we get the small chunks from the remove if operator, we combine them back into big chunks, then run the sort operator on these big chunks and get a nice performance improvement.
Here is another example of the impact of chunk sizes. In this case, we are running a non indexed nested loop join. We need to load every chunk from the A input with every chunk of the B input. Here is the pseudo code for the nested loop:
for each chunk in table A: for each chunk in table B: load A,B to the GPU, join them
Some example numbers: Table A size: 10,000,000,000 rows Table B size: 10,000,000 rows At a chunk size of 1 million rows you will upload 110 billion rows to the device. If you increase it to 20 million rows, you will upload only 15 billion rows to the device. It could run seven times as fast with this change.
Reducing PCI transfer in NINLJ
We want to try to avoid transferring data from host to device wherever possible. For instance, in the previous join example, if we can fit a chunk from table A on GPU at the same time as the whole table B, we can leave all the table B data on the GPU for the whole join. Another example: a common situation is that data is inserted in order of a timestamp field. It’s trivial to keep this data ordered by this field in the storage. If the table contains data for 24 months, but most queries have a where predicate which selects an individual day, we can completely skip loading most of the data very easily.
Sometimes, you find your bottleneck on the GPU itself (yes, it really happens sometimes even on big data queries). Here are three quick examples which came up at SQream:
We found our implementation of reduce_by_key (we replaced the Thrust one early on for various reasons) to perform particularly badly when most of the keys where distinct (i.e. the number of output rows is not much less than the number of input rows). We optimised this by implementing specialised code paths for the case where the keys are mostly distinct, and the case where they are not mostly distinct.
Originally, to implement multikey sort we used an idea called ‘lexicographical sort’ which was found on the Thrust mailing list. The naive approach performs badly when the total key size is quite big, and there are many loops through the first stage of the algorithm. One possible way to improve this is to use a version of segmented sort.
We found a big bottleneck in some joins in our implemention which uses lower_bounds and upper_bounds from Thrust. We are looking at a few different ways to improve this performance since trying to optimise these functions themselves to run faster than the Thrust implementations has been difficult.
One final issue we have: strings. There are lots of performance and functionality issues with strings on the GPU. Matching the performance of CPU based compression in general is still an unsolved problem on GPUs. There are also issues with operating on very large string columns on big data on the GPU, because of GPU memory constraints. There are other issues with strings on the GPU even when you are not running on big data, such as finding ways to deal with variable length strings on the GPU, using collations (which means you can sort and compare the data properly), and supporting unicode. Good solutions: we’re still looking …
No big conclusion, but I hope this has filled in a few blanks on how you go about implementing ‘big data SQL’ on GPUs if you haven’t thought about it much before, and given you some interesting things to think about.
We’re proud of our GPU database engine. We’ve spent a lot of time and work making it highly optimised for a variety of use-cases, from telecoms to ad-tech, retail and finance. Read more about SQream DB, or book a GPU Database demo to see for yourself what SQream’s GPU database can do for your data analytics.