SQream Platform
GPU Powered Data & Analytics Acceleration
Enterprise (Private Deployment) SQL on GPU for Large & Complex Queries
Public Cloud (GCP, AWS) GPU Powered Data Lakehouse
No Code Data Solution for Small & Medium Business
By Eliyahu Klatis
When a data analyst runs a query with SQream DB, the query compiler interprets the query into instructions. These instructions include information about what data needs be read from our persistent store in order to satisfy the query. SQream DB persists data to disk, so read performance is important for query performance. Primarily, sequential read speed is important for SQream DB. When we initially designed our system, most data we dealt with was stored on spinning disks, and the system was optimized for spinning disk systems. If data was arranged correctly to be read sequentially, when a query demanded a full scan of the data tables (such as situations when there are no WHERE predicates), the read performance become the biggest factor in a query’s runtime. This meant it would also be the biggest bottleneck. The question, therefore, is “How do we challenge the storage, and ourselves, and get the highest read performance we can, so that our queries can run as fast as possible?” We recently revisited the level of I/O concurrency that we used, and found that adding more concurrency results in big performance benefits. But first….
On Linux, the basic way to access data on storage is to use the ‘read’ system call. On spinning disk systems and with sequential data, sending multiple read requests caused most queries to run slower. Even though SQream DB’s table readers are multi-threaded, most queries would only launch a small number of workers. With modern storage systems that can better handle random access, we can actually get a lot more throughput by launching several read workers in parallel (with the main limitation being the amount of main system memory). At the scales in which SQream DB works (many terabytes to petabytes), reading entire tables into main system memory is usually not feasible, so we can’t just launch any number of workers.
Low parallellism I/O is suitable for systems with limited disk throughput
SQream has many ways to reduce I/O bottlenecks that are focused around GPU compression and clever filtering. But at its core, it’s a columnar database. Like many other columnar databases, SQream DB handles hundreds of read requests, each one related to a different column of the table, with some tables having many hundreds of columns. The approach that worked best for us with spinning disks was iterating over groups of columns, and only issuing the next bulk of read requests when the previous has finished. This was good enough to saturate older systems with read requests. But is this enough in 2019?
With more modern systems, this wasn’t enough. A modern storage device that several of our customers use is often capable of achieving 12GB/s of throughput when connected over a single 100GbE network link. So, we did what we know how to do everywhere else in the system. We started launching more workers to read more data in parallel. This meant that a single SQream DB instance could now read much more data in one go. The actual numbers vary depending on query type and workload, but SQream DB can now read about 6GB/s per instance.
High parallellism I/O can deliver much faster read throughput on newer storage systems
The biggest change was the number of worker threads that performed the request in parallel. Our system can now launch just the right number of workers. Most query workloads we tested seem to peak at around 20 workers, but this number can be a bit lower for high-occupancy machines with more than 4 GPUs, and highly depend on the actual storage unit.
My colleague Eyal Hirsch and I ran some extensive research to identify which workloads would benefit from this improvement. What we found most interesting wasn’t just any workload, but the TPC-H Benchmark queries. The queries tend to be quite varied – some read a lot of data, others are CPU or GPU bound. However, all queries begin by reading data off disk. It’s also worth noting that we don’t believe the TPC-H benchmark accurately reflects a typical customer workload, but it should give some insight into how different types of queries run on our system. Here are some results we ran on the TPC-H benchmark, with a scale factor of 10,000 – meaning a 10TB dataset. Our tests show that the average speedup is about 2x at this scale. Some queries are just over twice as fast – like Query 19 – which went down from about 39 minutes on a sequential version of SQream DB to just under 16 minutes. It also more than doubled the average throughput from 451MB/s to 1049MB/s. That’s a lot of time saved when you’re waiting for a query to finish.
Some of our current work to further improve performance is centered around handling more bulk requests simultaneously. However, this currently requires more system memory, so finding the sweet spot is critical, as memory is still quite expensive (and small), compared to the on-disk storage we use for customer scenarios.