What is the Fastest Way to Scan 100 TB of Data?

By Galit Koka Elad

4.26.2017 twitter linkedin facebook

Say you have 100 TB of data in a table with a year’s worth of information in it. Your data is partitioned by a key, and ordered by date. Your manager asks you to quickly answer a whole lot of questions about last year across all regions, regardless of the key. And it has to be on his desk, fast.

Any thoughts?

How do you approach this?

Partitions won’t do the trick here since your data is partitioned by a different key. Ordering won’t help either. Your first option will be to “full scan” it all, but this is time consuming, and you don’t have this time. Does this sound familiar?

You might be tempted to index the ‘region’ column and query the data though that index. But indexing costs time not only in first-time creation, but also during inserts and often during querying itself. Indexing also requires pre-modeling, and it’s only good if you query by the index column. This isn’t a really great option either.

So back to our question – how do you do that? How do you scan 100 TB of data to find only the relevant rows, aggregate, summarize – and all within seconds?

SQream DB, the GPU database we developed gathers metadata with a transparent method during data ingest that allows it to support full data skipping across all columns and tables. Data skipping basically works as a negative index, and enables SQream DB to avoid loading and processing data which isn’t necessary for the query results.

Assume you were asked to get a summary of all the past year’s EMEA sales. Data skipping allows SQream DB to avoid reading  columns and column ‘chunks’ (and as such skip the entire data processing), speeding up the query processing dramatically. Data skipping works wonders in columnar databases like SQream DB, which only reads the relevant columns. This means you get a slice and dice that improves your performance by orders of magnitude.

Now, I hear you. You’re saying “But other databases also do data skipping! What’s so special about SQream DB?”. To that, I say: “At SQream we support data skipping via the GPU, across all query layers”. The available power of the GPU allows the collection and tagging of data during ingest to be “always on” and basically costless. The GPU, with SQream’s trade secret and patented algorithms enable the gathering of metadata on intermediate results and to avoid joins and other expensive processing operations. The result is super smart and efficient processing.

The conclusion as to what the fastest way to scan 100 TB of data is, thus, don’t scan all of it… Instead, focus only on what you need!  SQream DB knows how to do that, always, on all tables, any type of column, and exactly when you need it – which is pretty cool, wouldn’t you agree?