A common building block of many data operations is known as a “string gather” operation. The “string gather” operation is defined as following – given one “index array” and several “string arrays”, we wish to use the “index array” to reorder them – as in the example below.

Before string gather:

indices: 0 2 1 2 0 1
strings1_in: foo   bar   tin
strings2_in: danny hadar motti

After string gather:

strings1_out: foo   tin   bar   tin   foo   bar
strings2_out: danny motti hadar motti danny hadar

I will explain this example. Since the indices array starts with a 0, the string outputs should start with the 1st string of both string inputs (we use 0-based indexing). Since a 2 appears afterwards, the corresponding position in the output would contain the 3rd string of both string inputs, and so on. Note that the size of the indices array is independent of the string input arrays (because we may repeat a string twice, or omit it entirely).

Doing string gathers on the GPU can be problematic due to GPU memory limitations, therefore we might need to do this on the CPU, as fast as possible. In this post I will demonstrate how we start with a naive implementation and gradually speed it up. I will explain the naive implementation:

void gather1(const vector<char*> &strs
                ,const vector<size_t> &sizes
                ,const size_t rows
                ,const vector &idxs
                ,vector<char*> &outs)
    {
        for (size_t r = 0; r < rows; ++r)
            for (size_t c = 0; c < sizes.size(); ++c)
                memcpy(outs[c] + r       * sizes[c]
                      ,strs[c] + idxs[r] * sizes[c]
                      ,sizes[c]);
    }

The gather1 function accepts 5 parameters:

  • strs – Vector of char pointers, each char pointer points to one of the string input arrays. The strings in each array are of fixed length (which may be different in each array).

  • sizes – Vector of integers, for each array pointer in strs, the corresponding element in sizes contains the fixed length of its strings.

  • rows – Integer, the number of strings in each string array (this number is the same for all arrays).

  • idxs – Vector of integers, the indices array.

  • outs – Vector of char pointers, each char pointer points to one of the string output arrays to which the function will write its output.

This implementation is easy to read and does the work, but its performance can be improved:

  • First thing we notice is that for every iteration we call vector::size() once, and vector::operator[] several times. We can get rid of these calls by extracting the vector base pointer using vector::data() and keeping the size in a variable (this is possible since both of these will not change during the loop).

  • Furthermore, the gather1 function outputs to a different column in each iteration. Since data are stored in columnar ordering, this means that writes spread all over memory. By changing the order of loops, we can get the writes to be ordered and adjacent to each other. This property is called ‘spatial locality’ and improves memory I/O performance.

After these two imporvements, the function looks as following:

void gather2(const vector<char*> &strs_
                ,const vector<size_t> &sizes_
                ,const size_t rows
                ,const vector &idxs_
                ,vector<char*> &outs_)
    {
        auto strs = strs_.data();
        auto sizes = sizes_.data();
        auto idxs = idxs_.data();
        auto outs = outs_.data();
        auto sizessize = sizes_.size();
        for (size_t c = 0; c < sizessize; ++c) {
            auto sz = sizes[c];
            for (size_t r = 0; r < rows; ++r) 
                memcpy(outs[c] + r       * sz
                      ,strs[c] + idxs[r] * sz
                      ,sz);
        }
    }

The next step is to improve the speed of the memcpy itself. The memcpy function in the standard library is biased towards large amounts of memory. While this is good for most cases, since we are doing small copies, this hurts our performance. If the compiler knows in compile time the size of a small memcpy, it can replace it with inline code performs a faster small memcpy instead of calling the memcpy function. Here is a pseudo code for how we can take advantage of fast memcpy:

labels = {dummy,memcpy1,memcpy2,memcpy3,...,memcpy256};
    for each c in columns
        col_sz = fixed length of strings in col c
        for each r in rows
            goto labels[col_sz]
          memcpy1:
            memcpy(dst, src, 1);
            goto done;
          memcpy2:
            memcpy(dst, src, 2);
            goto done;
          .
          .
          .
          memcpy256:
            memcpy(dst, src, 256);
            goto done;
          done:
            continue the loop;

The actual code uses C macros to unroll a big switch statements with cases for each size from 1-256. Since most compilers optimize these switch statements as jump vectors, the compiler is likely to produce machine code very similar to the pseudo code above. Here is the code (you don’t have to understand the macros, we just use them to write less lines, writing the switch directly yields the same performance)

void gather3(const vector<char*> &strs_
                ,const vector<size_t> &sizes_
                ,const size_t rows
                ,const vector &idxs_
                ,vector<char*> &outs_)
    {
        auto strs = strs_.data();
        auto sizes = sizes_.data();
        auto idxs = idxs_.data();
        auto outs = outs_.data();
        auto sizessize = sizes_.size();
        for (size_t c = 0; c < sizessize; ++c) {
            auto sz = sizes[c];
            for (size_t r = 0; r < rows; ++r) {
    #define CASE(x) 
    case (x): memcpy(outs[c] + r * (x), strs[c] + idxs[r] * (x), (x)); break;

    #define CASE2(x) CASE(x) CASE(x+1)
    #define CASE4(x) CASE2(x) CASE2(x+2)
    #define CASE8(x) CASE4(x) CASE4(x+4)
    #define CASE16(x) CASE8(x) CASE8(x+8)
    #define CASE32(x) CASE16(x) CASE16(x+16)
    #define CASE64(x) CASE32(x) CASE32(x+32)
    #define CASE128(x) CASE64(x) CASE64(x+64)
    #define CASE256(x) CASE128(x) CASE128(x+128)

                switch (sz) { CASE256(1) }

    #undef CASE256
    #undef CASE128
    #undef CASE64
    #undef CASE32
    #undef CASE16
    #undef CASE8
    #undef CASE4
    #undef CASE2
    #undef CASE
            }
        }
    }

The gather3 function keeps ‘spatial locality’ for its output, as well as copies strings efficiently. One problem remains – the access to input memory is random. There is no way to prevent this since we have no guarantees about the content of the indices array. We can however reduce the damage caused by this random access. In order to do this, we must first understand what advantage ‘spatial locality’ gives us. The main advantage we get from ‘spatial locality’ is better cache performance. Since accessing the RAM takes time, there exists a layer of memory that is faster than normal RAM, it is called cache memory. Data from the RAM are copied to the cache in blocks, so that when one memory is addressed, the block surrounding it is copied to the cache. While this improves performance for sequential memory reads, random memory reads benefit less from it. But what if we could predict in advance from where our next read is going to be, and tell the memory to asynchronously load its block to the cache. That way, while we are copying some memory from the cache, the memory for the next iteration is already copied to the cache. Intel processors provide a ‘PREFETCH’ instruction, to which gcc conveniently gives us access. Thus, we can integrate cache pre-fetching into our memcpy instructions. Note that detecting whether to cache the next string, or the next-next string, and using what caching parameters can change from processor to processor, and from one fixed string length to another. In the current example, these functions are only implemented for sizes 50, 112, 208. The code looks as follows:

struct block112 { __m128d d[7]; };
    struct block208 { __m128d d[13]; };

    void gather_50_fast(void *out, void *in, int *arr, size_t cnt) {
        char *a = (char *)in;
        char *b = (char *)out;
        size_t i;
        for (i = 0; i < cnt; ++i) {
            _mm_prefetch(a + arr[i + 3] * 50, _MM_HINT_NTA);
            memcpy(b + i * 50, a + arr[i] * 50, 50);
        }
    }

    void gather_112_fast(void *out, void *in, int *arr, size_t cnt) {
        struct block112 *a = (struct block112 *)in;
        struct block112 *b = (struct block112 *)out;
        size_t i;
        for (i = 0; i < cnt; ++i) {
            _mm_prefetch(a[arr[i + 3]].d, _MM_HINT_NTA);
            _mm_prefetch(a[arr[i + 3]].d + 4, _MM_HINT_NTA);
            memcpy(b + i, a + arr[i], 112);
        }
    }

    void gather_208_fast(void *out, void *in, int *arr, size_t cnt) {
        struct block208 *a = (struct block208 *)in;
        struct block208 *b = (struct block208 *)out;
        size_t i;
        for (i = 0; i < cnt; ++i) {
            _mm_prefetch(a[arr[i + 2]].d, _MM_HINT_NTA);
            _mm_prefetch(a[arr[i + 2]].d + 4, _MM_HINT_NTA);
            _mm_prefetch(a[arr[i + 2]].d + 8, _MM_HINT_NTA);
            _mm_prefetch(a[arr[i + 2]].d + 12, _MM_HINT_NTA);
            memcpy(b + i, a + arr[i], 208);
        }
    }

    void gather4(const vector<char*> &strs_
                ,const vector<size_t> &sizes_
                ,const size_t rows
                ,const vector &idxs_
                ,vector<char*> &outs_)
    {
        auto strs = strs_.data();
        auto sizes = sizes_.data();
        auto idxs = const_cast<int*>(idxs_.data()); // note: we need to remove the
                                                    // cast here due to the typing
                                                    // of the gcc caching primitives
        auto outs = outs_.data();
        auto sizessize = sizes_.size();
        for (size_t c = 0; c < sizessize; ++c) {
            auto sz = sizes[c];
            switch (sz) {
            case 50:  gather_50_fast (outs[c], strs[c], idxs, rows); break;
            case 112: gather_112_fast(outs[c], strs[c], idxs, rows); break;
            case 208: gather_208_fast(outs[c], strs[c], idxs, rows); break;
            }
        }
    }

The block112 and block208 are used to encode strings of lengths 112 and 208 as sequences of packed integers (__m128d). This may be useful if we will use the Intel SSE instruction set to further improve performance. For now however, we do not use the SSE operations supported by the __m128d type. The _mm_prefetch is a function that gets translated into a call to the ‘PREFETCH’ instruction. It gets 2 parameters, an address to cache from, and a “cache mode”. The ‘PREFETCH’ calls we added are based on experiments we’ve done to find out how to properly distribute them in the loop.

After we wrote these 4 gather functions, it is time to benchmark them. Our benchmark program creates a randomized indices array and several input string arrays of fixed sizes supported by gather4. It then runs one of the gather implementations above and times it. It also supports a ‘compare’ mode in which we make sure that all gathers give the same output, thus, naive implementations can act as a unit test for optimized implementations. One last note, our benchmark code also fills the output memory before running gather on it. This is important to do on the Linux operating system since Linux allocates memory lazily. Until memory haven’t been written to, it is not allocated. Thus, if we ran gather on the output memory without writing to it, we would also be timing how long it takes Linux to allocate the memory. This is not relevant for our benchmark.

After benchmarking the gather implementations on an Inteli5-3470 3.20GHz CPU, and with rows=1000000, and sizes={50,50,112,112,208,112,50,50,50,50}, we got the following results:

gather1 - 0.51s
    gather2 - 0.42s
    gather3 - 0.33s
    gather4 - 0.27s

Note how after all optimizations, the run time of gather went down by roughly 45%. 

Want to know more about SQream Technologies? Be sure to follow us on twitter.
Leave your comments below – we can’t wait to hear what you have to say!