Vector Compression in Interleaved Space on ARM

In my last article, Use interleaved vectors for parsing on ARM, I covered the three main algorithms we need to support on interleaved vectors for high performance parsing of utf8, JSON, or Zig for aarch64/ARM architectures. Namely, the movemask and unmovemask routines, as well as the elementwise shift replacement, all of which are more efficient when performed on interleaved vectors (except when shifting by a multiple of 16). I also briefly explained that we can perform prefix sum operations using these elementwise shifts, but I did not explain why we might want to. The answer is vector compression.

Vector Compression§

In high performance parsers, we sometimes produce a vector bytemask and/or a bitmask which indicates certain elements we want to keep, and the rest should be thrown out. On the latest and greatest x86-64 hardware (with AVX-512), we have the VPCOMPRESSB instruction, which can extract just those bytes from a 64-byte vector corresponding to a bitmask we pass in.

On Arm, however, there is no VPCOMPRESSB, nor is there an interleaved equivalent. So, per usual, we have to roll our own.

Luckily, we reproduce the semantics of VPCOMPRESSB by:

Finding the (exclusive) prefix sum of the bytemask.
Adding the prefix sum to the identity counting vector.
Simultaneously shifting each element by their corresponding amount, calculated in step 2.

Prefix Sum§

As in the previous article, the prefix-sum of a vector can be computed like so:

fn prefixSum(vec_: @Vector(64, u8)) @Vector(64, u8) {
    var vec = vec_;
    inline for (0..6) |i| { // iterates from 0 to 5
        vec += std.simd.shiftElementsRight(vec, 1 << i, 0);
    }
    return vec;
}

And, as first shown in the previous article, we can abstract away the elementwise vector shifts with a helper function. That way, our interleaved version looks like so:

fn prefixSum(vec_: @Vector(64, u8)) @Vector(64, u8) {
    var vec = vec_;
    inline for (0..6) |i| { // iterates from 0 to 5
        vec += shiftInterleavedElementsRight(vec, 1 << i, 0);
    }
    return vec;
}

It looks the same!

Since we abstracted away the interleaved shift, we can think of it as though it’s just a normal-ordered vector. So here is what a prefix sum looks like for a normally-ordered vector:

As depicted above, we first shift the vector right by one, then add that to itself. Then we shift the result of that addition to the right by two, then add that to our previous result. If we continue with this pattern, every column ends up becoming the sum of all bytes that came before it.

To apply this to vector compression, start with a -1 in each slot you intend to keep, and a 0 otherwise:

Next, shift our result to the right by 1 because we want the exclusive prefix sum (if possible, it’s more efficient to do this beforehand). Then add the final result to the identity counting vector.

This resulting travel_distances vector contains, for the (red) elements we care about, how many slots they each need to be shifted leftwards.

Compression via prefix-sum§

Next, we shift each element left by the travel_distances we calculated from the prefix_sums vector. To accomplish this, we shift the vector by successive powers of 2, each time keeping only values whose binary representation has a 1 bit in the place value corresponding to the current power of 2, otherwise keeping the previous value. E.g., if we want to shift an element by a total of 5 slots, we will shift it during the 1-shift and 4-shift stage, because the binary representation of 5 is 0b101 (i.e. 1+4).

// Compresses the elements in `data` corresponding to non-zero bytes in the `condition` vector.
// Note this return a compressed 64-byte vector in interleaved space, meaning that if you want to write
// this out to memory, you need to use the `st4` instruction.
fn compress(data: @Vector(64, u8), prefix_sums: @Vector(64, u8)) @Vector(64, u8) {
    const indices = comptime @as(@Vector(64, u8), @bitCast(std.simd.deinterlace(4, std.simd.iota(u8, 64))));
    var travel_distances = indices +% prefix_sums;
    var compressed_data = data;

    inline for (0..6) |x| {
        const i = 1 << x;
        const shifted_travel_distances = shiftInterleavedElementsLeft(travel_distances, i, 0));
        const shifted_compressed_data = shiftInterleavedElementsLeft(compressed_data, i, 0));
        const selector = cmtst(shifted_travel_distances, @splat(i));
        travel_distances = bsl(selector, shifted_travel_distances, travel_distances);
        compressed_data = bsl(selector, shifted_compressed_data, compressed_data);
    }

    return compressed_data;
}

fn cmtst(a: anytype, comptime b: @TypeOf(a)) @TypeOf(a) {
    return @select(u8, (a & b) != @as(@TypeOf(a), @splat(0)), @as(@TypeOf(a), @splat(0xff)), @as(@TypeOf(a), @splat(0)));
}

fn bsl(selector: anytype, a: @TypeOf(selector), b: @TypeOf(selector)) @TypeOf(selector) {
    return (a & selector) | (b & ~selector);
}

Unfortunately, this method requires a logarithmic number of steps, both in the prefix sum and the compression stage. It also has a strict serial dependency chain from start to finish.

We can do better by finding the prefix sum of each group of 8 or 16, and writing out to memory 8 or 4 times instead of once.

Breaking it up§

In order to prefix-sum and vector-compress groups of 8, 16 or 32, we need to be careful that our shiftInterleavedElementsLeft and shiftInterleavedElementsRight functions do not go across the boundaries we set. If we want to vector-compress groups of 8, we want our elementwise shift emulation to not add element 7 to element 8, 15 to 16, 23 to 24, etc. Luckily, we have an instruction that puts barriers between these groups. The shl/shr instructions! For groups of 8, we use 2-byte-granularity shifts, for groups of 16, we use 4, and for groups of 32 we would use 8-byte-granularity shifts.

I define a custom shiftElementsLeft upon which shiftInterleavedElementsLeft is built (see definition of shiftInterleavedElementsRight here), which has a comptime boundary parameter. When the boundary is smaller than a u128, we will use a 16, 32, or 64 bit-wise shift:

fn shiftElementsLeft(vec: @Vector(16, u8), comptime amount: std.simd.VectorCount(@Vector(64, u8)), comptime boundary: type) @Vector(16, u8) {
    return if (boundary == u128)
        std.simd.shiftElementsLeft(vec, amount, 0)
    else
        @bitCast(@as(@Vector(16 / @sizeOf(boundary), boundary), @bitCast(vec)) >> @splat(8*amount));
}

This allows us to get our st4 instructions started earlier, with a lot more parallelism:

inline for (0..64 / WIDTH) |i| {
    st4(
        dest[if (i == 0) 0 else prefix_sum_of_offsets[i*(WIDTH / 8) - 1]..],
        shiftInterleavedElementsLeft(compressed_data, WIDTH*i, u128)
    );
}

Where prefix_sum_of_offsets is defined like so:

comptime var prefix_sum_multiplier = 0;
inline for (0..64 / WIDTH) |i| prefix_sum_multiplier |= 1 << i*WIDTH;
const prefix_sum_of_offsets: [8]u8 = @bitCast(
@as([2]u64, @bitCast(
    uzp2(
        neg(
            @as([4]@Vector(16, u8), @bitCast(prefix_sums))[3]
        )
    )
))[0] *% prefix_sum_multiplier);

This takes the bottommost prefix-sums vector, corresponding to what was originally 3, 7, 11, 15, 19, etc, then takes the arithmetic negative of each element, then extracts the odd bytes into a u64, then multiplies by a constant prefix_sum_multiplier that has every WIDTH bits set to 1. E.g. when WIDTH is 8, it will multiply by 0x0101010101010101, which will compute the byte-wise prefix-sum. When WIDTH is 16, it will multiply by 0x0001000100010001, computing the 2-byte-wise prefix-sum. Quickly producing the prefix_sum_of_offsets with a multiply instead of a serial dependency chain of add instructions allows us to calculate the destination pointers in parallel.

Putting it all together: (full code here)

const WIDTH = 16; // How many elements to operate on at once

/// Compresses the elements in `data` corresponding to the `condition` vector.
/// Writes to `dest`, including a number of undefined bytes.
/// In total, this expression gives the number of bytes written past `dest`:
/// switch (WIDTH) {
///    8, 16, 32 => (64 - WIDTH) + 32,
///    64 => 64,
/// }
export fn compress(data: @Vector(64, u8), condition: @Vector(64, u8), dest: [*]u8) u8 {
    const U = std.meta.Int(.unsigned, WIDTH*2);
    const indices = comptime @as(@Vector(64, u8), @bitCast(std.simd.deinterlace(4, std.simd.iota(u8, 64) & @as(@Vector(64, u8), @splat(WIDTH - 1)))));

    var prefix_sums = @select(u8, condition != @as(@Vector(64, u8), @splat(0)),
        @as(@Vector(64, u8), @splat(255)),
        @as(@Vector(64, u8), @splat(0)),
    );

    // Next, shift elements right by 1, 2, 4, 8, 16, and 32, and accumulate at each step
    inline for (0..std.math.log2(WIDTH)) |i| {
        prefix_sums +%= shiftInterleavedElementsRight(prefix_sums, 1 << i, U);
    }

    comptime var prefix_sum_multiplier = 0;
    inline for (0..64 / WIDTH) |i| prefix_sum_multiplier |= 1 << i*WIDTH;
    const prefix_sum_of_offsets: [8]u8 = @bitCast(
    @as([2]u64, @bitCast(
        uzp2(
            neg(
                @as([4]@Vector(16, u8), @bitCast(prefix_sums))[3]
            )
        )
    ))[0] *% prefix_sum_multiplier);

    // Now take the identity indices and add it to the prefix_sums.
    // This value tells us how far each value should be left-shifted
    var travel_distances = indices +% shiftInterleavedElementsRight(prefix_sums, 1, U);
    var compressed_data = data;

    inline for (0..std.math.log2(WIDTH)) |x| {
        const i = 1 << x;
        const shifted_left = shiftInterleavedElementsLeft(travel_distances, i, U);
        const shifted_compressed_data = shiftInterleavedElementsLeft(compressed_data, i, U);
        const selector = cmtst(shifted_left, @splat(i));
        travel_distances = bsl(selector, shifted_left, travel_distances);
        compressed_data = bsl(selector, shifted_compressed_data, compressed_data);
    }

    inline for (0..64 / WIDTH) |i| {
        (if (WIDTH == 64) st4 else st4_first_32)(
            dest[if (i == 0) 0 else prefix_sum_of_offsets[i*(WIDTH / 8) - 1]..],
            shiftInterleavedElementsLeft(compressed_data, WIDTH*i, u128)
        );
    }

    return prefix_sum_of_offsets[7];
}

Subject to the following issues:

Technique II: Lookup table§

Unfortunately, it seems that even with those issues fixed, it’s still going to be more efficient to use a lookup table, if you can afford to consume 2KiB of your precious cache. (Godbolt link)

export fn compress(interleaved_data: @Vector(64, u8), bitstring: u64, dest: [*]u8) void {
    comptime var lookups: [256]@Vector(8, u8) = undefined;
    comptime {
        @setEvalBranchQuota(100000);
        for (&lookups, 0..) |*slot, i| {
            var pos: u8 = 0;
            for (0..8) |j| {
                const bit: u1 = @truncate(i >> j);
                slot[pos] = j / 4 + (j & 3) * 16;
                pos += bit;
            }

            for (pos..8) |j| {
                slot[j] = 255;
            }
        }
    }

    const chunks: [4]@Vector(16, u8) = @bitCast(interleaved_data);

    const prefix_sum_of_popcounts =
        @as(u64, @bitCast(@as(@Vector(8, u8), @popCount(@as(@Vector(8, u8), @bitCast(bitstring))))))
            *% 0x0101010101010101;

    inline for (@as([8]u8, @bitCast(bitstring)), @as([8]u8, @bitCast(prefix_sum_of_popcounts)), 0..)
    |byte, pos, i| {
        dest[pos..][0..8].* = tbl4(
            chunks[0],
            chunks[1],
            chunks[2],
            chunks[3],
            lookups[byte] +| @as(@Vector(8, u8), @splat(2*i))
        );
    }
}

fn tbl4(
    table_part_1: @Vector(16, u8),
    table_part_2: @Vector(16, u8),
    table_part_3: @Vector(16, u8),
    table_part_4: @Vector(16, u8),
    indices: @Vector(8, u8)
) @TypeOf(indices) {
    return struct {
        extern fn @"llvm.aarch64.neon.tbl4"(@TypeOf(table_part_1), @TypeOf(table_part_2), @TypeOf(table_part_3), @TypeOf(table_part_4), @TypeOf(indices)) @TypeOf(indices);
    }.@"llvm.aarch64.neon.tbl4"(table_part_1, table_part_2, table_part_3, table_part_4, indices);
}

This technique also has to use tbl4 because it is deinterleaving the data at the same time as compressing. In normal space, you would just use tbl1. But hey, as long as there is no serial dependency, you only eat the latency once.

Now go compress your interleaved vectors, you glorious vectorizers!

‒ Validark