Bit-permuting 16 u32s at once with AVX-512
The basic trick to apply the same bit-permutation to each of the u32s is to view them as matrix of 16 rows by 32 columns, transpose it into a 32 u16s, permute those u16s in the same way that we wanted to permute the bits of the u32s [1], then transpose back to 16 u32s. Easy:
__m512i permbits_16x32(__m512i data, __m512i indices)
{
__m512i x = data;
x = transpose_16_dwords_to_32_words(x);
x = _mm512_permutexvar_epi16(indices, x);
x = transpose_32_words_to_16_dwords(x);
return x;
}
transpose_16_dwo...
Read more at bitmath.blogspot.com