I have so many good ideas for polyfilling SIMD instructions in older instruction sets and I don't know how to put them in my library properly
You want PCMPEQQ but don't have SSE4.1? No worries, do a PCMPEQD, use PSHUFD to swap pairs of dwords, then PAND with the original result (3 cycles).
PSRLB? Just PSRLD and PAND out some bits (2 cycles).
PSRAQ? Use PSRLQ, and OR the result with the negation of the shifted MSB (i.e. PSRLQ, PAND, PSUB, POR, 4 cycles). For PSRAB, do the same, but do an additional PAND (concurrently with the PAND / PSUB) to mask out overlapping high bits.
Want a VPOPCNTB for cheap? Perform two PSHUFBs (one on the low bits, one on the high bits, both with masking) to popcount nibbles and add their results. Even older CPUs should be able to do that in 3 cycles. For VPLZCNTB / VPTZCNTB, use PMIN/PMAX instead of adding the results.