Awesome addition! Would it make sense to use x86's BMI2's PDEP instruction, or is the interleave computation too small of a percentage to introduce not-so-easy-to-port code? Also, I think it needs a bit more documentation to explain the logic, i.e. a link to https://stackoverflow.com/questions/39490345/interleave-bits-efficiently ? Thx for making it faster :)