This sounds familiar. If you search back in the archives around 2004
or so I think you'll find a similar discussion when we replaced the
crc32 implementation with what we have now. We put a fair amount of
effort into searching for faster implementations so if you've found
one 3x faster I'm pretty startled. Are you sure it's faster on all
architectures and not a win sometimes and a loss other times? And are
you sure it's faster in our use case where we're crcing small
sequences of data often and not crcing a large block?