Calculate checksum 8 bytes at a time with a clever slicing algorithm.
This is the fastest algorithm, but comes with a 8KiB lookup table.
Most modern processors have enough cache to hold this table without
thrashing the cache.
This is the default implementation choice. Choose this one unless
you have a good reason not to.