0.9.6.14:
Faster allocation on x86-64 (25% speedup on memory-intensive
CL-BENCH tests, 5% on more generic stuff like COMPILER):
* Inline allocation was using a memory-to-register XCHG
(latency 16 on Athlon 64) on the fast path. Use a temporary
register instead.
* Change the temp-tn from r13 to r11, which has a shorter
encoding (results in smaller core and better icache behaviour)
TODO: Check whether the XCHG issue also caused the bizarre P4
performance problems with the (disabled) x86 inline
allocation support, and whether anything can be done to fix the
problem. Using the same solution is probably impossible due to
the lack of extra registers.