Micro-optimize DOUBLE-FLOAT-LOW-BITS on x68-64.
Instead of loading a 64-bit register from memory and zeroing the upper
32 bits of it by the sequence SHL reg, 32; SHR reg, 32 simply load the
corresponding 32-bit register from memory, relying on the implicit
zero-extension to 64 bits this does. This is smaller and faster.
For example, if the input to the VOP is a descriptor register, the old
instruction sequence is:
MOV RDX, [RDX-7]
SHL RDX, 32
SHR RDX, 32
and the new one:
MOV EDX, [RDX-7]
Regarding store-to-load forwarding this change should make no
difference: Most current processors can forward a 64-bit store to a
32-bit load from the same address. The exception is Intel's Atom which
can forward only to a load of the same size as the store; but it also
supports this only between integer registers, and DOUBLE-FLOAT-LOW-BITS
mostly or even always acts on memory slots written from an XMM register
(of the three storage classes it supports as input, for the first it
does the store itself from an XMM register; for the other two I have
investigated some disassemblies and always found the prior store to be
from am XMM register).