repo.macrolet.net Git - sbcl.git/log

Complete SSE instruction definitions for x86-64

* New instruction formats:
  - 2-byte instructions with GP/mem source and XMM destination.
  - 1- and 2-byte instructions with XMM source and GP/mem destination.
  - F3-escape instructions GP/mem source and GP destination.
  - 2-byte instructions with GP/mem source and GP destination.

* Complete support for SSE instruction sets:
  - SSE3
  - SSSE3
  - SSE4.1
  - SSE4.2

* Fix definition of pblendvb, blendvps, blendvpd: These require a third operand,
   implicitly in XMM0.

* PEXTRW has a new 2-byte encoding in SSE4.1 which allows a memory address as
   the destination operand. The new encoding is only used when dst is a memory
   address, otherwise the old backward-compatible encoding is used.

* Fix 64-bit popcnt (F3 still comes REX.W), and make it check for operand sizes,
   like the new CRC32.

* Slightly adapted from Jonathan Armond to work with Douglas Katzman's F3-specific
   r, r/m instruction format.

Export SB-SIMD-PACK symbols from SB-EXT

Export the SIMD-PACK type, the SIMD-PACK-P predicate,
%make-simd-pack-{ub32,ub64,single,double}, and
%simd-pack-{ub32s,ub64s,singles,doubles}.

These are far from useful yet, but at least future extensions
can work with SB-EXT instead of SB-KERNEL.

Also, says so in NEWS.

SB-SIMD-PACK on x86-64

* Enable them by default on x86-64;

* And run some smoke tests, at least.

Additional niceties and middle end support for short vector SIMD packs

* Allow FASL loading/dumping of (boxed) SIMD packs, and mark them as
   trivially (i.e. without going through make-load-form) dumpable.

* SIMD packs print nicely, and take the element type into account while
   doing so.

* (C)TYPE-OF is more accurate for SIMD packs; this enables IR2 conversion
   to choose the right primitive type and storage class for constants.

The FASL code was kept on life support by Alexander Gavrilov for too many years,
and the printing logic is a very light adaptation of the output code he developed
for his branch.

Back end work for short vector SIMD packs

* Platform-agnostic changes:
   - Declare type testing/checking routines.
   - Define three primitive types: simd-pack-double for packs
     of doubles, simd-pack-single for packs of singles, and
     simd-pack-int for packs of integer/unknown.
   - Define a heap-representation for 128-bit SIMD packs,
     along with reserving a widetag and filling the corresponding
     entries in gencgc's tables.
   - Make the simd-pack class definition fully concrete.
   - Teach IR1 how to expand SIMD-PACK type checks.
   - IR2-conversion maps SIMD-PACK types to the right primitive type.
   - Increase the limit on the number of storage classes: SIMD packs
     went way past the previous (arbitrary?) limit of 40.

* Platform-specific changes, in src/compiler/target/simd-pack:
   - Create new storage classes (that are backed by the float-reg [i.e. SSE]
     storage base): one for each of double, single and integer sse packs.
   - Also create the corresponding immediate-constant and stack storage
     classes.
   - Teach the assembler and the inline constant code about this new kind
     of registers/constants, and how to map constant SIMD-PACKs to which SC.
   - Define movement/conversion VOPs for SSE packs, along with VOP routines
     needed for basic creation/manipulation of SSE packs.
   - The type-checking VOP in generic/late-type-vops is extremely
     x86-64-specific... IIRC, there are ordering issues I do not
     want to tangle with.

* Implementation idiosyncrasy: while type *tests* (i.e. TYPEP calls) consider
   the element type, type *checks* (e.g. THE or DECLARE) only check for
   SIMD-PACKness, without looking at the element type.  This is allowed by the
   standard, is similar to what Python does for FUNCTION types, and helps
   code remain efficient even when type checks can't be fully elided.

The vast majority of the code is verbatim or heavily inspired by Alexander
Gavrilov's branch.

Front end infrastructure for short vector SIMD packs

* new feature, sb-simd-pack.

* define a new IR1 type for SIMD packs:
   - (SB!KERNEL:SIMD-PACK [eltype]), where [eltype] is a subtype
     of the plaform-specific SIMD element type universe, or * (default),
     the union of all these possibilities;
   - Element types are always upgraded to the platform's element type
     (small) universe, so we can easily manipulate unions of SIMD-PACK
     types by working in terms of the element types.

* immediately specify the universe of SIMD pack element types
   (sb!kernel:*simd-pack-element-types*) for x86-64, to ensure
   #!+sb-simd-pack buildability.

* declare basic functions to create/manipulate SIMD packs:
   - simd-pack-p is the basic type predicate;
   - %simd-pack-tag returns a fixnum tag associated with each SIMD-PACK;
     currently, we suppose it only encodes the element type, as the
     position of the element type in *simd-pack-element-types*;
   - %make-simd-pack creates a 128-bit SIMD pack from a tag and two
     64 bit integers;
   - %make-simd-pack-double creates an appropriately-tagged pack from
     two double floats;
   - %make-simd-pack-single creates a tagged pack from four single
     floats;
   - %make-simd-pack-ub{32,64} creates a tagged pack from four 32 bit
     or two 64 bit integers;
   - %simd-pack-{low,high} returns the low/high integer half of a
     128 bit pack;
   - %simd-pack-ub{32,64}s returns the four integer quarters or two
     integer halves of a 128 bit pack;
   - %simd-pack-singles returns the four singles in a 128 bit pack;
   - %simd-pack-doubles returns the two doubles in a 128 bit pack.

Alexander Gavrilov kept a branch alive for the last couple years. The
creation/manipulation primitives are largely taken from that branch,
or informed by the branch's usage.

Fix foreign-symbol-address transform on +sb-dynamic-core.

Badly placed ` was resulting in a wrong result.

Make some instances of IF/IF conversion more direct

When faced with CFGs that look like (if (if ...) ...), we duplicate
the outer NULL test forward in the branches (and jump to the correct
branch, so very little code is duplicated). However, this transform
depends on later ir1 optimisation to handle patterns like
(if (if ... nil t) ...). Try and get them right with a specialised
rewrite to get good code even when ir1opt doesn't run until fixpoint.

Also, refactored the code a bit while working on it.

Exploit specialised VOPs for EQL of anything/constant fixnum

By swapping constant arguments to the right ourselves before
strength reducing EQL into EQ, rather than erroneously using
commutative-arg-swap.

Spotted by Douglas Katzman.

More efficient integer=>word conversion and fixnump tests on x86-64

* Special-case on 63-bit fixnums to detect non-zero fixnum tag bits
with a shift right when converting fixnum-or-bignum to ub64.

* In fixnump/unsigned-byte-64, use MOVE to avoid useless mov x, x.

* In fixnump/signed-byte-64, use the conversion's left shift to
detect overflows.

* Based on a patch by Douglas Katzman.

Cleverer handling of medium (32 < bit width <= 64) constants on x86-64

* Exploit sign-extension for large unsigned constants.

* Always force the remaining operand and the result in a register:
in the worst case, we use a RIP-relative unboxed constant.

* Based on a patch by Douglas Katzman.

POPCNT instruction on x86-64

Patch by Douglas Katzman.

Fix disassembly for BT* instructions on x86oids

* A dedicated instruction format gets the details right.

* Patch by Douglas Katzman.

Annotate disassembly with unboxed constant values

* Only on x86-64, for qword-sized values.

* Patch by Douglas Katzman.

Improved local call analysis for inlined higher-order functions

Locall analysis greatly benefits from forwarding function arguments to
their use site. Do that in locall and hopefully trigger further rewrites,
rather than waiting for a separate ir1opt phase to do its magic.

Constant-fold backquote of constant expressions

* There is no guarantee that backquote expressions cons up fresh
   storage, so we are free to allocate (sub)lists or vectors at
   compile-time. In addition to regular constant-folding, perform
   part of LIST/LIST*/APPEND at compile-time.

* Fix one instance of CL:SORT of now-literal data.

* Implement SB!IMPL:PROPER-LIST-P because BACKQ-APPEND needed that.

* Based on a patch by James Y Knight; closes lp#1026439.

Enable (type-directed) constant folding for LOGTEST on x86oids and PPC

* COMBINATION-IMPLEMENTATION-STYLE can return :maybe. Like :default,
   it enables transforms, but transforms can call C-I-S themselves to
   selectively disable rewrites.

* Implement type-directed constant folding for LOGTEST. !x86oids/PPC
   platforms get that for free via inlining.

* Use :maybe to enable all LOGTEST transforms except inlining.

Exploit associativity to fold more constants

* Implement transforms for logand, logior, logxor and logtest to
detect patterns like (f (f x k1) k2) => (f x (f k1 k2)).

* Same for + and * of rational values.

* Similar logic for mask-signed-field: we only need to keep the
narrowest width.

room: Fix reconstituting CONS cells with unbound-marker in the CAR.

  * When I originally rewrote ROOM in terms of RECONSTITUTE-OBJECT,
I looked at what constitutes a valid CONS according to the runtime.
I noticed that one of the immediate types was an unbound marker
and said to myself "nobody's going to put one of those in a list".
This turned out to be a mistake.

  * x86 systems (and plausibly not any others) put unbound-markers
in lists when loading FASLs.  I have no real idea how or why, but
they do.  This would lead to an error, "Unrecognized widetag #x4A
in reconstitute-object".

  * Fix, by recording unbound-marker-widetag as being valid as the
first word of a CONS cell.

  * Issue reported by "scymtym" on #sbcl.

gencgc: Decide earlier about pinning large object pages.

  * The old logic here called maybe_adjust_large_object(), and
then re-checked the pointer to preserve for validity.  This is
non-optimal, as it means that maybe_adjust_large_object can't
promote pages to newspace directly, it instead merely adjusts the
page allocation to fit the possibly-shrunken object.

  * It turns out that large_object pages can contain bignums,
vectors, code-objects, or in unusual cases instances.  Neither
bignums, vectors, nor instances can contain embedded objects.
Code-objects can contain only functions or LRAs.  None of these
objects have list-pointer-lowtag on their references.  The "tail"
of a shrunken object is comprised of conses with both cells as
fixnum zero.  The minor catch is that we allow untagged pointers
to pin code-allocated pages, but the saving grace here is that
code-objects don't shrink.

  * Alter preserve_pointer() to test the lowtag and page type to
check for invalid pointers to large-object pages before calling
maybe_adjust_large_object() instead of bounds-checking the pointer
after the fact.

gencgc: Fix potential out-of-bounds access in page_ends_contiguous_block_p().

  * If we're testing to see if the LAST page in dynamic space is
the end of a contiguous block, and it is a full page (bytes_used
is GENCGC_CARD_BYTES), we turn around and start investigating the
next page table entry... but there isn't one, it's beyond the end
of the allocation.

  * Fix, by bounds-testing the page index against the index of the
high-water mark for dynamic space.  This is guaranteed to be no
more than the total maximum for the page table, and is slightly
more micro-efficient than using the actual maximum, as any page
after the high-water mark will be page_free_p().

gencgc: Introduce a new predicate, page_ends_contiguous_block_p().

  * There are a number of places in gencgc where a number of
attributes of a page and possibly the subsequent page are tested
for various values.  Invariably, this is actually testing to see
if a page ends a contiguous block.

  * Extract the various tests to a new inlined predicate function,
page_ends_contiguous_block_p(), thus revealing the intent of
what's going on far better than the bare tests, and coalescing the
code to a single copy to make it easier to fix if there is a bug
in it (and there is, but this is a refactoring commit, not a
behavior change commit).

gencgc: Introduce a new predicate, page_starts_contiguous_block_p().

  * There are a number of places in gencgc where scan_start_offset
for a page is tested for zero.  Invariably, this is actually
testing to see if a page starts a contiguous block...  Or starts
on an object boundary.

  * Extract the various tests for a zero scan_start_offset to a
new inlined predicate function, page_starts_contiguous_block_p(),
thus revealing the intent of what's going on far better than the
bare test.

gencgc: Rename page_table field region_start_offset to scan_start_offset.

* Let's call it what it is: The offset from where to start any
scan through the page to the start of the page. The only relation
this field has to an alloc_region is the way it is initialized.

gencgc: Commentary fix for struct page, field region_start_offset.

  * Simply describing region_start_offset as being related to an
allocation region which contains a page is disingenuous at best,
and misleading at worst.  Its relation with an alloc_region is due
to its initialization strategy, and has nothing to do with what
the value is for.

  * Say it like it is, it's an offset to a known object boundary,
from where we can start a call to gc_search_space() or scavenge().
That's what it's for, not for keeping track of alloc_regions.

gencgc: Defer moving pinned pages to newspace as late as possible.

  * Rather than moving pinned pages to newspace immediately, defer
moving them until just before we start to scavenge (evacuate) all
of the oldpsace pages.

  * This, in theory, makes it easier to move pages to newspace if
they are mostly-live, rather than having to allocate new pages for
the data (increasing peak address-space use during GC), assuming
that we know that some page meets such criteria.

  * While we're here, commentary updates also replace an "XX I'd
rather not do this but the GC logic can't cope with not doing it"
with an actual explanation of WHY it needs to be done.  In fact,
commentary updates explain it twice, in two different locations.

gencgc: Fix commentary for page table allocation field.

* The commentary for the page table allocation field was
misleading, presumably not updated when the definitions for the
constants used for its actual contents were last changed, and cost
me a bit of surprise and time spent trying to figure out why core
file saving and loading worked at all.

* Updated the commentary on the allocation field to match
current reality, and added cross-references between the field
itself and the definitions for its contents, so that a future
desync between commentary and reality is less likely.

More robust function-name testing in CUT-TO-WIDTH

Let's use lvar-fun-name instead of replicating half the logic; as
a bonus, modularity transforms now heeds NOTINLINE.

Fix (CONCATENATE 'null ...) for generic sequences

* (CONCATENATE 'NULL SEQUENCE1 SEQUENCE2 ...) ensures that SEQUENCE1,
   SEQUENCE2, ... are empty, but only did so for lists and
   vectors. Instead, use new function EMPTYP which works for all
   sequences. EMPTYP is not exported.

* Add generic function SEQUENCE:EMPTYP to which EMPTYP dispatches for
   generic sequences. Methods for lists, vectors and generic sequences
   use NULL or (ZEROP (LENGTH ...)).

* Test cases in seq.impure.lisp.

* Patch by Jan Moringen; fixes lp#1162301.

Print intermediate evaluation results for some ASSERTed expressions

* The reports of errors signaled by ASSERT now print intermediate
  evaluation results under the following conditions:
   1. The ASSERTed expression is known to be a function call.
   2. Arguments in the call are not constants.

* Test the new feature in condition.impure.lisp.

* Original patch from Alexandra Barchunova; closes lp#789497.

Take bitwidth into account in BOOLEAN alien type

Some ABIs (x86/x86-64) allow garbage in unused upper bits of return
values; take that into account when converting BOOLEAN return types
to CL BOOLEANs.

Declare the argument type for float-radix

Otherwise, inlined copies sometimes skip the type check.

Enable dumping huge (> 64k) pages in genesis

bvectors for a page can be smaller than the page size; zero-padding
explicitly as needed lets the build proceed further along before
failing, when configured with large GC card size.

Make ir1-convert-hairy-lambda safe for non-local exits.

The function it calls may throw a tag, locall-already-let-converted,
which will leave a partially initialized optional-dispatch structure
in new-functionals of the current component, which may cause problems
down the line.

Fixes lp#1180992.

Also add a test-case for f3a2cd.. "Add a stub for %other-pointer-p.".

Free-er form FILTER-LVAR

The DUMMY argument can now be in any argument position. Use that
in CUT-TO-WIDTH instead of ((lambda (...) ...) ...) hack.

More robust FILTER-LVAR through CASTs

* IR1-conversion can insert casts between a combination and its
arguments. Handle that case via principal-lvar{-use}.

* Fixes a regression in b111015 (lp#1181684).

NEWS entries for Unicode normalization work

implement primary and canonical composition, and hence NFC/NFKC

Read in the non-algorithmically-specified composition exclusions from
Unicode's CompositionExclusions.txt file, and generate a hash table
using the concatenated 42 bits of code points. This is a bit of a
sucky hash-table key, particularly on 32-bit platforms; I have a plan
to reduce the key to 24 bits (using some auxiliary information in ucd)
but the advantage of getting this try in is...

... hook in NFC/NFKC into normalization tests, and check that tests
pass.

actually run Part3 of Unicode Normalization tests

better UCD treatment of characters not allocated by Unicode

fixes lp#1178038 (reported by Ken Harris)

finish handling NormalizationTest test vectors

NFC/NFKC still not hooked in, but otherwise complete.

first cut at testing unicode normalization

Parts 0 and 1 from Unicode NormalizationTest.txt, fully tested for
NFD and NFKD.

add a comment about one-basing the character tables

apply recursive decomposition in DECOMPOSE-STRING

We should really precompute the result of the recursion during the build;
working on getting tests up and running so that we can check whether
we've done that correctly.

fix test for Blocked condition in canonical normalization

Would most likely otherwise fail in Jamo with combining characters in
between.

improve normalize-string

* now works on non-simple strings;
* more likely to be correct under #!-sb-unicode

comment on LSTRING implementation

handle Hangul syllable decomposition

Entries for the codepoint range (#xac00 -- #xd7a3) have 1 for
their decomposition-info, a decomposition length of 2 or 3, but
a zero decomposition index (the decomposition is handled
algorithmically instead).

work-in-progress towards full normalization support

beginnings of decomposition

Store enough information in output from ucd.lisp to be able to actually
decompose individual characters. Include proof-of-concept implementation
of decomposition, not hooked into anything yet.

delete now-unused code from ucd.dat

Incorporate some decomposition information in ucd table

Oh boy.  This one is quite intricate.  We have two bytes free in
the 8-byte entries for information about characters, so use one of
them to indicate if the character has a decomposition, and if so of
what kind it is.  Adapt the ucd.lisp tools-for-build code to
parse and preserve that information.

However, this causes there to be more than 256 distinct possible
classes of character known to the system: not a problem in principle,
but Teemu Kalvas' implementation of the double indirection depended on
having a one-byte index.  But since Unicode characters are limited to
21 bits, with a careful packing scheme we can in fact steal 3 more bits
for the index, at the cost of needing to do an extra memory reference
and some arithmetic to reconstruct the index.  (In the process, change
the endianness of the ucd.dat filesystem representation, because it's
easier that way).

But wait, there's more.  Before, there were only two kinds of
lower-case characters: those whose upper-case transformation
lowercase back to the original character, and those where there is
no round-trip.  (The former are cl:lower-case-p, the latter aren't).
This gave rise to straightforward implementations of lower-case-p
and friends; in the new world, where there are multiple different
kinds of lower-case characters (with various decomposition classes)
we need to adjust the implementations, still fairly straightforward,
of lower-case-p and related functions.

The extra information provided in the ucd table by this commit
is largely useless on its own; the next step is to incorporate
the actual decomposition data.  Stay tuned.

MORE COMMENT regarding the careful format of the encoded UCD data

update to unicode 6.2

Complete cut-to-width

* Insert logand/mask-signed-field even around references to variables
   in modular arithmetic: avoid recursive rewriting by disabling the
   transform when the destination is a direct logand/mask-signed-field
   combination.

* Fixes lp#1026634 (reported by Anton Marsden on sbcl-devel).

More efficient MASK-SIGNED-FIELD

Word => signed-word and {word, signed-word} => fixnum conversions
are implemented with unchecked move VOPs.

Insert typechecks before RAW-INSTANCE-INIT in structure constructors

* Usually, FTYPE declarations ensure that happens, but multiple
inlining of the same structure constructor cause strangeness.

* Fixed lp#1177703, reported by Jan Moringen.

More robust erroneous local call detection

* When possible, convert known bad calls into calls to error-signaling
stubs.

* Fixes lp#504121 (and likely other occurrences of
"failed AVER (ZEROP (HASH-TABLE-COUNT ...))."

COMPILE-FILE shouldn't "attempt to dump invalid structure" anymore

* When CAST nodes detect definite type mismatch, they are replaced
   with debugging instrumentation to provide source locations at
   compile and run -time. When code is generated internally, the
   source can include literal internal data structures. Skip those
   when recovering source locations.

* Fixes lp#943953 and a bunch of equally baffling duplicates.

Recover full backtraces with generic arithmetic on x86 and x86-64

* Errors in generic arithmetic (or comparisons) used to hide the caller
   in the backtrace: it was replaced with a frame in the anonymous
   assembly stub.

* Regression since 1.0.24.35, fixes lp#800343.

* Also remove a misleading FIXME in typed-accessor-definitions
   (reported by Matt Novenstern in lp#1171646).

Add a stub for %other-pointer-p.

Otherwise the VOP isn't translated on literal objects, and
sb-sequence:do-sequence stops working on literal vectors. Having a
stub allows constant folding to work.

Reported by adeht on #lisp.

loop: remove code size-estimation.

Loop has a facility to determine whether it's ok to duplicate variable
initialization and stepping code when the variable preceding it has
different initialization and stepping forms. The code which determines
code size is quite strange and it may have been relevant 20 years ago
on primitive implementations, but not anymore, and people who really
care about code size would use functions, which will also improve code
readability.

As a side effect, it fixes a bug which was present in the
estimate-code-size function.

Fixes lp#1178989.

Fix describe-object for characters.

Don't prefix lines with ":_".

early-alieneval: Fix package-related thinko with saved-fp-and-pc logic.

* Forgot a package prefix for GET-LISP-OBJ-ADDRESS, because the
function symbol was exported from SB!ALIEN-INTERNALS yet defined
in SB!ALIEN, and the prefix wouldn't have been necessary from
SB!ALIEN-INTERNALS.

* Thanks to Stas Boukarev for the heads-up.

code/room: Completely rewrite MAP-ALLOCATED-OBJECTS.

  * The old version of M-A-O consisted of bizaare toplevel logic,
a scheme for figuring out what each heap object was and its size
that did not parallel what the garbage collector used and may or
may not have been correct, and relied heavily on inlining to
reduce consing.

  * This new version of M-A-O uses straightforward toplevel logic,
a scheme for figuring out what each heap object is and its size
that directly parallels what the garbage collector uses and is
verifiably correct, and relies heavily on the aligned unboxed
pointer to fixnum equivalence to reduce consing.

  * The new interface to M-A-O no longer includes the optional
"careful" argument, as it gains us nothing once the underlying
mechanism is so obviously correct.  sb-introspect has been updated
appropriately.

  * The way the new implementation walks the heap and page table
requires direct access to a "static" global variable in gencgc.c,
so the "static" attribute has been removed.

  * This implementation has been lightly tested on an x86-64 and
PPC, and it seems to work quite well, but there are still some
fairly obvious non-optimalities in terms of generated code (as
seen in the trace-file output from the cross compiler).  It does
pass the two test cases that exhausted the heap on PPC with the
previous implementation.

code/room: Improve type-format database initialization for simple vector types.

  * There has been a longstanding FIXME comment on a piece of code
which contains a hand-maintained list of specialized vector types
and the shift count for converting the length from elements to
octets.

  * It turns out that all of this information, plus the type names
that we currently do a song-and-dance with INTERN, SUBSEQ, and
MISMATCH to obtain, plus information for the string types, is
available from *SPECIALIZED-ARRAY-ELEMENT-TYPE-PROPERTIES*.  And
*S-A-E-T-P* is guaranteed to be up-to-date, as it's too central to
our implementation of UPGRADED-ARRAY-ELEMENT-TYPE and MAKE-ARRAY
for it to be allowed to break.

  * So, replace nasty KLUDGE of an initialization for simple
vector types with something more principled, making it explicit
which properties need to be derived and which are simply already
available, and picking off the one specialized array type that
needs to be handled differently (SIMPLE-ARRAY-NIL).

NEWS updates.

* Forgot to add a NEWS update to my recent commit involving the
internal-error logic.

* And clarify that only vectors of boxed items may be stack-
allocated on PPC.

code/interr: Hook internal error contexts into the saved-fp-and-pc mechanism.

* This covers the unfortunate case of a signal handler not
having an unbroken stack frame chain to the interrupted context,
which actually occurs on threaded x86-64 FreeBSD systems.

* Use the existing saved-fp-and-pc mechanism, used for
ALIEN-FUNCALL to cover for code compiled -fomit-frame-pointer to
treat the internal error context as an alien funcall point.

Allow inlining more calls to INVOKE-WITH-SAVED-FP-AND-PC during XC.

  * The INVOKE-WITH-SAVED-FP-AND-PC mechanism was defined in
ALIENCOMP, which occurs well after the first uses of ALIEN-FUNCALL,
thus preventing it from being inlined when used during XC (by
default, only on x86).

  * Fix, by relocating the mechanism from SB!C to
SB!ALIEN-INTERNALS and from COMPILER;ALIENCOMP to
CODE;EARLY-ALIENEVAL.

  * Also relocate and publish symbols for all of the magic from
SB!ALIEN-INTERNALS.

sb-introspect:find-definition-sources-by-name: more defoptimizer types.

Look for sb-c:ir2-convert and sb-c::stack-allocate-result defoptimizer types.

Make CONTAINING-INTEGER-TYPE take N-WORD-BITS into account.

Replace the hardcoded 32s in the function with N-WORD-BITS. This in turn
allows SOURCE-TRANSFORM-NUMERIC-TYPEP to find better transformations
for type checks for types larger than fixnum but smaller than a machine
word also under 64-bit word size. The most important improvement this
achieves is to avoid generic arithmetic for the bounds tests in these
cases.

For example, the test (TYPEP X '(UNSIGNED-BYTE 63)) runs about four
times as fast on x86-64 with this change.

(Tests for the exact types (SIGNED-BYTE 64) and (UNSIGNED-BYTE 64) are
unaffected as compiling them takes another code path, which already
generates well optimized code.)

Make %EMIT-ALIGNMENT be more friendly to multi-byte NOPs.

When %EMIT-ALIGNMENT needs to tighten the alignment, it used to emit
a fixed-size skip first and an alignment note afterwards. On x86-64,
where block headers are aligned using multi-byte NOPs, this could
lead to emitting one more such NOP than needed to span the desired
range, unnecessarily increasing the number of machine instructions
the processor needs to decode.

To avoid that, change %EMIT-ALIGNMENT to only emit an alignment note
(covering both the fixed-size skip and the alignment note from the
original version) in this situation.

An example of the difference, from the disassembly of
SB-C::FLATTEN-LIST:

Before:

  896: L0:   8F4508           POP QWORD PTR [RBP+8]
  899:       0F1F00           NOP
  89C:       0F1F4000         NOP
  8A0: L1:   4881F917001020   CMP RCX, 537919511

Afterwards:

  896: L0:   8F4508           POP QWORD PTR [RBP+8]
  899:       0F1F8000000000   NOP
  8A0: L1:   4881F917001020   CMP RCX, 537919511

Better type derivation for APPEND, NCONC, LIST.

The result types of APPEND/NCONC depend on the last argument and the
presence of conses in the middle.
For example (append 42) => 42, (append nil nil 42) => 42,
(append (list 1) 42) => (1 . 42), etc.

LIST returns NIL in case of no arguments and a cons in other
cases. That fact required an adjustment for a values-list optimizer,
which removed all arguments from a LIST call making it change the type
from LIST to NULL and confusing things.

Closes lp#538957

Micro-optimize values-list.

Compare a register with nil-value directly, without going through a
temporary register.

sb-introspect:find-definition-sources-by-name: find VOPs by name.

(sb-introspect:find-definition-sources-by-name x :vop) now
also returns VOPs which do not translate any functions.

Commiting fix by Doug Katzman: disassembler missing ",8" on SHLD

NEWS: Updates for recent PPC changes.

* I recently committed a small batch of PPC changes, but none of
them had NEWS entries. This was not quite an oversight, as the
original changes were written prior to the 1.1.7 release, but now
that they've been committed they should at least be mentioned as
NEWS.

Correct integer-length on fixnums on x86-64 when n-fixnum-tag-bits > 1.

Use SAR, not SHR for untagging, to preserve the sign.
Thanks to Paul Khuong.

tests/dynamic-extent.impure.lisp: One of the dx-vector test terms was misplaced.

* MAKE-ARRAY-ON-STACK-1 tries to create a specialized vector,
but was being called from a test that only claims to handle
vectors suitable for a precisely-scavenged control stack.

* Fix, by moving the call to the next test, which is for
specialized vectors (and thus only runs on conservative-stack
systems).

ppc support for stack-allocatable-vectors

  * This turned out to be fairly straightforward.  Unlike in a
heap-allocation-only regime, where a VOP is required to :TRANSLATE
ALLOCATE-VECTOR, the :STACK-ALLOCATABLE-VECTORS feature enables an
LTN-ANNOTATE optimizer for ALLOCATE-VECTOR that substitutes
invocations of one of two named VOPs.

  * To convert from the old regime to the new, rename the old VOP
to fit the new naming scheme, and write a new VOP to do the stack
allocation.

  * As a cleaning-up-a-loose-end matter, lose the :TRANSLATE
option for the old VOP.

  * And as a "being somewhat cute about things" matter, make the
support for stack-allocatable-vectors selectable at build time,
which should provide a quick overview of how to make this work on
some other platform, should anyone else be interested later on.

gencgc: Compute bytes_allocated correctly during dynamic space pickup.

* Rather than computing bytes_allocated based on the number of
pages prior to the "alloc pointer" (really the heap high-water
mark), accumulate it based on the number of ALLOCATED pages.

* Fixes some lossage in write_generation_stats(), which seems to
be the only place where this value is checked against reality.

Add test cases for non-consing WITHOUT-GCING and WITH-PINNED-OBJECTS.

* Neither of these two constructs should cons under normal
circumstances, but WITH-PINNED-OBJECTS is occasionally broken in
this respect on some backends. May as well make it explicit and
official.

compiler/{sparc,ppc}/macros: with-pinned-objects improvements.

  * For all precise gencgc backends, with-pinned-objects uses an
explicit "pin list".  This pin list should be stack-allocated.

  * Declare the pin list to be TRULY-DYNAMIC-EXTENT, for both
backends.  This won't actually do anything unless the backend
also supports :stack-allocatable-fixed-objects or more than two
objects are to be pinned at once (one-arg LIST and two-arg LIST*
are both converted to CONS by the compiler, and CONS falls under
:stack-allocatable-fixed-objects rather than
:stack-allocatable-lists).

ppc: Implement :stack-allocatable-fixed-objects

  * Alter SYS:SRC;COMPILER;PPC;MACROS.LISP, WITH-FIXED-ALLOCATION
to accept a parameter for requesting stack allocation instead of
heap allocation.

  * Alter SYS:SRC;COMPILER;PPC;ALLOC.LISP, VOP FIXED-ALLOC to pass
the new stack-allocation parameter.

  * And add :stack-allocatable-fixed-objects to the PPC section in
make-config.sh.

backtrace-interrupted-condition-wait now passes on darwin.

Micro-optimize integer-length on fixnums on x86-64.

INTEGER-LENGTH is implemented by using the BSR instruction, which
returns the position of the first 1-bit from the right. And that needs
to be incremented to get the width of the integer, and BSR doesn't
work on 0, so it needs a branch to handle 0.

But fixnums are tagged by being shifted left n-fixnum-tag-bits times,
untagging by shifting right n-fixnum-tag-bits-1 times (and if
n-fixnum-tag-bits = 1, no shifting is required), will make the
resulting integer one bit wider, making the increment unnecessary.
Then, to avoid calling BSR on 0, OR the result with 1. That sets the
first bit to 1, and if all other bits are 0, BSR will return 0,
which is the correct value for INTEGER-LENGTH.

Document the new :directory argument for run-program.

Convert the MOVE macro on x86-64 into a function.

This is possible as the macro is used just to simulate an inline
function. Converting MOVE into a true function shrinks the core by
448 KiB and may even make the compiler run faster due to reduced
instruction cache pressure.

Some background: Only on x86-64 MOVE is used with float SCs sometimes.
It therefore needs to select different machine instructions depending on
the SC of its destination argument. This compiles to so much code that
inlining it can't be justified, especially given that MOVE is used in
several hundred VOPs.

While at it, correct the comment at the top of the file for 64-bitness.

Faster ISQRT on small (about fixnum sized) numbers.

ISQRT is implemented using a recursive algorithm for arguments above 24
which is compiled using generic arithmetic only (as it must support both
fixnums and bignums).

Improve this by compiling this recursive part twice, once using generic
and once fixnum-only arithmetic, and dispatching on function entry into
the applicable part. For maximum speed, the fixnum part recurs directly
into itself, thereby avoiding further type dispatching.

This makes ISQRT run about three times as fast on fixnum inputs while
the generated code is about 40 percent larger (both measured on x86-64).
For bignums a speedup can be seen, too, as ISQRT always recurs into
fixnum territory eventually, but the relative gain obviously becomes
smaller very fast with increasing size of the argument.

I have changed the variable names in the recursive part; they no longer
have an "n-" prefix as this in SBCL by convention means "number of" and
as the argument of the recursive part is no longer visibly "n".

Slightly augment the test case.

Improve scaling of type derivation for LOG{AND,IOR,XOR}.

If the types of the arguments of LOG{AND,IOR,XOR} are known to be ranges
of non-negative integers the compiler currently derives the range of the
result using straightforward implementations of algorithms from
"Hacker's Delight". These take quadratical time in the number of bits of
the inputs in the worst case, potentially leading to unacceptably long
compilation times. (The algorithms are based on loops over the bits of
the inputs, doing calculations during each iteration that are themselves
linear in the number of bits of their operands.)

Instead implement bit-parallel algorithms I have found that take linear
time in all cases. While their runtime therefore is limited to much
smaller values for large inputs, it is comparable to that of the current
algorithms for small inputs, too; the new deriver for LOGXOR is in fact
faster than the old one by a factor of two to ten already in the latter
case.

The (existing) test for these derivers compares their results with those
from a brute-force algorithm for all O(N^4) many pairs of input ranges
with endpoints from the set of N-bit unsigned integers. The brute-force
algorithm needs to consider O(N^2) input pairs for each pair of ranges,
making the total runtime O(N^6). Therefore the test normally runs with
N = 5. I have tested all three new derivers successfully with N = 7.

Replace LOG{AND,IOR,XOR}-DERIVE-UNSIGNED-{LOW,HIGH}-BOUND with
LOG{AND,IOR,XOR}-DERIVE-UNSIGNED-BOUNDS to make it possible to evaluate
expressions only once that the calculations for the low and the high
bound have in common. The callers always need both bounds anyway.

Adapt the test to this change. (It runs twice as fast now due to the
brute force loop calculating both bounds in one go.)

Add a test for the scaling behaviour. This needs a function to measure
runtimes over potentially large ranges; add this to test-util.lisp.

Fixes lp#1096444.

Split bitops-derive-type.lisp out of srctran.lisp.

The moved part contains DERIVE-TYPE methods for LOGAND, LOGIOR, and
friends. The split is motivated by srctran.lisp being too large and
by planned changes to these type derivers.

Fix init-var-ignoring-errors.

Actually set the variable to the default value in case of an error.

Caught by Nikodemus Siivola.

Add :directory argument to sb-ext:run-program.

The implementation uses chdir(2) on Unices, the lpCurrentDirectory
argument to CreateProcessW on Windows.
Slightly adapted from the patch by Matthias Benkard.
Closes lp#791800

Handle environment initialization better.

Don't fail with mysterious errors and memory faults on startup during
initialization of *default-pathname-defaults* when the current
directory contains undecodable characters or is deleted. Similarly
catch decoding errors for things like *runtime-pathname* and
*posix-argv*.
Turn the errors into warnings, and ensure that streams are initialized
and the error messages can be printed.

1.1.7: will be tagged as "sbcl-1.1.7"

fix formatting of most recent "changes" line in NEWS

sort NEWS into enhancement/bug fix/optimization order

Trivial code cleanups

Declare a variable as ignored, and descriptors are 64 bit on
x86-64. The latter was brought to my attention by Douglas Katzman.

Substitute constants with modular equivalents more safely

* Modular arithmetic sometimes lets us narrow constants down,
  especially with signed arithmetic. We now update the receiving
  LVAR's type conservatively when there are multiple uses; otherwise,
  conflicting type information results in spurious dead code
  elimination.

* Test case by Eric Marsden.

* Reported by Eric Marsden on sbcl-devel (2013-04-18).

Fix the build on OS X 10.8.0

It seems our exception handler can be called before it's fully set up.
Handle that case without potentially leaking too many ports.

Reported by Gabriel Dos Reis on sbcl-devel.