doc/cmucl/internals/back.tex

   1 % -*- Dictionary: design -*-
   2 \f
   3 \chapter{Copy propagation}
   4
   5 File: {\tt copyprop}
   6
   7 This phase is optional, but should be done whenever speed or space is more
   8 important than compile speed.  We use global flow analysis to find the reaching
   9 definitions for each TN.  This information is used here to eliminate
  10 unnecessary TNs, and is also used later on by loop invariant optimization.
  11
  12 In some cases, VMR conversion will unnecessarily copy the value of a TN into
  13 another TN, since it may not be able to tell that the initial TN has the same
  14 value at the time the second TN is referenced.  This can happen when ICR
  15 optimize is unable to eliminate a trivial variable binding, or when the user
  16 does a setq, or may also result from creation of expression evaluation
  17 temporaries during VMR conversion.  Whatever the cause, we would like to avoid
  18 the unnecessary creation and assignment of these TNs.
  19
  20 What we do is replace TN references whose only reaching definition is a Move
  21 VOP with a reference to the TN moved from, and then delete the Move VOP if the
  22 copy TN has no remaining references.  There are several restrictions on copy
  23 propagation:
  24 \begin{itemize}
  25 \item The TNs must be ``ordinary'' TNs, not restricted or otherwise
  26 unusual.  Extending the life of restricted (or wired) TNs can make register
  27 allocation impossible.  Some other TN kinds have hidden references.
  28
  29 \item We don't want to defeat source-level debugging by replacing named
  30 variables with anonymous temporaries.
  31
  32 \item We can't delete moves that representation selected might want to change
  33 into a representation conversion, since we need the primitive types of both TNs
  34 to select a conversion.
  35 \end{itemize}
  36
  37 Some cleverness reduces the cost of flow analysis.  As for lifetime analysis,
  38 we only need to do flow analysis on global packed TNs.  We can't do the real
  39 local TN assignment pass before this, since we allocate TNs afterward, so we do
  40 a pre-pass that marks the TNs that are local for our purposes.  We don't care
  41 if block splitting eventually causes some of them to be considered global.
  42
  43 Note also that we are really only are interested in knowing if there is a
  44 unique reaching definition, which we can mash into our flow analysis rules by
  45 doing an intersection.  Then a definition only appears in the set when it is
  46 unique.  We then propagate only definitions of TNs with only one write, which
  47 allows the TN to stand for the definition.
  48
  49 \f
  50 \chapter{Representation selection}
  51
  52 File: {\tt represent}
  53
  54 Some types of object (such as {\tt single-float}) have multiple possible
  55 representations.  Multiple representations are useful mainly when there is a
  56 particularly efficient non-descriptor representation.  In this case, there is
  57 the normal descriptor representation, and an alternate non-descriptor
  58 representation.
  59
  60 This possibility brings up two major issues:
  61 \begin{itemize}
  62 \item The compiler must decide which representation will be most efficient for
  63 any given value, and
  64
  65 \item Representation conversion code must be inserted where the representation
  66 of a value is changed.
  67 \end{itemize}
  68 First, the representations for TNs are selected by examining all the TN
  69 references and attempting to minimize reference costs.  Then representation
  70 conversion code is introduced.
  71
  72 This phase is in effect a pre-pass to register allocation.  The main reason for
  73 its existence is that representation conversions may be farily complex (e.g.
  74 involving memory allocation), and thus must be discovered before register
  75 allocation.
  76
  77
  78 VMR conversion leaves stubs for representation specific move operations.
  79 Representation selection recognizes {\tt move} by name.  Argument and return
  80 value passing for call VOPs is controlled by the {\tt :move-arguments} option
  81 to {\tt define-vop}.
  82
  83 Representation selection is also responsible for determining what functions use
  84 the number stack.  If any representation is chosen which could involve packing
  85 into the {\tt non-descriptor-stack} SB, then we allocate the NFP register
  86 throughout the component.  As an optimization, permit the decision of whether a
  87 number stack frame needs to be allocated to be made on a per-function basis.
  88 If a function doesn't use the number stack, and isn't in the same tail-set as
  89 any function that uses the number stack, then it doesn't need a number stack
  90 frame, even if other functions in the component do.
  91
  92 \f
  93 \chapter{Lifetime analysis}
  94
  95 File: {\tt life}
  96
  97 This phase is a preliminary to Pack.  It involves three passes:
  98  -- A pre-pass that computes the DEF and USE sets for live TN analysis, while
  99     also assigning local TN numbers, splitting blocks if necessary.  \#\#\# But
 100 not really...
 101  -- A flow analysis pass that does backward flow analysis on the
 102     component to find the live TNs at each block boundary.
 103  -- A post-pass that finds the conflict set for each TN.
 104
 105 \#|
 106 Exploit the fact that a single VOP can only exhaust LTN numbers when there are
 107 large more operands.  Since more operand reference cannot be interleaved with
 108 temporary reference, the references all effectively occur at the same time.
 109 This means that we can assign all the more args and all the more results the
 110 same LTN number and the same lifetime info.
 111 |\#
 112
 113 \f
 114 \section{Flow analysis}
 115
 116 It seems we could use the global-conflicts structures during compute the
 117 inter-block lifetime information.  The pre-pass creates all the
 118 global-conflicts for blocks that global TNs are referenced in.  The flow
 119 analysis pass just adds always-live global-conflicts for the other blocks the
 120 TNs are live in.  In addition to possibly being more efficient than SSets, this
 121 would directly result in the desired global-conflicts information, rather that
 122 having to create it from another representation.
 123
 124 The DFO sorted per-TN global-conflicts thread suggests some kind of algorithm
 125 based on the manipulation of the sets of blocks each TN is live in (which is
 126 what we really want), rather than the set of TNs live in each block.
 127
 128 If we sorted the per-TN global-conflicts in reverse DFO (which is just as good
 129 for determining conflicts between TNs), then it seems we could scan though the
 130 conflicts simultaneously with our flow-analysis scan through the blocks.
 131
 132 The flow analysis step is the following:
 133     If a TN is always-live or read-before-written in a successor block, then we
 134     make it always-live in the current block unless there are already
 135     global-conflicts recorded for that TN in this block.
 136
 137 The iteration terminates when we don't add any new global-conflicts during a
 138 pass.
 139
 140 We may also want to promote TNs only read within a block to always-live when
 141 the TN is live in a successor.  This should be easy enough as long as the
 142 global-conflicts structure contains this kind of info.
 143
 144 The critical operation here is determining whether a given global TN has global
 145 conflicts in a given block.  Note that since we scan the blocks in DFO, and the
 146 global-conflicts are sorted in DFO, if we give each global TN a pointer to the
 147 global-conflicts for the last block we checked the TN was in, then we can
 148 guarantee that the global-conflicts we are looking for are always at or after
 149 that pointer.  If we need to insert a new structure, then the pointer will help
 150 us rapidly find the place to do the insertion.]
 151
 152 \f
 153 \section{Conflict detection}
 154
 155 [\#\#\# Environment, :more TNs.]
 156
 157 This phase makes use of the results of lifetime analysis to find the set of TNs
 158 that have lifetimes overlapping with those of each TN.  We also annotate call
 159 VOPs with information about the live TNs so that code generation knows which
 160 registers need to be saved.
 161
 162 The basic action is a backward scan of each block, looking at each TN-Ref and
 163 maintaining a set of the currently live TNs.  When we see a read, we check if
 164 the TN is in the live set.  If not, we:
 165  -- Add the TN to the conflict set for every currently live TN,
 166  -- Union the set of currently live TNs with the conflict set for the TN, and
 167  -- Add the TN to the set of live TNs.
 168
 169 When we see a write for a live TN, we just remove it from the live set.  If we
 170 see a write to a dead TN, then we update the conflicts sets as for a read, but
 171 don't add the TN to the live set.  We have to do this so that the bogus write
 172 doesn't clobber anything.
 173
 174 [We don't consider always-live TNs at all in this process, since the conflict
 175 of always-live TNs with other TNs in the block is implicit in the
 176 global-conflicts structures.
 177
 178 Before we do the scan on a block, we go through the global-conflicts structures
 179 of TNs that change liveness in the block, assigning the recorded LTN number to
 180 the TN's LTN number for the duration of processing of that block.]
 181
 182
 183 Efficiently computing and representing this information calls for some
 184 cleverness.  It would be prohibitively expensive to represent the full conflict
 185 set for every TN with sparse sets, as is done at the block-level.  Although it
 186 wouldn't cause non-linear behavior, it would require a complex linked structure
 187 containing tens of elements to be created for every TN.  Fortunately we can
 188 improve on this if we take into account the fact that most TNs are "local" TNs:
 189 TNs which have all their uses in one block.
 190
 191 First, many global TNs will be either live or dead for the entire duration of a
 192 given block.  We can represent the conflict between global TNs live throughout
 193 the block and TNs local to the block by storing the set of always-live global
 194 TNs in the block.  This reduces the number of global TNs that must be
 195 represented in the conflicts for local TNs.
 196
 197 Second, we can represent conflicts within a block using bit-vectors.  Each TN
 198 that changes liveness within a block is assigned a local TN number.  Local
 199 conflicts are represented using a fixed-size bit-vector of 64 elements or so
 200 which has a 1 for the local TN number of every TN live at that time.  The block
 201 has a simple-vector which maps from local TN numbers to TNs.  Fixed-size
 202 vectors reduce the hassle of doing allocations and allow operations to be
 203 open-coded in a maximally tense fashion.
 204
 205 We can represent the conflicts for a local TN by a single bit-vector indexed by
 206 the local TN numbers for that block, but in the global TN case, we need to be
 207 able to represent conflicts with arbitrary TNs.  We could use a list-like
 208 sparse set representation, but then we would have to either special-case global
 209 TNs by using the sparse representation within the block, or convert the local
 210 conflicts bit-vector to the sparse representation at the block end.  Instead,
 211 we give each global TN a list of the local conflicts bit-vectors for each block
 212 that the TN is live in.  If the TN is always-live in a block, then we record
 213 that fact instead.  This gives us a major reduction in the amount of work we
 214 have to do in lifetime analysis at the cost of some increase in the time to
 215 iterate over the set during Pack.
 216
 217 Since we build the lists of local conflict vectors a block at a time, the
 218 blocks in the lists for each TN will be sorted by the block number.  The
 219 structure also contains the local TN number for the TN in that block.  These
 220 features allow pack to efficiently determine whether two arbitrary TNs
 221 conflict.  You just scan the lists in order, skipping blocks that are in only
 222 one list by using the block numbers.  When we find a block that both TNs are
 223 live in, we just check the local TN number of one TN in the local conflicts
 224 vector of the other.
 225
 226 In order to do these optimizations, we must do a pre-pass that finds the
 227 always-live TNs and breaks blocks up into small enough pieces so that we don't
 228 run out of local TN numbers.  If we can make a block arbitrarily small, then we
 229 can guarantee that an arbitrarily small number of TNs change liveness within
 230 the block.  We must be prepared to make the arguments to unbounded arg count
 231 VOPs (such as function call) always-live even when they really aren't.  This is
 232 enabled by a panic mode in the block splitter: if we discover that the block
 233 only contains one VOP and there are still too many TNs that aren't always-live,
 234 then we promote the arguments (which we'd better be able to do...).
 235
 236 This is done during the pre-scan in lifetime analysis.  We can do this because
 237 all TNs that change liveness within a block can be found by examining that
 238 block: the flow analysis only adds always-live TNs.
 239
 240
 241 When we are doing the conflict detection pass, we set the LTN number of global
 242 TNs.  We can easily detect global TNs that have not been locally mapped because
 243 this slot is initially null for global TNs and we null it out after processing
 244 each block.  We assign all Always-Live TNs to the same local number so that we
 245 don't need to treat references to them specially when making the scan.
 246
 247 We also annotate call VOPs that do register saving with the TNs that are live
 248 during the call, and thus would need to be saved if they are packed in
 249 registers.
 250
 251 We adjust the costs for TNs that need to be saved so that TNs costing more to
 252 save and restore than to reference get packed on the stack.  We would also like
 253 more often saved TNs to get higher costs so that they are packed in more
 254 savable locations.
 255
 256 \f
 257 \chapter{Packing}
 258
 259 File: {\tt pack}
 260
 261 \#|
 262
 263 Add lifetime/pack support for pre-packed save TNs.
 264
 265 Fix GTN/VMR conversion to use pre-packed save TNs for old-cont and return-PC.
 266 (Will prevent preference from passing location to save location from ever being
 267 honored?)
 268
 269 We will need to make packing of passing locations smarter before we will be
 270 able to target the passing location on the stack in a tail call (when that is
 271 where the callee wants it.)  Currently, we will almost always pack the passing
 272 location in a register without considering whether that is really a good idea.
 273 Maybe we should consider schemes that explicitly understand the parallel
 274 assignment semantics, and try to do the assignment with a minimum number of
 275 temporaries.  We only need assignment temps for TNs that appear both as an
 276 actual argument value and as a formal parameter of the called function.  This
 277 only happens in self-recursive functions.
 278
 279 Could be a problem with lifetime analysis, though.  The write by a move-arg VOP
 280 would look like a write in the current env, when it really isn't.  If this is a
 281 problem, then we might want to make the result TN be an info arg rather than a
 282 real operand.  But this would only be a problem in recursive calls, anyway.
 283 [This would prevent targeting, but targeting across passing locations rarely
 284 seems to work anyway.]  [\#\#\# But the :ENVIRONMENT TN mechanism would get
 285 confused.  Maybe put env explicitly in TN, and have it only always-live in that
 286 env, and normal in other envs (or blocks it is written in.)  This would allow
 287 targeting into environment TNs.
 288
 289 I guess we would also want the env/PC save TNs normal in the return block so
 290 that we can target them.  We could do this by considering env TNs normal in
 291 read blocks with no successors.
 292
 293 ENV TNs would be treated totally normally in non-env blocks, so we don't have
 294 to worry about lifetime analysis getting confused by variable initializations.
 295 Do some kind of TN costing to determine when it is more trouble than it is
 296 worth to allocate TNs in registers.
 297
 298 Change pack ordering to be less pessimal.  Pack TNs as they are seen in the LTN
 299 map in DFO, which at least in non-block compilations has an effect something
 300 like packing main trace TNs first, since control analysis tries to put the good
 301 code first.  This could also reduce spilling, since it makes it less likely we
 302 will clog all registers with global TNs.
 303
 304 If we pack a TN with a specified save location on the stack, pack in the
 305 specified location.
 306
 307 Allow old-cont and return-pc to be kept in registers by adding a new "keep
 308 around" kind of TN.  These are kind of like environment live, but are only
 309 always-live in blocks that they weren't referenced in.  Lifetime analysis does
 310 a post-pass adding always-live conflicts for each "keep around" TN to those
 311 blocks with no conflict for that TN.  The distinction between always-live and
 312 keep-around allows us to successfully target old-cont and return-pc to passing
 313 locations.  MAKE-KEEP-AROUND-TN (ptype), PRE-PACK-SAVE-TN (tn scn offset).
 314 Environment needs a KEEP-AROUND-TNS slot so that conflict analysis can find
 315 them (no special casing is needed after then, they can be made with :NORMAL
 316 kind).  VMR-component needs PRE-PACKED-SAVE-TNS so that conflict analysis or
 317 somebody can copy conflict info from the saved TN.
 318
 319
 320
 321 Note that having block granularity in the conflict information doesn't mean
 322 that a localized packing scheme would have to do all moves at block boundaries
 323 (which would clash with the desire the have saving done as part of this
 324 mechanism.)  All that it means is that if we want to do a move within the
 325 block, we would need to allocate both locations throughout that block (or
 326 something).
 327
 328
 329
 330
 331
 332 Load TN pack:
 333
 334 A location is out for load TN packing if:
 335
 336 The location has TN live in it after the VOP for a result, or before the VOP
 337 for an argument, or
 338
 339 The location is used earlier in the TN-ref list (after) the saved results ref
 340 or later in the TN-Ref list (before) the loaded argument's ref.
 341
 342 To pack load TNs, we advance the live-tns to the interesting VOP, then
 343 repeatedly scan the vop-refs to find vop-local conflicts for each needed load
 344 TN.  We insert move VOPs and change over the TN-Ref-TNs as we go so the TN-Refs
 345 will reflect conflicts with already packed load-TNs.
 346
 347 If we fail to pack a load-TN in the desired SC, then we scan the Live-TNs for
 348 the SB, looking for a TN that can be packed in an unbounded SB.  This TN must
 349 then be repacked in the unbounded SB.  It is important the load-TNs are never
 350 packed in unbounded SBs, since that would invalidate the conflicts info,
 351 preventing us from repacking TNs in unbounded SBs.  We can't repack in a finite
 352 SB, since there might have been load TNs packed in that SB which aren't
 353 represented in the original conflict structures.
 354
 355 Is it permissible to "restrict" an operand to an unbounded SC?  Not impossible
 356 to satisfy as long as a finite SC is also allowed.  But in practice, no
 357 restriction would probably be as good.
 358
 359 We assume all locations can be used when an sc is based on an unbounded sb.
 360
 361 ]
 362
 363
 364 TN-Refs are be convenient structures to build the target graph out of.  If we
 365 allocated space in every TN-Ref, then there would certainly be enough to
 366 represent arbitrary target graphs.  Would it be enough to allocate a single
 367 Target slot?  If there is a target path though a given VOP, then the Target of
 368 the write ref would be the read, and vice-versa.  To find all the TNs that
 369 target us, we look at the TN for the target of all our write refs.
 370
 371 We separately chain together the read refs and the write refs for a TN,
 372 allowing easy determination of things such as whether a TN has only a single
 373 definition or has no reads.  It would also allow easier traversal of the target
 374 graph.
 375
 376 Represent per-location conflicts as vectors indexed by block number of
 377 per-block conflict info.  To test whether a TN conflicts on a location, we
 378 would then have to iterate over the TNs global-conflicts, using the block
 379 number and LTN number to check for a conflict in that block.  But since most
 380 TNs are local, this test actually isn't much more expensive than indexing into
 381 a bit-vector by GTN numbers.
 382
 383 The big win of this scheme is that it is much cheaper to add conflicts into the
 384 conflict set for a location, since we never need to actually compute the
 385 conflict set in a list-like representation (which requires iterating over the
 386 LTN conflicts vectors and unioning in the always-live TNs).  Instead, we just
 387 iterate over the global-conflicts for the TN, using BIT-IOR to combine the
 388 conflict set with the bit-vector for that block in that location, or marking
 389 that block/location combination as being always-live if the conflict is
 390 always-live.
 391
 392 Generating the conflict set is inherently more costly, since although we
 393 believe the conflict set size to be roughly constant, it can easily contain
 394 tens of elements.  We would have to generate these moderately large lists for
 395 all TNs, including local TNs.  In contrast, the proposed scheme does work
 396 proportional to the number of blocks the TN is live in, which is small on
 397 average (1 for local TNs).  This win exists independently from the win of not
 398 having to iterate over LTN conflict vectors.
 399
 400
 401 [\#\#\# Note that since we never do bitwise iteration over the LTN conflict
 402 vectors, part of the motivation for keeping these a small fixed size has been
 403 removed.  But it would still be useful to keep the size fixed so that we can
 404 easily recycle the bit-vectors, and so that we could potentially have maximally
 405 tense special primitives for doing clear and bit-ior on these vectors.]
 406
 407 This scheme is somewhat more space-intensive than having a per-location
 408 bit-vector.  Each vector entry would be something like 150 bits rather than one
 409 bit, but this is mitigated by the number of blocks being 5-10x smaller than the
 410 number of TNs.  This seems like an acceptable overhead, a small fraction of the
 411 total VMR representation.
 412
 413 The space overhead could also be reduced by using something equivalent to a
 414 two-dimensional bit array, indexed first by LTN numbers, and then block numbers
 415 (instead of using a simple-vector of separate bit-vectors.)  This would
 416 eliminate space wastage due to bit-vector overheads, which might be 50% or
 417 more, and would also make efficient zeroing of the vectors more
 418 straightforward.  We would then want efficient operations for OR'ing LTN
 419 conflict vectors with rows in the array.
 420
 421 This representation also opens a whole new range of allocation algorithms: ones
 422 that store allocate TNs in different locations within different portions of the
 423 program.  This is because we can now represent a location being used to hold a
 424 certain TN within an arbitrary subset of the blocks the TN is referenced in.
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434 Pack goals:
 435
 436 Pack should:
 437
 438 Subject to resource constraints:
 439  -- Minimize use costs
 440      -- "Register allocation"
 441          Allocate as many values as possible in scarce "good" locations,
 442          attempting to minimize the aggregate use cost for the entire program.
 443      -- "Save optimization"
 444          Don't allocate values in registers when the save/restore costs exceed
 445          the expected gain for keeping the value in a register.  (Similar to
 446          "opening costs" in RAOC.)  [Really just a case of representation
 447          selection.]
 448
 449  -- Minimize preference costs
 450     Eliminate as many moves as possible.
 451
 452
 453 "Register allocation" is basically an attempt to eliminate moves between
 454 registers and memory.  "Save optimization" counterbalances "register
 455 allocation" to prevent it from becoming a pessimization, since saves can
 456 introduce register/memory moves.
 457
 458 Preference optimization reduces the number of moves within an SC.  Doing a good
 459 job of honoring preferences is important to the success of the compiler, since
 460 we have assumed in many places that moves will usually be optimized away.
 461
 462 The scarcity-oriented aspect of "register allocation" is handled by a greedy
 463 algorithm in pack.  We try to pack the "most important" TNs first, under the
 464 theory that earlier packing is more likely to succeed due to fewer constraints.
 465
 466 The drawback of greedy algorithms is their inability to look ahead.  Packing a
 467 TN may mess up later "register allocation" by precluding packing of TNs that
 468 are individually "less important", but more important in aggregate.  Packing a
 469 TN may also prevent preferences from being honored.
 470
 471
 472 \f
 473 Initial packing:
 474
 475
 476 Pack all TNs restricted to a finite SC first, before packing any other TNs.
 477
 478 One might suppose that Pack would have to treat TNs in different environments
 479 differently, but this is not the case.  Pack simply assigns TNs to locations so
 480 that no two conflicting TNs are in the same location.  In the process of
 481 implementing call semantics in conflict analysis, we cause TNs in different
 482 environments not to conflict.  In the case of passing TNs, cross environment
 483 conflicts do exist, but this reflects reality, since the passing TNs are
 484 live in both the caller and the callee.  Environment semantics has already been
 485 implemented at this point.
 486
 487 This means that Pack can pack all TNs simultaneously, using one data structure
 488 to represent the conflicts for each location.  So we have only one conflict set
 489 per SB location, rather than separating this information by environment
 490 environment.
 491
 492 \f
 493 Load TN packing:
 494
 495 We create load TNs as needed in a post-pass to the initial packing.  After TNs
 496 are packed, it may be that some references to a TN will require it to be in a
 497 SC other than the one it was packed in.  We create load-TNs and pack them on
 498 the fly during this post-pass.
 499
 500 What we do is have an optional SC restriction associated with TN-refs.  If we
 501 pack the TN in an SC which is different from the required SC for the reference,
 502 then we create a TN for each such reference, and pack it into the required SC.
 503
 504 In many cases we will be able to pack the load TN with no hassle, but in
 505 general we may need to spill a TN that has already been packed.  We choose a
 506 TN that isn't in use by the offending VOP, and then spill that TN onto the
 507 stack for the duration of that VOP.  If the VOP is a conditional, then we must
 508 insert a new block interposed before the branch target so that the value TN
 509 value is restored regardless of which branch is taken.
 510
 511 Instead of remembering lifetime information from conflict analysis, we rederive
 512 it.  We scan each block backward while keeping track of which locations have
 513 live TNs in them.  When we find a reference that needs a load TN packed, we try
 514 to pack it in an unused location.  If we can't, we unpack the currently live TN
 515 with the lowest cost and force it into an unbounded SC.
 516
 517 The per-location and per-TN conflict information used by pack doesn't
 518 need to be updated when we pack a load TN, since we are done using those data
 519 structures.
 520
 521 We also don't need to create any TN-Refs for load TNs.  [??? How do we keep
 522 track of load-tn lifetimes?  It isn't really that hard, I guess.  We just
 523 remember which load TNs we created at each VOP, killing them when we pass the
 524 loading (or saving) step.  This suggests we could flush the Refs thread if we
 525 were willing to sacrifice some flexibility in explicit temporary lifetimes.
 526 Flushing the Refs would make creating the VMR representation easier.]
 527
 528 The lifetime analysis done during load-TN packing doubles as a consistency
 529 check.  If we see a read of a TN packed in a location which has a different TN
 530 currently live, then there is a packing bug.  If any of the TNs recorded as
 531 being live at the block beginning are packed in a scarce SB, but aren't current
 532 in that location, then we also have a problem.
 533
 534 The conflict structure for load TNs is fairly simple, the load TNs for
 535 arguments and results all conflict with each other, and don't conflict with
 536 much else.  We just try packing in targeted locations before trying at random.
 537
 538
 539 \f
 540 \chapter{Code generation}
 541
 542 This is fairly straightforward.  We translate VOPs into instruction sequences
 543 on a per-block basis.
 544
 545 After code generation, the VMR representation is gone.  Everything is
 546 represented by the assembler data structures.
 547
 548 \f
 549 \chapter{Assembly}
 550
 551 In effect, we do much of the work of assembly when the compiler is compiled.
 552
 553 The assembler makes one pass fixing up branch offsets, then squeezes out the
 554 space left by branch shortening and dumps out the code along with the load-time
 555 fixup information.  The assembler also deals with dumping unboxed non-immediate
 556 constants and symbols.  Boxed constants are created by explicit constructor
 557 code in the top-level form, while immediate constants are generated using
 558 inline code.
 559
 560 [\#\#\# The basic output of the assembler is:
 561     A code vector
 562     A representation of the fixups along with indices into the code vector for
 563       the fixup locations
 564     A PC map translating PCs into source paths
 565
 566 This information can then be used to build an output file or an in-core
 567 function object.
 568 ]
 569
 570 The assembler is table-driven and supports arbitrary instruction formats.  As
 571 far as the assembler is concerned, an instruction is a bit sequence that is
 572 broken down into subsequences.  Some of the subsequences are constant in value,
 573 while others can be determined at assemble or load time.
 574
 575 Assemble Node Form*
 576     Allow instructions to be emitted during the evaluation of the Forms by
 577     defining Inst as a local macro.  This macro caches various global
 578     information in local variables.  Node tells the assembler what node
 579     ultimately caused this code to be generated.  This is used to create the
 580     pc=>source map for the debugger.
 581
 582 Assemble-Elsewhere Node Form*
 583     Similar to Assemble, but the current assembler location is changed to
 584     somewhere else.  This is useful for generating error code and similar
 585     things.  Assemble-Elsewhere may not be nested.
 586
 587 Inst Name Arg*
 588     Emit the instruction Name with the specified arguments.
 589
 590 Gen-Label
 591 Emit-Label (Label)
 592     Gen-Label returns a Label object, which describes a place in the code.
 593     Emit-Label marks the current position as being the location of Label.
 594
 595
 596 \f
 597 \chapter{Dumping}
 598
 599 So far as input to the dumper/loader, how about having a list of Entry-Info
 600 structures in the VMR-Component?  These structures contain all information
 601 needed to dump the associated function objects, and are only implicitly
 602 associated with the functional/XEP data structures.  Load-time constants that
 603 reference these function objects should specify the Entry-Info, rather than the
 604 functional (or something).  We would then need to maintain some sort of
 605 association so VMR conversion can find the appropriate Entry-Info.
 606 Alternatively, we could initially reference the functional, and then later
 607 clobber the reference to the Entry-Info.
 608
 609 We have some kind of post-pass that runs after assembly, going through the
 610 functions and constants, annotating the VMR-Component for the benefit of the
 611 dumper:
 612     Resolve :Label load-time constants.
 613     Make the debug info.
 614     Make the entry-info structures.
 615
 616 Fasl dumper and in-core loader are implementation (but not instruction set)
 617 dependent, so we want to give them a clear interface.
 618
 619 open-fasl-file name => fasl-file
 620     Returns a "fasl-file" object representing all state needed by the dumper.
 621     We objectify the state, since the fasdumper should be reentrant.  (but
 622     could fail to be at first.)
 623
 624 close-fasl-file fasl-file abort-p
 625     Close the specified fasl-file.
 626
 627 fasl-dump-component component code-vector length fixups fasl-file
 628     Dump the code, constants, etc. for component.  Code-Vector is a vector
 629     holding the assembled code.  Length is the number of elements of Vector
 630     that are actually in use.  Fixups is a list of conses (offset . fixup)
 631     describing the locations and things that need to be fixed up at load time.
 632     If the component is a top-level component, then the top-level lambda will
 633     be called after the component is loaded.
 634
 635 load-component component code-vector length fixups
 636     Like Fasl-Dump-Component, but directly installs the code in core, running
 637     any top-level code immediately.  (???) but we need some way to glue
 638     together the componenents, since we don't have a fasl table.
 639
 640
 641
 642 Dumping:
 643
 644 Dump code for each component after compiling that component, but defer dumping
 645 of other stuff.  We do the fixups on the code vectors, and accumulate them in
 646 the table.
 647
 648 We have to grovel the constants for each component after compiling that
 649 component so that we can fix up load-time constants.  Load-time constants are
 650 values needed my the code that are computed after code generation/assembly
 651 time.  Since the code is fixed at this point, load-time constants are always
 652 represented as non-immediate constants in the constant pool.  A load-time
 653 constant is distinguished by being a cons (Kind . What), instead of a Constant
 654 leaf.  Kind is a keyword indicating how the constant is computed, and What is
 655 some context.
 656
 657 Some interesting load-time constants:
 658
 659     (:label . <label>)
 660         Is replaced with the byte offset of the label within the code-vector.
 661
 662     (:code-vector . <component>)
 663         Is replaced by the component's code-vector.
 664
 665     (:entry . <function>)
 666     (:closure-entry . <function>)
 667         Is replaced by the function-entry structure for the specified function.
 668         :Entry is how the top-level component gets a handle on the function
 669         definitions so that it can set them up.
 670
 671 We also need to remember the starting offset for each entry, although these
 672 don't in general appear as explicit constants.
 673
 674 We then dump out all the :Entry and :Closure-Entry objects, leaving any
 675 constant-pool pointers uninitialized.  After dumping each :Entry, we dump some
 676 stuff to let genesis know that this is a function definition.  Then we dump all
 677 the constant pools, fixing up any constant-pool pointers in the already-dumped
 678 function entry structures.
 679
 680 The debug-info *is* a constant: the first constant in every constant pool.  But
 681 the creation of this constant must be deferred until after the component is
 682 compiled, so we leave a (:debug-info) placeholder.  [Or maybe this is
 683 implicitly added in by the dumper, being supplied in a VMR-component slot.]
 684
 685
 686     Work out details of the interface between the back-end and the
 687     assembler/dumper.
 688
 689     Support for multiple assemblers concurrently loaded?  (for byte code)
 690
 691     We need various mechanisms for getting information out of the assembler.
 692
 693     We can get entry PCs and similar things into function objects by making a
 694     Constant leaf, specifying that it goes in the closure, and then
 695     setting the value after assembly.
 696
 697     We have an operation Label-Value which can be used to get the value of a
 698     label after assembly and before the assembler data structures are
 699     deallocated.
 700
 701     The function map can be constructed without any special help from the
 702     assembler.  Codegen just has to note the current label when the function
 703     changes from one block to the next, and then use the final value of these
 704     labels to make the function map.
 705
 706     Probably we want to do the source map this way too.  Although this will
 707     make zillions of spurious labels, we would have to effectively do that
 708     anyway.
 709
 710     With both the function map and the source map, getting the locations right
 711     for uses of Elsewhere will be a bit tricky.  Users of Elsewhere will need
 712     to know about how these maps are being built, since they must record the
 713     labels and corresponding information for the elsewhere range.  It would be
 714     nice to have some cooperation from Elsewhere so that this isn't necessary,
 715     otherwise some VOP writer will break the rules, resulting in code that is
 716     nowhere.
 717
 718     The Debug-Info and related structures are dumped by consing up the
 719     structure and making it be the value of a constant.
 720
 721     Getting the code vector and fixups dumped may be a bit more interesting.  I
 722     guess we want a Dump-Code-Vector function which dumps the code and fixups
 723     accumulated by the current assembly, returning a magic object that will
 724     become the code vector when it is dumped as a constant.
 725 ]