#1 (defun mysl (s) (declare (simple-string s)) (declare (optimize (speed 3) (safety 0) (debug 0))) (let ((c 0)) (declare (fixnum c)) (dotimes (i (length s)) (when (eql (aref s i) #\1) (incf c))) c)) * On X86 I is represented as a tagged integer. * Unnecessary move: 3: SLOT S!11[EDX] {SB-C::VECTOR-LENGTH 1 7} => t23[EAX] 4: MOVE t23[EAX] => t24[EBX] -------------------------------------------------------------------------------- #2 (defun quux (v) (declare (optimize (speed 3) (safety 0) (space 2) (debug 0))) (declare (type (simple-array double-float 1) v)) (let ((s 0d0)) (declare (type double-float s)) (dotimes (i (length v)) (setq s (+ s (aref v i)))) s)) * Python does not combine + with AREF, so generates extra move and allocates a register. * On X86 Python thinks that all FP registers are directly accessible and emits costy MOVE ... => FR1. -------------------------------------------------------------------------------- #3 (defun bar (n) (declare (optimize (speed 3) (safety 0) (space 2)) (type fixnum n)) (let ((v (make-list n))) (setq v (make-array n)) (length v))) * IR1 does not optimize away (MAKE-LIST N). -------------------------------------------------------------------------------- #4 (defun bar (v1 v2) (declare (optimize (speed 3) (safety 0) (space 2)) (type (simple-array base-char 1) v1 v2)) (dotimes (i (length v1)) (setf (aref v2 i) (aref v1 i)))) VOP DATA-VECTOR-SET/SIMPLE-STRING V2!14[EDI] t32[EAX] t30[S2]>t33[CL] => t34[S2], # MOV BYTE PTR [EDI+EAX+1], # MOV #, # MOV #, # * The value of DATA-VECTOR-SET is not used, so there is no need in the last two moves. * And why two moves? -------------------------------------------------------------------------------- #8 (defun foo (d) (declare (optimize (speed 3) (safety 0) (debug 0))) (declare (type (double-float 0d0 1d0) d)) (loop for i fixnum from 1 to 5 for x1 double-float = (sin d) ;;; !!! do (loop for j fixnum from 1 to 4 sum x1 double-float))) Without the marked declaration Python will use boxed representation for X1. This is equivalent to (let ((x nil)) (setq x 0d0) ;; use of X as DOUBLE-FLOAT ) The initial binding is effectless, and without it X is of type DOUBLE-FLOAT. Unhopefully, IR1 does not optimize away effectless SETs/bindings, and IR2 does not perform type inference. -------------------------------------------------------------------------------- #9 "Multi-path constant folding" (defun foo (x) (if (= (cond ((irgh x) 0) ((buh x) 1) (t 2)) 0) :yes :no)) This code could be optimized to (defun foo (x) (cond ((irgh x) :yes) ((buh x) :no) (t :no))) -------------------------------------------------------------------------------- #11 (inverted variant of #9) (lambda (x) (let ((y (sap-alien x c-string))) (list (alien-sap y) (alien-sap y)))) It could be optimized to (lambda (x) (list x x)) (if Y were used only once, the current compiler would optimize it) -------------------------------------------------------------------------------- #12 (typep (truly-the (simple-array * (*)) x) 'simple-vector) tests lowtag. -------------------------------------------------------------------------------- #13 FAST-+/FIXNUM and similar should accept unboxed arguments in interests of representation selection. Problem: inter-TN dependencies. -------------------------------------------------------------------------------- #14 The derived type of (/ (THE (DOUBLE-FLOAT (0D0)) X) (THE (DOUBLE-FLOAT 1D0) Y)) is (DOUBLE-FLOAT 0.0d0). While it might be reasonable, it is better to derive (OR (MEMBER 0.0d0) (DOUBLE-FLOAT (0.0d0))). -------------------------------------------------------------------------------- #15 On the alpha, the system is reluctant to refer directly to a constant bignum, preferring to load a large constant through a slow sequence of instructions, then cons up a bignum for it: (LAMBDA (A) (DECLARE (OPTIMIZE (SAFETY 1) (SPEED 3) (DEBUG 1)) (TYPE (INTEGER -10000 10000) A) (IGNORABLE A)) (CASE A ((89 125 16) (ASH A (MIN 18 -706))) (T (DPB -3 (BYTE 30 30) -1)))) -------------------------------------------------------------------------------- #16 (do ((i 0 (1+ i))) ((= i (the (integer 0 100) n))) ...) It is commonly expected for Python to derive (FIXNUMP I). (If ``='' is replaced with ``>='', Python will do.) -------------------------------------------------------------------------------- #17 Type tests for (ARRAY BIT), (ARRAY T) and similar go through full %TYPEP, even though it is relatively simple to establish the arrayness of an object and also to obtain the element type of an array. As of sbcl-0.8.12.30, this affects at least DUMP-OBJECT through COMPOUND-OBJECT-P, and (LABELS MAYBE-EMIT-MAKE-LOAD-FORMS GROVEL) through TYPEP UNBOXED-ARRAY, within the compiler itself. -------------------------------------------------------------------------------- #18 (lambda (x) (declare (null x)) (sxhash x)) goes through SYMBOL-HASH rather than either constant-folding or manipulating NIL-VALUE or NULL-TN directly. -------------------------------------------------------------------------------- #19 (let ((dx (if (foo) (list x) (list y z)))) (declare (dynamic-extent dx)) ...) DX is not allocated on stack. -------------------------------------------------------------------------------- #20 (defun-with-dx foo (x) (flet ((make (x) (let ((l (list nil nil))) (setf (first l) x) (setf (second l) (1- x)) l))) (let ((l (make x))) (declare (dynamic-extent l)) (mapc #'print l)))) Result of MAKE is not stack allocated, which means that stack-allocation of structures is impossible. -------------------------------------------------------------------------------- #21 (defun-with-dx foo () (let ((dx (list (list 1 2) (list 3 4)))) (declare (dynamic-extent dx)) ...)) External list in DX is allocated on stack, but internal are not. -------------------------------------------------------------------------------- #22 IR2 does not perform unused code flushing. -------------------------------------------------------------------------------- #23 Python does not know that &REST lists are LISTs (and cannot derive it). -------------------------------------------------------------------------------- #24 a. Iterations on &REST lists, returning them as VALUES could be rewritten with &MORE vectors. b. Implement local unknown-values mv-call (useful for fast type checking). -------------------------------------------------------------------------------- #26 SBCL cannot derive upper bound for I and uses generic arithmetic here: (defun foo (l) (declare (vector l)) (dotimes (i (length l)) (if (block nil (map-foo (lambda (x) (if x (return t))) l)) t nil))) (So the constraint propagator or a possible future SSA-convertor should know the connection between an NLE and its CLEANUP.) -------------------------------------------------------------------------------- #27 Initialization of stack-allocated arrays is inefficient: we always fill the vector with zeroes, even when it is not needed (as for platforms with conservative GC or for arrays of unboxed objectes) and is performed later explicitely. -------------------------------------------------------------------------------- #28 a. Accessing raw slots in structure instances is more inefficient than it could be; if we placed raw slots before the header word, we would not need to do arithmetic at runtime to access them. (But beware: this would complicate handling of the interior pointer). b. (Also note that raw slots are currently disabled on HPPA) -------------------------------------------------------------------------------- #29 Python is overly zealous when converting high-level CL functions, such as MIN/MAX, LOGBITP, and LOGTEST, to low-level CL functions. Reducing Python's aggressiveness would make it easier to effect changes such as x86-64: * direct MIN/MAX on {SINGLE,DOUBLE}-FLOATs ({MIN,MAX}S{S,D}) x86-64: * direct LOGBITP on word-sized integers and fixnums (BT + JC) x86{,-64}/PPC: * branch-free MIN/MAX on word-sized integers and fixnums (floats could be handled too, modulo safety considerations on the PPC) x86-64: * efficient LOGTESTs on word-sized integers and fixnums (TEST) etc., etc. (The framework for this has been implemented as of 0.9.9.18; see the vm-support-routine COMBINATION-IMPLEMENTATION-STYLE and its use in src/compiler/ir1opt.lisp, IR1-OPTIMIZE-COMBINATION. The above optimizations are left as an exercise for the reader.) -------------------------------------------------------------------------------- #30 (defun foo (x y) (< x y)) FOO's IR1 representation is roughly: (defun foo (x y) (if (< x y) T NIL)) However, if a full call is generated for < (and similarly for other predicate functions), then the IF is unnecessary, since the return value of (< x y) is already T or NIL. -------------------------------------------------------------------------------- #31 The typecheck generated for a declaration like (integer 0 45) on x86 looks like: ; 12B: F6C203 TEST DL, 3 ; 12E: 753B JNE L1 ; 130: 8BC2 MOV EAX, EDX ; 132: 83F800 CMP EAX, 0 ; 135: 7C34 JL L1 ; 137: 8BC2 MOV EAX, EDX ; 139: 3DB4000000 CMP EAX, 180 ; 13E: 7F2B JNLE L1 A better code sequence for this would be: TEST DL, 3 JNE L1 MOV EAX, EDX CMP EAX, 180 JBE L1 Doing an unsigned comparison means that, similarly to %CHECK-BOUND, we can combine the <0 and >=bound tests. This sort of test is generated often in SBCL and any array-based code that's serious about type-checking its indices. -------------------------------------------------------------------------------- #32 The code for a vector bounds check on x86 (similarly on x86-64) where the vector is in EDX and the index in EAX looks like: ; 49: L0: 8B5AFD MOV EBX, [EDX-3] ; 4C: 39C3 CMP EBX, EAX ; 4E: 7632 JBE L2 because %CHECK-BOUND is used for bounds-checking any array dimension. A more efficient specialization (%CHECK-BOUND/VECTOR) would produce: CMP [EDX-3], EAX JBE L2 Which is slightly shorter and avoids using a register. -------------------------------------------------------------------------------- #33 Reports from the Java camp indicate that using an SSE2-based floating-point backend on x86 when possible is highly preferable to using the x86 FP stack. It would be nice if SBCL included an SSE2-based floating point backend with a compile-time option to switch between the two. -------------------------------------------------------------------------------- #34 Compiling (defun foo (x y) (declare (type (integer 0 45) x y)) (+ x y)) results in the following error trapping code for type-checking the arguments: ; 424: L0: 8B058CE31812 MOV EAX, [#x1218E38C] ; '(MOD 46) ; 42A: 0F0B0A BREAK 10 ; error trap ; 42D: 05 BYTE #X05 ; 42E: 1F BYTE #X1F ; OBJECT-NOT-TYPE-ERROR ; 42F: FECE01 BYTE #XFE, #XCE, #X01 ; EDI ; 432: 0E BYTE #X0E ; EAX ; 433: L1: 8B0590E31812 MOV EAX, [#x1218E390] ; '(MOD 46) ; 439: 0F0B0A BREAK 10 ; error trap ; 43C: 03 BYTE #X03 ; 43D: 1F BYTE #X1F ; OBJECT-NOT-TYPE-ERROR ; 43E: 8E BYTE #X8E ; EDX ; 43F: 0E BYTE #X0E ; EAX Notice that '(MOD 46) has two entries in the constant vector. Having one would be preferable.