#1 (defun mysl (s) (declare (simple-string s)) (declare (optimize (speed 3) (safety 0) (debug 0))) (let ((c 0)) (declare (fixnum c)) (dotimes (i (length s)) (when (eql (aref s i) #\1) (incf c))) c)) * On X86 I is represented as a tagged integer. * Unnecessary move: 3: SLOT S!11[EDX] {SB-C::VECTOR-LENGTH 1 7} => t23[EAX] 4: MOVE t23[EAX] => t24[EBX] -------------------------------------------------------------------------------- #2 (defun quux (v) (declare (optimize (speed 3) (safety 0) (space 2) (debug 0))) (declare (type (simple-array double-float 1) v)) (let ((s 0d0)) (declare (type double-float s)) (dotimes (i (length v)) (setq s (+ s (aref v i)))) s)) * Python does not combine + with AREF, so generates extra move and allocates a register. * On X86 Python thinks that all FP registers are directly accessible and emits costy MOVE ... => FR1. -------------------------------------------------------------------------------- #3 (defun bar (n) (declare (optimize (speed 3) (safety 0) (space 2)) (type fixnum n)) (let ((v (make-list n))) (setq v (make-array n)) (length v))) * IR1 does not optimize away (MAKE-LIST N). -------------------------------------------------------------------------------- #4 (defun bar (v1 v2) (declare (optimize (speed 3) (safety 0) (space 2)) (type (simple-array base-char 1) v1 v2)) (dotimes (i (length v1)) (setf (aref v2 i) (aref v1 i)))) VOP DATA-VECTOR-SET/SIMPLE-STRING V2!14[EDI] t32[EAX] t30[S2]>t33[CL] => t34[S2], # MOV BYTE PTR [EDI+EAX+1], # MOV #, # MOV #, # * The value of DATA-VECTOR-SET is not used, so there is no need in the last two moves. * And why two moves? -------------------------------------------------------------------------------- #8 (defun foo (d) (declare (optimize (speed 3) (safety 0) (debug 0))) (declare (type (double-float 0d0 1d0) d)) (loop for i fixnum from 1 to 5 for x1 double-float = (sin d) ;;; !!! do (loop for j fixnum from 1 to 4 sum x1 double-float))) Without the marked declaration Python will use boxed representation for X1. This is equivalent to (let ((x nil)) (setq x 0d0) ;; use of X as DOUBLE-FLOAT ) The initial binding is effectless, and without it X is of type DOUBLE-FLOAT. Unhopefully, IR1 does not optimize away effectless SETs/bindings, and IR2 does not perform type inference. -------------------------------------------------------------------------------- #9 "Multi-path constant folding" (defun foo (x) (if (= (cond ((irgh x) 0) ((buh x) 1) (t 2)) 0) :yes :no)) This code could be optimized to (defun foo (x) (cond ((irgh x) :yes) ((buh x) :no) (t :no))) -------------------------------------------------------------------------------- #11 (inverted variant of #9) (lambda (x) (let ((y (sap-alien x c-string))) (list (alien-sap y) (alien-sap y)))) It could be optimized to (lambda (x) (list x x)) (if Y were used only once, the current compiler would optimize it) -------------------------------------------------------------------------------- #12 (typep (truly-the (simple-array * (*)) x) 'simple-vector) tests lowtag. -------------------------------------------------------------------------------- #13 FAST-+/FIXNUM and similar should accept unboxed arguments in interests of representation selection. Problem: inter-TN dependencies. -------------------------------------------------------------------------------- #14 The derived type of (/ (THE (DOUBLE-FLOAT (0D0)) X) (THE (DOUBLE-FLOAT 1D0) Y)) is (DOUBLE-FLOAT 0.0d0). While it might be reasonable, it is better to derive (OR (MEMBER 0.0d0) (DOUBLE-FLOAT (0.0d0))). -------------------------------------------------------------------------------- #15 On the alpha, the system is reluctant to refer directly to a constant bignum, preferring to load a large constant through a slow sequence of instructions, then cons up a bignum for it: (LAMBDA (A) (DECLARE (OPTIMIZE (SAFETY 1) (SPEED 3) (DEBUG 1)) (TYPE (INTEGER -10000 10000) A) (IGNORABLE A)) (CASE A ((89 125 16) (ASH A (MIN 18 -706))) (T (DPB -3 (BYTE 30 30) -1)))) -------------------------------------------------------------------------------- #16 (do ((i 0 (1+ i))) ((= i (the (integer 0 100) n))) ...) It is commonly expected for Python to derive (FIXNUMP I). (If ``='' is replaced with ``>='', Python will do.) -------------------------------------------------------------------------------- #17 Type tests for (ARRAY BIT), (ARRAY T) and similar go through full %TYPEP, even though it is relatively simple to establish the arrayness of an object and also to obtain the element type of an array. As of sbcl-0.8.12.30, this affects at least DUMP-OBJECT through COMPOUND-OBJECT-P, and (LABELS MAYBE-EMIT-MAKE-LOAD-FORMS GROVEL) through TYPEP UNBOXED-ARRAY, within the compiler itself. -------------------------------------------------------------------------------- #18 (lambda (x) (declare (null x)) (sxhash x)) goes through SYMBOL-HASH rather than either constant-folding or manipulating NIL-VALUE or NULL-TN directly. -------------------------------------------------------------------------------- #19 (let ((dx (if (foo) (list x) (list y z)))) (declare (dynamic-extent dx)) ...) DX is not allocated on stack. -------------------------------------------------------------------------------- #20 (defun-with-dx foo (x) (flet ((make (x) (let ((l (list nil nil))) (setf (first l) x) (setf (second l) (1- x)) l))) (let ((l (make x))) (declare (dynamic-extent l)) (mapc #'print l)))) Result of MAKE is not stack allocated, which means that stack-allocation of structures is impossible. -------------------------------------------------------------------------------- #21 (defun-with-dx foo () (let ((dx (list (list 1 2) (list 3 4)))) (declare (dynamic-extent dx)) ...)) External list in DX is allocated on stack, but internal are not. -------------------------------------------------------------------------------- #22 IR2 does not perform unused code flushing. -------------------------------------------------------------------------------- #23 Python does not know that &REST lists are LISTs (and cannot derive it). -------------------------------------------------------------------------------- #24 a. Iterations on &REST lists, returning them as VALUES could be rewritten with &MORE vectors. b. Implement local unknown-values mv-call (useful for fast type checking). -------------------------------------------------------------------------------- #26 SBCL cannot derive upper bound for I and uses generic arithmetic here: (defun foo (l) (declare (vector l)) (dotimes (i (length l)) (if (block nil (map-foo (lambda (x) (if x (return t))) l)) t nil))) (So the constraint propagator or a possible future SSA-convertor should know the connection between an NLE and its CLEANUP.) -------------------------------------------------------------------------------- #27 Initialization of stack-allocated arrays is inefficient: we always fill the vector with zeroes, even when it is not needed (as for platforms with conservative GC or for arrays of unboxed objectes) and is performed later explicitely. -------------------------------------------------------------------------------- #28 a. Accessing raw slots in structure instances is more inefficient than it could be; if we placed raw slots before the header word, we would not need to do arithmetic at runtime to access them. (But beware: this would complicate handling of the interior pointer). b. (Also note that raw slots are currently disabled on HPPA) -------------------------------------------------------------------------------- #29 Python is overly zealous when converting high-level CL functions, such as MIN/MAX, LOGBITP, and LOGTEST, to low-level CL functions. Reducing Python's aggressiveness would make it easier to effect changes such as x86-64: * direct MIN/MAX on {SINGLE,DOUBLE}-FLOATs ({MIN,MAX}S{S,D}) x86{,-64}: * direct LOGBITP on word-sized integers and fixnums (BT + JC) x86{,-64}/PPC: * branch-free MIN/MAX on word-sized integers and fixnums * efficient LOGTESTs on word-sized integers and fixnums (TEST / AND.) PPC: * efficient LDB on word-sized integers and fixnums (RLWINM) etc., etc. The "easier" part claimed above would come about because the functions would be available for :TRANSLATE through a VOP or similar, whereas with the current architecture, one would have to pattern-match IR1. While IR1 pattern-matching would be useful in other contexts, it seems better here to attempt the direct :TRANSLATE route. I (NJF) don't know how to implement such architecture-specific optimizations whilst keeping the high->low transformations for other architectures. Certainly adding #!+/- magic in compiler/*.lisp could be done, but such a solution is somewhat inelegant. Moving the relevant DEFTRANSFORMs to the architecture-specific compiler/* areas is also possible, but that would duplicate quite a bit of code. -------------------------------------------------------------------------------- #30 (defun foo (x y) (< x y)) FOO's IR1 representation is roughly: (defun foo (x y) (if (< x y) T NIL)) However, if a full call is generated for < (and similarly for other predicate functions), then the IF is unnecessary, since the return value of (< x y) is already T or NIL.