Fully Inlined Merge Sort
posted by Neil Toronto
While writing the code for the triangular distribution in the upcoming math library, I found that I needed a function that sorts exactly three numbers. This kind of code is annoying to write and to get right. But it comes up rarely enough, and it seems simple enough, that I’ve never felt like making a library function for it.
But what if I wrote a macro that generated code to sort n numbers very quickly, where n is known at expansion time, but the numbers themselves aren’t? I think I could justify putting that in a library.
Here’s code that correctly sorts three numbers a, b and c:
It’s an if tree. Notice that there are 6 leaf expressions, for the 3! = 6 possible permutations. Also, it never compares more than it has to. It’s optimal.The optimality came from my reasoning about transitivity. For example, only two comparisons are needed before returning (values a b c). I knew that both (< a b) and (< b c), so (< a b c) must be true by transitivity.
It would be nice if the macro generated optimal code by explicitly reasoning about transitivity, or as an emergent property of the sorting algorithm it uses.
We’ll write a macro that does the latter, by generating a fully inlined merge sort.
[Edit: The final inline sort macro is here.]
Runtime Merge Sort
Start with a simple, runtime merge sort:
(define (merge as bs) (match* (as bs) [((list) bs) bs] [(as (list)) as] [((list a as ...) (list b bs ...)) (if (< a b) (cons a (merge as (cons b bs))) (cons b (merge (cons a as) bs)))])) (define (mergesort vs) (match vs [(list) vs] [(list a) vs] [_ (definevalues (lvs rvs) (splitat vs (quotient (length vs) 2))) (merge (mergesort lvs) (mergesort rvs))]))
Example:
> (mergesort '(5 1 2 5 10)) '(1 2 5 5 10)
To make a macro out of merge sort, we need to change two things. The most obvious is that it has to return syntax for an if instead of evaluating it. That’s easy for a novice macrologist: change the functions to operate on syntax, stick a syntaxquasiquote in front of the if, and unquote the bits inside that get evaluated at expansion time.
But if we did only that, we’d end up with expanded code like this:
It would be slow because cons allocates. We want the code to be fast.So the other change is to move the conses inside the ifs, and evaluate them at expansion time. We can then construct a values expression out of the resulting list.
AccumulatorPassing Style Won’t Work
Novice functional programmers should know that accumulatorpassing style (APS) moves conses inward. For example, this “add 1 to each element” function:
racket
(define (listadd1 vs)
(match vs
[(list) (list)]
[(list v vs ...)
(cons (add1 v) (listadd1 vs))]))
becomes this after conversion to APS:
Now cons is where we want it: inside the recursive call, instead of in tail position. The problem is that APS doesn’t work on treeshaped recursion.
ContinuationPassing Style Does Work
In continuationpassing style (CPS), we pass a continuation k—i.e. “what happens next”— instead of an accumulator. The functions call k instead of returning values. For example:
If we want, we can pass something besides identity as the basecase continuation:
CPS turns every call into a tail call, so it moves conses inward even with treeshaped recursion. As a demonstration, here’s a CPSed merge sort:
(define (merge/k as bs k) (match* (as bs) [((list) bs) (k bs)] [(as (list)) (k as)] [((list a as ...) (list b bs ...)) (if (< a b) (merge/k as (cons b bs) (λ (vs) (k (cons a vs)))) (merge/k (cons a as) bs (λ (vs) (k (cons b vs)))))])) (define (mergesort/k vs k) (match vs [(list) (k vs)] [(list a) (k vs)] [_ (definevalues (lvs rvs) (splitat vs (quotient (length vs) 2))) (mergesort/k lvs (λ (lvs) (mergesort/k rvs (λ (rvs) (merge/k lvs rvs k)))))]))
You can read the last expression in mergesort/k as, “Sort lvs. Then, with the sorted lvs, sort rvs. Then, with the sorted rvs, merge lvs and rvs.” Example:
The Inline Sort Macro
When we macroize the CPSed merge sort, we’ll turn the continuations into expansiontime functions. So not only will macroized CPS move conses inward, it’ll apply them all at expansion time!
We’ll do it in three parts: the userfacing macro, the inline merge function, and the inline sort function.
The UserFacing Macro
Let’s put a nice face on inline sorting:
(definesyntax (inlinesort stx) (syntaxcase stx () [(_ lst ...) (withsyntax ([(vs ...) (generatetemporaries #'(lst ...))]) #`(let ([vs lst] ...) #,(inlinesort/k #'(vs ...) (λ (vs) #`(values #,@vs)))))]))
This macro does two things. First, it names the values to be sorted, so they don’t get reevaluated every time they’re compared. Second, it calls inlinesort/k with a basecase continuation that converts syntax lists to values expressions.Note that the call #,(inlinesort/k …) happens at expansion time, and that the continuation (λ (vs) …) it passes is an expansiontime value.
The Inline Merge Function
Changing merge/k to operate on syntax at expansion time is as straightforward as possible:
The only substantial changes are the quasiquotes and unquotes, and using syntaxcase to destructure syntax lists instead of using match to destructure lists. Again, note that the recursive calls #,(inlinemerge/k …) in each if branch happen at expansion time, and that their continuations are expansiontime values.We can see what kind of code merge/k returns by applying it at expansion time:
> (beginforsyntax (print (syntax>datum (inlinemerge/k #'(a) #'(b c) (λ (vs) vs))))) '(if (< a b) (a b c) (if (< a c) (b a c) (b c a)))
The syntax list #’(b c) is assumed already sorted, meaning (< b c) at runtime. Therefore, if (< a b) at runtime, by transitivity, (< a b c) at runtime, so the merge generates #’(a b c). In other words, inlinemerge/k’s assumption that its arguments are sorted is equivalent to reasoning about transitivity.
The Inline Sort Function
Lastly, the divideandconquer part:
(require (forsyntax racket/list)) ; for list functions (defineforsyntax (inlinesort/k vs k) (syntaxcase vs () [() (k vs)] [(a) (k vs)] [_ (let ([vs (if (list? vs) vs (syntax>list vs))]) (definevalues (lvs rvs) (splitat vs (quotient (length vs) 2))) (inlinesort/k lvs (λ (lvs) (inlinesort/k rvs (λ (rvs) (inlinemerge/k lvs rvs k))))))]))
This is changed similarly to merge/k. The only new change is using syntax>list to convert syntax to lists so we can use the functions length and splitat. Example:
> (inlinesort 5 2 3 1) 1 2 3 5
Of course, the result of evaluating an inlinesort doesn’t tell the whole story. Let’s fire up the macro stepper and see what (inlinesort 5 2 3) expands to. Copying from the macro stepper window and renaming temp variables, we get
It’s an if tree. Notice that there are 6 leaf expressions, for the 3! = 6 possible permutations. Also, it never compares more than it has to. It’s optimal.
Inline Sort Properties
Inherited from the merge sort, the inline sort has the following properties (assuming a lengthn list):

Time optimality: The depth of the if tree is O(n log(n)).

Size optimality: The number of leaves is exactly n!. The term “size optimality” is misleading, because that’s still a lot of code. I’ve seriously considered requiring any user of this macro to state how many permutations there are for the number of values they’re sorting. They’d have to prove to the macro that they know how much code they’re asking it to generate.Inline sort is wicked fast, as we should expect:
> (define vs '(5 2 3 1)) > (matchdefine (list a b c d) vs) > (for ([_ (inrange 5)]) (time (for ([_ (inrange 1000000)]) (matchlet ([(list a b c d) (sort vs <)]) (void))))) cpu time: 550 real time: 540 gc time: 30 cpu time: 510 real time: 518 gc time: 0 cpu time: 520 real time: 517 gc time: 20 cpu time: 520 real time: 517 gc time: 0 cpu time: 510 real time: 516 gc time: 10 > (for ([_ (inrange 5)]) (time (for ([_ (inrange 1000000)]) (letvalues ([(a b c d) (inlinesort a b c d)]) (void))))) cpu time: 20 real time: 28 gc time: 0 cpu time: 30 real time: 27 gc time: 0 cpu time: 30 real time: 27 gc time: 0 cpu time: 30 real time: 27 gc time: 0 cpu time: 20 real time: 27 gc time: 0
So about 20 times faster than sort on a length–4 list.I use it in Typed Racket, on floatingpoint numbers. Typed Racket’s optimizer replaces <
with unsafefl<
. This tells the compiler that the elements are floats, so it can keep them stackallocated, which reduces allocations further. In all, for my particular use of inlinesort, it’s over 50 times faster than sort.And using it is a heckuvalot easier than writing an if tree and reasoning about transitivity every time I need to sort a handful of floats.
Conclusion
Writing macros in expansiontime CPS to fully inline a recursive function works out beautifully. I suspect that it will work on any recursive, polymorphic function whose wellfoundedness follows only from the structure of the input data.Also, it can generate a metric truckload of code.Interesting note: I originally wrote inlinesort using only syntaxrules, passing the names of higherorder macros to other macros as continuations. Sorting a fiveelement list took almost 19000 expansion steps, which is ridiculously inefficient even for a 120leaf if tree.