24 Jan 2021

Racket Compiler and Runtime Status: January 2021

posted by Matthew Flatt

With the upcoming version 8.0 release, Racket on Chez Scheme (Racket CS) will become the default implementation of Racket. This report discusses the implications of that change, continuing a series that was originally about Racket CS; the most recent previous report was the February 2020 report. The original January 2018 report explains the motivation behind throwing out around 200k lines of C code to replace it with around 150k lines of Scheme and Racket code.

Switching to Racket CS

For most users, the differences between Racket CS and Racket BC (“before Chez”) may be too small to notice. The installer and executables will be larger, because they contain machine code instead of bytecode, and some programs may run a little faster. Otherwise, Racket programs are supposed to run the same.

To test this, we’ve built and tested every package available from the package server on Racket CS over the past several months and compared the results with those from Racket BC version 7.9. Out of more than 1800 packages, currently only 12 now fail to compile, and 5 others fail some of their tests. The new build failures, as well as 2 of the 5 new test failures, are the result of changes at the C API level. Several packages already have pull requests to fix them, and we hope others will be fixed soon, too. The remaining 3 test failures are differences in thread scheduling that provoked existing bugs in the packages.

Although Racket CS will be the default for v8.0, Racket BC will remain available through a “More Installers” link on the download page for the foreseeable future. So, Racket users will still have the option of falling back to the old runtime system and compiler if something goes wrong, or if they need a package that does not yet work on Racket CS.

Progress in the Past Year

During 2020, Racket CS became a little faster and a little smaller, and many corners of the implementation were repaired or made more compatible with the BC implementation. The planned changes that were on the previous report’s roadmap all happened! In addition, we introduced an AArch64 backend and ported to run natively on M1 Macs. Finally, we made the Chez Scheme garbage collector run in parallel, which brought the performance of places-based parallelism in Racket CS about on par with BC.

Building Racket CS no longer involves first building Racket BC as a way to bootstrap Chez Scheme. Instead, Chez Scheme bootstraps itself on any supported platform using a new portable bytecode (pb) backend. Bootfiles for pb mode are checked into a separate Git repository that is used in a submodule-like way, but bootfiles can always be built from source by using an existing Racket build; the bootstrap implementation is now detangled so that much older Racket versions in the v7 series can work for bootstrapping.

Benchmark Performance

For traditional Scheme benchmarks, not much has changed in the results. A plot is practically indistinguishable from the one in the previous report.

For the shootout benchmarks, which measure more Racket-specific functionality and libraries, there’s some improvement, especially toward the end of the table. (Shorter is better.) The light-blue line shows the results from the previous report (which corresponds to version 7.6), so a shorter “CS” bar means improvement there. The improvements mostly reflect unboxed floating-point arithmetic.

Below is another set of microbenchmarks from the racket-benchmarks package, comparing Racket CS to BC. Some points to note:

CS is usually faster than BC. (Shorter is better.)
The fastest relative CS runs are much faster than the fastest relative BC runs; that is, the shortest blue bars are much shorter than the shortest red bars. Although it’s difficult to extract from these plots, the difference is really that CS performs more consistently, so it’s slowest relative runs are not as slow as the slowest relative BC runs.
These benchmarks mostly compare subsystems that are implemented in C for BC to subsystems implemented in Scheme and Racket for CS. That’s the key and motivating point — the maintenance advantage, more than the performance advantage.

For the rest of this report, the plots will show CS slower or bigger, but gradually approaching BC performance. You should not read that as “CS is slower than BC.” This report is about the parts of the implementation where we spent time last year, and we don’t dwell on parts where no improvement was needed.

About the measurements: Benchmarks are in the "racket-benchmark" package in the "shootout", "control", "hash", and "chaperone" directories. We dropped some of the "control" benchmarks because they take to long to run on BC, which means that the plots here understate how much more consistently CS performs compared to BC.

Startup and Memory Use

Load times improved for Racket CS:

That’s mostly due to reduced code size:

On a different scale and measuring peak memory use instead of final memory use for DrRacket start up and exit:

The large drop in peak memory use for DrRacket CS is due to garbage-collection improvements for old generations in large heaps, where old-generation objects tend to be marked in place instead of copied.

About the measurements: These results were gathered by running racket with the arguments -l racket or -l drracket. The command further included -W "debug@GC" -e ’(collect-garbage)’ -e ’(collect-garbage)’, and reported sizes are based on the logged memory use before the second collection. For the “BC” line, the reported memory use includes the first number that is printed by logging in square brackets, which is the memory occupied by code outside of the garbage collector’s directly managed space. Baseline memory use was measured by setting the PLT_GCS_ON_EXIT environment variable and running with -n, which is has the same effect as -e '(collect-garbage)' -e '(collect-garbage)'. DrRacket was initialized with racket/base as the default language; also, background expansion was disabled, because measuring memory use is tricky on Racket BC.

Build Times

Compile times improved for Racket CS:

This improvement, combined with others, makes the CS distribution now build a little faster from source than a BC build, and it uses about 20% less memory than the BC build. The following two plots use the same scale, where the foreground blue or red line shows memory use (vertical) plotted over time (horizontal) as recorded on each major collection (see this page for a detailed description):

Racket CS

Racket BC

A shorter build time in less space represents a big milestone for Racket CS. Even though most Racket users start with a pre-built distribution, build time measures end-to-end performance across many parts of the implementation, and build performance correlates well with end-to-end performance in many Racket applications.

The build plots above are for a sequential (i.e., single-threaded) build. More typically, Racket is built on machines with more CPUs, and build times easily benefit from process-like parallelism. Over the past year, garbage-collector improvements for parallel collection have made build times with place-based parallelism dramatically shorter on a machine with 4 hyperthreaded cores (although build dependencies ultimately limit parallelism). Since parallel collection debuted in Racket CS v7.9 but has improved since, we show v7.9 as an intermediate point:

As the plot shows, as of the previous report (before parallel garbage collection), trying to build with multiple places was counterproductive. The speedup from places-based parallelism in v8.0 is finally close to BC’s speedup, which is about the same as using process-based parallelism (but with the advantage of staying within a single Racket process). The parallelism available from BC remains a little higher, because its implementation of places uses completely separate allocation heaps, while parallelism in CS use a single heap.

About the measurements: The compile-time results were gathered by using time in a shell a few times and taking the median. The build plots were generated using the "plt-build-plot" package, which drives a build from source and plots the results. The parallelism results were generated by starting with an in-place installation, using raco setup –fast-clean and rm doc/docindex.sqlite to start from a clean state, and getting elapsed real time with time racro setup –jobs 8, and dividing by elapsed time from build plots (which is not exactly the same build job, but is close).

Reflection and Outlook

We tried to declare success and move on a year ago, but that didn’t work. This time really is different, because Racket CS is ready to be the default Racket implementation. Time will tell whether that difference is enough to allow a different focus in the coming year. Given how much the CS compilation strategy (ahead-of-time) differs from the BC strategy (bytecode compilation plus a JIT), it’s not clear that the compile time, load time, and memory footprint differences between BC and CS can be reduced further.

In terms of maintainability, one possible direction for improvement is to shift even more Racket support into the Chez Scheme layer. Gustavo’s decision to implement type reconstruction as a Chez Scheme pass was good call. Contributor yjqww6’s implementation of Racket-style lifting as a Chez Scheme pass was a clear improvement for Racket CS, because it eliminated a larger pass on the “schemify” side and implemented at a better layer of abstraction with less interference on local optimizations. Similarly, extending Chez Scheme’s support for weak hash tables, which was not difficult, allowed us to discard a complex and slower implementation in the “rumble” Racket-adapter layer. More direct support for left-to-right evaluation guarantees and cross-linklet inlining could further reduce the schemify layer, and maybe one day eliminate it.

We’ve mostly kept BC and CS in sync, so far, but moving forward, Racket BC will not necessarily get new functionality that is added to CS. After all, the point of CS is to eventually shed technical debt in BC. This divergence has already started: Racket CS fully supports M1 Macs, but Racket BC runs only in interpreted mode or as x86_64 (i.e., we did not add an AArch64 JIT to Racket BC).

Thanks to Sam, Ryan, Robby, Matthias, and John for improving this report. Special thanks to Sam, Paulo, and Robby for moving CS testing forward. Many contributors helped track down and repair CS bugs, and besides others mentioned, thanks to Shu-Hung, Sorawee, and Bogdan for tackling some substantial repairs.