29 Jan 2019

Racket-on-Chez Status: January 2019

posted by Matthew Flatt

For background information about Racket on Chez Scheme (a.k.a. Racket CS), see the original announcement, last January’s report, and the report at Scheme Workshop 2018.

Racket on Chez Scheme is done in a useful sense. All functionality is in place, DrRacket CS works fully, the main Racket CS distribution can build itself, and 99.95% of the core Racket test suite passes.

You can download a build for Windows, Linux, or Mac OS from the Utah snapshot site (look for “Racket CS”):

https://www.cs.utah.edu/plt/snapshots/

While code generally runs as fast as it should, end-to-end performance is not yet good enough to make Racket CS the default implementation of Racket. We’ll let the implementation settle and gradually improve, with the expectation that it will eventually be good enough to switch over— and better in the long run.

Compatibility with the Current Racket Implementation

Racket CS is intended to behave the same as the existing Racket implementation with a few exceptions:

There are still a few internal gaps related to handling large numbers of file descriptors (needed by servers with lots of connections, for example) and support for file-change events. Those should be easy to fill in, but our focus right now is on flushing out the remaining bugs that are exposed by test suites.

Another kind of incompatibility is that the compiled form of Racket code with the current implementation is platform-independent bytecode, while Racket CS’s compiled form is platform-specific machine code. This difference can sometimes affect a development workflow, and it required adjustments to the distribution-build process. Racket CS does not yet support cross compilation.

Here’s an incomplete list of things that are compatible between the current Racket implementation and Racket CS and that required some specific effort:

Macros, modules, threads, futures, places, custodians, events, ports, networking, string encodings, paths, regular expressions, mutable and immutable hash tables, structure properties and applicable structures, procedure arities and names, chaperones and impersonators, delimited continuations, continuation marks, parameters, exceptions, logging, security guards, inspectors, plumbers, reachability-based memory accounting, ephemerons, ordered and unordered finalization, foreign-function interface (mostly), phantom byte strings, source locations, left-to-right evaluation, result-arity checking, left-associative arithmetic, eqv? on NaNs, and eq? and flonums.

Outlook

The rest of this report will provide lots of numbers, but none of them expose the main benefit of Racket CS over the current Racket implementation: it’s more flexible and maintainable.

Putting a number on maintainability is less easy than measuring benchmark performance. Anecdotally, as the person who has worked on both systems, I can report that it’s no contest. The current Racket implementation is fundamentally put together in the wrong way (except for the macro expander), while Racket CS is fundamentally put together in the right way. Time and again, correcting a Racket CS bug or adding a feature has turned out to be easier than expected.

To maximize the maintenance benefits of Racket CS, it’s better to make it the default Racket variant sooner rather than later— and, ideally, discard the current Racket implementation. But while Racket CS is compatible with Racket to a high percentage, it’s never going to be 100%. From here, it’s some combination of patching differences and migrating away from irreconcilable differences, and that will take a little time. Given that both implementations need to exist for a while, anyway, we can given some weight to end-to-end performance when deciding on the right point to switch.

Many plots in this report are intended to tease out reasons for the performance difference between Racket CS and current Racket. From the explorations, so far, its does not appear that the performance difference is an inevitable trade-off from putting Racket together in a better way. Part of the problem is that some new code on top of Chez Scheme needs to be refined. Perhaps more significantly, there are some trade-offs in the space of compilation timing (ahead-of-time or just-in-time) and code representation (machine code versus bytecode) that we can adjust with more work.

Although the current Racket implementation and Racket CS will both exist for a while, we do not anticipate the dueling implementations to create problems for the Racket community. The question of which to use will be more analogous to “which browser works best for your application?” than “does this library need Python 2 or Python 3?”

Meanwhile, there’s even more code to maintain, and accommodating multiple Racket variants creates some extra complexity by itself (e.g., in the distribution builds). It still looks like a good deal in the long run.

Performance of Compiled Code

The plots below show timings for Chez Scheme (purple), Racket CS (blue), and current Racket (red) on traditional Scheme benchmarks. Shorter is better. The results are sorted by Chez Scheme’s time over Racket’s time, except that benchmarks that rely on mutable pairs are in a second group with green labels.

  • Note that the break-even point between Chez Scheme and Racket is toward the end of the set of benchmarks with black lables, which reflects that Chez Scheme is usually faster than current Racket.

  • The main result is that the blue bar tracks the purple bar fairly well for the benchmarks without mutable pairs: Racket CS’s layers on top of Chez Scheme are not interfering too much with Chez Scheme’s base performance, even though Racket CS wraps and constrains Chez Scheme in various ways (e.g., enforcing left-to-right evaluation of application arguments).

  • For the benchmarks that use mutable pairs (green labels), Racket CS loses some of Chez Scheme’s performance by redirecting mutable-pair operations away from the built-in pair datatype, since built-in pairs are used only for immutable pairs in Racket CS.

  • The tak variants where the blue bar is shortest may be due to an extra layer of function inlining. The collatz test is effectively a test of exact-rational arithmetic on large fractions.

The next set of plots compare Racket CS and current Racket on the Racket implementations of benchmarks that were written over the years for The Computer Language Benchmarks Game. These rely more heavily on Racket-specific language features. Racket CS’s slowness toward the end of the list is often due to the I/O implementation, which is newly implemented for Racket CS and will take time to refine.

Aside from the fact that I/O needs work in Racket CS, the takeaway here is that there are no huge problems nor huge performance benefits with the Racket CS implementation. Longer term, the red lines probably aren’t going to move, but because so much new code is involved with the blue lines, there’s reason to think that some blue lines can get shorter.

About the measurements: These benchmarks are in the "racket-benchmark" package in the "common" and "shootout" directories. We used commit f6b6f03401 of the Racket fork of Chez Scheme and commit c9e3788d42 of Racket. The Chez Scheme fork includes Gustavo Massaccesi’s “cptypes” pass, which improves Chez Scheme’s performance on a few benchmarks. The test machine was a Core i7-2600 3.4GHz running 64-bit Linux.

Startup and Load Times

Startup and load time have improved since previous reports, but Racket CS remains slower.

Startup for just the runtime system without any libraries (still on a Core i7-2600 3.4GHz running 64-bit Linux):

The difference here is that the Racket CS startup image has much more Scheme and Racket code that is dynamically loaded and linked, instead of loaded as a read-only code segment like the compiled C code that dominates the current Racket implementation. We can illustrate that effect by building the current Racket implementation in a mode where its Racket-implemented macro expander is compiled to C code instead of bytecode, too, shown below as “R/cify.” We can also compare to Racket v6, which had an expander that was written directly in C:

The gap widens if we load compiled Racket code. Loading the racket/base library:

The additional difference here is that Racket CS’s machine code is bigger than current Racket’s bytecode representation. Furthermore, the current Racket implementation is lazy about parsing some bytecode. We can tease out the latter effect by disabling lazy bytecode loading with the -d flag, shown as “R/all”:

(We could also force bytecode to be JITted immediately— but JITting is more work than just loading, so that timing result would not be useful.)

We get a similar shape and a larger benefit from lazy loading with the racket library, which is what the racket executable loads by default for interactive mode:

About the measurements: These results were gathered by using time in a shell a few times and taking the median. The command was as shown, but using racketcs for the “R/CS” lines and racket -d for the “R/all” lines.

Memory Use

Like load times, differences in memory use between Racket CS and current Racket can be attributed to code-size differences from bytecode versus machine code and by lazy bytecode loading.

The following plots show memory use, including both code and data, after loading racket/base or racket, but subtracting memory use at the end of a run that loads no libraries (which reduces noise from different ways of counting code in the initial heap). The “R/jit!” line uses -d to load all bytecode eagerly, and it further forces that bytecode to be compiled to native code by the JIT compiler:

These results show that bytecode is more compact than machine code, as expected. Lazy parsing of bytecode also makes a substantial difference in memory use for the current Racket implementation. Racket’s current machine code takes a similar amount of space as Chez Scheme machine code, but the JIT overhead and other factors make it even larger. (Bytecode is not retained after conversion to machine code by the JIT.)

On a different scale and measuring peak memory use instead of final memory use for DrRacket start up and exit:

This result reflects that DrRacket’s memory use is mostly the code that implements DrRacket, at least if you just start DrRacket and immediately exit.

The gap narrows if you open an earlier version of this document’s source and run it three times before exiting, so that memory use involves more than mostly DrRacket’s own code:

About the measurements: These results were gathered by running racket or racketcs starting with the arguments -l racket/base, -l racket, or -l drracket. The command further included -W "[email protected]" -e (collect-garbage) -e (collect-garbage) and recording the logged memory use before that second collection. For the “R” line, the reported memory use includes the first number that is printed by logging in square brackets, which is the memory occupied by code outside of the garbage collector’s directly managed space. For “R/all,” the -d flag is used in addition, and for “R/jit!,” the PLT_EAGER_JIT environment variable was set in addition to supplying -d.

Expand and Compile Times

Compile times have improved some for Racket CS since the original report, but not dramatically. These plots compare compile times from source for the racket/base module (and all of its dependencies) and the racket module (and dependencies):

Compilation requires first macro-expanding source, and that’s a significant part of the time for loading from source. Racket CS and current Racket use the same expander implementation, and they expand at practically the same speed, so the extra time in Racket CS can be attributed to machine-code compilation. The following plots show how parts of the compile time can be attributed to specific subtasks:

Another way to look at compile times is to start with modules that are already expanded by the macro expander and just compile them. The -M flag alone does not do that, but it’s meant here to represent an installation that was constructed by using the -M flag for all build steps:

The difference in these compile times reflects how Chez Scheme puts much more effort into compilation. Of course, the benefit is the improved run times that you see in so many benchmarks.

The compile-only bars are also significantly shorter than taking the expansion-plus-compilation bars and removing only the gray part. That’s because the gray part only covers time spent specifically in the macro expander or running macro transformers, but it does not cover the time to compile macro definitions as they are discovered during expansion or to instantiate modules for compile-time use.

Given that the Chez Scheme compiler is so much slower (for good reason) than the current Racket compiler, we might ask how it compares to other, non-Racket compilers. Fortunately, we can make a relatively direct comparison between C and Racket, because the Racket macro expander was formerly written in C, and now it is written in Racket with essentially the same algorithms and architecture (only nicer). The implementations are not so different in lines of code: 45 KLoC in C versus 28.5 KLoC in Racket. The following plot shows compile times for the expander’s implementation:

To further check that we’re comparing similar compilation tasks, we can check the size of the generated machine code. Toward that end, we can compile the Racket code to C code through a cify compiler, which is how the expander is compiled for the current Racket implementation for platforms that are not supported by Racket’s JIT. Below is a summary of machine-code sizes for the various compiled forms of the expander.

The current Racket implementation generates much more code from the same implementation, in part because it inlines functions aggressively and relies on the fact that only called code is normally translated to machine code; the “R/jit!/no” bar shows the code size when inlining is disabled. In any case, while the machine-code sizes vary quote a bit in this test, they’re all on the same general scale.

In summary, as an extensible language, the question of compile times is more complicated than for a conventional programming language. At the core-compiler level, current Racket manages to be very fast as a compiler by not trying hard. Racket CS, which gets its compile times directly from Chez Scheme, spends more time compiling, but it still has respectable compile times.

About the measurements: The numbers in compile-time plots come from running the shown command (but with racketcs instead of racket for the “R/CS” lines) with the PLT_EXPANDER_TIMES and PLT_LINkLET_TIMES environment variables set. The overall time is as reported by time for user plus system time, and the divisions are extracted from the logging that is enabled by the environment variables.

For measuring compile times on the expander itself, the Chez Scheme measurement is based on the build step that generates "expander.so", the current-Racket measurement is based on the build step that generates "cstartup.inc", and the C measurement is based on subtracting the time to rebuild Racket version 6.12 versus version 7.2.0.3 when the ".o" files in "build/racket/gc2" are deleted.

For measuring machine-code size, the expander’s code size for Chez Scheme was computed by comparing the output of object-counts after loading all expander prerequsites to the result after the expander; to reduce the code that is just form the library wrapper, the expander was compiled as a program instead of as a library. The code size for Racket was determined by setting PLT_EAGER_JIT and PLT_LINKLET_TIMES and running racket -d -n, which causes the expander implemtation to be JITted and total bytes of code generated by the JIT to be reported. The “R/no-inline” variant was the same, but compiling the expander to bytecode with compile-context-preservation-enabled set to #f, which disables inlining. The “R/cify” code size was computed by taking the difference on sizes of the Racket shared library for a normal build and one with --enable-cify, after stripping the binaries with strip -S, then further subtracting the size of the expander’s bytecode as it is embedded in the normal build’s shared library. The “C” code size was similarly computed by subtracting the size of the Racket shared library for version 7.2.0.3 from the size for the 6.12 release, stipped and with the expander bytecode size subtracted.

Build Time

Since Racket programs rely heavily on metaprogramming facilities–either directly or just by virtue of being a Racket program— the time required to build a Racket program depends on a combination of compile time, run time, and load time. Few Racket programmers may care exactly how long it takes to build the Racket distribution itself, but distribution-build performance is probably indicative of how end-to-end performance will feel to a programmer using Racket.

The following plots are all on the same scale, and they show memory use plotted against time for building the Racket distribution from source. The first two plots are essentially the same as in the January 2018 report. While the graph stretches out horizontally for Racket CS, showing a build that takes about three times as long, it has very much the same shape for memory use.

Racket CS

 

 

current Racket

 

One might assume that the difference in compile time explains the slower Racket CS build. However, this assumption does not hold up if we completely isolate the step of compiling fully expanded modules. To set up that comparison, the following plots show build activity when using current Racket and making “compile” just mean “expand.” It happens to take about the same time as a Racket CS build, but with more of the time in the documentation phases:

current Racket -M

 

Although current Racket compiles from expanded source relatively quickly, a build requires loading the some modules over and over for compiling different sets of libraries and running different documentation examples. The documentation running and rendering phases, as shown in the blue in green regions, are especially show and use especially much memory, because documentation often uses sandboxes that load libraries to run and render examples. (The big jump at the same point in the blue and green region merits further investigation. It might be a sandbox bug or a leaky unsafe library.)

Given the result of the expand-only build as an input, we can switch in-place to either Racket CS or normal-mode current Racket and compile each fully expanded module to machine code:

Racket CS finish

 

 

current Racket finish

 

Each module is compiled from expanded form just once, and that compiled form can be used as needed (for cross-module optimization) to compile other modules. Also, documentation doesn’t get re-run and re-rendered in this finishing build, because the build process can tell that the sources did not change. Overall, compilation finishes in under 20 minutes for Racket CS, which is a reasonable amount of time for 1.2 million lines of source Racket code.

These build-finishing plots illustrate how the Racket distribution server generate bundles for multiple platforms and variants in hours instead of days. The build server first creates expanded-module builds of the packages and main collections, and it serves those to machine-specific finishing builds.

About the measurements: These plots were generated using the "plt-build-plot" package, which drives a build from source and plots the results. The -M build was created by setting the PLT_COMPILE_ANY environment variable, and then the finishing builds were measured by another run on the result but using the --skip-clean flag for "plt-build-plot".

Implementation Outlook

Based on the data that we’ve collected so far, I see three directions toward improving end-to-end performance for Racket CS:

  • Improvements to new implementation of Racket’s I/O API.

  • Better support in Chez Scheme to trade performance for faster compilation, combined at the Racket CS level with a bytecode-and-JIT setup that supports lazy decoding of bytecode.

    The January 2018 report mentions an experimental JIT mode for Racket CS, and that alternative remains in place. At the moment, it’s not a good alternative to Racket CS’s default mode, but it still may be a step in the right direction, especially considering that it allows JIT-style compilation and ahead-of-time compilation to coexist.

  • Algorithmic improvements to the way macros and modules work.

    That the full expansion stack takes 10 times as long as core compilation for Racket libraries suggests that there is room for algorithmic improvements that would help both the current Racket implementation and Racket CS.

There are bound to be additional performance factors that we haven’t yet isolated. Whether it turns out to be the factors that we know or others, working in the new implementation of Racket will make it easier explore the solutions.

Made with Frog, a static-blog generator written in Racket.
Source code for this blog.