posted by Matthew Flatt
For background information about Racket on Chez Scheme
(a.k.a. Racket CS), see
the
original announcement, the January 2018 report,
the
report at Scheme Workshop 2018,
the January 2019 report,
the ICFP experience report,
and the
report at RacketCon 2019.
Racket on Chez Scheme (Racket CS) is ready for production use.
Racket CS now passes all of the tests for the main Racket distribution
tests, and differences in compile and run times are much reduced.
Overall, Racket CS tends to perform about the same as the traditional
Racket implementation (Racket BC, “before Chez”)—
sometimes better
and sometimes worse, but typically using more memory due to larger
code sizes.
Racket CS is not yet the default Racket implementation, but it is
available as a download option alongside the regular Racket release at
https://download.racket-lang.org/ (select CS from the
Variant popup).
Run-Time Performance
Run-time performance for Racket CS has continued to improve.
Benchmarks show the difference to some degree, but they understate the
difference for a typical Racket application. Alex Harsanyi shared his
initial
experience with Racket CS in December 2018, and he has been kind enough to keep
taking measurements. The plots below show his results for
Racket CS and Racket BC, where
lower is better, and the overall trend here seems typical for
applications that I’ve measured.
As the trend lines may suggest, the overall improvement is from many
small changes that add up. The plateau around June to October 2019
coincides with a push on correctness and compatibility, as opposed to
performance, to make all Racket tests pass.
The plots below show current results for traditional Scheme
benchmarks. The top bar is current and
unmodified Chez Scheme, while the
second bar is Chez Scheme
as modified to support Racket. The third bar is
Racket CS, and the bottom bar is Racket BC. The
last two rows of benchmarks rely on mutable
pairs, so they are run in Racket as #lang r5rs
programs. There’s not a lot of difference here compared to one year
ago, except that a faster path for integer division in the Racket
variant of Chez Scheme has eliminated the
collatz outlier.
The next set of plots compare Racket CS and Racket BC for
Racket-specific implementations over years for
The
Computer Language Benchmarks Game. Compared to one year ago, the
fraction of benchmarks where Racket CS wins over Racket BC is
reversed. Much of that improvement happened in the thread and I/O
layers that were newly implemented for Racket CS.
About the measurements: Alex Harsanyi’s measurements are for parts of the
ActivityLog2 test suite; ActivityLog2 and the CI infrastructure it
runs on have both changed over time, so the plots are approximate, but
they are generally consistent with fresh runs with the current
ActivitlyLog2 implementation on Racket versions 7.3 through 7.6.0.12.
All other measurements use a Core i7-2600 at 3.4GHz running 64-bit
Linux. Benchmarks are in the
"racket-benchmark"
package in the "common" and "shootout" directories.
We used commit a20e3f305c of the Racket variant of Chez Scheme
and commit cdd0659438 of Racket.
Startup and Load Times
Load times have improved for Racket CS. Loading the
racket/base library:
Loading the full racket library:
Load times are, of course, directly related to memory use, and
Racket CS load times improved primarily through reduced memory use.
About the measurements: These results were gathered by using time in a
shell a few times and taking the median. The command was as shown.
Memory Use
Differences in memory use between Racket CS and Racket BC are mostly
due to bytecode versus machine code, plus the fact that Racket BC can
load bytecode more lazily. Various improvements, including changes to
the Chez Scheme compiler to reduce code size, have decreased memory
use in Racket CS by around 20% for typical applications.
The following plots show memory use, including both code and data,
after loading racket/base or racket,
but subtracting memory use at the end of a run that loads no libraries
(which reduces noise from different ways of counting code in the
initial heap). The gray portion of each bar is an estimate of memory
occupied by code, which may be in machine-code form, bytecode form,
or not yet unmarshaled.
Racket BC heap sizes here are larger compared to previous reports by
about 5 MB. That difference reflects a more accurate measurement of
the initial Racket BC heap.
On a different scale and measuring peak memory use instead of final
memory use for DrRacket start up and exit:
About the measurements: These results were gathered by running racket with the
arguments -l racket/base, -l racket, or -l
drracket. The command further included -W "debug@GC" -e
’(collect-garbage)’ -e ’(collect-garbage)’, and reported sizes are
based on the logged memory use before the second collection. For the
Racket BC bar, the reported memory use includes the first number that is
printed by logging in square brackets, which is the memory occupied by
code outside of the garbage collector’s directly managed space.
Baseline memory use was measured by setting the
PLT_GCS_ON_EXIT environment variable and running with
-n, which is has the same effect as -e
'(collect-garbage)' -e '(collect-garbage)'. DrRacket was initialized
with racket/base as the default language; also,
background expansion was disabled, because measuring memory use is
tricky on Racket BC. Code size was estimate using
dump-memory-stats counting bytecode, JIT-generated native
code, and marshaled code for Racket BC and machine code, relocations,
and bytevectors (which are mostly marshaled code) for Racket CS.
Expand and Compile Times
Compile times have improved substantially for Racket CS. The main
change was to use an interpreter for compile-time code within the
module currently being compiled, because that’s a better trade-off in
(meta) compilation time and (meta) run time. Compile-time code that is
exported for use by other modules is compiled and optimized normally.
While the Racket CS interpreter and the one in Racket BC are both
intended to be safe for space, the Racket CS one stands a much better
chance of achieving that goal.
About the measurements: These results were gathered by using time in a
shell a few times and taking the median. The command was as shown.
Build Time
The time and memory used to build a Racket distribution using
Racket CS is now much closer to the time and memory used by
Racket BC. The following plots are all on the same scale, and they
show memory use plotted against time for building the Racket
distribution from source:
In each plot, there are two lines, although they are
often smashed together. The top line is memory use just before a
major garbage collection, and the bottom line is memory use just after
a major garbage collection.
Racket CS |
|
|
|
|
|
Racket BC |
|
|
The Racket CS plot used to be more than twice as wide as the Racket BC
plot. About half of the improvement came from fixing a cache that
interacted badly with Racket CS’s more frequent minor garbage
collections. The rest of the improvement was due to many small
improvements.
About the measurements: These plots were generated using the
"plt-build-plot" package, which drives a build from source
and plots the results.
Run-Time Performance Redux
In some ways, the Racket CS project has validated the Racket BC
implementation, because it turns out that Racket BC performs pretty
well. With the notable exception of first-class continuations (where
Racket BC use a poor strategy), the traditional, JIT-based Racket
engine performs close to Chez Scheme.
Then again, development to date has been aimed at making Racket CS
match the performance of Racket BC on a code base that was developed
and tuned for Racket BC, and that obviously gives Racket BC an
advantage. For example, Racket BC makes dormant code relatively cheap, so
Racket libraries generate a lot of code. Future Racket libraries will
likely shift to take advantage of Racket CS’s cheaper function calls
and dramatically cheaper continuations. One day, probably, Racket BC
will no longer be a viable alternative to Racket CS for most programs.
To make that prediction more concrete, consider these three ways of
counting to 10 million (if N is 10000000):
Here are run times normalized to the first one, which is about the
same in Racket CS and Racket BC:
While no one will start writing trivial loops the slow way, there’s a
big difference between a ×5 and ×25 overhead when choosing how to
implement a new abstraction and deciding how much to make static
versus dynamic. There’s an even bigger difference between ×335 and
×2672—
both irrelevant for a toy loop, but ×335 becomes relevant
sooner than ×2672 for more interesting calculations. Overall, when
implementation choices start to rely on the R/CS
column, the R/BC column will sometimes be
unacceptable.
Reflection and Outlook
It took three years to get Racket on Chez Scheme running well enough
for production use, and it will take yet more time for Racket CS to
fully replace Racket BC. But a certain amount of optimism is necessary
to take on a large project like this, and if the timeline gets
stretched beyond
initial
and
early
projections, then that’s only to be expected.
Racket CS will eventually outpace Racket BC for the reason that
originally motivated porting Racket to Chez Scheme: it’s put together
in a better way, so it’s easier to modify and improve. Maintainability
is difficult to capture with the same clarity as performance
benchmarks, but after spending one more year modifying both
implementations, I remain as convinced as ever that Racket CS is much
better.
This report is the last one for Racket CS. Here are a few things that
are on the Racket CS roadmap—
but, increasingly, we’ll just call it
the Racket roadmap:
-
Support for embedding Racket CS in a larger application.
Probably the C API here will start with the Chez Scheme C API,
which will make it different from Racket BC’s C API: providing
similar functionality, but with simpler rules for cooperating
with the memory manager.
-
Improved garbage collection, especially for large heap sizes,
including support for incremental collection.
-
Unboxed floating-point arithmetic, especially for local
compositions of floating-point operations.