posted by Matthew Flatt, Ryan Culpepper, Robby
Findler, Gustavo Massaccesi, and Sam Tobin-Hochstadt
With the version 9.0 release, Racket includes support for
shared-memory threads that can take advantage of multicore hardware
and operating-systems threads to run in parallel—
not merely
concurrently with other Racket threads, as was the case in versions before 9.0.
Creating a thread that runs in parallel is as simple as adding a
flag to the call to thread.
To see the effect, try first putting the following code into a file named
"thread.rkt" and running racket thread.rkt on the
command line:
Racket will find many square roots (tweak N to match your
machine), but will keep only one core of your CPU busy. Using
time in the shell reports “CPU” (possibly broken into
“user” and “system”) and “real” times that are similar. To
use two cores, add #:pool 'own to the thread call:
In this case, real time should be about half of CPU time, while CPU
should remain similar to before. In other words, the parallel version
runs twice as fast. On the machine used
below:
|
concurrent |
|
parallel |
|
|
|
×1 |
|
1011 msec real |
|
|
979 |
msec CPU |
|
10 |
msec CPU for GC | |
|
×2 |
|
517 msec real |
|
|
1021 |
msec CPU |
|
13 |
msec CPU for GC | |
|
|
Passing the new #:pool argument
creates a parallel thread; create pools via
make-parallel-thread-pool to have a group of threads share
processor resources or just pass 'own to have the
new thread exist in its own
parallel thread pool.
As a further addition to thread, a #:keep 'result
argument keeps the result of thunk when it returns, instead
of discarding the result. Retrieve a thread’s result with
thread-wait. So, for example,
runs thunk in parallel
to other Racket threads, blocks the current Racket thread (while
allowing other Racket threads to continue, even non-parallel ones),
and then returns the result value(s) when thunk completes.
To maintain backwards compatibility, the thread function still creates a coroutine
thread by default, which is a lightweight thread that is preemptively
scheduled and whose execution is interleaved with other coroutine
threads. For many tasks that need the organizational benefits of
concurrency without the performance benefits of parallelism, such as
when managing GUI interactions or orchestrating remote processes,
coroutine threads are still the best abstraction. Coroutine threads
can use #:keep 'result, too.
Racket’s full thread API works with parallel threads. Follow the links
from the thread documentation to see more details on thread
pools and for more interesting uses. Of course, just because you put
tasks in parallel threads doesn’t mean that they always speed up,
as sharing and communication can limit parallelism. Racket’s
future
visualizer works for parallel threads, though, and it can help you
understand where synchronization in a task limits parallelism.
Also, adding parallelism to Racket potentially creates trouble for existing
libraries that were not designed to accommodate parallelism. We expect
problems to be rare, however.
We’ll explore the performance details and explain why we expect most
programs will continue to work well later in this post, but first:
Racket’s Road to Parallelism
Running threads in parallel counts as news in 2025?! Well, it has been a long
road.
Racket’s implementation started in the mid-1990s, just as a wave of
enthusiasm for parallel programming was winding down. Although
operating systems by that point consistently supported within-process
threads, computers with multiprocessors were not commonly available.
Many language runtime systems from the same era—
including Python,
Ruby, and OCaml—
took advantage of the internal simplicity of a
single-threaded runtime system while offering constructs for
concurrency at the language level. Racket has always
included threads for concurrency, and it was an early adopter of
Concurrent ML’s
abstractions for managing concurrency well. But an absence of
parallelism was deeply baked into the original implementation.
Over time, to provide support for parallelism, we added
places and futures to Racket. Places support
coarse-grained parallelism through a message-passing API, effectively
running parallel instances of the virtual machine within a single
operating-system process; limited sharing makes the implementation
easier and safer than arbitrary sharing between parallel threads.
Futures provide fine-grained parallelism for restricted computations;
a future blocks when it tries to perform any operation that would be
difficult for the runtime system to complete safely in parallel.
Places and futures are both useful, and they avoid some pitfalls of
shared-memory threads. Still, fitting a parallel task into futures or
places usually requires special effort.
Meanwhile, single-threaded execution was only one of the problems with
the original Racket (a.k.a. PLT Scheme) implementation. To address
larger problems with the implementation and to improve performance, we
started in 2017
rebuilding
Racket on top of Chez Scheme. Rebuilding took some time, and we only
gradually deprecated the old “BC” implementation in favor of the new
“CS” implementation, but the transition is now complete. Racket BC
is still maintained, but as of August 2025, we distribute only Racket
CS builds at https://download.racket-lang.org.
Chez Scheme is a much better foundation for improving parallelism in
Racket. Part of the Racket-rebuilding effort included improving Chez
Scheme’s support for parallelism: we added memory fences as needed for
platforms with a weak memory-consistency model, and we parallelized
the Chez Scheme garbage collector so that garbage collection itself
runs in parallel. There’s still plenty of room for improvement—
the
garbage collector is only parallel with itself, not the main program,
for example—
but further improvements are more within reach than
before. Equally important, the rebuild included new implementations of
the Racket thread scheduler and I/O layer in Racket itself (instead of
C). Because of these improvements, Racket’s futures worked better for
parallelism from the start in Racket CS than in Racket BC.
With version 9.0, we finally take advantage of new opportunities for
parallelism created by the move to Racket CS. Internally, a parallel
thread is backed by combination of a future and a coroutine thread.
The main extra work was making Racket’s coroutine thread scheduler
cooperate more with the future scheduler and making the I/O layer safe
for Chez Scheme threads—
all while making locks fine-grained enough
to enable parallelism, and keeping the cost of needed synchronization
as low as possible, including for non-parallel Racket programs.
Performance
Here are some simple benchmarks on an M2 Mac to give a sense
of the state of the current implementation. This machine has 8
cores, but 4 big and 4 little, so ×4 speedup is possible with
4-way parallelism but less than ×8 with 8-way parallelism.
As an easy first example, we should expect that a Fibonacci [code] run of
1 iteration in each of 4 coroutine threads takes the same time as
running it 4 iterations in 1 thread, while 1 iteration in each of 4
parallel threads should take about 1/4th of the time. Also, for such a
simple function, using plain old futures should work just as well as
parallel threads. That’s what we see in the numbers below.
Times are shown as a speedup over single-threaded, then in
real elapsed milliseconds, with CPU milliseconds as the upper smaller
number to the right, and CPU milliseconds that are specifically for GC
as the lower smaller number to the right. The times are from a single
run of the benchmark.
|
|
|
(fib 40) |
|
real msec |
|
|
|
|
|
|
|
N |
|
sequential |
|
coroutine |
|
parallel |
|
futures |
|
|
|
|
|
1 |
|
×1 |
|
511 |
|
|
|
×1 |
|
506 |
|
|
|
×1 |
|
494 |
|
|
|
×1 |
|
495 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
2045 |
|
|
|
×1 |
|
2034 |
|
|
|
×3.7 |
|
554 |
|
|
|
×3.8 |
|
545 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
4154 |
|
|
|
×1 |
|
4154 |
|
|
|
×5.4 |
|
776 |
|
|
|
×5.2 |
|
796 |
|
|
|
|
Of course, most programs are not just simple arithmetic. If we
change our example to repeatedly convert numbers back and forth
to strings as we compute Fibonacci [code], then
we can see the effects of the more complex conversions. This version
also triggers frequent allocation, which lets us see
how thread-local allocation and parallel garbage collection scale.
|
|
|
(strfib* 32) |
|
real msec |
|
|
|
|
|
|
|
N |
|
sequential |
|
coroutine |
|
parallel |
|
futures |
|
|
|
|
|
1 |
|
×1 |
|
204 |
|
|
|
×1 |
|
205 |
|
|
|
×1.1 |
|
192 |
|
|
|
×1 |
|
211 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
826 |
|
|
|
×1 |
|
808 |
|
|
|
×3.7 |
|
222 |
|
|
|
×3.7 |
|
221 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
1619 |
|
|
|
×1 |
|
1602 |
|
|
|
×3.9 |
|
419 |
|
|
|
×4 |
|
406 |
|
|
|
|
From this table, we still see reasonable scaling up to four cores,
but the additional work and the use of the garbage collector limit
scaling beyond that point.
That first string variant of Fibonacci includes a slight cheat,
however: it goes out of its way to use a string->number*
wrapper that carefully calls string->number in a way that
avoids evaluating expressions that compute the default values of some
arguments. The defaults consult the
parameters
read-decimal-as-inexact and
read-single-flonum—
which a perfectly fine thing to do in
general, but turns out to block a future, because parameter values
can depend on the current continuation. In contrast, parallel threads
continue to provide a benefit when those kinds of Racket constructs
are used. We can see the difference by using plain
string->number in place of string->number*, which
will fetch parameter values 14 million times in each individual run of
(strfib 32):
|
|
|
(strfib 32) |
|
real msec |
|
|
|
|
|
|
|
N |
|
sequential |
|
coroutine |
|
parallel |
|
futures |
|
|
|
|
|
1 |
|
×1 |
|
772 |
|
|
|
×1.3 |
|
578 |
|
|
|
×1.1 |
|
721 |
|
|
|
×0.9 |
|
873 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
3169 |
|
|
|
×1.3 |
|
2364 |
|
|
|
×4 |
|
797 |
|
|
|
×0.8 |
|
4164 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
6409 |
|
|
|
×1.4 |
|
4730 |
|
|
|
×4.3 |
|
1493 |
|
|
|
×0.8 |
|
8353 |
|
|
|
|
The coroutine column here also shows an improvement, surprisingly.
That’s because a coroutine thread has a smaller continuation than the one in
the sequential column, and the cost of fetching a parameter
value can depend (to a limited degree) on continuation size.
The effect of parallel threads on this kind of program is
more consistent than fine details of a continuation’s shape.
Operations on mutable equal?-based
hash tables
[code] are another case where
futures block, but parallel threads can provide performance
improvement.
|
|
|
(hash-nums 6) |
|
real msec |
|
|
|
|
|
|
|
N |
|
sequential |
|
coroutine |
|
parallel |
|
futures |
|
|
|
|
|
1 |
|
×1 |
|
193 |
|
|
|
×1 |
|
190 |
|
|
|
×1 |
|
186 |
|
|
|
×1 |
|
191 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
767 |
|
|
|
×1 |
|
763 |
|
|
|
×3.7 |
|
208 |
|
|
|
×1 |
|
763 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
1541 |
|
|
|
×1 |
|
1532 |
|
|
|
×4.5 |
|
346 |
|
|
|
×1 |
|
1539 |
|
|
|
|
As an illustration of the current limitations of parallel threads in
Racket, let’s try a program that writes data to a byte-string
port then hashes it
[code].
|
|
|
(hash-digs 7) |
|
real msec |
|
|
|
|
|
|
|
N |
|
sequential |
|
coroutine |
|
parallel |
|
futures |
|
|
|
|
|
1 |
|
×1 |
|
127 |
|
|
|
×1 |
|
127 |
|
|
|
×0.9 |
|
135 |
|
|
|
×1 |
|
126 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
503 |
|
|
|
×0.9 |
|
536 |
|
|
|
×2.5 |
|
201 |
|
|
|
×1 |
|
520 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
1022 |
|
|
|
×0.9 |
|
1097 |
|
|
|
×2.5 |
|
403 |
|
|
|
×1 |
|
1049 |
|
|
|
|
Here we see that parallel threads do get some speedup, but
they do not scale especially well. The fact that separate
ports are not contended enables performance
improvement from parallelism, but speedup is limited by some
general locks in the I/O layer.
Even further in that direction, let’s try a program that
hashes all files in the current directory [code] and
computes a combined hash. When run on the "src" directory of
the Racket Git repository, most of the time is reading bytes from
files, and locks related to file I/O are currently too coarse-grained
to permit much speed-up.
|
|
|
(hash-dir) |
|
real msec |
|
|
|
|
|
|
|
N |
|
sequential |
|
coroutine |
|
parallel |
|
futures |
|
|
|
|
|
1 |
|
×1 |
|
170 |
|
|
|
×1 |
|
169 |
|
|
|
×0.7 |
|
256 |
|
|
|
×1 |
|
170 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
692 |
|
|
|
×1 |
|
662 |
|
|
|
×1.3 |
|
515 |
|
|
|
×1 |
|
681 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
1393 |
|
|
|
×1.1 |
|
1293 |
|
|
|
×1.6 |
|
868 |
|
|
|
×1 |
|
1368 |
|
|
|
|
Having locks in place for parallel threads can impose a cost on
sequential programs, since locks generally have to be taken whether or
not any parallel threads are active. Different data structures in
Racket use specialized locks to minimize the cost, and most benchmarks
reported here run report the same numbers in sequential column in Racket v8.18 (the
previous release) and Racket v9.0. The exceptions are the
(hash-nums 6) and (hash-digs 7) benchmarks, because
those measure very-fine grained actions on mutable hash tables and I/O
ports, and the cost is largest for those. Comparing sequential times
for those two versions shows that support for parallel thread can cost
up to 6-8% for programs that do not use them, although the cost tends
to be much less for most programs.
|
|
|
(hash-nums 6) |
|
real msec |
|
|
|
|
|
|
|
N |
|
v8.18 sequential |
|
v9.0 sequential |
|
|
|
|
|
1 |
|
×1 |
|
188 |
|
|
|
×0.96 |
|
195 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
757 |
|
|
|
×0.98 |
|
773 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
1520 |
|
|
|
×0.98 |
|
1546 |
|
|
|
|
|
|
|
(hash-digs 7) |
|
real msec |
|
|
|
|
|
|
|
N |
|
v8.18 sequential |
|
v9.0 sequential |
|
|
|
|
|
1 |
|
×1 |
|
118 |
|
|
|
×0.94 |
|
126 |
|
|
|
|
|
|
|
4 |
|
×1 |
|
474 |
|
|
|
×0.94 |
|
506 |
|
|
|
|
|
|
|
8 |
|
×1 |
|
947 |
|
|
|
×0.92 |
|
1025 |
|
|
|
|
Overall, parallelizable numerical programs or ones that manipulate
unshared data structures can achieve speedup through parallel threads
relatively easily, but I/O remains a direction for improvement.
Backward Compatibility
If a library uses mutable variables or objects, either publicly or
internally, then it must use locks or some other form of concurrency
control to work properly in a multithreaded context. Racket already
has concurrency, and the expectation for libraries to work with
threads does not change with the introduction of parallel threads.
Racket’s semaphores, channels, and other synchronization constructs
work the same with parallel threads as concurrent threads. Even
programs that use lock-free approaches based on compare-and-swap
operation (such as box-cas!) continue to work, since Racket’s
compare-and-swap operations use processor-level primitives.
Still, there are a few concerns:
-
Racket’s coroutine threads offer the guarantee of
sequential consistency, which means that effects in one
thread cannot be seen out-of-order in another thread. Parallel threads
in Racket
expose
the underlying machine’s memory-consistency model, which may allow
reordering of memory effects as observed by other threads. In general,
a weak memory model can be an issue for code not intended for use with
threads, but Racket—
more precisely, Chez Scheme—
always guarantees
the memory safety of such code using memory fences. That is, Racket
code might observe out-of-order writes, but it never observes
ill-formed Racket objects. The fences are not new, and they are part of
the same write barrier that already supports generational garbage
collection and the memory safety of futures. Although
sequential consistency supports lock implementations that don’t work
with weaker memory models, so they would work with coroutine threads
and not parallel threads, we have not found any such
implementations in Racket libraries.
-
Some Racket libraries use atomic mode for concurrency
control. Atomic mode in Racket prevent coroutine thread swaps, and
entering atomic mode is a relatively cheap operation within Racket’s
coroutine scheduler. When a parallel thread enters atomic mode, then
it prevents other coroutine threads from running, but it does
not prevent other parallel threads from running. As long as
atomic mode is used consistently to guard a shared resource, then it
continues to serve that role with parallel threads.
Entering atomic mode is a much more expensive operation in a parallel
thread than in a coroutine thread; in many cases, Racket core
libraries that need finer-grained locking more specifically need to
move away from using atomic mode. Still, making atomic mode
synchronize a parallel thread with coroutine thread provides a
graceful fallback and evolution path.
-
Foreign functions that are called by Racket in a coroutine
threads are effectively atomic operations when there are no parallel
threads, since a coroutine swap cannot take place during the foreign
call. It’s rare that this atomicity implies any kind of lock at the
Racket level, however, and the foreign function itself is either
adapted to operating-system threads or not. Racket can already create
operating systems threads through dynamic-place, and
foreign-function bindings have generally been adapted already to that
possibility.
The greater degree of concurrency enabled by parallelism exposed some
bugs in our existing core libraries that could have been triggered
with coroutine threads, but hadn’t been triggered reliably enough to
detect and repair the bugs before. Beyond those general improvements,
our experience with pre-release Racket is that parallel threads have
not created backward-compatibility problems.