23 Nov 2025

Parallel Threads in Racket v9.0

posted by Matthew Flatt, Ryan Culpepper, Robby Findler, Gustavo Massaccesi, and Sam Tobin-Hochstadt

With the version 9.0 release, Racket includes support for shared-memory threads that can take advantage of multicore hardware and operating-systems threads to run in parallel— not merely concurrently with other Racket threads, as was the case in versions before 9.0.

Creating a thread that runs in parallel is as simple as adding a flag to the call to thread. To see the effect, try first putting the following code into a file named "thread.rkt" and running racket thread.rkt on the command line:

#lang racket/base

(define N 22)

(define (go)

(thread

(λ ()

(for ([i (in-range (expt 2 N))])

(black-box (sqrt i))))))

(define t1 (go))

(define t2 (go))

(thread-wait t1)

(thread-wait t2)

Racket will find many square roots (tweak N to match your machine), but will keep only one core of your CPU busy. Using time in the shell reports “CPU” (possibly broken into “user” and ‘system”) and “real” times that are similar. To use two cores, add #:pool 'own to the thread call:

#lang racket/base

(define N 22)

(define (go)

(thread

#:pool 'own ; <-- only change is here

(λ ()

(for ([i (in-range (expt 2 N))])

(black-box (sqrt i))))))

(define t1 (go))

(define t2 (go))

(thread-wait t1)

(thread-wait t2)

In this case, real time should be about half of CPU time, while CPU should remain similar to before. In other words, the parallel version runs twice as fast. On the machine used below:

concurrent

parallel

×1

1011 msec real

979	msec CPU
10	msec CPU for GC

×2

517 msec real

1021	msec CPU
13	msec CPU for GC

Passing the new #:pool argument creates a parallel thread; create pools via make-parallel-thread-pool to have a group of threads share processor resources or just pass 'own to have the new thread exist in its own parallel thread pool.

As a further addition to thread, a #:keep 'result argument keeps the result of thunk when it returns, instead of discarding the result. Retrieve a thread’s result with thread-wait. So, for example,

(thread-wait (thread thunk #:pool 'own #:keep 'result))

runs thunk in parallel to other Racket threads, blocks the current Racket thread (while allowing other Racket threads to continue, even non-parallel ones), and then returns the result value(s) when thunk completes.

To maintain backwards compatibility, the thread function still creates a coroutine thread by default, which is a lightweight thread that is preemptively scheduled and whose execution is interleaved with other coroutine threads. For many tasks that need the organizational benefits of concurrency without the performance benefits of parallelism, such as when managing GUI interactions or orchestrating remote processes, coroutine threads are still the best abstraction. Coroutine threads can use #:keep 'result, too.

Racket’s full thread API works with parallel threads. Follow the links from the thread documentation to see more details on thread pools and for more interesting uses. Of course, just because you put tasks in parallel threads doesn’t mean that they always speed up, as sharing and communication can limit parallelism. Racket’s future visualizer works for parallel threads, tho, and it can help you understand where synchronization in a task limits parallelism. Also, adding parallelism to Racket potentially creates trouble for existing libraries that were not designed to accommodate parallelism. We expect problems to be rare, however.

We’ll explore the performance details and explain why we expect most programs will continue to work well later in this post, but first:

Racket’s Road to Parallelism

Running threads in parallel counts as news in 2025?! Well, it has been a long road.

Racket’s implementation started in the mid-1990s, just as a wave of enthusiasm for parallel programming was winding down. Although operating systems by that point consistently supported within-process threads, computers with multiprocessors were not commonly available. Many language runtime systems from the same era— including Python, Ruby, and OCaml— took advantage of the internal simplicity of a single-threaded runtime system while offering constructs for concurrency at the language level. Racket has always included threads for concurrency, and it was an early adopter of Concurrent ML’s abstractions for managing concurrency well. But an absence of parallelism was deeply baked into the original implementation.

Over time, to provide support for parallelism, we added places and futures to Racket. Places support coarse-grained parallelism through a message-passing API, effectively running parallel instances of the virtual machine within a single operating-system process; limited sharing makes the implementation easier and safer than arbitrary sharing between parallel threads. Futures provide fine-grained parallelism for restricted computations; a future blocks when it tries to perform any operation that would be difficult for the runtime system to complete safely in parallel. Places and futures are both useful, and they avoid some pitfalls of shared-memory threads. Still, fitting a parallel task into futures or places usually requires special effort.

Meanwhile, single-threaded execution was only one of the problems with the original Racket (a.k.a. PLT Scheme) implementation. To address larger problems with the implementation and to improve performance, we started in 2017 rebuilding Racket on top of Chez Scheme. Rebuilding took some time, and we only gradually deprecated the old “BC” implementation in favor of the new “CS” implementation, but the transition is now complete. Racket BC is still maintained, but as of August 2025, we distribute only Racket CS builds at https://download.racket-lang.org.

Chez Scheme is a much better foundation for improving parallelism in Racket. Part of the Racket-rebuilding effort included improving Chez Scheme’s support for parallelism: we added memory fences as needed for platforms with a weak memory-consistency model, and we parallelized the Chez Scheme garbage collector so that garbage collection itself runs in parallel. There’s still plenty of room for improvement— the garbage collector is only parallel with itself, not the main program, for example— but further improvements are more within reach than before. Equally important, the rebuild included new implementations of the Racket thread scheduler and I/O layer in Racket itself (instead of C). Because of these improvements, Racket’s futures worked better for parallelism from the start in Racket CS than in Racket BC.

With version 9.0, we finally take advantage of new opportunities for parallelism created by the move to Racket CS. Internally, a parallel thread is backed by combination of a future and a coroutine thread. The main extra work was making Racket’s coroutine thread scheduler cooperate more with the future scheduler and making the I/O layer safe for Chez Scheme threads— all while making locks fine-grained enough to enable parallelism, and keeping the cost of needed synchronization as low as possible, including for non-parallel Racket programs.

Performance

Here are some simple benchmarks on an M2 Mac to give a sense of the state of the current implementation. This machine has 8 cores, but 4 big and 4 little, so ×4 speedup is possible with 4-way parallelism but less than ×8 with 8-way parallelism.

As an easy first example, we should expect that a Fibonacci [code] run of 1 iteration in each of 4 coroutine threads takes the same time as running it 4 iterations in 1 thread, while 1 iteration in each of 4 parallel threads should take about 1/4th of the time. Also, for such a simple function, using plain old futures should work just as well as parallel threads. That’s what we see in the numbers below.

Times are shown as a speedup over single-threaded, then in real elapsed milliseconds, with CPU milliseconds as the upper smaller number to the right, and CPU milliseconds that are specifically for GC as the lower smaller number to the right. The times are from a single run of the benchmark.

(fib 40)

real msec

CPU

sequential

coroutine

parallel

futures

×1

511

493

×1

506

490

×1

494

×1

495

×1

2045

1978

×1

2034

1967

×3.7

554

2210

×3.8

545

2168

×1

4154

4021

×1

4154

4021

×5.4

776

5928

×5.2

796

6006

Of course, most programs are not just simple arithmetic. If we change our example to repeatedly convert numbers back and forth to strings as we compute Fibonacci [code], then we can see the effects of the more complex conversions. This version also triggers frequent allocation, which lets us see how thread-local allocation and parallel garbage collection scale.

(strfib* 32)

real msec

CPU

sequential

coroutine

parallel

futures

×1

204

197

×1

205

198

×1.1

192

×1

211

203

×1

826

796

×1

808

780

×3.7

222

861

×3.7

221

857

×1

1619

1563

×1

1602

1545

×3.9

419

2544

×4

406

2551

From this table, we still see reasonable scaling up to four cores, but the additional work and the use of the garbage collector limit scaling beyond that point.

That first string variant of Fibonacci includes a slight cheat, however: it goes out of its way to use a string->number* wrapper that carefully calls string->number in a way that avoids evaluating expressions that compute the default values of some arguments. The defaults consult the parameters read-decimal-as-inexact and read-single-flonum— which a perfectly fine thing to do in general, but turns out to block a future, because parameter values can depend on the current continuation. In contrast, parallel threads continue to provide a benefit when those kinds of Racket constructs are used. We can see the difference by using plain string->number in place of string->number*, which will fetch parameter values 14 million times in each individual run of (strfib 32):

(strfib 32)

real msec

CPU

sequential

coroutine

parallel

futures

×1

772

751

×1.3

578

562

×1.1

721

×0.9

873

851

×1

3169

3085

×1.3

2364

2303

×4

797

3103

×0.8

4164

4058

×1

6409

6225

×1.4

4730

4608

×4.3

1493

9166

197

×0.8

8353

8135

The coroutine column here also shows an improvement, surprisingly. That’s because a coroutine thread has a smaller continuation than the one in the sequential column, and the cost of fetching a parameter value can depend (to a limited degree) on continuation size. The effect of parallel threads on this kind of program is more consistent than fine details of a continuation’s shape.

Operations on mutable equal?-based hash tables [code] are another case where futures block, but parallel threads can provide performance improvement.

(hash-nums 6)

real msec

CPU

sequential

coroutine

parallel

futures

×1

193

184

×1

190

182

×1

186

×1

191

182

×1

767

733

×1

763

729

×3.7

208

824

×1

763

729

×1

1541

1473

×1

1532

1463

×4.5

346

2373

×1

1539

1470

As an illustration of the current limitations of parallel threads in Racket, let’s try a program that writes data to a byte-string port then hashes it [code].

(hash-digs 7)

real msec

CPU

sequential

coroutine

parallel

futures

×1

127

123

×1

127

123

×0.9

135

×1

126

122

×1

503

487

×0.9

536

519

×2.5

201

599

×1

520

504

×1

1022

986

×0.9

1097

1057

×2.5

403

1630

×1

1049

1017

Here we see that parallel threads do get some speedup, but they do not scale especially well. The fact that separate ports are not contended enables performance improvement from parallelism, but speedup is limited by some general locks in the I/O layer.

Even further in that direction, let’s try a program that hashes all files in the current directory [code] and computes a combined hash. When run on the "src" directory of the Racket Git repository, most of the time is reading bytes from files, and locks related to file I/O are currently too coarse-grained to permit much speed-up.

(hash-dir)

real msec

CPU

sequential

coroutine

parallel

futures

×1

170

169

×1

169

×0.7

256

248

×1

170

×1

692

690

×1

662

658

×1.3

515

1068

×1

681

679

×1

1393

1377

×1.1

1293

1290

×1.6

868

2158

×1

1368

1366

Having locks in place for parallel threads can impose a cost on sequential programs, since locks generally have to be taken whether or not any parallel threads are active. Different data structures in Racket use specialized locks to minimize the cost, and most benchmarks reported here run report the same numbers in sequential column in Racket v8.18 (the previous release) and Racket v9.0. The exceptions are the (hash-nums 6) and (hash-digs 7) benchmarks, because those measure very-fine grained actions on mutable hash tables and I/O ports, and the cost is largest for those. Comparing sequential times for those two versions shows that support for parallel thread can cost up to 6-8% for programs that do not use them, although the cost tends to be much less for most programs.

(hash-nums 6)

real msec

CPU

v8.18 sequential

v9.0 sequential

×1

188

180

×0.96

195

187

×1

757

728

×0.98

773

739

×1

1520

1461

×0.98

1546

1477

(hash-digs 7)

real msec

CPU

v8.18 sequential

v9.0 sequential

×1

118

114

×0.94

126

122

×1

474

458

×0.94

506

489

×1

947

915

×0.92

1025

989

Overall, parallelizable numerical programs or ones that manipulate unshared data structures can achieve speedup through parallel threads relatively easily, but I/O remains a direction for improvement.

Backward Compatibility

If a library uses mutable variables or objects, either publicly or internally, then it must use locks or some other form of concurrency control to work properly in a multithreaded context. Racket already has concurrency, and the expectation for libraries to work with threads does not change with the introduction of parallel threads. Racket’s semaphores, channels, and other synchronization constructs work the same with parallel threads as concurrent threads. Even programs that use lock-free approaches based on compare-and-swap operation (such as box-cas!) continue to work, since Racket’s compare-and-swap operations use processor-level primitives.

Still, there are a few concerns:

Racket’s coroutine threads offer the guarantee of sequential consistency, which means that effects in one thread cannot be seen out-of-order in another thread. Parallel threads in Racket expose the underlying machine’s memory-consistency model, which may allow reordering of memory effects as observed by other threads. In general, a weak memory model can be an issue for code not intended for use with threads, but Racket— more precisely, Chez Scheme— always guarantees the memory safety of such code using memory fences. That is, Racket code might observe out-of-order writes, but it never observes ill-formed Racket objects. The fences are not new, and they are part of the same write barrier that already supports generational garbage collection and the memory safety of futures. Although sequential consistency supports lock implementations that don’t work with weaker memory models, so they would work with coroutine threads and not parallel threads, we have not found any such implementations in Racket libraries.
Some Racket libraries use atomic mode for concurrency control. Atomic mode in Racket prevent coroutine thread swaps, and entering atomic mode is a relatively cheap operation within Racket’s coroutine scheduler. When a parallel thread enters atomic mode, then it prevents other coroutine threads from running, but it does not prevent other parallel threads from running. As long as atomic mode is used consistently to guard a shared resource, then it continues to serve that role with parallel threads.

Entering atomic mode is a much more expensive operation in a parallel thread than in a coroutine thread; in many cases, Racket core libraries that need finer-grained locking more specifically need to move away from using atomic mode. Still, making atomic mode synchronize a parallel thread with coroutine thread provides a graceful fallback and evolution path.
Foreign functions that are called by Racket in a coroutine threads are effectively atomic operations when there are no parallel threads, since a coroutine swap cannot take place during the foreign call. It’s rare that this atomicity implies any kind of lock at the Racket level, however, and the foreign function itself is either adapted to operating-system threads or not. Racket can already create operating systems threads through dynamic-place, and foreign-function bindings have generally been adapted already to that possibility.

The greater degree of concurrency enabled by parallelism exposed some bugs in our existing core libraries that could have been triggered with coroutine threads, but hadn’t been triggered reliably enough to detect and repair the bugs before. Beyond those general improvements, our experience with pre-release Racket is that parallel threads have not created backward-compatibility problems.