23 Nov 2025

Parallel Threads in Racket v9.0

posted by Matthew Flatt, Ryan Culpepper, Robby Findler, Gustavo Massaccesi, and Sam Tobin-Hochstadt

With the version 9.0 release, Racket includes support for shared-memory threads that can take advantage of multicore hardware and operating-systems threads to run in parallel— not merely concurrently with other Racket threads, as was the case in versions before 9.0.

Creating a thread that runs in parallel is as simple as adding a flag to the call to thread. To see the effect, try first putting the following code into a file named "thread.rkt" and running racket thread.rkt on the command line:

#lang racket/base
 
(define N 22)
 
(define (go)
  (thread
   (λ ()
     (for ([i (in-range (expt 2 N))])
       (black-box (sqrt i))))))
 
(define t1 (go))
(define t2 (go))
 
(thread-wait t1)
(thread-wait t2)

Racket will find many square roots (tweak N to match your machine), but will keep only one core of your CPU busy. Using time in the shell reports “CPU” (possibly broken into “user” and ‘system”) and “real” times that are similar. To use two cores, add #:pool 'own to the thread call:

#lang racket/base
 
(define N 22)
 
(define (go)
  (thread
   #:pool 'own ; <-- only change is here
   (λ ()
     (for ([i (in-range (expt 2 N))])
       (black-box (sqrt i))))))
 
(define t1 (go))
(define t2 (go))
 
(thread-wait t1)
(thread-wait t2)

In this case, real time should be about half of CPU time, while CPU should remain similar to before. In other words, the parallel version runs twice as fast. On the machine used below:

 concurrent

  

 parallel

  

 ×1

  

1011 msec real

  

979
 msec CPU
10
 msec CPU for GC

  

 ×2

  

517 msec real

  

1021
 msec CPU
13
 msec CPU for GC

  

Passing the new #:pool argument creates a parallel thread; create pools via make-parallel-thread-pool to have a group of threads share processor resources or just pass 'own to have the new thread exist in its own parallel thread pool.

As a further addition to thread, a #:keep 'result argument keeps the result of thunk when it returns, instead of discarding the result. Retrieve a thread’s result with thread-wait. So, for example,

(thread-wait (thread thunk #:pool 'own #:keep 'result))

runs thunk in parallel to other Racket threads, blocks the current Racket thread (while allowing other Racket threads to continue, even non-parallel ones), and then returns the result value(s) when thunk completes.

To maintain backwards compatibility, the thread function still creates a coroutine thread by default, which is a lightweight thread that is preemptively scheduled and whose execution is interleaved with other coroutine threads. For many tasks that need the organizational benefits of concurrency without the performance benefits of parallelism, such as when managing GUI interactions or orchestrating remote processes, coroutine threads are still the best abstraction. Coroutine threads can use #:keep 'result, too.

Racket’s full thread API works with parallel threads. Follow the links from the thread documentation to see more details on thread pools and for more interesting uses. Of course, just because you put tasks in parallel threads doesn’t mean that they always speed up, as sharing and communication can limit parallelism. Racket’s future visualizer works for parallel threads, tho, and it can help you understand where synchronization in a task limits parallelism. Also, adding parallelism to Racket potentially creates trouble for existing libraries that were not designed to accommodate parallelism. We expect problems to be rare, however.

We’ll explore the performance details and explain why we expect most programs will continue to work well later in this post, but first:

Racket’s Road to Parallelism

Running threads in parallel counts as news in 2025?! Well, it has been a long road.

Racket’s implementation started in the mid-1990s, just as a wave of enthusiasm for parallel programming was winding down. Although operating systems by that point consistently supported within-process threads, computers with multiprocessors were not commonly available. Many language runtime systems from the same era— including Python, Ruby, and OCaml— took advantage of the internal simplicity of a single-threaded runtime system while offering constructs for concurrency at the language level. Racket has always included threads for concurrency, and it was an early adopter of Concurrent ML’s abstractions for managing concurrency well. But an absence of parallelism was deeply baked into the original implementation.

Over time, to provide support for parallelism, we added places and futures to Racket. Places support coarse-grained parallelism through a message-passing API, effectively running parallel instances of the virtual machine within a single operating-system process; limited sharing makes the implementation easier and safer than arbitrary sharing between parallel threads. Futures provide fine-grained parallelism for restricted computations; a future blocks when it tries to perform any operation that would be difficult for the runtime system to complete safely in parallel. Places and futures are both useful, and they avoid some pitfalls of shared-memory threads. Still, fitting a parallel task into futures or places usually requires special effort.

Meanwhile, single-threaded execution was only one of the problems with the original Racket (a.k.a. PLT Scheme) implementation. To address larger problems with the implementation and to improve performance, we started in 2017 rebuilding Racket on top of Chez Scheme. Rebuilding took some time, and we only gradually deprecated the old “BC” implementation in favor of the new “CS” implementation, but the transition is now complete. Racket BC is still maintained, but as of August 2025, we distribute only Racket CS builds at https://download.racket-lang.org.

Chez Scheme is a much better foundation for improving parallelism in Racket. Part of the Racket-rebuilding effort included improving Chez Scheme’s support for parallelism: we added memory fences as needed for platforms with a weak memory-consistency model, and we parallelized the Chez Scheme garbage collector so that garbage collection itself runs in parallel. There’s still plenty of room for improvement— the garbage collector is only parallel with itself, not the main program, for example— but further improvements are more within reach than before. Equally important, the rebuild included new implementations of the Racket thread scheduler and I/O layer in Racket itself (instead of C). Because of these improvements, Racket’s futures worked better for parallelism from the start in Racket CS than in Racket BC.

With version 9.0, we finally take advantage of new opportunities for parallelism created by the move to Racket CS. Internally, a parallel thread is backed by combination of a future and a coroutine thread. The main extra work was making Racket’s coroutine thread scheduler cooperate more with the future scheduler and making the I/O layer safe for Chez Scheme threads— all while making locks fine-grained enough to enable parallelism, and keeping the cost of needed synchronization as low as possible, including for non-parallel Racket programs.

Performance

Here are some simple benchmarks on an M2 Mac to give a sense of the state of the current implementation. This machine has 8 cores, but 4 big and 4 little, so ×4 speedup is possible with 4-way parallelism but less than ×8 with 8-way parallelism.

As an easy first example, we should expect that a Fibonacci [code] run of 1 iteration in each of 4 coroutine threads takes the same time as running it 4 iterations in 1 thread, while 1 iteration in each of 4 parallel threads should take about 1/4th of the time. Also, for such a simple function, using plain old futures should work just as well as parallel threads. That’s what we see in the numbers below.

Times are shown as a speedup over single-threaded, then in real elapsed milliseconds, with CPU milliseconds as the upper smaller number to the right, and CPU milliseconds that are specifically for GC as the lower smaller number to the right. The times are from a single run of the benchmark.

  

(fib 40)

  

real msec

  

CPU
GC

  

  

N

  

sequential

  

coroutine

  

parallel

  

futures

  

  

1

  

 ×1

  

511

  

493
0

  

 ×1

  

506

  

490
0

  

 ×1

  

494

  

494
0

  

 ×1

  

495

  

495
0

  

  

4

  

 ×1

  

2045

  

1978
0

  

 ×1

  

2034

  

1967
0

  

 ×3.7

  

554

  

2210
0

  

 ×3.8

  

545

  

2168
3

  

  

8

  

 ×1

  

4154

  

4021
0

  

 ×1

  

4154

  

4021
2

  

 ×5.4

  

776

  

5928
1

  

 ×5.2

  

796

  

6006
2

  

Of course, most programs are not just simple arithmetic. If we change our example to repeatedly convert numbers back and forth to strings as we compute Fibonacci [code], then we can see the effects of the more complex conversions. This version also triggers frequent allocation, which lets us see how thread-local allocation and parallel garbage collection scale.

  

(strfib* 32)

  

real msec

  

CPU
GC

  

  

N

  

sequential

  

coroutine

  

parallel

  

futures

  

  

1

  

 ×1

  

204

  

197
3

  

 ×1

  

205

  

198
0

  

 ×1.1

  

192

  

192
0

  

 ×1

  

211

  

203
0

  

  

4

  

 ×1

  

826

  

796
10

  

 ×1

  

808

  

780
2

  

 ×3.7

  

222

  

861
10

  

 ×3.7

  

221

  

857
10

  

  

8

  

 ×1

  

1619

  

1563
11

  

 ×1

  

1602

  

1545
4

  

 ×3.9

  

419

  

2544
59

  

 ×4

  

406

  

2551
59

  

From this table, we still see reasonable scaling up to four cores, but the additional work and the use of the garbage collector limit scaling beyond that point.

That first string variant of Fibonacci includes a slight cheat, however: it goes out of its way to use a string->number* wrapper that carefully calls string->number in a way that avoids evaluating expressions that compute the default values of some arguments. The defaults consult the parameters read-decimal-as-inexact and read-single-flonumwhich a perfectly fine thing to do in general, but turns out to block a future, because parameter values can depend on the current continuation. In contrast, parallel threads continue to provide a benefit when those kinds of Racket constructs are used. We can see the difference by using plain string->number in place of string->number*, which will fetch parameter values 14 million times in each individual run of (strfib 32):

  

(strfib 32)

  

real msec

  

CPU
GC

  

  

N

  

sequential

  

coroutine

  

parallel

  

futures

  

  

1

  

 ×1

  

772

  

751
4

  

 ×1.3

  

578

  

562
1

  

 ×1.1

  

721

  

721
1

  

 ×0.9

  

873

  

851
0

  

  

4

  

 ×1

  

3169

  

3085
12

  

 ×1.3

  

2364

  

2303
6

  

 ×4

  

797

  

3103
33

  

 ×0.8

  

4164

  

4058
2

  

  

8

  

 ×1

  

6409

  

6225
14

  

 ×1.4

  

4730

  

4608
13

  

 ×4.3

  

1493

  

9166
197

  

 ×0.8

  

8353

  

8135
4

  

The coroutine column here also shows an improvement, surprisingly. That’s because a coroutine thread has a smaller continuation than the one in the sequential column, and the cost of fetching a parameter value can depend (to a limited degree) on continuation size. The effect of parallel threads on this kind of program is more consistent than fine details of a continuation’s shape.

Operations on mutable equal?-based hash tables [code] are another case where futures block, but parallel threads can provide performance improvement.

  

(hash-nums 6)

  

real msec

  

CPU
GC

  

  

N

  

sequential

  

coroutine

  

parallel

  

futures

  

  

1

  

 ×1

  

193

  

184
1

  

 ×1

  

190

  

182
0

  

 ×1

  

186

  

186
0

  

 ×1

  

191

  

182
0

  

  

4

  

 ×1

  

767

  

733
5

  

 ×1

  

763

  

729
0

  

 ×3.7

  

208

  

824
3

  

 ×1

  

763

  

729
0

  

  

8

  

 ×1

  

1541

  

1473
11

  

 ×1

  

1532

  

1463
1

  

 ×4.5

  

346

  

2373
24

  

 ×1

  

1539

  

1470
1

  

As an illustration of the current limitations of parallel threads in Racket, let’s try a program that writes data to a byte-string port then hashes it [code].

  

(hash-digs 7)

  

real msec

  

CPU
GC

  

  

N

  

sequential

  

coroutine

  

parallel

  

futures

  

  

1

  

 ×1

  

127

  

123
3

  

 ×1

  

127

  

123
4

  

 ×0.9

  

135

  

135
3

  

 ×1

  

126

  

122
3

  

  

4

  

 ×1

  

503

  

487
13

  

 ×0.9

  

536

  

519
35

  

 ×2.5

  

201

  

599
38

  

 ×1

  

520

  

504
28

  

  

8

  

 ×1

  

1022

  

986
33

  

 ×0.9

  

1097

  

1057
81

  

 ×2.5

  

403

  

1630
85

  

 ×1

  

1049

  

1017
60

  

Here we see that parallel threads do get some speedup, but they do not scale especially well. The fact that separate ports are not contended enables performance improvement from parallelism, but speedup is limited by some general locks in the I/O layer.

Even further in that direction, let’s try a program that hashes all files in the current directory [code] and computes a combined hash. When run on the "src" directory of the Racket Git repository, most of the time is reading bytes from files, and locks related to file I/O are currently too coarse-grained to permit much speed-up.

  

(hash-dir)

  

real msec

  

CPU
GC

  

  

N

  

sequential

  

coroutine

  

parallel

  

futures

  

  

1

  

 ×1

  

170

  

169
1

  

 ×1

  

169

  

169
0

  

 ×0.7

  

256

  

248
0

  

 ×1

  

170

  

170
0

  

  

4

  

 ×1

  

692

  

690
3

  

 ×1

  

662

  

658
3

  

 ×1.3

  

515

  

1068
5

  

 ×1

  

681

  

679
3

  

  

8

  

 ×1

  

1393

  

1377
10

  

 ×1.1

  

1293

  

1290
6

  

 ×1.6

  

868

  

2158
18

  

 ×1

  

1368

  

1366
7

  

Having locks in place for parallel threads can impose a cost on sequential programs, since locks generally have to be taken whether or not any parallel threads are active. Different data structures in Racket use specialized locks to minimize the cost, and most benchmarks reported here run report the same numbers in sequential column in Racket v8.18 (the previous release) and Racket v9.0. The exceptions are the (hash-nums 6) and (hash-digs 7) benchmarks, because those measure very-fine grained actions on mutable hash tables and I/O ports, and the cost is largest for those. Comparing sequential times for those two versions shows that support for parallel thread can cost up to 6-8% for programs that do not use them, although the cost tends to be much less for most programs.

  

(hash-nums 6)

  

real msec

  

CPU
GC

  

  

N

  

 v8.18 sequential

  

 v9.0 sequential

  

  

1

  

 ×1

  

188

  

180
0

  

 ×0.96

  

195

  

187
0

  

  

4

  

 ×1

  

757

  

728
2

  

 ×0.98

  

773

  

739
3

  

  

8

  

 ×1

  

1520

  

1461
9

  

 ×0.98

  

1546

  

1477
12

  

  

(hash-digs 7)

  

real msec

  

CPU
GC

  

  

N

  

 v8.18 sequential

  

 v9.0 sequential

  

  

1

  

 ×1

  

118

  

114
3

  

 ×0.94

  

126

  

122
3

  

  

4

  

 ×1

  

474

  

458
12

  

 ×0.94

  

506

  

489
13

  

  

8

  

 ×1

  

947

  

915
25

  

 ×0.92

  

1025

  

989
26

  

Overall, parallelizable numerical programs or ones that manipulate unshared data structures can achieve speedup through parallel threads relatively easily, but I/O remains a direction for improvement.

Backward Compatibility

If a library uses mutable variables or objects, either publicly or internally, then it must use locks or some other form of concurrency control to work properly in a multithreaded context. Racket already has concurrency, and the expectation for libraries to work with threads does not change with the introduction of parallel threads. Racket’s semaphores, channels, and other synchronization constructs work the same with parallel threads as concurrent threads. Even programs that use lock-free approaches based on compare-and-swap operation (such as box-cas!) continue to work, since Racket’s compare-and-swap operations use processor-level primitives.

Still, there are a few concerns:

  • Racket’s coroutine threads offer the guarantee of sequential consistency, which means that effects in one thread cannot be seen out-of-order in another thread. Parallel threads in Racket expose the underlying machine’s memory-consistency model, which may allow reordering of memory effects as observed by other threads. In general, a weak memory model can be an issue for code not intended for use with threads, but Racket— more precisely, Chez Scheme— always guarantees the memory safety of such code using memory fences. That is, Racket code might observe out-of-order writes, but it never observes ill-formed Racket objects. The fences are not new, and they are part of the same write barrier that already supports generational garbage collection and the memory safety of futures. Although sequential consistency supports lock implementations that don’t work with weaker memory models, so they would work with coroutine threads and not parallel threads, we have not found any such implementations in Racket libraries.

  • Some Racket libraries use atomic mode for concurrency control. Atomic mode in Racket prevent coroutine thread swaps, and entering atomic mode is a relatively cheap operation within Racket’s coroutine scheduler. When a parallel thread enters atomic mode, then it prevents other coroutine threads from running, but it does not prevent other parallel threads from running. As long as atomic mode is used consistently to guard a shared resource, then it continues to serve that role with parallel threads.

    Entering atomic mode is a much more expensive operation in a parallel thread than in a coroutine thread; in many cases, Racket core libraries that need finer-grained locking more specifically need to move away from using atomic mode. Still, making atomic mode synchronize a parallel thread with coroutine thread provides a graceful fallback and evolution path.

  • Foreign functions that are called by Racket in a coroutine threads are effectively atomic operations when there are no parallel threads, since a coroutine swap cannot take place during the foreign call. It’s rare that this atomicity implies any kind of lock at the Racket level, however, and the foreign function itself is either adapted to operating-system threads or not. Racket can already create operating systems threads through dynamic-place, and foreign-function bindings have generally been adapted already to that possibility.

The greater degree of concurrency enabled by parallelism exposed some bugs in our existing core libraries that could have been triggered with coroutine threads, but hadn’t been triggered reliably enough to detect and repair the bugs before. Beyond those general improvements, our experience with pre-release Racket is that parallel threads have not created backward-compatibility problems.

Made with Frog, a static-blog generator written in Racket.
Source code for this blog.