Project Proposal: No-Wait Fork-Join Continuation Stealing Framework

Zhengjia Cao (zhengjic) & Jingzhi Zhang (jingzhi2) November 9, 2022

  1. Summary

    We would like to implement a fork/join parallel framework with working stealing in C++. We will discuss the trade-off between implementation designs (e.g., Child-stealing strategy vs. Continuation stealing, Different implementation of distributed work queue, etc.) and measure their performance. On multi-core PSC machines, we would also like to implement the wait-free strand coordination method mentioned in Nowa [1] and reproduce their benchmark against Cilk plus, TBB and OpenMP.

    The URL of project web page is

  2. Background

    For modern computer with multiple cores, scalability of parallelism is important. Fully strict Fork/Join model of concurrent run-time (like Cilk) can guarantee scalability by dynamically allocating and divide tasks until each task reaches a certain granularity. Once the subtasks have been created, threads can execute these subtasks in parallel until they’ve reached the barrier created by Join and continue to execute in sequential order. Program that requires heavy-workload computations can benefit from the model without too much overhead.

    However, the scalability of most of current fully strict Fork/Join models suffer from the locking used in the system implementation. Nowa [1] presents a wait-free continuation stealing approach to further scale the parallelism on machine with large number of processing units. We will follow the same design and transform hazardous race conditions and replace the requirement of using a lock with atomic counters.

  3. The challenge

    There are mainly two challenges of implementing the framework: Firstly, we need to come up with an implementation that minimize the scheduling cost

    Figure 1: Illustration of a basic fork-join model

    and consider different tradeoffs between different stealing policies and lock-free synchronization methods. Secondly, we need to design different scenarios and test the speedup of our fork-join framework against Cilk and OpenMP on different hardware settings. We’d like to understand the results and analyze which tradeoff in design and implementation is responsible for the difference in performance.

  4. Resources

    We will build our implementation of the framework. However, we will borrow some of the idea from existing implementations like :

    1. Nowa Source Code:

    2. Basic fork-join model:

  5. Goals and deliverables

    1. Implement the basic fork-join model with distributed work queue

    2. Improve the framework to be wait-free and reproduce the result from nowa

    3. Test and compare the framework with other fork-join frameworks (e.g. Cilk) and discuss the tradeoffs.

  6. Platform choice

    From the perspective of software, we would like to use the co-routines in C++ 20 to implement the continuation-stealing scheduler. As for hardware, our bottleneck is CPU, so we will develop our code on GHC 8-core machines and do benchmark on PSC 128-core ones.

  7. Schedule


[1] Florian Schmaus et al. “Nowa: A wait-free continuation-stealing concurrency platform”. In: (2021), pp. 360–371.