Notes From the Meeting On Python GIL Removal Between Python Core and Sam Gross

During the annual Python core development sprint we held a meeting with Sam Gross, the author of nogil, a fork of Python 3.9 that removes the GIL. This is a non-linear summary of the meeting.

tl;dr

Sam’s work demonstrates it’s viable to remove the GIL in such a way that the resulting Python interpreter is performant and scales with added CPU cores. For performance to be net positive, other seemingly unrelated interpreter work is required.

At the moment it’s impossible to merge Sam’s changes back to CPython since they’re deliberately made against the legacy 3.9 branch so that the resulting 3.9 nogil interpreter can be tested by end users with the currently available base of pip-installable libraries and C extensions. To merge nogil, the changes will have to be made against the main branch (that is currently scheduled to become 3.11).

Don’t expect Python 3.11 to drop the GIL just yet. Merging Sam’s work back to CPython will itself be a laborious process, but is only part of what’s needed: a very good backwards compatibility and migration plan for the community is needed before CPython drops the GIL. None of this is planned yet. We might still decide it isn’t a good fit.

Some people mention Python 4 when talking about changes of this magnitude. Core developers don’t actively plan to release Python 4 at this point, in fact the opposite is true: we are actively trying not to release Python 4 since the Python 2 to 3 transition was hard enough for the community. It’s definitely too early to speculate, let alone worry, about Python 4.

Introduction to nogil

Sam published his code alongside a detailed write-up where he explains the motivation and design of his fork.

The design can be summarized as:

  • replacement of Python’s built-in allocator pymalloc with mimalloc for thread safety, including cooperation required for lockless read access to dictionaries and other collections, and efficiency (heap memory layout allows finding GC-tracked objects without having to maintain an explicit list);
  • replacement of Python’s non-atomic eager reference counting with biased reference counting that:
    • ties each object with the thread that created it (called the owner thread);
    • enables efficient non-atomic local reference counting within the owner thread;
    • allows for slower but atomic shared reference counting in other threads;
  • to speed up object access across threads (otherwise slowed down by the atomic shared reference counting), two techniques are employed:
    • some special objects are immortalized, meaning their reference counts are never computed and they are never deallocated: this applies to singletons like None, True, False, other small integers and interned strings, as well as statically allocated PyTypeObjects for built-in types;
    • deferred reference counting is used for other globally accessible objects like top-level functions, code objects, and modules; those don’t use immortalization as they aren’t always kept for the lifetime of the program;
  • adjustment of the cyclic garbage collector to become a single-threaded stop-the-world garbage collector that:
    • waits for all threads to pause at a safe point (any bytecode boundary);
    • doesn’t wait for threads that are blocked on I/O (and use PyEval_ReleaseThread, an equivalent of releasing the GIL in current Python);
    • efficiently constructs the list of objects to deallocate just-in-time: thanks to using mimalloc, GC-tracked objects are all kept in a separate lightweight heap;
  • relocation of the process-global MRO cache to thread-local to avoid contention on MRO lookups; invalidations are still global;
  • modification of built-in collections to become thread-safe.

Sam’s design document contains detail into how those design elements operate, as well as information on thread states and the GIL API, other interpreter and bytecode modifications (replacement of the stack VM with a register VM with accumulator register; optimized function calls by avoiding creation of C stack frames; other changes to ceval.c; usage of tagged pointers; thread-safe metadata for LOAD_ATTR, LOAD_METHOD, LOAD_GLOBAL opcodes; and more). I encourage you to read it in its entirety.

Early Benchmarks

The no-GIL proof-of-concept interpreter is 10% faster than 3.9 on the pyperformance benchmark suite. It’s estimated that the cost of the GIL removal within the combined modified interpreter is around 9%, most of which being due to biased reference counting and deferred reference counting. In other words, Python 3.9 with all the other changes but the GIL removal itself could be 19% faster. However, this wouldn’t fix the multicore scalability issue.

By the way, some of those changes, like decoupling the C call stack from the Python call stack have already been implemented for Python 3.11. In fact, we have preliminary benchmarks against current main that demonstrate that the performance-related changes in Python 3.11 make it 16% faster than nogil in single-thread performance.

More benchmarks are needed, particularly utilizing what Larry Hastings used when benchmarking Gilectomy (at the time based on Python 3.5, later ported to 3.6 alpha 1).

Sam reminds us that how well a given end-user application will scale on a GIL-free Python really depends on end-user code. Without testing it in the wild, it’s impossible to predict how well your code will behave without the GIL. Consequently, it’s impossible to responsibly provide a single number that says how-many-X times faster GIL-free Python will be.

Questions to Sam at the meeting

The questions here were reordered for clarity as compared to how they happened at the meeting. The answers are paraphrased from Sam’s responses and were approved by him reading a draft of this summary. Note that core team members might have other views on some of those topics.

Q: What’s the level of perceived risk that the nogil project will end up not being viable for inclusion in CPython?

The codebase as it stands already proves its technical viability. It works, and is more scalable and performant than both the vanilla CPython interpreter, and the Gilectomy project. It’s close to two years of full-time work at this point.

It all depends on how well the community adapts C extensions so they don’t cause downright crashes of the interpreter. Then, the remaining long tail is community adopting free threads in their applications in a way that is both correct and scales well. Those two are the biggest challenges but we have to be optimistic.

Q: How would you go about upstreaming your work? Any advice on the commit order? How will you be keeping your work in sync with main?

Sam is currently working on rebasing his work, originally done against 3.9.0a3, to match the 3.9.7 final release. Part of this work is refactoring commits into logical units that tell a better story about what needs to be changed where and why.

There is currently no plan to move this work to main (future 3.11) yet because that branch is in too much flux. In contrast, 3.9 has a wide array of released pip-installable libraries and C extensions to test against. This enables Sam to evaluate how the project behaves with real-world third-party code. Rebasing on main takes time that can be otherwise spent on improving the GIL-free interpreter so at this point it’s probably too early to focus on keeping the fork up-to-date yet.

Splitting the work so that it can be merged is doable but you have to keep in mind that many of the changes are a performance net positive in tandem. In isolation, they are (temporary?) performance degradations.

Note from core devs: we cannot integrate changes made against the 3.9 branch now. It makes sense for this stage of the project to use 3.9 but it will be critical to split it into consumable chunks that can be integrated one by one into the main branch. It’s a good point that doing it chunk by chunk can hurt performance but it’s the only realistic route of integration.

Q: Can we just extract the register VM and compiler without the other changes? Do you foresee any special difficulty using the register VM without the reference counting or GIL changes?

The VM uses deferred/immortal reference counts. It might be possible to convert it to just use classic reference counting but it’s unclear how efficient the end result would be (for instance, all objects on the stack use deferred reference counting due to performance).

Q: …and the opposite question: how difficult would it be to adopt nogil without the new register-based VM?

While the new VM only improves performance, not correctness, it also improves scalability to allow GIL-free Python to utilize available cores without contention. It’s also viable to use the 3.11 interpreter, but with some ideas from the register-based VM that are important for scalability and thread-safety. It’s going to be a substantial amount of work. But bringing the register-based VM up-to-date with the main branch (plus fixing remaining bugs) is also a substantial amount of work. Both options are viable.

Q: What is the recommendation for C extensions which don’t expect their C code to be run in parallel by other threads? Wouldn’t those need some API support from CPython to bridge the gap until they can be adapted to work in the new free-threaded environment?

This will take time. The goal is incremental adoption, eventually by the majority of C extensions. The GIL will still be optionally available as an interpreter startup-time option. If it isn’t enabled and a C extension is unaware of that mode of operation, it could raise a warning or fail to import. The community will have to adapt extensions and opt them into the GIL-free mode.

The proof of concept right now runs without the GIL and accepts any C extension since this is what users expect when they download nogil. If it’s adopted upstream, it makes sense to start with the opposite (to require Python run with -X nogil at startup) to let third-party libraries adapt. Then, after a few releases, the default might be switched in the opposite direction.

While it won’t be easy to port everything (parallelism is hard), in many cases it shouldn’t be very hard, especially for C extensions wrapping external libraries.

Note from core devs: there is a large number of “dark matter” Python (and C extension) code out there that isn’t open-source. We need to be careful not to break it since it might not be feasible for its users to make required changes, or to report problems back upstream to us. In particular, some C extensions protect their own internal state with the GIL. This is a big worry, and might be a big hindrance to adoption of a GIL-free Python.

Q: Will you add a PEP 489 “slot” that extensions can use to indicate support for nogil and fail import under nogil mode if not?

This has come up a lot, it may be a good idea but it isn’t entirely clear what that means. Opting into the GIL-free mode is no guarantee that there are no bugs. Instead, we might allow all extensions to run by default (this is what is happening now with nogil). An incompatible extension might instead use the PyInit module code to proactively ask the interpreter whether the GIL is enabled, and raise a warning or even an exception at import time if it isn’t.

Q: Is runtime enabling of nogil a long-term viable option or a transitional feature?

Ideally the end game is CPython without the GIL, period. However, there will be an expected long period of community adaptation. We want to avoid a rift similar to the Python 2 to Python 3 transition. Rather, we want the transition to be easier, even if that means stretching it over a longer period of time.

Q: To confirm, is the end state that there is only nogil and no way to turn the GIL back on?

We don’t know yet at this point. Ideally the end game is for there to only be a GIL-free Python but it’s unclear if this can ever be achieved.

Q: If these feature flags are going to live for a long time, will that mean we’ll need to significantly increase the testing matrix?

Yes, you will need to double your testing matrix. However, testing the GIL-free version is probably a good predictor of whether the classic GIL version works or not. It might make sense to run the tests with the GIL enabled only sporadically (nightly?).

Note from core devs: code regresses at an outstanding pace if not tested. In CPython we don’t run all tests with every change due to their required runtime (like reference leak tests) but if nightly tests for those fail, we revert changes immediately because it’s very common to other regressions to creep up behind an already failing buildbot.

Q: What do you think of running multiple Python interpreters in parallel with one GIL per interpreter?

This is in some ways complimentary, in other ways competing with the no-GIL proposal. It should be possible to support subinterpreters in a GIL-free interpreter.

It’s unclear if the multiple subinterpreters work will be completed. With no-GIL there is less worry about sharing objects across threads, and less worry about C extension compatibility because with subinterpreters no state can truly be global anymore and therefore it needs to be specifically isolated. Passing objects around between subinterpreters requires some form of serialization/deserialization for mutable ones. For immutable ones, there might be special support added by the interpreter but user code would need to opt those objects in if they’re not known immutable builtin types. This is informed by related work by PyTorch which does use a form of subinterpreters.

Since the use cases Sam was mostly interested in were scientific data in nature (PyTorch training workflows), the ability to share data directly and efficiently was crucial to multithreaded performance. With subinterpreters, such sharing could only be enabled on C extension level, pushing more code towards C/C++ than compared to a no-GIL Python.

Q: You have gone into great detail on the dict and list implementations. What about other mutable types like queues, sets, arrays, and so on?

The nogil fork is a work-in-progress. The dict and list implementations have seen the most work due to their prevalence in the interpreter’s inner workings. Similar work is complete with regards to queues but not others yet. Sets is the next big one to cover.

Queue was very important since it’s used for communication between threads with concurrent.futures and asyncio. Queues were easier than dictionaries and lists, they use fine-grained locking instead of lock-free reads. Some other objects will probably need a combination.

This work is tricky to do because you need to be careful when you acquire and release the locks, for example Py_DECREFs can be reentrant. More “coarse-grained” locks can be still considered for those but of course those run the risk of deadlocks.

Q: How dependent is nogil on mimalloc? If we needed it to be a compile-time option to use it or not, would a less performant build be feasible that uses the platform’s malloc instead without C preprocessor hell?

mimalloc is used for more than just thread safety. Its cooperation is neccesary to enable lock-free read dictionaries, it also enables efficient GC tracking.

The maintainer of mimalloc is interested in explicitly supporting CPython and is open to necessary changes to make that happen.

Other malloc implementations are reported as stable with CPython: jemalloc used at Facebook, tcmalloc used at Google, although with less integration, more like simple replacements of the default allocator.

Note from core devs: Christian Heimes and Pablo Galindo Salgado are evaluating using mimalloc for CPython. Early tests show no performance regression on average (geometric mean), with most benchmarks doing better, and a smaller number of benchmarks doing marginally worse. There are possible issues to evaluate:

  • mimalloc’s API and ABI stability;
  • licensing;
  • portability across all CPython-supported platforms, for example stdatomic.h only being available in C11;
  • integration with profilers and sanitizer tools (Valgrind, asan, ubsan, etc.);
  • and possibly more.

Q: What similarities does your project have with Larry’s Gilectomy? Were you able to make use of any of his work?

On the high level the project is similar: deferred reference counting, fine-grained locking, challenges around returning borrowed references. There is no code reuse with Gilectomy.

Q: You’re saying your work is in high-level similar to Larry’s Gilectomy. His work was also based on deferred reference counting. Yet he only observed performance degradation during Gilectomy. Your “nogil” has a promising performance profile. Where do you think the difference comes from?

Switching to the register-based compiler and other optimizations done, like the lock-less dictionary reads powered by mimalloc and avoiding contention with deferred reference counting, were critical to making nogil scale and perform well. Also, in some cases, Python itself grew faster. For instance, function calls in Python 3.9 are much cheaper than they were in Python 3.5.

Making it scale definitely took more work than expected.

Q: Would it be possible to opt a C extension into the GIL-free mode or opt out of it?

As the name implies, the GIL is a simple global lock. For it to protect any single piece of shared data, it needs to be turned on all threads, not just in one with the incompatible extension.

It’s tricky to switch the interpreter from being GIL-free to running with the GIL (and vice versa) in an already running process. What is more likely is a startup-time switch that either enables the GIL or not inside the process. C extensions not tagged as compatible would raise warnings or fail to import.

Alternatively, it would be feasible to always “stop the world” when a C extension is accessed but this defeats the purpose of removing the GIL performance-wise.

Note from core devs: there are a few other ideas that so far weren’t explored in depth. One is to convert the GIL into a “many readers - one writer” lock. In this scenario, the GIL-free mode would essentially acquire the lock as a “reader”, e.g. without blocking other new code from doing the same. Legacy code would acquire a “writer” lock, blocking all other threads from execution until freed. This design would require keeping the GIL acquisition/release APIs which nogil already does to inform the GC that a thread is blocked on I/O.

Q: Would it be possible to mark functions as not thread-safe (e.g. using a decorator) and have nogil take this into account when running the code to prevent other threads from getting in the way? (sort of like a temporary GIL)

If the concern is about state accessed elsewhere, every access needs to be locked. This isn’t particularly viable on a decorator level. Enabling the GIL conditionally for paths including unsafe code would be hard to implement, as mentioned above.

Unclear. For C API extensions there’s at least one good design pattern: they often have a similar structure, and keep shared state in a single struct. Pybind11 looks as the furthest from this pattern at this point so most changes might be required in C extensions written with it.

Many complex C extensions already have to deal with locking and multithreading as their raison d’être is to release the GIL as much as possible, like for instance numpy does. So, perhaps surprisingly, those would possibly be easier to migrate.

Next steps

As a follow-up to this meeting, the core developers discussed the feasibility of including nogil upstream and what that would mean to the community. Definitely, a change of this magnitude needs a lot of care.

Before we decide on that, it feels more feasible to introduce some of its code first. In particular, mimalloc looks interesting and there is already an open pull request exploring its inclusion. Look there for links to benchmarks.

On a personal level, we are impressed by Sam’s work so far and invited him to join the CPython project. I’m happy to report he is interested, and to help him ramp up to become a core developer, I will be mentoring him. Guido and Neil Schemenauer will help me review code for the interpreter bits I’m unfamiliar with.

#Programming #Python