Following system colour scheme Selected dark colour scheme Selected light colour scheme

Python Enhancement Proposals

PEP 788 – Reimagining native threads

Author:
Peter Bierma <zintensitydev at gmail.com>
Sponsor:
Victor Stinner <vstinner at python.org>
Discussions-To:
Discourse thread
Status:
Draft
Type:
Standards Track
Created:
23-Apr-2025
Python-Version:
3.15
Post-History:
10-Mar-2025, 27-Apr-2025

Table of Contents

Abstract

PyGILState_Ensure(), PyGILState_Release(), and other related functions in the PyGILState family are the most common way to create native threads that interact with Python. They have been the standard for over twenty years (PEP 311). But, over time, these functions have become problematic:

  • They aren’t safe for finalization, either causing the calling thread to hang or crashing it with a segmentation fault, preventing further execution.
  • When they’re called before finalization, they force the thread to be “daemon”, meaning that an interpreter won’t wait for it to reach any point of execution. This is mostly frustrating for developers, but can lead to deadlocks!
  • Subinterpreters don’t play nicely with them, because they all assume that the main interpreter is the only one that exists. A fresh thread (that is, has never had a thread state) that calls PyGILState_Ensure() will always be for the main interpreter.
  • The term “GIL” in the name is quite confusing for users of free-threaded Python. There isn’t a GIL, why do they still have to call it?

This PEP intends to fix all of these issues by providing two new functions, PyThreadState_Ensure() and PyThreadState_Release(), as a more correct and safer replacement for PyGILState_Ensure() and PyGILState_Release(). For example:

if (PyThreadState_Ensure(interp) < 0) {
    fputs("Python is shutting down", stderr);
    return;
}

/* Interact with Python, without worrying about finalization. */
// ...

PyThreadState_Release();

This is achieved by introducing two concepts into the C API:

  • “Daemon” and “non-daemon” threads, similar to how it works in the threading module.
  • Interpreter reference counts which prevent an interpreter from finalizing.

In PyThreadState_Ensure(), both of these ideas are applied. The calling thread is to store a reference to an interpreter via PyInterpreterState_Hold(). PyInterpreterState_Hold() increases the reference count of an interpreter, requiring the thread to finish (by eventually calling PyThreadState_Release()) before beginning finalization.

For example, creating a native thread with this API would look something like this:

static PyObject *
my_method(PyObject *self, PyObject *unused)
{
    PyThread_handle_t handle;
    PyThead_indent_t indent;

    PyInterpreterState *interp = PyInterpreterState_Hold();
    if (PyThread_start_joinable_thread(thread_func, interp, &ident, &handle) < 0) {
        PyInterpreterState_Release(interp);
        return NULL;
    }
    /* The thread will always attach and finish, because we increased
       the reference count of the interpreter. */
    Py_RETURN_NONE;
}

Motivation

Native threads will always hang during finalization

Many codebases might need to call Python code in highly-asynchronous situations where the interpreter is already finalizing, or might finalize, and want to continue running code after the Python call. This desire has been brought up by users. For example, a callback that wants to call Python code might be invoked when:

  • A kernel has finished running on a GPU.
  • A network packet was received.
  • A thread has quit, and a native library is executing static finalizers of thread local storage.

In the current C API, any non-Python thread (one not created via the threading module) is considered to be “daemon”, meaning that the interpreter won’t wait on that thread to finalize. Instead, the interpreter will hang the thread when it goes to attach a thread state, making it unusable past that point. Attaching a thread state can happen at any point when invoking Python, such as releasing the GIL in-between bytecode instructions, or when a C function exits a Py_BEGIN_ALLOW_THREADS block. (Note that hanging the thread is relatively new behavior; in prior versions, the thread would terminate, but the issue is the same.)

This means that any non-Python thread may be terminated at any point, which is severely limiting for users who want to do more than just execute Python code in their stream of calls (for example, C++ executing finalizers in addition to calling Python).

Using Py_IsFinalizing is insufficient

The docs currently recommend Py_IsFinalizing() to guard against termination of the thread:

Calling this function from a thread when the runtime is finalizing will terminate the thread, even if the thread was not created by Python. You can use Py_IsFinalizing() or sys.is_finalizing() to check if the interpreter is in process of being finalized before calling this function to avoid unwanted termination.

Unfortunately, this isn’t correct, because of time-of-call to time-of-use issues; the interpreter might not be finalizing during the call to Py_IsFinalizing(), but it might start finalizing immediately afterwards, which would cause the attachment of a thread state (typically via PyGILState_Ensure()) to hang the thread.

Daemon threads can cause finalization deadlocks

When acquiring locks, it’s extremely important to detach the thread state to prevent deadlocks. This is true on both the with-GIL and free-threaded builds. When the GIL is enabled, a deadlock can occur pretty easily when acquiring a lock if the GIL wasn’t released, and lock-ordering deadlocks can still occur free-threaded builds if the thread state wasn’t detached.

So, all code that needs to work with locks need to detach the thread state. In C, this is almost always done via Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS, in a code block that looks something like this:

Py_BEGIN_ALLOW_THREADS
acquire_lock();
Py_END_ALLOW_THREADS

Again, in a daemon thread, Py_END_ALLOW_THREADS will hang the thread if the interpreter is finalizing. But, Py_BEGIN_ALLOW_THREADS will not hang the thread; the lock will be acquired, and then the thread will be hung! Once that happens, nothing can try to acquire that lock without deadlocking. The main thread will continue to run finalizers past that point, though. If any of those finalizers try to acquire the lock, deadlock ensues.

This affects CPython itself, and there’s not much that can be done to fix it. For example, python/cpython#129536 remarks that the ssl module can emit a fatal error when used at finalization, because a daemon thread got hung while holding the lock. There are workarounds for this for pure-Python code, but native threads don’t have such an option.

We can’t change finalization behavior for PyGILState_Ensure

There will always have to be a point in a Python program where PyGILState_Ensure() can no longer acquire the GIL (or more correctly, attach a thread state). If the interpreter is long dead, then Python obviously can’t give a thread a way to invoke it. PyGILState_Ensure() doesn’t have any meaningful way to return a failure, so it has no choice but to terminate the thread or emit a fatal error, as noted in python/cpython#124622:

I think a new GIL acquisition and release C API would be needed. The way the existing ones get used in existing C code is not amenible to suddenly bolting an error state onto; none of the existing C code is written that way. After the call they always just assume they have the GIL and can proceed. The API was designed as “it’ll block and only return once it has the GIL” without any other option.

For this reason, we can’t make any real changes to how PyGILState_Ensure() works for finalization, because it would break existing code. Similarly, threads created with the existing C API will have to remain daemon, because extensions that implement native threads aren’t guaranteed to work during finalization.

The existing APIs are broken and misleading

There are currently two public ways for a user to create and attach their own thread state; manual use of PyThreadState_New() & PyThreadState_Swap(), and PyGILState_Ensure(). The latter, PyGILState_Ensure(), is significantly more common.

PyGILState_Ensure generally crashes during finalization

At the time of writing, the current behavior of PyGILState_Ensure() does not match the documentation. Instead of hanging the thread during finalization as previously noted, it’s extremely common for it to crash with a segmentation fault. This is a known issue that could, in theory, be fixed in CPython, but it’s definitely worth noting here. Incidentally, acceptance and implementation of this PEP will likely fix the existing crashes caused by PyGILState_Ensure().

The term “GIL” is tricky for free-threading

A large issue with the term “GIL” in the C API is that it is semantically misleading. This was noted in python/cpython#127989, created by the authors of this PEP:

The biggest issue is that for free-threading, there is no GIL, so users erroneously call the C API inside Py_BEGIN_ALLOW_THREADS blocks or omit PyGILState_Ensure in fresh threads.

Since Python 3.12, it is an attached thread state that lets a thread invoke the C API. On with-GIL builds, holding an attached thread state implies holding the GIL, so only one thread can have one at a time. Free-threaded builds achieve the effect of multi-core parallism while remaining ackwards-compatible by simply removing that limitation: threads still need a thread state (and thus need to call PyGILState_Ensure()), but they don’t need to wait on one another to do so.

Subinterpreters don’t work with PyGILState_Ensure

As noted in the documentation, PyGILState APIs aren’t officially supported in subinterpreters:

Note that the PyGILState_* functions assume there is only one global interpreter (created automatically by Py_Initialize()). Python supports the creation of additional interpreters (using Py_NewInterpreter()), but mixing multiple interpreters and the PyGILState_* API is unsupported.

More technically, this is because PyGILState_Ensure doesn’t have any way to know which interpreter created the thread, and as such, it has to assume that it was the main interpreter. There isn’t any way to detect this at runtime, so spurious races are bound to come up in threads created by subinterpreters, because synchronization for the wrong interpreter will be used on objects shared between the threads.

Interpreters can concurrently shut down

The other way of creating a native thread that can invoke Python, PyThreadState_New() / PyThreadState_Swap(), is a lot better for supporting subinterpreters (because PyThreadState_New() takes an explicit interpreter, rather than assuming that the main interpreter was intended), but is still limited by the current API.

In particular, subinterpreters typically have a much shorter lifetime than the main interpreter, and as such, there’s not necessarily a guarantee that a PyInterpreterState (acquired by PyInterpreterState_Get()) passed to a fresh thread will still be alive. Similarly, a PyInterpreterState pointer could have been replaced with a new interpreter, causing all sorts of unknown issues. They are also subject to all the finalization related hanging mentioned previously.

Rationale

This PEP includes several new APIs that intend to fix all of the issues stated above.

Replacing the old APIs

As made clear in Motivation, PyGILState is already pretty buggy, and even if it was magically fixed, the current behavior of hanging the thread is beyond repair. In turn, this PEP intends to completely deprecate the existing PyGILState APIs and provide better alternatives. However, even if this PEP is rejected, all of the APIs can be replaced with more correct PyThreadState functions in the current C API:

This PEP specifies a ten-year deprecation for these functions (while remaining in the stable ABI), primarily because it’s expected that the migration won’t be seamless, due to the new requirement of storing an interpreter state. The exact details of this deprecation are currently unclear, see When should the legacy APIs be removed?.

A light layer of magic

The APIs proposed by this PEP intentionally have a layer of abstraction that is hidden from the user and offloads complexity onto CPython. This is done primarily to help ease the transition from PyGILState for existing codebases, and for ease-of-use to those who provide wrappers the C API, such as Cython or PyO3.

In particular, the API hides details about the lifetime of the thread state and most of the details with interpreter references.

See also Exposing an Activate/Deactivate API instead of Ensure/Clear.

Bikeshedding and the PyThreadState namespace

To solve the issue with “GIL” terminology, the new functions described by this PEP intended as replacements for PyGILState will go under the existing PyThreadState namespace. In Python 3.14, the documentation has been updated to switch over to terms like “attached thread state” instead of “global interpreter lock”, so this namespace seems to fit well for this PEP.

Preventing interpreter finalization with references

Several iterations of this API have taken an approach where PyThreadState_Ensure() can return a failure based on the state of the interpreter. Instead, this PEP takes an approach where an interpreter keeps track of the number of non-daemon threads, which inherently prevents it from beginning finalization.

The main upside with this approach is that there’s more consistency with attaching threads. Using an interpreter reference from the calling thread keeps the interpreter from finalizing before the thread starts, ensuring that it always works. An approach that were to return a failure based on the start-time of the thread could cause spurious issues.

In the case where it is useful to let the interpreter finalize, such as in an asynchronous callback where there’s no guarantee that the thread will start, strong references to an interpreter can be acquired through PyInterpreterState_Lookup().

Specification

Daemon and non-daemon threads

This PEP introduces the concept of non-daemon thread states. By default, all threads created without the threading module will hang when trying to attach a thread state for a finalizing interpreter (in fact, daemon threads that are created with the threading module will hang in the same way). This generally happens when a thread calls PyEval_RestoreThread() or in between bytecode instructions, based on sys.setswitchinterval().

A new, internal field will be added to the PyThreadState structure that determines if the thread is daemon. Before finalization, an interpreter will wait until all non-daemon threads call PyThreadState_Delete().

For backwards compatibility, all thread states created by existing APIs, including PyGILState_Ensure(), will remain daemon by default. See We can’t change finalization behavior for PyGILState_Ensure.

int PyThreadState_SetDaemon(int is_daemon)
Set the attached thread state as non-daemon or daemon.

The attached thread state must not be the main thread for the interpreter. All thread states created without PyThreadState_Ensure() are daemon by default.

If the thread state is non-daemon, then the current interpreter will wait for this thread to finish before shutting down. See also threading.Thread.daemon.

Return zero on success, non-zero without an exception set on failure.

Interpreter reference counting

Internally, an interpreter will have to keep track of the number of non-daemon native threads, which will determine when an interpreter can finalize. This is done to prevent use-after-free crashes in PyThreadState_Ensure() for interpreters with short lifetimes, and to remove needless layers of synchronization between the calling thread and the started thread.

An interpreter state returned by Py_NewInterpreter() (or really, PyInterpreterState_New()) will start with a native thread countdown. For simplicity’s sake, this will be referred to as a reference count. A non-zero reference count prevents the interpreter from finalizing.

PyInterpreterState *PyInterpreterState_Hold(void)
Similar to PyInterpreterState_Get(), but returns a strong reference to the interpreter (meaning, it has its reference count incremented by one, allowing the returned interpreter state to be safely accessed by another thread, because it will be prevented from finalizing).

This function is generally meant to be used in tandem with PyThreadState_Ensure().

The caller must have an attached thread state. This function cannot return NULL. Failures are always a fatal error.

PyInterpreterState *PyInterpreterState_Lookup(int64_t interp_id)
Similar to PyInterpreterState_Hold(), but looks up an interpreter based on an ID (see PyInterpreterState_GetID()). This has the benefit of allowing the interpreter to finalize in cases where the thread might not start, such as inside of an asynchronous callback.

This function will return NULL without an exception set on failure. If the return value is non-NULL, then the returned interpreter will be prevented from finalizing until the reference is released by PyThreadState_Release() or PyInterpreterState_Release().

Returning NULL typically means that the interpreter is at a point where threads cannot start, or no longer exists.

The caller does not need to have an attached thread state.

void PyInterpreterState_Release(PyInterpreterState *interp)
Decrement the reference count of the interpreter, as was incremented by PyInterpreterState_Hold() or PyInterpreterState_Lookup().

This function cannot fail, other than with a fatal error. The caller does not need to have an attached thread state for interp.

Ensuring and releasing thread states

This proposal includes two new high-level threading APIs that intend to replace PyGILState_Ensure() and PyGILState_Release().

int PyThreadState_Ensure(PyInterpreterState *interp)
Ensure that the thread has an attached thread state for interp, and thus can safely invoke that interpreter. It is OK to call this function if the thread already has an attached thread state, as long as there is a subsequent call to PyThreadState_Release() that matches this one.

The reference to the interpreter interp is stolen by this function. As such, interp should have been acquired by PyInterpreterState_Hold().

Thread states created by this function are non-daemon by default. See PyThreadState_SetDaemon(). If the calling thread already has an attached thread state that matches interp, then this function will mark the existing thread state as non-daemon and return. It will be restored to its prior daemon status upon the next PyThreadState_Release() call.

Return zero on success, and non-zero with the old attached thread state restored (which may have been NULL).

void PyThreadState_Release()
Release the attached thread state set by PyThreadState_Ensure(). Any thread state that was set prior to the original call to PyThreadState_Ensure() will be restored.

This function cannot fail, but may hang the thread if the attached thread state prior to the original PyThreadState_Ensure() was daemon and the interpreter was finalized.

Deprecation of PyGILState APIs

This PEP deprecates all of the existing PyGILState APIs in favor of the new PyThreadState APIs for the reasons given in the Motivation. Namely:

All of the PyGILState APIs are to be removed from the non-limited C API in Python 3.25. They will remain available in the stable ABI for compatibility.

Backwards Compatibility

This PEP specifies a breaking change with the removal of all the PyGILState APIs from the public headers of the non-limited C API in 10 years (Python 3.25).

Security Implications

This PEP has no known security implications.

How to Teach This

As with all C API functions, all the new APIs in this PEP will be documented in the C API documentation, ideally under the Non-Python created threads section. The existing PyGILState documentation should be updated accordingly to point to the new APIs.

Examples

These examples are here to help understand the APIs described in this PEP. Ideally, they could be reused in the documentation.

Single-threaded example

This example shows acquiring a lock in a Python method.

If this were to be called from a daemon thread, then the interpreter could hang the thread while reattaching the thread state, leaving us with the lock held. Any future finalizer that wanted to acquire the lock would be deadlocked!

static PyObject *
my_critical_operation(PyObject *self, PyObject *unused)
{
    assert(PyThreadState_GetUnchecked() != NULL);
    PyInterpreterState *interp = PyInterpreterState_Hold();
    /* Temporarily make this thread non-daemon to ensure that the
       lock is released. */
    if (PyThreadState_Ensure(interp) < 0) {
        PyErr_SetString(PyExc_PythonFinalizationError,
                        "interpreter is shutting down");
        return NULL;
    }

    Py_BEGIN_ALLOW_THREADS;
    acquire_some_lock();
    Py_END_ALLOW_THREADS;

    /* Do something while holding the lock */
    // ...

    release_some_lock();
    PyThreadState_Release();
    Py_RETURN_NONE;
}

Transitioning from old functions

The following code uses the old PyGILState APIs:

static int
thread_func(void *arg)
{
    PyGILState_STATE gstate = PyGILState_Ensure();
    /* It's not an issue in this example, but we just attached
       a thread state for the main interpreter. If my_method() was
       originally called in a subinterpreter, then we would be unable
       to safely interact with any objects from it. */
    if (PyRun_SimpleString("print(42)") < 0) {
        PyErr_Print();
    }
    PyGILState_Release(gstate);
    return 0;
}

static PyObject *
my_method(PyObject *self, PyObject *unused)
{
    PyThread_handle_t handle;
    PyThead_indent_t indent;

    if (PyThread_start_joinable_thread(thread_func, NULL, &ident, &handle) < 0) {
        return NULL;
    }
    Py_BEGIN_ALLOW_THREADS;
    PyThread_join_thread(handle);
    Py_END_ALLOW_THREADS;
    Py_RETURN_NONE;
}

This is the same code, updated to use the new functions:

static int
thread_func(void *arg)
{
    PyInterpreterState *interp = (PyInterpreterState *)arg;
    if (PyThreadState_Ensure(interp) < 0) {
        fputs("Cannot talk to Python", stderr);
        return -1;
    }
    if (PyRun_SimpleString("print(42)") < 0) {
        PyErr_Print();
    }
    PyThreadState_Release();
    return 0;
}

static PyObject *
my_method(PyObject *self, PyObject *unused)
{
    PyThread_handle_t handle;
    PyThead_indent_t indent;

    PyInterpreterState *interp = PyInterpreterState_Hold();
    if (PyThread_start_joinable_thread(thread_func, interp, &ident, &handle) < 0) {
        PyInterpreterState_Release(interp);
        return NULL;
    }
    Py_BEGIN_ALLOW_THREADS
    PyThread_join_thread(handle);
    Py_END_ALLOW_THREADS
    Py_RETURN_NONE;
}

Daemon thread example

Native daemon threads are still a use-case, and as such, they can still be used with this API:

static int
thread_func(void *arg)
{
    PyInterpreterState *interp = (PyInterpreterState *)arg;
    if (PyThreadState_Ensure(interp) < 0) {
        fputs("Cannot talk to Python", stderr);
        return -1;
    }
    (void)PyThreadState_SetDaemon(1);
    if (PyRun_SimpleString("print(42)") < 0) {
        PyErr_Print();
    }
    PyThreadState_Release();
    return 0;
}

static PyObject *
my_method(PyObject *self, PyObject *unused)
{
    PyThread_handle_t handle;
    PyThead_indent_t indent;

    PyInterpreterState *interp = PyInterpreterState_Hold();
    if (PyThread_start_joinable_thread(thread_func, interp, &ident, &handle) < 0) {
        PyInterpreterState_Release(interp);
        return NULL;
    }
    Py_RETURN_NONE;
}

Asynchronous callback example

As stated in the Motivation, there are many cases where it’s desirable to call Python in an asynchronous callback. In such cases, it’s not safe to call PyInterpreterState_Hold(), because it’s not guaranteed that PyThreadState_Ensure() will ever be called. If not, finalization becomes deadlocked.

This scenario requires using PyInterpreterState_Lookup() instead, which only prevents finalization once the lookup has been made.

For example:

typedef struct {
    int64_t interp_id;
} pyrun_t;

static int
async_callback(void *arg)
{
    pyrun_t *data = (pyrun_t *)arg;
    PyInterpreterState *interp = PyInterpreterState_Lookup(data->interp_id);
    PyMem_RawFree(data);
    if (interp == NULL) {
        fputs("Python has shut down", stderr);
        return -1;
    }
    if (PyThreadState_Ensure(interp) < 0) {
        fputs("Cannot talk to Python", stderr);
        return -1;
    }
    if (PyRun_SimpleString("print(42)") < 0) {
        PyErr_Print();
    }
    PyThreadState_Release();
    return 0;
}

static PyObject *
setup_callback(PyObject *self, PyObject *unused)
{
    PyThread_handle_t handle;
    PyThead_indent_t indent;

    pyrun_t *data = PyMem_RawMalloc(sizeof(pyrun_t));
    if (data == NULL) {
        return PyErr_NoMemory();
    }
    // Weak reference to the interpreter. It won't wait on the callback
    // to finalize.
    data->interp_id = PyInterpreterState_GetID(PyInterpreterState_Get());
    register_callback(async_callback, data);

    Py_RETURN_NONE;
}

Reference Implementation

A reference implementation of this PEP can be found here.

Rejected Ideas

Using an interpreter ID instead of a interpreter state for PyThreadState_Ensure

Some iterations of this API took an int64_t interp_id parameter instead of PyInterpreterState *interp, because interpreter IDs cannot be concurrently deleted and cause use-after-free violations. PyInterpreterState_Hold() fixes this issue anyway, but an interpreter ID does have the benefit of requiring less magic in the implementation, but has several downsides:

  • Nearly all existing interpreter APIs already return a PyInterpreterState pointer, not an interpreter ID. Functions like PyThreadState_GetInterpreter() would have to be accompanied by frustrating calls to PyInterpreterState_GetID(). There’s also no existing way to go from an int64_t back to a PyInterpreterState*, and providing such an API would come with its own set of design problems.
  • Threads typically take a void *arg parameter, not an int64_t arg. As such, passing an interpreter pointer requires much less boilerplate for the user, because an additional structure definition or heap allocation would be needed to store the interpreter ID. This is especially an issue on 32-bit systems, where void * is too small for an int64_t.
  • To retain usability, interpreter ID APIs would still need to keep a reference count, otherwise the interpreter could be finalizing before the native thread gets a chance to attach. The problem with using an interpreter ID is that the reference count has to be “invisible”; it must be tracked elsewhere in the interpreter, likely being more complex than PyInterpreterState_Hold(). There’s also a lack of intuition that a standalone integer could have such a thing as a reference count. PyInterpreterState_Lookup() sidesteps this problem because the reference count is always associated with the returned interpreter state, not the integer ID.

Exposing an Activate/Deactivate API instead of Ensure/Clear

In prior discussions of this API, it was suggested to provide actual PyThreadState pointers in the API in an attempt to make the ownership and lifetime of the thread state clearer:

More importantly though, I think this makes it clearer who owns the thread state - a manually created one is controlled by the code that created it, and once it’s deleted it can’t be activated again.

This was ultimately rejected for two reasons:

Using PyStatus for the return value of PyThreadState_Ensure

In prior iterations of this API, PyThreadState_Ensure() returned a PyStatus instead of an integer to denote failures, which had the benefit of providing an error message.

This was rejected because it’s not clear that an error message would be all that useful; all the conceived use-cases for this API wouldn’t really care about a message indicating why Python can’t be invoked. As such, the API would only be needlessly harder to use, which in turn would hurt the transition from PyGILState_Ensure().

In addition, PyStatus isn’t commonly used in the C API. A few functions related to interpreter initialization use it (simply because they can’t raise exceptions), and PyThreadState_Ensure() does not fall under that category.

Open Issues

When should the legacy APIs be removed?

PyGILState_Ensure() and PyGILState_Release() have been around for over two decades, and it’s expected that the migration will be difficult. Currently, the plan is to remove them in 10 years (opposed to the 5 years required by PEP 387), but this is subject to further discussion, as it’s unclear if that’s enough (or too much) time.

In addition, it’s unclear whether to remove them at all. A soft deprecation could reasonably fit for these functions if it’s determined that a full PyGILState removal would be too disruptive for the ecosystem.


Source: https://github.com/python/peps/blob/main/peps/pep-0788.rst

Last modified: 2025-04-30 16:59:34 GMT