PyPy and software transactional memory
This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible. |
Software transactional memory (STM) is an ambitious feature that is being worked on for the PyPy interpreter. STM is an alternative to locking that is meant to eventually replace the (in)famous global interpreter lock (GIL) in the PyPy version of Python. That should also help multi-threaded Python programs scale better. But STM is tricky, necessitating two separate rewrites of that code for PyPy—including one announced at the beginning of February.
The idea behind STM is relatively straightforward. Shared data is processed inside of "transactions" that can either fully complete or, if there are conflicts, be rolled back. While a thread is processing a transaction, it records any shared object that is read or written. At the end of the transaction, the thread verifies that no other thread has changed any of the values participating in the transaction (i.e. those recorded). If so, the change is committed, and other threads will see the changes made. If not, the changes are rolled back and retried in a new transaction.
A bit of history
PyPy, which is a Python interpreter written in Python, has always had a focus on performance. That may seem a little contradictory, given that it is written in a language not particularly known for its performance characteristics, but PyPy uses various techniques to overcome that. First, it uses a restricted subset of Python (called RPython) that can easily be statically analyzed. Second, PyPy uses a just-in-time (JIT) compiler to increase its performance.
But STM has been on the project's radar for a few years. PyPy project lead Armin Rigo posted a message to the pypy-dev mailing list about STM back in 2011, but he referenced a similar idea for Python from 2003. Rigo went on to post a PyPy blog message calling for adding STM to PyPy. Seven months later, in March 2012, a call for donations to support STM went out and has so far gathered roughly half of the $50,400 target. Work on the feature started soon after the donation call.
Rigo (and Remi Meier, who works with him on STM for PyPy) then went back to the drawing board in June 2013. Originally, the STM piece was written in C, while the garbage collector was written in RPython. That turned out to be problematic, so they turned to an all-C approach. But, as Rigo stressed, STM in PyPy is a research project—it might take a few iterations before things all come together.
And some status
Another of those iterations came about recently. Rigo and Meier have come up with a different way to organize PyPy's memory to better support STM. It will require a full rewrite of the C STM library, but once that is done, the existing pypy-stm code should be fairly easily moved to it. Early indications are that the new mechanism is significantly faster:
As outlined in a draft design document, the idea behind the new mechanism came out of the discovery of the remap_file_pages() Linux system call. Using remap_file_pages() means the new library will not be portable to other operating systems, at least initially. There are some thoughts on how to add it for Windows and others, but that is still a ways off.
The basic idea is to mmap() enough memory for N copies of the shared data, where N is the number of threads (shown at right). By using remap_file_pages(), each thread's portion of that memory area is actually mapped to the same set of pages, so all threads share the same memory even though each thread sees that memory at a different virtual address. Some trickery with the Clang compiler causes code to be emitted that uses the %gs register as an offset to shift the memory accesses by that amount. It is similar to the way the %fs register is used by the POSIX threads (Pthreads) library for thread-local variables.
The diagram at left shows two threads (T1, T2) running, each with their address space pointing in to the shared region.
When a thread wants to modify an object, the page
that it resides in is "unshared" by another call to
remap_file_pages() that remaps the address back into the
thread-local part of the mapped memory: "i.e. stop the
corresponding %gs-relative,
thread-local page from mapping to the same physical page as
others
".
Much like a copy-on-write operation, each access causes a
new page to be
allocated and the
contents of the corresponding shared page are copied to it.
At right, we see that thread T2 is modifying an object, so it points to the private copy made.
Each transaction will record the objects it accesses: privately it records the objects it reads and publicly records those it writes. Before an object is modified, it is unshared, thus copied to the thread-local page as described above; it can then be rolled back by copying the shared page again. In addition, only one transaction is allowed to write to a given object—other transactions are aborted if they attempt to write to it. Once a transaction completes successfully, which means there were no reads of data changed by another thread (writes can't conflict due to the "one transaction writing an object" rule mentioned above), the page is copied back to the shared area.
It seems like a clever approach and, as the preliminary benchmarks indicate, one that has some significant performance advantages. But it is important to remember Rigo's warning that it is a research project. Significant progress has been made before, but needed to be redone—that could happen again. But, with luck, this approach will lead to a PyPy without a GIL but with the performance to allow it to be used for multi-threaded programs on systems with two cores or more.
(Log in to post comments)