Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: markroot over finalizers is stop-the-world #11485

Closed
aclements opened this issue Jun 30, 2015 · 8 comments
Closed

runtime: markroot over finalizers is stop-the-world #11485

aclements opened this issue Jun 30, 2015 · 8 comments

Comments

@aclements
Copy link
Member

On multi-gigabyte heaps, about half of our mark termination time (~5ms of ~10ms at 4GB) is spent in the loop over all spans in the _RootSpans case of markroot. We should fix this for 1.6.

@RLH, @dr2chase, and I discussed a few possible fixes. One is to maintain a global list of finalizers in addition to the per-span lists so that this loop is proportional to the number of finalizers rather than the number of spans. This could still be slow in a hypothetical program with many finalizers (maybe we don't care) and might complicate synchronization around the specials and finalizers lists.

Another possibility is to scan finalizers during the concurrent scan phase and perform the appropriate write barriers for finalizers added during GC and then simply eliminate this re-scan from mark termination. It's entirely possible we already have all or most of the write barriers we need for this. There is one quirk to this solution: currently we only scan the object with the finalizer during mark termination. However, it's unclear why we do this and it's probably a bad idea since it's means we're delaying mark work we could have done concurrently.

@aclements
Copy link
Member Author

See issue #11484 for most of the other half of mark termination time.

@aclements
Copy link
Member Author

@RLH and I talked about marking objects reachable from finalizers during concurrent scan rather than mark termination today. We concluded that it probably should Just Work modulo one quirk: if SetFinalizer is called during a GC cycle, it must itself call scanobject on the object. Otherwise, it's possible to have objects A -> B where A is reachable at the beginning of GC, but does not have a finalizer at the time of concurrent scan; then during concurrent mark, the mutator sets a finalizer on A, then makes A unreachable before it gets marked. If SetFinalizer doesn't call scanobject, B won't be marked and may be collected before A's finalizer runs.

We might also need more locking. There are comments in addspecial/removespecial claiming that it's okay for the GC to access the specials list without locks, but they don't explain why and I'm not sure I believe them.

@rsc rsc changed the title runtime: fix loop over spans in markroot runtime: markroot over finalizers is stop-the-world Aug 12, 2015
@rsc
Copy link
Contributor

rsc commented Aug 12, 2015

I believe the primary expense in finalizers is treating them as roots, right? For the *os.File finalizer, what we care about is getting at the 'fd int' field, not any pointers the *os.File contains. One option is to have a runtime.unsafeSetFinalizer that arranges for the finalizer to be called as usual but without treating the object as a root. The finalizer must be careful not to access pointers contained in the object. We could use that for *os.File but not expose it otherwise.

A safer option, which I like better, might be to introduce an optimization in SetFinalizer that if the object contains no pointers, it doesn't count as a root during GC (since it points at nothing). Then we can put the 'fd int' in its own finalized mini-object that satisfies this property. That would eliminate basically all the finalizer roots in network heavy programs with tons of open network connections, which should kill off the finalizer contribution to stop-the-world in those programs.

Would that help enough?

The sweep of a span containing finalized objects would still need to arrange to call the finalizers, but that runs concurrently so I assume it is not as critical to the speed.

@aclements
Copy link
Member Author

The current cost of finalizers is O(# of finalizers + # of spans). The constant on the (# of spans) term is about 1ms/heap GB. I haven't done the profiling to determine the constant on the (# of finalizers) term, but it does mean that even if we eliminate the marking for some finalizers, we still have the (# of spans) term. If the constant on O(# of finalizers) turns out to be high, then I think your suggestion may be a good direction. But I suspect it is low and just moving it to concurrent mark should be enough, since the cost of the whole mark phase is much higher than 1ms/heap GB.

Of course, improving on the (# of spans) term is good, too. Currently, we just loop over all of the spans to find spans with finalizers. We could keep a better data structure instead.

Note that the finalizer on net.netFD objects is more complex than on os.File objects. However, it looks like everything the finalizer actually touches is still pointer-free (except on nacl!).

@gopherbot
Copy link

CL https://golang.org/cl/13112 mentions this issue.

aclements added a commit that referenced this issue Sep 14, 2015
Marking of span roots can represent a significant fraction of the time
spent in mark termination. Simply traversing the span list takes about
1ms per GB of heap and if there are a large number of finalizers (for
example, for network connections), it may take much longer.

Improve the situation by splitting the span scan into 128 subtasks
that can be executed in parallel and load balanced by the markroots
parallel for. This lets the GC balance this job across the Ps.

A better solution is to do this during concurrent mark, or to improve
it algorithmically, but this is a simple change with a lot of bang for
the buck.

This was suggested by Rhys Hiltner.

Updates #11485.

Change-Id: I8b281adf0ba827064e154a1b6cc32d4d8031c03c
Reviewed-on: https://go-review.googlesource.com/13112
Reviewed-by: Keith Randall <khr@golang.org>
@aclements
Copy link
Member Author

We might also need more locking. There are comments in addspecial/removespecial claiming that it's okay for the GC to access the specials list without locks, but they don't explain why and I'm not sure I believe them.

It's certainly not safe for GC to traverse the specials list during removefinalizer/removespecial. Memory for finalizers is manually managed, so at best the GC could try to read from freed memory while traversing the list. This isn't a problem today because everywhere that traverses this list is either STW or holds span.speciallock. We could traverse this list during concurrent GC without a large cost if we peeked at the list head without holding span.speciallock and only if it's non-nil take the lock and walk the list.

Traversing the list during addfinalizer/addspecial without locking could be safe, but isn't in theory under the Go memory model and may not be in practice on non-TSO machines because there's no publication fence between filling in the special and splicing it in to the list.

@aclements
Copy link
Member Author

I think I understand the synchronization (and lack of synchronization) involved here. I'm going to try moving this in to the concurrent phase. We can separately optimize it by reducing the loop to O(# of finalizers) or even O(# of non-trivial finalizers), but at the end of the day, we could still have a program with a legitimately large number of finalizers and I don't want STW to be proportional to that. Plus, it won't be a big time sink compared to other things in the concurrent phase, so this will shift the focus to other things that affect overall throughput more.

@gopherbot
Copy link

CL https://golang.org/cl/14982 mentions this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants