Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change scheduling complexity from O(containers) to O(nodes) #18831

Closed
hongchaodeng opened this issue Dec 17, 2015 · 14 comments
Closed

Change scheduling complexity from O(containers) to O(nodes) #18831

hongchaodeng opened this issue Dec 17, 2015 · 14 comments
Assignees
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@hongchaodeng
Copy link
Contributor

So far the scheduling algorithm complexity is O(containers) -- not O(pods) nor O(nodes). The scheduling latency becomes higher as more containers/pods get scheduled -- last time we observed 20 ms on 0 pods and it increased to 200ms on 30k pods.

After fixing #18255, #18452, #18466, #18413, #18170, and making use of scheduler test suite #18458, we found that majority of time was spent on PodFitsResources and PodFitsHostPorts as well as the GC burden they brought.

The root problem of both is that they list all containers for every schedule decision -- O(containers). With caching the entire nodes to pods map and aggregated requested resource of a node and incremental updating, they could be optimized to O(nodes) logically the same.

After optimizing only PodFitsResources, we are able to reduce the scheduling latency to 20ms. The scheduling rate doesn’t decrease much as more and more pods get scheduled. The benchmark results:

BenchmarkScheduling100Nodes0Pods-4           200           5530781 ns/op
BenchmarkScheduling100Nodes1000Pods-4        200           7058672 ns/op
BenchmarkScheduling1000Nodes0Pods-4          100          15344940 ns/op
BenchmarkScheduling1000Nodes10kPods-4        100          20022489 ns/op
BenchmarkScheduling1000Nodes20kPods-4        100          19291019 ns/op
BenchmarkScheduling1000Nodes29kPods-4        100          24107608 ns/op

Before optimized, PodFitsResources computes pod request for every node, lists all containers for every node, and calculate aggregated requested resource to compare it with pod request. We added a cache for aggregated requested resource of every node. When a pod is assumed, scheduled, or deleted, a watch based update is made upon the cache. Finally we just consume from the cache along with iterating the nodes. We also reused the cache for MapPodsToMachines in findNodesThatFit -- optimized from listing all pods to watch based incremental updates.

There are still places for improvement: latest profile output showed that PodFitsHostPorts is still a large consumer of cpu cycles; @wojtek-t mentioned in scheduler test suite that listing tens of thousands of pods can potentially affect scheduller performance; priorityFuncs are still O(containers); etc.

Latest benchmark result on master (0f22ba4):

BenchmarkScheduling100Nodes0Pods-4           200           6049842 ns/op
BenchmarkScheduling100Nodes1000Pods-4        200          10174731 ns/op
BenchmarkScheduling1000Nodes0Pods-4           50          22099051 ns/op
BenchmarkScheduling1000Nodes10kPods-4         20          85206355 ns/op
BenchmarkScheduling1000Nodes20kPods-4         5          220729042 ns/op
BenchmarkScheduling1000Nodes29kPods-4         5          211274674 ns/op

benchcmp shows our improvement:

benchmark                                 old ns/op     new ns/op     delta
BenchmarkScheduling100Nodes0Pods-4        6049842       5530781       -8.58%
BenchmarkScheduling100Nodes1000Pods-4     10174731      7058672       -30.63%
BenchmarkScheduling1000Nodes0Pods-4       22099051      15344940      -30.56%
BenchmarkScheduling1000Nodes10kPods-4     85206355      20022489      -76.50%
BenchmarkScheduling1000Nodes20kPods-4     220729042     19291019      -91.26%
BenchmarkScheduling1000Nodes29kPods-4     211274674     24107608      -88.59%
@hongchaodeng
Copy link
Contributor Author

@hongchaodeng hongchaodeng changed the title Optimize scheduling complexity from O(containers) to O(nodes) Change scheduling complexity from O(containers) to O(nodes) Dec 17, 2015
@wojtek-t
Copy link
Member

Awesome - I'm happy to review your changes.

@kubernetes/sig-scalability @davidopp

@HaiyangDING
Copy link

Great work, I am expecting your PR!

@HardySimpson
Copy link
Contributor

Is the change happens in etcd or in the memory of sheduler?

@wojtek-t
Copy link
Member

I don't think we should change anything in etcd - this should be scheduler change.

@davidopp davidopp added team/control-plane priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Dec 18, 2015
@davidopp
Copy link
Member

We added a cache for aggregated requested resource of every node. When a pod is assumed, scheduled, or deleted, a watch based update is made upon the cache.

I think the caching you're proposing doesn't cause any new problems to multiple-scheduler configuration (in the current code without a cache, with multiple schedulers the new pod can be bound by scheduler A in between scheduler B doing List and doing work on the result), but it's worth thinking about.

ref/ #18745

@davidopp
Copy link
Member

Actually it's not the same as #18745; #18745 is about multiple schedulers scheduling the same pod at the same time, but my concern here is multiple scheduler scheduling different pods onto the same node at the same time. Kubelet will detect conflict anyway, so it is not a correctness problem, just maybe an increase in conflicts.

@hongchaodeng
Copy link
Contributor Author

Hi @davidopp , Thanks for your comments.

The concern exist today without my changes. The optimistic scheduling will bring conflicts in one manner or the other. I'm not trying to make a tradeoff here. The changes proposed is just trying to make the existing cache more efficient.

@davidopp
Copy link
Member

Yeah I was just thinking it might increase the rate of conflicts, if the cache replaces re-listing, so that the scheduler uses data that is less fresh when making scheduling decisions. It's not a reason to not make the change you suggested, but something to watch out for once we have multiple schedulers.

@davidopp
Copy link
Member

It seems that a simple way to speed up PodFitsHostPorts() is to change it to something like this:

wantPorts := getUsedPorts(pod)
if len(wantPorts) == 0
    return true, nil
existingPorts := getUsedPorts(existingPods...)
for wport := range wantPorts {
[... rest of function as-is ...]

I think that's probably sufficient. The vast majority of pods will not request a hostPort.

@hongchaodeng
Copy link
Contributor Author

@davidopp
I agree with you that's gonna speed up both of benchmark and kubemark results.
Some realistic scenarios would be much better to back up the argument. Since we can't be sure on this, keeping it simple as you suggested looks good to me.

@bgrant0607 bgrant0607 added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 16, 2016
@davidopp davidopp modified the milestones: next-candidate, v1.2 Feb 4, 2016
@davidopp
Copy link
Member

davidopp commented Feb 4, 2016

Removing from 1.2 milestone, adding to next-candidate.

If we can hit our 1000 node scaling goals without this change, then we do not need it for 1.2.

@timothysc
Copy link
Member

@hongchaodeng can we close this now?

@wojtek-t
Copy link
Member

In fact there are already some things that are not solved here (e.g. spreading priority is O(pods)). But I'm fine with closing this one and filing separate issues for such things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

8 participants