New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change scheduling complexity from O(containers) to O(nodes) #18831
Comments
Awesome - I'm happy to review your changes. @kubernetes/sig-scalability @davidopp |
Great work, I am expecting your PR! |
Is the change happens in etcd or in the memory of sheduler? |
I don't think we should change anything in etcd - this should be scheduler change. |
I think the caching you're proposing doesn't cause any new problems to multiple-scheduler configuration (in the current code without a cache, with multiple schedulers the new pod can be bound by scheduler A in between scheduler B doing List and doing work on the result), but it's worth thinking about. ref/ #18745 |
Actually it's not the same as #18745; #18745 is about multiple schedulers scheduling the same pod at the same time, but my concern here is multiple scheduler scheduling different pods onto the same node at the same time. Kubelet will detect conflict anyway, so it is not a correctness problem, just maybe an increase in conflicts. |
Hi @davidopp , Thanks for your comments. The concern exist today without my changes. The optimistic scheduling will bring conflicts in one manner or the other. I'm not trying to make a tradeoff here. The changes proposed is just trying to make the existing cache more efficient. |
Yeah I was just thinking it might increase the rate of conflicts, if the cache replaces re-listing, so that the scheduler uses data that is less fresh when making scheduling decisions. It's not a reason to not make the change you suggested, but something to watch out for once we have multiple schedulers. |
It seems that a simple way to speed up
I think that's probably sufficient. The vast majority of pods will not request a hostPort. |
@davidopp |
Removing from 1.2 milestone, adding to next-candidate. If we can hit our 1000 node scaling goals without this change, then we do not need it for 1.2. |
@hongchaodeng can we close this now? |
In fact there are already some things that are not solved here (e.g. spreading priority is O(pods)). But I'm fine with closing this one and filing separate issues for such things. |
So far the scheduling algorithm complexity is O(containers) -- not O(pods) nor O(nodes). The scheduling latency becomes higher as more containers/pods get scheduled -- last time we observed 20 ms on 0 pods and it increased to 200ms on 30k pods.
After fixing #18255, #18452, #18466, #18413, #18170, and making use of scheduler test suite #18458, we found that majority of time was spent on
PodFitsResources
andPodFitsHostPorts
as well as the GC burden they brought.The root problem of both is that they list all containers for every schedule decision -- O(containers). With caching the entire nodes to pods map and aggregated requested resource of a node and incremental updating, they could be optimized to O(nodes) logically the same.
After optimizing only
PodFitsResources
, we are able to reduce the scheduling latency to 20ms. The scheduling rate doesn’t decrease much as more and more pods get scheduled. The benchmark results:Before optimized,
PodFitsResources
computes pod request for every node, lists all containers for every node, and calculate aggregated requested resource to compare it with pod request. We added a cache for aggregated requested resource of every node. When a pod is assumed, scheduled, or deleted, a watch based update is made upon the cache. Finally we just consume from the cache along with iterating the nodes. We also reused the cache forMapPodsToMachines
infindNodesThatFit
-- optimized from listing all pods to watch based incremental updates.There are still places for improvement: latest profile output showed that
PodFitsHostPorts
is still a large consumer of cpu cycles; @wojtek-t mentioned in scheduler test suite that listing tens of thousands of pods can potentially affect scheduller performance; priorityFuncs are still O(containers); etc.Latest benchmark result on master (0f22ba4):
benchcmp shows our improvement:
The text was updated successfully, but these errors were encountered: