Linux I/O Schedulers

Linux I/O Schedulers Hao-Ran Liu Why I/O scheduler? „ Disk seek is the slowest operation in a computer „ A system would perform horribly without a suitable I/O scheduler „ I/O scheduler arranges the disk head to move in a single direction to minimize seeks „ Like the way elevators moves between floors „ Achieve greater global throughput at the expense of fairness to some requests What do I/O schedulers do? „ Improve overall disk throughput by „ Reorder requests to reduce the disk seek time „ Merge requests to reduce the number of requests „ Prevent starvation „ submit requests before deadline „ Avoid read starvation by write „ Provide fairness among different processes Linux I/O scheduling framework „ Linux elevator is an abstract layer to which different I/O scheduler can attach „ Merging mechanisms are provided by request queues „ Front or back merge of a request and a bio „ Merge two requests „ Sorting policy and merge decision are done in elevators „ Pick up a request to be merged with a bio „ Add a new request to the request queue „ Select next request to be processed by block drivers elevator (sorting) queue (merging) block drivers policy mechanism Abstraction of Linux I/O scheduler framework block layer (producer) I/O scheduler internal queue(s) enqueue functions dequeue functions low-level device driver (consumer)sort/merge prioritize external queue The relationship of I/O scheduler functions queue block driver elevator ll_merge_requests_fn() ll_front_merge_fn() ll_back_merge_fn() xxx_request_fn() elv_next_request() elv_add_request() elv_queue_empty() elv_remove_request() elv_completed_request() __make_request() block layer elv_merge() elv_may_queue() function calls generic_make_request() submit_bio() Description of elevator functions Type Description elv_merge Find a request in the request queue to be merged with a bio. The function’s return value indicate front merge, back merge or no merge. elv_add_request Add a new request to the request queue elv_may_queue Ask if the elevator allows enqueuing of a new request elv_remove_request Remove a request from the request queue elv_queue_empty Check if the request queue is empty elv_next_request Called by the device drivers to get next request from the request queue elv_set_request When a new request is allocated, this function is called to initialize elevator-specific variables elv_put_request When a request is to be freed, this function is called to free memory allocated for some elevator. elv_completed_request Called when a request is completed Most functions are just wrappers. The actual implementation are elevator-specific Flowchart of __make_request() elv_queue_empty() Check if the queue is empty get_request() Allocate a new request and put the bio into it Yes No elv_merge() Select a suitable request from the queue for merging with the bio Front merge (or back merge) No merge ll_front_merge_fn() (ll_back_merge_fn()) Merge bio into the front or the back of the request add_request() Add the new request into request queue Done attempt_front_merge() (attempt_back_merge()) Also, check if the resulting request can be merged again with the neighbor get_request_wait() Wait for some requests to become available Block queue full?Yes No Merge functions at request queue struct request_queue { struct list_head queue_head; struct elevator_queue *elevator; ... merge_request_fn *back_merge_fn; merge_request_fn *front_merge_fn; merge_requests_fn*merge_requests_fn; ... } A list of requests (external queue) Elevator queue of this request queue Pointers to merge functions: ll_back_merge_fn() : back merge a request and a bio ll_front_merge_fn() : front merge a request and a bio ll_merge_requests_fn() : merge two requests ll_xxx_fn() is the default set of functions for merge The structure of elevator type struct elevator_type { struct list_head list; struct elevator_ops ops; struct kobj_type *elevator_ktype; char elevator_name[ELV_NAME_MAX]; struct module *elevator_owner; }; A list of all available elevator types elevator functions the name of the elevator Each request queue is associated with its own elevator queue of certain type struct elevator_queue { struct elevator_ops *ops; void *elevator_data; struct kobject kobj; struct elevator_type *elevator_type; }; the private data of the elevator The structure of elevator operations struct elevator_ops { elevator_merge_fn *elevator_merge_fn; elevator_merged_fn *elevator_merged_fn; elevator_merge_req_fn *elevator_merge_req_fn; elevator_next_req_fn *elevator_next_req_fn; elevator_add_req_fn *elevator_add_req_fn; elevator_remove_req_fn *elevator_remove_req_fn; elevator_requeue_req_fn *elevator_requeue_req_fn; elevator_deactivate_req_fn *elevator_deactivate_req_fn; elevator_queue_empty_fn *elevator_queue_empty_fn; elevator_completed_req_fn *elevator_completed_req_fn; elevator_request_list_fn *elevator_former_req_fn; elevator_request_list_fn *elevator_latter_req_fn; elevator_set_req_fn *elevator_set_req_fn; elevator_put_req_fn *elevator_put_req_fn; elevator_may_queue_fn *elevator_may_queue_fn; elevator_init_fn *elevator_init_fn; elevator_exit_fn *elevator_exit_fn; }; These pointers point to the functions of a specific elevator Elevators in Linux 2.6 „ All elevator types are registered in a global linked list elv_list „ Request queues can change to a different type of elevator online „ This allows for adaptive I/O scheduling based on current workloads „ I/O schedulers available „ noop, deadline, CFQ, anticipatory NOOP I/O scheduler „ Suitable for truly random-access device, like flash memory card „ Requests in the queue are kept in FIFO order „ Only the last request added to the request queue will be tested for the possibility of a merge NOOP: Registration static struct elevator_type elevator_noop = { .ops = { .elevator_merge_fn = elevator_noop_merge, .elevator_merge_req_fn = elevator_noop_merge_requests, .elevator_next_req_fn = elevator_noop_next_request, .elevator_add_req_fn = elevator_noop_add_request, }, .elevator_name = "noop", .elevator_owner = THIS_MODULE, }; static int __init noop_init(void) { return elv_register(&elevator_noop); } static void __exit noop_exit(void) { elv_unregister(&elevator_noop); } module_init(noop_init); module_exit(noop_exit); This structure stores the name of the This structure stores the name of the noopnoop elevator and pointers to elevator and pointers to noopnoop functions. Use functions. Use elv_registerelv_register()() function to register the structure with function to register the structure with the the pluginplugin interfaces of the elevatorinterfaces of the elevator NOOP: add request and get next request static void elevator_noop_add_request(request_queue_t *q, struct request *rq, int where) { if (where == ELEVATOR_INSERT_FRONT) list_add(&rq->queuelist, &q->queue_head); else list_add_tail(&rq->queuelist, &q->queue_head); /* * new merges must not precede this barrier */ if (rq->flags & REQ_HARDBARRIER) q->last_merge = NULL; else if (!q->last_merge) q->last_merge = rq; } static struct request *elevator_noop_next_request(request_queue_t *q) { if (!list_empty(&q->queue_head)) return list_entry_rq(q->; return NULL; } Called by the device driver to get the next Called by the device driver to get the next request to be submitted. If the request queue request to be submitted. If the request queue is not empty, return the request at the head is not empty, return the request at the head of the queueof the queue Add a new request to Add a new request to the request queuethe request queue NOOP: request merge /* * See if we can find a request that this buffer can be coalesced with. */ static int elevator_noop_merge(request_queue_t *q, struct request **req, struct bio *bio) { int ret; ret = elv_try_last_merge(q, bio); if (ret != ELEVATOR_NO_MERGE) *req = q->last_merge; return ret; } static void elevator_noop_merge_requests(request_queue_t *q, struct request *req, struct request *next) { list_del_init(&next->queuelist); } Given a bio, find a adjacent Given a bio, find a adjacent request in the request queue to request in the request queue to be merged merged with. This function simply remove This function simply remove nextnext request from the request queue. It is request from the request queue. It is called after called after nextnext are merged into are merged into reqreq. . Deadline I/O scheduler „ Goal „ Reorder requests to improve I/O performance while simultaneously ensuring that no I/O request is being starved „ Favor reads over writes „ Each requests is associated with a expire time „ Read: 500ms, write 5sec „ Requests are inserted into „ A sorted-by-start-sector queue (two queues! for read and write) „ A FIFO list (two lists too!) sorted by expire time „ Normally, requests are pulled from sorted queues. However, if the request at the head of either FIFO queue expires, requests are still processed in sorted order but started from the first request in the FIFO queue Architecture view of Deadline I/O scheduler Read RB tree (sorted by start sector) Read FIFO lists (sorted by queue time) Write RB tree (sorted by start sector) Write FIFO lists (sorted by queue time) Request hash table (sorted by end sector) deadline_insert_request Dispatch queue deadline_dispatch_requests deadline_next_request Device driver „ The sorted queues are built on red-black trees „ For back merge purpose, requests are hashed into an array of list indexed by the end sector Deadline: dispatching requests 1. If [next_req] is in the batch (adjacent to previous request and batch count < 16), set it as [dispatch_req] and jump to step 5 2. Here, we are not in a batch. If there are read reqs and write is not starved, select read dir and jump to step 4 3. If there are write reqs, select write dir. Otherwise, return 0 4. If the first req in the fifo of the selected data direction expired, set it as [dispatch_req] and set batch count = 0. Otherwise, set [next_req] as [dispatch_req] 5. Increase batch count and dispatch the [dispatch_req]. 6. Search forward from the end sector of [dispatch_req] in the RB tree of selected dir. Set the next request as [next_req] Anticipatory scheduling Background Disk schedulers reorder available disk requests for • performance by seek optimization, • proportional resource allocation, etc. Any policy needs multiple outstanding requests to make good decisions! from With enough requests… issued by process A issued by process B E.g., Throughput = 21 MB/s (IBM Deskstar disk) seek time location on disk from With synchronous I/O… E.g., Throughput = 5 MB/s schedule issued by process A issued by process B forced! too late! forced! from Deceptive idleness Process A is about to issue next request. but Scheduler hastily assumes that process A has no further requests! from Proportional scheduler Allocate disk service in say 1:2 ratio: Deceptive idleness causes 1:1 allocation: BABA from Anticipatory scheduling Key idea: Sometimes wait for process whose request was last serviced. Keeps disk idle for short intervals. But with informed decisions, this: • Improves throughput • Achieves desired proportions from Cost-benefit analysis Balance expected benefits of waiting against cost of keeping disk idle. Tradeoffs sensitive to scheduling policy e.g., 1. seek optimizing scheduler 2. proportional scheduler from Statistics For each process, measure: 1. Expected median and 95percentile thinktime 2. Expected positioning time Median 95percentile Number of requests Thinktime last next from Benefit = best.positioning_time — next.positioning_time Cost = next.median_thinktime Waiting_duration = (Benefit > Cost) ? next.95percentile_thinktime : 0 from Cost-benefit analysis for seek optimizing scheduler best := best available request chosen by scheduler next := expected forthcoming request from process whose request was last serviced Proportional scheduler Costs and benefits are different. e.g., proportional scheduler: Wait for process whose request was last serviced, 1. if it has received less than its allocation, and 2. if it has thinktime below a threshold (e.g., 3ms) Waiting_duration = next.95percentile_thinktime from Prefetch Overlaps computation with I/O. Side-effect: avoids deceptive idleness! • Application-driven • Kernel-driven from Microbenchmark 0 5 10 15 20 25 Sequential Alternate Random within file Throughput (MB/s) Original Anticipatory no prefetch no prefetch no prefetch prefetch prefetch prefetch from Proportional scheduler 0 10 20 0102030 Experimental time (seconds) Service received (seconds) Original Anticipatory 0 30 60 90 120 Throughput (tps) Database benchmark: two databases, select queries from Work-conserving vs. non-work- conserving „ Work-conserving scheduler „ If the disk is idle or a request is completed, next request in the queue is scheduled immediately „ Non-work-conserving scheduler „ the disk stands idle in the face of nonempty queue „ Anticipatory scheduler are non-work- conserving Anticipatory I/O scheduler in Linux „ Based on deadline I/O scheduler „ Suitable for desktop, good interactive performance „ Design shortcomings „ Assume only 1 physical seeking head „ Bad for RAID devices „ Only 1 read request are dispatched to the disk controller at a time „ Bad for controller that supports TCQ „ Read anticipation assumes synchronous requests are issued by individual processes „ Bad for requests issued cooperatively by multiple processes „ Rough benefit-cost analysis „ Anticipate a better request if mean thinktime of the process < 6ms and mean seek distance of the process < seek distance of next request Anticipatory IO scheduler policy „ One-way elevator algorithm „ Limited backward seeks „ FIFO expiration times for reads and for writes „ When a requests expire, interrupt the current elevator sweep „ Read and write request batching „ Scheduler alternates dispatching read and write batches to the driver. The read (write) FIFO timeout values are tested only during read (write) batches. „ Read Anticipation „ At the end of each read request, the I/O scheduler examines its next candidate read request from its sorted read list and decide whether to wait for a “better request” I/O statistics for anticipatory scheduler „ Per request queue (as_data) „ The last sector of the last request „ Exit probability „ Probability a task will exit while being waited on „ Per process (as_io_context) „ Last request completion time „ Last request position „ Mean think time „ Mean seek distance Anticipation States „ ANTIC_OFF „ Not anticipating (normal operation) „ ANTIC_WAIT_REQ „ The last read has not yet completed „ ANTIC_WAIT_NEXT „ Currently anticipating a request vs last read (which has completed) „ ANTIC_FINISHED „ Anticipating but have found a candidate or timed out State transitions of request anticipation time ANTIC_OFF ANTIC_WAIT_REQ ANTIC_WAIT_NEXT ANTIC_FINISHED Driver ask for next req, but last read is not completed yet Anticipate next read after the last read completed. The anticipated request is found or anticipation timer expire. (ref. next slide as_antic_stop) Last read completes before driver ask for next req A close request from other process is enqueued,or a request from the anticipated process is submitted FIFO expired or a barrier request is submitted FIFO expired or a barrier request is submitted A request is dispatched Functions executed during the anticipation of requests as_completed_request executed when a request is completed. If the completed request is a read, record completion time and call as_antic_waitnext as_antic_waitnext Start a timer (expire time: 6ms) to wait the next read request. as_antic_stop Stop anticipation timer and scheduler a call to request_fn Called in 4 conditions: 1. FIFO queue expired 2. The anticipated process submit a read or write 3. The anticipated process exited 4. A close request is submitted from other process as_antic_timeout Timer expired. Schedule a call to request_fn time as_add_request executed when a new request is queued. If this is a read, update the corresponding process’s I/O statistics.Conditionally call as_antic_stop I/O statistics – thinktime & seek distance „ These statistics are associated with each process, but not with a specific I/O device „ The statistics will be a combination of I/O behavior from all actively-use devices (It seems bad!) „ Thinktime „ At enqueuing of a new read request, thinktime = current jiffies – completion time of last read request „ seek distance „ At enqueuing of a new read request, seek distance = |start sector of the new request – last request end sector| I/O statistics – average thinktime and seek distance „ Previous I/O history decays as new request are enqueued „ Fixed point arithmetic (1.0 == 1 << 8) tsamples ttotaltmean thinktimettotalttotal tsamplestsamples 128 8 2567 8 2567 += ×+×= +×= Mean thinktime of a process Mean seek distance of a process ssamples ssamplesstotal smean seekdiststotalstotal ssamplesssamples 2 8 2567 8 2567 + = ×+×= +×= Make a decision – Shall we anticipate a “better request”? FIFO expire? No Last request is a write? Anticipation state = ANTIC_FINISHED? No Next request is from the same process? Process anticipated on has exited No No Next request is a close read request from other process? No Anticipation timer expired? No No Mean thinktime > anticipation time? No Mean seek distance > seek distance of next request? No Wait for a better requestYes Yes Yes Yes Yes Yes Yes Yes Yes Dispatch next request Cooperative Anticipatory Scheduler „ Proposed in this paper: Enhancements to Linux I/O scheduler, OLS2005 „ The problems of anticipatory scheduler „ Anticipation works only when requests are issued by the same process „ Solution „ Keep anticipating even when the anticipated process has exited „ Cooperative exit probability: existence of cooperative processes related to dead processes CAS: Performance Evaluation Program 1: while true do dd if=/dev/zero of=file \ count=2048 bs=1M done Program 2: time cat 200mb-file > /dev/null Streaming writes and reads Streaming and chunk reads Program 1: while true do cat big-file > /dev/null done Program 2: time find . –type f –exec \ cat ‘{}’ ‘;’ > /dev/null Scheduler Execution time (sec) Throughput (MB/s) Deadline 129 25 AS 10 33 CAS 933 Scheduler Execution time (sec) Throughput (MB/s) Deadline 297 9 AS 4767 35 CAS 255 34 AS failed to anticipate chunk reads AS works too well for Program 1. Program 2 starved. CFQv2 (Complete Fair Queuing) I/O scheduler „ Goal „ Provide fair allocation of I/O bandwidth among all the initiators of I/O requests „ CFQ can be configured to provide fairness at per- process, per-process-group, per-user and per-user- group levels. „ Each initiator has its own request queue and CFQ services these queues round-robin „ Data writeback is usually performed by the pdflush kernel threads. That means, all data writes share the alloted I/O bandwidth of the pdflush threads Architecture view of CFQv2 tgid 1 queue tgid 2 queue tgid n queue queue hash by tgid . . . cfq_insert_request Round robin serving 1 request at a time device queue (sorted by sector) cfq_dispatch_requests Red-black tree (sorted by sector) Read FIFO lists (sorted by queue time) Write FIFO lists (sorted by queue time) References „ Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O, Sitaram Iyer, ACM SOSP’01 „ Enhancements to Linux I/O scheduling, Seetharami Seelam, OLS’05 „ Linux 2.6.12 kernel source „ Linux Kernel Development, 2nd edition, Robert Love, 2005




需要 5 金币 [ 分享pdf获得金币 ] 0 人已下载





下载需要 5 金币 [金币充值 ]
亲,您也可以通过 分享原创pdf 来获得金币奖励!