GIT 241215908147e699fd32f37d7a927af3c863fc04 git+ssh://master.kernel.org/pub/scm/linux/kernel/git/axboe/linux-2.6-block.git#for-akpm commit Author: Jens Axboe Date: Mon Feb 12 16:21:15 2007 +0100 [PATCH] ll_rw_blk: improve plug merge ratio If we only look at the previously queued request, we miss a lot of merging opportunities. So loop over the list from the back, bail when we find a merge candidate. This logic can be improve: - Potentially O(N) runtime, the likelyhood of worst case behaviour is really small though. And N should not be a large number at that. - If we fail merging on a request but it was a merge candidate, we should also bail out. This happens for things like REQ_NOMERGE, when a request is merged to the max. Signed-off-by: Jens Axboe commit 155a903c180b8254018664186fd81fb0f768f236 Author: Jens Axboe Date: Mon Feb 5 14:07:45 2007 +0100 plug: check for plugged process in schedule() Temporary work-around for now, although some parts of this do apply nicely for a variety of reasons. Sometimes it's a bug if we enter schedule() plugged, since we really should send off the requests and the caller should have used io_schedule() instead. But sometimes it happens inadvertently because the caller blocks for other reasons, eg calling functions that may block (memory allocation, for one). Signed-off-by: Jens Axboe commit b9214e77f64dea430b96fa9e454bc0d5dcd1790d Author: Thomas Gleixner Date: Wed Jan 17 12:01:07 2007 +1100 raid1: wakeup fix With the plugging changes, raid1 hangs because it's missing a thread wakeup. Signed-off-by: Jens Axboe commit 7f70d87e787a366a443e9dfbc516e55d3ef6f5e1 Author: Jens Axboe Date: Wed Jan 3 13:35:44 2007 +0100 Make generic_file_buffered_write() plug An obvious candidate for explicit plugging, enable it. Signed-off-by: Jens Axboe commit 05cf56a90e81ce4b79b53d907e36e0a7ff948d12 Author: Jens Axboe Date: Fri Feb 9 09:50:44 2007 +0100 [PATCH] block: explicit plugging Nick writes: This is a patch to perform block device plugging explicitly in the submitting process context rather than implicitly by the block device. There are several advantages to plugging in process context over plugging by the block device: - Implicit plugging is only active when the queue empties, so any advantages are lost if there is parallel IO occuring. Not so with explicit plugging. - Implicit plugging relies on a timer and watermarks and a kind-of-explicit directive in lock_page which directs plugging. These are heuristics and can cost performance due to holding a block device idle longer than it should be. Explicit plugging avoids most of these issues by only holding the device idle when it is known more requests will be submitted. - This lock_page directive uses a roundabout way to attempt to minimise intrusiveness of plugging on the VM. In doing so, it gets needlessly complex: the VM really is in a good position to direct the block layer as to the nature of its requests, so there is no need to try to hide the fact. - Explicit plugging keeps a process-private queue of requests being held. This offers some advantages over immediately sending requests to the block device: firstly, merging can be attempted on requests in this list (currently only attempted on the head of the list) without taking any locks; secondly, when unplugging occurs, the requests can be delivered to the block device queue in a batch, thus the lock aquisitions can be batched up. On a parallel tiobench benchmark, of the 800 000 calls to __make_request performed, this patch avoids 490 000 (62%) of queue_lock aquisitions by early merging on the private plugged list. Signed-off-by: Nick Piggin Changes so far by me: - Don't invoke ->request_fn() in blk_queue_invalidate_tags - Fixup all filesystems for block_sync_page() - Add blk_delay_queue() to handle the old plugging-on-shortage usage. - Unconditionally run replug_current_nested() in ioschedule() - Fixup queue start/stop - Fixup all the remaining drivers - Change the namespace (prefix the plug functions with blk_) - Fixup ext4 - Dead code removal - Fixup blktrace plug/unplug notifications - __make_request() cleanups - bio_sync() fixups - Kill queue empty checking - Make barriers work again, using QRCU - Make blk_sync_queue() work again, reuse barrier SRCU handling - Tons of other fixes and improvements This patch needs more work and some dedicated testing. Signed-off-by: Jens Axboe commit ceca0ff203e5884acd6abdbc1019a5ae54274415 Author: Paul E. McKenney Date: Fri Feb 9 09:44:47 2007 +0100 [PATCH] qrcu: add documentation Signed-off-by: Paul E. McKenney Acked-by: Jens Axboe commit 7b2a5c9729b15a4a594c27cc62efde18bc631c57 Author: Oleg Nesterov Date: Thu Dec 21 10:27:07 2006 +0100 qrcu: add rcutorture test Add rcutorture test for qrcu. Works for me! Signed-off-by: Oleg Nesterov Signed-off-by: Josh Triplett Acked-by: Paul E. McKenney Acked-by: Jens Axboe commit 8d8a582c7a9cf03414cc98b5b8e78ab490f8bde3 Author: Oleg Nesterov Date: Mon Dec 18 19:15:28 2006 +0100 qrcu: "quick" srcu implementation Very much based on ideas, corrections, and patient explanations from Alan and Paul. The current srcu implementation is very good for readers, lock/unlock are extremely cheap. But for that reason it is not possible to avoid synchronize_sched() and polling in synchronize_srcu(). Jens Axboe wrote: > > It works for me, but the overhead is still large. Before it would take > 8-12 jiffies for a synchronize_srcu() to complete without there actually > being any reader locks active, now it takes 2-3 jiffies. So it's > definitely faster, and as suspected the loss of two of three > synchronize_sched() cut down the overhead to a third. 'qrcu' behaves the same as srcu but optimized for writers. The fast path for synchronize_qrcu() is mutex_lock() + atomic_read() + mutex_unlock(). The slow path is __wait_event(), no polling. However, the reader does atomic inc/dec on lock/unlock, and the counters are not per-cpu. Also, unlike srcu, qrcu read lock/unlock can be used in interrupt context, and 'qrcu_struct' can be compile-time initialized. See also (a long) discussion: http://marc.theaimsgroup.com/?t=116370857600003 Signed-off-by: Oleg Nesterov Acked-by: Jens Axboe Documentation/RCU/checklist.txt | 13 + Documentation/RCU/rcu.txt | 6 Documentation/RCU/torture.txt | 15 + Documentation/RCU/whatisRCU.txt | 3 Documentation/block/biodoc.txt | 5 block/as-iosched.c | 15 - block/cfq-iosched.c | 8 - block/deadline-iosched.c | 9 - block/elevator.c | 45 --- block/ll_rw_blk.c | 509 +++++++++++++++++++++------------------ block/noop-iosched.c | 8 - drivers/block/cciss.c | 6 drivers/block/cpqarray.c | 3 drivers/block/floppy.c | 1 drivers/block/loop.c | 12 - drivers/block/pktcdvd.c | 2 drivers/block/rd.c | 2 drivers/block/umem.c | 16 - drivers/ide/ide-cd.c | 9 - drivers/ide/ide-io.c | 25 -- drivers/md/bitmap.c | 1 drivers/md/dm-emc.c | 2 drivers/md/dm-table.c | 14 - drivers/md/dm.c | 18 - drivers/md/dm.h | 1 drivers/md/linear.c | 14 - drivers/md/md.c | 3 drivers/md/multipath.c | 32 -- drivers/md/raid0.c | 17 - drivers/md/raid1.c | 72 +----- drivers/md/raid10.c | 73 +----- drivers/md/raid5.c | 60 ----- drivers/message/i2o/i2o_block.c | 6 drivers/mmc/mmc_queue.c | 3 drivers/s390/block/dasd.c | 3 drivers/s390/char/tape_block.c | 1 drivers/scsi/ide-scsi.c | 2 drivers/scsi/scsi_lib.c | 46 ++-- fs/adfs/inode.c | 1 fs/affs/file.c | 2 fs/befs/linuxvfs.c | 1 fs/bfs/file.c | 1 fs/block_dev.c | 2 fs/buffer.c | 25 -- fs/cifs/file.c | 2 fs/direct-io.c | 7 - fs/ecryptfs/mmap.c | 23 -- fs/efs/inode.c | 1 fs/ext2/inode.c | 2 fs/ext3/inode.c | 3 fs/ext4/inode.c | 3 fs/fat/inode.c | 1 fs/freevxfs/vxfs_subr.c | 1 fs/fuse/inode.c | 1 fs/gfs2/ops_address.c | 1 fs/hfs/inode.c | 2 fs/hfsplus/inode.c | 2 fs/hpfs/file.c | 1 fs/isofs/inode.c | 1 fs/jfs/inode.c | 1 fs/jfs/jfs_metapage.c | 1 fs/minix/inode.c | 1 fs/ntfs/aops.c | 4 fs/ntfs/compress.c | 2 fs/ocfs2/aops.c | 1 fs/ocfs2/cluster/heartbeat.c | 4 fs/qnx4/inode.c | 1 fs/reiserfs/inode.c | 1 fs/sysv/itree.c | 1 fs/udf/file.c | 1 fs/udf/inode.c | 1 fs/ufs/inode.c | 1 fs/ufs/truncate.c | 2 fs/xfs/linux-2.6/xfs_aops.c | 1 fs/xfs/linux-2.6/xfs_buf.c | 13 - include/linux/backing-dev.h | 3 include/linux/blkdev.h | 96 +++++-- include/linux/buffer_head.h | 1 include/linux/elevator.h | 8 - include/linux/fs.h | 1 include/linux/pagemap.h | 12 - include/linux/raid/md.h | 1 include/linux/sched.h | 1 include/linux/srcu.h | 30 ++ include/linux/swap.h | 2 kernel/rcutorture.c | 71 +++++ kernel/sched.c | 3 kernel/srcu.c | 105 ++++++++ mm/filemap.c | 66 +---- mm/nommu.c | 4 mm/page-writeback.c | 8 + mm/readahead.c | 11 - mm/shmem.c | 1 mm/swap_state.c | 5 mm/swapfile.c | 37 --- mm/vmscan.c | 6 96 files changed, 686 insertions(+), 985 deletions(-) diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index f4dffad..a31d84c 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -259,3 +259,16 @@ over a rather long period of time, but i Note that, rcu_assign_pointer() and rcu_dereference() relate to SRCU just as they do to other forms of RCU. + +14. QRCU is very similar to SRCU, but features very fast grace-period + processing at the expense of heavier-weight read-side operations. + The correspondance between QRCU and SRCU is as follows: + + QRCU SRCU + + struct qrcu_struct struct srcu_struct + init_qrcu_struct() init_srcu_struct() + cleanup_qrcu_struct() cleanup_srcu_struct() + qrcu_read_lock() srcu_read_lock() + qrcu_read_unlock() srcu_read_unlock() + synchronize_qrcu() synchronize_srcu() diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index f84407c..ae1e54e 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt @@ -45,8 +45,10 @@ o How can I see where RCU is currently u Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", "rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh", - "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu", - "synchronize_net", and "synchronize_srcu". + "qrcu_read_lock", qrcu_read_unlock", "srcu_read_lock", + "srcu_read_unlock", "synchronize_rcu", "synchronize_qrcu", + "synchronize_net", "synchronize_srcu", rcu_assign_pointer(), + and rcu_dereference(). o What guidelines should I follow when writing code that uses RCU? diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt index 25a3c3f..2cb0a3b 100644 --- a/Documentation/RCU/torture.txt +++ b/Documentation/RCU/torture.txt @@ -35,7 +35,8 @@ nfakewriters This is the number of RCU f different numbers of writers running in parallel. nfakewriters defaults to 4, which provides enough parallelism to trigger special cases caused by multiple writers, such as - the synchronize_srcu() early return optimization. + the synchronize_srcu() and synchronize_qrcu() early return + optimizations. stat_interval The number of seconds between output of torture statistics (via printk()). Regardless of the interval, @@ -54,11 +55,13 @@ test_no_idle_hz Whether or not to test t idle CPUs. Boolean parameter, "1" to test, "0" otherwise. torture_type The type of RCU to test: "rcu" for the rcu_read_lock() API, - "rcu_sync" for rcu_read_lock() with synchronous reclamation, - "rcu_bh" for the rcu_read_lock_bh() API, "rcu_bh_sync" for - rcu_read_lock_bh() with synchronous reclamation, "srcu" for - the "srcu_read_lock()" API, and "sched" for the use of - preempt_disable() together with synchronize_sched(). + "rcu_sync" for rcu_read_lock() with synchronous + reclamation, "rcu_bh" for the rcu_read_lock_bh() API, + "rcu_bh_sync" for rcu_read_lock_bh() with synchronous + reclamation, "srcu" for the "srcu_read_lock()" API, + "qrcu" for the "qrcu_read_lock()" "quick grace period" + form of SRCU, and "sched" for the use of preempt_disable() + together with synchronize_sched(). verbose Enable debug printk()s. Default is disabled. diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index e0d6d99..e91650b 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt @@ -780,6 +780,8 @@ Markers for RCU read-side critical secti rcu_read_unlock_bh srcu_read_lock srcu_read_unlock + qrcu_read_lock + qrcu_read_unlock RCU pointer/list traversal: @@ -807,6 +809,7 @@ RCU grace period: synchronize_sched synchronize_rcu synchronize_srcu + synchronize_qrcu call_rcu call_rcu_bh diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt index 3adaace..c6370b8 100644 --- a/Documentation/block/biodoc.txt +++ b/Documentation/block/biodoc.txt @@ -962,11 +962,6 @@ elevator_dispatch_fn fills the dispatch elevator_add_req_fn called to add a new request into the scheduler -elevator_queue_empty_fn returns true if the merge queue is empty. - Drivers shouldn't use this, but rather check - if elv_next_request is NULL (without losing the - request if one exists!) - elevator_former_req_fn elevator_latter_req_fn These return the request before or after the one specified in disk sort order. Used by the diff --git a/block/as-iosched.c b/block/as-iosched.c index ef12627..654f742 100644 --- a/block/as-iosched.c +++ b/block/as-iosched.c @@ -1183,20 +1183,6 @@ static void as_deactivate_request(reques atomic_inc(&RQ_IOC(rq)->aic->nr_dispatched); } -/* - * as_queue_empty tells us if there are requests left in the device. It may - * not be the case that a driver can get the next request even if the queue - * is not empty - it is used in the block layer to check for plugging and - * merging opportunities - */ -static int as_queue_empty(request_queue_t *q) -{ - struct as_data *ad = q->elevator->elevator_data; - - return list_empty(&ad->fifo_list[REQ_ASYNC]) - && list_empty(&ad->fifo_list[REQ_SYNC]); -} - static int as_merge(request_queue_t *q, struct request **req, struct bio *bio) { @@ -1445,7 +1431,6 @@ static struct elevator_type iosched_as = .elevator_add_req_fn = as_add_request, .elevator_activate_req_fn = as_activate_request, .elevator_deactivate_req_fn = as_deactivate_request, - .elevator_queue_empty_fn = as_queue_empty, .elevator_completed_req_fn = as_completed_request, .elevator_former_req_fn = elv_rb_former_request, .elevator_latter_req_fn = elv_rb_latter_request, diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index b6491c0..8d55c5a 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -214,13 +214,6 @@ static inline void cfq_schedule_dispatch kblockd_schedule_work(&cfqd->unplug_work); } -static int cfq_queue_empty(request_queue_t *q) -{ - struct cfq_data *cfqd = q->elevator->elevator_data; - - return !cfqd->busy_queues; -} - static inline pid_t cfq_queue_pid(struct task_struct *task, int rw, int is_sync) { /* @@ -2172,7 +2165,6 @@ static struct elevator_type iosched_cfq .elevator_add_req_fn = cfq_insert_request, .elevator_activate_req_fn = cfq_activate_request, .elevator_deactivate_req_fn = cfq_deactivate_request, - .elevator_queue_empty_fn = cfq_queue_empty, .elevator_completed_req_fn = cfq_completed_request, .elevator_former_req_fn = elv_rb_former_request, .elevator_latter_req_fn = elv_rb_latter_request, diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c index 6d673e9..a0ca245 100644 --- a/block/deadline-iosched.c +++ b/block/deadline-iosched.c @@ -335,14 +335,6 @@ dispatch_request: return 1; } -static int deadline_queue_empty(request_queue_t *q) -{ - struct deadline_data *dd = q->elevator->elevator_data; - - return list_empty(&dd->fifo_list[WRITE]) - && list_empty(&dd->fifo_list[READ]); -} - static void deadline_exit_queue(elevator_t *e) { struct deadline_data *dd = e->elevator_data; @@ -455,7 +447,6 @@ static struct elevator_type iosched_dead .elevator_merge_req_fn = deadline_merged_requests, .elevator_dispatch_fn = deadline_dispatch_requests, .elevator_add_req_fn = deadline_add_request, - .elevator_queue_empty_fn = deadline_queue_empty, .elevator_former_req_fn = elv_rb_former_request, .elevator_latter_req_fn = elv_rb_latter_request, .elevator_init_fn = deadline_init_queue, diff --git a/block/elevator.c b/block/elevator.c index 25f6ef2..9e559bc 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -92,7 +92,7 @@ inline int elv_rq_merge_ok(struct reques } EXPORT_SYMBOL(elv_rq_merge_ok); -static inline int elv_try_merge(struct request *__rq, struct bio *bio) +int elv_try_merge(struct request *__rq, struct bio *bio) { int ret = ELEVATOR_NO_MERGE; @@ -549,7 +549,6 @@ void elv_insert(request_queue_t *q, stru { struct list_head *pos; unsigned ordseq; - int unplug_it = 1; blk_add_trace_rq(q, rq, BLK_TA_INSERT); @@ -576,7 +575,6 @@ void elv_insert(request_queue_t *q, stru * with anything. There's no point in delaying queue * processing. */ - blk_remove_plug(q); q->request_fn(q); break; @@ -606,12 +604,6 @@ void elv_insert(request_queue_t *q, stru */ rq->cmd_flags |= REQ_SOFTBARRIER; - /* - * Most requeues happen because of a busy condition, - * don't force unplug of the queue for that case. - */ - unplug_it = 0; - if (q->ordseq == 0) { list_add(&rq->queuelist, &q->queue_head); break; @@ -633,18 +625,9 @@ void elv_insert(request_queue_t *q, stru __FUNCTION__, where); BUG(); } - - if (unplug_it && blk_queue_plugged(q)) { - int nrq = q->rq.count[READ] + q->rq.count[WRITE] - - q->in_flight; - - if (nrq >= q->unplug_thresh) - __generic_unplug_device(q); - } } -void __elv_add_request(request_queue_t *q, struct request *rq, int where, - int plug) +void __elv_add_request(request_queue_t *q, struct request *rq, int where) { if (q->ordcolor) rq->cmd_flags |= REQ_ORDERED_COLOR; @@ -673,21 +656,17 @@ void __elv_add_request(request_queue_t * } else if (!(rq->cmd_flags & REQ_ELVPRIV) && where == ELEVATOR_INSERT_SORT) where = ELEVATOR_INSERT_BACK; - if (plug) - blk_plug_device(q); - elv_insert(q, rq, where); } EXPORT_SYMBOL(__elv_add_request); -void elv_add_request(request_queue_t *q, struct request *rq, int where, - int plug) +void elv_add_request(request_queue_t *q, struct request *rq, int where) { unsigned long flags; spin_lock_irqsave(q->queue_lock, flags); - __elv_add_request(q, rq, where, plug); + __elv_add_request(q, rq, where); spin_unlock_irqrestore(q->queue_lock, flags); } @@ -793,21 +772,6 @@ void elv_dequeue_request(request_queue_t EXPORT_SYMBOL(elv_dequeue_request); -int elv_queue_empty(request_queue_t *q) -{ - elevator_t *e = q->elevator; - - if (!list_empty(&q->queue_head)) - return 0; - - if (e->ops->elevator_queue_empty_fn) - return e->ops->elevator_queue_empty_fn(q); - - return 1; -} - -EXPORT_SYMBOL(elv_queue_empty); - struct request *elv_latter_request(request_queue_t *q, struct request *rq) { elevator_t *e = q->elevator; @@ -1037,7 +1001,6 @@ static int elevator_switch(request_queue elv_drain_elevator(q); while (q->rq.elvpriv) { - blk_remove_plug(q); q->request_fn(q); spin_unlock_irq(q->queue_lock); msleep(10); diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c index 38c293b..56223b0 100644 --- a/block/ll_rw_blk.c +++ b/block/ll_rw_blk.c @@ -30,14 +30,14 @@ #include #include #include #include +#include /* * for max sense size */ #include -static void blk_unplug_work(struct work_struct *work); -static void blk_unplug_timeout(unsigned long data); +static void blk_delay_work(struct work_struct *work); static void drive_stat_acct(struct request *rq, int nr_sectors, int new_io); static void init_request_from_bio(struct request *req, struct bio *bio); static int __make_request(request_queue_t *q, struct bio *bio); @@ -217,15 +217,7 @@ void blk_queue_make_request(request_queu blk_queue_congestion_threshold(q); q->nr_batching = BLK_BATCH_REQ; - q->unplug_thresh = 4; /* hmm */ - q->unplug_delay = (3 * HZ) / 1000; /* 3 milliseconds */ - if (q->unplug_delay == 0) - q->unplug_delay = 1; - - INIT_WORK(&q->unplug_work, blk_unplug_work); - - q->unplug_timer.function = blk_unplug_timeout; - q->unplug_timer.data = (unsigned long)q; + INIT_DELAYED_WORK(&q->delay_work, blk_delay_work); /* * by default assume old behaviour and bounce for any highmem page @@ -1175,7 +1167,7 @@ void blk_queue_invalidate_tags(request_q blk_queue_end_tag(q, rq); rq->cmd_flags &= ~REQ_STARTED; - __elv_add_request(q, rq, ELEVATOR_INSERT_BACK, 0); + __elv_add_request(q, rq, ELEVATOR_INSERT_BACK); } } @@ -1530,119 +1522,74 @@ static int ll_merge_requests_fn(request_ return 1; } -/* - * "plug" the device if there are no outstanding requests: this will - * force the transfer to start only after we have put all the requests - * on the list. - * - * This is called with interrupts off and no requests on the queue and - * with the queue lock held. - */ -void blk_plug_device(request_queue_t *q) -{ - WARN_ON(!irqs_disabled()); - - /* - * don't plug a stopped queue, it must be paired with blk_start_queue() - * which will restart the queueing - */ - if (blk_queue_stopped(q)) - return; - - if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) { - mod_timer(&q->unplug_timer, jiffies + q->unplug_delay); - blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG); - } -} - -EXPORT_SYMBOL(blk_plug_device); - -/* - * remove the queue from the plugged list, if present. called with - * queue lock held and interrupts disabled. - */ -int blk_remove_plug(request_queue_t *q) +static void blk_delay_work(struct work_struct *work) { - WARN_ON(!irqs_disabled()); - - if (!test_and_clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags)) - return 0; + request_queue_t *q; - del_timer(&q->unplug_timer); - return 1; + q = container_of(work, request_queue_t, delay_work.work); + spin_lock_irq(q->queue_lock); + q->request_fn(q); + spin_unlock_irq(q->queue_lock); } -EXPORT_SYMBOL(blk_remove_plug); - /* - * remove the plug and let it rip.. + * Make sure that plugs that were pending when this function was entered, + * are now complete and requests pushed to the queue. */ -void __generic_unplug_device(request_queue_t *q) +static inline void queue_sync_plugs(request_queue_t *q) { - if (unlikely(blk_queue_stopped(q))) - return; - - if (!blk_remove_plug(q)) - return; + /* + * If the current process is plugged and has barriers submitted, + * we will livelock if we don't unplug first. + */ + blk_replug_current_nested(); - q->request_fn(q); + synchronize_qrcu(&q->qrcu); } -EXPORT_SYMBOL(__generic_unplug_device); /** - * generic_unplug_device - fire a request queue + * blk_delay_queue - restart queueing after defined interval * @q: The &request_queue_t in question + * @delay: delay in msecs * * Description: - * Linux uses plugging to build bigger requests queues before letting - * the device have at them. If a queue is plugged, the I/O scheduler - * is still adding and merging requests on the queue. Once the queue - * gets unplugged, the request_fn defined for the queue is invoked and - * transfers started. - **/ -void generic_unplug_device(request_queue_t *q) + * Sometimes queueing needs to be postponed for a little while, to allow + * resources to come back. This function will make sure that queueing is + * restarted around the specified time. + * + */ +void blk_delay_queue(request_queue_t *q, unsigned long msecs) { - spin_lock_irq(q->queue_lock); - __generic_unplug_device(q); - spin_unlock_irq(q->queue_lock); + schedule_delayed_work(&q->delay_work, msecs_to_jiffies(msecs)); } -EXPORT_SYMBOL(generic_unplug_device); +EXPORT_SYMBOL(blk_delay_queue); -static void blk_backing_dev_unplug(struct backing_dev_info *bdi, - struct page *page) +static void __blk_run_queue(request_queue_t *q) { - request_queue_t *q = bdi->unplug_io_data; - /* - * devices don't necessarily have an ->unplug_fn defined + * Only recurse once to avoid overrunning the stack, let the unplug + * handling reinvoke the handler shortly if we already got there. */ - if (q->unplug_fn) { - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, - q->rq.count[READ] + q->rq.count[WRITE]); - - q->unplug_fn(q); - } -} - -static void blk_unplug_work(struct work_struct *work) -{ - request_queue_t *q = container_of(work, request_queue_t, unplug_work); - - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, - q->rq.count[READ] + q->rq.count[WRITE]); - - q->unplug_fn(q); + if (!test_and_set_bit(QUEUE_FLAG_REENTER, &q->queue_flags)) { + q->request_fn(q); + clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags); + } else + queue_delayed_work(kblockd_workqueue, &q->delay_work, 0); } -static void blk_unplug_timeout(unsigned long data) +/** + * blk_run_queue - run a single device queue + * @q: The queue to run + */ +void blk_run_queue(struct request_queue *q) { - request_queue_t *q = (request_queue_t *)data; - - blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_TIMER, NULL, - q->rq.count[READ] + q->rq.count[WRITE]); + unsigned long flags; - kblockd_schedule_work(&q->unplug_work); + spin_lock_irqsave(q->queue_lock, flags); + __blk_run_queue(q); + spin_unlock_irqrestore(q->queue_lock, flags); } +EXPORT_SYMBOL(blk_run_queue); /** * blk_start_queue - restart a previously stopped queue @@ -1658,18 +1605,7 @@ void blk_start_queue(request_queue_t *q) WARN_ON(!irqs_disabled()); clear_bit(QUEUE_FLAG_STOPPED, &q->queue_flags); - - /* - * one level of recursion is ok and is much faster than kicking - * the unplug handling - */ - if (!test_and_set_bit(QUEUE_FLAG_REENTER, &q->queue_flags)) { - q->request_fn(q); - clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags); - } else { - blk_plug_device(q); - kblockd_schedule_work(&q->unplug_work); - } + __blk_run_queue(q); } EXPORT_SYMBOL(blk_start_queue); @@ -1690,7 +1626,7 @@ EXPORT_SYMBOL(blk_start_queue); **/ void blk_stop_queue(request_queue_t *q) { - blk_remove_plug(q); + cancel_delayed_work(&q->delay_work); set_bit(QUEUE_FLAG_STOPPED, &q->queue_flags); } EXPORT_SYMBOL(blk_stop_queue); @@ -1711,41 +1647,13 @@ EXPORT_SYMBOL(blk_stop_queue); */ void blk_sync_queue(struct request_queue *q) { - del_timer_sync(&q->unplug_timer); + queue_sync_plugs(q); + cancel_delayed_work(&q->delay_work); kblockd_flush(); } EXPORT_SYMBOL(blk_sync_queue); /** - * blk_run_queue - run a single device queue - * @q: The queue to run - */ -void blk_run_queue(struct request_queue *q) -{ - unsigned long flags; - - spin_lock_irqsave(q->queue_lock, flags); - blk_remove_plug(q); - - /* - * Only recurse once to avoid overrunning the stack, let the unplug - * handling reinvoke the handler shortly if we already got there. - */ - if (!elv_queue_empty(q)) { - if (!test_and_set_bit(QUEUE_FLAG_REENTER, &q->queue_flags)) { - q->request_fn(q); - clear_bit(QUEUE_FLAG_REENTER, &q->queue_flags); - } else { - blk_plug_device(q); - kblockd_schedule_work(&q->unplug_work); - } - } - - spin_unlock_irqrestore(q->queue_lock, flags); -} -EXPORT_SYMBOL(blk_run_queue); - -/** * blk_cleanup_queue: - release a &request_queue_t when it is no longer needed * @kobj: the kobj belonging of the request queue to be released * @@ -1775,6 +1683,8 @@ static void blk_release_queue(struct kob blk_trace_shutdown(q); + cleanup_qrcu_struct(&q->qrcu); + kmem_cache_free(requestq_cachep, q); } @@ -1834,15 +1744,16 @@ request_queue_t *blk_alloc_queue_node(gf return NULL; memset(q, 0, sizeof(*q)); - init_timer(&q->unplug_timer); + + if (init_qrcu_struct(&q->qrcu)) { + kmem_cache_free(requestq_cachep, q); + return NULL; + } snprintf(q->kobj.name, KOBJ_NAME_LEN, "%s", "queue"); q->kobj.ktype = &queue_ktype; kobject_init(&q->kobj); - q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug; - q->backing_dev_info.unplug_io_data = q; - mutex_init(&q->sysfs_lock); return q; @@ -1898,6 +1809,7 @@ blk_init_queue_node(request_fn_proc *rfn q->node = node_id; if (blk_init_free_list(q)) { + cleanup_qrcu_struct(&q->qrcu); kmem_cache_free(requestq_cachep, q); return NULL; } @@ -1913,7 +1825,6 @@ blk_init_queue_node(request_fn_proc *rfn q->request_fn = rfn; q->prep_rq_fn = NULL; - q->unplug_fn = generic_unplug_device; q->queue_flags = (1 << QUEUE_FLAG_CLUSTER); q->queue_lock = lock; @@ -2155,8 +2066,8 @@ out: } /* - * No available requests for this queue, unplug the device and wait for some - * requests to become available. + * No available requests for this queue, wait for some requests to become + * available. * * Called with q->queue_lock held, and returns with it unlocked. */ @@ -2181,7 +2092,6 @@ static struct request *get_request_wait( blk_add_trace_generic(q, bio, rw, BLK_TA_SLEEPRQ); - __generic_unplug_device(q); spin_unlock_irq(q->queue_lock); io_schedule(); @@ -2234,10 +2144,7 @@ EXPORT_SYMBOL(blk_get_request); */ void blk_start_queueing(request_queue_t *q) { - if (!blk_queue_plugged(q)) - q->request_fn(q); - else - __generic_unplug_device(q); + q->request_fn(q); } EXPORT_SYMBOL(blk_start_queueing); @@ -2307,7 +2214,7 @@ void blk_insert_request(request_queue_t blk_queue_end_tag(q, rq); drive_stat_acct(rq, rq->nr_sectors, 1); - __elv_add_request(q, rq, where, 0); + __elv_add_request(q, rq, where); blk_start_queueing(q); spin_unlock_irqrestore(q->queue_lock, flags); } @@ -2585,8 +2492,8 @@ void blk_execute_rq_nowait(request_queue rq->end_io = done; WARN_ON(irqs_disabled()); spin_lock_irq(q->queue_lock); - __elv_add_request(q, rq, where, 1); - __generic_unplug_device(q); + __elv_add_request(q, rq, where); + q->request_fn(q); spin_unlock_irq(q->queue_lock); } EXPORT_SYMBOL_GPL(blk_execute_rq_nowait); @@ -2669,7 +2576,7 @@ static void drive_stat_acct(struct reque return; if (!new_io) { - __disk_stat_inc(rq->rq_disk, merges[rw]); + disk_stat_inc(rq->rq_disk, merges[rw]); } else { disk_round_stats(rq->rq_disk); rq->rq_disk->in_flight++; @@ -2689,7 +2596,7 @@ static inline void add_request(request_q * elevator indicated where it wants this request to be * inserted at elevator_merge time */ - __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0); + __elv_add_request(q, req, ELEVATOR_INSERT_SORT); } /* @@ -2868,6 +2775,87 @@ static inline int attempt_front_merge(re return 0; } +static int bio_attempt_back_merge(request_queue_t *q, struct request *req, + struct bio *bio) +{ + BUG_ON(!rq_mergeable(req)); + + if (!ll_back_merge_fn(q, req, bio)) + return 0; + + blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); + + req->biotail->bi_next = bio; + req->biotail = bio; + req->nr_sectors = req->hard_nr_sectors += bio_sectors(bio); + req->ioprio = ioprio_best(req->ioprio, bio_prio(bio)); + + drive_stat_acct(req, bio_sectors(bio), 0); + return 1; +} + +static int bio_attempt_front_merge(request_queue_t *q, struct request *req, + struct bio *bio) +{ + int cur_nr_sectors; + sector_t sector; + + BUG_ON(!rq_mergeable(req)); + + if (!ll_front_merge_fn(q, req, bio)) + return 0; + + blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); + + sector = bio->bi_sector; + cur_nr_sectors = bio_cur_sectors(bio); + + bio->bi_next = req->bio; + req->bio = bio; + + /* + * may not be valid. if the low level driver said + * it didn't need a bounce buffer then it better + * not touch req->buffer either... + */ + req->buffer = bio_data(bio); + req->current_nr_sectors = cur_nr_sectors; + req->hard_cur_sectors = cur_nr_sectors; + req->sector = req->hard_sector = sector; + req->nr_sectors = req->hard_nr_sectors += bio_sectors(bio); + req->ioprio = ioprio_best(req->ioprio, bio_prio(bio)); + + drive_stat_acct(req, bio_sectors(bio), 0); + return 1; +} + +/* + * Attempts to merge with the plugged list in the current process. + */ +static int check_plug_merge(request_queue_t *q, struct io_context *ioc, + struct bio *bio) +{ + struct request *rq; + + if (!ioc || !ioc->plugged) + return 1; + + list_for_each_entry_reverse(rq, &ioc->plugged_list, queuelist) { + int el_ret; + + el_ret = elv_try_merge(rq, bio); + if (el_ret == ELEVATOR_BACK_MERGE) { + if (bio_attempt_back_merge(q, rq, bio)) + return 0; + } else if (el_ret == ELEVATOR_FRONT_MERGE) { + if (bio_attempt_front_merge(q, rq, bio)) + return 0; + } + } + + return 1; +} + static void init_request_from_bio(struct request *req, struct bio *bio) { req->cmd_type = REQ_TYPE_FS; @@ -2904,13 +2892,24 @@ static void init_request_from_bio(struct static int __make_request(request_queue_t *q, struct bio *bio) { - struct request *req; - int el_ret, nr_sectors, barrier, err; - const unsigned short prio = bio_prio(bio); + struct io_context *ioc = current_io_context(GFP_ATOMIC, q->node); const int sync = bio_sync(bio); - int rw_flags; + int el_ret, rw_flags; + struct request *req; - nr_sectors = bio_sectors(bio); + if (unlikely(bio_barrier(bio))) { + if (q->next_ordered == QUEUE_ORDERED_NONE) + goto end_io_eopnotsupp; + + /* + * When we encounter a barrier, we need to make sure that + * the processes that have this queue privately plugged have + * finished. + */ + queue_sync_plugs(q); + spin_lock_irq(q->queue_lock); + goto get_rq; + } /* * low level driver can indicate that it wants pages above a @@ -2919,66 +2918,27 @@ static int __make_request(request_queue_ */ blk_queue_bounce(q, &bio); - barrier = bio_barrier(bio); - if (unlikely(barrier) && (q->next_ordered == QUEUE_ORDERED_NONE)) { - err = -EOPNOTSUPP; - goto end_io; - } + /* + * Check if we can merge with the plugged list before grabbing + * any locks. + */ + if (!check_plug_merge(q, ioc, bio)) + goto out; spin_lock_irq(q->queue_lock); - - if (unlikely(barrier) || elv_queue_empty(q)) - goto get_rq; - el_ret = elv_merge(q, &req, bio); - switch (el_ret) { - case ELEVATOR_BACK_MERGE: - BUG_ON(!rq_mergeable(req)); - - if (!ll_back_merge_fn(q, req, bio)) - break; - - blk_add_trace_bio(q, bio, BLK_TA_BACKMERGE); - - req->biotail->bi_next = bio; - req->biotail = bio; - req->nr_sectors = req->hard_nr_sectors += nr_sectors; - req->ioprio = ioprio_best(req->ioprio, prio); - drive_stat_acct(req, nr_sectors, 0); + if (el_ret == ELEVATOR_BACK_MERGE) { + if (bio_attempt_back_merge(q, req, bio)) { if (!attempt_back_merge(q, req)) elv_merged_request(q, req, el_ret); - goto out; - - case ELEVATOR_FRONT_MERGE: - BUG_ON(!rq_mergeable(req)); - - if (!ll_front_merge_fn(q, req, bio)) - break; - - blk_add_trace_bio(q, bio, BLK_TA_FRONTMERGE); - - bio->bi_next = req->bio; - req->bio = bio; - - /* - * may not be valid. if the low level driver said - * it didn't need a bounce buffer then it better - * not touch req->buffer either... - */ - req->buffer = bio_data(bio); - req->current_nr_sectors = bio_cur_sectors(bio); - req->hard_cur_sectors = req->current_nr_sectors; - req->sector = req->hard_sector = bio->bi_sector; - req->nr_sectors = req->hard_nr_sectors += nr_sectors; - req->ioprio = ioprio_best(req->ioprio, prio); - drive_stat_acct(req, nr_sectors, 0); + goto out_unlock; + } + } else if (el_ret == ELEVATOR_FRONT_MERGE) { + if (bio_attempt_front_merge(q, req, bio)) { if (!attempt_front_merge(q, req)) elv_merged_request(q, req, el_ret); - goto out; - - /* ELV_NO_MERGE: elevator says don't/can't merge. */ - default: - ; + goto out_unlock; + } } get_rq: @@ -3005,19 +2965,29 @@ get_rq: */ init_request_from_bio(req, bio); - spin_lock_irq(q->queue_lock); - if (elv_queue_empty(q)) - blk_plug_device(q); - add_request(q, req); -out: - if (sync) - __generic_unplug_device(q); + if (!ioc || !ioc->plugged || bio_sync(bio)) { + spin_lock_irq(q->queue_lock); + add_request(q, req); + q->request_fn(q); +out_unlock: + spin_unlock_irq(q->queue_lock); + } else { + if (ioc->plugged_queue && ioc->plugged_queue != q) + blk_replug_current_nested(); + if (!ioc->plugged_queue) { + blk_add_trace_generic(q, NULL, ioc->plugged, BLK_TA_PLUG); + ioc->plugged_queue = q; + } + if (bio_data_dir(bio) == WRITE && ioc->qrcu_idx == -1) + ioc->qrcu_idx = qrcu_read_lock(&q->qrcu); + list_add_tail(&req->queuelist, &ioc->plugged_list); + } - spin_unlock_irq(q->queue_lock); +out: return 0; -end_io: - bio_endio(bio, nr_sectors << 9, err); +end_io_eopnotsupp: + bio_endio(bio, bio->bi_size, -EOPNOTSUPP); return 0; } @@ -3715,6 +3685,82 @@ void exit_io_context(void) put_io_context(ioc); } +/** + * blk_plug_current - plug the current task + * + * Description: + * When a process is about to submit multiple pieces of IO, it may first + * call this function to enable better batching of these requests. Any + * request queued after this call will be stored in a lockless manner + * local to the process. The IO will not be sent to the IO scheduler until + * a call to @blk_unplug_current is made. A call to this function must + * be paired with a later call to @blk_unplug_current. + * + **/ +void blk_plug_current(void) +{ + struct io_context *ioc = current_io_context(GFP_NOIO, -1); + + if (ioc) + ioc->plugged++; +} +EXPORT_SYMBOL(blk_plug_current); + +/** + * blk_unplug_current - unplug the current task + * + * Description: + * Also see description of @blk_plug_current. When a process is done + * submitting IO, it calls this function to send those requests off to + * the IO scheduler. + * + **/ +void blk_unplug_current(void) +{ + struct io_context *ioc = current->io_context; + struct request *req; + request_queue_t *q; + int nr_unplug; + + if (!ioc) + return; + + BUG_ON(!ioc->plugged); + ioc->plugged--; + if (ioc->plugged) + return; + + q = ioc->plugged_queue; + ioc->plugged_queue = NULL; + if (list_empty(&ioc->plugged_list)) + goto out; + + nr_unplug = 0; + spin_lock_irq(q->queue_lock); + do { + req = list_entry_rq(ioc->plugged_list.next); + list_del_init(&req->queuelist); + add_request(q, req); + nr_unplug++; + } while (!list_empty(&ioc->plugged_list)); + + q->request_fn(q); + spin_unlock_irq(q->queue_lock); + + blk_add_trace_pdu_int(q, BLK_TA_UNPLUG_IO, NULL, nr_unplug); + +out: + /* + * This process is now done with the plug and has committed it's + * pending io, notify QRCU. + */ + if (ioc->qrcu_idx != -1) { + qrcu_read_unlock(&q->qrcu, ioc->qrcu_idx); + ioc->qrcu_idx = -1; + } +} +EXPORT_SYMBOL(blk_unplug_current); + /* * If the current task has no IO context then create one and initialise it. * Otherwise, return its existing IO context. @@ -3734,13 +3780,12 @@ static struct io_context *current_io_con ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node); if (ret) { + memset(ret, 0, sizeof(*ret)); atomic_set(&ret->refcount, 1); ret->task = current; - ret->ioprio_changed = 0; + ret->qrcu_idx = -1; + INIT_LIST_HEAD(&ret->plugged_list); ret->last_waited = jiffies; /* doesn't matter... */ - ret->nr_batch_requests = 0; /* because this is 0 */ - ret->aic = NULL; - ret->cic_root.rb_node = NULL; /* make sure set_task_ioprio() sees the settings above */ smp_wmb(); tsk->io_context = ret; diff --git a/block/noop-iosched.c b/block/noop-iosched.c index 1c3de2b..8e0e94f 100644 --- a/block/noop-iosched.c +++ b/block/noop-iosched.c @@ -38,13 +38,6 @@ static void noop_add_request(request_que list_add_tail(&rq->queuelist, &nd->queue); } -static int noop_queue_empty(request_queue_t *q) -{ - struct noop_data *nd = q->elevator->elevator_data; - - return list_empty(&nd->queue); -} - static struct request * noop_former_request(request_queue_t *q, struct request *rq) { @@ -89,7 +82,6 @@ static struct elevator_type elevator_noo .elevator_merge_req_fn = noop_merged_requests, .elevator_dispatch_fn = noop_dispatch, .elevator_add_req_fn = noop_add_request, - .elevator_queue_empty_fn = noop_queue_empty, .elevator_former_req_fn = noop_former_request, .elevator_latter_req_fn = noop_latter_request, .elevator_init_fn = noop_init_queue, diff --git a/drivers/block/cciss.c b/drivers/block/cciss.c index 05dfe35..e60c254 100644 --- a/drivers/block/cciss.c +++ b/drivers/block/cciss.c @@ -2456,12 +2456,6 @@ static void do_cciss_request(request_que drive_info_struct *drv; int i, dir; - /* We call start_io here in case there is a command waiting on the - * queue that has not been sent. - */ - if (blk_queue_plugged(q)) - goto startio; - queue: creq = elv_next_request(q); if (!creq) diff --git a/drivers/block/cpqarray.c b/drivers/block/cpqarray.c index b94cd1c..11c7fb5 100644 --- a/drivers/block/cpqarray.c +++ b/drivers/block/cpqarray.c @@ -894,9 +894,6 @@ static void do_ida_request(request_queue struct scatterlist tmp_sg[SG_MAX]; int i, dir, seg; - if (blk_queue_plugged(q)) - goto startio; - queue_next: creq = elv_next_request(q); if (!creq) diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c index 3f1b382..79e8eb8 100644 --- a/drivers/block/floppy.c +++ b/drivers/block/floppy.c @@ -3858,7 +3858,6 @@ static int __floppy_read_block_0(struct bio.bi_end_io = floppy_rb0_complete; submit_bio(READ, &bio); - generic_unplug_device(bdev_get_queue(bdev)); process_fd_request(); wait_for_completion(&complete); diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 6b5b642..97cc0a6 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -535,17 +535,6 @@ out: return 0; } -/* - * kick off io on the underlying address space - */ -static void loop_unplug(request_queue_t *q) -{ - struct loop_device *lo = q->queuedata; - - clear_bit(QUEUE_FLAG_PLUGGED, &q->queue_flags); - blk_run_address_space(lo->lo_backing_file->f_mapping); -} - struct switch_request { struct file *file; struct completion wait; @@ -810,7 +799,6 @@ static int loop_set_fd(struct loop_devic */ blk_queue_make_request(lo->lo_queue, loop_make_request); lo->lo_queue->queuedata = lo; - lo->lo_queue->unplug_fn = loop_unplug; set_capacity(disks[lo->lo_number], size); bd_set_size(bdev, size << 9); diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 93fb6ed..b4a0a23 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -1633,8 +1633,6 @@ static int kcdrwd(void *foobar) min_sleep_time = pkt->sleep_time; } - generic_unplug_device(bdev_get_queue(pd->bdev)); - VPRINTK("kcdrwd: sleeping\n"); residue = schedule_timeout(min_sleep_time); VPRINTK("kcdrwd: wake up\n"); diff --git a/drivers/block/rd.c b/drivers/block/rd.c index 485aa87..38db306 100644 --- a/drivers/block/rd.c +++ b/drivers/block/rd.c @@ -325,7 +325,6 @@ static int rd_ioctl(struct inode *inode, static struct backing_dev_info rd_backing_dev_info = { .ra_pages = 0, /* No readahead */ .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK | BDI_CAP_MAP_COPY, - .unplug_io_fn = default_unplug_io_fn, }; /* @@ -336,7 +335,6 @@ static struct backing_dev_info rd_backin static struct backing_dev_info rd_file_backing_dev_info = { .ra_pages = 0, /* No readahead */ .capabilities = BDI_CAP_MAP_COPY, /* Does contribute to dirty memory */ - .unplug_io_fn = default_unplug_io_fn, }; static int rd_open(struct inode *inode, struct file *filp) diff --git a/drivers/block/umem.c b/drivers/block/umem.c index dff3766..4e26262 100644 --- a/drivers/block/umem.c +++ b/drivers/block/umem.c @@ -273,8 +273,7 @@ static void dump_dmastat(struct cardinfo * * Whenever IO on the active page completes, the Ready page is activated * and the ex-Active page is clean out and made Ready. - * Otherwise the Ready page is only activated when it becomes full, or - * when mm_unplug_device is called via the unplug_io_fn. + * Otherwise the Ready page is only activated when it becomes full. * * If a request arrives while both pages a full, it is queued, and b_rdev is * overloaded to record whether it was a read or a write. @@ -364,17 +363,6 @@ static inline void reset_page(struct mm_ page->biotail = & page->bio; } -static void mm_unplug_device(request_queue_t *q) -{ - struct cardinfo *card = q->queuedata; - unsigned long flags; - - spin_lock_irqsave(&card->lock, flags); - if (blk_remove_plug(q)) - activate(card); - spin_unlock_irqrestore(&card->lock, flags); -} - /* * If there is room on Ready page, take * one bh off list and add it. @@ -559,7 +547,6 @@ static int mm_make_request(request_queue *card->biotail = bio; bio->bi_next = NULL; card->biotail = &bio->bi_next; - blk_plug_device(q); spin_unlock_irq(&card->lock); return 0; @@ -976,7 +963,6 @@ #endif blk_queue_make_request(card->queue, mm_make_request); card->queue->queuedata = card; - card->queue->unplug_fn = mm_unplug_device; tasklet_init(&card->tasklet, process_page, (unsigned long)card); diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c index 5969cec..212e29a 100644 --- a/drivers/ide/ide-cd.c +++ b/drivers/ide/ide-cd.c @@ -792,15 +792,11 @@ static int cdrom_decode_status(ide_drive if (time_after(jiffies, info->write_timeout)) do_end_request = 1; else { - unsigned long flags; - /* * take a breather relying on the * unplug timer to kick us again */ - spin_lock_irqsave(&ide_lock, flags); - blk_plug_device(drive->queue); - spin_unlock_irqrestore(&ide_lock,flags); + blk_delay_queue(drive->queue, 1); return 1; } } @@ -3142,9 +3138,6 @@ int ide_cdrom_setup (ide_drive_t *drive) blk_queue_prep_rq(drive->queue, ide_cdrom_prep_fn); blk_queue_dma_alignment(drive->queue, 31); - drive->queue->unplug_delay = (1 * HZ) / 1000; - if (!drive->queue->unplug_delay) - drive->queue->unplug_delay = 1; drive->special.all = 0; diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c index 2614f41..88df1d2 100644 --- a/drivers/ide/ide-io.c +++ b/drivers/ide/ide-io.c @@ -1094,26 +1094,16 @@ repeat: * though that is 3 requests, it must be seen as a single transaction. * we must not preempt this drive until that is complete */ - if (blk_queue_flushing(drive->queue)) { - /* - * small race where queue could get replugged during - * the 3-request flush cycle, just yank the plug since - * we want it to finish asap - */ - blk_remove_plug(drive->queue); + if (blk_queue_flushing(drive->queue)) return drive; - } do { if ((!drive->sleeping || time_after_eq(jiffies, drive->sleep)) - && !elv_queue_empty(drive->queue)) { + && elv_next_request(drive->queue)) { if (!best || (drive->sleeping && (!best->sleeping || time_before(drive->sleep, best->sleep))) || (!best->sleeping && time_before(WAKEUP(drive), WAKEUP(best)))) - { - if (!blk_queue_plugged(drive->queue)) - best = drive; - } + best = drive; } } while ((drive = drive->next) != hwgroup->drive); if (best && best->nice1 && !best->sleeping && best != hwgroup->drive && best->service_time > WAIT_MIN_SLEEP) { @@ -1245,11 +1235,6 @@ #endif drive->sleeping = 0; drive->service_start = jiffies; - if (blk_queue_plugged(drive->queue)) { - printk(KERN_ERR "ide: huh? queue was plugged!\n"); - break; - } - /* * we know that the queue isn't empty, but this can happen * if the q->prep_rq_fn() decides to kill a request @@ -1278,7 +1263,7 @@ #endif */ if (drive->blocked && !blk_pm_request(rq) && !(rq->cmd_flags & REQ_PREEMPT)) { drive = drive->next ? drive->next : hwgroup->drive; - if (loops++ < 4 && !blk_queue_plugged(drive->queue)) + if (loops++ < 4) goto again; /* We clear busy, there should be no pending ATA command at this point. */ hwgroup->busy = 0; @@ -1744,7 +1729,7 @@ int ide_do_drive_cmd (ide_drive_t *drive where = ELEVATOR_INSERT_FRONT; rq->cmd_flags |= REQ_PREEMPT; } - __elv_add_request(drive->queue, rq, where, 0); + __elv_add_request(drive->queue, rq, where); ide_do_request(hwgroup, IDE_NO_IRQ); spin_unlock_irqrestore(&ide_lock, flags); diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c index 5554ada..2ef0377 100644 --- a/drivers/md/bitmap.c +++ b/drivers/md/bitmap.c @@ -1180,7 +1180,6 @@ int bitmap_startwrite(struct bitmap *bit case 0: bitmap_file_set_bit(bitmap, offset); bitmap_count_page(bitmap,offset, 1); - blk_plug_device(bitmap->mddev->queue); /* fall through */ case 1: *bmc = 2; diff --git a/drivers/md/dm-emc.c b/drivers/md/dm-emc.c index 265c467..25d632e 100644 --- a/drivers/md/dm-emc.c +++ b/drivers/md/dm-emc.c @@ -215,7 +215,7 @@ static void emc_pg_init(struct hw_handle } DMINFO("emc_pg_init: sending switch-over command"); - elv_add_request(q, rq, ELEVATOR_INSERT_FRONT, 1); + elv_add_request(q, rq, ELEVATOR_INSERT_FRONT); return; fail_path: diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index 05befa9..b2c6195 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c @@ -984,19 +984,6 @@ int dm_table_any_congested(struct dm_tab return r; } -void dm_table_unplug_all(struct dm_table *t) -{ - struct list_head *d, *devices = dm_table_get_devices(t); - - for (d = devices->next; d != devices; d = d->next) { - struct dm_dev *dd = list_entry(d, struct dm_dev, list); - request_queue_t *q = bdev_get_queue(dd->bdev); - - if (q->unplug_fn) - q->unplug_fn(q); - } -} - int dm_table_flush_all(struct dm_table *t) { struct list_head *d, *devices = dm_table_get_devices(t); @@ -1040,5 +1027,4 @@ EXPORT_SYMBOL(dm_table_get_mode); EXPORT_SYMBOL(dm_table_get_md); EXPORT_SYMBOL(dm_table_put); EXPORT_SYMBOL(dm_table_get); -EXPORT_SYMBOL(dm_table_unplug_all); EXPORT_SYMBOL(dm_table_flush_all); diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 3668b17..c7c457a 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -854,17 +854,6 @@ static int dm_flush_all(request_queue_t return ret; } -static void dm_unplug_all(request_queue_t *q) -{ - struct mapped_device *md = q->queuedata; - struct dm_table *map = dm_get_table(md); - - if (map) { - dm_table_unplug_all(map); - dm_table_put(map); - } -} - static int dm_any_congested(void *congested_data, int bdi_bits) { int r; @@ -1001,7 +990,6 @@ static struct mapped_device *alloc_dev(i md->queue->backing_dev_info.congested_data = md; blk_queue_make_request(md->queue, dm_request); blk_queue_bounce_limit(md->queue, BLK_BOUNCE_ANY); - md->queue->unplug_fn = dm_unplug_all; md->queue->issue_flush_fn = dm_flush_all; md->io_pool = mempool_create_slab_pool(MIN_IOS, _io_cache); @@ -1376,10 +1364,6 @@ int dm_suspend(struct mapped_device *md, add_wait_queue(&md->wait, &wait); up_write(&md->io_lock); - /* unplug */ - if (map) - dm_table_unplug_all(map); - /* * Then we wait for the already mapped ios to * complete. @@ -1489,8 +1473,6 @@ int dm_resume(struct mapped_device *md) clear_bit(DMF_SUSPENDED, &md->flags); - dm_table_unplug_all(map); - kobject_uevent(&md->disk->kobj, KOBJ_CHANGE); r = 0; diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 2f796b1..b10af9c 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -78,7 +78,6 @@ void dm_table_presuspend_targets(struct void dm_table_postsuspend_targets(struct dm_table *t); int dm_table_resume_targets(struct dm_table *t); int dm_table_any_congested(struct dm_table *t, int bdi_bits); -void dm_table_unplug_all(struct dm_table *t); int dm_table_flush_all(struct dm_table *t); /*----------------------------------------------------------------- diff --git a/drivers/md/linear.c b/drivers/md/linear.c index c625ddb..c787706 100644 --- a/drivers/md/linear.c +++ b/drivers/md/linear.c @@ -79,19 +79,6 @@ static int linear_mergeable_bvec(request return maxsectors << 9; } -static void linear_unplug(request_queue_t *q) -{ - mddev_t *mddev = q->queuedata; - linear_conf_t *conf = mddev_to_conf(mddev); - int i; - - for (i=0; i < mddev->raid_disks; i++) { - request_queue_t *r_queue = bdev_get_queue(conf->disks[i].rdev->bdev); - if (r_queue->unplug_fn) - r_queue->unplug_fn(r_queue); - } -} - static int linear_issue_flush(request_queue_t *q, struct gendisk *disk, sector_t *error_sector) { @@ -280,7 +267,6 @@ static int linear_run (mddev_t *mddev) mddev->array_size = conf->array_size; blk_queue_merge_bvec(mddev->queue, linear_mergeable_bvec); - mddev->queue->unplug_fn = linear_unplug; mddev->queue->issue_flush_fn = linear_issue_flush; mddev->queue->backing_dev_info.congested_fn = linear_congested; mddev->queue->backing_dev_info.congested_data = mddev; diff --git a/drivers/md/md.c b/drivers/md/md.c index 05febfd..d90d704 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -5282,7 +5282,6 @@ void md_do_sync(mddev_t *mddev) * about not overloading the IO subsystem. (things like an * e2fsck being done on the RAID array should execute fast) */ - mddev->queue->unplug_fn(mddev->queue); cond_resched(); currspeed = ((unsigned long)(io_sectors-mddev->resync_mark_cnt))/2 @@ -5301,8 +5300,6 @@ void md_do_sync(mddev_t *mddev) * this also signals 'finished resyncing' to md_stop */ out: - mddev->queue->unplug_fn(mddev->queue); - wait_event(mddev->recovery_wait, !atomic_read(&mddev->recovery_active)); /* tell personality that we are finished */ diff --git a/drivers/md/multipath.c b/drivers/md/multipath.c index 14da37f..65d7ca0 100644 --- a/drivers/md/multipath.c +++ b/drivers/md/multipath.c @@ -115,37 +115,6 @@ static int multipath_end_request(struct return 0; } -static void unplug_slaves(mddev_t *mddev) -{ - multipath_conf_t *conf = mddev_to_conf(mddev); - int i; - - rcu_read_lock(); - for (i=0; iraid_disks; i++) { - mdk_rdev_t *rdev = rcu_dereference(conf->multipaths[i].rdev); - if (rdev && !test_bit(Faulty, &rdev->flags) - && atomic_read(&rdev->nr_pending)) { - request_queue_t *r_queue = bdev_get_queue(rdev->bdev); - - atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); - - if (r_queue->unplug_fn) - r_queue->unplug_fn(r_queue); - - rdev_dec_pending(rdev, mddev); - rcu_read_lock(); - } - } - rcu_read_unlock(); -} - -static void multipath_unplug(request_queue_t *q) -{ - unplug_slaves(q->queuedata); -} - - static int multipath_make_request (request_queue_t *q, struct bio * bio) { mddev_t *mddev = q->queuedata; @@ -531,7 +500,6 @@ static int multipath_run (mddev_t *mddev */ mddev->array_size = mddev->size; - mddev->queue->unplug_fn = multipath_unplug; mddev->queue->issue_flush_fn = multipath_issue_flush; mddev->queue->backing_dev_info.congested_fn = multipath_congested; mddev->queue->backing_dev_info.congested_data = mddev; diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c index dfe3214..23b8b2a 100644 --- a/drivers/md/raid0.c +++ b/drivers/md/raid0.c @@ -25,21 +25,6 @@ #define MAJOR_NR MD_MAJOR #define MD_DRIVER #define MD_PERSONALITY -static void raid0_unplug(request_queue_t *q) -{ - mddev_t *mddev = q->queuedata; - raid0_conf_t *conf = mddev_to_conf(mddev); - mdk_rdev_t **devlist = conf->strip_zone[0].dev; - int i; - - for (i=0; iraid_disks; i++) { - request_queue_t *r_queue = bdev_get_queue(devlist[i]->bdev); - - if (r_queue->unplug_fn) - r_queue->unplug_fn(r_queue); - } -} - static int raid0_issue_flush(request_queue_t *q, struct gendisk *disk, sector_t *error_sector) { @@ -248,8 +233,6 @@ static int create_strip_zones (mddev_t * conf->hash_spacing = sz; } - mddev->queue->unplug_fn = raid0_unplug; - mddev->queue->issue_flush_fn = raid0_issue_flush; mddev->queue->backing_dev_info.congested_fn = raid0_congested; mddev->queue->backing_dev_info.congested_data = mddev; diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 97ee870..5cb244d 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -48,23 +48,16 @@ #endif #define NR_RAID1_BIOS 256 -static void unplug_slaves(mddev_t *mddev); - static void allow_barrier(conf_t *conf); static void lower_barrier(conf_t *conf); static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data) { struct pool_info *pi = data; - r1bio_t *r1_bio; int size = offsetof(r1bio_t, bios[pi->raid_disks]); /* allocate a r1bio with room for raid_disks entries in the bios array */ - r1_bio = kzalloc(size, gfp_flags); - if (!r1_bio) - unplug_slaves(pi->mddev); - - return r1_bio; + return kzalloc(size, gfp_flags); } static void r1bio_pool_free(void *r1_bio, void *data) @@ -87,10 +80,8 @@ static void * r1buf_pool_alloc(gfp_t gfp int i, j; r1_bio = r1bio_pool_alloc(gfp_flags, pi); - if (!r1_bio) { - unplug_slaves(pi->mddev); + if (!r1_bio) return NULL; - } /* * Allocate bios : 1 for reading, n-1 for writing @@ -539,38 +530,6 @@ static int read_balance(conf_t *conf, r1 return new_disk; } -static void unplug_slaves(mddev_t *mddev) -{ - conf_t *conf = mddev_to_conf(mddev); - int i; - - rcu_read_lock(); - for (i=0; iraid_disks; i++) { - mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev); - if (rdev && !test_bit(Faulty, &rdev->flags) && atomic_read(&rdev->nr_pending)) { - request_queue_t *r_queue = bdev_get_queue(rdev->bdev); - - atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); - - if (r_queue->unplug_fn) - r_queue->unplug_fn(r_queue); - - rdev_dec_pending(rdev, mddev); - rcu_read_lock(); - } - } - rcu_read_unlock(); -} - -static void raid1_unplug(request_queue_t *q) -{ - mddev_t *mddev = q->queuedata; - - unplug_slaves(mddev); - md_wakeup_thread(mddev->thread); -} - static int raid1_issue_flush(request_queue_t *q, struct gendisk *disk, sector_t *error_sector) { @@ -656,8 +615,7 @@ static void raise_barrier(conf_t *conf) /* Wait until no block IO is waiting */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting, - conf->resync_lock, - raid1_unplug(conf->mddev->queue)); + conf->resync_lock, blk_replug_current_nested()); /* block any new IO from starting */ conf->barrier++; @@ -665,8 +623,7 @@ static void raise_barrier(conf_t *conf) /* No wait for all pending IO to complete */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_pending && conf->barrier < RESYNC_DEPTH, - conf->resync_lock, - raid1_unplug(conf->mddev->queue)); + conf->resync_lock, blk_replug_current_nested()); spin_unlock_irq(&conf->resync_lock); } @@ -687,7 +644,7 @@ static void wait_barrier(conf_t *conf) conf->nr_waiting++; wait_event_lock_irq(conf->wait_barrier, !conf->barrier, conf->resync_lock, - raid1_unplug(conf->mddev->queue)); + blk_replug_current_nested()); conf->nr_waiting--; } conf->nr_pending++; @@ -715,8 +672,7 @@ static void freeze_array(conf_t *conf) conf->nr_waiting++; wait_event_lock_irq(conf->wait_barrier, conf->barrier+conf->nr_pending == conf->nr_queued+2, - conf->resync_lock, - raid1_unplug(conf->mddev->queue)); + conf->resync_lock, blk_replug_current_nested()); spin_unlock_irq(&conf->resync_lock); } static void unfreeze_array(conf_t *conf) @@ -939,10 +895,9 @@ #endif bio_list_merge(&conf->pending_bio_list, &bl); bio_list_init(&bl); - blk_plug_device(mddev->queue); spin_unlock_irqrestore(&conf->device_lock, flags); - if (do_sync) + if (do_sync || !bitmap) md_wakeup_thread(mddev->thread); #if 0 while ((bio = bio_list_pop(&bl)) != NULL) @@ -1500,7 +1455,6 @@ static void raid1d(mddev_t *mddev) unsigned long flags; conf_t *conf = mddev_to_conf(mddev); struct list_head *head = &conf->retry_list; - int unplug=0; mdk_rdev_t *rdev; md_check_recovery(mddev); @@ -1511,7 +1465,6 @@ static void raid1d(mddev_t *mddev) if (conf->pending_bio_list.head) { bio = bio_list_get(&conf->pending_bio_list); - blk_remove_plug(mddev->queue); spin_unlock_irqrestore(&conf->device_lock, flags); /* flush any pending bitmap writes to disk before proceeding w/ I/O */ if (bitmap_unplug(mddev->bitmap) != 0) @@ -1523,8 +1476,6 @@ static void raid1d(mddev_t *mddev) generic_make_request(bio); bio = next; } - unplug = 1; - continue; } @@ -1537,10 +1488,9 @@ static void raid1d(mddev_t *mddev) mddev = r1_bio->mddev; conf = mddev_to_conf(mddev); - if (test_bit(R1BIO_IsSync, &r1_bio->state)) { + if (test_bit(R1BIO_IsSync, &r1_bio->state)) sync_request_write(mddev, r1_bio); - unplug = 1; - } else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) { + else if (test_bit(R1BIO_BarrierRetry, &r1_bio->state)) { /* some requests in the r1bio were BIO_RW_BARRIER * requests which failed with -EOPNOTSUPP. Hohumm.. * Better resubmit without the barrier. @@ -1620,14 +1570,11 @@ static void raid1d(mddev_t *mddev) bio->bi_end_io = raid1_end_read_request; bio->bi_rw = READ | do_sync; bio->bi_private = r1_bio; - unplug = 1; generic_make_request(bio); } } } spin_unlock_irqrestore(&conf->device_lock, flags); - if (unplug) - unplug_slaves(mddev); } @@ -2000,7 +1947,6 @@ static int run(mddev_t *mddev) */ mddev->array_size = mddev->size; - mddev->queue->unplug_fn = raid1_unplug; mddev->queue->issue_flush_fn = raid1_issue_flush; mddev->queue->backing_dev_info.congested_fn = raid1_congested; mddev->queue->backing_dev_info.congested_data = mddev; diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index a9401c0..bfb3154 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -52,23 +52,16 @@ #include */ #define NR_RAID10_BIOS 256 -static void unplug_slaves(mddev_t *mddev); - static void allow_barrier(conf_t *conf); static void lower_barrier(conf_t *conf); static void * r10bio_pool_alloc(gfp_t gfp_flags, void *data) { conf_t *conf = data; - r10bio_t *r10_bio; int size = offsetof(struct r10bio_s, devs[conf->copies]); /* allocate a r10bio with room for raid_disks entries in the bios array */ - r10_bio = kzalloc(size, gfp_flags); - if (!r10_bio) - unplug_slaves(conf->mddev); - - return r10_bio; + return kzalloc(size, gfp_flags); } static void r10bio_pool_free(void *r10_bio, void *data) @@ -99,10 +92,8 @@ static void * r10buf_pool_alloc(gfp_t gf int nalloc; r10_bio = r10bio_pool_alloc(gfp_flags, conf); - if (!r10_bio) { - unplug_slaves(conf->mddev); + if (!r10_bio) return NULL; - } if (test_bit(MD_RECOVERY_SYNC, &conf->mddev->recovery)) nalloc = conf->copies; /* resync */ @@ -586,38 +577,6 @@ rb_out: return disk; } -static void unplug_slaves(mddev_t *mddev) -{ - conf_t *conf = mddev_to_conf(mddev); - int i; - - rcu_read_lock(); - for (i=0; iraid_disks; i++) { - mdk_rdev_t *rdev = rcu_dereference(conf->mirrors[i].rdev); - if (rdev && !test_bit(Faulty, &rdev->flags) && atomic_read(&rdev->nr_pending)) { - request_queue_t *r_queue = bdev_get_queue(rdev->bdev); - - atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); - - if (r_queue->unplug_fn) - r_queue->unplug_fn(r_queue); - - rdev_dec_pending(rdev, mddev); - rcu_read_lock(); - } - } - rcu_read_unlock(); -} - -static void raid10_unplug(request_queue_t *q) -{ - mddev_t *mddev = q->queuedata; - - unplug_slaves(q->queuedata); - md_wakeup_thread(mddev->thread); -} - static int raid10_issue_flush(request_queue_t *q, struct gendisk *disk, sector_t *error_sector) { @@ -698,8 +657,7 @@ static void raise_barrier(conf_t *conf, /* Wait until no block IO is waiting (unless 'force') */ wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting, - conf->resync_lock, - raid10_unplug(conf->mddev->queue)); + conf->resync_lock, blk_replug_current_nested()); /* block any new IO from starting */ conf->barrier++; @@ -707,8 +665,7 @@ static void raise_barrier(conf_t *conf, /* No wait for all pending IO to complete */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_pending && conf->barrier < RESYNC_DEPTH, - conf->resync_lock, - raid10_unplug(conf->mddev->queue)); + conf->resync_lock, blk_replug_current_nested()); spin_unlock_irq(&conf->resync_lock); } @@ -729,7 +686,7 @@ static void wait_barrier(conf_t *conf) conf->nr_waiting++; wait_event_lock_irq(conf->wait_barrier, !conf->barrier, conf->resync_lock, - raid10_unplug(conf->mddev->queue)); + blk_replug_current_nested()); conf->nr_waiting--; } conf->nr_pending++; @@ -757,8 +714,7 @@ static void freeze_array(conf_t *conf) conf->nr_waiting++; wait_event_lock_irq(conf->wait_barrier, conf->barrier+conf->nr_pending == conf->nr_queued+2, - conf->resync_lock, - raid10_unplug(conf->mddev->queue)); + conf->resync_lock, blk_replug_current_nested()); spin_unlock_irq(&conf->resync_lock); } @@ -920,7 +876,6 @@ static int make_request(request_queue_t bitmap_startwrite(mddev->bitmap, bio->bi_sector, r10_bio->sectors, 0); spin_lock_irqsave(&conf->device_lock, flags); bio_list_merge(&conf->pending_bio_list, &bl); - blk_plug_device(mddev->queue); spin_unlock_irqrestore(&conf->device_lock, flags); if (do_sync) @@ -1496,7 +1451,6 @@ static void raid10d(mddev_t *mddev) unsigned long flags; conf_t *conf = mddev_to_conf(mddev); struct list_head *head = &conf->retry_list; - int unplug=0; mdk_rdev_t *rdev; md_check_recovery(mddev); @@ -1507,7 +1461,6 @@ static void raid10d(mddev_t *mddev) if (conf->pending_bio_list.head) { bio = bio_list_get(&conf->pending_bio_list); - blk_remove_plug(mddev->queue); spin_unlock_irqrestore(&conf->device_lock, flags); /* flush any pending bitmap writes to disk before proceeding w/ I/O */ if (bitmap_unplug(mddev->bitmap) != 0) @@ -1519,8 +1472,6 @@ static void raid10d(mddev_t *mddev) generic_make_request(bio); bio = next; } - unplug = 1; - continue; } @@ -1533,13 +1484,11 @@ static void raid10d(mddev_t *mddev) mddev = r10_bio->mddev; conf = mddev_to_conf(mddev); - if (test_bit(R10BIO_IsSync, &r10_bio->state)) { + if (test_bit(R10BIO_IsSync, &r10_bio->state)) sync_request_write(mddev, r10_bio); - unplug = 1; - } else if (test_bit(R10BIO_IsRecover, &r10_bio->state)) { + else if (test_bit(R10BIO_IsRecover, &r10_bio->state)) recovery_request_write(mddev, r10_bio); - unplug = 1; - } else { + else { int mirror; /* we got a read error. Maybe the drive is bad. Maybe just * the block and we can fix it. @@ -1582,14 +1531,11 @@ static void raid10d(mddev_t *mddev) bio->bi_rw = READ | do_sync; bio->bi_private = r10_bio; bio->bi_end_io = raid10_end_read_request; - unplug = 1; generic_make_request(bio); } } } spin_unlock_irqrestore(&conf->device_lock, flags); - if (unplug) - unplug_slaves(mddev); } @@ -2117,7 +2063,6 @@ static int run(mddev_t *mddev) mddev->array_size = size/2; mddev->resync_max_sectors = size; - mddev->queue->unplug_fn = raid10_unplug; mddev->queue->issue_flush_fn = raid10_issue_flush; mddev->queue->backing_dev_info.congested_fn = raid10_congested; mddev->queue->backing_dev_info.congested_data = mddev; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 11c3d7b..badbc09 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -114,11 +114,9 @@ static void __release_stripe(raid5_conf_ if (test_bit(STRIPE_HANDLE, &sh->state)) { if (test_bit(STRIPE_DELAYED, &sh->state)) { list_add_tail(&sh->lru, &conf->delayed_list); - blk_plug_device(conf->mddev->queue); } else if (test_bit(STRIPE_BIT_DELAY, &sh->state) && sh->bm_seq - conf->seq_write > 0) { list_add_tail(&sh->lru, &conf->bitmap_list); - blk_plug_device(conf->mddev->queue); } else { clear_bit(STRIPE_BIT_DELAY, &sh->state); list_add_tail(&sh->lru, &conf->handle_list); @@ -268,9 +266,6 @@ static struct stripe_head *__find_stripe return NULL; } -static void unplug_slaves(mddev_t *mddev); -static void raid5_unplug_device(request_queue_t *q); - static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int disks, int pd_idx, int noblock) { @@ -298,8 +293,7 @@ static struct stripe_head *get_active_st < (conf->max_nr_stripes *3/4) || !conf->inactive_blocked), conf->device_lock, - raid5_unplug_device(conf->mddev->queue) - ); + blk_replug_current_nested()); conf->inactive_blocked = 0; } else init_stripe(sh, sector, pd_idx, disks); @@ -445,8 +439,7 @@ static int resize_stripes(raid5_conf_t * wait_event_lock_irq(conf->wait_for_stripe, !list_empty(&conf->inactive_list), conf->device_lock, - unplug_slaves(conf->mddev) - ); + blk_replug_current_nested()); osh = get_free_stripe(conf); spin_unlock_irq(&conf->device_lock); atomic_set(&nsh->count, 1); @@ -2467,49 +2460,6 @@ static void activate_bit_delay(raid5_con } } -static void unplug_slaves(mddev_t *mddev) -{ - raid5_conf_t *conf = mddev_to_conf(mddev); - int i; - - rcu_read_lock(); - for (i=0; iraid_disks; i++) { - mdk_rdev_t *rdev = rcu_dereference(conf->disks[i].rdev); - if (rdev && !test_bit(Faulty, &rdev->flags) && atomic_read(&rdev->nr_pending)) { - request_queue_t *r_queue = bdev_get_queue(rdev->bdev); - - atomic_inc(&rdev->nr_pending); - rcu_read_unlock(); - - if (r_queue->unplug_fn) - r_queue->unplug_fn(r_queue); - - rdev_dec_pending(rdev, mddev); - rcu_read_lock(); - } - } - rcu_read_unlock(); -} - -static void raid5_unplug_device(request_queue_t *q) -{ - mddev_t *mddev = q->queuedata; - raid5_conf_t *conf = mddev_to_conf(mddev); - unsigned long flags; - - spin_lock_irqsave(&conf->device_lock, flags); - - if (blk_remove_plug(q)) { - conf->seq_flush++; - raid5_activate_delayed(conf); - } - md_wakeup_thread(mddev->thread); - - spin_unlock_irqrestore(&conf->device_lock, flags); - - unplug_slaves(mddev); -} - static int raid5_issue_flush(request_queue_t *q, struct gendisk *disk, sector_t *error_sector) { @@ -2868,7 +2818,6 @@ static int make_request(request_queue_t * add failed due to overlap. Flush everything * and wait a while */ - raid5_unplug_device(mddev->queue); release_stripe(sh); schedule(); goto retry; @@ -3038,7 +2987,6 @@ static inline sector_t sync_request(mdde if (sector_nr >= max_sector) { /* just being told to finish up .. nothing much to do */ - unplug_slaves(mddev); if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) { end_reshape(conf); return 0; @@ -3216,7 +3164,6 @@ static void raid5d (mddev_t *mddev) if (list_empty(&conf->handle_list) && atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD && - !blk_queue_plugged(mddev->queue) && !list_empty(&conf->delayed_list)) raid5_activate_delayed(conf); @@ -3251,8 +3198,6 @@ static void raid5d (mddev_t *mddev) spin_unlock_irq(&conf->device_lock); - unplug_slaves(mddev); - PRINTK("--- raid5d inactive\n"); } @@ -3557,7 +3502,6 @@ static int run(mddev_t *mddev) /* Ok, everything is just fine now */ sysfs_create_group(&mddev->kobj, &raid5_attrs_group); - mddev->queue->unplug_fn = raid5_unplug_device; mddev->queue->issue_flush_fn = raid5_issue_flush; mddev->queue->backing_dev_info.congested_fn = raid5_congested; mddev->queue->backing_dev_info.congested_data = mddev; diff --git a/drivers/message/i2o/i2o_block.c b/drivers/message/i2o/i2o_block.c index da9859f..9bf21ec 100644 --- a/drivers/message/i2o/i2o_block.c +++ b/drivers/message/i2o/i2o_block.c @@ -919,11 +919,7 @@ static void i2o_block_request_fn(struct { struct request *req; - while (!blk_queue_plugged(q)) { - req = elv_next_request(q); - if (!req) - break; - + while ((req = elv_next_request(q)) != NULL) { if (blk_fs_request(req)) { struct i2o_block_delayed_request *dreq; struct i2o_block_request *ireq = req->special; diff --git a/drivers/mmc/mmc_queue.c b/drivers/mmc/mmc_queue.c index c27e426..564afb3 100644 --- a/drivers/mmc/mmc_queue.c +++ b/drivers/mmc/mmc_queue.c @@ -72,8 +72,7 @@ static int mmc_queue_thread(void *d) spin_lock_irq(q->queue_lock); set_current_state(TASK_INTERRUPTIBLE); - if (!blk_queue_plugged(q)) - req = elv_next_request(q); + req = elv_next_request(q); mq->req = req; spin_unlock_irq(q->queue_lock); diff --git a/drivers/s390/block/dasd.c b/drivers/s390/block/dasd.c index eb5dc62..b7784b0 100644 --- a/drivers/s390/block/dasd.c +++ b/drivers/s390/block/dasd.c @@ -1209,8 +1209,7 @@ __dasd_process_blk_queue(struct dasd_dev list_for_each_entry(cqr, &device->ccw_queue, list) if (cqr->status == DASD_CQR_QUEUED) nr_queued++; - while (!blk_queue_plugged(queue) && - elv_next_request(queue) && + while (elv_next_request(queue) && nr_queued < DASD_CHANQ_MAX_SIZE) { req = elv_next_request(queue); diff --git a/drivers/s390/char/tape_block.c b/drivers/s390/char/tape_block.c index dd0ecae..6704ec9 100644 --- a/drivers/s390/char/tape_block.c +++ b/drivers/s390/char/tape_block.c @@ -169,7 +169,6 @@ tapeblock_requeue(struct work_struct *wo spin_lock(&device->blk_data.request_queue_lock); while ( - !blk_queue_plugged(queue) && elv_next_request(queue) && nr_queued < TAPEBLOCK_MIN_REQUEUE ) { diff --git a/drivers/scsi/ide-scsi.c b/drivers/scsi/ide-scsi.c index 8f6b5bf..fd1b8b1 100644 --- a/drivers/scsi/ide-scsi.c +++ b/drivers/scsi/ide-scsi.c @@ -987,7 +987,7 @@ static int idescsi_eh_abort (struct scsi /* It's somewhere in flight. Does ide subsystem agree? */ if (scsi->pc->scsi_cmd->serial_number == cmd->serial_number && !busy && - elv_queue_empty(drive->queue) && HWGROUP(drive)->rq != scsi->pc->rq) { + HWGROUP(drive)->rq != scsi->pc->rq) { /* * FIXME - not sure this condition can ever occur */ diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 9f7482d..339045c 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -64,6 +64,13 @@ #endif }; #undef SP +/* + * When to reinvoke queueing after a resource shortage. It's 3 msecs to + * not change behaviour from the previous unplug mechanism, experimentation + * may prove this needs changing. + */ +#define SCSI_QUEUE_DELAY 3 + static void scsi_run_queue(struct request_queue *q); /* @@ -144,14 +151,7 @@ int scsi_queue_insert(struct scsi_cmnd * /* * Requeue this command. It will go before all other commands * that are already in the queue. - * - * NOTE: there is magic here about the way the queue is plugged if - * we have no outstanding commands. - * - * Although we *don't* plug the queue, we call the request - * function. The SCSI request function detects the blocked condition - * and plugs the queue appropriately. - */ + */ spin_lock_irqsave(q->queue_lock, flags); blk_requeue_request(q, cmd->request); spin_unlock_irqrestore(q->queue_lock, flags); @@ -1256,11 +1256,11 @@ static int scsi_prep_fn(struct request_q case BLKPREP_DEFER: /* * If we defer, the elv_next_request() returns NULL, but the - * queue must be restarted, so we plug here if no returning - * command will automatically do that. + * queue must be restarted, so we schedule a callback to happen + * shortly. */ if (sdev->device_busy == 0) - blk_plug_device(q); + blk_delay_queue(q, SCSI_QUEUE_DELAY); break; default: req->cmd_flags |= REQ_DONTPREP; @@ -1289,7 +1289,7 @@ static inline int scsi_dev_queue_ready(s sdev_printk(KERN_INFO, sdev, "unblocking device at zero depth\n")); } else { - blk_plug_device(q); + blk_delay_queue(q, SCSI_QUEUE_DELAY); return 0; } } @@ -1321,7 +1321,7 @@ static inline int scsi_host_queue_ready( printk("scsi%d unblocking host at zero depth\n", shost->host_no)); } else { - blk_plug_device(q); + blk_delay_queue(q, SCSI_QUEUE_DELAY); return 0; } } @@ -1444,7 +1444,7 @@ static void scsi_request_fn(struct reque * the host is no longer able to accept any more requests. */ shost = sdev->host; - while (!blk_queue_plugged(q)) { + for (;;) { int rtn; /* * get next queueable request. We do this early to make sure @@ -1509,15 +1509,8 @@ static void scsi_request_fn(struct reque */ rtn = scsi_dispatch_cmd(cmd); spin_lock_irq(q->queue_lock); - if(rtn) { - /* we're refusing the command; because of - * the way locks get dropped, we need to - * check here if plugging is required */ - if(sdev->device_busy == 0) - blk_plug_device(q); - - break; - } + if (rtn) + goto out_delay; } goto out; @@ -1536,9 +1529,10 @@ static void scsi_request_fn(struct reque spin_lock_irq(q->queue_lock); blk_requeue_request(q, req); sdev->device_busy--; - if(sdev->device_busy == 0) - blk_plug_device(q); - out: +out_delay: + if (sdev->device_busy == 0) + blk_delay_queue(q, SCSI_QUEUE_DELAY); +out: /* must be careful here...if we trigger the ->remove() function * we cannot be holding the q lock */ spin_unlock_irq(q->queue_lock); diff --git a/fs/adfs/inode.c b/fs/adfs/inode.c index 7e7a04b..ec9d493 100644 --- a/fs/adfs/inode.c +++ b/fs/adfs/inode.c @@ -75,7 +75,6 @@ static sector_t _adfs_bmap(struct addres static const struct address_space_operations adfs_aops = { .readpage = adfs_readpage, .writepage = adfs_writepage, - .sync_page = block_sync_page, .prepare_write = adfs_prepare_write, .commit_write = generic_commit_write, .bmap = _adfs_bmap diff --git a/fs/affs/file.c b/fs/affs/file.c index 4aa8079..1e0b002 100644 --- a/fs/affs/file.c +++ b/fs/affs/file.c @@ -411,7 +411,6 @@ static sector_t _affs_bmap(struct addres const struct address_space_operations affs_aops = { .readpage = affs_readpage, .writepage = affs_writepage, - .sync_page = block_sync_page, .prepare_write = affs_prepare_write, .commit_write = generic_commit_write, .bmap = _affs_bmap @@ -764,7 +763,6 @@ out: const struct address_space_operations affs_aops_ofs = { .readpage = affs_readpage_ofs, //.writepage = affs_writepage_ofs, - //.sync_page = affs_sync_page_ofs, .prepare_write = affs_prepare_write_ofs, .commit_write = affs_commit_write_ofs }; diff --git a/fs/befs/linuxvfs.c b/fs/befs/linuxvfs.c index cc6cc8e..5cf525f 100644 --- a/fs/befs/linuxvfs.c +++ b/fs/befs/linuxvfs.c @@ -74,7 +74,6 @@ static const struct inode_operations bef static const struct address_space_operations befs_aops = { .readpage = befs_readpage, - .sync_page = block_sync_page, .bmap = befs_bmap, }; diff --git a/fs/bfs/file.c b/fs/bfs/file.c index ef4d1fa..cc03674 100644 --- a/fs/bfs/file.c +++ b/fs/bfs/file.c @@ -158,7 +158,6 @@ static sector_t bfs_bmap(struct address_ const struct address_space_operations bfs_aops = { .readpage = bfs_readpage, .writepage = bfs_writepage, - .sync_page = block_sync_page, .prepare_write = bfs_prepare_write, .commit_write = generic_commit_write, .bmap = bfs_bmap, diff --git a/fs/block_dev.c b/fs/block_dev.c index 0c59b70..6872b9a 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -340,7 +340,6 @@ completion: nbytes = iocb->ki_left; iocb->ki_pos += nbytes; - blk_run_address_space(inode->i_mapping); if (atomic_dec_and_test(bio_count)) aio_complete(iocb, nbytes, 0); @@ -1319,7 +1318,6 @@ static long block_ioctl(struct file *fil const struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .writepage = blkdev_writepage, - .sync_page = block_sync_page, .prepare_write = blkdev_prepare_write, .commit_write = blkdev_commit_write, .writepages = generic_writepages, diff --git a/fs/buffer.c b/fs/buffer.c index f99c509..00abea4 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -55,23 +55,15 @@ init_buffer(struct buffer_head *bh, bh_e bh->b_private = private; } -static int sync_buffer(void *word) +static int sleep_on_buffer(void *word) { - struct block_device *bd; - struct buffer_head *bh - = container_of(word, struct buffer_head, b_state); - - smp_mb(); - bd = bh->b_bdev; - if (bd) - blk_run_address_space(bd->bd_inode->i_mapping); io_schedule(); return 0; } void fastcall __lock_buffer(struct buffer_head *bh) { - wait_on_bit_lock(&bh->b_state, BH_Lock, sync_buffer, + wait_on_bit_lock(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(__lock_buffer); @@ -91,7 +83,7 @@ void fastcall unlock_buffer(struct buffe */ void __wait_on_buffer(struct buffer_head * bh) { - wait_on_bit(&bh->b_state, BH_Lock, sync_buffer, TASK_UNINTERRUPTIBLE); + wait_on_bit(&bh->b_state, BH_Lock, sleep_on_buffer, TASK_UNINTERRUPTIBLE); } static void @@ -2879,16 +2871,6 @@ out: } EXPORT_SYMBOL(try_to_free_buffers); -void block_sync_page(struct page *page) -{ - struct address_space *mapping; - - smp_mb(); - mapping = page_mapping(page); - if (mapping) - blk_run_backing_dev(mapping->backing_dev_info, page); -} - /* * There are no bdflush tunables left. But distributions are * still running obsolete flush daemons, so we terminate them here. @@ -3030,7 +3012,6 @@ EXPORT_SYMBOL(__wait_on_buffer); EXPORT_SYMBOL(block_commit_write); EXPORT_SYMBOL(block_prepare_write); EXPORT_SYMBOL(block_read_full_page); -EXPORT_SYMBOL(block_sync_page); EXPORT_SYMBOL(block_truncate_page); EXPORT_SYMBOL(block_write_full_page); EXPORT_SYMBOL(cont_prepare_write); diff --git a/fs/cifs/file.c b/fs/cifs/file.c index 07ff935..2fdf4b2 100644 --- a/fs/cifs/file.c +++ b/fs/cifs/file.c @@ -2027,7 +2027,6 @@ const struct address_space_operations ci .prepare_write = cifs_prepare_write, .commit_write = cifs_commit_write, .set_page_dirty = __set_page_dirty_nobuffers, - /* .sync_page = cifs_sync_page, */ /* .direct_IO = */ }; @@ -2043,6 +2042,5 @@ const struct address_space_operations ci .prepare_write = cifs_prepare_write, .commit_write = cifs_commit_write, .set_page_dirty = __set_page_dirty_nobuffers, - /* .sync_page = cifs_sync_page, */ /* .direct_IO = */ }; diff --git a/fs/direct-io.c b/fs/direct-io.c index d9d0833..0b826a5 100644 --- a/fs/direct-io.c +++ b/fs/direct-io.c @@ -1064,9 +1064,6 @@ direct_io_worker(int rw, struct kiocb *i if (dio->bio) dio_bio_submit(dio); - /* All IO is now issued, send it on its way */ - blk_run_address_space(inode->i_mapping); - /* * It is possible that, we return short IO due to end of file. * In that case, we need to release all the pages we got hold on. @@ -1159,6 +1156,8 @@ __blockdev_direct_IO(int rw, struct kioc int release_i_mutex = 0; int acquire_i_mutex = 0; + blk_plug_current(); + if (rw & WRITE) rw = WRITE_SYNC; @@ -1251,6 +1250,8 @@ out: mutex_unlock(&inode->i_mutex); else if (acquire_i_mutex) mutex_lock(&inode->i_mutex); + + blk_unplug_current(); return retval; } EXPORT_SYMBOL(__blockdev_direct_IO); diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c index 3a6f65c..1e5d2ba 100644 --- a/fs/ecryptfs/mmap.c +++ b/fs/ecryptfs/mmap.c @@ -776,33 +776,10 @@ static sector_t ecryptfs_bmap(struct add return rc; } -static void ecryptfs_sync_page(struct page *page) -{ - struct inode *inode; - struct inode *lower_inode; - struct page *lower_page; - - inode = page->mapping->host; - lower_inode = ecryptfs_inode_to_lower(inode); - /* NOTE: Recently swapped with grab_cache_page(), since - * sync_page() just makes sure that pending I/O gets done. */ - lower_page = find_lock_page(lower_inode->i_mapping, page->index); - if (!lower_page) { - ecryptfs_printk(KERN_DEBUG, "find_lock_page failed\n"); - return; - } - lower_page->mapping->a_ops->sync_page(lower_page); - ecryptfs_printk(KERN_DEBUG, "Unlocking page with index = [0x%.16x]\n", - lower_page->index); - unlock_page(lower_page); - page_cache_release(lower_page); -} - struct address_space_operations ecryptfs_aops = { .writepage = ecryptfs_writepage, .readpage = ecryptfs_readpage, .prepare_write = ecryptfs_prepare_write, .commit_write = ecryptfs_commit_write, .bmap = ecryptfs_bmap, - .sync_page = ecryptfs_sync_page, }; diff --git a/fs/efs/inode.c b/fs/efs/inode.c index 174696f..5eb86ec 100644 --- a/fs/efs/inode.c +++ b/fs/efs/inode.c @@ -23,7 +23,6 @@ static sector_t _efs_bmap(struct address } static const struct address_space_operations efs_aops = { .readpage = efs_readpage, - .sync_page = block_sync_page, .bmap = _efs_bmap }; diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index dd4e14c..952a2aa 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -688,7 +688,6 @@ const struct address_space_operations ex .readpage = ext2_readpage, .readpages = ext2_readpages, .writepage = ext2_writepage, - .sync_page = block_sync_page, .prepare_write = ext2_prepare_write, .commit_write = generic_commit_write, .bmap = ext2_bmap, @@ -706,7 +705,6 @@ const struct address_space_operations ex .readpage = ext2_readpage, .readpages = ext2_readpages, .writepage = ext2_nobh_writepage, - .sync_page = block_sync_page, .prepare_write = ext2_nobh_prepare_write, .commit_write = nobh_commit_write, .bmap = ext2_bmap, diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 8a824f4..dba6dd2 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1771,7 +1771,6 @@ static const struct address_space_operat .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_ordered_writepage, - .sync_page = block_sync_page, .prepare_write = ext3_prepare_write, .commit_write = ext3_ordered_commit_write, .bmap = ext3_bmap, @@ -1785,7 +1784,6 @@ static const struct address_space_operat .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_writeback_writepage, - .sync_page = block_sync_page, .prepare_write = ext3_prepare_write, .commit_write = ext3_writeback_commit_write, .bmap = ext3_bmap, @@ -1799,7 +1797,6 @@ static const struct address_space_operat .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_journalled_writepage, - .sync_page = block_sync_page, .prepare_write = ext3_prepare_write, .commit_write = ext3_journalled_commit_write, .set_page_dirty = ext3_journalled_set_page_dirty, diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index fbff4b9..806eee1 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1770,7 +1770,6 @@ static const struct address_space_operat .readpage = ext4_readpage, .readpages = ext4_readpages, .writepage = ext4_ordered_writepage, - .sync_page = block_sync_page, .prepare_write = ext4_prepare_write, .commit_write = ext4_ordered_commit_write, .bmap = ext4_bmap, @@ -1784,7 +1783,6 @@ static const struct address_space_operat .readpage = ext4_readpage, .readpages = ext4_readpages, .writepage = ext4_writeback_writepage, - .sync_page = block_sync_page, .prepare_write = ext4_prepare_write, .commit_write = ext4_writeback_commit_write, .bmap = ext4_bmap, @@ -1798,7 +1796,6 @@ static const struct address_space_operat .readpage = ext4_readpage, .readpages = ext4_readpages, .writepage = ext4_journalled_writepage, - .sync_page = block_sync_page, .prepare_write = ext4_prepare_write, .commit_write = ext4_journalled_commit_write, .set_page_dirty = ext4_journalled_set_page_dirty, diff --git a/fs/fat/inode.c b/fs/fat/inode.c index 7610735..da90517 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -197,7 +197,6 @@ static const struct address_space_operat .readpages = fat_readpages, .writepage = fat_writepage, .writepages = fat_writepages, - .sync_page = block_sync_page, .prepare_write = fat_prepare_write, .commit_write = fat_commit_write, .direct_IO = fat_direct_IO, diff --git a/fs/freevxfs/vxfs_subr.c b/fs/freevxfs/vxfs_subr.c index decac62..859ce1a 100644 --- a/fs/freevxfs/vxfs_subr.c +++ b/fs/freevxfs/vxfs_subr.c @@ -45,7 +45,6 @@ static sector_t vxfs_bmap(struct addres const struct address_space_operations vxfs_aops = { .readpage = vxfs_readpage, .bmap = vxfs_bmap, - .sync_page = block_sync_page, }; inline void diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c index 5ab8e50..740b3c4 100644 --- a/fs/fuse/inode.c +++ b/fs/fuse/inode.c @@ -412,7 +412,6 @@ static struct fuse_conn *new_conn(void) INIT_LIST_HEAD(&fc->interrupts); atomic_set(&fc->num_waiting, 0); fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; - fc->bdi.unplug_io_fn = default_unplug_io_fn; fc->reqctr = 0; fc->blocked = 1; get_random_bytes(&fc->scramble_key, sizeof(fc->scramble_key)); diff --git a/fs/gfs2/ops_address.c b/fs/gfs2/ops_address.c index 56e3359..368516a 100644 --- a/fs/gfs2/ops_address.c +++ b/fs/gfs2/ops_address.c @@ -789,7 +789,6 @@ const struct address_space_operations gf .writepages = gfs2_writepages, .readpage = gfs2_readpage, .readpages = gfs2_readpages, - .sync_page = block_sync_page, .prepare_write = gfs2_prepare_write, .commit_write = gfs2_commit_write, .bmap = gfs2_bmap, diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c index fafcba5..9290c90 100644 --- a/fs/hfs/inode.c +++ b/fs/hfs/inode.c @@ -117,7 +117,6 @@ static int hfs_writepages(struct address const struct address_space_operations hfs_btree_aops = { .readpage = hfs_readpage, .writepage = hfs_writepage, - .sync_page = block_sync_page, .prepare_write = hfs_prepare_write, .commit_write = generic_commit_write, .bmap = hfs_bmap, @@ -127,7 +126,6 @@ const struct address_space_operations hf const struct address_space_operations hfs_aops = { .readpage = hfs_readpage, .writepage = hfs_writepage, - .sync_page = block_sync_page, .prepare_write = hfs_prepare_write, .commit_write = generic_commit_write, .bmap = hfs_bmap, diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 642012a..7848786 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -112,7 +112,6 @@ static int hfsplus_writepages(struct add const struct address_space_operations hfsplus_btree_aops = { .readpage = hfsplus_readpage, .writepage = hfsplus_writepage, - .sync_page = block_sync_page, .prepare_write = hfsplus_prepare_write, .commit_write = generic_commit_write, .bmap = hfsplus_bmap, @@ -122,7 +121,6 @@ const struct address_space_operations hf const struct address_space_operations hfsplus_aops = { .readpage = hfsplus_readpage, .writepage = hfsplus_writepage, - .sync_page = block_sync_page, .prepare_write = hfsplus_prepare_write, .commit_write = generic_commit_write, .bmap = hfsplus_bmap, diff --git a/fs/hpfs/file.c b/fs/hpfs/file.c index b4eafc0..b6c5457 100644 --- a/fs/hpfs/file.c +++ b/fs/hpfs/file.c @@ -102,7 +102,6 @@ static sector_t _hpfs_bmap(struct addres const struct address_space_operations hpfs_aops = { .readpage = hpfs_readpage, .writepage = hpfs_writepage, - .sync_page = block_sync_page, .prepare_write = hpfs_prepare_write, .commit_write = generic_commit_write, .bmap = _hpfs_bmap diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c index 64a96cd..f422a2d 100644 --- a/fs/isofs/inode.c +++ b/fs/isofs/inode.c @@ -1052,7 +1052,6 @@ static sector_t _isofs_bmap(struct addre static const struct address_space_operations isofs_aops = { .readpage = isofs_readpage, - .sync_page = block_sync_page, .bmap = _isofs_bmap }; diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c index e285022..46136b1 100644 --- a/fs/jfs/inode.c +++ b/fs/jfs/inode.c @@ -302,7 +302,6 @@ const struct address_space_operations jf .readpages = jfs_readpages, .writepage = jfs_writepage, .writepages = jfs_writepages, - .sync_page = block_sync_page, .prepare_write = jfs_prepare_write, .commit_write = nobh_commit_write, .bmap = jfs_bmap, diff --git a/fs/jfs/jfs_metapage.c b/fs/jfs/jfs_metapage.c index 58deae0..0a73782 100644 --- a/fs/jfs/jfs_metapage.c +++ b/fs/jfs/jfs_metapage.c @@ -580,7 +580,6 @@ static void metapage_invalidatepage(stru const struct address_space_operations jfs_metapage_aops = { .readpage = metapage_readpage, .writepage = metapage_writepage, - .sync_page = block_sync_page, .releasepage = metapage_releasepage, .invalidatepage = metapage_invalidatepage, .set_page_dirty = __set_page_dirty_nobuffers, diff --git a/fs/minix/inode.c b/fs/minix/inode.c index 92e383a..a2f927c 100644 --- a/fs/minix/inode.c +++ b/fs/minix/inode.c @@ -363,7 +363,6 @@ static sector_t minix_bmap(struct addres static const struct address_space_operations minix_aops = { .readpage = minix_readpage, .writepage = minix_writepage, - .sync_page = block_sync_page, .prepare_write = minix_prepare_write, .commit_write = generic_commit_write, .bmap = minix_bmap diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c index 629e7ab..ff9684b 100644 --- a/fs/ntfs/aops.c +++ b/fs/ntfs/aops.c @@ -1550,8 +1550,6 @@ #endif /* NTFS_RW */ */ const struct address_space_operations ntfs_aops = { .readpage = ntfs_readpage, /* Fill page with data. */ - .sync_page = block_sync_page, /* Currently, just unplugs the - disk request queue. */ #ifdef NTFS_RW .writepage = ntfs_writepage, /* Write dirty page to disk. */ #endif /* NTFS_RW */ @@ -1566,8 +1564,6 @@ #endif /* NTFS_RW */ */ const struct address_space_operations ntfs_mst_aops = { .readpage = ntfs_readpage, /* Fill page with data. */ - .sync_page = block_sync_page, /* Currently, just unplugs the - disk request queue. */ #ifdef NTFS_RW .writepage = ntfs_writepage, /* Write dirty page to disk. */ .set_page_dirty = __set_page_dirty_nobuffers, /* Set the page dirty diff --git a/fs/ntfs/compress.c b/fs/ntfs/compress.c index d98daf5..1e3b686 100644 --- a/fs/ntfs/compress.c +++ b/fs/ntfs/compress.c @@ -687,7 +687,7 @@ lock_retry_remap: "uptodate! Unplugging the disk queue " "and rescheduling."); get_bh(tbh); - blk_run_address_space(mapping); + blk_replug_current_nested(); schedule(); put_bh(tbh); if (unlikely(!buffer_uptodate(tbh))) diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 93628b0..b10fc22 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -660,6 +660,5 @@ const struct address_space_operations oc .prepare_write = ocfs2_prepare_write, .commit_write = ocfs2_commit_write, .bmap = ocfs2_bmap, - .sync_page = block_sync_page, .direct_IO = ocfs2_direct_IO }; diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c index 5a9779b..c05c567 100644 --- a/fs/ocfs2/cluster/heartbeat.c +++ b/fs/ocfs2/cluster/heartbeat.c @@ -208,11 +208,7 @@ static inline void o2hb_bio_wait_dec(str static void o2hb_wait_on_io(struct o2hb_region *reg, struct o2hb_bio_wait_ctxt *wc) { - struct address_space *mapping = reg->hr_bdev->bd_inode->i_mapping; - - blk_run_address_space(mapping); o2hb_bio_wait_dec(wc, 1); - wait_for_completion(&wc->wc_io_complete); } diff --git a/fs/qnx4/inode.c b/fs/qnx4/inode.c index 83bc8e7..4511c87 100644 --- a/fs/qnx4/inode.c +++ b/fs/qnx4/inode.c @@ -451,7 +451,6 @@ static sector_t qnx4_bmap(struct address static const struct address_space_operations qnx4_aops = { .readpage = qnx4_readpage, .writepage = qnx4_writepage, - .sync_page = block_sync_page, .prepare_write = qnx4_prepare_write, .commit_write = generic_commit_write, .bmap = qnx4_bmap diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c index 9fcbfe3..734f9df 100644 --- a/fs/reiserfs/inode.c +++ b/fs/reiserfs/inode.c @@ -3006,7 +3006,6 @@ const struct address_space_operations re .readpages = reiserfs_readpages, .releasepage = reiserfs_releasepage, .invalidatepage = reiserfs_invalidatepage, - .sync_page = block_sync_page, .prepare_write = reiserfs_prepare_write, .commit_write = reiserfs_commit_write, .bmap = reiserfs_aop_bmap, diff --git a/fs/sysv/itree.c b/fs/sysv/itree.c index f2bcccd..394ae5a 100644 --- a/fs/sysv/itree.c +++ b/fs/sysv/itree.c @@ -468,7 +468,6 @@ static sector_t sysv_bmap(struct address const struct address_space_operations sysv_aops = { .readpage = sysv_readpage, .writepage = sysv_writepage, - .sync_page = block_sync_page, .prepare_write = sysv_prepare_write, .commit_write = generic_commit_write, .bmap = sysv_bmap diff --git a/fs/udf/file.c b/fs/udf/file.c index 40d5047..d776527 100644 --- a/fs/udf/file.c +++ b/fs/udf/file.c @@ -98,7 +98,6 @@ static int udf_adinicb_commit_write(stru const struct address_space_operations udf_adinicb_aops = { .readpage = udf_adinicb_readpage, .writepage = udf_adinicb_writepage, - .sync_page = block_sync_page, .prepare_write = udf_adinicb_prepare_write, .commit_write = udf_adinicb_commit_write, }; diff --git a/fs/udf/inode.c b/fs/udf/inode.c index ae21a0e..8d0e44b 100644 --- a/fs/udf/inode.c +++ b/fs/udf/inode.c @@ -135,7 +135,6 @@ static sector_t udf_bmap(struct address_ const struct address_space_operations udf_aops = { .readpage = udf_readpage, .writepage = udf_writepage, - .sync_page = block_sync_page, .prepare_write = udf_prepare_write, .commit_write = generic_commit_write, .bmap = udf_bmap, diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c index fb34ad0..cdd4e82 100644 --- a/fs/ufs/inode.c +++ b/fs/ufs/inode.c @@ -573,7 +573,6 @@ static sector_t ufs_bmap(struct address_ const struct address_space_operations ufs_aops = { .readpage = ufs_readpage, .writepage = ufs_writepage, - .sync_page = block_sync_page, .prepare_write = ufs_prepare_write, .commit_write = generic_commit_write, .bmap = ufs_bmap diff --git a/fs/ufs/truncate.c b/fs/ufs/truncate.c index 749581f..4df3747 100644 --- a/fs/ufs/truncate.c +++ b/fs/ufs/truncate.c @@ -470,7 +470,7 @@ int ufs_truncate(struct inode *inode, lo break; if (IS_SYNC(inode) && (inode->i_state & I_DIRTY)) ufs_sync_inode (inode); - blk_run_address_space(inode->i_mapping); + blk_replug_current_nested(); yield(); } diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index 143ffc8..732ce92 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -1471,7 +1471,6 @@ const struct address_space_operations xf .readpages = xfs_vm_readpages, .writepage = xfs_vm_writepage, .writepages = xfs_vm_writepages, - .sync_page = block_sync_page, .releasepage = xfs_vm_releasepage, .invalidatepage = xfs_vm_invalidatepage, .prepare_write = xfs_vm_prepare_write, diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c index e2bea6a..825f2a3 100644 --- a/fs/xfs/linux-2.6/xfs_buf.c +++ b/fs/xfs/linux-2.6/xfs_buf.c @@ -910,7 +910,7 @@ xfs_buf_lock( { XB_TRACE(bp, "lock", 0); if (atomic_read(&bp->b_io_remaining)) - blk_run_address_space(bp->b_target->bt_mapping); + blk_replug_current_nested(); down(&bp->b_sema); XB_SET_OWNER(bp); XB_TRACE(bp, "locked", 0); @@ -981,9 +981,7 @@ xfs_buf_wait_unpin( set_current_state(TASK_UNINTERRUPTIBLE); if (atomic_read(&bp->b_pin_count) == 0) break; - if (atomic_read(&bp->b_io_remaining)) - blk_run_address_space(bp->b_target->bt_mapping); - schedule(); + io_schedule(); } remove_wait_queue(&bp->b_waiters, &wait); set_current_state(TASK_RUNNING); @@ -1296,7 +1294,7 @@ xfs_buf_iowait( { XB_TRACE(bp, "iowait", 0); if (atomic_read(&bp->b_io_remaining)) - blk_run_address_space(bp->b_target->bt_mapping); + blk_replug_current_nested(); down(&bp->b_iodonesema); XB_TRACE(bp, "iowaited", (long)bp->b_error); return bp->b_error; @@ -1529,7 +1527,6 @@ xfs_mapping_buftarg( struct inode *inode; struct address_space *mapping; static const struct address_space_operations mapping_aops = { - .sync_page = block_sync_page, .migratepage = fail_migrate_page, }; @@ -1760,7 +1757,7 @@ xfsbufd( if (as_list_len > 0) purge_addresses(); if (count) - blk_run_address_space(target->bt_mapping); + blk_replug_current_nested(); } while (!kthread_should_stop()); @@ -1801,7 +1798,7 @@ xfs_flush_buftarg( } if (wait) - blk_run_address_space(target->bt_mapping); + blk_replug_current_nested(); /* * Remaining list items must be flushed before returning diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 7011d62..5757eb7 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -30,8 +30,6 @@ struct backing_dev_info { unsigned int capabilities; /* Device capabilities */ congested_fn *congested_fn; /* Function pointer if device is md/dm */ void *congested_data; /* Pointer to aux data for congested func */ - void (*unplug_io_fn)(struct backing_dev_info *, struct page *); - void *unplug_io_data; }; @@ -61,7 +59,6 @@ #error please change backing_dev_info::c #endif extern struct backing_dev_info default_backing_dev_info; -void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page); int writeback_acquire(struct backing_dev_info *bdi); int writeback_in_progress(struct backing_dev_info *bdi); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 83dcd8c..f8cdd44 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -111,6 +111,11 @@ struct io_context { /* * For request batching */ + int plugged; + int qrcu_idx; + struct list_head plugged_list; + struct request_queue *plugged_queue; + unsigned long last_waited; /* Time last woken after wait for request */ int nr_batch_requests; /* Number of requests left in the batch */ @@ -118,6 +123,38 @@ struct io_context { struct rb_root cic_root; }; +void blk_plug_current(void); +void blk_unplug_current(void); + +static inline int blk_unplug_current_nested(void) +{ + struct io_context *ioc = current->io_context; + int ret = 0; + + if (ioc && ioc->plugged) { + ret = ioc->plugged; + ioc->plugged = 1; + blk_unplug_current(); + } + + return ret; +} + +static inline void blk_plug_current_nested(int depth) +{ + struct io_context *ioc = current->io_context; + + if (ioc) { + ioc->plugged = depth; + ioc->plugged_queue = NULL; + } +} + +static inline void blk_replug_current_nested(void) +{ + blk_plug_current_nested(blk_unplug_current_nested()); +} + void put_io_context(struct io_context *ioc); void exit_io_context(void); struct io_context *get_io_context(gfp_t gfp_flags, int node); @@ -333,7 +370,6 @@ #include typedef void (request_fn_proc) (request_queue_t *q); typedef int (make_request_fn) (request_queue_t *q, struct bio *bio); typedef int (prep_rq_fn) (request_queue_t *, struct request *); -typedef void (unplug_fn) (request_queue_t *); struct bio_vec; typedef int (merge_bvec_fn) (request_queue_t *, struct bio *, struct bio_vec *); @@ -373,7 +409,6 @@ struct request_queue request_fn_proc *request_fn; make_request_fn *make_request_fn; prep_rq_fn *prep_rq_fn; - unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; issue_flush_fn *issue_flush_fn; prepare_flush_fn *prepare_flush_fn; @@ -386,12 +421,11 @@ struct request_queue struct request *boundary_rq; /* - * Auto-unplugging state + * This replaces the part of the old plugging mechanism that was + * responsible for re-invoking ->request_fn() after a little nap. + * This is needed to handle temporary resource starvation. */ - struct timer_list unplug_timer; - int unplug_thresh; /* After this many requests */ - unsigned long unplug_delay; /* After this many jiffies */ - struct work_struct unplug_work; + struct delayed_work delay_work; struct backing_dev_info backing_dev_info; @@ -466,6 +500,11 @@ #endif struct request *orig_bar_rq; unsigned int bi_size; + /* + * plug synchronization + */ + struct qrcu_struct qrcu; + struct mutex sysfs_lock; }; @@ -476,8 +515,7 @@ #define QUEUE_FLAG_READFULL 3 /* write q #define QUEUE_FLAG_WRITEFULL 4 /* read queue has been filled */ #define QUEUE_FLAG_DEAD 5 /* queue being torn down */ #define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */ -#define QUEUE_FLAG_PLUGGED 7 /* queue is plugged */ -#define QUEUE_FLAG_ELVSWITCH 8 /* don't use elevator, just do FIFO */ +#define QUEUE_FLAG_ELVSWITCH 7 /* don't use elevator, just do FIFO */ enum { /* @@ -519,7 +557,6 @@ enum { QUEUE_ORDSEQ_DONE = 0x20, }; -#define blk_queue_plugged(q) test_bit(QUEUE_FLAG_PLUGGED, &(q)->queue_flags) #define blk_queue_tagged(q) test_bit(QUEUE_FLAG_QUEUED, &(q)->queue_flags) #define blk_queue_stopped(q) test_bit(QUEUE_FLAG_STOPPED, &(q)->queue_flags) #define blk_queue_flushing(q) ((q)->ordseq) @@ -633,8 +670,7 @@ extern void blk_end_sync_rq(struct reque extern struct request *blk_get_request(request_queue_t *, int, gfp_t); extern void blk_insert_request(request_queue_t *, struct request *, int, void *); extern void blk_requeue_request(request_queue_t *, struct request *); -extern void blk_plug_device(request_queue_t *); -extern int blk_remove_plug(request_queue_t *); +extern void blk_delay_queue(request_queue_t *, unsigned long); extern void blk_recount_segments(request_queue_t *, struct bio *); extern int scsi_cmd_ioctl(struct file *, struct gendisk *, unsigned int, void __user *); extern int sg_scsi_ioctl(struct file *, struct request_queue *, @@ -685,19 +721,6 @@ static inline request_queue_t *bdev_get_ return bdev->bd_disk->queue; } -static inline void blk_run_backing_dev(struct backing_dev_info *bdi, - struct page *page) -{ - if (bdi && bdi->unplug_io_fn) - bdi->unplug_io_fn(bdi, page); -} - -static inline void blk_run_address_space(struct address_space *mapping) -{ - if (mapping) - blk_run_backing_dev(mapping->backing_dev_info, NULL); -} - /* * end_request() and friends. Must be called with the request queue spinlock * acquired. All functions called within end_request() _must_be_ atomic. @@ -756,8 +779,6 @@ extern void blk_ordered_complete_seq(req extern int blk_rq_map_sg(request_queue_t *, struct request *, struct scatterlist *); extern void blk_dump_rq_flags(struct request *, char *); -extern void generic_unplug_device(request_queue_t *); -extern void __generic_unplug_device(request_queue_t *); extern long nr_blockdev_pages(void); int blk_get_queue(request_queue_t *); @@ -881,6 +902,27 @@ static inline void exit_io_context(void) { } +static inline void blk_replug_current_nested(void) +{ +} + +static inline void blk_plug_current(void) +{ +} + +static inline void blk_unplug_current(void) +{ +} + +static inline int blk_unplug_current_nested(void) +{ + return 0; +} + +static inline void blk_plug_current_nested(int depth) +{ +} + #endif /* CONFIG_BLOCK */ #endif diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index dd27b1c..ff8b2c9 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -208,7 +208,6 @@ int cont_prepare_write(struct page*, uns int generic_cont_expand(struct inode *inode, loff_t size); int generic_cont_expand_simple(struct inode *inode, loff_t size); int block_commit_write(struct page *page, unsigned from, unsigned to); -void block_sync_page(struct page *); sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *); int generic_commit_write(struct file *, struct page *, unsigned, unsigned); int block_truncate_page(struct address_space *, loff_t, get_block_t *); diff --git a/include/linux/elevator.h b/include/linux/elevator.h index e88fcbc..270da8d 100644 --- a/include/linux/elevator.h +++ b/include/linux/elevator.h @@ -17,7 +17,6 @@ typedef int (elevator_allow_merge_fn) (r typedef int (elevator_dispatch_fn) (request_queue_t *, int); typedef void (elevator_add_req_fn) (request_queue_t *, struct request *); -typedef int (elevator_queue_empty_fn) (request_queue_t *); typedef struct request *(elevator_request_list_fn) (request_queue_t *, struct request *); typedef void (elevator_completed_req_fn) (request_queue_t *, struct request *); typedef int (elevator_may_queue_fn) (request_queue_t *, int); @@ -42,7 +41,6 @@ struct elevator_ops elevator_activate_req_fn *elevator_activate_req_fn; elevator_deactivate_req_fn *elevator_deactivate_req_fn; - elevator_queue_empty_fn *elevator_queue_empty_fn; elevator_completed_req_fn *elevator_completed_req_fn; elevator_request_list_fn *elevator_former_req_fn; @@ -96,16 +94,16 @@ struct elevator_queue */ extern void elv_dispatch_sort(request_queue_t *, struct request *); extern void elv_dispatch_add_tail(request_queue_t *, struct request *); -extern void elv_add_request(request_queue_t *, struct request *, int, int); -extern void __elv_add_request(request_queue_t *, struct request *, int, int); +extern void elv_add_request(request_queue_t *, struct request *, int); +extern void __elv_add_request(request_queue_t *, struct request *, int); extern void elv_insert(request_queue_t *, struct request *, int); extern int elv_merge(request_queue_t *, struct request **, struct bio *); +extern int elv_try_merge(struct request *, struct bio *); extern void elv_merge_requests(request_queue_t *, struct request *, struct request *); extern void elv_merged_request(request_queue_t *, struct request *, int); extern void elv_dequeue_request(request_queue_t *, struct request *); extern void elv_requeue_request(request_queue_t *, struct request *); -extern int elv_queue_empty(request_queue_t *); extern struct request *elv_next_request(struct request_queue *q); extern struct request *elv_former_request(request_queue_t *, struct request *); extern struct request *elv_latter_request(request_queue_t *, struct request *); diff --git a/include/linux/fs.h b/include/linux/fs.h index 86ec3f4..aa3185e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -399,7 +399,6 @@ struct writeback_control; struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); - void (*sync_page)(struct page *); /* Write back some dirty pages from this mapping. */ int (*writepages)(struct address_space *, struct writeback_control *); diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 7a8dcb8..f9636d8 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -132,7 +132,6 @@ static inline pgoff_t linear_page_index( } extern void FASTCALL(__lock_page(struct page *page)); -extern void FASTCALL(__lock_page_nosync(struct page *page)); extern void FASTCALL(unlock_page(struct page *page)); /* @@ -146,17 +145,6 @@ static inline void lock_page(struct page } /* - * lock_page_nosync should only be used if we can't pin the page's inode. - * Doesn't play quite so well with block device plugging. - */ -static inline void lock_page_nosync(struct page *page) -{ - might_sleep(); - if (TestSetPageLocked(page)) - __lock_page_nosync(page); -} - -/* * This is exported only for wait_on_page_locked/wait_on_page_writeback. * Never use this directly! */ diff --git a/include/linux/raid/md.h b/include/linux/raid/md.h index fbaeda7..82d6a8c 100644 --- a/include/linux/raid/md.h +++ b/include/linux/raid/md.h @@ -85,7 +85,6 @@ extern void md_write_end(mddev_t *mddev) extern void md_handle_safemode(mddev_t *mddev); extern void md_done_sync(mddev_t *mddev, int blocks, int ok); extern void md_error (mddev_t *mddev, mdk_rdev_t *rdev); -extern void md_unplug_mddev(mddev_t *mddev); extern void md_super_write(mddev_t *mddev, mdk_rdev_t *rdev, sector_t sector, int size, struct page *page); diff --git a/include/linux/sched.h b/include/linux/sched.h index 5053dc0..f24c904 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1002,6 +1002,7 @@ #endif struct backing_dev_info *backing_dev_info; +/* block state */ struct io_context *io_context; unsigned long ptrace_message; diff --git a/include/linux/srcu.h b/include/linux/srcu.h index aca0eee..03a9010 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include + struct srcu_struct_array { int c[2]; }; @@ -50,4 +52,32 @@ void srcu_read_unlock(struct srcu_struct void synchronize_srcu(struct srcu_struct *sp); long srcu_batches_completed(struct srcu_struct *sp); +/* + * fully compatible with srcu, but optimized for writers. + */ + +struct qrcu_struct { + int completed; + atomic_t ctr[2]; + wait_queue_head_t wq; + struct mutex mutex; +}; + +int init_qrcu_struct(struct qrcu_struct *qp); +int qrcu_read_lock(struct qrcu_struct *qp) __acquires(qp); +void qrcu_read_unlock(struct qrcu_struct *qp, int idx) __releases(qp); +void synchronize_qrcu(struct qrcu_struct *qp); + +/** + * cleanup_qrcu_struct - deconstruct a quick-RCU structure + * @qp: structure to clean up. + * + * Must invoke this after you are finished using a given qrcu_struct that + * was initialized via init_qrcu_struct(). We reserve the right to + * leak memory should you fail to do this! + */ +static inline void cleanup_qrcu_struct(struct qrcu_struct *qp) +{ +} + #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 0068688..c4f63bb 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -214,8 +214,6 @@ #ifdef CONFIG_MMU extern int shmem_unuse(swp_entry_t entry, struct page *page); #endif /* CONFIG_MMU */ -extern void swap_unplug_io_fn(struct backing_dev_info *, struct page *); - #ifdef CONFIG_SWAP /* linux/mm/page_io.c */ extern int swap_readpage(struct file *, struct page *); diff --git a/kernel/rcutorture.c b/kernel/rcutorture.c index 482b11f..bd7fd49 100644 --- a/kernel/rcutorture.c +++ b/kernel/rcutorture.c @@ -465,6 +465,73 @@ static struct rcu_torture_ops srcu_ops = }; /* + * Definitions for qrcu torture testing. + */ + +static struct qrcu_struct qrcu_ctl; + +static void qrcu_torture_init(void) +{ + init_qrcu_struct(&qrcu_ctl); + rcu_sync_torture_init(); +} + +static void qrcu_torture_cleanup(void) +{ + synchronize_qrcu(&qrcu_ctl); + cleanup_qrcu_struct(&qrcu_ctl); +} + +static int qrcu_torture_read_lock(void) __acquires(&qrcu_ctl) +{ + return qrcu_read_lock(&qrcu_ctl); +} + +static void qrcu_torture_read_unlock(int idx) __releases(&qrcu_ctl) +{ + qrcu_read_unlock(&qrcu_ctl, idx); +} + +static int qrcu_torture_completed(void) +{ + return qrcu_ctl.completed; +} + +static void qrcu_torture_synchronize(void) +{ + synchronize_qrcu(&qrcu_ctl); +} + +static int qrcu_torture_stats(char *page) +{ + int cnt = 0; + int idx = qrcu_ctl.completed & 0x1; + + cnt += sprintf(&page[cnt], "%s%s per-CPU(idx=%d):", + torture_type, TORTURE_FLAG, idx); + + cnt += sprintf(&page[cnt], " (%d,%d)", + atomic_read(qrcu_ctl.ctr + 0), + atomic_read(qrcu_ctl.ctr + 1)); + + cnt += sprintf(&page[cnt], "\n"); + return cnt; +} + +static struct rcu_torture_ops qrcu_ops = { + .init = qrcu_torture_init, + .cleanup = qrcu_torture_cleanup, + .readlock = qrcu_torture_read_lock, + .readdelay = srcu_read_delay, + .readunlock = qrcu_torture_read_unlock, + .completed = qrcu_torture_completed, + .deferredfree = rcu_sync_torture_deferred_free, + .sync = qrcu_torture_synchronize, + .stats = qrcu_torture_stats, + .name = "qrcu" +}; + +/* * Definitions for sched torture testing. */ @@ -503,8 +570,8 @@ static struct rcu_torture_ops sched_ops }; static struct rcu_torture_ops *torture_ops[] = - { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops, &srcu_ops, - &sched_ops, NULL }; + { &rcu_ops, &rcu_sync_ops, &rcu_bh_ops, &rcu_bh_sync_ops, + &srcu_ops, &qrcu_ops, &sched_ops, NULL }; /* * RCU torture writer kthread. Repeatedly substitutes a new structure diff --git a/kernel/sched.c b/kernel/sched.c index 08f8617..d63145b 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -3446,6 +3446,8 @@ asmlinkage void __sched schedule(void) print_irqtrace_events(current); dump_stack(); } + if (unlikely(current->io_context && current->io_context->plugged)) + blk_replug_current_nested(); profile_hit(SCHED_PROFILING, __builtin_return_address(0)); need_resched: @@ -4726,6 +4728,7 @@ void __sched io_schedule(void) { struct rq *rq = &__raw_get_cpu_var(runqueues); + blk_replug_current_nested(); delayacct_blkio_start(); atomic_inc(&rq->nr_iowait); schedule(); diff --git a/kernel/srcu.c b/kernel/srcu.c index 3507cab..53c6989 100644 --- a/kernel/srcu.c +++ b/kernel/srcu.c @@ -256,3 +256,108 @@ EXPORT_SYMBOL_GPL(srcu_read_unlock); EXPORT_SYMBOL_GPL(synchronize_srcu); EXPORT_SYMBOL_GPL(srcu_batches_completed); EXPORT_SYMBOL_GPL(srcu_readers_active); + +/** + * init_qrcu_struct - initialize a quick-RCU structure. + * @qp: structure to initialize. + * + * Must invoke this on a given qrcu_struct before passing that qrcu_struct + * to any other function. Each qrcu_struct represents a separate domain + * of QRCU protection. + */ +int init_qrcu_struct(struct qrcu_struct *qp) +{ + qp->completed = 0; + atomic_set(qp->ctr + 0, 1); + atomic_set(qp->ctr + 1, 0); + init_waitqueue_head(&qp->wq); + mutex_init(&qp->mutex); + + return 0; +} + +/** + * qrcu_read_lock - register a new reader for an QRCU-protected structure. + * @qp: qrcu_struct in which to register the new reader. + * + * Counts the new reader in the appropriate element of the qrcu_struct. + * Returns an index that must be passed to the matching qrcu_read_unlock(). + */ +int qrcu_read_lock(struct qrcu_struct *qp) +{ + for (;;) { + int idx = qp->completed & 0x1; + if (likely(atomic_inc_not_zero(qp->ctr + idx))) + return idx; + } +} + +/** + * qrcu_read_unlock - unregister a old reader from an QRCU-protected structure. + * @qp: qrcu_struct in which to unregister the old reader. + * @idx: return value from corresponding qrcu_read_lock(). + * + * Removes the count for the old reader from the appropriate element of + * the qrcu_struct. + */ +void qrcu_read_unlock(struct qrcu_struct *qp, int idx) +{ + if (atomic_dec_and_test(qp->ctr + idx)) + wake_up(&qp->wq); +} + +/** + * synchronize_qrcu - wait for prior QRCU read-side critical-section completion + * @qp: qrcu_struct with which to synchronize. + * + * Flip the completed counter, and wait for the old count to drain to zero. + * As with classic RCU, the updater must use some separate means of + * synchronizing concurrent updates. Can block; must be called from + * process context. + * + * Note that it is illegal to call synchronize_qrcu() from the corresponding + * QRCU read-side critical section; doing so will result in deadlock. + * However, it is perfectly legal to call synchronize_qrcu() on one + * qrcu_struct from some other qrcu_struct's read-side critical section. + */ +void synchronize_qrcu(struct qrcu_struct *qp) +{ + int idx; + + /* + * The following memory barrier is needed to ensure that + * any prior data-structure manipulation is seen by other + * CPUs to happen before picking up the value of + * qp->completed. + */ + smp_mb(); + mutex_lock(&qp->mutex); + + idx = qp->completed & 0x1; + if (atomic_read(qp->ctr + idx) == 1) + goto out; + + atomic_inc(qp->ctr + (idx ^ 0x1)); + /* Reduce the likelihood that qrcu_read_lock() will loop */ + smp_mb__after_atomic_inc(); + qp->completed++; + + atomic_dec(qp->ctr + idx); + __wait_event(qp->wq, !atomic_read(qp->ctr + idx)); +out: + mutex_unlock(&qp->mutex); + smp_mb(); + /* + * The above smp_mb() is needed in the case that we + * see the counter reaching zero, so that we do not + * need to block. In this case, we need to make + * sure that the CPU does not re-order any subsequent + * changes made by the caller to occur prior to the + * test, as seen by other CPUs. + */ +} + +EXPORT_SYMBOL_GPL(init_qrcu_struct); +EXPORT_SYMBOL_GPL(qrcu_read_lock); +EXPORT_SYMBOL_GPL(qrcu_read_unlock); +EXPORT_SYMBOL_GPL(synchronize_qrcu); diff --git a/mm/filemap.c b/mm/filemap.c index 0041484..70c7bc9 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -133,38 +133,8 @@ void remove_from_page_cache(struct page write_unlock_irq(&mapping->tree_lock); } -static int sync_page(void *word) +static int sleep_on_page(void *word) { - struct address_space *mapping; - struct page *page; - - page = container_of((unsigned long *)word, struct page, flags); - - /* - * page_mapping() is being called without PG_locked held. - * Some knowledge of the state and use of the page is used to - * reduce the requirements down to a memory barrier. - * The danger here is of a stale page_mapping() return value - * indicating a struct address_space different from the one it's - * associated with when it is associated with one. - * After smp_mb(), it's either the correct page_mapping() for - * the page, or an old page_mapping() and the page's own - * page_mapping() has gone NULL. - * The ->sync_page() address_space operation must tolerate - * page_mapping() going NULL. By an amazing coincidence, - * this comes about because none of the users of the page - * in the ->sync_page() methods make essential use of the - * page_mapping(), merely passing the page down to the backing - * device's unplug functions when it's non-NULL, which in turn - * ignore it for all cases but swap, where only page_private(page) is - * of interest. When page_mapping() does go NULL, the entire - * call stack gracefully ignores the page and returns. - * -- wli - */ - smp_mb(); - mapping = page_mapping(page); - if (mapping && mapping->a_ops && mapping->a_ops->sync_page) - mapping->a_ops->sync_page(page); io_schedule(); return 0; } @@ -478,12 +448,6 @@ struct page *__page_cache_alloc(gfp_t gf EXPORT_SYMBOL(__page_cache_alloc); #endif -static int __sleep_on_page_lock(void *word) -{ - io_schedule(); - return 0; -} - /* * In order to wait for pages to become available there must be * waitqueues associated with pages. By using a hash table of @@ -511,7 +475,7 @@ void fastcall wait_on_page_bit(struct pa DEFINE_WAIT_BIT(wait, &page->flags, bit_nr); if (test_bit(bit_nr, &page->flags)) - __wait_on_bit(page_waitqueue(page), &wait, sync_page, + __wait_on_bit(page_waitqueue(page), &wait, sleep_on_page, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(wait_on_page_bit); @@ -558,32 +522,16 @@ EXPORT_SYMBOL(end_page_writeback); /** * __lock_page - get a lock on the page, assuming we need to sleep to get it * @page: the page to lock - * - * Ugly. Running sync_page() in state TASK_UNINTERRUPTIBLE is scary. If some - * random driver's requestfn sets TASK_RUNNING, we could busywait. However - * chances are that on the second loop, the block layer's plug list is empty, - * so sync_page() will then return in state TASK_UNINTERRUPTIBLE. */ void fastcall __lock_page(struct page *page) { DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); - __wait_on_bit_lock(page_waitqueue(page), &wait, sync_page, + __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page, TASK_UNINTERRUPTIBLE); } EXPORT_SYMBOL(__lock_page); -/* - * Variant of lock_page that does not require the caller to hold a reference - * on the page's mapping. - */ -void fastcall __lock_page_nosync(struct page *page) -{ - DEFINE_WAIT_BIT(wait, &page->flags, PG_locked); - __wait_on_bit_lock(page_waitqueue(page), &wait, __sleep_on_page_lock, - TASK_UNINTERRUPTIBLE); -} - /** * find_get_page - find and get a page reference * @mapping: the address_space to search @@ -880,6 +828,8 @@ void do_generic_mapping_read(struct addr last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; + blk_plug_current(); + isize = i_size_read(inode); if (!isize) goto out; @@ -1065,6 +1015,8 @@ out: page_cache_release(cached_page); if (filp) file_accessed(filp); + + blk_unplug_current(); } EXPORT_SYMBOL(do_generic_mapping_read); @@ -2067,6 +2019,8 @@ generic_file_buffered_write(struct kiocb buf = cur_iov->iov_base + iov_base; } + blk_plug_current(); + do { unsigned long index; unsigned long offset; @@ -2172,6 +2126,8 @@ zero_length_segment: if (cached_page) page_cache_release(cached_page); + blk_unplug_current(); + /* * For now, when the user asks for O_SYNC, we'll actually give O_DSYNC */ diff --git a/mm/nommu.c b/mm/nommu.c index 23fb033..310d3af 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1171,10 +1171,6 @@ int remap_pfn_range(struct vm_area_struc } EXPORT_SYMBOL(remap_pfn_range); -void swap_unplug_io_fn(struct backing_dev_info *bdi, struct page *page) -{ -} - unsigned long arch_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { diff --git a/mm/page-writeback.c b/mm/page-writeback.c index f7e088f..f470b47 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -685,12 +685,18 @@ int do_writepages(struct address_space * if (wbc->nr_to_write <= 0) return 0; + + blk_plug_current(); + wbc->for_writepages = 1; if (mapping->a_ops->writepages) ret = mapping->a_ops->writepages(mapping, wbc); else ret = generic_writepages(mapping, wbc); wbc->for_writepages = 0; + + blk_unplug_current(); + return ret; } @@ -839,7 +845,7 @@ int set_page_dirty_lock(struct page *pag { int ret; - lock_page_nosync(page); + lock_page(page); ret = set_page_dirty(page); unlock_page(page); return ret; diff --git a/mm/readahead.c b/mm/readahead.c index 93d9ee6..e88b5be 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -16,16 +16,10 @@ #include #include #include -void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page) -{ -} -EXPORT_SYMBOL(default_unplug_io_fn); - struct backing_dev_info default_backing_dev_info = { .ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE, .state = 0, .capabilities = BDI_CAP_MAP_COPY, - .unplug_io_fn = default_unplug_io_fn, }; EXPORT_SYMBOL_GPL(default_backing_dev_info); @@ -167,6 +161,8 @@ static int read_pages(struct address_spa struct pagevec lru_pvec; int ret; + blk_plug_current(); + if (mapping->a_ops->readpages) { ret = mapping->a_ops->readpages(filp, mapping, pages, nr_pages); /* Clean up the remaining pages */ @@ -188,7 +184,10 @@ static int read_pages(struct address_spa } pagevec_lru_add(&lru_pvec); ret = 0; + out: + blk_unplug_current(); + return ret; } diff --git a/mm/shmem.c b/mm/shmem.c index 8820530..9eb8813 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -186,7 +186,6 @@ static struct vm_operations_struct shmem static struct backing_dev_info shmem_backing_dev_info __read_mostly = { .ra_pages = 0, /* No readahead */ .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, - .unplug_io_fn = default_unplug_io_fn, }; static LIST_HEAD(shmem_swaplist); diff --git a/mm/swap_state.c b/mm/swap_state.c index 5f7cf2a..62096fe 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -21,19 +21,16 @@ #include /* * swapper_space is a fiction, retained to simplify the path through - * vmscan's shrink_list, to make sync_page look nicer, and to allow - * future use of radix_tree tags in the swap cache. + * vmscan's shrink_list. */ static const struct address_space_operations swap_aops = { .writepage = swap_writepage, - .sync_page = block_sync_page, .set_page_dirty = __set_page_dirty_nobuffers, .migratepage = migrate_page, }; static struct backing_dev_info swap_backing_dev_info = { .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK, - .unplug_io_fn = swap_unplug_io_fn, }; struct address_space swapper_space = { diff --git a/mm/swapfile.c b/mm/swapfile.c index a2d9bb4..001f3e3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -48,39 +48,6 @@ static struct swap_info_struct swap_info static DEFINE_MUTEX(swapon_mutex); -/* - * We need this because the bdev->unplug_fn can sleep and we cannot - * hold swap_lock while calling the unplug_fn. And swap_lock - * cannot be turned into a mutex. - */ -static DECLARE_RWSEM(swap_unplug_sem); - -void swap_unplug_io_fn(struct backing_dev_info *unused_bdi, struct page *page) -{ - swp_entry_t entry; - - down_read(&swap_unplug_sem); - entry.val = page_private(page); - if (PageSwapCache(page)) { - struct block_device *bdev = swap_info[swp_type(entry)].bdev; - struct backing_dev_info *bdi; - - /* - * If the page is removed from swapcache from under us (with a - * racy try_to_unuse/swapoff) we need an additional reference - * count to avoid reading garbage from page_private(page) above. - * If the WARN_ON triggers during a swapoff it maybe the race - * condition and it's harmless. However if it triggers without - * swapoff it signals a problem. - */ - WARN_ON(page_count(page) <= 1); - - bdi = bdev->bd_inode->i_mapping->backing_dev_info; - blk_run_backing_dev(bdi, page); - } - up_read(&swap_unplug_sem); -} - #define SWAPFILE_CLUSTER 256 #define LATENCY_LIMIT 256 @@ -1256,10 +1223,6 @@ asmlinkage long sys_swapoff(const char _ goto out_dput; } - /* wait for any unplug function to finish */ - down_write(&swap_unplug_sem); - up_write(&swap_unplug_sem); - destroy_swap_extents(p); mutex_lock(&swapon_mutex); spin_lock(&swap_lock); diff --git a/mm/vmscan.c b/mm/vmscan.c index 0655d5f..4dec6da 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1017,6 +1017,7 @@ static unsigned long shrink_zones(int pr */ unsigned long try_to_free_pages(struct zone **zones, gfp_t gfp_mask) { + int plugdepth; int priority; int ret = 0; unsigned long total_scanned = 0; @@ -1034,6 +1035,8 @@ unsigned long try_to_free_pages(struct z count_vm_event(ALLOCSTALL); + plugdepth = blk_unplug_current_nested(); + for (i = 0; zones[i] != NULL; i++) { struct zone *zone = zones[i]; @@ -1098,6 +1101,9 @@ out: zone->prev_priority = priority; } + + blk_plug_current_nested(plugdepth); + return ret; }