GIT 7b93ed139ea0fa65265c12d8d953816a637040a4 git://lost.foo-projects.org/~dwillia2/git/iop#md-for-linus commit Author: Dan Williams Date: Thu Aug 2 23:11:20 2007 -0700 raid5: use stripe_queues to prioritize the "most deserving" requests (take5) Overview: Taking advantage of the stripe_queue/stripe_head separation, this patch implements a queue in front of the stripe cache. A stripe_queue pool accepts incoming requests. As requests are attached, the weight of the queue object is updated. A workqueue (raid456_cache_arbiter) is introduced to control the flow of requests to the stripe cache. Pressure (weight of the queue object) can push requests to be processed by the the cache (raid5d). raid5d also pulls requests when its 'handle' list is empty. The cache arbiter prioritizes reads and full stripe-writes, as there is no performance to be gained by delaying them. Sub-stripe-width writes are handled as before by a 'preread-active' mechanism. The difference now is that full-stripe-writes can pass delayed-writes waiting for the cache. Previously there was no opportunity to make this decision, sub-width-writes would occupy a stripe cache entry from the time they entered the delayed list until they finished processing. Flow: 1/ make_request calls get_active_queue, add_queue_bio, and handle_queue 2/ handle_queue checks to see if this stripe_queue is already attached to a stripe_head and if so we bypass the queue and handle the stripe immediately, done. Otherwise, handle_queue checks the incoming requests and flags the queue as overwrite, read, sub-width-write, or delayed. 3/ __release_queue is called and depending on the determination made by handle queue the stripe_queue is placed on one of three lists (io-hi, io-lo, and delayed). Then the cache arbiter is woken up. 4/ raid456_cache_arbiter runs and attaches stripe_queues to stripe_heads in priority order, io-hi then io-lo. If the raid device is not plugged and there is nothing else to do it will transition delayed queues to the io-lo list. Since there are more stripe_queues in the system than stripe_heads we will end up sleeping in get_active_stripe. While sleeping requests can still enter the queue and hopefully promote sub-width-writes to full-stripe-writes. Details: * the number of stripe_queue objects in the pool is set at 2x the maximum number of stripes in the stripe_cache (STRIPE_QUEUE_SIZE). * stripe_queues are tracked in a red-black-tree * a stripe_queue is considered active while it has STRIPE_QUEUE_HANDLED set, or it is attached to a stripe_head * once a stripe_queue is activated it is not placed on the inactive list until it has been serviced by the stripe cache Changes in take2: * separate write and overwrite in the io_weight fields, i.e. an overwrite no longer implies a write * rename queue_weight -> io_weight * fix r5_io_weight_size * implement support for sysfs changes to stripe_cache_size * delete and re-add stripe queues from their management lists rather than moving them. This guarantees that when the count is non-zero that the queue is not on a list (identical to stripe_head handling) * __wait_for_inactive_queue was incorrectly using conf->inactive_blocked which is exclusively for the stripe_cache. Added conf->inactive_queue_blocked and set the routine to wait until the number of active queues drops below 7/8's of the total before unblocking processing. 7/8's arises from the following: get_active_stripe waits for 3/4's of the stripe cache i.e. 1/4 inactive. conf->max_nr_stripes / 4 == conf->max_nr_stripes * STRIPE_QUEUE_SIZE / 8 iff STRIPE_QUEUE_SIZE == 2 * change raid5_congested to report whether the queue is congested and not the cache. Changes in take3: * rename raid5qd => rai456_cache_arbiter * make raid456_cache_arbiter the only thread that can block on a call to get_active_stripe, this ensures proper ordering of attachments * added wait_for_cache_attach for routines outside the i/o path (like resync, and reshape) to request servicing from raid456_cache_arbiter * change cache attachment priorities to io_hi (full-stripe-writes) and io_lo (reads and sub-width-stripe-writes) * changed handle_queue to try to attempt a non-blocking cache attachment, this recovers some of the lost read throughput from take2 * move flags back to r5dev to stay in sync with the buffers * use sq->overwrite to set R5_OVERWRITE, fixes a data corruption issue with stale overwrite flags when attempting to use sq->overwrite in handle_stripe stripe-queue take4 * disonnect sq->sh and sh->sh->sq at the end of __release_queue * remove the implicit get_active_stripe from get_active_queue to ensure that writes that need to be delayed go through raid456_cache_arbiter. This fixes the performance regression caused by increasing the stripe cache size. * kill __get_active_stripe, not needed stripe-queue take5 * fix retry_aligned_read... check for null returns from get_active_queue * workqueue leak fix, dmonakhov@openvz.org Signed-off-by: Dan Williams commit eb3e0df0faa5b310de66ab1d59ab71d76fc64ec9 Author: Dan Williams Date: Thu Aug 2 23:11:20 2007 -0700 raid5: add the stripe_queue object for tracking raid io requests (take2) The raid5 stripe cache object, struct stripe_head, serves two purposes: 1/ frontend: queuing incoming requests 2/ backend: transitioning requests through the cache state machine to the backing devices The problem with this model is that queuing decisions are directly tied to cache availability. There is no facility to determine that a request or group of requests 'deserves' usage of the cache and disks at any given time. This patch separates the object members needed for queuing from the object members used for caching. The stripe_queue object takes over the incoming bio lists as well as the buffer state flags. The following fields are moved from struct stripe_head to struct stripe_queue: raid5_private_data *raid_conf int pd_idx spinlock_t lock int bm_seq The following fields are moved from struct r5dev to struct r5_queue_dev: sector_t sector struct bio *toread, *towrite This patch lays the groundwork, but does not implement, the facility to have more queue objects in the system than available stripes, currently this remains a 1:1 relationship. In other words, this patch just moves fields around and does not implement new logic. --- Performance Data --- Unit information ================ File size = megabytes Blk Size = bytes Num Thr = number of threads Avg Rate = relative throughput CPU% = relative percentage of CPU used during the test CPU Eff = Rate divided by CPU% - relative throughput per cpu load Configuration ============= Platform: 1200Mhz iop348 with 4-disk sata_vsc array mdadm --create /dev/md0 /dev/sd[abcd] -n 4 -l 5 mkfs.ext2 /dev/md0 mount /dev/md0 /mnt/raid tiobench --size 2048 --numruns 5 --block 4096 --block 131072 --dir /mnt/raid Sequential Reads File Blk Num Avg Maximum CPU Identifier Size Size Thr Rate (CPU%) Eff ----------- ------ ----- --- ------ ------ ----- 2.6.22-iop1 2048 4096 1 -1% 2% -3% 2.6.22-iop1 2048 4096 2 -37% -34% -5% 2.6.22-iop1 2048 4096 4 -22% -19% -3% 2.6.22-iop1 2048 4096 8 -3% -3% -1% 2.6.22-iop1 2048 13107 1 1% -1% 2% 2.6.22-iop1 2048 13107 2 -11% -11% -1% 2.6.22-iop1 2048 13107 4 25% 20% 4% 2.6.22-iop1 2048 13107 8 8% 6% 2% Sequential Writes File Blk Num Avg Maximum CPU Identifier Size Size Thr Rate (CPU%) Eff ----------- ------ ----- --- ------ ------ ----- 2.6.22-iop1 2048 4096 1 26% 29% -2% 2.6.22-iop1 2048 4096 2 40% 43% -2% 2.6.22-iop1 2048 4096 4 24% 7% 16% 2.6.22-iop1 2048 4096 8 6% -11% 19% 2.6.22-iop1 2048 13107 1 66% 65% 0% 2.6.22-iop1 2048 13107 2 41% 33% 6% 2.6.22-iop1 2048 13107 4 23% -8% 34% 2.6.22-iop1 2048 13107 8 13% -24% 49% The read numbers in this take have approved from a %14 average decline to a %5 average decline. However it is still a mystery as to why any significant variance is showing up because most reads should completely bypass the stripe_cache. Here is blktrace data for a component disk while running the following: for i in `seq 1 5`; do dd if=/dev/zero of=/dev/md0 bs=1024k count=1024; done Pre-patch: CPU0 (sda): Reads Queued: 7965, 31860KiB Writes Queued: 437458, 1749MiB Read Dispatches: 881, 31860KiB Write Dispatches: 26405, 1749MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 881, 31860KiB Writes Completed: 26415, 1749MiB Read Merges: 6955, 27820KiB Write Merges: 411007, 1644MiB Read depth: 2 Write depth: 2 IO unplugs: 176 Timer unplugs: 176 Post-patch: CPU0 (sda): Reads Queued: 36255, 145020KiB Writes Queued: 437727, 1750MiB Read Dispatches: 1960, 145020KiB Write Dispatches: 6672, 1750MiB Reads Requeued: 0 Writes Requeued: 0 Reads Completed: 1960, 145020KiB Writes Completed: 6682, 1750MiB Read Merges: 34235, 136940KiB Write Merges: 430409, 1721MiB Read depth: 2 Write depth: 2 IO unplugs: 423 Timer unplugs: 423 The performance win is coming from improved merging and not from reduced reads as previously assumed. Note that with blktrace enabled the throughput comes in at ~98MB/s compared to ~120MB/s without. Pre-patch throughput hovers at ~85MB/s for this dd command. Changes in take2: * leave the flags with the buffers, prevents a data corruption issue whereby stale buffer state flags are attached to newly initialized buffers Signed-off-by: Dan Williams drivers/md/raid5.c | 1487 ++++++++++++++++++++++++++++++++------------ include/linux/raid/raid5.h | 87 ++- 2 files changed, 1167 insertions(+), 407 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 2aff4be..537f53a 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -31,7 +31,7 @@ * conf->bm_flush is the number of the last batch that was closed to * new additions. * When we discover that we will need to write to any block in a stripe - * (in add_stripe_bio) we update the in-memory bitmap and record in sh->bm_seq + * (in add_stripe_bio) we update the in-memory bitmap and record in sq->bm_seq * the number of the batch it will be in. This is bm_flush+1. * When we are ready to do a write, if that batch hasn't been written yet, * we plug the array and queue the stripe for later. @@ -65,6 +65,7 @@ #define STRIPE_SECTORS (STRIPE_SIZE>>9) #define IO_THRESHOLD 1 #define NR_HASH (PAGE_SIZE / sizeof(struct hlist_head)) #define HASH_MASK (NR_HASH - 1) +#define STRIPE_QUEUE_SIZE 2 /* multiple of nr_stripes */ #define stripe_hash(conf, sect) (&((conf)->stripe_hashtbl[((sect) >> STRIPE_SHIFT) & HASH_MASK])) @@ -78,6 +79,8 @@ #define stripe_hash(conf, sect) (&((conf * of the current stripe+device */ #define r5_next_bio(bio, sect) ( ( (bio)->bi_sector + ((bio)->bi_size>>9) < sect + STRIPE_SECTORS) ? (bio)->bi_next : NULL) +#define r5_io_weight_size(devs) (sizeof(unsigned long) * \ + (ALIGN(devs, BITS_PER_LONG) / BITS_PER_LONG)) /* * The following can be used to debug the driver */ @@ -122,44 +125,113 @@ static void return_io(struct bio *return static void print_raid5_conf (raid5_conf_t *conf); +/* __release_queue - route the stripe_queue based on pending i/o's. The + * queue object is allowed to bounce around between 4 lists up until + * it is attached to a stripe_head. The lists in order of priority are: + * 1/ overwrite: all data blocks are set to be overwritten, no prereads + * 2/ unaligned_read: read requests that get past chunk_aligned_read + * 3/ subwidth_write: write requests that require prereading + * 4/ delayed_q: write requests pending activation + */ +static struct stripe_queue init_sq; /* sq for newborn stripe_heads */ +static struct stripe_head init_sh; /* sh for newborn stripe_queues */ +static void __release_queue(raid5_conf_t *conf, struct stripe_queue *sq) +{ + if (atomic_dec_and_test(&sq->count)) { + BUG_ON(!list_empty(&sq->list_node)); + BUG_ON(atomic_read(&conf->active_queues) == 0); + if (test_bit(STRIPE_QUEUE_HANDLE, &sq->state)) { + BUG_ON(sq->sh); + if (test_bit(STRIPE_QUEUE_IO_HI, &sq->state)) { + list_add_tail(&sq->list_node, + &conf->io_hi_queue); + queue_work(conf->workqueue, + &conf->stripe_queue_work); + } else if (test_bit(STRIPE_QUEUE_IO_LO, &sq->state)) { + list_add_tail(&sq->list_node, + &conf->io_lo_queue); + queue_work(conf->workqueue, + &conf->stripe_queue_work); + } else if (test_bit(STRIPE_QUEUE_DELAYED, &sq->state)) { + list_add_tail(&sq->list_node, + &conf->delayed_q_list); + blk_plug_device(conf->mddev->queue); + } else { + /* nothing to handle */ + sq->sh = &init_sh; + clear_bit(STRIPE_QUEUE_HANDLE, &sq->state); + goto nothing_to_handle; /* fall through */ + } + } else { +nothing_to_handle: + BUG_ON(sq->sh == NULL); + atomic_dec(&conf->active_queues); + if (test_and_clear_bit(STRIPE_QUEUE_PREREAD_ACTIVE, + &sq->state)) { + atomic_dec(&conf->preread_active_queues); + if (atomic_read(&conf->preread_active_queues) < + IO_THRESHOLD) + queue_work(conf->workqueue, + &conf->stripe_queue_work); + } + if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) { + list_add_tail(&sq->list_node, + &conf->inactive_queue_list); + wake_up(&conf->wait_for_queue); + if (conf->retry_read_aligned) + md_wakeup_thread(conf->mddev->thread); + } + /* one a queue goes inactive it can be reattached + * to a different stripe_head + */ + sq->sh->sq = NULL; + sq->sh = NULL; + } + } +} + static void __release_stripe(raid5_conf_t *conf, struct stripe_head *sh) { + struct stripe_queue *sq = sh->sq; + if (atomic_dec_and_test(&sh->count)) { BUG_ON(!list_empty(&sh->lru)); BUG_ON(atomic_read(&conf->active_stripes)==0); if (test_bit(STRIPE_HANDLE, &sh->state)) { - if (test_bit(STRIPE_DELAYED, &sh->state)) { - list_add_tail(&sh->lru, &conf->delayed_list); - blk_plug_device(conf->mddev->queue); - } else if (test_bit(STRIPE_BIT_DELAY, &sh->state) && - sh->bm_seq - conf->seq_write > 0) { + if (test_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state) && + sq->bm_seq - conf->seq_write > 0) { list_add_tail(&sh->lru, &conf->bitmap_list); blk_plug_device(conf->mddev->queue); } else { - clear_bit(STRIPE_BIT_DELAY, &sh->state); + clear_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state); list_add_tail(&sh->lru, &conf->handle_list); } md_wakeup_thread(conf->mddev->thread); } else { BUG_ON(sh->ops.pending); - if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) { - atomic_dec(&conf->preread_active_stripes); - if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) - md_wakeup_thread(conf->mddev->thread); - } atomic_dec(&conf->active_stripes); - if (!test_bit(STRIPE_EXPANDING, &sh->state)) { + __release_queue(conf, sq); + if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) { list_add_tail(&sh->lru, &conf->inactive_list); wake_up(&conf->wait_for_stripe); - if (conf->retry_read_aligned) - md_wakeup_thread(conf->mddev->thread); } } } } + +static void release_queue(struct stripe_queue *sq) +{ + raid5_conf_t *conf = sq->raid_conf; + unsigned long flags; + + spin_lock_irqsave(&conf->device_lock, flags); + __release_queue(conf, sq); + spin_unlock_irqrestore(&conf->device_lock, flags); +} + static void release_stripe(struct stripe_head *sh) { - raid5_conf_t *conf = sh->raid_conf; + raid5_conf_t *conf = sh->sq->raid_conf; unsigned long flags; spin_lock_irqsave(&conf->device_lock, flags); @@ -188,9 +260,10 @@ static inline void insert_hash(raid5_con /* find an idle stripe, make sure it is unhashed, and return it. */ -static struct stripe_head *get_free_stripe(raid5_conf_t *conf) +static struct stripe_head *get_free_stripe(struct stripe_queue *sq) { struct stripe_head *sh = NULL; + raid5_conf_t *conf = sq->raid_conf; struct list_head *first; CHECK_DEVLOCK(); @@ -201,10 +274,29 @@ static struct stripe_head *get_free_stri list_del_init(first); remove_hash(sh); atomic_inc(&conf->active_stripes); + atomic_inc(&sq->count); + sh->sq = NULL; out: return sh; } +static struct stripe_queue *get_free_queue(raid5_conf_t *conf) +{ + struct stripe_queue *sq = NULL; + struct list_head *first; + + CHECK_DEVLOCK(); + if (list_empty(&conf->inactive_queue_list)) + goto out; + first = conf->inactive_queue_list.next; + sq = list_entry(first, struct stripe_queue, list_node); + list_del_init(first); + rb_erase(&sq->rb_node, &conf->stripe_queue_tree); + atomic_inc(&conf->active_queues); +out: + return sq; +} + static void shrink_buffers(struct stripe_head *sh, int num) { struct page *p; @@ -236,23 +328,41 @@ static int grow_buffers(struct stripe_he static void raid5_build_block (struct stripe_head *sh, int i); -static void init_stripe(struct stripe_head *sh, sector_t sector, int pd_idx, int disks) +#if BITS_PER_LONG == 32 +#define hweight hweight32 +#else +#define hweight hweight64 +#endif +static unsigned long io_weight(unsigned long *bitmap, int disks) { - raid5_conf_t *conf = sh->raid_conf; + unsigned long weight = hweight(*bitmap); + + for (bitmap++; disks > BITS_PER_LONG; disks -= BITS_PER_LONG, bitmap++) + weight += hweight(*bitmap); + + return weight; +} + +static void +init_stripe(struct stripe_head *sh, struct stripe_queue *sq, int disks) +{ + raid5_conf_t *conf = sq->raid_conf; + sector_t sector = sq->sector; int i; + pr_debug("init_stripe called, stripe %llu\n", + (unsigned long long)sector); + BUG_ON(atomic_read(&sh->count) != 0); BUG_ON(test_bit(STRIPE_HANDLE, &sh->state)); BUG_ON(sh->ops.pending || sh->ops.ack || sh->ops.complete); + sh->sq = sq; CHECK_DEVLOCK(); - pr_debug("init_stripe called, stripe %llu\n", - (unsigned long long)sh->sector); remove_hash(sh); sh->sector = sector; - sh->pd_idx = pd_idx; sh->state = 0; sh->disks = disks; @@ -260,12 +370,11 @@ static void init_stripe(struct stripe_he for (i = sh->disks; i--; ) { struct r5dev *dev = &sh->dev[i]; - if (dev->toread || dev->read || dev->towrite || dev->written || + if (dev->read || dev->written || test_bit(R5_LOCKED, &dev->flags)) { - printk(KERN_ERR "sector=%llx i=%d %p %p %p %p %d\n", - (unsigned long long)sh->sector, i, dev->toread, - dev->read, dev->towrite, dev->written, - test_bit(R5_LOCKED, &dev->flags)); + printk(KERN_ERR "sector=%llx i=%d %p %p %d\n", + (unsigned long long)sector, i, dev->read, + dev->written, test_bit(R5_LOCKED, &dev->flags)); BUG(); } dev->flags = 0; @@ -288,62 +397,257 @@ static struct stripe_head *__find_stripe return NULL; } +static struct stripe_queue *__find_queue(raid5_conf_t *conf, sector_t sector) +{ + struct rb_node *n = conf->stripe_queue_tree.rb_node; + struct stripe_queue *sq; + + pr_debug("%s, sector %llu\n", __FUNCTION__, (unsigned long long)sector); + while (n) { + sq = rb_entry(n, struct stripe_queue, rb_node); + + if (sector < sq->sector) + n = n->rb_left; + else if (sector > sq->sector) + n = n->rb_right; + else + return sq; + } + pr_debug("__queue %llu not in tree\n", (unsigned long long)sector); + return NULL; +} + +static struct stripe_queue * +__insert_active_sq(raid5_conf_t *conf, sector_t sector, struct rb_node *node) +{ + struct rb_node **p = &conf->stripe_queue_tree.rb_node; + struct rb_node *parent = NULL; + struct stripe_queue *sq; + + while (*p) { + parent = *p; + sq = rb_entry(parent, struct stripe_queue, rb_node); + + if (sector < sq->sector) + p = &(*p)->rb_left; + else if (sector > sq->sector) + p = &(*p)->rb_right; + else + return sq; + } + + rb_link_node(node, parent, p); + + return NULL; +} + +static inline struct stripe_queue * +insert_active_sq(raid5_conf_t *conf, sector_t sector, struct rb_node *node) +{ + struct stripe_queue *sq = __insert_active_sq(conf, sector, node); + + if (sq) + goto out; + rb_insert_color(node, &conf->stripe_queue_tree); + out: + return sq; +} + +static sector_t compute_blocknr(raid5_conf_t *conf, int raid_disks, + sector_t sector, int pd_idx, int i); + +static void +pickup_cached_stripe(struct stripe_head *sh, struct stripe_queue *sq) +{ + raid5_conf_t *conf = sq->raid_conf; + + if (atomic_read(&sh->count)) + BUG_ON(!list_empty(&sh->lru)); + else { + if (!test_bit(STRIPE_HANDLE, &sh->state)) { + atomic_inc(&conf->active_stripes); + atomic_inc(&sq->count); + sh->sq = NULL; + } + if (list_empty(&sh->lru) && + !test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) + BUG(); + list_del_init(&sh->lru); + } +} + static void unplug_slaves(mddev_t *mddev); static void raid5_unplug_device(struct request_queue *q); -static struct stripe_head *get_active_stripe(raid5_conf_t *conf, sector_t sector, int disks, - int pd_idx, int noblock) +static void +__wait_for_inactive_stripe(raid5_conf_t *conf, struct stripe_queue *sq) { - struct stripe_head *sh; + conf->inactive_blocked = 1; + wait_event_lock_irq(conf->wait_for_stripe, + (!list_empty(&conf->inactive_list) && + (atomic_read(&conf->active_stripes) + < (conf->max_nr_stripes * 3/4) || + !conf->inactive_blocked)), + conf->device_lock, + raid5_unplug_device(conf->mddev->queue) + ); + conf->inactive_blocked = 0; +} - pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector); +static struct stripe_head * +get_active_stripe(struct stripe_queue *sq, int disks, int noblock) +{ + raid5_conf_t *conf = sq->raid_conf; + sector_t sector = sq->sector; + struct stripe_head *sh; spin_lock_irq(&conf->device_lock); + pr_debug("get_stripe, sector %llu\n", (unsigned long long)sq->sector); + do { - wait_event_lock_irq(conf->wait_for_stripe, - conf->quiesce == 0, - conf->device_lock, /* nothing */); + /* try to get a cached stripe */ sh = __find_stripe(conf, sector, disks); + + /* try to activate a new stripe */ if (!sh) { if (!conf->inactive_blocked) - sh = get_free_stripe(conf); + sh = get_free_stripe(sq); if (noblock && sh == NULL) break; - if (!sh) { - conf->inactive_blocked = 1; - wait_event_lock_irq(conf->wait_for_stripe, - !list_empty(&conf->inactive_list) && - (atomic_read(&conf->active_stripes) - < (conf->max_nr_stripes *3/4) - || !conf->inactive_blocked), - conf->device_lock, - raid5_unplug_device(conf->mddev->queue) - ); - conf->inactive_blocked = 0; - } else - init_stripe(sh, sector, pd_idx, disks); - } else { - if (atomic_read(&sh->count)) { - BUG_ON(!list_empty(&sh->lru)); - } else { - if (!test_bit(STRIPE_HANDLE, &sh->state)) - atomic_inc(&conf->active_stripes); - if (list_empty(&sh->lru) && - !test_bit(STRIPE_EXPANDING, &sh->state)) - BUG(); - list_del_init(&sh->lru); - } - } + if (!sh) + __wait_for_inactive_stripe(conf, sq); + else + init_stripe(sh, sq, disks); + } else + pickup_cached_stripe(sh, sq); } while (sh == NULL); - if (sh) + BUG_ON(sq->sector != sector); + + if (sh) { atomic_inc(&sh->count); + sh->sq = sq; + + if (sq->sh) + BUG_ON(sq->sector != sh->sector); + else { + list_del_init(&sq->list_node); + clear_bit(STRIPE_QUEUE_HANDLE, &sq->state); + sq->sh = sh; + } + + BUG_ON(!list_empty(&sq->list_node)); + BUG_ON(test_bit(STRIPE_QUEUE_HANDLE, &sq->state)); + } + spin_unlock_irq(&conf->device_lock); + return sh; } +static void init_queue(struct stripe_queue *sq, sector_t sector, + int disks, int pd_idx) +{ + raid5_conf_t *conf = sq->raid_conf; + int i; + + pr_debug("%s: %llu -> %llu [%p]\n", + __FUNCTION__, (unsigned long long) sq->sector, + (unsigned long long) sector, sq); + + BUG_ON(atomic_read(&sq->count) != 0); + BUG_ON(io_weight(sq->to_read, disks)); + BUG_ON(io_weight(sq->to_write, disks)); + BUG_ON(io_weight(sq->overwrite, disks)); + BUG_ON(test_bit(STRIPE_QUEUE_HANDLE, &sq->state)); + BUG_ON(sq->sh); + + sq->state = (1 << STRIPE_QUEUE_HANDLE); + sq->sector = sector; + sq->pd_idx = pd_idx; + + for (i = disks; i--; ) { + struct r5_queue_dev *dev_q = &sq->dev[i]; + + if (dev_q->toread || dev_q->towrite) { + printk(KERN_ERR "sector=%llx i=%d %p %p\n", + (unsigned long long)sector, i, dev_q->toread, + dev_q->towrite); + BUG(); + } + dev_q->sector = compute_blocknr(conf, disks, sector, pd_idx, i); + } + + sq = insert_active_sq(conf, sector, &sq->rb_node); + if (unlikely(sq)) { + printk(KERN_ERR "%s: sq: %p sector: %llu bounced off the " + "stripe_queue rb_tree\n", __FUNCTION__, sq, + (unsigned long long) sq->sector); + BUG(); + } +} + +static void __wait_for_inactive_queue(raid5_conf_t *conf) +{ + conf->inactive_queue_blocked = 1; + wait_event_lock_irq(conf->wait_for_queue, + !list_empty(&conf->inactive_queue_list) && + (atomic_read(&conf->active_queues) + < conf->max_nr_stripes * + STRIPE_QUEUE_SIZE * 7/8 || + !conf->inactive_queue_blocked), + conf->device_lock, + /* nothing */ + ); + conf->inactive_queue_blocked = 0; +} + + +static struct stripe_queue * +get_active_queue(raid5_conf_t *conf, sector_t sector, int disks, int pd_idx, + int noblock) +{ + struct stripe_queue *sq; + + pr_debug("%s, sector %llu\n", __FUNCTION__, + (unsigned long long)sector); + + spin_lock_irq(&conf->device_lock); + + do { + wait_event_lock_irq(conf->wait_for_queue, + conf->quiesce == 0, + conf->device_lock, /* nothing */); + sq = __find_queue(conf, sector); + if (!sq) { + if (!conf->inactive_queue_blocked) + sq = get_free_queue(conf); + if (noblock && sq == NULL) + break; + if (!sq) + __wait_for_inactive_queue(conf); + else + init_queue(sq, sector, disks, pd_idx); + } else { + if (atomic_read(&sq->count)) + BUG_ON(!list_empty(&sq->list_node)); + else if (!test_and_set_bit(STRIPE_QUEUE_HANDLE, + &sq->state)) + atomic_inc(&conf->active_queues); + list_del_init(&sq->list_node); + } + } while (sq == NULL); + + if (sq) + atomic_inc(&sq->count); + + spin_unlock_irq(&conf->device_lock); + return sq; +} + + /* test_and_ack_op() ensures that we only dequeue an operation once */ #define test_and_ack_op(op, pend) \ do { \ @@ -389,12 +693,13 @@ raid5_end_write_request (struct bio *bi, static void ops_run_io(struct stripe_head *sh) { - raid5_conf_t *conf = sh->raid_conf; + struct stripe_queue *sq = sh->sq; + raid5_conf_t *conf = sq->raid_conf; int i, disks = sh->disks; might_sleep(); - for (i = disks; i--; ) { + for (i = disks; i--;) { int rw; struct bio *bi; mdk_rdev_t *rdev; @@ -513,7 +818,8 @@ static void ops_complete_biofill(void *s { struct stripe_head *sh = stripe_head_ref; struct bio *return_bi = NULL; - raid5_conf_t *conf = sh->raid_conf; + struct stripe_queue *sq = sh->sq; + raid5_conf_t *conf = sq->raid_conf; int i, more_to_read = 0; pr_debug("%s: stripe %llu\n", __FUNCTION__, @@ -522,14 +828,16 @@ static void ops_complete_biofill(void *s /* clear completed biofills */ for (i = sh->disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; + /* check if this stripe has new incoming reads */ - if (dev->toread) + if (dev_q->toread) more_to_read++; /* acknowledge completion of a biofill operation */ /* and check if we need to reply to a read request */ - if (test_bit(R5_Wantfill, &dev->flags) && !dev->toread) { + if (test_bit(R5_Wantfill, &dev->flags) && !dev_q->toread) { struct bio *rbi, *rbi2; clear_bit(R5_Wantfill, &dev->flags); @@ -541,8 +849,8 @@ static void ops_complete_biofill(void *s rbi = dev->read; dev->read = NULL; while (rbi && rbi->bi_sector < - dev->sector + STRIPE_SECTORS) { - rbi2 = r5_next_bio(rbi, dev->sector); + dev_q->sector + STRIPE_SECTORS) { + rbi2 = r5_next_bio(rbi, dev_q->sector); spin_lock_irq(&conf->device_lock); if (--rbi->bi_phys_segments == 0) { rbi->bi_next = return_bi; @@ -566,7 +874,8 @@ static void ops_complete_biofill(void *s static void ops_run_biofill(struct stripe_head *sh) { struct dma_async_tx_descriptor *tx = NULL; - raid5_conf_t *conf = sh->raid_conf; + struct stripe_queue *sq = sh->sq; + raid5_conf_t *conf = sq->raid_conf; int i; pr_debug("%s: stripe %llu\n", __FUNCTION__, @@ -574,17 +883,19 @@ static void ops_run_biofill(struct strip for (i = sh->disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sh->sq->dev[i]; if (test_bit(R5_Wantfill, &dev->flags)) { struct bio *rbi; spin_lock_irq(&conf->device_lock); - dev->read = rbi = dev->toread; - dev->toread = NULL; + dev->read = rbi = dev_q->toread; + dev_q->toread = NULL; + clear_bit(i, sq->to_read); spin_unlock_irq(&conf->device_lock); while (rbi && rbi->bi_sector < - dev->sector + STRIPE_SECTORS) { + dev_q->sector + STRIPE_SECTORS) { tx = async_copy_data(0, rbi, dev->page, - dev->sector, tx); - rbi = r5_next_bio(rbi, dev->sector); + dev_q->sector, tx); + rbi = r5_next_bio(rbi, dev_q->sector); } } } @@ -665,7 +976,8 @@ ops_run_prexor(struct stripe_head *sh, s /* kernel stack size limits the total number of disks */ int disks = sh->disks; struct page *xor_srcs[disks]; - int count = 0, pd_idx = sh->pd_idx, i; + struct stripe_queue *sq = sh->sq; + int count = 0, pd_idx = sq->pd_idx, i; /* existing parity data subtracted */ struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; @@ -675,8 +987,9 @@ ops_run_prexor(struct stripe_head *sh, s for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; /* Only process blocks that are known to be uptodate */ - if (dev->towrite && test_bit(R5_Wantprexor, &dev->flags)) + if (dev_q->towrite && test_bit(R5_Wantprexor, &dev->flags)) xor_srcs[count++] = dev->page; } @@ -691,7 +1004,8 @@ static struct dma_async_tx_descriptor * ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx) { int disks = sh->disks; - int pd_idx = sh->pd_idx, i; + struct stripe_queue *sq = sh->sq; + int pd_idx = sq->pd_idx, i; /* check if prexor is active which means only process blocks * that are part of a read-modify-write (Wantprexor) @@ -703,16 +1017,17 @@ ops_run_biodrain(struct stripe_head *sh, for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; struct bio *chosen; int towrite; towrite = 0; if (prexor) { /* rmw */ - if (dev->towrite && + if (dev_q->towrite && test_bit(R5_Wantprexor, &dev->flags)) towrite = 1; } else { /* rcw */ - if (i != pd_idx && dev->towrite && + if (i != pd_idx && dev_q->towrite && test_bit(R5_LOCKED, &dev->flags)) towrite = 1; } @@ -720,18 +1035,19 @@ ops_run_biodrain(struct stripe_head *sh, if (towrite) { struct bio *wbi; - spin_lock(&sh->lock); - chosen = dev->towrite; - dev->towrite = NULL; + spin_lock(&sq->lock); + chosen = dev_q->towrite; + dev_q->towrite = NULL; + clear_bit(i, sq->to_write); BUG_ON(dev->written); wbi = dev->written = chosen; - spin_unlock(&sh->lock); + spin_unlock(&sq->lock); while (wbi && wbi->bi_sector < - dev->sector + STRIPE_SECTORS) { + dev_q->sector + STRIPE_SECTORS) { tx = async_copy_data(1, wbi, dev->page, - dev->sector, tx); - wbi = r5_next_bio(wbi, dev->sector); + dev_q->sector, tx); + wbi = r5_next_bio(wbi, dev_q->sector); } } } @@ -754,7 +1070,8 @@ static void ops_complete_postxor(void *s static void ops_complete_write(void *stripe_head_ref) { struct stripe_head *sh = stripe_head_ref; - int disks = sh->disks, i, pd_idx = sh->pd_idx; + struct stripe_queue *sq = sh->sq; + int disks = sh->disks, i, pd_idx = sq->pd_idx; pr_debug("%s: stripe %llu\n", __FUNCTION__, (unsigned long long)sh->sector); @@ -779,7 +1096,7 @@ ops_run_postxor(struct stripe_head *sh, int disks = sh->disks; struct page *xor_srcs[disks]; - int count = 0, pd_idx = sh->pd_idx, i; + int count = 0, pd_idx = sh->sq->pd_idx, i; struct page *xor_dest; int prexor = test_bit(STRIPE_OP_PREXOR, &sh->ops.pending); unsigned long flags; @@ -833,7 +1150,7 @@ ops_run_postxor(struct stripe_head *sh, static void ops_complete_check(void *stripe_head_ref) { struct stripe_head *sh = stripe_head_ref; - int pd_idx = sh->pd_idx; + int pd_idx = sh->sq->pd_idx; pr_debug("%s: stripe %llu\n", __FUNCTION__, (unsigned long long)sh->sector); @@ -854,7 +1171,7 @@ static void ops_run_check(struct stripe_ struct page *xor_srcs[disks]; struct dma_async_tx_descriptor *tx; - int count = 0, pd_idx = sh->pd_idx, i; + int count = 0, pd_idx = sh->sq->pd_idx, i; struct page *xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page; pr_debug("%s: stripe %llu\n", __FUNCTION__, @@ -909,56 +1226,118 @@ static void raid5_run_ops(struct stripe_ if (test_bit(STRIPE_OP_IO, &pending)) ops_run_io(sh); - if (overlap_clear) + if (overlap_clear) { for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; if (test_and_clear_bit(R5_Overlap, &dev->flags)) - wake_up(&sh->raid_conf->wait_for_overlap); + wake_up(&sh->sq->raid_conf->wait_for_overlap); } + } } static int grow_one_stripe(raid5_conf_t *conf) { struct stripe_head *sh; - sh = kmem_cache_alloc(conf->slab_cache, GFP_KERNEL); + sh = kmem_cache_alloc(conf->sh_slab_cache, GFP_KERNEL); if (!sh) return 0; memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev)); - sh->raid_conf = conf; - spin_lock_init(&sh->lock); if (grow_buffers(sh, conf->raid_disks)) { shrink_buffers(sh, conf->raid_disks); - kmem_cache_free(conf->slab_cache, sh); + kmem_cache_free(conf->sh_slab_cache, sh); return 0; } sh->disks = conf->raid_disks; /* we just created an active stripe so... */ atomic_set(&sh->count, 1); atomic_inc(&conf->active_stripes); + atomic_set(&init_sq.count, 2); /* set to two so that it is not picked + * up by __release_queue + */ INIT_LIST_HEAD(&sh->lru); - release_stripe(sh); + sh->sq = &init_sq; + + spin_lock_irq(&conf->device_lock); + __release_stripe(conf, sh); + sh->sq = NULL; + spin_unlock_irq(&conf->device_lock); + + return 1; +} + +static int grow_one_queue(raid5_conf_t *conf) +{ + struct stripe_queue *sq; + int disks = conf->raid_disks; + void *weight_map; + sq = kmem_cache_alloc(conf->sq_slab_cache, GFP_KERNEL); + if (!sq) + return 0; + memset(sq, 0, (sizeof(*sq)+(disks-1) * sizeof(struct r5_queue_dev)) + + r5_io_weight_size(disks) + r5_io_weight_size(disks) + + r5_io_weight_size(disks)); + + /* set the queue weight bitmaps to the free space at the end of sq */ + weight_map = ((void *) sq) + offsetof(typeof(*sq), dev) + + sizeof(struct r5_queue_dev) * disks; + sq->to_read = weight_map; + weight_map += r5_io_weight_size(disks); + sq->to_write = weight_map; + weight_map += r5_io_weight_size(disks); + sq->overwrite = weight_map; + + spin_lock_init(&sq->lock); + sq->sector = MaxSector; + sq->raid_conf = conf; + /* we just created an active queue so... */ + atomic_set(&sq->count, 1); + atomic_inc(&conf->active_queues); + sq->sh = &init_sh; + INIT_LIST_HEAD(&sq->list_node); + init_waitqueue_head(&sq->wait_for_attach); + RB_CLEAR_NODE(&sq->rb_node); + release_queue(sq); + return 1; } static int grow_stripes(raid5_conf_t *conf, int num) { struct kmem_cache *sc; - int devs = conf->raid_disks; + int devs = conf->raid_disks, num_q = num * STRIPE_QUEUE_SIZE; + + sprintf(conf->sh_cache_name[0], "raid5-%s", mdname(conf->mddev)); + sprintf(conf->sh_cache_name[1], "raid5-%s-alt", mdname(conf->mddev)); + sprintf(conf->sq_cache_name[0], "raid5q-%s", mdname(conf->mddev)); + sprintf(conf->sq_cache_name[1], "raid5q-%s-alt", mdname(conf->mddev)); - sprintf(conf->cache_name[0], "raid5-%s", mdname(conf->mddev)); - sprintf(conf->cache_name[1], "raid5-%s-alt", mdname(conf->mddev)); conf->active_name = 0; - sc = kmem_cache_create(conf->cache_name[conf->active_name], + sc = kmem_cache_create(conf->sh_cache_name[conf->active_name], sizeof(struct stripe_head)+(devs-1)*sizeof(struct r5dev), 0, 0, NULL); if (!sc) return 1; - conf->slab_cache = sc; + conf->sh_slab_cache = sc; conf->pool_size = devs; while (num--) if (!grow_one_stripe(conf)) return 1; + + sc = kmem_cache_create(conf->sq_cache_name[conf->active_name], + (sizeof(struct stripe_queue)+(devs-1) * + sizeof(struct r5_queue_dev)) + + r5_io_weight_size(devs) + + r5_io_weight_size(devs) + + r5_io_weight_size(devs), 0, 0, NULL); + if (!sc) + return 1; + + conf->sq_slab_cache = sc; + while (num_q--) + if (!grow_one_queue(conf)) + return 1; + return 0; } @@ -989,11 +1368,13 @@ static int resize_stripes(raid5_conf_t * * so we use GFP_NOIO allocations. */ struct stripe_head *osh, *nsh; + struct stripe_queue *osq, *nsq; LIST_HEAD(newstripes); + LIST_HEAD(newqueues); struct disk_info *ndisks; int err = 0; - struct kmem_cache *sc; - int i; + struct kmem_cache *sc, *sc_q; + int i, j; if (newsize <= conf->pool_size) return 0; /* never bother to shrink */ @@ -1001,23 +1382,75 @@ static int resize_stripes(raid5_conf_t * md_allow_write(conf->mddev); /* Step 1 */ - sc = kmem_cache_create(conf->cache_name[1-conf->active_name], + sc = kmem_cache_create(conf->sh_cache_name[1-conf->active_name], sizeof(struct stripe_head)+(newsize-1)*sizeof(struct r5dev), 0, 0, NULL); if (!sc) return -ENOMEM; + sc_q = kmem_cache_create(conf->sq_cache_name[conf->active_name], + (sizeof(struct stripe_queue)+(newsize-1) * + sizeof(struct r5_queue_dev)) + + r5_io_weight_size(newsize) + + r5_io_weight_size(newsize) + + r5_io_weight_size(newsize), + 0, 0, NULL); + + if (!sc_q) { + kmem_cache_destroy(sc); + return -ENOMEM; + } + for (i = conf->max_nr_stripes; i; i--) { + struct stripe_queue *nsq_per_sh[STRIPE_QUEUE_SIZE]; + nsh = kmem_cache_alloc(sc, GFP_KERNEL); if (!nsh) break; - memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev)); + /* allocate STRIPE_QUEUE_SIZE queues per stripe */ + for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++) + nsq_per_sh[j] = kmem_cache_alloc(sc_q, GFP_KERNEL); - nsh->raid_conf = conf; - spin_lock_init(&nsh->lock); + for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++) + if (!nsq_per_sh[j]) + break; + if (j <= ARRAY_SIZE(nsq_per_sh)) { + kmem_cache_free(sc, nsh); + do + if (nsq_per_sh[j]) + kmem_cache_free(sc_q, nsq_per_sh[j]); + while (--j >= 0); + break; + } + + memset(nsh, 0, sizeof(*nsh) + (newsize-1)*sizeof(struct r5dev)); list_add(&nsh->lru, &newstripes); + + for (j = 0; j < ARRAY_SIZE(nsq_per_sh); j++) { + void *weight_map; + nsq = nsq_per_sh[j]; + memset(nsq, 0, (sizeof(*nsq)+(newsize-1) * + sizeof(struct r5_queue_dev)) + + r5_io_weight_size(newsize) + + r5_io_weight_size(newsize) + + r5_io_weight_size(newsize)); + /* set the queue weight bitmaps to the free space at + * the end of nsq + */ + weight_map = ((void *) nsq) + + offsetof(typeof(*nsq), dev) + + sizeof(struct r5_queue_dev) * newsize; + nsq->to_read = weight_map; + weight_map += r5_io_weight_size(newsize); + nsq->to_write = weight_map; + weight_map += r5_io_weight_size(newsize); + nsq->overwrite = weight_map; + nsq->raid_conf = conf; + spin_lock_init(&nsq->lock); + list_add(&nsq->list_node, &newqueues); + } } if (i) { /* didn't get enough, give up */ @@ -1026,6 +1459,14 @@ static int resize_stripes(raid5_conf_t * list_del(&nsh->lru); kmem_cache_free(sc, nsh); } + while (!list_empty(&newqueues)) { + nsq = list_entry(newqueues.next, + struct stripe_queue, + list_node); + list_del(&nsh->lru); + kmem_cache_free(sc_q, nsq); + } + kmem_cache_destroy(sc_q); kmem_cache_destroy(sc); return -ENOMEM; } @@ -1033,6 +1474,19 @@ static int resize_stripes(raid5_conf_t * * OK, we have enough stripes, start collecting inactive * stripes and copying them over */ + list_for_each_entry(nsq, &newqueues, list_node) { + spin_lock_irq(&conf->device_lock); + wait_event_lock_irq(conf->wait_for_queue, + !list_empty(&conf->inactive_queue_list), + conf->device_lock, + unplug_slaves(conf->mddev) + ); + osq = get_free_queue(conf); + spin_unlock_irq(&conf->device_lock); + atomic_set(&nsq->count, 1); + kmem_cache_free(conf->sq_slab_cache, osq); + } + list_for_each_entry(nsh, &newstripes, lru) { spin_lock_irq(&conf->device_lock); wait_event_lock_irq(conf->wait_for_stripe, @@ -1040,16 +1494,17 @@ static int resize_stripes(raid5_conf_t * conf->device_lock, unplug_slaves(conf->mddev) ); - osh = get_free_stripe(conf); + osh = get_free_stripe(&init_sq); spin_unlock_irq(&conf->device_lock); atomic_set(&nsh->count, 1); for(i=0; ipool_size; i++) nsh->dev[i].page = osh->dev[i].page; for( ; idev[i].page = NULL; - kmem_cache_free(conf->slab_cache, osh); + kmem_cache_free(conf->sh_slab_cache, osh); } - kmem_cache_destroy(conf->slab_cache); + kmem_cache_destroy(conf->sh_slab_cache); + kmem_cache_destroy(conf->sq_slab_cache); /* Step 3. * At this point, we are holding all the stripes so the array @@ -1066,8 +1521,11 @@ static int resize_stripes(raid5_conf_t * err = -ENOMEM; /* Step 4, return new stripes to service */ - while(!list_empty(&newstripes)) { + while (!list_empty(&newstripes)) { + nsq = list_entry(newqueues.next, struct stripe_queue, + list_node); nsh = list_entry(newstripes.next, struct stripe_head, lru); + list_del_init(&nsq->list_node); list_del_init(&nsh->lru); for (i=conf->raid_disks; i < newsize; i++) if (nsh->dev[i].page == NULL) { @@ -1076,11 +1534,24 @@ static int resize_stripes(raid5_conf_t * if (!p) err = -ENOMEM; } + nsq->sh = nsh; + nsh->sq = nsq; release_stripe(nsh); } + + /* dump the remaining sq's onto the inactive list */ + while (!list_empty(&newqueues)) { + nsq = list_entry(newqueues.next, struct stripe_queue, + list_node); + list_del_init(&nsq->list_node); + nsq->sh = &init_sh; + release_queue(nsq); + } + /* critical section pass, GFP_NOIO no longer needed */ - conf->slab_cache = sc; + conf->sh_slab_cache = sc; + conf->sq_slab_cache = sc_q; conf->active_name = 1-conf->active_name; conf->pool_size = newsize; return err; @@ -1092,32 +1563,53 @@ static int drop_one_stripe(raid5_conf_t struct stripe_head *sh; spin_lock_irq(&conf->device_lock); - sh = get_free_stripe(conf); + sh = get_free_stripe(&init_sq); spin_unlock_irq(&conf->device_lock); if (!sh) return 0; BUG_ON(atomic_read(&sh->count)); shrink_buffers(sh, conf->pool_size); - kmem_cache_free(conf->slab_cache, sh); + kmem_cache_free(conf->sh_slab_cache, sh); atomic_dec(&conf->active_stripes); return 1; } +static int drop_one_queue(raid5_conf_t *conf) +{ + struct stripe_queue *sq; + + spin_lock_irq(&conf->device_lock); + sq = get_free_queue(conf); + spin_unlock_irq(&conf->device_lock); + if (!sq) + return 0; + kmem_cache_free(conf->sq_slab_cache, sq); + atomic_dec(&conf->active_queues); + return 1; +} + static void shrink_stripes(raid5_conf_t *conf) { while (drop_one_stripe(conf)) ; - if (conf->slab_cache) - kmem_cache_destroy(conf->slab_cache); - conf->slab_cache = NULL; + while (drop_one_queue(conf)) + ; + + if (conf->sh_slab_cache) + kmem_cache_destroy(conf->sh_slab_cache); + conf->sh_slab_cache = NULL; + + if (conf->sq_slab_cache) + kmem_cache_destroy(conf->sq_slab_cache); + conf->sq_slab_cache = NULL; } static int raid5_end_read_request(struct bio * bi, unsigned int bytes_done, int error) { struct stripe_head *sh = bi->bi_private; - raid5_conf_t *conf = sh->raid_conf; + raid5_conf_t *conf = sh->sq->raid_conf; int disks = sh->disks, i; int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags); char b[BDEVNAME_SIZE]; @@ -1195,7 +1687,8 @@ static int raid5_end_write_request (stru int error) { struct stripe_head *sh = bi->bi_private; - raid5_conf_t *conf = sh->raid_conf; + struct stripe_queue *sq = sh->sq; + raid5_conf_t *conf = sq->raid_conf; int disks = sh->disks, i; int uptodate = test_bit(BIO_UPTODATE, &bi->bi_flags); @@ -1225,9 +1718,6 @@ static int raid5_end_write_request (stru return 0; } - -static sector_t compute_blocknr(struct stripe_head *sh, int i); - static void raid5_build_block (struct stripe_head *sh, int i) { struct r5dev *dev = &sh->dev[i]; @@ -1242,9 +1732,6 @@ static void raid5_build_block (struct st dev->req.bi_sector = sh->sector; dev->req.bi_private = sh; - - dev->flags = 0; - dev->sector = compute_blocknr(sh, i); } static void error(mddev_t *mddev, mdk_rdev_t *rdev) @@ -1379,12 +1866,12 @@ static sector_t raid5_compute_sector(sec } -static sector_t compute_blocknr(struct stripe_head *sh, int i) +static sector_t +compute_blocknr(raid5_conf_t *conf, int raid_disks, sector_t sector, + int pd_idx, int i) { - raid5_conf_t *conf = sh->raid_conf; - int raid_disks = sh->disks; int data_disks = raid_disks - conf->max_degraded; - sector_t new_sector = sh->sector, check; + sector_t new_sector = sector, check; int sectors_per_chunk = conf->chunk_size >> 9; sector_t stripe; int chunk_offset; @@ -1396,7 +1883,7 @@ static sector_t compute_blocknr(struct s stripe = new_sector; BUG_ON(new_sector != stripe); - if (i == sh->pd_idx) + if (i == pd_idx) return 0; switch(conf->level) { case 4: break; @@ -1404,14 +1891,14 @@ static sector_t compute_blocknr(struct s switch (conf->algorithm) { case ALGORITHM_LEFT_ASYMMETRIC: case ALGORITHM_RIGHT_ASYMMETRIC: - if (i > sh->pd_idx) + if (i > pd_idx) i--; break; case ALGORITHM_LEFT_SYMMETRIC: case ALGORITHM_RIGHT_SYMMETRIC: - if (i < sh->pd_idx) + if (i < pd_idx) i += raid_disks; - i -= (sh->pd_idx + 1); + i -= (pd_idx + 1); break; default: printk(KERN_ERR "raid5: unsupported algorithm %d\n", @@ -1419,25 +1906,25 @@ static sector_t compute_blocknr(struct s } break; case 6: - if (i == raid6_next_disk(sh->pd_idx, raid_disks)) + if (i == raid6_next_disk(pd_idx, raid_disks)) return 0; /* It is the Q disk */ switch (conf->algorithm) { case ALGORITHM_LEFT_ASYMMETRIC: case ALGORITHM_RIGHT_ASYMMETRIC: - if (sh->pd_idx == raid_disks-1) - i--; /* Q D D D P */ - else if (i > sh->pd_idx) + if (pd_idx == raid_disks-1) + i--; /* Q D D D P */ + else if (i > pd_idx) i -= 2; /* D D P Q D */ break; case ALGORITHM_LEFT_SYMMETRIC: case ALGORITHM_RIGHT_SYMMETRIC: - if (sh->pd_idx == raid_disks-1) + if (pd_idx == raid_disks-1) i--; /* Q D D D P */ else { /* D D P Q D */ - if (i < sh->pd_idx) + if (i < pd_idx) i += raid_disks; - i -= (sh->pd_idx + 2); + i -= (pd_idx + 2); } break; default: @@ -1451,7 +1938,7 @@ static sector_t compute_blocknr(struct s r_sector = (sector_t)chunk_number * sectors_per_chunk + chunk_offset; check = raid5_compute_sector (r_sector, raid_disks, data_disks, &dummy1, &dummy2, conf); - if (check != sh->sector || dummy1 != dd_idx || dummy2 != sh->pd_idx) { + if (check != sector || dummy1 != dd_idx || dummy2 != pd_idx) { printk(KERN_ERR "compute_blocknr: map not correct\n"); return 0; } @@ -1518,8 +2005,9 @@ #define check_xor() do { \ static void compute_parity6(struct stripe_head *sh, int method) { - raid6_conf_t *conf = sh->raid_conf; - int i, pd_idx = sh->pd_idx, qd_idx, d0_idx, disks = sh->disks, count; + struct stripe_queue *sq = sh->sq; + raid6_conf_t *conf = sq->raid_conf; + int i, pd_idx = sq->pd_idx, qd_idx, d0_idx, disks = sh->disks, count; struct bio *chosen; /**** FIX THIS: This could be very bad if disks is close to 256 ****/ void *ptrs[disks]; @@ -1535,9 +2023,10 @@ static void compute_parity6(struct strip BUG(); /* READ_MODIFY_WRITE N/A for RAID-6 */ case RECONSTRUCT_WRITE: for (i= disks; i-- ;) - if ( i != pd_idx && i != qd_idx && sh->dev[i].towrite ) { - chosen = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; + if (i != pd_idx && i != qd_idx && sq->dev[i].towrite) { + chosen = sq->dev[i].towrite; + sq->dev[i].towrite = NULL; + clear_bit(i, sq->to_write); if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) wake_up(&conf->wait_for_overlap); @@ -1550,9 +2039,9 @@ static void compute_parity6(struct strip BUG(); /* Not implemented yet */ } - for (i = disks; i--;) + for (i = disks; i--; ) if (sh->dev[i].written) { - sector_t sector = sh->dev[i].sector; + sector_t sector = sq->dev[i].sector; struct bio *wbi = sh->dev[i].written; while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) { copy_data(1, wbi, sh->dev[i].page, sector); @@ -1600,9 +2089,10 @@ static void compute_parity6(struct strip /* Compute one missing block */ static void compute_block_1(struct stripe_head *sh, int dd_idx, int nozero) { + struct stripe_queue *sq = sh->sq; int i, count, disks = sh->disks; void *ptr[MAX_XOR_BLOCKS], *dest, *p; - int pd_idx = sh->pd_idx; + int pd_idx = sq->pd_idx; int qd_idx = raid6_next_disk(pd_idx, disks); pr_debug("compute_block_1, stripe %llu, idx %d\n", @@ -1639,7 +2129,7 @@ static void compute_block_1(struct strip static void compute_block_2(struct stripe_head *sh, int dd_idx1, int dd_idx2) { int i, count, disks = sh->disks; - int pd_idx = sh->pd_idx; + int pd_idx = sh->sq->pd_idx; int qd_idx = raid6_next_disk(pd_idx, disks); int d0_idx = raid6_next_disk(qd_idx, disks); int faila, failb; @@ -1701,8 +2191,9 @@ static void compute_block_2(struct strip static int handle_write_operations5(struct stripe_head *sh, int rcw, int expand) { - int i, pd_idx = sh->pd_idx, disks = sh->disks; int locked = 0; + struct stripe_queue *sq = sh->sq; + int i, pd_idx = sq->pd_idx, disks = sh->disks; if (rcw) { /* if we are not expanding this is a proper write request, and @@ -1719,8 +2210,9 @@ handle_write_operations5(struct stripe_h for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; - if (dev->towrite) { + if (dev_q->towrite) { set_bit(R5_LOCKED, &dev->flags); if (!expand) clear_bit(R5_UPTODATE, &dev->flags); @@ -1739,6 +2231,8 @@ handle_write_operations5(struct stripe_h for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; + if (i == pd_idx) continue; @@ -1747,7 +2241,7 @@ handle_write_operations5(struct stripe_h * written so we distinguish these blocks by the * R5_Wantprexor bit */ - if (dev->towrite && + if (dev_q->towrite && (test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) { set_bit(R5_Wantprexor, &dev->flags); @@ -1777,25 +2271,27 @@ handle_write_operations5(struct stripe_h * toread/towrite point to the first in a chain. * The bi_next chain must be in order. */ -static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx, int forwrite) +static int add_queue_bio(struct stripe_queue *sq, struct bio *bi, int dd_idx, + int forwrite) { struct bio **bip; - raid5_conf_t *conf = sh->raid_conf; + raid5_conf_t *conf = sq->raid_conf; int firstwrite=0; + struct stripe_head *sh; - pr_debug("adding bh b#%llu to stripe s#%llu\n", + pr_debug("adding bio (%llu) to queue (%llu)\n", (unsigned long long)bi->bi_sector, - (unsigned long long)sh->sector); - + (unsigned long long)sq->sector); - spin_lock(&sh->lock); + spin_lock(&sq->lock); spin_lock_irq(&conf->device_lock); + sh = sq->sh; if (forwrite) { - bip = &sh->dev[dd_idx].towrite; - if (*bip == NULL && sh->dev[dd_idx].written == NULL) + bip = &sq->dev[dd_idx].towrite; + if (*bip == NULL && (!sh || (sh && !sh->dev[dd_idx].written))) firstwrite = 1; } else - bip = &sh->dev[dd_idx].toread; + bip = &sq->dev[dd_idx].toread; while (*bip && (*bip)->bi_sector < bi->bi_sector) { if ((*bip)->bi_sector + ((*bip)->bi_size >> 9) > bi->bi_sector) goto overlap; @@ -1810,38 +2306,42 @@ static int add_stripe_bio(struct stripe_ *bip = bi; bi->bi_phys_segments ++; spin_unlock_irq(&conf->device_lock); - spin_unlock(&sh->lock); + spin_unlock(&sq->lock); pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n", (unsigned long long)bi->bi_sector, - (unsigned long long)sh->sector, dd_idx); + (unsigned long long)sq->sector, dd_idx); if (conf->mddev->bitmap && firstwrite) { - bitmap_startwrite(conf->mddev->bitmap, sh->sector, + bitmap_startwrite(conf->mddev->bitmap, sq->sector, STRIPE_SECTORS, 0); - sh->bm_seq = conf->seq_flush+1; - set_bit(STRIPE_BIT_DELAY, &sh->state); + sq->bm_seq = conf->seq_flush+1; + set_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state); } if (forwrite) { /* check if page is covered */ - sector_t sector = sh->dev[dd_idx].sector; - for (bi=sh->dev[dd_idx].towrite; - sector < sh->dev[dd_idx].sector + STRIPE_SECTORS && + sector_t sector = sq->dev[dd_idx].sector; + + set_bit(dd_idx, sq->to_write); + for (bi = sq->dev[dd_idx].towrite; + sector < sq->dev[dd_idx].sector + STRIPE_SECTORS && bi && bi->bi_sector <= sector; - bi = r5_next_bio(bi, sh->dev[dd_idx].sector)) { + bi = r5_next_bio(bi, sq->dev[dd_idx].sector)) { if (bi->bi_sector + (bi->bi_size>>9) >= sector) sector = bi->bi_sector + (bi->bi_size>>9); } - if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS) - set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags); - } + if (sector >= sq->dev[dd_idx].sector + STRIPE_SECTORS) + set_bit(dd_idx, sq->overwrite); + } else + set_bit(dd_idx, sq->to_read); + return 1; overlap: set_bit(R5_Overlap, &sh->dev[dd_idx].flags); spin_unlock_irq(&conf->device_lock); - spin_unlock(&sh->lock); + spin_unlock(&sq->lock); return 0; } @@ -1873,6 +2373,8 @@ handle_requests_to_failed_array(raid5_co struct bio **return_bi) { int i; + struct stripe_queue *sq = sh->sq; + for (i = disks; i--; ) { struct bio *bi; int bitmap_end = 0; @@ -1888,8 +2390,9 @@ handle_requests_to_failed_array(raid5_co } spin_lock_irq(&conf->device_lock); /* fail all writes first */ - bi = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; + bi = sq->dev[i].towrite; + sq->dev[i].towrite = NULL; + clear_bit(i, sq->to_write); if (bi) { s->to_write--; bitmap_end = 1; @@ -1899,8 +2402,8 @@ handle_requests_to_failed_array(raid5_co wake_up(&conf->wait_for_overlap); while (bi && bi->bi_sector < - sh->dev[i].sector + STRIPE_SECTORS) { - struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector); + sq->dev[i].sector + STRIPE_SECTORS) { + struct bio *nextbi = r5_next_bio(bi, sq->dev[i].sector); clear_bit(BIO_UPTODATE, &bi->bi_flags); if (--bi->bi_phys_segments == 0) { md_write_end(conf->mddev); @@ -1914,8 +2417,8 @@ handle_requests_to_failed_array(raid5_co sh->dev[i].written = NULL; if (bi) bitmap_end = 1; while (bi && bi->bi_sector < - sh->dev[i].sector + STRIPE_SECTORS) { - struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector); + sq->dev[i].sector + STRIPE_SECTORS) { + struct bio *bi2 = r5_next_bio(bi, sq->dev[i].sector); clear_bit(BIO_UPTODATE, &bi->bi_flags); if (--bi->bi_phys_segments == 0) { md_write_end(conf->mddev); @@ -1931,15 +2434,16 @@ handle_requests_to_failed_array(raid5_co if (!test_bit(R5_Wantfill, &sh->dev[i].flags) && (!test_bit(R5_Insync, &sh->dev[i].flags) || test_bit(R5_ReadError, &sh->dev[i].flags))) { - bi = sh->dev[i].toread; - sh->dev[i].toread = NULL; + bi = sq->dev[i].toread; + sq->dev[i].toread = NULL; + clear_bit(i, sq->to_read); if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) wake_up(&conf->wait_for_overlap); if (bi) s->to_read--; while (bi && bi->bi_sector < - sh->dev[i].sector + STRIPE_SECTORS) { + sq->dev[i].sector + STRIPE_SECTORS) { struct bio *nextbi = - r5_next_bio(bi, sh->dev[i].sector); + r5_next_bio(bi, sq->dev[i].sector); clear_bit(BIO_UPTODATE, &bi->bi_flags); if (--bi->bi_phys_segments == 0) { bi->bi_next = *return_bi; @@ -1962,22 +2466,25 @@ handle_requests_to_failed_array(raid5_co static int __handle_issuing_new_read_requests5(struct stripe_head *sh, struct stripe_head_state *s, int disk_idx, int disks) { + struct stripe_queue *sq = sh->sq; struct r5dev *dev = &sh->dev[disk_idx]; + struct r5_queue_dev *dev_q = &sq->dev[disk_idx]; struct r5dev *failed_dev = &sh->dev[s->failed_num]; + struct r5_queue_dev *failed_dev_q = &sq->dev[s->failed_num]; /* don't schedule compute operations or reads on the parity block while * a check is in flight */ - if ((disk_idx == sh->pd_idx) && + if ((disk_idx == sq->pd_idx) && test_bit(STRIPE_OP_CHECK, &sh->ops.pending)) return ~0; /* is the data in this block needed, and can we get it? */ if (!test_bit(R5_LOCKED, &dev->flags) && - !test_bit(R5_UPTODATE, &dev->flags) && (dev->toread || - (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) || + !test_bit(R5_UPTODATE, &dev->flags) && (dev_q->toread || + (dev_q->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) || s->syncing || s->expanding || (s->failed && - (failed_dev->toread || (failed_dev->towrite && + (failed_dev_q->toread || (failed_dev_q->towrite && !test_bit(R5_OVERWRITE, &failed_dev->flags) ))))) { /* 1/ We would like to get this block, possibly by computing it, @@ -2060,18 +2567,22 @@ static void handle_issuing_new_read_requ int disks) { int i; + struct stripe_queue *sq = sh->sq; + for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; + if (!test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) && - (dev->toread || (dev->towrite && + (dev_q->toread || (dev_q->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) || s->syncing || s->expanding || (s->failed >= 1 && - (sh->dev[r6s->failed_num[0]].toread || + (sq->dev[r6s->failed_num[0]].toread || s->to_write)) || (s->failed >= 2 && - (sh->dev[r6s->failed_num[1]].toread || + (sq->dev[r6s->failed_num[1]].toread || s->to_write)))) { /* we would like to get this block, possibly * by computing it, but we might not be able to @@ -2121,11 +2632,12 @@ static void handle_completed_write_reque struct stripe_head *sh, int disks, struct bio **return_bi) { int i; - struct r5dev *dev; + struct stripe_queue *sq = sh->sq; for (i = disks; i--; ) if (sh->dev[i].written) { - dev = &sh->dev[i]; + struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; if (!test_bit(R5_LOCKED, &dev->flags) && test_bit(R5_UPTODATE, &dev->flags)) { /* We can return any write requests */ @@ -2136,8 +2648,8 @@ static void handle_completed_write_reque wbi = dev->written; dev->written = NULL; while (wbi && wbi->bi_sector < - dev->sector + STRIPE_SECTORS) { - wbi2 = r5_next_bio(wbi, dev->sector); + dev_q->sector + STRIPE_SECTORS) { + wbi2 = r5_next_bio(wbi, dev_q->sector); if (--wbi->bi_phys_segments == 0) { md_write_end(conf->mddev); wbi->bi_next = *return_bi; @@ -2145,7 +2657,7 @@ static void handle_completed_write_reque } wbi = wbi2; } - if (dev->towrite == NULL) + if (dev_q->towrite == NULL) bitmap_end = 1; spin_unlock_irq(&conf->device_lock); if (bitmap_end) @@ -2162,10 +2674,14 @@ static void handle_issuing_new_write_req struct stripe_head *sh, struct stripe_head_state *s, int disks) { int rmw = 0, rcw = 0, i; + struct stripe_queue *sq = sh->sq; + for (i = disks; i--; ) { /* would I have to read this buffer for read_modify_write */ struct r5dev *dev = &sh->dev[i]; - if ((dev->towrite || i == sh->pd_idx) && + struct r5_queue_dev *dev_q = &sq->dev[i]; + + if ((dev_q->towrite || i == sq->pd_idx) && !test_bit(R5_LOCKED, &dev->flags) && !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) { @@ -2175,7 +2691,7 @@ static void handle_issuing_new_write_req rmw += 2*disks; /* cannot read it */ } /* Would I have to read this buffer for reconstruct_write */ - if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sh->pd_idx && + if (!test_bit(R5_OVERWRITE, &dev->flags) && i != sq->pd_idx && !test_bit(R5_LOCKED, &dev->flags) && !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags))) { @@ -2191,51 +2707,41 @@ static void handle_issuing_new_write_req /* prefer read-modify-write, but need to get some data */ for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; - if ((dev->towrite || i == sh->pd_idx) && + struct r5_queue_dev *dev_q = &sq->dev[i]; + + if ((dev_q->towrite || i == sq->pd_idx) && !test_bit(R5_LOCKED, &dev->flags) && !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) && test_bit(R5_Insync, &dev->flags)) { - if ( - test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) { - pr_debug("Read_old block " - "%d for r-m-w\n", i); - set_bit(R5_LOCKED, &dev->flags); - set_bit(R5_Wantread, &dev->flags); - if (!test_and_set_bit( - STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - s->locked++; - } else { - set_bit(STRIPE_DELAYED, &sh->state); - set_bit(STRIPE_HANDLE, &sh->state); - } + pr_debug("Read_old block %d for r-m-w\n", i); + set_bit(R5_LOCKED, &dev->flags); + set_bit(R5_Wantread, &dev->flags); + if (!test_and_set_bit(STRIPE_OP_IO, + &sh->ops.pending)) + sh->ops.count++; + s->locked++; } } if (rcw <= rmw && rcw > 0) /* want reconstruct write, but need to get some data */ for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; + if (!test_bit(R5_OVERWRITE, &dev->flags) && - i != sh->pd_idx && + i != sq->pd_idx && !test_bit(R5_LOCKED, &dev->flags) && !(test_bit(R5_UPTODATE, &dev->flags) || test_bit(R5_Wantcompute, &dev->flags)) && test_bit(R5_Insync, &dev->flags)) { - if ( - test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) { - pr_debug("Read_old block " - "%d for Reconstruct\n", i); - set_bit(R5_LOCKED, &dev->flags); - set_bit(R5_Wantread, &dev->flags); - if (!test_and_set_bit( - STRIPE_OP_IO, &sh->ops.pending)) - sh->ops.count++; - s->locked++; - } else { - set_bit(STRIPE_DELAYED, &sh->state); - set_bit(STRIPE_HANDLE, &sh->state); - } + pr_debug("Read_old block " + "%d for Reconstruct\n", i); + set_bit(R5_LOCKED, &dev->flags); + set_bit(R5_Wantread, &dev->flags); + if (!test_and_set_bit(STRIPE_OP_IO, + &sh->ops.pending)) + sh->ops.count++; + s->locked++; } } /* now if nothing is locked, and if we have enough data, @@ -2251,7 +2757,7 @@ static void handle_issuing_new_write_req if ((s->req_compute || !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) && (s->locked == 0 && (rcw == 0 || rmw == 0) && - !test_bit(STRIPE_BIT_DELAY, &sh->state))) + !test_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state))) s->locked += handle_write_operations5(sh, rcw == 0, 0); } @@ -2259,7 +2765,8 @@ static void handle_issuing_new_write_req struct stripe_head *sh, struct stripe_head_state *s, struct r6_state *r6s, int disks) { - int rcw = 0, must_compute = 0, pd_idx = sh->pd_idx, i; + struct stripe_queue *sq = sh->sq; + int rcw = 0, must_compute = 0, pd_idx = sq->pd_idx, i; int qd_idx = r6s->qd_idx; for (i = disks; i--; ) { struct r5dev *dev = &sh->dev[i]; @@ -2290,28 +2797,19 @@ static void handle_issuing_new_write_req && !test_bit(R5_LOCKED, &dev->flags) && !test_bit(R5_UPTODATE, &dev->flags) && test_bit(R5_Insync, &dev->flags)) { - if ( - test_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) { - pr_debug("Read_old stripe %llu " - "block %d for Reconstruct\n", - (unsigned long long)sh->sector, i); - set_bit(R5_LOCKED, &dev->flags); - set_bit(R5_Wantread, &dev->flags); - s->locked++; - } else { - pr_debug("Request delayed stripe %llu " - "block %d for Reconstruct\n", - (unsigned long long)sh->sector, i); - set_bit(STRIPE_DELAYED, &sh->state); - set_bit(STRIPE_HANDLE, &sh->state); - } + pr_debug("Read_old stripe %llu " + "block %d for Reconstruct\n", + (unsigned long long)sh->sector, i); + set_bit(R5_LOCKED, &dev->flags); + set_bit(R5_Wantread, &dev->flags); + s->locked++; } } /* now if nothing is locked, and if we have enough data, we can start a * write request */ if (s->locked == 0 && rcw == 0 && - !test_bit(STRIPE_BIT_DELAY, &sh->state)) { + !test_bit(STRIPE_QUEUE_BIT_DELAY, &sq->state)) { if (must_compute > 0) { /* We have failed blocks and need to compute them */ switch (s->failed) { @@ -2342,19 +2840,13 @@ static void handle_issuing_new_write_req } /* after a RECONSTRUCT_WRITE, the stripe MUST be in-sync */ set_bit(STRIPE_INSYNC, &sh->state); - - if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) { - atomic_dec(&conf->preread_active_stripes); - if (atomic_read(&conf->preread_active_stripes) < - IO_THRESHOLD) - md_wakeup_thread(conf->mddev->thread); - } } } static void handle_parity_checks5(raid5_conf_t *conf, struct stripe_head *sh, struct stripe_head_state *s, int disks) { + struct stripe_queue *sq = sh->sq; set_bit(STRIPE_HANDLE, &sh->state); /* Take one of the following actions: * 1/ start a check parity operation if (uptodate == disks) @@ -2366,7 +2858,7 @@ static void handle_parity_checks5(raid5_ !test_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending)) { if (!test_and_set_bit(STRIPE_OP_CHECK, &sh->ops.pending)) { BUG_ON(s->uptodate != disks); - clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags); + clear_bit(R5_UPTODATE, &sh->dev[sq->pd_idx].flags); sh->ops.count++; s->uptodate--; } else if ( @@ -2392,8 +2884,8 @@ static void handle_parity_checks5(raid5_ set_bit(STRIPE_OP_MOD_REPAIR_PD, &sh->ops.pending); set_bit(R5_Wantcompute, - &sh->dev[sh->pd_idx].flags); - sh->ops.target = sh->pd_idx; + &sh->dev[sq->pd_idx].flags); + sh->ops.target = sq->pd_idx; sh->ops.count++; s->uptodate++; } @@ -2418,9 +2910,10 @@ static void handle_parity_checks5(raid5_ !test_bit(STRIPE_OP_CHECK, &sh->ops.pending) && !test_bit(STRIPE_OP_COMPUTE_BLK, &sh->ops.pending)) { struct r5dev *dev; + /* either failed parity check, or recovery is happening */ if (s->failed == 0) - s->failed_num = sh->pd_idx; + s->failed_num = sq->pd_idx; dev = &sh->dev[s->failed_num]; BUG_ON(!test_bit(R5_UPTODATE, &dev->flags)); BUG_ON(s->uptodate != disks); @@ -2443,8 +2936,9 @@ static void handle_parity_checks6(raid5_ int disks) { int update_p = 0, update_q = 0; + struct stripe_queue *sq = sh->sq; struct r5dev *dev; - int pd_idx = sh->pd_idx; + int pd_idx = sq->pd_idx; int qd_idx = r6s->qd_idx; set_bit(STRIPE_HANDLE, &sh->state); @@ -2534,6 +3028,7 @@ static void handle_stripe_expansion(raid struct r6_state *r6s) { int i; + struct stripe_queue *sq = sh->sq; /* We have read all the blocks in this stripe and now we need to * copy some of them into a target stripe for expand. @@ -2541,26 +3036,35 @@ static void handle_stripe_expansion(raid struct dma_async_tx_descriptor *tx = NULL; clear_bit(STRIPE_EXPAND_SOURCE, &sh->state); for (i = 0; i < sh->disks; i++) - if (i != sh->pd_idx && (r6s && i != r6s->qd_idx)) { + if (i != sq->pd_idx && (r6s && i != r6s->qd_idx)) { int dd_idx, pd_idx, j; struct stripe_head *sh2; + struct stripe_queue *sq2; + int disks = conf->raid_disks; - sector_t bn = compute_blocknr(sh, i); + sector_t bn = compute_blocknr(conf, sh->disks, + sh->sector, sq->pd_idx, i); sector_t s = raid5_compute_sector(bn, conf->raid_disks, - conf->raid_disks - + disks - conf->max_degraded, &dd_idx, &pd_idx, conf); - sh2 = get_active_stripe(conf, s, conf->raid_disks, - pd_idx, 1); - if (sh2 == NULL) + sq2 = get_active_queue(conf, s, disks, pd_idx, 1); + if (sq2) + sh2 = get_active_stripe(sq, disks, 1); + if (!(sq2 && sh2)) { /* so far only the early blocks of this stripe * have been requested. When later blocks * get requested, we will try again */ + if (sq2) + release_queue(sq2); continue; - if (!test_bit(STRIPE_EXPANDING, &sh2->state) || - test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) { + } + + if (!test_bit(STRIPE_QUEUE_EXPANDING, &sq2->state) || + test_bit(R5_Expanded, &sh2->dev[dd_idx].flags)) { /* must have already done this block */ + release_queue(sq2); release_stripe(sh2); continue; } @@ -2573,7 +3077,7 @@ static void handle_stripe_expansion(raid set_bit(R5_Expanded, &sh2->dev[dd_idx].flags); set_bit(R5_UPTODATE, &sh2->dev[dd_idx].flags); for (j = 0; j < conf->raid_disks; j++) - if (j != sh2->pd_idx && + if (j != sh2->sq->pd_idx && (r6s && j != r6s->qd_idx) && !test_bit(R5_Expanded, &sh2->dev[j].flags)) break; @@ -2581,6 +3085,7 @@ static void handle_stripe_expansion(raid set_bit(STRIPE_EXPAND_READY, &sh2->state); set_bit(STRIPE_HANDLE, &sh2->state); } + release_queue(sq2); release_stripe(sh2); /* done submitting copies, wait for them to complete */ @@ -2610,7 +3115,8 @@ static void handle_stripe_expansion(raid static void handle_stripe5(struct stripe_head *sh) { - raid5_conf_t *conf = sh->raid_conf; + struct stripe_queue *sq = sh->sq; + raid5_conf_t *conf = sh->sq->raid_conf; int disks = sh->disks, i; struct bio *return_bi = NULL; struct stripe_head_state s; @@ -2620,12 +3126,11 @@ static void handle_stripe5(struct stripe memset(&s, 0, sizeof(s)); pr_debug("handling stripe %llu, state=%#lx cnt=%d, pd_idx=%d " "ops=%lx:%lx:%lx\n", (unsigned long long)sh->sector, sh->state, - atomic_read(&sh->count), sh->pd_idx, + atomic_read(&sh->count), sq->pd_idx, sh->ops.pending, sh->ops.ack, sh->ops.complete); - spin_lock(&sh->lock); + spin_lock(&sq->lock); clear_bit(STRIPE_HANDLE, &sh->state); - clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); @@ -2636,18 +3141,21 @@ static void handle_stripe5(struct stripe for (i=disks; i--; ) { mdk_rdev_t *rdev; struct r5dev *dev = &sh->dev[i]; + struct r5_queue_dev *dev_q = &sq->dev[i]; clear_bit(R5_Insync, &dev->flags); + if (test_and_clear_bit(i, sq->overwrite)) + set_bit(R5_OVERWRITE, &dev->flags); pr_debug("check %d: state 0x%lx toread %p read %p write %p " - "written %p\n", i, dev->flags, dev->toread, dev->read, - dev->towrite, dev->written); + "written %p\n", i, dev->flags, dev_q->toread, + dev->read, dev_q->towrite, dev->written); /* maybe we can request a biofill operation * * new wantfill requests are only permitted while * STRIPE_OP_BIOFILL is clear */ - if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread && + if (test_bit(R5_UPTODATE, &dev->flags) && dev_q->toread && !test_bit(STRIPE_OP_BIOFILL, &sh->ops.pending)) set_bit(R5_Wantfill, &dev->flags); @@ -2658,9 +3166,9 @@ static void handle_stripe5(struct stripe if (test_bit(R5_Wantfill, &dev->flags)) s.to_fill++; - else if (dev->toread) + else if (dev_q->toread) s.to_read++; - if (dev->towrite) { + if (dev_q->towrite) { s.to_write++; if (!test_bit(R5_OVERWRITE, &dev->flags)) s.non_overwrite++; @@ -2704,12 +3212,12 @@ static void handle_stripe5(struct stripe /* might be able to return some write requests if the parity block * is safe, or on a failed drive */ - dev = &sh->dev[sh->pd_idx]; + dev = &sh->dev[sq->pd_idx]; if ( s.written && ((test_bit(R5_Insync, &dev->flags) && !test_bit(R5_LOCKED, &dev->flags) && test_bit(R5_UPTODATE, &dev->flags)) || - (s.failed == 1 && s.failed_num == sh->pd_idx))) + (s.failed == 1 && s.failed_num == sq->pd_idx))) handle_completed_write_requests(conf, sh, disks, &return_bi); /* Now we might consider reading some blocks, either to check/generate @@ -2754,27 +3262,21 @@ static void handle_stripe5(struct stripe /* All the 'written' buffers and the parity block are ready to * be written back to disk */ - BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags)); + BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[sq->pd_idx].flags)); for (i = disks; i--; ) { dev = &sh->dev[i]; if (test_bit(R5_LOCKED, &dev->flags) && - (i == sh->pd_idx || dev->written)) { + (i == sq->pd_idx || dev->written)) { pr_debug("Writing block %d\n", i); set_bit(R5_Wantwrite, &dev->flags); if (!test_and_set_bit( STRIPE_OP_IO, &sh->ops.pending)) sh->ops.count++; if (!test_bit(R5_Insync, &dev->flags) || - (i == sh->pd_idx && s.failed == 0)) + (i == sq->pd_idx && s.failed == 0)) set_bit(STRIPE_INSYNC, &sh->state); } } - if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) { - atomic_dec(&conf->preread_active_stripes); - if (atomic_read(&conf->preread_active_stripes) < - IO_THRESHOLD) - md_wakeup_thread(conf->mddev->thread); - } } /* Now to consider new write requests and what else, if anything @@ -2836,7 +3338,7 @@ static void handle_stripe5(struct stripe if (test_bit(STRIPE_OP_POSTXOR, &sh->ops.complete) && !test_bit(STRIPE_OP_BIODRAIN, &sh->ops.pending)) { - clear_bit(STRIPE_EXPANDING, &sh->state); + clear_bit(STRIPE_QUEUE_EXPANDING, &sq->state); clear_bit(STRIPE_OP_POSTXOR, &sh->ops.pending); clear_bit(STRIPE_OP_POSTXOR, &sh->ops.ack); @@ -2849,11 +3351,11 @@ static void handle_stripe5(struct stripe } } - if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state) && + if (s.expanded && test_bit(STRIPE_QUEUE_EXPANDING, &sq->state) && !test_bit(STRIPE_OP_POSTXOR, &sh->ops.pending)) { /* Need to write out all blocks after computing parity */ sh->disks = conf->raid_disks; - sh->pd_idx = stripe_to_pdidx(sh->sector, conf, + sq->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks); s.locked += handle_write_operations5(sh, 0, 1); } else if (s.expanded && @@ -2870,7 +3372,7 @@ static void handle_stripe5(struct stripe if (sh->ops.count) pending = get_stripe_work(sh); - spin_unlock(&sh->lock); + spin_unlock(&sq->lock); if (pending) raid5_run_ops(sh, pending); @@ -2881,10 +3383,11 @@ static void handle_stripe5(struct stripe static void handle_stripe6(struct stripe_head *sh, struct page *tmp_page) { - raid6_conf_t *conf = sh->raid_conf; + struct stripe_queue *sq = sh->sq; + raid6_conf_t *conf = sq->raid_conf; int disks = sh->disks; struct bio *return_bi = NULL; - int i, pd_idx = sh->pd_idx; + int i, pd_idx = sq->pd_idx; struct stripe_head_state s; struct r6_state r6s; struct r5dev *dev, *pdev, *qdev; @@ -2896,9 +3399,8 @@ static void handle_stripe6(struct stripe atomic_read(&sh->count), pd_idx, r6s.qd_idx); memset(&s, 0, sizeof(s)); - spin_lock(&sh->lock); + spin_lock(&sq->lock); clear_bit(STRIPE_HANDLE, &sh->state); - clear_bit(STRIPE_DELAYED, &sh->state); s.syncing = test_bit(STRIPE_SYNCING, &sh->state); s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); @@ -2908,24 +3410,31 @@ static void handle_stripe6(struct stripe rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; + struct r5_queue_dev *dev_q = &sq->dev[i]; + dev = &sh->dev[i]; clear_bit(R5_Insync, &dev->flags); + if (test_and_clear_bit(i, sq->overwrite)) + set_bit(R5_OVERWRITE, &dev->flags); pr_debug("check %d: state 0x%lx read %p write %p written %p\n", - i, dev->flags, dev->toread, dev->towrite, dev->written); + i, dev->flags, dev_q->toread, dev_q->towrite, + dev->written); /* maybe we can reply to a read */ - if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) { + if (test_bit(R5_UPTODATE, &dev->flags) && dev_q->toread) { struct bio *rbi, *rbi2; pr_debug("Return read for disc %d\n", i); spin_lock_irq(&conf->device_lock); - rbi = dev->toread; - dev->toread = NULL; + rbi = dev_q->toread; + dev_q->toread = NULL; + clear_bit(i, sq->to_read); if (test_and_clear_bit(R5_Overlap, &dev->flags)) wake_up(&conf->wait_for_overlap); spin_unlock_irq(&conf->device_lock); - while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) { - copy_data(0, rbi, dev->page, dev->sector); - rbi2 = r5_next_bio(rbi, dev->sector); + while (rbi && rbi->bi_sector < + dev_q->sector + STRIPE_SECTORS) { + copy_data(0, rbi, dev->page, dev_q->sector); + rbi2 = r5_next_bio(rbi, dev_q->sector); spin_lock_irq(&conf->device_lock); if (--rbi->bi_phys_segments == 0) { rbi->bi_next = return_bi; @@ -2941,9 +3450,9 @@ static void handle_stripe6(struct stripe if (test_bit(R5_UPTODATE, &dev->flags)) s.uptodate++; - if (dev->toread) + if (dev_q->toread) s.to_read++; - if (dev->towrite) { + if (dev_q->towrite) { s.to_write++; if (!test_bit(R5_OVERWRITE, &dev->flags)) s.non_overwrite++; @@ -3047,10 +3556,10 @@ static void handle_stripe6(struct stripe } } - if (s.expanded && test_bit(STRIPE_EXPANDING, &sh->state)) { + if (s.expanded && test_bit(STRIPE_QUEUE_EXPANDING, &sq->state)) { /* Need to write out all blocks after computing P&Q */ sh->disks = conf->raid_disks; - sh->pd_idx = stripe_to_pdidx(sh->sector, conf, + sq->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks); compute_parity6(sh, RECONSTRUCT_WRITE); for (i = conf->raid_disks ; i-- ; ) { @@ -3058,7 +3567,7 @@ static void handle_stripe6(struct stripe s.locked++; set_bit(R5_Wantwrite, &sh->dev[i].flags); } - clear_bit(STRIPE_EXPANDING, &sh->state); + clear_bit(STRIPE_QUEUE_EXPANDING, &sq->state); } else if (s.expanded) { clear_bit(STRIPE_EXPAND_READY, &sh->state); atomic_dec(&conf->reshape_stripes); @@ -3069,7 +3578,7 @@ static void handle_stripe6(struct stripe if (s.expanding && s.locked == 0) handle_stripe_expansion(conf, sh, &r6s); - spin_unlock(&sh->lock); + spin_unlock(&sq->lock); return_io(return_bi); @@ -3135,26 +3644,68 @@ static void handle_stripe6(struct stripe static void handle_stripe(struct stripe_head *sh, struct page *tmp_page) { - if (sh->raid_conf->level == 6) + if (sh->sq->raid_conf->level == 6) handle_stripe6(sh, tmp_page); else handle_stripe5(sh); } +static void handle_queue(struct stripe_queue *sq, int disks, int data_disks) +{ + struct stripe_head *sh = NULL; + + /* continue to process i/o while the stripe is cached */ + if (test_bit(STRIPE_QUEUE_HANDLE, &sq->state)) { + if (io_weight(sq->overwrite, disks) == data_disks) { + set_bit(STRIPE_QUEUE_IO_HI, &sq->state); + sh = get_active_stripe(sq, disks, 1); + } else if (io_weight(sq->to_read, disks)) { + set_bit(STRIPE_QUEUE_IO_LO, &sq->state); + sh = get_active_stripe(sq, disks, 1); + } else if (io_weight(sq->to_write, disks)) { + if (test_bit(STRIPE_QUEUE_PREREAD_ACTIVE, &sq->state)) { + set_bit(STRIPE_QUEUE_IO_LO, &sq->state); + sh = get_active_stripe(sq, disks, 1); + } else + set_bit(STRIPE_QUEUE_DELAYED, &sq->state); + } + } else { + sh = get_active_stripe(sq, disks, 1); + /* pickup the stripe from the cache, some is wrong + * if the retrieved stripe is not the one currently + * attached to the stripe_queue + */ + BUG_ON(!(sq->sh && sq->sh == sh)); + } + release_queue(sq); + if (sh) { + handle_stripe(sh, NULL); + release_stripe(sh); + } + + pr_debug("%s: sector %llu " + "state: %#lx r: %lu w: %lu o: %lu handled: %s\n", __FUNCTION__, + (unsigned long long) sq->sector, sq->state, + io_weight(sq->to_read, disks), + io_weight(sq->to_write, disks), + io_weight(sq->overwrite, disks), sh ? "yes" : "no"); +} static void raid5_activate_delayed(raid5_conf_t *conf) { - if (atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD) { - while (!list_empty(&conf->delayed_list)) { - struct list_head *l = conf->delayed_list.next; - struct stripe_head *sh; - sh = list_entry(l, struct stripe_head, lru); - list_del_init(l); - clear_bit(STRIPE_DELAYED, &sh->state); - if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state)) - atomic_inc(&conf->preread_active_stripes); - list_add_tail(&sh->lru, &conf->handle_list); + if (atomic_read(&conf->preread_active_queues) < IO_THRESHOLD) { + struct stripe_queue *sq, *_sq; + pr_debug("%s\n", __FUNCTION__); + list_for_each_entry_safe(sq, _sq, &conf->delayed_q_list, + list_node) { + clear_bit(STRIPE_QUEUE_DELAYED, &sq->state); + set_bit(STRIPE_QUEUE_IO_LO, &sq->state); + if (!test_and_set_bit(STRIPE_QUEUE_PREREAD_ACTIVE, + &sq->state)) + atomic_inc(&conf->preread_active_queues); + list_move_tail(&sq->list_node, + &conf->io_lo_queue); } } } @@ -3209,6 +3760,7 @@ static void raid5_unplug_device(struct r conf->seq_flush++; raid5_activate_delayed(conf); } + md_wakeup_thread(mddev->thread); spin_unlock_irqrestore(&conf->device_lock, flags); @@ -3252,13 +3804,13 @@ static int raid5_congested(void *data, i raid5_conf_t *conf = mddev_to_conf(mddev); /* No difference between reads and writes. Just check - * how busy the stripe_cache is + * how busy the stripe_queue is */ - if (conf->inactive_blocked) + if (conf->inactive_queue_blocked) return 1; if (conf->quiesce) return 1; - if (list_empty_careful(&conf->inactive_list)) + if (list_empty_careful(&conf->inactive_queue_list)) return 1; return 0; @@ -3450,7 +4002,7 @@ static int chunk_aligned_read(struct req } spin_lock_irq(&conf->device_lock); - wait_event_lock_irq(conf->wait_for_stripe, + wait_event_lock_irq(conf->wait_for_queue, conf->quiesce == 0, conf->device_lock, /* nothing */); atomic_inc(&conf->active_aligned_reads); @@ -3473,7 +4025,7 @@ static int make_request(struct request_q unsigned int dd_idx, pd_idx; sector_t new_sector; sector_t logical_sector, last_sector; - struct stripe_head *sh; + struct stripe_queue *sq; const int rw = bio_data_dir(bi); int remaining; @@ -3535,16 +4087,18 @@ static int make_request(struct request_q (unsigned long long)new_sector, (unsigned long long)logical_sector); - sh = get_active_stripe(conf, new_sector, disks, pd_idx, (bi->bi_rw&RWA_MASK)); - if (sh) { + sq = get_active_queue(conf, new_sector, disks, pd_idx, + (bi->bi_rw & RWA_MASK)); + if (sq) { if (unlikely(conf->expand_progress != MaxSector)) { /* expansion might have moved on while waiting for a - * stripe, so we must do the range check again. + * queue, so we must do the range check again. * Expansion could still move past after this * test, but as we are holding a reference to - * 'sh', we know that if that happens, - * STRIPE_EXPANDING will get set and the expansion - * won't proceed until we finish with the stripe. + * 'sq', we know that if that happens, + * STRIPE_QUEUE_EXPANDING will get set and the + * expansion won't proceed until we finish + * with the queue. */ int must_retry = 0; spin_lock_irq(&conf->device_lock); @@ -3554,7 +4108,7 @@ static int make_request(struct request_q must_retry = 1; spin_unlock_irq(&conf->device_lock); if (must_retry) { - release_stripe(sh); + release_queue(sq); goto retry; } } @@ -3563,27 +4117,27 @@ static int make_request(struct request_q */ if (logical_sector >= mddev->suspend_lo && logical_sector < mddev->suspend_hi) { - release_stripe(sh); + release_queue(sq); schedule(); goto retry; } - if (test_bit(STRIPE_EXPANDING, &sh->state) || - !add_stripe_bio(sh, bi, dd_idx, (bi->bi_rw&RW_MASK))) { + if (test_bit(STRIPE_QUEUE_EXPANDING, &sq->state) || + !add_queue_bio(sq, bi, dd_idx, + bi->bi_rw & RW_MASK)) { /* Stripe is busy expanding or * add failed due to overlap. Flush everything * and wait a while */ raid5_unplug_device(mddev->queue); - release_stripe(sh); + release_queue(sq); schedule(); goto retry; } finish_wait(&conf->wait_for_overlap, &w); - handle_stripe(sh, NULL); - release_stripe(sh); + handle_queue(sq, disks, data_disks); } else { - /* cannot get stripe for read-ahead, just give-up */ + /* cannot get queue for read-ahead, just give-up */ clear_bit(BIO_UPTODATE, &bi->bi_flags); finish_wait(&conf->wait_for_overlap, &w); break; @@ -3606,6 +4160,34 @@ static int make_request(struct request_q return 0; } +static struct stripe_queue * +wait_for_cache_attached_queue(raid5_conf_t *conf, sector_t sector, int disks, + int pd_idx) +{ + struct stripe_queue *sq; + struct stripe_head *sh; + + do { + wait_queue_t wait; + init_waitqueue_entry(&wait, current); + add_wait_queue(&conf->wait_for_stripe, &wait); + for (;;) { + sq = get_active_queue(conf, sector, disks, pd_idx, 0); + set_current_state(TASK_UNINTERRUPTIBLE); + sh = get_active_stripe(sq, disks, 1); + if (sh) + break; + release_queue(sq); + schedule(); + } + current->state = TASK_RUNNING; + remove_wait_queue(&conf->wait_for_stripe, &wait); + } while (0); + + return sq; +} + + static sector_t reshape_request(mddev_t *mddev, sector_t sector_nr, int *skipped) { /* reshaping is quite different to recovery/resync so it is @@ -3619,6 +4201,7 @@ static sector_t reshape_request(mddev_t */ raid5_conf_t *conf = (raid5_conf_t *) mddev->private; struct stripe_head *sh; + struct stripe_queue *sq; int pd_idx; sector_t first_sector, last_sector; int raid_disks = conf->previous_raid_disks; @@ -3672,21 +4255,26 @@ static sector_t reshape_request(mddev_t int j; int skipped = 0; pd_idx = stripe_to_pdidx(sector_nr+i, conf, conf->raid_disks); - sh = get_active_stripe(conf, sector_nr+i, - conf->raid_disks, pd_idx, 0); - set_bit(STRIPE_EXPANDING, &sh->state); + sq = wait_for_cache_attached_queue(conf, sector_nr+i, + conf->raid_disks, pd_idx); + sh = sq->sh; + + set_bit(STRIPE_QUEUE_EXPANDING, &sq->state); atomic_inc(&conf->reshape_stripes); /* If any of this stripe is beyond the end of the old * array, then we need to zero those blocks */ for (j=sh->disks; j--;) { sector_t s; - if (j == sh->pd_idx) + int pd_idx = sh->sq->pd_idx; + + if (j == pd_idx) continue; if (conf->level == 6 && - j == raid6_next_disk(sh->pd_idx, sh->disks)) + j == raid6_next_disk(pd_idx, sh->disks)) continue; - s = compute_blocknr(sh, j); + s = compute_blocknr(conf, sh->disks, sh->sector, + pd_idx, j); if (s < (mddev->array_size<<1)) { skipped = 1; continue; @@ -3699,6 +4287,7 @@ static sector_t reshape_request(mddev_t set_bit(STRIPE_EXPAND_READY, &sh->state); set_bit(STRIPE_HANDLE, &sh->state); } + release_queue(sq); release_stripe(sh); } spin_lock_irq(&conf->device_lock); @@ -3723,10 +4312,13 @@ static sector_t reshape_request(mddev_t while (first_sector <= last_sector) { pd_idx = stripe_to_pdidx(first_sector, conf, conf->previous_raid_disks); - sh = get_active_stripe(conf, first_sector, - conf->previous_raid_disks, pd_idx, 0); + sq = wait_for_cache_attached_queue(conf, first_sector, + conf->previous_raid_disks, + pd_idx); + sh = sq->sh; set_bit(STRIPE_EXPAND_SOURCE, &sh->state); set_bit(STRIPE_HANDLE, &sh->state); + release_queue(sq); release_stripe(sh); first_sector += STRIPE_SECTORS; } @@ -3738,6 +4330,7 @@ static inline sector_t sync_request(mdde { raid5_conf_t *conf = (raid5_conf_t *) mddev->private; struct stripe_head *sh; + struct stripe_queue *sq; int pd_idx; int raid_disks = conf->raid_disks; sector_t max_sector = mddev->size << 1; @@ -3786,14 +4379,11 @@ static inline sector_t sync_request(mdde } pd_idx = stripe_to_pdidx(sector_nr, conf, raid_disks); - sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 1); - if (sh == NULL) { - sh = get_active_stripe(conf, sector_nr, raid_disks, pd_idx, 0); - /* make sure we don't swamp the stripe cache if someone else - * is trying to get access - */ - schedule_timeout_uninterruptible(1); - } + + sq = wait_for_cache_attached_queue(conf, sector_nr, raid_disks, + pd_idx); + sh = sq->sh; + /* Need to check if array will still be degraded after recovery/resync * We don't need to check the 'failed' flag as when that gets set, * recovery aborts. @@ -3804,12 +4394,13 @@ static inline sector_t sync_request(mdde bitmap_start_sync(mddev->bitmap, sector_nr, &sync_blocks, still_degraded); - spin_lock(&sh->lock); + spin_lock(&sq->lock); set_bit(STRIPE_SYNCING, &sh->state); clear_bit(STRIPE_INSYNC, &sh->state); - spin_unlock(&sh->lock); + spin_unlock(&sq->lock); handle_stripe(sh, NULL); + release_queue(sq); release_stripe(sh); return STRIPE_SECTORS; @@ -3827,17 +4418,19 @@ static int retry_aligned_read(raid5_con * We *know* that this entire raid_bio is in one chunk, so * it will be only one 'dd_idx' and only need one call to raid5_compute_sector. */ - struct stripe_head *sh; + struct stripe_queue *sq; int dd_idx, pd_idx; sector_t sector, logical_sector, last_sector; int scnt = 0; int remaining; int handled = 0; + int disks = conf->raid_disks; + int data_disks = disks - conf->max_degraded; logical_sector = raid_bio->bi_sector & ~((sector_t)STRIPE_SECTORS-1); sector = raid5_compute_sector( logical_sector, - conf->raid_disks, - conf->raid_disks - conf->max_degraded, + disks, + data_disks, &dd_idx, &pd_idx, conf); @@ -3847,30 +4440,34 @@ static int retry_aligned_read(raid5_con logical_sector += STRIPE_SECTORS, sector += STRIPE_SECTORS, scnt++) { + struct stripe_head *sh; if (scnt < raid_bio->bi_hw_segments) /* already done this stripe */ continue; - sh = get_active_stripe(conf, sector, conf->raid_disks, pd_idx, 1); - - if (!sh) { - /* failed to get a stripe - must wait */ + sq = get_active_queue(conf, sector, disks, pd_idx, 1); + if (sq) + sh = get_active_stripe(sq, disks, 1); + if (!(sq && sh)) { + /* failed to get a queue/stripe - must wait */ raid_bio->bi_hw_segments = scnt; conf->retry_read_aligned = raid_bio; + if (sq) + release_queue(sq); return handled; } set_bit(R5_ReadError, &sh->dev[dd_idx].flags); - if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) { + if (!add_queue_bio(sq, raid_bio, dd_idx, 0)) { + release_queue(sq); release_stripe(sh); raid_bio->bi_hw_segments = scnt; conf->retry_read_aligned = raid_bio; return handled; } - handle_stripe(sh, NULL); - release_stripe(sh); + handle_queue(sq, disks, data_disks); handled++; } spin_lock_irq(&conf->device_lock); @@ -3889,7 +4486,60 @@ static int retry_aligned_read(raid5_con return handled; } +static void raid456_cache_arbiter(struct work_struct *work) +{ + raid5_conf_t *conf = container_of(work, raid5_conf_t, + stripe_queue_work); + struct list_head *sq_entry; + int attach = 0; + + /* attach queues to stripes in priority order */ + pr_debug("+++ %s active\n", __FUNCTION__); + spin_lock_irq(&conf->device_lock); + do { + sq_entry = NULL; + if (!list_empty(&conf->io_hi_queue)) + sq_entry = conf->io_hi_queue.next; + else if (!list_empty(&conf->io_lo_queue)) + sq_entry = conf->io_lo_queue.next; + + /* "these aren't the droids you're looking for..." + * do not handle the delayed list while there are better + * things to do + */ + if (!sq_entry && + atomic_read(&conf->preread_active_queues) < + IO_THRESHOLD && !blk_queue_plugged(conf->mddev->queue) && + !list_empty(&conf->delayed_q_list)) { + raid5_activate_delayed(conf); + sq_entry = conf->io_lo_queue.next; + } + + if (sq_entry) { + struct stripe_queue *sq; + struct stripe_head *sh; + sq = list_entry(sq_entry, struct stripe_queue, + list_node); + + list_del_init(sq_entry); + atomic_inc(&sq->count); + BUG_ON(atomic_read(&sq->count) != 1); + BUG_ON(sq->sh); + spin_unlock_irq(&conf->device_lock); + sh = get_active_stripe(sq, conf->raid_disks, 0); + spin_lock_irq(&conf->device_lock); + + set_bit(STRIPE_HANDLE, &sh->state); + __release_queue(conf, sq); + __release_stripe(conf, sh); + attach++; + } + } while (sq_entry); + spin_unlock_irq(&conf->device_lock); + pr_debug("%d stripe(s) attached\n", attach); + pr_debug("--- %s inactive\n", __FUNCTION__); +} /* * This is our raid5 kernel thread. @@ -3923,12 +4573,6 @@ static void raid5d (mddev_t *mddev) activate_bit_delay(conf); } - if (list_empty(&conf->handle_list) && - atomic_read(&conf->preread_active_stripes) < IO_THRESHOLD && - !blk_queue_plugged(mddev->queue) && - !list_empty(&conf->delayed_list)) - raid5_activate_delayed(conf); - while ((bio = remove_bio_from_retry(conf))) { int ok; spin_unlock_irq(&conf->device_lock); @@ -3940,6 +4584,7 @@ static void raid5d (mddev_t *mddev) } if (list_empty(&conf->handle_list)) { + queue_work(conf->workqueue, &conf->stripe_queue_work); async_tx_issue_pending_all(); break; } @@ -3982,7 +4627,8 @@ raid5_store_stripe_cache_size(mddev_t *m { raid5_conf_t *conf = mddev_to_conf(mddev); char *end; - int new; + int new, queue, i; + if (len >= PAGE_SIZE) return -EINVAL; if (!conf) @@ -3998,9 +4644,21 @@ raid5_store_stripe_cache_size(mddev_t *m conf->max_nr_stripes--; else break; + + for (i = 0, queue = 0; i < STRIPE_QUEUE_SIZE; i++) + queue += drop_one_queue(conf); + + if (queue < STRIPE_QUEUE_SIZE) + break; } md_allow_write(mddev); while (new > conf->max_nr_stripes) { + for (i = 0, queue = 0; i < STRIPE_QUEUE_SIZE; i++) + queue += grow_one_queue(conf); + + if (queue < STRIPE_QUEUE_SIZE) + break; + if (grow_one_stripe(conf)) conf->max_nr_stripes++; else break; @@ -4026,9 +4684,23 @@ stripe_cache_active_show(mddev_t *mddev, static struct md_sysfs_entry raid5_stripecache_active = __ATTR_RO(stripe_cache_active); +static ssize_t +stripe_queue_active_show(mddev_t *mddev, char *page) +{ + raid5_conf_t *conf = mddev_to_conf(mddev); + if (conf) + return sprintf(page, "%d\n", atomic_read(&conf->active_queues)); + else + return 0; +} + +static struct md_sysfs_entry +raid5_stripequeue_active = __ATTR_RO(stripe_queue_active); + static struct attribute *raid5_attrs[] = { &raid5_stripecache_size.attr, &raid5_stripecache_active.attr, + &raid5_stripequeue_active.attr, NULL, }; static struct attribute_group raid5_attrs_group = { @@ -4129,16 +4801,30 @@ static int run(mddev_t *mddev) if (!conf->spare_page) goto abort; } + + sprintf(conf->workqueue_name, "%s_cache_arb", + mddev->gendisk->disk_name); + conf->workqueue = create_singlethread_workqueue(conf->workqueue_name); + if (!conf->workqueue) + goto abort; + spin_lock_init(&conf->device_lock); init_waitqueue_head(&conf->wait_for_stripe); + init_waitqueue_head(&conf->wait_for_queue); init_waitqueue_head(&conf->wait_for_overlap); INIT_LIST_HEAD(&conf->handle_list); - INIT_LIST_HEAD(&conf->delayed_list); INIT_LIST_HEAD(&conf->bitmap_list); INIT_LIST_HEAD(&conf->inactive_list); + INIT_LIST_HEAD(&conf->io_hi_queue); + INIT_LIST_HEAD(&conf->io_lo_queue); + INIT_LIST_HEAD(&conf->delayed_q_list); + INIT_LIST_HEAD(&conf->inactive_queue_list); atomic_set(&conf->active_stripes, 0); - atomic_set(&conf->preread_active_stripes, 0); + atomic_set(&conf->active_queues, 0); + atomic_set(&conf->preread_active_queues, 0); atomic_set(&conf->active_aligned_reads, 0); + INIT_WORK(&conf->stripe_queue_work, raid456_cache_arbiter); + init_sq.raid_conf = conf; pr_debug("raid5: run(%s) called.\n", mdname(mddev)); @@ -4238,6 +4924,8 @@ static int run(mddev_t *mddev) printk(KERN_INFO "raid5: allocated %dkB for %s\n", memory, mdname(mddev)); + conf->stripe_queue_tree = RB_ROOT; + if (mddev->degraded == 0) printk("raid5: raid level %d set %s active with %d out of %d" " devices, algorithm %d\n", conf->level, mdname(mddev), @@ -4294,6 +4982,8 @@ static int run(mddev_t *mddev) abort: if (conf) { print_raid5_conf(conf); + if (conf->workqueue) + destroy_workqueue(conf->workqueue); safe_put_page(conf->spare_page); kfree(conf->disks); kfree(conf->stripe_hashtbl); @@ -4318,6 +5008,7 @@ static int stop(mddev_t *mddev) blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/ sysfs_remove_group(&mddev->kobj, &raid5_attrs_group); kfree(conf->disks); + destroy_workqueue(conf->workqueue); kfree(conf); mddev->private = NULL; return 0; @@ -4328,8 +5019,8 @@ static void print_sh (struct seq_file *s { int i; - seq_printf(seq, "sh %llu, pd_idx %d, state %ld.\n", - (unsigned long long)sh->sector, sh->pd_idx, sh->state); + seq_printf(seq, "sh %llu, state %ld.\n", + (unsigned long long)sh->sector, sh->state); seq_printf(seq, "sh %llu, count %d.\n", (unsigned long long)sh->sector, atomic_read(&sh->count)); seq_printf(seq, "sh %llu, ", (unsigned long long)sh->sector); @@ -4340,19 +5031,33 @@ static void print_sh (struct seq_file *s seq_printf(seq, "\n"); } -static void printall (struct seq_file *seq, raid5_conf_t *conf) +static void print_sq(struct seq_file *seq, struct stripe_queue *sq) { - struct stripe_head *sh; - struct hlist_node *hn; int i; + seq_printf(seq, "sq %llu, pd_idx %d, state %ld.\n", + (unsigned long long)sq->sector, sq->pd_idx, sq->state); + seq_printf(seq, "sq %llu, count %d.\n", + (unsigned long long)sq->sector, atomic_read(&sq->count)); + seq_printf(seq, "sq %llu, ", (unsigned long long)sq->sector); + seq_printf(seq, "\n"); + seq_printf(seq, "sq %llu, sh %p.\n", + (unsigned long long) sq->sector, sq->sh); + if (sq->sh) + print_sh(seq, sq->sh); +} + +static void printall(struct seq_file *seq, raid5_conf_t *conf) +{ + struct stripe_queue *sq; + struct rb_node *rbn; + spin_lock_irq(&conf->device_lock); - for (i = 0; i < NR_HASH; i++) { - hlist_for_each_entry(sh, hn, &conf->stripe_hashtbl[i], hash) { - if (sh->raid_conf != conf) - continue; - print_sh(seq, sh); - } + rbn = rb_first(&conf->stripe_queue_tree); + while (rbn) { + sq = rb_entry(rbn, struct stripe_queue, rb_node); + print_sq(seq, sq); + rbn = rb_next(rbn); } spin_unlock_irq(&conf->device_lock); } @@ -4671,8 +5376,8 @@ static void raid5_quiesce(mddev_t *mddev case 1: /* stop all writes */ spin_lock_irq(&conf->device_lock); conf->quiesce = 1; - wait_event_lock_irq(conf->wait_for_stripe, - atomic_read(&conf->active_stripes) == 0 && + wait_event_lock_irq(conf->wait_for_queue, + atomic_read(&conf->active_queues) == 0 && atomic_read(&conf->active_aligned_reads) == 0, conf->device_lock, /* nothing */); spin_unlock_irq(&conf->device_lock); @@ -4681,7 +5386,7 @@ static void raid5_quiesce(mddev_t *mddev case 0: /* re-enable writes */ spin_lock_irq(&conf->device_lock); conf->quiesce = 0; - wake_up(&conf->wait_for_stripe); + wake_up(&conf->wait_for_queue); wake_up(&conf->wait_for_overlap); spin_unlock_irq(&conf->device_lock); break; diff --git a/include/linux/raid/raid5.h b/include/linux/raid/raid5.h index 93678f5..c744fa4 100644 --- a/include/linux/raid/raid5.h +++ b/include/linux/raid/raid5.h @@ -3,6 +3,7 @@ #define _RAID5_H #include #include +#include /* * @@ -158,16 +159,13 @@ #include * the compute block completes. */ +struct stripe_queue; struct stripe_head { struct hlist_node hash; struct list_head lru; /* inactive_list or handle_list */ - struct raid5_private_data *raid_conf; sector_t sector; /* sector of this row */ - int pd_idx; /* parity disk index */ unsigned long state; /* state flags */ atomic_t count; /* nr of active thread/requests */ - spinlock_t lock; - int bm_seq; /* sequence number for bitmap flushes */ int disks; /* disks in stripe */ /* stripe_operations * @pending - pending ops flags (set for request->issue->complete) @@ -184,13 +182,13 @@ struct stripe_head { int count; u32 zero_sum_result; } ops; + struct stripe_queue *sq; /* list of pending bios for this stripe */ struct r5dev { struct bio req; struct bio_vec vec; struct page *page; - struct bio *toread, *read, *towrite, *written; - sector_t sector; /* sector of this page */ - unsigned long flags; + struct bio *read, *written; + unsigned long flags; } dev[1]; /* allocated with extra space depending of RAID geometry */ }; @@ -209,6 +207,35 @@ struct r6_state { int p_failed, q_failed, qd_idx, failed_num[2]; }; +/* stripe_queue + * @sector - rb_tree key + * @lock + * @sh - our stripe_head in the cache + * @list_node - once this queue object satisfies some constraint (like full + * stripe write) it is placed on a list for processing by the cache + * @overwrite_count - how many blocks are set to be overwritten + */ +struct stripe_queue { + struct rb_node rb_node; + sector_t sector; + int pd_idx; /* parity disk index */ + int bm_seq; /* sequence number for bitmap flushes */ + spinlock_t lock; + struct raid5_private_data *raid_conf; + unsigned long state; + struct stripe_head *sh; + struct list_head list_node; + wait_queue_head_t wait_for_attach; + unsigned long *to_read; + unsigned long *to_write; + unsigned long *overwrite; + atomic_t count; + struct r5_queue_dev { + sector_t sector; /* hw starting sector for this block */ + struct bio *toread, *towrite; + } dev[1]; +}; + /* Flags */ #define R5_UPTODATE 0 /* page contains current data */ #define R5_LOCKED 1 /* IO has been submitted on "req" */ @@ -245,11 +272,7 @@ #define CHECK_PARITY 3 #define STRIPE_HANDLE 2 #define STRIPE_SYNCING 3 #define STRIPE_INSYNC 4 -#define STRIPE_PREREAD_ACTIVE 5 -#define STRIPE_DELAYED 6 #define STRIPE_DEGRADED 7 -#define STRIPE_BIT_DELAY 8 -#define STRIPE_EXPANDING 9 #define STRIPE_EXPAND_SOURCE 10 #define STRIPE_EXPAND_READY 11 /* @@ -271,6 +294,17 @@ #define STRIPE_OP_MOD_REPAIR_PD 7 #define STRIPE_OP_MOD_DMA_CHECK 8 /* + * Stripe-queue state + */ +#define STRIPE_QUEUE_HANDLE 0 +#define STRIPE_QUEUE_IO_HI 1 +#define STRIPE_QUEUE_IO_LO 2 +#define STRIPE_QUEUE_DELAYED 3 +#define STRIPE_QUEUE_EXPANDING 4 +#define STRIPE_QUEUE_PREREAD_ACTIVE 5 +#define STRIPE_QUEUE_BIT_DELAY 6 + +/* * Plugging: * * To improve write throughput, we need to delay the handling of some @@ -301,6 +335,7 @@ struct disk_info { struct raid5_private_data { struct hlist_head *stripe_hashtbl; + struct rb_root stripe_queue_tree; mddev_t *mddev; struct disk_info *spare; int chunk_size, level, algorithm; @@ -316,20 +351,32 @@ struct raid5_private_data { int previous_raid_disks; struct list_head handle_list; /* stripes needing handling */ - struct list_head delayed_list; /* stripes that have plugged requests */ struct list_head bitmap_list; /* stripes delaying awaiting bitmap update */ + struct list_head delayed_q_list; /* queues that have plugged + * requests + */ + struct list_head io_hi_queue; /* reads and full stripe writes */ + struct list_head io_lo_queue; /* sub-stripe-width writes */ + struct workqueue_struct *workqueue; /* attaches sq's to sh's */ + struct work_struct stripe_queue_work; + char workqueue_name[20]; + struct bio *retry_read_aligned; /* currently retrying aligned bios */ struct bio *retry_read_aligned_list; /* aligned bios retry list */ - atomic_t preread_active_stripes; /* stripes with scheduled io */ atomic_t active_aligned_reads; + atomic_t preread_active_queues; /* queues with scheduled + * io + */ atomic_t reshape_stripes; /* stripes with pending writes for reshape */ /* unfortunately we need two cache names as we temporarily have * two caches. */ int active_name; - char cache_name[2][20]; - struct kmem_cache *slab_cache; /* for allocating stripes */ + char sh_cache_name[2][20]; + char sq_cache_name[2][20]; + struct kmem_cache *sh_slab_cache; + struct kmem_cache *sq_slab_cache; int seq_flush, seq_write; int quiesce; @@ -342,12 +389,20 @@ struct raid5_private_data { struct page *spare_page; /* Used when checking P/Q in raid6 */ /* + * Free queue pool + */ + atomic_t active_queues; + struct list_head inactive_queue_list; + wait_queue_head_t wait_for_queue; + wait_queue_head_t wait_for_overlap; + int inactive_queue_blocked; + + /* * Free stripes pool */ atomic_t active_stripes; struct list_head inactive_list; wait_queue_head_t wait_for_stripe; - wait_queue_head_t wait_for_overlap; int inactive_blocked; /* release of inactive stripes blocked, * waiting for 25% to be free */