commit bea9a6d239cb2aa2ced4dcb0a05e1827ce61fa3d
Merge: cd9f040 5453258
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun Jul 18 10:09:25 2010 -0700

    Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
    
    * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2:
      ocfs2: Silence gcc warning in ocfs2_write_zero_page().
      jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
      ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
      ocfs2: Don't duplicate pages past i_size during CoW.
      ocfs2: tighten up strlen() checking
      ocfs2: Make xattr reflink work with new local alloc reservation.
      ocfs2: make xattr extension work with new local alloc reservation.
      ocfs2: Remove the redundant cpu_to_le64.
      ocfs2/dlm: don't access beyond bitmap size
      ocfs2: No need to zero pages past i_size.
      ocfs2: Zero the tail cluster when extending past i_size.
      ocfs2: When zero extending, do it by page.
      ocfs2: Limit default local alloc size within bitmap range.
      ocfs2: Move orphan scan work to ocfs2_wq.
      fs/ocfs2/dlm: Add missing spin_unlock

commit 5453258d532e72731b0829e4fefd36dd611a2fff
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Jul 16 13:32:33 2010 -0700

    ocfs2: Silence gcc warning in ocfs2_write_zero_page().
    
    ocfs2_write_zero_page() has a loop that won't ever be skipped, but gcc
    doesn't know that.  Set ret=0 just to make gcc happy.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 13ceef099edd2b70c5a6f3a9ef5d6d97cda2e096
Author: Jan Kara <jack@suse.cz>
Date:   Wed Jul 14 07:56:33 2010 +0200

    jbd2/ocfs2: Fix block checksumming when a buffer is used in several transactions
    
    OCFS2 uses t_commit trigger to compute and store checksum of the just
    committed blocks. When a buffer has b_frozen_data, checksum is computed
    for it instead of b_data but this can result in an old checksum being
    written to the filesystem in the following scenario:
    
    1) transaction1 is opened
    2) handle1 is opened
    3) journal_access(handle1, bh)
        - This sets jh->b_transaction to transaction1
    4) modify(bh)
    5) journal_dirty(handle1, bh)
    6) handle1 is closed
    7) start committing transaction1, opening transaction2
    8) handle2 is opened
    9) journal_access(handle2, bh)
        - This copies off b_frozen_data to make it safe for transaction1 to commit.
          jh->b_next_transaction is set to transaction2.
    10) jbd2_journal_write_metadata() checksums b_frozen_data
    11) the journal correctly writes b_frozen_data to the disk journal
    12) handle2 is closed
        - There was no dirty call for the bh on handle2, so it is never queued for
          any more journal operation
    13) Checkpointing finally happens, and it just spools the bh via normal buffer
    writeback.  This will write b_data, which was never triggered on and thus
    contains a wrong (old) checksum.
    
    This patch fixes the problem by calling the trigger at the moment data is
    frozen for journal commit - i.e., either when b_frozen_data is created by
    do_get_write_access or just before we write a buffer to the log if
    b_frozen_data does not exist. We also rename the trigger to t_frozen as
    that better describes when it is called.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit a39953dd95ff10e311083d94f4f95c348cb22464
Author: Wengang Wang <wen.gang.wang@oracle.com>
Date:   Wed Jul 14 22:38:21 2010 +0800

    ocfs2/dlm: Remove BUG_ON from migration in the rare case of a down node
    
    For migration, we are waiting for DLM_LOCK_RES_MIGRATING flag to be set
    before sending DLM_MIG_LOCKRES_MSG message to the target. We are using
    dlm_migration_can_proceed() for that purpose.  However, if the node is
    down, dlm_migration_can_proceed() will also return "go ahead".  In this
    rare case, the DLM_LOCK_RES_MIGRATING flag might not be set yet. Remove
    the BUG_ON() that trips over this condition.
    
    Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit f5e27b6ddfbafdd9c9c2f06bbf28af12581409bc
Author: Tao Ma <tao.ma@oracle.com>
Date:   Wed Jul 14 11:19:32 2010 +0800

    ocfs2: Don't duplicate pages past i_size during CoW.
    
    During CoW, the pages after i_size don't contain valid data, so there's
    no need to read and duplicate them.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit e372357ba55ae89307af15cd680467d8f0db4f01
Author: Dan Carpenter <error27@gmail.com>
Date:   Sat Jul 10 16:33:36 2010 +0200

    ocfs2: tighten up strlen() checking
    
    This function is only called from one place and it's like this:
    	dlm_register_domain(conn->cc_name, dlm_key, &fs_version);
    
    The "conn->cc_name" is 64 characters long.  If strlen(conn->cc_name)
    were equal to O2NM_MAX_NAME_LEN (64) that would be a bug because
    strlen() doesn't count the NULL character.
    
    In fact, if you look how O2NM_MAX_NAME_LEN is used, it mostly describes
    64 character buffers.  The only exception is nd_name from struct
    o2nm_node.
    
    Anyway I looked into it and in this case the domain string comes from
    osb->uuid_str in ocfs2_setup_osb_uuid().  That's 32 characters and NULL
    which easily fits into O2NM_MAX_NAME_LEN.  This patch doesn't change how
    the code works, but I think it makes the code a little cleaner.
    
    Signed-off-by: Dan Carpenter <error27@gmail.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 121a39bb00b421211f4f590c440a8f636d3ae807
Author: Tao Ma <tao.ma@oracle.com>
Date:   Fri Jul 9 14:53:12 2010 +0800

    ocfs2: Make xattr reflink work with new local alloc reservation.
    
    The new reservation code in local alloc has add the limitation
    that the caller should handle the case that the local alloc
    doesn't give use enough contiguous clusters. It make the old
    xattr reflink code broken.
    
    So this patch udpate the xattr reflink code so that it can
    handle the case that local alloc give us one cluster at a time.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit a78f9f4668949a6588b8872f162e86685c63d023
Author: Tao Ma <tao.ma@oracle.com>
Date:   Fri Jul 9 14:53:11 2010 +0800

    ocfs2: make xattr extension work with new local alloc reservation.
    
    The old ocfs2_xattr_extent_allocation is too optimistic about
    the clusters we can get. So actually if the file system is
    too fragmented, ocfs2_add_clusters_in_btree will return us
    with EGAIN and we need to allocate clusters once again.
    
    So this patch change it to a while loop so that we can allocate
    clusters until we reach clusters_to_add.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Cc: stable@kernel.org

commit 0a463b74e7e6856b24e613de2b85237c6e11890b
Author: Tao Ma <tao.ma@oracle.com>
Date:   Thu Jul 8 11:11:11 2010 +0800

    ocfs2: Remove the redundant cpu_to_le64.
    
    In ocfs2_block_group_alloc, we set c_blkno by bg->bg_blkno.
    But actually bg->bg_blkno is already changed to little endian
    in ocfs2_block_group_fill. So remove the extra cpu_to_le64.
    
    Reported-by: Marcos Matsunaga <Marcos.Matsunaga@oracle.com>
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit f471c9df922a80ca9af1d9a490b4aab3f990ec19
Author: Wengang Wang <wen.gang.wang@oracle.com>
Date:   Wed Jun 30 20:23:30 2010 +0800

    ocfs2/dlm: don't access beyond bitmap size
    
    dlm->recovery_map is defined as
    	unsigned long recovery_map[BITS_TO_LONGS(O2NM_MAX_NODES)];
    
    We should treat O2NM_MAX_NODES as the bit map size in bits.
    This patches fixes a bit operation that takes O2NM_MAX_NODES + 1 as bitmap size.
    
    Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 693c241a5f6aa01417f5f4caf9f82e60e316398d
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Jul 2 17:20:27 2010 -0700

    ocfs2: No need to zero pages past i_size.
    
    When ocfs2 fills a hole, it does so by allocating clusters.  When a
    cluster is larger than the write, ocfs2 must zero the portions of the
    cluster outside of the write.  If the clustersize is smaller than a
    pagecache page, this is handled by the normal pagecache mechanisms, but
    when the clustersize is larger than a page, ocfs2's write code will zero
    the pages adjacent to the write.  This makes sure the entire cluster is
    zeroed correctly.
    
    Currently ocfs2 behaves exactly the same when writing past i_size.
    However, this means ocfs2 is writing zeroed pages for portions of a new
    cluster that are beyond i_size.  The page writeback code isn't expecting
    this.  It treats all pages past the one containing i_size as left behind
    due to a previous truncate operation.
    
    Thankfully, ocfs2 calculates the number of pages it will be working on
    up front.  The rest of the write code merely honors the original
    calculation.  We can simply trim the number of pages to only cover the
    actual file data.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Cc: stable@kernel.org

commit 5693486bad2bc2ac585a2c24f7e2f3964b478df9
Author: Joel Becker <joel.becker@oracle.com>
Date:   Thu Jul 1 15:13:31 2010 -0700

    ocfs2: Zero the tail cluster when extending past i_size.
    
    ocfs2's allocation unit is the cluster.  This can be larger than a block
    or even a memory page.  This means that a file may have many blocks in
    its last extent that are beyond the block containing i_size.  There also
    may be more unwritten extents after that.
    
    When ocfs2 grows a file, it zeros the entire cluster in order to ensure
    future i_size growth will see cleared blocks.  Unfortunately,
    block_write_full_page() drops the pages past i_size.  This means that
    ocfs2 is actually leaking garbage data into the tail end of that last
    cluster.  This is a bug.
    
    We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
    when a write or truncate is past i_size.  They will use
    ocfs2_zero_extend() to ensure the data is properly zeroed.
    
    Older versions of ocfs2_zero_extend() simply zeroed every block between
    i_size and the zeroing position.  This presumes three things:
    
    1) There is allocation for all of these blocks.
    2) The extents are not unwritten.
    3) The extents are not refcounted.
    
    (1) and (2) hold true for non-sparse filesystems, which used to be the
    only users of ocfs2_zero_extend().  (3) is another bug.
    
    Since we're now using ocfs2_zero_extend() for sparse filesystems as
    well, we teach ocfs2_zero_extend() to check every extent between
    i_size and the zeroing position.  If the extent is unwritten, it is
    ignored.  If it is refcounted, it is CoWed.  Then it is zeroed.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Cc: stable@kernel.org

commit a4bfb4cf11fd2211b788af59dc8a8b4394bca227
Author: Joel Becker <joel.becker@oracle.com>
Date:   Tue Jul 6 14:36:06 2010 -0700

    ocfs2: When zero extending, do it by page.
    
    ocfs2_zero_extend() does its zeroing block by block, but it calls a
    function named ocfs2_write_zero_page().  Let's have
    ocfs2_write_zero_page() handle the page level.  From
    ocfs2_zero_extend()'s perspective, it is now page-at-a-time.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Cc: stable@kernel.org

commit 327f935a9ef644c0ec3d050c94bce753756d60c0
Author: Tejun Heo <tj@kernel.org>
Date:   Tue Mar 30 02:52:32 2010 +0900

    ocfs2: update gfp/slab.h includes
    
    Implicit slab.h inclusion via percpu.h is about to go away.  Make sure
    gfp.h or slab.h is included as necessary.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Cc: Stephen Rothwell <sfr@canb.auug.org.au>
    Cc: Joel Becker <joel.becker@oracle.com>
    Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>

commit 1739da40543ed2129050ccfa8a076a851ab6ed00
Author: Tao Ma <tao.ma@oracle.com>
Date:   Wed Jun 9 16:43:05 2010 +0800

    ocfs2: Limit default local alloc size within bitmap range.
    
    In commit 6b82021b9e91cd689fdffadbcdb9a42597bbe764, we increase
    our local alloc size and calculate how much megabytes we can
    get according to group size and volume size.
    But we also need to check the maximum bits a local alloc block
    bitmap can have. With a bs=512, cs=32K, local volume with 160G,
    it calculate 96MB while the maximum local alloc size is only
    76M. So the bitmap will overflow and corrupt the system truncate
    log file. See bug
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1262
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Acked-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 40f165f416bde747d85cdf71bc9dde700912f71f
Author: Tao Ma <tao.ma@oracle.com>
Date:   Fri May 28 14:22:59 2010 +0800

    ocfs2: Move orphan scan work to ocfs2_wq.
    
    We used to let orphan scan work in the default work queue,
    but there is a corner case which will make the system deadlock.
    The scenario is like this:
    1. set heartbeat threadshold to 200. this will allow us to have a
       great chance to have a orphan scan work before our quorum decision.
    2. mount node 1.
    3. after 1~2 minutes, mount node 2(in order to make the bug easier
       to reproduce, better add maxcpus=1 to kernel command line).
    4. node 1 do orphan scan work.
    5. node 2 do orphan scan work.
    6. node 1 do orphan scan work. After this, node 1 hold the orphan scan
       lock while node 2 know node 1 is the master.
    7. ifdown eth2 in node 2(eth2 is what we do ocfs2 interconnection).
    
    Now when node 2 begins orphan scan, the system queue is blocked.
    
    The root cause is that both orphan scan work and quorum decision work
    will use the system event work queue. orphan scan has a chance of
    blocking the event work queue(in dlm_wait_for_node_death) so that there
    is no chance for quorum decision work to proceed.
    
    This patch resolve it by moving orphan scan work to ocfs2_wq.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 6469272c350872980891dbe38e81c936c43f2d9b
Author: Julia Lawall <julia@diku.dk>
Date:   Wed May 26 17:58:53 2010 +0200

    fs/ocfs2/dlm: Add missing spin_unlock
    
    Add a spin_unlock missing on the error path.  Unlock as in the other code
    that leads to the leave label.
    
    The semantic match that finds this problem is as follows:
    (http://coccinelle.lip6.fr/)
    
    // <smpl>
    @@
    expression E1;
    @@
    
    * spin_lock(E1,...);
      <+... when != E1
      if (...) {
        ... when != E1
    *   return ...;
      }
      ...+>
    * spin_unlock(E1,...);
    // </smpl>
    
    Signed-off-by: Julia Lawall <julia@diku.dk>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit d28619f1563140526e2f84eae436f39206f40a69
Merge: 021fad8 f32764b
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Sun May 30 09:11:11 2010 -0700

    Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6
    
    * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6:
      quota: Convert quota statistics to generic percpu_counter
      ext3 uses rb_node = NULL; to zero rb_root.
      quota: Fixup dquot_transfer
      reiserfs: Fix resuming of quotas on remount read-write
      pohmelfs: Remove dead quota code
      ufs: Remove dead quota code
      udf: Remove dead quota code
      quota: rename default quotactl methods to dquot_
      quota: explicitly set ->dq_op and ->s_qcop
      quota: drop remount argument to ->quota_on and ->quota_off
      quota: move unmount handling into the filesystem
      quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
      quota: move remount handling into the filesystem
      ocfs2: Fix use after free on remount read-only
    
    Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c

commit 15c6fd9786dfaab43547bf60df6fa63170fb64fc
Author: npiggin@suse.de <npiggin@suse.de>
Date:   Thu May 27 01:05:34 2010 +1000

    kill spurious reference to vmtruncate
    
    Lots of filesystems calls vmtruncate despite not implementing the old
    ->truncate method.  Switch them to use simple_setsize and add some
    comments about the truncate code where it seems fitting.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Nick Piggin <npiggin@suse.de>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

commit 7ea8085910ef3dd4f3cad6845aaa2b580d39b115
Author: Christoph Hellwig <hch@lst.de>
Date:   Wed May 26 17:53:25 2010 +0200

    drop unused dentry argument to ->fsync
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

commit 4be929be34f9bdeffa40d815d32d7d60d2c7f03b
Author: Alexey Dobriyan <adobriyan@gmail.com>
Date:   Mon May 24 14:33:03 2010 -0700

    kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN
    
    - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not
      USHORT_MAX/SHORT_MAX/SHORT_MIN.
    
    - Make SHRT_MIN of type s16, not int, for consistency.
    
    [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c]
    [akpm@linux-foundation.org: fix security/keys/keyring.c]
    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
    Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit 287a80958cf63fc5c68d5bf6e89a3669dd66234a
Author: Christoph Hellwig <hch@infradead.org>
Date:   Wed May 19 07:16:45 2010 -0400

    quota: rename default quotactl methods to dquot_
    
    Follow the dquot_* style used elsewhere in dquot.c.
    
    [Jan Kara: Fixed up missing conversion of ext2]
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 307ae18a56e5b706056a2050d52e8cc01b5171c0
Author: Christoph Hellwig <hch@infradead.org>
Date:   Wed May 19 07:16:43 2010 -0400

    quota: drop remount argument to ->quota_on and ->quota_off
    
    Remount handling has fully moved into the filesystem, so all this is
    superflous now.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 0f0dd62fddcbd0f6830ed8ef3d3426ccc46b9250
Author: Christoph Hellwig <hch@infradead.org>
Date:   Wed May 19 07:16:41 2010 -0400

    quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers
    
    Instead of having wrappers in the VFS namespace export the dquot_suspend
    and dquot_resume helpers directly.  Also rename vfs_quota_disable to
    dquot_disable while we're at it.
    
    [Jan Kara: Moved dquot_suspend to quotaops.h and made it inline]
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit eea7feb072f5914ecafa95b3d83be0c229244d90
Author: Jan Kara <jack@suse.cz>
Date:   Thu May 13 22:14:53 2010 +0200

    ocfs2: Fix use after free on remount read-only
    
    We also have to cancel quota syncing thread on remount read only because
    at that moment quota is being turned off. Otherwise quota syncing thread
    will try to access already freed quota structures.
    
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 75fe0a2477dab30f00c228f9a4d79009d5677bde
Author: Dmitry Monakhov <dmonakhov@openvz.org>
Date:   Thu Mar 4 17:32:16 2010 +0300

    ocfs2: replace inode uid,gid,mode initialization with helper function
    
    Acked-by: Joel Becker <joel.becker@oracle.com>
    Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

commit 537d81ca7c5338e4f13f3e7e7b50e87ba293ec68
Author: Stephen Hemminger <shemminger@vyatta.com>
Date:   Thu May 13 17:53:22 2010 -0700

    ocfs: constify xattr_handler
    
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

commit c06bcbfa1ed8daaeb2a262f372b411207891e229
Author: Jan Kara <jack@suse.cz>
Date:   Thu May 13 22:14:53 2010 +0200

    ocfs2: Fix lock inversion in quotas during umount
    
    We cannot cancel delayed work from ocfs2_local_free_info because that is called
    with dqonoff_mutex held and the work it cancels requires dqonoff_mutex to
    finish. Cancel the work before acquiring dqonoff_mutex.
    
    Acked-by: Joel Becker <Joel.Becker@oracle.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 52a9ee281cfb26fffce1d6c409fb4b1f4aa8a766
Author: Jan Kara <jack@suse.cz>
Date:   Thu May 13 20:18:45 2010 +0200

    ocfs2: Use __dquot_transfer to avoid lock inversion
    
    dquot_transfer() acquires own references to dquots via dqget(). Thus it waits
    for dq_lock which creates a lock inversion because dq_lock ranks above
    transaction start but transaction is already started in ocfs2_setattr(). Fix
    the problem by passing own references directly to __dquot_transfer.
    
    Acked-by: Joel Becker <Joel.Becker@oracle.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 741e128933448e589a85286e535078b24f4cf568
Author: Jan Kara <jack@suse.cz>
Date:   Thu May 13 18:05:15 2010 +0200

    ocfs2: Fix NULL pointer deref when writing local dquot
    
    commit_dqblk() can write quota info to global file. That is actually a bad
    thing to do because if we are just modifying local quota file, we are not
    prepared (do not hold proper locks, do not have transaction credits) to do
    a modification of the global quota file. So do not use commit_dqblk() and
    instead call our writing function directly.
    
    Acked-by: Joel Becker <Joel.Becker@oracle.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 832d09cf1438bd172f69478bde74f20f05ec0115
Author: Jan Kara <jack@suse.cz>
Date:   Tue May 11 17:04:14 2010 +0200

    ocfs2: Fix estimate of credits needed for quota allocation
    
    We were missing reservation of a journal credit for modification of quota
    file inode when creating new dquot structure in the global quota file.
    
    Acked-by: Joel Becker <Joel.Becker@oracle.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit fb8dd8d780140a3f0e9074831a59054fec6cc451
Author: Jan Kara <jack@suse.cz>
Date:   Wed Mar 31 16:25:37 2010 +0200

    ocfs2: Fix quota locking
    
    OCFS2 had three issues with quota locking:
    a) When reading dquot from global quota file, we started a transaction while
       holding dqio_mutex which is prone to deadlocks because other paths do it
       the other way around
    b) During ocfs2_sync_dquot we were not protected against concurrent writers
       on the same node. Because we first copy data to local buffer, a race
       could happen resulting in old data being written to global quota file and
       thus causing quota inconsistency after a crash.
    c) ip_alloc_sem of quota files was acquired while a transaction is started
       in ocfs2_quota_write which can deadlock because we first get ip_alloc_sem
       and then start a transaction when extending quota files.
    
    We fix the problem a) by pulling all necessary code to ocfs2_acquire_dquot
    and ocfs2_release_dquot. Thus we no longer depend on generic dquot_acquire
    to do the locking and can force proper lock ordering.
    
    Problems b) and c) are fixed by locking i_mutex and ip_alloc_sem of
    global quota file in ocfs2_lock_global_qf and removing ip_alloc_sem from
    ocfs2_quota_read and ocfs2_quota_write.
    
    Acked-by: Joel Becker <Joel.Becker@oracle.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit ae4f6ef13417deaa49471c0e903914a3ef3be258
Author: Jan Kara <jack@suse.cz>
Date:   Wed Apr 28 19:04:29 2010 +0200

    ocfs2: Avoid unnecessary block mapping when refreshing quota info
    
    The position of global quota file info does not change. So we do not have
    to do logical -> physical block translation every time we reread it from
    disk. Thus we can also avoid taking ip_alloc_sem.
    
    Acked-by: Joel Becker <Joel.Becker@oracle.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit f64dd44eb748438783b10b3f7a4968d2656a3c95
Author: Jan Kara <jack@suse.cz>
Date:   Wed Apr 28 00:22:30 2010 +0200

    ocfs2: Do not map blocks from local quota file on each write
    
    There is no need to map offset of local dquot structure to on disk block
    in each quota write. It is enough to map it just once and store the physical
    block number in quota structure in memory. Moreover this simplifies locking
    as we do not have to take ip_alloc_sem from quota write path.
    
    Acked-by: Joel Becker <Joel.Becker@oracle.com>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 12755627bdcddcdb30a1bfb9a09395a52b1d6838
Author: Dmitry Monakhov <dmonakhov@openvz.org>
Date:   Thu Apr 8 22:04:20 2010 +0400

    quota: unify quota init condition in setattr
    
    Quota must being initialized if size or uid/git changes requested.
    But initialization performed in two different places:
    in case of i_size file system is responsible for dquot init
    , but in case of uid/gid init will be called internally in
    dquot_transfer().
    This ambiguity makes code harder to understand.
    Let's move this logic to one common helper function.
    
    Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
    Signed-off-by: Jan Kara <jack@suse.cz>

commit 03e62303cf56e87337115f14842321043df2b4bb
Merge: 33cf23b 18d3a98
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Fri May 21 07:20:17 2010 -0700

    Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2
    
    * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: (47 commits)
      ocfs2: Silence a gcc warning.
      ocfs2: Don't retry xattr set in case value extension fails.
      ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
      ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
      fs/ocfs2/dlm: Use kstrdup
      fs/ocfs2/dlm: Drop memory allocation cast
      Ocfs2: Optimize punching-hole code.
      Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
      Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
      Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
      ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
      ocfs2: Wrap signal blocking in void functions.
      ocfs2/dlm: Increase o2dlm lockres hash size
      ocfs2: Make ocfs2_extend_trans() really extend.
      ocfs2/trivial: Code cleanup for allocation reservation.
      ocfs2: make ocfs2_adjust_resv_from_alloc simple.
      ocfs2: Make nointr a default mount option
      ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
      o2net: log socket state changes
      ocfs2: print node # when tcp fails
      ...

commit 18d3a98f3c1b0e27ce026afa4d1ef042f2903726
Author: Joel Becker <joel.becker@oracle.com>
Date:   Tue May 18 16:47:55 2010 -0700

    ocfs2: Silence a gcc warning.
    
    ocfs2_block_group_claim_bits() is never called with min_bits=0, but we
    shouldn't leave status undefined if it ever is.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 5f5261acb059f43c7fb9a2fac9d32c6ef4df2ed5
Author: Tao Ma <tao.ma@oracle.com>
Date:   Thu May 13 22:49:05 2010 +0800

    ocfs2: Don't retry xattr set in case value extension fails.
    
    In normal xattr set, the set sequence is inode, xattr block
    and finally xattr bucket if we meet with a ENOSPC. But there
    is a corner case.
    So consider we will set a xattr whose value will be stored in
    a cluster, and there is no xattr block by now. So we will
    reserve 1 xattr block and 1 cluster for setting it. Now if we
    fail in value extension(in case the volume is almost full and
    we can't allocate the cluster because the check in
    ocfs2_test_bg_bit_allocatable), ENOSPC will be returned. So
    we will try to create a bucket(this time there is a chance that
    the reserved cluster will be used), and when we try value extension
    again, kernel bug happens. We did meet with it. Check the bug below.
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1251
    
    This patch just try to avoid this by adding a set_abort in
    ocfs2_xattr_set_ctxt, so in case ENOSPC happens in value extension,
    we will check whether it is caused by the real ENOSPC or just the
    full of inode or xattr block. If it is the first case, we set set_abort
    so that we don't try any further. we are safe to exit directly here
    ince it is really ENOSPC.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit d9ef75221a6247b758e1d7e18edb661996e4b7cf
Author: Wengang Wang <wen.gang.wang@oracle.com>
Date:   Mon May 17 20:20:44 2010 +0800

    ocfs2:dlm: avoid dlm->ast_lock lockres->spinlock dependency break
    
    Currently we process a dirty lockres with the lockres->spinlock taken. While
    during the process, we may need to lock on dlm->ast_lock. This breaks the
    dependency of dlm->ast_lock(lock first) and lockres->spinlock(lock second).
    
    This patch fixes the problem.
    Since we can't release lockres->spinlock, we have to take dlm->ast_lock
    just before taking the lockres->spinlock and release it after lockres->spinlock
    is released. And use __dlm_queue_bast()/__dlm_queue_ast(), the nolock version,
    in dlm_shuffle_lists(). There are no too many locks on a lockres, so there is no
    performance harm.
    
    Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit d5a7df0649fa6a1e7800785d760e2c7d7a3204de
Author: Tao Ma <tao.ma@oracle.com>
Date:   Mon May 10 18:09:47 2010 +0800

    ocfs2: Reset xattr value size after xa_cleanup_value_truncate().
    
    In ocfs2_prepare_xattr_entry, if we fail to grow an existing value,
    xa_cleanup_value_truncate() will leave the old entry in place.  Thus, we
    reset its value size.  However, if we were allocating a new value, we
    must not reset the value size or we will BUG().  This resolves
    oss.oracle.com bug 1247.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 41841b0bcea8af7f3bff8b2a23d542b94d9c1bb1
Merge: 316ce2b 1a934c3
Author: Joel Becker <joel.becker@oracle.com>
Date:   Tue May 18 16:40:42 2010 -0700

    Merge branch 'discontig-bg' of git://oss.oracle.com/git/tma/linux-2.6 into ocfs2-merge-window

commit 316ce2ba8e74a7bb9153b9f93adc883cb1ceb9fd
Author: Julia Lawall <julia@diku.dk>
Date:   Fri May 14 21:30:48 2010 +0200

    fs/ocfs2/dlm: Use kstrdup
    
    Use kstrdup when the goal of an allocation is copy a string into the
    allocated region.
    
    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)
    
    // <smpl>
    @@
    expression from,to;
    expression flag,E1,E2;
    statement S;
    @@
    
    -  to = kmalloc(strlen(from) + 1,flag);
    +  to = kstrdup(from, flag);
       ... when != \(from = E1 \| to = E1 \)
       if (to==NULL || ...) S
       ... when != \(from = E2 \| to = E2 \)
    -  strcpy(to, from);
    // </smpl>
    
    Signed-off-by: Julia Lawall <julia@diku.dk>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 3914ed0cec6532ab4feb202424fc95ad05024497
Author: Julia Lawall <julia@diku.dk>
Date:   Tue May 11 20:28:14 2010 +0200

    fs/ocfs2/dlm: Drop memory allocation cast
    
    Drop cast on the result of kmalloc and similar functions.
    
    The semantic patch that makes this change is as follows:
    (http://coccinelle.lip6.fr/)
    
    // <smpl>
    @@
    type T;
    @@
    
    - (T *)
      (\(kmalloc\|kzalloc\|kcalloc\|kmem_cache_alloc\|kmem_cache_zalloc\|
       kmem_cache_alloc_node\|kmalloc_node\|kzalloc_node\)(...))
    // </smpl>
    
    Signed-off-by: Julia Lawall <julia@diku.dk>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit c1631d4a484fbb498e35d661f1aebd64c86b66bf
Author: Tristan Ye <tristan.ye@oracle.com>
Date:   Tue May 11 17:54:45 2010 +0800

    Ocfs2: Optimize punching-hole code.
    
    This patch simplifies the logic of handling existing holes and
    skipping extent blocks and removes some confusing comments.
    
    The patch survived the fill_verify_holes testcase in ocfs2-test.
    It also passed my manual sanity check and stress tests with enormous
    extent records.
    
    Currently punching a hole on a file with 3+ extent tree depth was
    really a performance disaster.  It can even take several hours,
    though we may not hit this in real life with such a huge extent
    number.
    
    One simple way to improve the performance is quite straightforward.
    From the logic of truncate, we can punch the hole from hole_end to
    hole_start, which reduces the overhead of btree operations in a
    significant way, such as tree rotation and moving.
    
    Following is the testing result when punching hole from 0 to file end
    in bytes, on a 1G file, 1G file consists of 256k extent records, each record
    cover 4k data(just one cluster, clustersize is 4k):
    
    ===========================================================================
     * Original punching-hole mechanism:
    ===========================================================================
    
       I waited 1 hour for its completion, unfortunately it's still ongoing.
    
    ===========================================================================
     * Patched punching-hode mechanism:
    ===========================================================================
    
       real 0m2.518s
       user 0m0.000s
       sys  0m2.445s
    
    That means we've gained up to 1000 times improvement on performance in this
    case, whee! It's fairly cool. and it looks like that performance gain will
    be raising when extent records grow.
    
    The patch was based on my former 2 patches, which were about truncating
    codes optimization and fixup to handle CoW on punching hole.
    
    Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
    Acked-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit ee149a7c6cbaee0e3a1a7d9e9f92711228ef5236
Author: Tristan Ye <tristan.ye@oracle.com>
Date:   Tue May 11 17:54:44 2010 +0800

    Ocfs2: Make ocfs2_find_cpos_for_left_leaf() public.
    
    The original idea to pull ocfs2_find_cpos_for_left_leaf() out of
    alloc.c is to benefit punching-holes optimization patch, it however,
    can also be referred by other funcs in the future who want to do the
    same job.
    
    Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
    Acked-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit e8aec068ecb1957630816cfa2c150c6b3ddd1790
Author: Tristan Ye <tristan.ye@oracle.com>
Date:   Tue May 11 17:54:43 2010 +0800

    Ocfs2: Fix hole punching to correctly do CoW during cluster zeroing.
    
    Based on the previous patch of optimizing truncate, the bugfix for
    refcount trees when punching holes can be fairly easy
    and straightforward since most of work we should take into account for
    refcounting have been completed already in ocfs2_remove_btree_range().
    
    This patch performs CoW for refcounted extents when a hole being punched
    whose start or end offset were in the middle of a cluster, which means
    partial zeroing of the cluster will be performed soon.
    
    The patch has been tested fixing the following bug:
    
    http://oss.oracle.com/bugzilla/show_bug.cgi?id=1216
    
    Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
    Acked-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 78f94673d7faf01677f374f4ebbf324ff1a0aa6e
Author: Tristan Ye <tristan.ye@oracle.com>
Date:   Tue May 11 17:54:42 2010 +0800

    Ocfs2: Optimize ocfs2 truncate to use ocfs2_remove_btree_range() instead.
    
    Truncate is just a special case of punching holes(from new i_size to
    end), we therefore could take advantage of the existing
    ocfs2_remove_btree_range() to reduce the comlexity and redundancy in
    alloc.c.  The goal here is to make truncate more generic and
    straightforward.
    
    Several functions only used by ocfs2_commit_truncate() will smiply be
    removed.
    
    ocfs2_remove_btree_range() was originally used by the hole punching
    code, which didn't take refcount trees into account (definitely a bug).
    We therefore need to change that func a bit to handle refcount trees.
    It must take the refcount lock, calculate and reserve blocks for
    refcount tree changes, and decrease refcounts at the end.  We replace
    ocfs2_lock_allocators() here by adding a new func
    ocfs2_reserve_blocks_for_rec_trunc() which accepts some extra blocks to
    reserve.  This will not hurt any other code using
    ocfs2_remove_btree_range() (such as dir truncate and hole punching).
    
    I merged the following steps into one patch since they may be
    logically doing one thing, though I know it looks a little bit fat
    to review.
    
    1). Remove redundant code used by ocfs2_commit_truncate(), since we're
        moving to ocfs2_remove_btree_range anyway.
    
    2). Add a new func ocfs2_reserve_blocks_for_rec_trunc() for purpose of
        accepting some extra blocks to reserve.
    
    3). Change ocfs2_prepare_refcount_change_for_del() a bit to fit our
        needs.  It's safe to do this since it's only being called by
        truncate.
    
    4). Change ocfs2_remove_btree_range() a bit to take refcount case into
        account.
    
    5). Finally, we change ocfs2_commit_truncate() to call
        ocfs2_remove_btree_range() in a proper way.
    
    The patch has been tested normally for sanity check, stress tests
    with heavier workload will be expected.
    
    Based on this patch, fixing the punching holes bug will be fairly easy.
    
    Signed-off-by: Tristan Ye <tristan.ye@oracle.com>
    Acked-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 547ba7c8efe43c2cabb38782e23572a6179dd1c1
Author: Joel Becker <joel.becker@oracle.com>
Date:   Mon May 10 11:56:52 2010 -0700

    ocfs2: Block signals for mkdir/link/symlink/O_CREAT.
    
    Once file or link creation gets going, it can't be interrupted by a
    signal.  They're not idempotent.
    
    This blocks signals in ocfs2_mknod(), ocfs2_link(), and ocfs2_symlink()
    once we start actually changing things.  ocfs2_mknod() covers mknod(),
    creat(), mkdir(), and open(O_CREAT).
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit e4b963f10e9026c83419b5c25b93a0350413cf16
Author: Joel Becker <joel.becker@oracle.com>
Date:   Wed Sep 2 17:17:36 2009 -0700

    ocfs2: Wrap signal blocking in void functions.
    
    ocfs2 sometimes needs to block signals around dlm operations, but it
    currently does it with sigprocmask().  Even worse, it's checking the
    error code of sigprocmask().  The in-kernel sigprocmask() can only error
    if you get the SIG_* argument wrong.  We don't.
    
    Wrap the sigprocmask() calls with ocfs2_[un]block_signals().  These
    functions are void, but they will BUG() if somehow sigprocmask() returns
    an error.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 0467ae954d1843de65e7cf8f706f88fe65cd8418
Author: Sunil Mushran <sunil.mushran@oracle.com>
Date:   Wed May 5 16:25:08 2010 -0700

    ocfs2/dlm: Increase o2dlm lockres hash size
    
    Lockres hash size of 16KB is far too small for large filesystems (where we
    have hundreds of thousands of lock resources stored in the table).
    This patch increases it to 128KB.
    
    Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit c901fb00731e307c2c6e8c7d5eee005df5835f9d
Author: Tao Ma <tao.ma@oracle.com>
Date:   Mon Apr 26 14:34:57 2010 +0800

    ocfs2: Make ocfs2_extend_trans() really extend.
    
    In ocfs2, we use ocfs2_extend_trans() to extend a journal handle's
    blocks. But if jbd2_journal_extend() fails, it will only restart
    with the the new number of blocks.  This tends to be awkward since
    in most cases we want additional reserved blocks. It makes our code
    harder to mantain since the caller can't be sure all the original
    blocks will not be accessed and dirtied again.  There are 15 callers
    of ocfs2_extend_trans() in fs/ocfs2, and 12 of them have to add
    h_buffer_credits before they call ocfs2_extend_trans().  This makes
    ocfs2_extend_trans() really extend atop the original block count.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 3e4218df3176657be72ad2fa199779be6c11fe4f
Author: Tao Ma <tao.ma@oracle.com>
Date:   Tue Apr 6 16:46:46 2010 +0800

    ocfs2/trivial: Code cleanup for allocation reservation.
    
    Two tiny cleanup for allocation reservation.
    1. Remove some extra codes in ocfs2_local_alloc_find_clear_bits.
    2. Remove an unuseful variables in ocfs2_find_resv_lhs.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Acked-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit b065556a7d1a9205403db77a318a5c5aa530e701
Author: Tao Ma <tao.ma@oracle.com>
Date:   Thu Apr 8 16:33:02 2010 +0800

    ocfs2: make ocfs2_adjust_resv_from_alloc simple.
    
    When we allocate some bits from the reservation, we always
    allocate from the r_start(see ocfs2_resmap_resv_bits).
    So there should be no reason to check between r_start
    and start. And I don't think we will change this behaviour
    later by allocating from some bits after r_start.  Why not make
    ocfs2_adjust_resv_from_alloc simple for now?
    
    The only chance we have to adjust the reservation is when we haven't
    reached the end. With this patch, the function is more readable.
    
    Note:
    btw, this patch also fixes an original bug in the function
    which I haven't found before.
    	if (end < ocfs2_resv_end(resv))
    		rhs = end - ocfs2_resv_end(resv);
    This code is of course buggy. ;)
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>
    Acked-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 4b37fcb7d41ce3b9264b9562d6ffd62db9294bd1
Author: Sunil Mushran <sunil.mushran@oracle.com>
Date:   Tue Apr 13 18:00:31 2010 -0700

    ocfs2: Make nointr a default mount option
    
    OCFS2 has never really supported intr. This patch acknowledges this reality
    and makes nointr the default mount option. In a later patch, we intend to
    support intr.
    
    Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 5c80d4c9e5489d5930412add87501702fe5f93fb
Author: Sunil Mushran <sunil.mushran@oracle.com>
Date:   Tue Apr 13 18:00:30 2010 -0700

    ocfs2/dlm: Make o2dlm domain join/leave messages KERN_NOTICE
    
    o2dlm join and leave messages are more than informational as they are
    required for debugging locking issues. This patch changes them from
    KERN_INFO to KERN_NOTICE.
    
    Signed-off-by: Sunil Mushran <sunil.mushran@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 23fd9abdc8f63c72fe3324e83d454ccecedaec37
Author: Srinivas Eeda <srinivas.eeda@oracle.com>
Date:   Wed Mar 31 14:32:29 2010 -0700

    o2net: log socket state changes
    
    This patch logs socket state changes that lead to socket shutdown.
    
    Signed-off-by: Srinivas Eeda <srinivas.eeda@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit a5196ec5ef80309fd390191c548ee1f2e8a327ee
Author: Wengang Wang <wen.gang.wang@oracle.com>
Date:   Tue Mar 30 12:09:22 2010 +0800

    ocfs2: print node # when tcp fails
    
    Print the node number of a peer node if sending it a message failed.
    
    Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 83f92318fa33cc084e14e64dc903e605f75884c1
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Mon Apr 5 18:17:16 2010 -0700

    ocfs2: Add dir_resv_level mount option
    
    The default behavior for directory reservations stays the same, but we add a
    mount option so people can tweak the size of directory reservations
    according to their workloads.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit b07f8f24dfe54da0f074b78949044842e8df881f
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Mon Apr 5 18:17:15 2010 -0700

    ocfs2: change default reservation window sizes
    
    The default reservation size of 4 (32-bit windows) is a bit too ambitious.
    Scale it back to 16 bits (resv_level=2). I have been testing various sizes
    on a 4-node cluster which runs a mixed workload that is heavily threaded.
    With a 256MB local alloc, I get *roughly* the following levels of average file
    fragmentation:
    
    resv_level=0	70%
    resv_level=1	21%
    resv_level=2	23%
    resv_level=3	24%
    resv_level=4	60%
    resv_level=5	did not test
    resv_level=6	60%
    
    resv_level=2 seemed like a good compromise between not letting windows be
    too small, but not so big that heavier workloads will immediately suffer
    without tuning.
    
    This patch also change the behavior of directory reservations - they now
    track file reservations.  The previous compromise of giving directory
    windows only 8 bits wound up fragmenting more at some window sizes because
    file allocations had smaller unused windows to poach from.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 6b82021b9e91cd689fdffadbcdb9a42597bbe764
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Mon Apr 5 18:17:14 2010 -0700

    ocfs2: increase the default size of local alloc windows
    
    I have observed that the current size of 8M gives us pretty poor
    fragmentation on multi-threaded workloads which do lots of writes.
    
    Generally, I can increase the size of local alloc windows and observe a
    marked decrease in fragmentation, even up and beyond window sizes of 512
    megabytes. This makes sense for a couple reasons - larger local alloc means
    more room for reservation windows. On multi-node workloads the larger local
    alloc helps as well because we don't have to do window slides as often.
    
    Also, I removed the OCFS2_DEFAULT_LOCAL_ALLOC_SIZE constant as it is no
    longer used and the comment above it was out of date.
    
    To test fragmentation, I used a workload which launched 4 threads that did
    4k writes into a series of about 140 alternating files.
    
    With resv_level=2, and a 4k/4k file system I observed the following average
    fragmentation for various localalloc= parameters:
    
    localalloc=	avg. fragmentation
    	8		48
    	32		16
    	64		10
    	120		7
    
    On larger cluster sizes, the difference is more dramatic.
    
    The new default size top out at 256M, which we'll only get for cluster
    sizes of 32K and above.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 73c8a80003d13be54e2309865030404441075182
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Mon Apr 5 18:17:13 2010 -0700

    ocfs2: clean up localalloc mount option size parsing
    
    This patch pulls the local alloc sizing code into localalloc.c and provides
    a callout to it from ocfs2_fill_super(). Behavior is essentially unchanged
    except that I correctly calculate the maximum local alloc size. The old code
    in ocfs2_parse_options() calculated the max size as:
    
    ocfs2_local_alloc_size(sb) * 8
    
    which is correct, in bits. Unfortunately though the option passed in is in
    megabytes. Ultimately, this bug made no real difference - the shrink code
    would catch a too-large size and bring it down to something reasonable.
    Still, it's less than efficient as-is.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit a57c8fd2ad238258cc983049008aea5f985804b2
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Tue Mar 16 21:01:00 2010 -0700

    ocfs2: remove ocfs2_local_alloc_in_range()
    
    Inodes are always allocated from the global bitmap now so we don't need this
    any more. Also, the existing implementation bounces reservations around
    needlessly.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>

commit 33d5d380d667ad264675cfdb297dfc3c5b6542cc
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Wed Feb 24 13:34:09 2010 -0800

    ocfs2: allocate btree internal block groups from the global bitmap
    
    Otherwise, the need for a very large contiguous allocation tends to
    wreak havoc on many inode allocation reservations on the local alloc, thus
    ruining any chances for contiguousness.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>

commit e3b4a97dbe9741a3227c3ed857a0632532fcd386
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Mon Dec 7 13:16:07 2009 -0800

    ocfs2: use allocation reservations for directory data
    
    Use the reservations system for unindexed dir tree allocations. We don't
    bother with the indexed tree as reads from it are mostly random anyway.
    Directory reservations are marked seperately, to allow the reservations code
    a chance to optimize their window sizes. This patch allocates only 8 bits
    for directory windows as they generally are not expected to grow as quickly
    as file data. Future improvements to dir window sizing can trivially be
    made.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>

commit 4fe370afaae49c57619bb0bedb75de7e7c168308
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Mon Dec 7 13:15:40 2009 -0800

    ocfs2: use allocation reservations during file write
    
    Add a per-inode reservations structure and pass it through to the
    reservations code.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>

commit d02f00cc057809d96c044cc72d5b9809d59f7d49
Author: Mark Fasheh <mfasheh@suse.com>
Date:   Mon Dec 7 13:10:48 2009 -0800

    ocfs2: allocation reservations
    
    This patch improves Ocfs2 allocation policy by allowing an inode to
    reserve a portion of the local alloc bitmap for itself. The reserved
    portion (allocation window) is advisory in that other allocation
    windows might steal it if the local alloc bitmap becomes
    full. Otherwise, the reservations are honored and guaranteed to be
    free. When the local alloc window is moved to a different portion of
    the bitmap, existing reservations are discarded.
    
    Reservation windows are represented internally by a red-black
    tree. Within that tree, each node represents the reservation window of
    one inode. An LRU of active reservations is also maintained. When new
    data is written, we allocate it from the inodes window. When all bits
    in a window are exhausted, we allocate a new one as close to the
    previous one as possible. Should we not find free space, an existing
    reservation is pulled off the LRU and cannibalized.
    
    Signed-off-by: Mark Fasheh <mfasheh@suse.com>

commit ec20cec7a351584ca6c70ead012e73d61f9a8e04
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 19 14:13:52 2010 -0700

    ocfs2: Make ocfs2_journal_dirty() void.
    
    jbd[2]_journal_dirty_metadata() only returns 0.  It's been returning 0
    since before the kernel moved to git.  There is no point in checking
    this error.
    
    ocfs2_journal_dirty() has been faithfully returning the status since the
    beginning.  All over ocfs2, we have blocks of code checking this can't
    fail status.  In the past few years, we've tried to avoid adding these
    checks, because they are pointless.  But anyone who looks at our code
    assumes they are needed.
    
    Finally, ocfs2_journal_dirty() is made a void function.  All error
    checking is removed from other files.  We'll BUG_ON() the status of
    jbd2_journal_dirty_metadata() just in case they change it someday.  They
    won't.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 1a934c3e57594588c373aea858e4593cdfcba4f4
Author: Tao Ma <tao.ma@oracle.com>
Date:   Thu Mar 18 15:54:22 2010 +0800

    ocfs2: enable discontig block group support.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit abf1b3cb5b20fbad27ca9c7497235eeb4dd3f4fd
Author: Tao Ma <tao.ma@oracle.com>
Date:   Tue Apr 27 08:30:36 2010 +0800

    ocfs2: Set ac_last_group properly with discontig group.
    
    ac_last_group is used to record the last block group we
    used during allocation. But the initialization process
    only calls ocfs2_which_suballoc_group and fails to
    use suballoc_loc properly. So let us do it.
    Another function ocfs2_test_suballoc_bit also needs fix.
    
    I have searched all the callers of ocfs2_which_suballoc_group,
    and all the callers notices suballoc_loc now.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit 74380c479ad83addeff8a172ab95f59557b5b0c3
Author: Tao Ma <tao.ma@oracle.com>
Date:   Mon Mar 22 14:20:18 2010 +0800

    ocfs2: Free block to the right block group.
    
    In case the block we are going to free is allocated from
    a discontiguous block group, we have to use suballoc_loc
    to be the right group.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit af2bf0d86019e0b0306965321096f8380b7ca830
Author: Tao Ma <tao.ma@oracle.com>
Date:   Mon May 17 15:14:17 2010 +0800

    ocfs2: Add ocfs2_gd_is_discontig.
    
    Add ocfs2_gd_is_discontig so that we can test whether
    a group descriptor is discontiguous or not.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit 8571882c21e5073b2f96147ec4ff9b7042339e1b
Author: Tao Ma <tao.ma@oracle.com>
Date:   Tue Apr 13 14:38:06 2010 +0800

    ocfs2: ocfs2_group_bitmap_size has to handle old volume.
    
    ocfs2_group_bitmap_size has to handle the case when the
    volume don't have discontiguous block group support. So
    pass the feature_incompat in and check it.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit 4711954eaa8d30f653fda238cecf919f1ae40d6f
Author: Tao Ma <tao.ma@oracle.com>
Date:   Thu Apr 22 14:09:15 2010 +0800

    ocfs2: Some tiny bug fixes for discontiguous block allocation.
    
    The fixes include:
    1. some endian problems.
    2. we should use bit/bpc in ocfs2_block_group_grow_discontig to
       allocate clusters.
    3. set num_clusters properly in __ocfs2_claim_clusters.
    4. change name from ocfs2_supports_discontig_bh to
       ocfs2_supports_discontig_bg.
    
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit 95ec0adf0b56d6a3f0ca1ec87173311898486b2e
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 26 10:10:08 2010 +0800

    ocfs2: Don't relink cluster groups when allocating discontig block groups
    
    We don't have enough credits, and the filesystem is in a full state
    anyway.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 8b06bc592ebc5a31e8d0b9c2ab17c6e78dde1f86
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 26 10:09:29 2010 +0800

    ocfs2: Grow discontig block groups in one transaction.
    
    Rather than extending the transaction every time we add an extent to a
    discontiguous block group, we grab enough credits to fill the extent
    list up front.  This means we can free the bits in the same transaction
    if we end up not getting enough space.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 2b6cb576aa80611f1f6a3c88708d1e68a8d97985
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 26 10:09:15 2010 +0800

    ocfs2: Set suballoc_loc on allocated metadata.
    
    Get the suballoc_loc from ocfs2_claim_new_inode() or
    ocfs2_claim_metadata().  Store it on the appropriate field of the block
    we just allocated.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit ba2066351b630f0205ebf725f5c81a2a07a77cd7
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 26 10:08:59 2010 +0800

    ocfs2: Return allocated metadata blknos on the ocfs2_suballoc_result.
    
    Rather than calculating the resulting block number, return it on the
    ocfs2_suballoc_result structure.  This way we can calculate block
    numbers for discontiguous block groups.
    
    Cluster groups keep doing it the old way.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 1ed9b777f77929ae961d6f9cdf828a07200ba71c
Author: Joel Becker <joel.becker@oracle.com>
Date:   Thu May 6 13:59:06 2010 +0800

    ocfs2: ocfs2_claim_*() don't need an ocfs2_super argument.
    
    They all take an ocfs2_alloc_context, which has the allocation inode.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit 13e434cf0cacd2f03a7f4cd077e3e995ef5ef710
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 26 10:08:27 2010 +0800

    ocfs2: Trim suballocations if they cross discontiguous regions
    
    A discontiguous block group can find a range of free bits that straddle
    more than one region of its space.  Callers can't handle that, so we
    trim the returned bits until they fit within one region.
    
    Only cluster allocations ask for min_bits>1.  Discontiguous block groups
    are only for block allocations.  So min_bits doesn't matter here.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit aa8f8e93c898a0319bcd6c79a9a42fe52abac7d7
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 26 10:08:07 2010 +0800

    ocfs2: ocfs2_claim_suballoc_bits() doesn't need an osb argument.
    
    It's contained on ac->ac_inode->i_sb anyway.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 9cbc01231e82f9390edaea2b766abcb7165dc4b2
Author: Joel Becker <joel.becker@oracle.com>
Date:   Fri Mar 26 10:07:42 2010 +0800

    ocfs2: Add suballoc_loc to metadata blocks.
    
    We need a suballoc_loc field on any suballocated block.  Define them.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>

commit 7d1fe093bf04124dcc50c5dde1765bd098464bfa
Author: Joel Becker <joel.becker@oracle.com>
Date:   Tue Apr 13 14:30:19 2010 +0800

    ocfs2: Pass suballocation results back via a structure.
    
    We're going to be adding more info to a suballocator allocation.  Rather
    than growing every function in the chain, let's pass a result structure
    around.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit 798db35f4649eac2778381c390ed7d12de9ec767
Author: Joel Becker <joel.becker@oracle.com>
Date:   Tue Apr 13 14:26:32 2010 +0800

    ocfs2: Allocate discontiguous block groups.
    
    If we cannot get a contiguous region for a block group, allocate a
    discontiguous one when the filesystem supports it.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Signed-off-by: Tao Ma <tao.ma@oracle.com>

commit 4cbe4249d6586d5d88ef271e07302407a14c8443
Author: Joel Becker <joel.becker@oracle.com>
Date:   Tue Apr 13 14:26:12 2010 +0800

    ocfs2: Define data structures for discontiguous block groups.
    
    Defines the OCFS2_FEATURE_INCOMPAT_DISCONTIG_BG feature bit and modifies
    struct ocfs2_group_desc for the feature.
    
    Signed-off-by: Joel Becker <joel.becker@oracle.com>
    Signed-off-by: Tao Ma <tao.ma@oracle.com>