commit 047e5f383505cda6606b73d49a857895de5e2c48 Author: Greg Kroah-Hartman Date: Mon Dec 14 08:14:26 2009 -0800 Linux 2.6.31.8 commit 26973f8bd6dfee21209ef3d521c286ee485ef2a8 Author: Theodore Ts'o Date: Thu Dec 10 18:51:31 2009 -0500 ext4: Fix potential fiemap deadlock (mmap_sem vs. i_data_sem) (cherry picked from commit fab3a549e204172236779f502eccb4f9bf0dc87d) Fix the following potential circular locking dependency between mm->mmap_sem and ei->i_data_sem: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.32-04115-gec044c5 #37 ------------------------------------------------------- ureadahead/1855 is trying to acquire lock: (&mm->mmap_sem){++++++}, at: [] might_fault+0x5c/0xac but task is already holding lock: (&ei->i_data_sem){++++..}, at: [] ext4_fiemap+0x11b/0x159 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&ei->i_data_sem){++++..}: [] __lock_acquire+0xb67/0xd0f [] lock_acquire+0xdc/0x102 [] down_read+0x51/0x84 [] ext4_get_blocks+0x50/0x2a5 [] ext4_get_block+0xab/0xef [] do_mpage_readpage+0x198/0x48d [] mpage_readpages+0xd0/0x114 [] ext4_readpages+0x1d/0x1f [] __do_page_cache_readahead+0x12f/0x1bc [] ra_submit+0x21/0x25 [] filemap_fault+0x19f/0x32c [] __do_fault+0x55/0x3a2 [] handle_mm_fault+0x327/0x734 [] do_page_fault+0x292/0x2aa [] page_fault+0x25/0x30 [] clear_user+0x38/0x3c [] padzero+0x20/0x31 [] load_elf_binary+0x8bc/0x17ed [] search_binary_handler+0xc2/0x259 [] load_script+0x1b8/0x1cc [] search_binary_handler+0xc2/0x259 [] do_execve+0x1ce/0x2cf [] sys_execve+0x43/0x5a [] stub_execve+0x6a/0xc0 -> #0 (&mm->mmap_sem){++++++}: [] __lock_acquire+0xa11/0xd0f [] lock_acquire+0xdc/0x102 [] might_fault+0x89/0xac [] fiemap_fill_next_extent+0x95/0xda [] ext4_ext_fiemap_cb+0x138/0x157 [] ext4_ext_walk_space+0x178/0x1f1 [] ext4_fiemap+0x13c/0x159 [] do_vfs_ioctl+0x348/0x4d6 [] sys_ioctl+0x56/0x79 [] system_call_fastpath+0x16/0x1b other info that might help us debug this: 1 lock held by ureadahead/1855: #0: (&ei->i_data_sem){++++..}, at: [] ext4_fiemap+0x11b/0x159 stack backtrace: Pid: 1855, comm: ureadahead Not tainted 2.6.32-04115-gec044c5 #37 Call Trace: [] print_circular_bug+0xa8/0xb7 [] __lock_acquire+0xa11/0xd0f [] ? sched_clock+0x9/0xd [] lock_acquire+0xdc/0x102 [] ? might_fault+0x5c/0xac [] might_fault+0x89/0xac [] ? might_fault+0x5c/0xac [] ? __kmalloc+0x13b/0x18c [] fiemap_fill_next_extent+0x95/0xda [] ext4_ext_fiemap_cb+0x138/0x157 [] ? ext4_ext_fiemap_cb+0x0/0x157 [] ext4_ext_walk_space+0x178/0x1f1 [] ext4_fiemap+0x13c/0x159 [] ? might_fault+0x5c/0xac [] do_vfs_ioctl+0x348/0x4d6 [] ? __up_read+0x8d/0x95 [] ? retint_swapgs+0x13/0x1b [] sys_ioctl+0x56/0x79 [] system_call_fastpath+0x16/0x1b Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 59cebab12f5dd741ae565ed242f1f12790d9c4ba Author: Sebastian Andrzej Siewior Date: Sun Oct 25 15:37:58 2009 +0100 signal: Fix alternate signal stack check commit 2a855dd01bc1539111adb7233f587c5c468732ac upstream. All architectures in the kernel increment/decrement the stack pointer before storing values on the stack. On architectures which have the stack grow down sas_ss_sp == sp is not on the alternate signal stack while sas_ss_sp + sas_ss_size == sp is on the alternate signal stack. On architectures which have the stack grow up sas_ss_sp == sp is on the alternate signal stack while sas_ss_sp + sas_ss_size == sp is not on the alternate signal stack. The current implementation fails for architectures which have the stack grow down on the corner case where sas_ss_sp == sp.This was reported as Debian bug #544905 on AMD64. Simplified test case: http://download.breakpoint.cc/tc-sig-stack.c The test case creates the following stack scenario: 0xn0300 stack top 0xn0200 alt stack pointer top (when switching to alt stack) 0xn01ff alt stack end 0xn0100 alt stack start == stack pointer If the signal is sent the stack pointer is pointing to the base address of the alt stack and the kernel erroneously decides that it has already switched to the alternate stack because of the current check for "sp - sas_ss_sp < sas_ss_size" On parisc (stack grows up) the scenario would be: 0xn0200 stack pointer 0xn01ff alt stack end 0xn0100 alt stack start = alt stack pointer base (when switching to alt stack) 0xn0000 stack base This is handled correctly by the current implementation. [ tglx: Modified for archs which have the stack grow up (parisc) which would fail with the correct implementation for stack grows down. Added a check for sp >= current->sas_ss_sp which is strictly not necessary but makes the code symetric for both variants ] Signed-off-by: Sebastian Andrzej Siewior Cc: Oleg Nesterov Cc: Roland McGrath Cc: Kyle McMartin LKML-Reference: <20091025143758.GA6653@Chamillionaire.breakpoint.cc> Signed-off-by: Thomas Gleixner Signed-off-by: Greg Kroah-Hartman commit 3c4f4e86d320bc7853f9aba7f8382a5a77dc6e43 Author: James Bottomley Date: Thu Nov 5 13:33:12 2009 -0600 SCSI: scsi_lib_dma: fix bug with dma maps on nested scsi objects commit d139b9bd0e52dda14fd13412e7096e68b56d0076 upstream. Some of our virtual SCSI hosts don't have a proper bus parent at the top, which can be a problem for doing DMA on them This patch makes the host device cache a pointer to the physical bus device and provides an extra API for setting it (the normal API picks it up from the parent). This patch also modifies the qla2xxx and lpfc vport logic to use the new DMA host setting API. Acked-By: James Smart Signed-off-by: James Bottomley Signed-off-by: Greg Kroah-Hartman commit fa0b90829888932592f6f99959ebc42c12356623 Author: Martin Michlmayr Date: Mon Nov 16 20:49:25 2009 +0200 SCSI: osd_protocol.h: Add missing #include commit 0899638688f223fd9e9fee60d662665e11693d12 upstream. include/scsi/osd_protocol.h uses ALIGN() without an #include , leading to: | include/scsi/osd_protocol.h:362: error: implicit declaration of function 'ALIGN' Signed-off-by: Martin Michlmayr Signed-off-by: Boaz Harrosh Signed-off-by: James Bottomley Signed-off-by: Greg Kroah-Hartman commit 66d8e059a54d7aa345ef9a857986e2055870a14e Author: Yang, Bo Date: Tue Oct 6 14:52:20 2009 -0600 SCSI: megaraid_sas: fix 64 bit sense pointer truncation commit 7b2519afa1abd1b9f63aa1e90879307842422dae upstream. The current sense pointer is cast to a u32 pointer, which can truncate on 64 bits. Fix by using unsigned long instead. Signed-off-by Bo Yang Signed-off-by: James Bottomley Signed-off-by: Greg Kroah-Hartman commit 51a88ff8de521caa02d8d208e410ff85a3c85199 Author: Akira Fujita Date: Sun Dec 6 23:38:31 2009 -0500 ext4: Fix insufficient checks in EXT4_IOC_MOVE_EXT (cherry picked from commit 4a58579b9e4e2a35d57e6c9c8483e52f6f1b7fd6) This patch fixes three problems in the handling of the EXT4_IOC_MOVE_EXT ioctl: 1. In current EXT4_IOC_MOVE_EXT, there are read access mode checks for original and donor files, but they allow the illegal write access to donor file, since donor file is overwritten by original file data. To fix this problem, change access mode checks of original (r->r/w) and donor (r->w) files. 2. Disallow the use of donor files that have a setuid or setgid bits. 3. Call mnt_want_write() and mnt_drop_write() before and after ext4_move_extents() calling to get write access to a mount. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit e8f0d507456ee6ea071e0bb9d445e848b29872ac Author: Jan Kara Date: Thu Dec 10 00:50:57 2009 -0500 ext4: Wait for proper transaction commit on fsync (cherry picked from commit b436b9bef84de6893e86346d8fbf7104bc520645) We cannot rely on buffer dirty bits during fsync because pdflush can come before fsync is called and clear dirty bits without forcing a transaction commit. What we do is that we track which transaction has last changed the inode and which transaction last changed allocation and force it to disk on fsync. Signed-off-by: Jan Kara Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 224fb952944a6ff5c4032f5cdcf0a73ac45b0702 Author: Dmitry Monakhov Date: Tue Dec 8 22:42:28 2009 -0500 ext4: fix incorrect block reservation on quota transfer. (cherry picked from commit 194074acacebc169ded90a4657193f5180015051) Inside ->setattr() call both ATTR_UID and ATTR_GID may be valid This means that we may end-up with transferring all quotas. Add we have to reserve QUOTA_DEL_BLOCKS for all quotas, as we do in case of QUOTA_INIT_BLOCKS. Signed-off-by: Dmitry Monakhov Reviewed-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit e79220b94468feaf42ec05fe197a8dfe5a782f57 Author: Dmitry Monakhov Date: Tue Dec 8 22:42:15 2009 -0500 ext4: quota macros cleanup (cherry picked from commit 5aca07eb7d8f14d90c740834d15ca15277f4820c) Currently all quota block reservation macros contains hard-coded "2" aka MAXQUOTAS value. This is no good because in some places it is not obvious to understand what does this digit represent. Let's introduce new macro with self descriptive name. Signed-off-by: Dmitry Monakhov Acked-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 89ca4c75b2ad9f2610d390590dcd16e60e909f43 Author: Dmitry Monakhov Date: Tue Dec 8 22:41:52 2009 -0500 ext4: ext4_get_reserved_space() must return bytes instead of blocks (cherry picked from commit 8aa6790f876e81f5a2211fe1711a5fe3fe2d7b20) Signed-off-by: Dmitry Monakhov Reviewed-by: Eric Sandeen Acked-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit a9a3ddb71f81e2529f7c38f06981723c8af1ffd0 Author: Curt Wohlgemuth Date: Tue Dec 8 22:18:25 2009 -0500 ext4: remove blocks from inode prealloc list on failure (cherry picked from commit b844167edc7fcafda9623955c05e4c1b3c32ebc7) This fixes a leak of blocks in an inode prealloc list if device failures cause ext4_mb_mark_diskspace_used() to fail. Signed-off-by: Curt Wohlgemuth Acked-by: Aneesh Kumar K.V Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 3b9d4e773c20b34534c46e563103071f2f540bf8 Author: Josef Bacik Date: Tue Dec 8 21:48:58 2009 -0500 ext4: wait for log to commit when umounting (cherry picked from commit d4edac314e9ad0b21ba20ba8bc61b61f186f79e1) There is a potential race when a transaction is committing right when the file system is being umounting. This could reduce in a race because EXT4_SB(sb)->s_group_info could be freed in ext4_put_super before the commit code calls a callback so the mballoc code can release freed blocks in the transaction, resulting in a panic trying to access the freed s_group_info. The fix is to wait for the transaction to finish committing before we shutdown the multiblock allocator. Signed-off-by: Josef Bacik Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 1065591da0c1726f0aff654a7e23062e8898d923 Author: Jan Kara Date: Tue Dec 8 21:24:33 2009 -0500 ext4: Avoid data / filesystem corruption when write fails to copy data (cherry picked from commit b9a4207d5e911b938f73079a83cc2ae10524ec7f) When ext4_write_begin fails after allocating some blocks or generic_perform_write fails to copy data to write, we truncate blocks already instantiated beyond i_size. Although these blocks were never inside i_size, we have to truncate the pagecache of these blocks so that corresponding buffers get unmapped. Otherwise subsequent __block_prepare_write (called because we are retrying the write) will find the buffers mapped, not call ->get_block, and thus the page will be backed by already freed blocks leading to filesystem and data corruption. Signed-off-by: Jan Kara Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 5e5c63120a3f3be8b89d09198bb0bc76e8d4f926 Author: Roel Kluin Date: Mon Dec 7 10:38:16 2009 -0500 ext4: Return the PTR_ERR of the correct pointer in setup_new_group_blocks() (cherry picked from commit c09eef305dd43846360944ad072f051f964fa383) Signed-off-by: Roel Kluin Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit aea93db412553d4d7319945868293a3f9b65a3a8 Author: Theodore Ts'o Date: Tue Dec 1 09:04:42 2009 -0500 jbd2: Add ENOMEM checking in and for jbd2_journal_write_metadata_buffer() (cherry picked from commit e6ec116b67f46e0e7808276476554727b2e6240b) OOM happens. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit eee4dabf625a44ee556cf6869bd90d54364cf38b Author: Akira Fujita Date: Tue Nov 24 10:31:56 2009 -0500 ext4: move_extent_per_page() cleanup (cherry picked from commit ac48b0a1d068887141581bea8285de5fcab182b0) Integrate duplicate lines (acquire/release semaphore and invalidate extent cache in move_extent_per_page()) into mext_replace_branches(), to reduce source and object code size. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 3f54503c2b00d8d908f23621f090915bc51c0797 Author: Kazuya Mio Date: Tue Nov 24 10:28:48 2009 -0500 ext4: initialize moved_len before calling ext4_move_extents() (cherry picked from commit 446aaa6e7e993b38a6f21c6acfa68f3f1af3dbe3) The move_extent.moved_len is used to pass back the number of exchanged blocks count to user space. Currently the caller must clear this field; but we spend more code space checking for this requirement than simply zeroing the field ourselves, so let's just make life easier for everyone all around. Signed-off-by: Kazuya Mio Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit b9894156432b3013cb46567a2a9a9c46027f4444 Author: Akira Fujita Date: Tue Nov 24 10:19:57 2009 -0500 ext4: Fix double-free of blocks with EXT4_IOC_MOVE_EXT (cherry picked from commit 94d7c16cbbbd0e03841fcf272bcaf0620ad39618) At the beginning of ext4_move_extent(), we call ext4_discard_preallocations() to discard inode PAs of orig and donor inodes. But in the following case, blocks can be double freed, so move ext4_discard_preallocations() to the end of ext4_move_extents(). 1. Discard inode PAs of orig and donor inodes with ext4_discard_preallocations() in ext4_move_extents(). orig : [ DATA1 ] donor: [ DATA2 ] 2. While data blocks are exchanging between orig and donor inodes, new inode PAs is created to orig by other process's block allocation. (Since there are semaphore gaps in ext4_move_extents().) And new inode PAs is used partially (2-1). 2-1 Create new inode PAs to orig inode orig : [ DATA1 | used PA1 | free PA1 ] donor: [ DATA2 ] 3. Donor inode which has old orig inode's blocks is deleted after EXT4_IOC_MOVE_EXT finished (3-1, 3-2). So the block bitmap corresponds to old orig inode's blocks are freed. 3-1 After EXT4_IOC_MOVE_EXT finished orig : [ DATA2 | free PA1 ] donor: [ DATA1 | used PA1 ] 3-2 Delete donor inode orig : [ DATA2 | free PA1 ] donor: [ FREE SPACE(DATA1) | FREE SPACE(used PA1) ] 4. The double-free of blocks is occurred, when close() is called to orig inode. Because ext4_discard_preallocations() for orig inode frees used PA1 and free PA1, though used PA1 is already freed in 3. 4-1 Double-free of blocks is occurred orig : [ DATA2 | FREE SPACE(free PA1) ] donor: [ FREE SPACE(DATA1) | DOUBLE FREE(used PA1) ] Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 360bfe0d07db42ee99a47b5549ac06d81016f356 Author: Eric Sandeen Date: Thu Nov 19 14:28:50 2009 -0500 ext4: make "norecovery" an alias for "noload" (cherry picked from commit e3bb52ae2bb9573e84c17b8e3560378d13a5c798) Users on the linux-ext4 list recently complained about differences across filesystems w.r.t. how to mount without a journal replay. In the discussion it was noted that xfs's "norecovery" option is perhaps more descriptively accurate than "noload," so let's make that an alias for ext4. Also show this status in /proc/mounts Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit c46e04b9e788d6abed811128aa4e61dc0996509c Author: Eric Sandeen Date: Thu Nov 19 14:25:42 2009 -0500 ext4: make trim/discard optional (and off by default) (cherry picked from commit 5328e635315734d42080de9a5a1ee87bf4cae0a4) It is anticipated that when sb_issue_discard starts doing real work on trim-capable devices, we may see issues. Make this mount-time optional, and default it to off until we know that things are working out OK. Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 648d9cddf46454ef6aad9f8faf1dd051543647dd Author: Jan Kara Date: Mon Nov 23 07:24:48 2009 -0500 ext4: fix error handling in ext4_ind_get_blocks() (cherry picked from commit 2bba702d4f88d7b010ec37e2527b552588404ae7) When an error happened in ext4_splice_branch we failed to notice that in ext4_ind_get_blocks and mapped the buffer anyway. Fix the problem by checking for error properly. Signed-off-by: Jan Kara Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 84d77b74f7ce2e4937b18864fe6d033b29831918 Author: Theodore Ts'o Date: Mon Nov 23 07:24:57 2009 -0500 ext4: avoid issuing unnecessary barriers (cherry picked from commit 6b17d902fdd241adfa4ce780df20547b28bf5801) We don't to issue an I/O barrier on an error or if we force commit because we are doing data journaling. Signed-off-by: "Theodore Ts'o" Cc: Jan Kara Signed-off-by: Greg Kroah-Hartman commit b284e381f7b816a523636271274e38af7aae3dbd Author: Theodore Ts'o Date: Sun Nov 15 15:29:56 2009 -0500 ext4: fix block validity checks so they work correctly with meta_bg (cherry picked from commit 1032988c71f3f85483b2b4319684d1205a704c02) The block validity checks used by ext4_data_block_valid() wasn't correctly written to check file systems with the meta_bg feature. Fix this. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 433a171d4c17eff96511617e132efb985f200fad Author: Theodore Ts'o Date: Mon Nov 23 07:24:38 2009 -0500 ext4: fix uninit block bitmap initialization when s_meta_first_bg is non-zero (cherry picked from commit 8dadb198cb70ef811916668fe67eeec82e8858dd) The number of old-style block group descriptor blocks is s_meta_first_bg when the meta_bg feature flag is set. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit a0dacfcfac479a5b950a8d863fe8b55328a6e1cd Author: Theodore Ts'o Date: Mon Nov 23 07:24:52 2009 -0500 ext4: don't update the superblock in ext4_statfs() (cherry picked from commit 3f8fb9490efbd300887470a2a880a64e04dcc3f5) commit a71ce8c6c9bf269b192f352ea555217815cf027e updated ext4_statfs() to update the on-disk superblock counters, but modified this buffer directly without any journaling of the change. This is one of the accesses that was causing the crc errors in journal replay as seen in kernel.org bugzilla #14354. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit c1e25a5675eac46bee85953720705efceceb70a2 Author: Eric Sandeen Date: Sun Nov 15 15:30:52 2009 -0500 ext4: journal all modifications in ext4_xattr_set_handle (cherry picked from commit 86ebfd08a1930ccedb8eac0aeb1ed4b8b6a41dbc) ext4_xattr_set_handle() was zeroing out an inode outside of journaling constraints; this is one of the accesses that was causing the crc errors in journal replay as seen in kernel.org bugzilla #14354. Reviewed-by: Andreas Dilger Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 882e3a7d67adf1bd773efd2d9999383b414832fd Author: Julia Lawall Date: Sun Nov 15 15:30:58 2009 -0500 ext4: fix i_flags access in ext4_da_writepages_trans_blocks() (cherry picked from commit 30c6e07a92ea4cb87160d32ffa9bce172576ae4c) We need to be testing the i_flags field in the ext4 specific portion of the inode, instead of the (confusingly aliased) i_flags field in the generic struct inode. Signed-off-by: Julia Lawall Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 063875158fa59b569ba26b9b259e822b5b51657e Author: Theodore Ts'o Date: Mon Nov 23 07:17:34 2009 -0500 ext4: make sure directory and symlink blocks are revoked (cherry picked from commit 50689696867d95b38d9c7be640a311494a04fb86) When an inode gets unlinked, the functions ext4_clear_blocks() and ext4_remove_blocks() call ext4_forget() for all the buffer heads corresponding to the deleted inode's data blocks. If the inode is a directory or a symlink, the is_metadata parameter must be non-zero so ext4_forget() will revoke them via jbd2_journal_revoke(). Otherwise, if these blocks are reused for a data file, and the system crashes before a journal checkpoint, the journal replay could end up corrupting these data blocks. Thanks to Curt Wohlgemuth for pointing out potential problems in this area. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 97a9a2516af7df77a961e36466be866f7facfec3 Author: Theodore Ts'o Date: Sat Nov 14 08:19:05 2009 -0500 ext4: plug a buffer_head leak in an error path of ext4_iget() (cherry picked from commit 567f3e9a70d71e5c9be03701b8578be77857293b) One of the invalid error paths in ext4_iget() forgot to brelse() the inode buffer head. Fix it by adding a brelse() in the common error return path, which also simplifies function. Thanks to Andi Kleen reporting the problem. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 1cb0b894567626e02ad2fb6d8e74356918624499 Author: Akira Fujita Date: Mon Nov 23 07:24:41 2009 -0500 ext4: fix possible recursive locking warning in EXT4_IOC_MOVE_EXT (cherry picked from commit 49bd22bc4d603a2a4fc2a6a60e156cbea52eb494) If CONFIG_PROVE_LOCKING is enabled, the double_down_write_data_sem() will trigger a false-positive warning of a recursive lock. Since we take i_data_sem for the two inodes ordered by their inode numbers, this isn't a problem. Use of down_write_nested() will notify the lock dependency checker machinery that there is no problem here. This problem was reported by Brian Rogers: http://marc.info/?l=linux-ext4&m=125115356928011&w=1 Reported-by: Brian Rogers Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit c6b24d6fd81cf34718b0a075e9ce0584e7f6a784 Author: Akira Fujita Date: Mon Nov 23 07:24:43 2009 -0500 ext4: fix lock order problem in ext4_move_extents() (cherry picked from commit fc04cb49a898c372a22b21fffc47f299d8710801) ext4_move_extents() checks the logical block contiguousness of original file with ext4_find_extent() and mext_next_extent(). Therefore the extent which ext4_ext_path structure indicates must not be changed between above functions. But in current implementation, there is no i_data_sem protection between ext4_ext_find_extent() and mext_next_extent(). So the extent which ext4_ext_path structure indicates may be overwritten by delalloc. As a result, ext4_move_extents() will exchange wrong blocks between original and donor files. I change the place where acquire/release i_data_sem to solve this problem. Moreover, I changed move_extent_per_page() to start transaction first, and then acquire i_data_sem. Without this change, there is a possibility of the deadlock between mmap() and ext4_move_extents(): * NOTE: "A", "B" and "C" mean different processes A-1: ext4_ext_move_extents() acquires i_data_sem of two inodes. B: do_page_fault() starts the transaction (T), and then tries to acquire i_data_sem. But process "A" is already holding it, so it is kept waiting. C: While "A" and "B" running, kjournald2 tries to commit transaction (T) but it is under updating, so kjournald2 waits for it. A-2: Call ext4_journal_start with holding i_data_sem, but transaction (T) is locked. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 9a88a5ad2afe14b6f077c805a9c6dd95679aabba Author: Akira Fujita Date: Mon Nov 23 07:25:48 2009 -0500 ext4: fix the returned block count if EXT4_IOC_MOVE_EXT fails (cherry picked from commit f868a48d06f8886cb0367568a12367fa4f21ea0d) If the EXT4_IOC_MOVE_EXT ioctl fails, the number of blocks that were exchanged before the failure should be returned to the userspace caller. Unfortunately, currently if the block size is not the same as the page size, the returned block count that is returned is the page-aligned block count instead of the actual block count. This commit addresses this bug. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit a7aaaff940e284035a16e07cf90424da386ddce1 Author: Theodore Ts'o Date: Mon Nov 23 07:24:46 2009 -0500 ext4: avoid divide by zero when trying to mount a corrupted file system (cherry picked from commit 503358ae01b70ce6909d19dd01287093f6b6271c) If s_log_groups_per_flex is greater than 31, then groups_per_flex will will overflow and cause a divide by zero error. This can cause kernel BUG if such a file system is mounted. Thanks to Nageswara R Sastry for analyzing the failure and providing an initial patch. http://bugzilla.kernel.org/show_bug.cgi?id=14287 Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 37bd334da00d116dba0526292891b1b572f1955b Author: Theodore Ts'o Date: Mon Nov 23 07:25:49 2009 -0500 ext4: fix potential buffer head leak when add_dirent_to_buf() returns ENOSPC (cherry picked from commit 2de770a406b06dfc619faabbf5d85c835ed3f2e1) Previously add_dirent_to_buf() did not free its passed-in buffer head in the case of ENOSPC, since in some cases the caller still needed it. However, this led to potential buffer head leaks since not all callers dealt with this correctly. Fix this by making simplifying the freeing convention; now add_dirent_to_buf() *never* frees the passed-in buffer head, and leaves that to the responsibility of its caller. This makes things cleaner and easier to prove that the code is neither leaking buffer heads or calling brelse() one time too many. Signed-off-by: "Theodore Ts'o" Cc: Curt Wohlgemuth Signed-off-by: Greg Kroah-Hartman commit 00909f7247db77fa3a3237b1d7b3d69ea25edd8a Author: Mingming Date: Fri Nov 6 04:01:23 2009 -0500 ext4: Fix return value of ext4_split_unwritten_extents() to fix direct I/O (cherry picked from commit ba230c3f6dc88ec008806adb27b12088486d508e) To prepare for a direct I/O write, we need to split the unwritten extents before submitting the I/O. When no extents needed to be split, ext4_split_unwritten_extents() was incorrectly returning 0 instead of the size of uninitialized extents. This bug caused the wrong return value sent back to VFS code when it gets called from async IO path, leading to an unnecessary fall back to buffered IO. This bug also hid the fact that the check to see whether or not a split would be necessary was incorrect; we can only skip splitting the extent if the write completely covers the uninitialized extent. Signed-off-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 2bcfe650d1b5c9994ce58a25efd5a6f5eb4ad0ee Author: Mingming Date: Tue Nov 3 14:44:54 2009 -0500 ext4: code clean up for dio fallocate handling (cherry picked from commit 4b70df181611012a3556f017b57dfcef7e1d279f) The ext4_debug() call in ext4_end_io_dio() should be moved after the check to make sure that io_end is non-NULL. The comment above ext4_get_block_dio_write() ("Maximum number of blocks...") is a duplicate; the original and correct comment is above the #define DIO_MAX_BLOCKS up above. Based on review comments from Curt Wohlgemuth. Signed-off-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 3e6c8d3881736079664aad465e26486d1d4152b9 Author: Mingming Date: Tue Nov 10 10:48:04 2009 -0500 ext4: skip conversion of uninit extents after direct IO if there isn't any (cherry picked from commit 5f5249507e4b5c4fc0f9c93f33d133d8c95f47e1) At the end of direct I/O operation, ext4_ext_direct_IO() always called ext4_convert_unwritten_extents(), regardless of whether there were any unwritten extents involved in the I/O or not. This commit adds a state flag so that ext4_ext_direct_IO() only calls ext4_convert_unwritten_extents() when necessary. Signed-off-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 3e270fa58d81a994d1e4371be444f2e5bb072866 Author: Mingming Date: Tue Nov 10 10:48:08 2009 -0500 ext4: fix ext4_ext_direct_IO()'s return value after converting uninit extents (cherry picked from commit 109f55651954def97fa41ee71c464d268c512ab0) After a direct I/O request covering an uninitalized extent (i.e., created using the fallocate system call) or a hole in a file, ext4 will convert the uninitialized extent so it is marked as initialized by calling ext4_convert_unwritten_extents(). This function returns zero on success. This return value was getting returned by ext4_direct_IO(); however the file system's direct_IO function is supposed to return the number of bytes read or written on a success. By returning zero, it confused the direct I/O code into falling back to buffered I/O unnecessarily. Signed-off-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit ba593a59c912b3ae1f18b74d8ad66594ba3a47b6 Author: Aneesh Kumar K.V Date: Mon Nov 2 18:50:49 2009 -0500 ext4: discard preallocation when restarting a transaction during truncate (cherry picked from commit fa5d11133b07053270e18fa9c18560e66e79217e) When restart a transaction during a truncate operation, we drop and reacquire i_data_sem. After reacquiring i_data_sem, we need to discard any inode-based preallocation that might have been grabbed while we released i_data_sem (for example, if pdflush is allocating blocks and racing against the truncate). Signed-off-by: Aneesh Kumar K.V Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 620c66ab00f5b06de7119fbe0cc7dba3cc9aca7d Author: Eric Sandeen Date: Fri Oct 2 21:20:55 2009 -0400 ext4: retry failed direct IO allocations (cherry picked from commit fbbf69456619de5d251cb9f1df609069178c62d5) On a 256M filesystem, doing this in a loop: xfs_io -F -f -d -c 'pwrite 0 64m' test rm -f test eventually leads to ENOSPC. (the xfs_io command does a 64m direct IO write to the file "test") As with other block allocation callers, it looks like we need to potentially retry the allocations on the initial ENOSPC. Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 2489f42e40972d65e50a0b842297f90ce02e6cb0 Author: Theodore Ts'o Date: Wed Sep 30 22:57:41 2009 -0400 ext4: fix a BUG_ON crash by checking that page has buffers attached to it (cherry picked from commit 1f94533d9cd75f6d2826018d54a971b9cc085992) In ext4_num_dirty_pages() we were calling page_buffers() before checking to see if the page actually had pages attached to it; this would cause a BUG check crash in the inline function page_buffers(). Thanks to Markus Trippelsdorf for reporting this bug. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 28c72b7fbc403ec439b0108e6559c9c67a3afe8d Author: Theodore Ts'o Date: Wed Sep 30 01:13:55 2009 -0400 ext4: Fix time encoding with extra epoch bits (cherry picked from commit c1fccc0696bcaff6008c11865091f5ec4b0937ab) "Looking at ext4.h, I think the setting of extra time fields forgets to mask the epoch bits so the epoch part overwrites nsec part. The second change is only for coherency (2 -> EXT4_EPOCH_BITS)." Thanks to Damien Guibouret for pointing out this problem. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit e39f1920521ed14d874d638265e27a16526df494 Author: Curt Wohlgemuth Date: Tue Sep 29 11:01:03 2009 -0400 ext4: Handle nested ext4_journal_start/stop calls without a journal (cherry picked from commit d3d1faf6a74496ea4435fd057c6a2cad49f3e523) This patch fixes a problem with handling nested calls to ext4_journal_start/ext4_journal_stop, when there is no journal present. Signed-off-by: Curt Wohlgemuth Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 94ab12a2649b1e86677771d310a20752ba5b1542 Author: Curt Wohlgemuth Date: Tue Sep 29 16:06:01 2009 -0400 ext4: Make sure ext4_dirty_inode() updates the inode in no journal mode (cherry picked from commit f3dc272fd5e2ae08244796bb39e7e1ce4b25d3b3) This patch a problem that ext4_dirty_inode() was not calling ext4_mark_inode_dirty() if the current_handle is not valid, which it is the case in no journal mode. It also removes a test for non-matching transaction which can never happen. Signed-off-by: Curt Wohlgemuth Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 46d67b3c075b0d1630b57630f956edb1dad61baa Author: Frank Mayhar Date: Tue Sep 29 10:07:47 2009 -0400 ext4: Avoid updating the inode table bh twice in no journal mode (cherry picked from commit 830156c79b0a99ddf0f62496bcf4de640f9f52cd) This is a cleanup of commit 91ac6f4. Since ext4_mark_inode_dirty() has already called ext4_mark_iloc_dirty(), which in turn calls ext4_do_update_inode(), it's not necessary to have ext4_write_inode() call ext4_do_update_inode() in no journal mode. Indeed, it would be duplicated work. Reviewed-by: "Aneesh Kumar K.V" Signed-off-by: Frank Mayhar Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit ca23b5fd5bd6fd09e90008f9390cff248e01b718 Author: Theodore Ts'o Date: Mon Sep 28 15:58:29 2009 -0400 ext4: EXT4_IOC_MOVE_EXT: Check for different original and donor inodes first (cherry picked from commit f3ce8064b388ccf420012c5a4907aae4f13fe9d0) Move the check to make sure the original and donor inodes are different earlier, to avoid a potential deadlock by trying to lock the same inode twice. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 3c42b7eb297192519fbfa80e1c77abd01c223114 Author: Mingming Cao Date: Mon Sep 28 15:48:29 2009 -0400 ext4: async direct IO for holes and fallocate support (cherry picked from commit 8d5d02e6b176565c77ff03604908b1453a22044d) For async direct IO that covers holes or fallocate, the end_io callback function now queued the convertion work on workqueue but don't flush the work rightaway as it might take too long to afford. But when fsync is called after all the data is completed, user expects the metadata also being updated before fsync returns. Thus we need to flush the conversion work when fsync() is called. This patch keep track of a listed of completed async direct io that has a work queued on workqueue. When fsync() is called, it will go through the list and do the conversion. Signed-off-by: Mingming Cao Signed-off-by: Greg Kroah-Hartman commit 1295e40acf6d150d129fac939bc97aed771e6c7b Author: Mingming Cao Date: Mon Sep 28 15:48:41 2009 -0400 ext4: Use end_io callback to avoid direct I/O fallback to buffered I/O (cherry picked from commit 4c0425ff68b1b87b802ffeda7b6a46ff7da7241c) Currently the DIO VFS code passes create = 0 when writing to the middle of file. It does this to avoid block allocation for holes, so as not to expose stale data out when there is a parallel buffered read (which does not hold the i_mutex lock). Direct I/O writes into holes falls back to buffered IO for this reason. Since preallocated extents are treated as holes when doing a get_block() look up (buffer is not mapped), direct IO over fallocate also falls back to buffered IO. Thus ext4 actually silently falls back to buffered IO in above two cases, which is undesirable. To fix this, this patch creates unitialized extents when a direct I/O write into holes in sparse files, and registering an end_io callback which converts the uninitialized extent to an initialized extent after the I/O is completed. Singed-Off-By: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 45eef039f3a5d0bce583d0a9b34818127ea79e49 Author: Mingming Cao Date: Mon Sep 28 15:49:08 2009 -0400 ext4: Split uninitialized extents for direct I/O (cherry picked from commit 0031462b5b392f90d17f1d75abb795883c44e969) When writing into an unitialized extent via direct I/O, and the direct I/O doesn't exactly cover the unitialized extent, split the extent into uninitialized and initialized extents before submitting the I/O. This avoids needing to deal with an ENOSPC error in the end_io callback that gets used for direct I/O. When the IO is complete, the written extent will be marked as initialized. Singed-Off-By: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit f26ac96224871adf0647496e4154eabdee3689a6 Author: Mingming Cao Date: Mon Sep 28 15:49:52 2009 -0400 ext4: release reserved quota when block reservation for delalloc retry (cherry picked from commit 9f0ccfd8e07d61b413e6536ffa02fbf60d2e20d8) ext4_da_reserve_space() can reserve quota blocks multiple times if ext4_claim_free_blocks() fail and we retry the allocation. We should release the quota reservation before restarting. Bug found by Jan Kara. Signed-off-by: Mingming Cao Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 54088f5df9e352c41ef0f2eeb17ec9ae995ab4b3 Author: Theodore Ts'o Date: Tue Sep 29 13:31:31 2009 -0400 ext4: Adjust ext4_da_writepages() to write out larger contiguous chunks (cherry picked from commit 55138e0bc29c0751e2152df9ad35deea542f29b3) Work around problems in the writeback code to force out writebacks in larger chunks than just 4mb, which is just too small. This also works around limitations in the ext4 block allocator, which can't allocate more than 2048 blocks at a time. So we need to defeat the round-robin characteristics of the writeback code and try to write out as many blocks in one inode before allowing the writeback code to move on to another inode. We add a a new per-filesystem tunable, max_writeback_mb_bump, which caps this to a default of 128mb per inode. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 0be3bd025b9893a0f2ed3d1cbeb06d7abc0e27df Author: Theodore Ts'o Date: Mon Sep 28 00:06:20 2009 -0400 ext4: Fix hueristic which avoids group preallocation for closed files (cherry picked from commit 71780577306fd1e76c7a92e3b308db624d03adb9) The hueristic was designed to avoid using locality group preallocation when writing the last segment of a closed file. Fix it by move setting size to the maximum of size and isize until after we check whether size == isize. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit a6c92a88b556da1073e03d6936ae70c2c045421c Author: Theodore Ts'o Date: Thu Sep 17 09:34:16 2009 -0400 ext4: Fix the alloc on close after a truncate hueristic (cherry picked from commit 5534fb5bb35a62a94e0bd1fa2421f7fb6e894f10) In an attempt to avoid doing an unneeded flush after opening a (previously non-existent) file with O_CREAT|O_TRUNC, the code only triggered the hueristic if ei->disksize was non-zero. Turns out that the VFS doesn't call ->truncate() if the file doesn't exist, and ei->disksize is always zero even if the file previously existed. So remove the test, since it isn't necessary and in fact disabled the hueristic. Thanks to Clemens Eisserer that he was seeing problems with files written using kwrite and eclipse after sudden crashes caused by a buggy Intel video driver. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 1f51991dec0bd679d172b55a86d54ca3f6227c51 Author: Theodore Ts'o Date: Thu Sep 17 08:32:22 2009 -0400 ext4: store EXT4_EXT_MIGRATE in i_state instead of i_flags (cherry picked from commit 1b9c12f44c1eb614fd3b8822bfe8f1f5d8e53737) EXT4_EXT_MIGRATE is only intended to be used for an in-memory flag, and the hex value assigned to it collides with FS_DIRECTIO_FL (which is also stored in i_flags). There's no reason for the EXT4_EXT_MIGRATE bit to be stored in i_flags, so we switch it to use i_state instead. Cc: "Aneesh Kumar K.V" Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit d55f431a04b186cc1ede0f100a030c9dda25bd9c Author: Eric Sandeen Date: Wed Sep 16 14:45:10 2009 -0400 ext4: limit block allocations for indirect-block files to < 2^32 (cherry picked from commit fb0a387dcdcd21aab1b09ee7fd80b7c979bdbbfd) Today, the ext4 allocator will happily allocate blocks past 2^32 for indirect-block files, which results in the block numbers getting truncated, and corruption ensues. This patch limits such allocations to < 2^32, and adds BUG_ONs if we do get blocks larger than that. This should address RH Bug 519471, ext4 bitmap allocator must limit blocks to < 2^32 * ext4_find_goal() is modified to choose a goal < UINT_MAX, so that our starting point is in an acceptable range. * ext4_xattr_block_set() is modified such that the goal block is < UINT_MAX, as above. * ext4_mb_regular_allocator() is modified so that the group search does not continue into groups which are too high * ext4_mb_use_preallocated() has a check that we don't use preallocated space which is too far out * ext4_alloc_blocks() and ext4_xattr_block_set() add some BUG_ONs No attempt has been made to limit inode locations to < 2^32, so we may wind up with blocks far from their inodes. Doing this much already will lead to some odd ENOSPC issues when the "lower 32" gets full, and further restricting inodes could make that even weirder. For high inodes, choosing a goal of the original, % UINT_MAX, may be a bit odd, but then we're in an odd situation anyway, and I don't know of a better heuristic. Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 45d2735a0725d39b77a22cf535566ed570d6fc0a Author: Akira Fujita Date: Wed Sep 16 14:25:39 2009 -0400 ext4: Fix different block exchange issue in EXT4_IOC_MOVE_EXT (cherry picked from commit c40ce3c9ea97425a12d7e44031a98fe50add6fc1) If logical block offset of original file which is passed to EXT4_IOC_MOVE_EXT is different from donor file's, a calculation error occurs in ext4_calc_swap_extents(), therefore wrong block is exchanged between original file and donor file. As a result, we hit ext4_error() in check_block_validity(). To detect the logical offset difference in EXT4_IOC_MOVE_EXT, add checks to mext_calc_swap_extents() and handle it as error, since data exchange must be done between the same blocks in EXT4_IOC_MOVE_EXT. Reported-by: Peng Tao Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit f9211e5e69570ec989a5297d1637229340cadb61 Author: Akira Fujita Date: Wed Sep 16 14:25:07 2009 -0400 ext4: Add null extent check to ext_get_path (cherry picked from commit 347fa6f1c7cb5df2b38d3c9167cfe242ce0cd1da) There is the possibility that path structure which is taken by ext4_ext_find_extent() indicates null extents. Because during data block exchanging in ext4_move_extents(), constitution of an extent tree may be changed. As a solution, the patch adds null extent check to ext_get_path(). Reported-by: Peng Tao Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 06f3ce16b31eb2015f873c96d459494964f9b7cd Author: Akira Fujita Date: Wed Sep 16 13:46:35 2009 -0400 ext4: Replace BUG_ON() with ext4_error() in move_extents.c (cherry picked from commit 2147b1a6a48e28399120ca51d4a91840a278611f) Replace BUG_ON calls with a call to ext4_error() to print an error message if EXT4_IOC_MOVE_EXT failed with some kind of reasons. This will help to debug. Ted pointed this out, thanks. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit cc61eb8051988796a4a5103b749bb06e24df236c Author: Akira Fujita Date: Wed Sep 16 13:46:38 2009 -0400 ext4: Replace get_ext_path macro with an inline funciton (cherry picked from commit e8505970af46658ece2545e9bc1fe594998fdcdf) Replace get_ext_path macro with an inline function, since this macro looks like a function call but its arguments get modified. Ted pointed this out, thanks. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 77e8c777ef2f4e31b690688fc27014b88feead9d Author: Akira Fujita Date: Sat Sep 5 23:12:41 2009 -0400 ext4: Fix small typo for move_extent_per_page() (cherry picked from commit 44fc48f7048ab9657b524938a832fec4e0acea98) This function means moving extents every page, so change its name from move_exgtent_par_page(). Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit e4a8cc648a9a3decca1bc5f0f16aacdfc0d72d5a Author: Theodore Ts'o Date: Mon Sep 14 22:59:50 2009 -0400 ext4: Fix include/trace/events/ext4.h to work with Systemtap (cherry picked from commit 3661d28615ea580c1db02a972fd4d3898df1cb01) Using relative pathnames in #include statements interacts badly with SystemTap, since the fs/ext4/*.h header files are not packaged up as part of a distribution kernel's header files. Since systemtap doesn't use TP_fast_assign(), we can use a blind structure definition and then make sure the needed header files are defined before the ext4 source files #include the trace/events/ext4.h header file. https://bugzilla.redhat.com/show_bug.cgi?id=512478 Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 1de54d1bc0248c2b0f21e22b48f145271d38cbd5 Author: Theodore Ts'o Date: Fri Sep 11 16:51:28 2009 -0400 ext4: Fix initalization of s_flex_groups (cherry picked from commit 7ad9bb651fc2036ea94bed94da76a4b08959a911) The s_flex_groups array should have been initialized using atomic_add to sum up the free counts from the block groups that make up a flex_bg. By using atomic_set, the value of the s_flex_groups array was set to the values of the last block group in the flex_bg. The impact of this bug is that the block and inode allocation algorithms might not pick the best flex_bg for new allocation. Thanks to Damien Guibouret for pointing out this problem! Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit d78f2ac343787f48e97c87676fccaef850a50cbe Author: Andreas Schlick Date: Thu Sep 10 23:16:07 2009 -0400 ext4: Always set dx_node's fake_dirent explicitly. (cherry picked from commit 1f7bebb9e911d870fa8f997ddff838e82b5715ea) When ext4_dx_add_entry() has to split an index node, it has to ensure that name_len of dx_node's fake_dirent is also zero, because otherwise e2fsck won't recognise it as an intermediate htree node and consider the htree to be corrupted. Signed-off-by: Andreas Schlick Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 7a71dacba604fee399884a5f3d2c53e21e9c3a79 Author: Theodore Ts'o Date: Thu Sep 10 17:31:04 2009 -0400 ext4: Don't update superblock write time when filesystem is read-only (cherry picked from commit 71290b368ad5e1e0b0b300c9d5638490a9fd1a2d) This avoids updating the superblock write time when we are mounting the root file system read/only but we need to replay the journal; at that point, for people who are east of GMT and who make their clock tick in localtime for Windows bug-for-bug compatibility, and this will cause e2fsck to complain and force a full file system check. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 44373f34a740e9155cc011734ae0342b34e40d2c Author: Aneesh Kumar K.V Date: Wed Sep 9 23:34:50 2009 -0400 ext4: check for need init flag in ext4_mb_load_buddy (cherry picked from commit f41c0750538667b87a19c93952e5d42fcc069bd7) We should check for need init flag with the group's alloc_sem held, to make sure while we are loading the buddy cache and holding a reference to it, a file system resize can't add new blocks to same group. The patch also drops the need init flag check in ext4_mb_regular_allocator() because doing the check without holding alloc_sem is racy. Signed-off-by: "Theodore Ts'o" Signed-off-by: Aneesh Kumar K.V Signed-off-by: Greg Kroah-Hartman commit ee0cc3fed6eacd6d791b89fa8fd839918e41fefb Author: Aneesh Kumar K.V Date: Wed Sep 9 23:47:46 2009 -0400 ext4: move ext4_mb_init_group() function earlier in the mballoc.c (cherry picked from commit b6a758ec3af3ec236dbfdcf6a06b84ac8f94957e) This moves the function around so that it can be called from ext4_mb_load_buddy(). Signed-off-by: Aneesh Kumar K.V Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 90608af946cb40a743f3d7af58310ff8381b7101 Author: Frank Mayhar Date: Wed Sep 9 22:33:47 2009 -0400 ext4: Make non-journal fsync work properly (cherry picked from commit 91ac6f43317c0bf99969665f98016548011dfa38) Teach ext4_write_inode() and ext4_do_update_inode() about non-journal mode: If we're not using a journal, ext4_write_inode() now calls ext4_do_update_inode() (after getting the iloc via ext4_get_inode_loc()) with a new "do_sync" parameter. If that parameter is nonzero _and_ we're not using a journal, ext4_do_update_inode() calls sync_dirty_buffer() instead of ext4_handle_dirty_metadata(). This problem was found in power-fail testing, checking the amount of loss of files and blocks after a power failure when using fsync() and when not using fsync(). It turned out that using fsync() was actually worse than not doing so, possibly because it increased the likelihood that the inodes would remain unflushed and would therefore be lost at the power failure. Signed-off-by: Frank Mayhar Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 75c1a3f1f6b164f280b7713a8b301e125ae86292 Author: Theodore Ts'o Date: Sat Sep 12 13:41:55 2009 -0400 ext4: Assure that metadata blocks are written during fsync in no journal mode (cherry picked from commit fe188c0e084bdf3038dc0ac963c21d764f53f7da) When there is no journal present, we must attach buffer heads associated with extent tree and indirect blocks to the inode's mapping->private_list via mark_buffer_dirty_inode() so that ext4_sync_file() --- which is called to service fsync() and fdatasync() system calls --- can write out the inode's metadata blocks by calling sync_mapping_buffers(). Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 523a2ab5bf81efa0fc9a1b6afc43e729ddae58f0 Author: Theodore Ts'o Date: Wed Sep 9 21:32:41 2009 -0400 ext4: Use bforget() in no journal mode for ext4_journal_{forget,revoke}() (cherry picked from commit c7acb4c16646943180bd221c167a077e0a084f9c) When ext4 is using a journal, a metadata block which is deallocated must be passed into the journal layer so it can be dropped from the current transaction and/or revoked. This is done by calling the functions ext4_journal_forget() and ext4_journal_revoke(), which call jbd2_journal_forget(), and jbd2_journal_revoke(), respectively. Since the jbd2_journal_forget() and jbd2_journal_revoke() call bforget(), if ext4 is not using a journal, ext4_journal_forget() and ext4_journal_revoke() must call bforget() to avoid a dirty metadata block overwriting a block after it has been reallocated and reused for another inode's data block. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 6758feb58f9c66dbbd8ef67543deb03311296f3a Author: Theodore Ts'o Date: Tue Sep 8 08:21:26 2009 -0400 ext4: print more sysadmin-friendly message in check_block_validity() (cherry picked from commit 80e42468d65475e92651e62175bb7807773321d0) Drop the WARN_ON(1), as he stack trace is not appropriate, since it is triggered by file system corruption, and it misleads users into thinking there is a kernel bug. In addition, change the message displayed by ext4_error() to make it clear that this is a file system corruption problem. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit f112250b558ed59538502992570272253b4ef129 Author: Aneesh Kumar K.V Date: Wed Sep 9 22:36:03 2009 -0400 ext4: Take page lock before looking at attached buffer_heads flags (cherry picked from commit a827eaffff07c7d58a4cb32158cbeb4849f4e33a) In order to check whether the buffer_heads are mapped we need to hold page lock. Otherwise a reclaim can cleanup the attached buffer_heads. Signed-off-by: Aneesh Kumar K.V Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit de48efe24d0e6a2291dc4fe23271b5e5b16b01a7 Author: Akira Fujita Date: Sat Sep 5 22:46:29 2009 -0400 ext4: Return exchanged blocks count to user space in failure (cherry picked from commit 8d6669133d8cdbb7cbe0e1f0f3744e7802a84afe) Return exchanged blocks count (moved_len) to user space, if ext4_move_extents() failed on the way. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 4062fc4957707493ff8d53d21c058838a8425f2a Author: Akira Fujita Date: Sat Sep 5 22:11:55 2009 -0400 ext4: Remove unneeded BUG_ON() in ext4_move_extents() (cherry picked from commit daea696dbac0e33af3cfe304efbfb8d74e0effe6) The ext4_move_extents() functions checks with BUG_ON() whether the exchanged blocks count accords with request blocks count. But, if the target range (orig_start + len) includes sparse block(s), 'moved_len' (exchanged blocks count) does not agree with 'len' (request blocks count), since sparse block is not counted in 'moved_len'. This causes us to hit the BUG_ON(), even though the function succeeded. Signed-off-by: Akira Fujita Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit ad6030eecf171af161ad4c215c8d3f967b5ddfc5 Author: Akira Fujita Date: Wed Sep 16 14:28:22 2009 -0400 ext4: Fix wrong comparisons in mext_check_arguments() (cherry picked from commit 70d5d3dcea47c16058d2b093c29e07fdf61b56ad) The mext_check_arguments() function in move_extents.c has wrong comparisons. orig_start which is passed from user-space is block unit, but i_size of inode is byte unit, therefore the checks do not work fine. This mis-check leads to the overflow of 'len' and then hits BUG_ON() in ext4_move_extents(). The patch fixes this issue. Signed-off-by: Akira Fujita Reviewed-by: Greg Freemyer Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 2a8af5d4d0f69036fc02e91642f862e473350143 Author: Christoph Hellwig Date: Sat Sep 5 21:42:42 2009 -0400 ext4: fix cache flush in ext4_sync_file (cherry picked from commit 5f3481e9a80c240f169b36ea886e2325b9aeb745) We need to flush the write cache unconditionally in ->fsync, otherwise writes into already allocated blocks can get lost. Writes into fully allocated files are very common when using disk images for virtualization, and without this fix can easily lose data after an fdatasync, which is the typical implementation for a cache flush on the virtual drive. Signed-off-by: Christoph Hellwig Acked-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" commit 63762a2387882e4ff8ae77133b331abef8e6f555 Author: Theodore Ts'o Date: Mon Aug 31 17:00:59 2009 -0400 ext4: Restore wbc->range_start in ext4_da_writepages() (cherry picked from commit de89de6e0cf4b1eb13f27137cf2aa40d287aabdf) To solve a lock inversion problem, we implement part of the range_cyclic algorithm in ext4_da_writepages(). (See commit 2acf2c26 for more details.) As part of that change wbc->range_start was modified by ext4's writepages function, which causes its callers to get confused since they aren't expecting the filesystem to modify it. The simplest fix is to save and restore wbc->range_start in ext4_da_writepages. Signed-off-by: "Theodore Ts'o" commit e4670b394dd5cdfc5fda17e3ec94c444dcaecac2 Author: Theodore Ts'o Date: Sat Aug 29 21:08:08 2009 -0400 ext4: Limit number of links that can be created by ext4_link() (cherry picked from commit b05ab1dc3795e6f997fb0d34f38fce5012533c3e) In ext4_link we need to check using EXT4_LINK_MAX, and not EXT4_DIR_LINK_MAX(), since ext4_link() is creating hard links of regular files, and not directories. Signed-off-by: "Theodore Ts'o" commit 929c7112197d58053117b3303f3bcf160bcdca45 Author: Aneesh Kumar K.V Date: Fri Aug 28 21:43:15 2009 -0400 ext4: Allow rename to create more than EXT4_LINK_MAX subdirectories (cherry picked from commit 2c94eb86c66e1eaaa1e7d8a2120f4fad5e7e7736) Use EXT4_DIR_LINK_MAX so that rename() can move a directory into new parent directory without running into the EXT4_LINK_MAX limit. Signed-off-by: Aneesh Kumar K.V Signed-off-by: "Theodore Ts'o" commit 100acdbadc9707c6ddee16c3288759bcbe317ac1 Author: Aneesh Kumar K.V Date: Tue Aug 25 22:36:05 2009 -0400 ext4: Add missing unlock_new_inode() call in extent migration code (cherry picked from commit a8526e84ac758ac6da45cf273aa1538a6a7aa3de) We need to unlock the new inode before iput. This patch fixes the following warning when calling chattr +e to migrate a file to use extents. It also fixes problems in when e4defrag attempts to defragment an inode. [ 470.400044] ------------[ cut here ]------------ [ 470.400065] WARNING: at fs/inode.c:1210 generic_delete_inode+0x65/0x16a() [ 470.400072] Hardware name: N/A ..... ... [ 470.400353] Pid: 4451, comm: chattr Not tainted 2.6.31-rc7-red-debug #4 [ 470.400359] Call Trace: [ 470.400372] [] warn_slowpath_common+0x77/0x8f [ 470.400385] [] warn_slowpath_null+0xf/0x11 [ 470.400395] [] generic_delete_inode+0x65/0x16a [ 470.400405] [] generic_drop_inode+0x17/0x1bd [ 470.400413] [] iput+0x61/0x65 [ 470.400455] [] ext4_ext_migrate+0x5eb/0x66a [ext4] [ 470.400492] [] ext4_ioctl+0x340/0x756 [ext4] [ 470.400507] [] vfs_ioctl+0x1d/0x82 [ 470.400517] [] do_vfs_ioctl+0x483/0x4c9 [ 470.400527] [] ? trace_hardirqs_on+0xd/0xf [ 470.400537] [] sys_ioctl+0x51/0x74 [ 470.400549] [] system_call_fastpath+0x16/0x1b [ 470.400557] ---[ end trace ab85723542352dac ]--- Signed-off-by: Aneesh Kumar K.V Signed-off-by: "Theodore Ts'o" commit e139f5b68b7aea8959a7e0fcd1f08a70c19846ba Author: Eric Sandeen Date: Tue Aug 18 00:20:23 2009 -0400 ext4: Add feature set check helper for mount & remount paths (cherry picked from commit a13fb1a4533f26c1e2b0204d5283b696689645af) A user reported that although his root ext4 filesystem was mounting fine, other filesystems would not mount, with the: "Filesystem with huge files cannot be mounted RDWR without CONFIG_LBDAF" error on his 32-bit box built without CONFIG_LBDAF. This is because the test at mount time for this situation was not being re-checked on remount, and the normal boot process makes an ro->rw transition, so this was being missed. Refactor to make a common helper function to test the filesystem features against the type of mount request (RO vs. RW) so that we stay consistent. Addresses Red-Hat-Bugzilla: #517650 Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" commit 74ba8fe3abdfcd3f58e01ebb15b1fe9d2db86053 Author: Eric Sandeen Date: Mon Aug 17 23:48:51 2009 -0400 ext4: reject too-large filesystems on 32-bit kernels (cherry picked from commit bf43d84b185e2ff54598f8c58a5a8e63148b6e90) ext4 will happily mount a > 16T filesystem on a 32-bit box, but this is not safe; writes to the block device will wrap past 16T and the page cache can't index past 16T (232 index * 4k pages). Adding another test to the existing "too many sectors" test should do the trick. Add a comment, a relevant return value, and fix the reference to the CONFIG_LBD(AF) option as well. Signed-off-by: Eric Sandeen Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit e2177295902d4849d61ad677e51e2d84eae617e5 Author: Jan Kara Date: Mon Aug 17 22:17:20 2009 -0400 ext4: Fix possible deadlock between ext4_truncate() and ext4_get_blocks() During truncate we are sometimes forced to start a new transaction as the amount of blocks to be journaled is both quite large and hard to predict. So far we restarted a transaction while holding i_data_sem and that violates lock ordering because i_data_sem ranks below a transaction start (and it can lead to a real deadlock with ext4_get_blocks() mapping blocks in some page while having a transaction open). (cherry picked from commit 487caeef9fc08c0565e082c40a8aaf58dad92bbb) We fix the problem by dropping the i_data_sem before restarting the transaction and acquire it afterwards. It's slightly subtle that this works: 1) By the time ext4_truncate() is called, all the page cache for the truncated part of the file is dropped so get_block() should not be called on it (we only have to invalidate extent cache after we reacquire i_data_sem because some extent from not-truncated part could extend also into the part we are going to truncate). 2) Writes, migrate or defrag hold i_mutex so they are stopped for all the time of the truncate. This bug has been found and analyzed by Theodore Tso . Signed-off-by: Jan Kara Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 18623c3d1d26a30943cc1a97a99f385884f1ddca Author: Jan Kara Date: Mon Aug 17 21:23:17 2009 -0400 jbd2: Annotate transaction start also for jbd2_journal_restart() (cherry picked from commit 9599b0e597d810be9b8f759ea6e9619c4f983c5e) lockdep annotation for a transaction start has been at the end of jbd2_journal_start(). But a transaction is also started from jbd2_journal_restart(). Move the lockdep annotation to start_this_handle() which covers both cases. Signed-off-by: Jan Kara Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 225ff23eb45e5ad1229c27b00b2a0729cc3d8b85 Author: Theodore Ts'o Date: Fri Sep 18 13:34:02 2009 -0400 ext4: Avoid group preallocation for closed files (cherry picked from commit 50797481a7bdee548589506d7d7b48b08bc14dcd) Currently the group preallocation code tries to find a large (512) free block from which to do per-cpu group allocation for small files. The problem with this scheme is that it leaves the filesystem horribly fragmented. In the worst case, if the filesystem is unmounted and remounted (after a system shutdown, for example) we forget the fact that wee were using a particular (now-partially filled) 512 block extent. So the next time we try to allocate space for a small file, we will find *another* completely free 512 block chunk to allocate small files. Given that there are 32,768 blocks in a block group, after 64 iterations of "mount, write one 4k file in a directory, unmount", the block group will have 64 files, each separated by 511 blocks, and the block group will no longer have any free 512 completely free chunks of blocks for group preallocation space. So if we try to allocate blocks for a file that has been closed, such that we know the final size of the file, and the filesystem is not busy, avoid using group preallocation. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 5a69bbda790cdc3deb8425c6f726d22005e99b93 Author: Theodore Ts'o Date: Sun Aug 9 22:01:13 2009 -0400 ext4: Fix bugs in mballoc's stream allocation mode (cherry picked from commit 4ba74d00a20256e22f159cb288ff34b587608917) The logic around sbi->s_mb_last_group and sbi->s_mb_last_start was all screwed up. These fields were getting unconditionally all the time, set even when stream allocation had not taken place, and if they were being used when the file was smaller than s_mb_stream_request, which is when the allocation should _not_ be doing stream allocation. Fix this by determining whether or not we stream allocation should take place once, in ext4_mb_group_or_file(), and setting a flag which gets used in ext4_mb_regular_allocator() and ext4_mb_use_best_found(). This simplifies the code and assures that we are consistently using (or not using) the stream allocation logic. Signed-off-by: "Theodore Ts'o" Signed-off-by: Greg Kroah-Hartman commit 8be78bc620adf972f6d33bc6202ba1b85fdd4db5 Author: Peng Tao Date: Mon Aug 10 23:05:28 2009 -0400 ext4: fix journal ref count in move_extent_par_page (cherry picked from commit 91cc219ad963731191247c5f2db4118be2bc341a) move_extent_par_page calls a_ops->write_begin() to increase journal handler's reference count. However, if either mext_replace_branches() or ext4_get_block fails, the increased reference count isn't decreased. This will cause a later attempt to umount of the fs to hang forever. The patch addresses the issue by calling ext4_journal_stop() if page is not NULL (which means a_ops->write_end() isn't invoked). Signed-off-by: Peng Tao Signed-off-by: "Theodore Ts'o"