jbd: Fix jbd_commit_transaction()/journal_try_to_drop_buffers() race From: Mingming Cao There are a few cases direct IO could race with kjournal flushing data buffers which could result direct IO return EIO error. 1) journal_submit_data_buffers() tries to get bh_state lock. If try lock fails, it drops the j_list_lock and sleeps for bh_state lock, while holding a reference on the buffer. In the meanwhile, journal_try_to_free_buffers() can clean up the journal head could call try_to_free_buffers(). try_to_free_buffers() would fail due to the reference held by journal_submit_data_buffers() - which in turn causes failues for DIO (invalidate_inode_pages2()). 2) When the buffer is on t_locked_list waiting for IO to finish, we hold a reference and give up the cpu, if we can't get bh_state lock. This causes try_to_free_buffers() to fail. 3) when journal_submit_data_buffers() saw the buffer is dirty but failed to lock the buffer bh1, journal_submit_data_buffers() released the j_list_lock and submit other buffers collected from previous check, with the reference to bh1 still hold. During this time journal_try_to_free_buffers() could clean up the journal head of bh1 and remove it from the t_syncdata_list. Then try_to_free_buffers() would fail because the reference held by journal_submit_data_buffers() 4) journal_submit_data_buffers() already removed the jh from the bh (when found the buffers are already being synced), but still keep a reference to the buffer head. journal_try_to_free_buffers() could be called. In that case try_to_free_buffers() will be called since there is no jh related to this buffer, and failed due to journal_submit_data_buffers() hasn't finish the cleanup business yet. Fix for first three races is to drop the reference on the buffer head when release the j_list_lock, give up the cpu and re-try the whole operation. This patch also fixes the race that data buffers has been flushed to disk and journal head is cleard by journal_submit_data_buffers() but did not get a chance to release buffer head reference before the journal_try_to_free_buffers() kicked in. Signed-off-by: Badari Pulavarty Signed-off-by: Mingming Cao --- fs/jbd/commit.c | 24 ++++++++++++++++-------- fs/jbd/transaction.c | 13 +++++++++++++ 2 files changed, 29 insertions(+), 8 deletions(-) Index: linux-2.6.26-rc2/fs/jbd/commit.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/commit.c 2008-05-16 11:03:20.000000000 -0700 +++ linux-2.6.26-rc2/fs/jbd/commit.c 2008-05-16 11:03:26.000000000 -0700 @@ -79,12 +79,16 @@ nope: /* * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is - * held. For ranking reasons we must trylock. If we lose, schedule away and + * held. For ranking reasons we must trylock. If we lose, unlock the buffer + * if needed, drop the reference on the buffer, schedule away and * return 0. j_list_lock is dropped in this case. */ -static int inverted_lock(journal_t *journal, struct buffer_head *bh) +static int inverted_lock(journal_t *journal, struct buffer_head *bh, int locked) { if (!jbd_trylock_bh_state(bh)) { + if (locked) + unlock_buffer(bh); + put_bh(bh); spin_unlock(&journal->j_list_lock); schedule(); return 0; @@ -209,19 +213,24 @@ write_out_data: if (buffer_dirty(bh)) { if (test_set_buffer_locked(bh)) { BUFFER_TRACE(bh, "needs blocking lock"); + put_bh(bh); spin_unlock(&journal->j_list_lock); /* Write out all data to prevent deadlocks */ journal_do_submit_data(wbuf, bufs); bufs = 0; - lock_buffer(bh); spin_lock(&journal->j_list_lock); + continue; } locked = 1; } - /* We have to get bh_state lock. Again out of order, sigh. */ - if (!inverted_lock(journal, bh)) { - jbd_lock_bh_state(bh); + /* + * We have to get bh_state lock. If the try lock fails, + * release the ref on the buffer, give up cpu and retry the + * whole operation. + */ + if (!inverted_lock(journal, bh, locked)) { spin_lock(&journal->j_list_lock); + continue; } /* Someone already cleaned up the buffer? */ if (!buffer_jbd(bh) @@ -430,8 +439,7 @@ void journal_commit_transaction(journal_ err = -EIO; spin_lock(&journal->j_list_lock); } - if (!inverted_lock(journal, bh)) { - put_bh(bh); + if (!inverted_lock(journal, bh, 0)) { spin_lock(&journal->j_list_lock); continue; } Index: linux-2.6.26-rc2/fs/jbd/transaction.c =================================================================== --- linux-2.6.26-rc2.orig/fs/jbd/transaction.c 2008-05-16 11:03:20.000000000 -0700 +++ linux-2.6.26-rc2/fs/jbd/transaction.c 2008-05-16 11:03:26.000000000 -0700 @@ -1714,6 +1714,19 @@ int journal_try_to_free_buffers(journal_ goto busy; } while ((bh = bh->b_this_page) != head); ret = try_to_free_buffers(page); + if (ret == 0) { + /* + * it is possible that journal_submit_data_buffers() + * still holds the bh ref even if clears the jh + * after journal_remove_journal_head, + * which leads to try_to_free_buffers() failed + * let's wait for journal_submit_data_buffers() + * to finishing remove the bh from the sync_data_list + */ + spin_lock(&journal->j_list_lock); + ret = try_to_free_buffers(page); + spin_unlock(&journal->j_list_lock); + } busy: return ret; }