GIT ed7ee311520b670d4d5c7f31234fc8e2c983a317 git+ssh://master.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2.git#ALL commit 6113214220a65281f17d33493b465676f8bcb43a Author: Mark Fasheh Date: Mon Dec 3 16:43:01 2007 -0800 ocfs2: Re-journal buffers after transaction extend ocfs2_extend_trans() might call journal_restart() which will commit dirty buffers and then restart the transaction. This means that any buffers which still need changes should be passed to journal_access() again. Some paths during extend weren't doing this right. Signed-off-by: Mark Fasheh commit e4954b75945830d03f64c2729ca8039d17758fab Author: Mark Fasheh Date: Mon Dec 3 16:42:19 2007 -0800 ocfs2: Allow for debugging of transaction extends The nastiest cases of transaction extends are also the rarest. We can expose them more quickly at the expense of performance by going straight to the journal_restart() in ocfs2_extend_trans(). Wrap things in OCFS2_DEBUG_FS so that we only do this when "expensive debugging" is turned on. Signed-off-by: Mark Fasheh commit d2bd5b24f83f0c70f9d7d741eb5f495655edd972 Author: Mark Fasheh Date: Mon Dec 3 15:02:10 2007 -0800 ocfs2: Don't panic when truncating an empty extent This BUG_ON() was unintentionally left in after the sparse file support was written. Signed-off-by: Mark Fasheh commit 930c6bf32dd0ffe7a1da9818293e4966e2a2c603 Author: Mark Fasheh Date: Tue Oct 30 12:09:03 2007 -0700 ocfs2: Documentation update Remove 'readpages' from the list in ocfs2.txt. Instead of having two identical lists, I just removed the list in the OCFS2 section of fs/Kconfig and added a pointer to Documentation/filesystems/ocfs2.txt. Signed-off-by: Mark Fasheh commit 5c7821d448ce942a0ffee6ca30dc156b14d5a734 Author: Mark Fasheh Date: Tue Oct 30 12:08:32 2007 -0700 ocfs2: Readpages support Add ->readpages support to Ocfs2. This is rather trivial - all it required is a small update to ocfs2_get_block (for mapping full extents via b_size) and an ocfs2_readpages() function which partially mirrors ocfs2_readpage(). Signed-off-by: Mark Fasheh commit 132aea23390ab056fcf19b4dd33575c93b9054ac Author: Joel Becker Date: Fri Oct 5 14:31:44 2007 -0700 dlm: Split lock mode and flag constants into a sharable header. This allows others to use the DLM constants without being tied to the function API of fs/dlm. Signed-off-by: Joel Becker Signed-off-by: Steven Whitehouse Signed-off-by: David Teigland Signed-off-by: Mark Fasheh commit 105a81541f301ead6de6fd4d55d8b4fd0197d1c6 Author: Mark Fasheh Date: Thu Oct 18 15:30:42 2007 -0700 ocfs2: Rename ocfs2_meta_[un]lock Call this the "inode_lock" now, since it covers both data and meta data. This patch makes no functional changes. Signed-off-by: Mark Fasheh commit cc3fd8182faa2989f0a506b04ce2d16a48dbdb89 Author: Mark Fasheh Date: Thu Oct 18 15:23:46 2007 -0700 ocfs2: Remove data locks The meta lock now covers both meta data and data, so this just removes the now-redundant data lock. Combining locks saves us a round of lock mastery per inode and one less lock to ping between nodes during read/write. We don't lose much - since meta locks were always held before a data lock (and at the same level) ordered writeout mode (the default) ensured that flushing for the meta data lock also pushed out data anyways. Signed-off-by: Mark Fasheh commit 08fd5668a869d1c990956257bf0ae2eb2edbd4b9 Author: Mark Fasheh Date: Thu Oct 18 15:13:59 2007 -0700 ocfs2: Add data downconvert worker to inode lock In order to extend inode lock coverage to inode data, we use the same data downconvert worker with only a small modification to only do work for regular files. Signed-off-by: Mark Fasheh commit 37f7d9dd4698c31c5090d22053ed0e70d25ba1f3 Author: Mark Fasheh Date: Mon Sep 24 15:56:19 2007 -0700 ocfs2: Remove mount/unmount votes The node maps that are set/unset by these votes are no longer relevant, thus we can remove the mount and umount votes. Since those are the last two remaining votes, we can also remove the entire vote infrastructure. The vote thread has been renamed to the downconvert thread, and the small amount of functionality related to managing it has been moved into fs/ocfs2/dlmglue.c. All references to votes have been removed or updated. Signed-off-by: Mark Fasheh commit aacb23f5361876e33d903ce0631bf2331d8168a2 Author: Mark Fasheh Date: Mon Sep 24 15:09:41 2007 -0700 ocfs2: Remove fs dependency on ocfs2_heartbeat module Now that the dlm exposes domain information to us, we don't need generic node up / node down callbacks. And since the DLM is only telling us when a node goes down unexpectedly, we no longer need to optimize away node down callbacks via the umount map. Signed-off-by: Mark Fasheh commit 3901ded2c0d85715d9bd1c43198c29ac683d1858 Author: Mark Fasheh Date: Fri Sep 7 11:11:10 2007 -0700 ocfs2_dlm: Call node eviction callbacks from heartbeat handler With this, a dlm client can take advantage of the group protocol in the dlm to get full notification whenever a node within the dlm domain leaves unexpectedly. Signed-off-by: Mark Fasheh commit dfeb60ff43c628dffac61a6378a51d0b923d1e84 Author: Mark Fasheh Date: Mon Dec 3 14:06:23 2007 -0800 ocfs2: fix exit-while-locked bug in ocfs2_queue_orphans() We're holding the cluster lock when a failure might happen in ocfs2_dir_foreach() so it needs to be released. Signed-off-by: Mark Fasheh commit 33518067b4a5913575f554a993d4548847691fd8 Author: Tao Ma Date: Tue Nov 27 16:20:19 2007 +0800 ocfs2: Initalize bitmap_cpg of ocfs2_super to be the maximum. This value is initialized from global_bitmap->id2.i_chain.cl_cpg. If there is only 1 group, it will be equal to the total clusters in the volume. So as for online resize, it should change for all the nodes in the cluster. It isn't easy and there is no corresponding lock for it. bitmap_cpg is only used in 2 areas: 1. Check whether the suballoc is too large for us to allocate from the global bitmap, so it is little used. And now the suballoc size is 2048, it rarely meet this situation and the check is almost useless. 2. Calculate which group a cluster belongs to. We use it during truncate to figure out which cluster group an extent belongs too. But we should be OK if we increase it though as the cluster group calculated shouldn't change and we only ever have a small bitmap_cpg on file systems with a single cluster group. Signed-off-by: Tao Ma Signed-off-by: Mark Fasheh commit 154c56d0528bc6db1504429cda77d93d0221f555 Author: Mark Fasheh Date: Wed Nov 7 14:40:36 2007 -0800 ocfs2: Support commit= mount option Mostly taken from ext3. This allows the user to set the jbd commit interval, in seconds. The default of 5 seconds stays the same, but now users can easily increase the commit interval. Typically, this would be increased in order to benefit performance at the expense of data-safety. Signed-off-by: Mark Fasheh commit a4372b59334d2e7d0ef89eafb8af8d0eedfcf984 Author: Sunil Mushran Date: Tue Nov 6 16:10:23 2007 -0800 ocfs2: Update default cluster timeouts Lots of people are having trouble with the default timeouts, which are too low. These new values are derived from an informal survey taken on ocfs2-users, as well as data from bug reports. This should reduce the amount of cluster disconnects and subsequent fencing seen during normal workloads. Signed-off-by: Sunil Mushran Signed-off-by: Mark Fasheh commit 05b719633c4ea75cd0d14b94f47ece80ea403110 Author: Mark Fasheh Date: Tue Nov 6 15:52:58 2007 -0800 ocfs2: bump version number Bump the printed version to 1.5.0. This helps us quickly identify which version of Ocfs2 a bug filer is running. Signed-off-by: Mark Fasheh commit 908e37ab48962f5f8f2ca5291e606c40c866b73c Author: Mark Fasheh Date: Tue Oct 16 18:38:24 2007 -0700 ocfs2: Deferred cleanup of orphaned inodes Unlink in Ocfs2 is an expensive operation, mostly due to locking overhead. To improve performance, we defer final iput of orphaned inodes onto the ocfs2_wq thread, thus allowing sys_unlink to return without incurring all the locking overhead of ocfs2_delete_inode(). This also has the side-effect that the unlinking node will have a better chance of being the one to actually remove the inode from it's orphan dir, thus reducing lock conention. The queue is globally rate-limited so as to prevent any process from pinning an unbounded number of inodes in memory. Signed-off-by: Mark Fasheh Documentation/filesystems/ocfs2.txt | 12 +- fs/Kconfig | 10 +- fs/ocfs2/Makefile | 3 +- fs/ocfs2/alloc.c | 76 +++-- fs/ocfs2/aops.c | 130 ++++--- fs/ocfs2/cluster/heartbeat.h | 2 +- fs/ocfs2/cluster/tcp.h | 4 +- fs/ocfs2/cluster/tcp_internal.h | 8 +- fs/ocfs2/cluster/ver.c | 2 +- fs/ocfs2/dcache.c | 8 +- fs/ocfs2/dir.c | 8 +- fs/ocfs2/dlm/dlmfsver.c | 2 +- fs/ocfs2/dlm/dlmrecovery.c | 7 + fs/ocfs2/dlm/dlmver.c | 2 +- fs/ocfs2/dlmglue.c | 323 ++++++++-------- fs/ocfs2/dlmglue.h | 26 +- fs/ocfs2/export.c | 4 +- fs/ocfs2/file.c | 97 ++--- fs/ocfs2/heartbeat.c | 80 ---- fs/ocfs2/heartbeat.h | 2 - fs/ocfs2/inode.c | 72 ++-- fs/ocfs2/inode.h | 5 +- fs/ocfs2/ioctl.c | 8 +- fs/ocfs2/journal.c | 62 ++-- fs/ocfs2/localalloc.c | 8 +- fs/ocfs2/mmap.c | 17 +- fs/ocfs2/namei.c | 179 +++++++-- fs/ocfs2/namei.h | 3 + fs/ocfs2/ocfs2.h | 30 +- fs/ocfs2/slot_map.c | 19 - fs/ocfs2/slot_map.h | 2 - fs/ocfs2/suballoc.c | 4 +- fs/ocfs2/super.c | 106 ++---- fs/ocfs2/ver.c | 2 +- fs/ocfs2/vote.c | 756 ----------------------------------- fs/ocfs2/vote.h | 48 --- include/linux/Kbuild | 1 + include/linux/dlm.h | 140 +------- include/linux/dlmconstants.h | 159 ++++++++ 39 files changed, 821 insertions(+), 1606 deletions(-) diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt index ed55238..b63bd2d 100644 --- a/Documentation/filesystems/ocfs2.txt +++ b/Documentation/filesystems/ocfs2.txt @@ -35,7 +35,6 @@ Features which OCFS2 does not support yet: - Directory change notification (F_NOTIFY) - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) - POSIX ACLs - - readpages / writepages (not user visible) Mount options ============= @@ -62,3 +61,14 @@ data=writeback Data ordering is not preserved, data may be written preferred_slot=0(*) During mount, try to use this filesystem slot first. If it is in use by another node, the first empty one found will be chosen. Invalid values will be ignored. +commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. diff --git a/fs/Kconfig b/fs/Kconfig index 635f3e2..8729bff 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -440,14 +440,8 @@ config OCFS2_FS Tools web page: http://oss.oracle.com/projects/ocfs2-tools OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ - Note: Features which OCFS2 does not support yet: - - extended attributes - - quotas - - cluster aware flock - - Directory change notification (F_NOTIFY) - - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) - - POSIX ACLs - - readpages / writepages (not user visible) + For more information on OCFS2, see the file + . config OCFS2_DEBUG_MASKLOG bool "OCFS2 logging support" diff --git a/fs/ocfs2/Makefile b/fs/ocfs2/Makefile index 9fb8132..d2057e7 100644 --- a/fs/ocfs2/Makefile +++ b/fs/ocfs2/Makefile @@ -27,8 +27,7 @@ ocfs2-objs := \ symlink.o \ sysfile.o \ uptodate.o \ - ver.o \ - vote.o + ver.o obj-$(CONFIG_OCFS2_FS) += cluster/ obj-$(CONFIG_OCFS2_FS) += dlm/ diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index ce62c15..e6df06a 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -2389,6 +2389,18 @@ static int __ocfs2_rotate_tree_left(struct inode *inode, goto out; } + /* + * Caller might still want to make changes to the + * tree root, so re-add it to the journal here. + */ + ret = ocfs2_journal_access(handle, inode, + path_root_bh(left_path), + OCFS2_JOURNAL_ACCESS_WRITE); + if (ret) { + mlog_errno(ret); + goto out; + } + ret = ocfs2_rotate_subtree_left(inode, handle, left_path, right_path, subtree_root, dealloc, &deleted); @@ -3289,16 +3301,6 @@ static int ocfs2_insert_path(struct inode *inode, int ret, subtree_index; struct buffer_head *leaf_bh = path_leaf_bh(right_path); - /* - * Pass both paths to the journal. The majority of inserts - * will be touching all components anyway. - */ - ret = ocfs2_journal_access_path(inode, handle, right_path); - if (ret < 0) { - mlog_errno(ret); - goto out; - } - if (left_path) { int credits = handle->h_buffer_credits; @@ -3323,6 +3325,16 @@ static int ocfs2_insert_path(struct inode *inode, } } + /* + * Pass both paths to the journal. The majority of inserts + * will be touching all components anyway. + */ + ret = ocfs2_journal_access_path(inode, handle, right_path); + if (ret < 0) { + mlog_errno(ret); + goto out; + } + if (insert->ins_split != SPLIT_NONE) { /* * We could call ocfs2_insert_at_leaf() for some types @@ -3331,6 +3343,17 @@ static int ocfs2_insert_path(struct inode *inode, */ ocfs2_split_record(inode, left_path, right_path, insert_rec, insert->ins_split); + + /* + * Split might have modified either leaf and we don't + * have a guarantee that the later edge insert will + * dirty this for us. + */ + if (left_path) + ret = ocfs2_journal_dirty(handle, + path_leaf_bh(left_path)); + if (ret) + mlog_errno(ret); } else ocfs2_insert_at_leaf(insert_rec, path_leaf_el(right_path), insert, inode); @@ -3430,6 +3453,17 @@ static int ocfs2_do_insert_extent(struct inode *inode, mlog_errno(ret); goto out; } + + /* + * ocfs2_rotate_tree_right() might have extended the + * transaction without re-journaling our tree root. + */ + ret = ocfs2_journal_access(handle, inode, di_bh, + OCFS2_JOURNAL_ACCESS_WRITE); + if (ret) { + mlog_errno(ret); + goto out; + } } else if (type->ins_appending == APPEND_TAIL && type->ins_contig != CONTIG_LEFT) { ret = ocfs2_append_rec_to_path(inode, handle, insert_rec, @@ -3941,7 +3975,7 @@ static int __ocfs2_mark_extent_written(struct inode *inode, { int ret = 0; struct ocfs2_extent_list *el = path_leaf_el(path); - struct buffer_head *eb_bh, *last_eb_bh = NULL; + struct buffer_head *last_eb_bh = NULL; struct ocfs2_extent_rec *rec = &el->l_recs[split_index]; struct ocfs2_merge_ctxt ctxt; struct ocfs2_extent_list *rightmost_el; @@ -3960,14 +3994,6 @@ static int __ocfs2_mark_extent_written(struct inode *inode, goto out; } - eb_bh = path_leaf_bh(path); - ret = ocfs2_journal_access(handle, inode, eb_bh, - OCFS2_JOURNAL_ACCESS_WRITE); - if (ret) { - mlog_errno(ret); - goto out; - } - ctxt.c_contig_type = ocfs2_figure_merge_contig_type(inode, el, split_index, split_rec); @@ -4029,8 +4055,6 @@ static int __ocfs2_mark_extent_written(struct inode *inode, mlog_errno(ret); } - ocfs2_journal_dirty(handle, eb_bh); - out: brelse(last_eb_bh); return ret; @@ -4707,7 +4731,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) mutex_lock(&data_alloc_inode->i_mutex); - status = ocfs2_meta_lock(data_alloc_inode, &data_alloc_bh, 1); + status = ocfs2_inode_lock(data_alloc_inode, &data_alloc_bh, 1); if (status < 0) { mlog_errno(status); goto out_mutex; @@ -4729,7 +4753,7 @@ int __ocfs2_flush_truncate_log(struct ocfs2_super *osb) out_unlock: brelse(data_alloc_bh); - ocfs2_meta_unlock(data_alloc_inode, 1); + ocfs2_inode_unlock(data_alloc_inode, 1); out_mutex: mutex_unlock(&data_alloc_inode->i_mutex); @@ -5053,7 +5077,7 @@ static int ocfs2_free_cached_items(struct ocfs2_super *osb, mutex_lock(&inode->i_mutex); - ret = ocfs2_meta_lock(inode, &di_bh, 1); + ret = ocfs2_inode_lock(inode, &di_bh, 1); if (ret) { mlog_errno(ret); goto out_mutex; @@ -5094,7 +5118,7 @@ out_journal: ocfs2_commit_trans(osb, handle); out_unlock: - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); brelse(di_bh); out_mutex: mutex_unlock(&inode->i_mutex); @@ -6093,8 +6117,6 @@ start: mlog(0, "clusters_to_del = %u in this pass, tail blk=%llu\n", clusters_to_del, (unsigned long long)path_leaf_bh(path)->b_blocknr); - BUG_ON(clusters_to_del == 0); - mutex_lock(&tl_inode->i_mutex); tl_sem = 1; /* ocfs2_truncate_log_needs_flush guarantees us at least one diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 56f7790..2a0eaac 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -26,6 +26,7 @@ #include #include #include +#include #define MLOG_MASK_PREFIX ML_FILE_IO #include @@ -139,7 +140,8 @@ static int ocfs2_get_block(struct inode *inode, sector_t iblock, { int err = 0; unsigned int ext_flags; - u64 p_blkno, past_eof; + u64 max_blocks = bh_result->b_size >> inode->i_blkbits; + u64 p_blkno, count, past_eof; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); mlog_entry("(0x%p, %llu, 0x%p, %d)\n", inode, @@ -155,7 +157,7 @@ static int ocfs2_get_block(struct inode *inode, sector_t iblock, goto bail; } - err = ocfs2_extent_map_get_blocks(inode, iblock, &p_blkno, NULL, + err = ocfs2_extent_map_get_blocks(inode, iblock, &p_blkno, &count, &ext_flags); if (err) { mlog(ML_ERROR, "Error %d from get_blocks(0x%p, %llu, 1, " @@ -164,6 +166,9 @@ static int ocfs2_get_block(struct inode *inode, sector_t iblock, goto bail; } + if (max_blocks < count) + count = max_blocks; + /* * ocfs2 never allocates in this function - the only time we * need to use BH_New is when we're extending i_size on a file @@ -178,6 +183,8 @@ static int ocfs2_get_block(struct inode *inode, sector_t iblock, if (p_blkno && !(ext_flags & OCFS2_EXT_UNWRITTEN)) map_bh(bh_result, inode->i_sb, p_blkno); + bh_result->b_size = count << inode->i_blkbits; + if (!ocfs2_sparse_alloc(osb)) { if (p_blkno == 0) { err = -EIO; @@ -275,7 +282,7 @@ static int ocfs2_readpage(struct file *file, struct page *page) mlog_entry("(0x%p, %lu)\n", file, (page ? page->index : 0)); - ret = ocfs2_meta_lock_with_page(inode, NULL, 0, page); + ret = ocfs2_inode_lock_with_page(inode, NULL, 0, page); if (ret != 0) { if (ret == AOP_TRUNCATED_PAGE) unlock = 0; @@ -285,7 +292,7 @@ static int ocfs2_readpage(struct file *file, struct page *page) if (down_read_trylock(&oi->ip_alloc_sem) == 0) { ret = AOP_TRUNCATED_PAGE; - goto out_meta_unlock; + goto out_inode_unlock; } /* @@ -305,25 +312,16 @@ static int ocfs2_readpage(struct file *file, struct page *page) goto out_alloc; } - ret = ocfs2_data_lock_with_page(inode, 0, page); - if (ret != 0) { - if (ret == AOP_TRUNCATED_PAGE) - unlock = 0; - mlog_errno(ret); - goto out_alloc; - } - if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) ret = ocfs2_readpage_inline(inode, page); else ret = block_read_full_page(page, ocfs2_get_block); unlock = 0; - ocfs2_data_unlock(inode, 0); out_alloc: up_read(&OCFS2_I(inode)->ip_alloc_sem); -out_meta_unlock: - ocfs2_meta_unlock(inode, 0); +out_inode_unlock: + ocfs2_inode_unlock(inode, 0); out: if (unlock) unlock_page(page); @@ -331,6 +329,62 @@ out: return ret; } +/* + * This is used only for read-ahead. Failures or difficult to handle + * siutations are safe to ignore. + * + * Right now, we don't bother with BH_Boundary - in-inode extent lists + * are quite large (243 extents on 4k blocks), so most inodes don't + * grow out to a tree. If need be, detecting boundary extents could + * trivially be added in a future version of ocfs2_get_block(). + */ +static int ocfs2_readpages(struct file *filp, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages) +{ + int ret, err = -EIO; + struct inode *inode = mapping->host; + struct ocfs2_inode_info *oi = OCFS2_I(inode); + loff_t start; + struct page *last; + + /* + * Use the nonblocking flag for the dlm code to avoid page + * lock inversion, but don't bother with retrying. + */ + ret = ocfs2_inode_lock_full(inode, NULL, 0, OCFS2_LOCK_NONBLOCK); + if (ret) + return err; + + if (down_read_trylock(&oi->ip_alloc_sem) == 0) { + ocfs2_inode_unlock(inode, 0); + return err; + } + + /* + * Don't bother with inline-data. There isn't anything + * to read-ahead in that case anyway... + */ + if (oi->ip_dyn_features & OCFS2_INLINE_DATA_FL) + goto out_unlock; + + /* + * Check whether a remote node truncated this file - we just + * drop out in that case as it's not worth handling here. + */ + last = list_entry(pages->prev, struct page, lru); + start = (loff_t)last->index << PAGE_CACHE_SHIFT; + if (start >= i_size_read(inode)) + goto out_unlock; + + err = mpage_readpages(mapping, pages, nr_pages, ocfs2_get_block); + +out_unlock: + up_read(&oi->ip_alloc_sem); + ocfs2_inode_unlock(inode, 0); + + return err; +} + /* Note: Because we don't support holes, our allocation has * already happened (allocation writes zeros to the file data) * so we don't have to worry about ordered writes in @@ -452,7 +506,7 @@ static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block) * accessed concurrently from multiple nodes. */ if (!INODE_JOURNAL(inode)) { - err = ocfs2_meta_lock(inode, NULL, 0); + err = ocfs2_inode_lock(inode, NULL, 0); if (err) { if (err != -ENOENT) mlog_errno(err); @@ -467,7 +521,7 @@ static sector_t ocfs2_bmap(struct address_space *mapping, sector_t block) if (!INODE_JOURNAL(inode)) { up_read(&OCFS2_I(inode)->ip_alloc_sem); - ocfs2_meta_unlock(inode, 0); + ocfs2_inode_unlock(inode, 0); } if (err) { @@ -638,34 +692,12 @@ static ssize_t ocfs2_direct_IO(int rw, if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) return 0; - if (!ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb))) { - /* - * We get PR data locks even for O_DIRECT. This - * allows concurrent O_DIRECT I/O but doesn't let - * O_DIRECT with extending and buffered zeroing writes - * race. If they did race then the buffered zeroing - * could be written back after the O_DIRECT I/O. It's - * one thing to tell people not to mix buffered and - * O_DIRECT writes, but expecting them to understand - * that file extension is also an implicit buffered - * write is too much. By getting the PR we force - * writeback of the buffered zeroing before - * proceeding. - */ - ret = ocfs2_data_lock(inode, 0); - if (ret < 0) { - mlog_errno(ret); - goto out; - } - ocfs2_data_unlock(inode, 0); - } - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, inode->i_sb->s_bdev, iov, offset, nr_segs, ocfs2_direct_IO_get_blocks, ocfs2_dio_end_io); -out: + mlog_exit(ret); return ret; } @@ -1754,7 +1786,7 @@ static int ocfs2_write_begin(struct file *file, struct address_space *mapping, struct buffer_head *di_bh = NULL; struct inode *inode = mapping->host; - ret = ocfs2_meta_lock(inode, &di_bh, 1); + ret = ocfs2_inode_lock(inode, &di_bh, 1); if (ret) { mlog_errno(ret); return ret; @@ -1769,30 +1801,22 @@ static int ocfs2_write_begin(struct file *file, struct address_space *mapping, */ down_write(&OCFS2_I(inode)->ip_alloc_sem); - ret = ocfs2_data_lock(inode, 1); - if (ret) { - mlog_errno(ret); - goto out_fail; - } - ret = ocfs2_write_begin_nolock(mapping, pos, len, flags, pagep, fsdata, di_bh, NULL); if (ret) { mlog_errno(ret); - goto out_fail_data; + goto out_fail; } brelse(di_bh); return 0; -out_fail_data: - ocfs2_data_unlock(inode, 1); out_fail: up_write(&OCFS2_I(inode)->ip_alloc_sem); brelse(di_bh); - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); return ret; } @@ -1908,15 +1932,15 @@ static int ocfs2_write_end(struct file *file, struct address_space *mapping, ret = ocfs2_write_end_nolock(mapping, pos, len, copied, page, fsdata); - ocfs2_data_unlock(inode, 1); up_write(&OCFS2_I(inode)->ip_alloc_sem); - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); return ret; } const struct address_space_operations ocfs2_aops = { .readpage = ocfs2_readpage, + .readpages = ocfs2_readpages, .writepage = ocfs2_writepage, .write_begin = ocfs2_write_begin, .write_end = ocfs2_write_end, diff --git a/fs/ocfs2/cluster/heartbeat.h b/fs/ocfs2/cluster/heartbeat.h index 35397dd..e511339 100644 --- a/fs/ocfs2/cluster/heartbeat.h +++ b/fs/ocfs2/cluster/heartbeat.h @@ -35,7 +35,7 @@ #define O2HB_LIVE_THRESHOLD 2 /* number of equal samples to be seen as dead */ extern unsigned int o2hb_dead_threshold; -#define O2HB_DEFAULT_DEAD_THRESHOLD 7 +#define O2HB_DEFAULT_DEAD_THRESHOLD 31 /* Otherwise MAX_WRITE_TIMEOUT will be zero... */ #define O2HB_MIN_DEAD_THRESHOLD 2 #define O2HB_MAX_WRITE_TIMEOUT_MS (O2HB_REGION_TIMEOUT_MS * (o2hb_dead_threshold - 1)) diff --git a/fs/ocfs2/cluster/tcp.h b/fs/ocfs2/cluster/tcp.h index da880fc..f36f66a 100644 --- a/fs/ocfs2/cluster/tcp.h +++ b/fs/ocfs2/cluster/tcp.h @@ -60,8 +60,8 @@ typedef void (o2net_post_msg_handler_func)(int status, void *data, /* same as hb delay, we're waiting for another node to recognize our hb */ #define O2NET_RECONNECT_DELAY_MS_DEFAULT 2000 -#define O2NET_KEEPALIVE_DELAY_MS_DEFAULT 5000 -#define O2NET_IDLE_TIMEOUT_MS_DEFAULT 10000 +#define O2NET_KEEPALIVE_DELAY_MS_DEFAULT 2000 +#define O2NET_IDLE_TIMEOUT_MS_DEFAULT 30000 /* TODO: figure this out.... */ diff --git a/fs/ocfs2/cluster/tcp_internal.h b/fs/ocfs2/cluster/tcp_internal.h index 9606111..b2e832a 100644 --- a/fs/ocfs2/cluster/tcp_internal.h +++ b/fs/ocfs2/cluster/tcp_internal.h @@ -38,6 +38,12 @@ * locking semantics of the file system using the protocol. It should * be somewhere else, I'm sure, but right now it isn't. * + * New in version 10: + * - Meta/data locks combined + * + * New in version 9: + * - All votes removed + * * New in version 8: * - Replace delete inode votes with a cluster lock * @@ -60,7 +66,7 @@ * - full 64 bit i_size in the metadata lock lvbs * - introduction of "rw" lock and pushing meta/data locking down */ -#define O2NET_PROTOCOL_VERSION 8ULL +#define O2NET_PROTOCOL_VERSION 10ULL struct o2net_handshake { __be64 protocol_version; __be64 connector_id; diff --git a/fs/ocfs2/cluster/ver.c b/fs/ocfs2/cluster/ver.c index 7286c48..a56eee6 100644 --- a/fs/ocfs2/cluster/ver.c +++ b/fs/ocfs2/cluster/ver.c @@ -28,7 +28,7 @@ #include "ver.h" -#define CLUSTER_BUILD_VERSION "1.3.3" +#define CLUSTER_BUILD_VERSION "1.5.0" #define VERSION_STR "OCFS2 Node Manager " CLUSTER_BUILD_VERSION diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c index 9923278..b1cc7c3 100644 --- a/fs/ocfs2/dcache.c +++ b/fs/ocfs2/dcache.c @@ -128,9 +128,9 @@ static int ocfs2_match_dentry(struct dentry *dentry, /* * Walk the inode alias list, and find a dentry which has a given * parent. ocfs2_dentry_attach_lock() wants to find _any_ alias as it - * is looking for a dentry_lock reference. The vote thread is looking - * to unhash aliases, so we allow it to skip any that already have - * that property. + * is looking for a dentry_lock reference. The downconvert thread is + * looking to unhash aliases, so we allow it to skip any that already + * have that property. */ struct dentry *ocfs2_find_local_alias(struct inode *inode, u64 parent_blkno, @@ -266,7 +266,7 @@ int ocfs2_dentry_attach_lock(struct dentry *dentry, dl->dl_count = 0; /* * Does this have to happen below, for all attaches, in case - * the struct inode gets blown away by votes? + * the struct inode gets blown away by the downconvert thread? */ dl->dl_inode = igrab(inode); dl->dl_parent_blkno = parent_blkno; diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c index 63b28fd..6b0107f 100644 --- a/fs/ocfs2/dir.c +++ b/fs/ocfs2/dir.c @@ -846,14 +846,14 @@ int ocfs2_readdir(struct file * filp, void * dirent, filldir_t filldir) mlog_entry("dirino=%llu\n", (unsigned long long)OCFS2_I(inode)->ip_blkno); - error = ocfs2_meta_lock_atime(inode, filp->f_vfsmnt, &lock_level); + error = ocfs2_inode_lock_atime(inode, filp->f_vfsmnt, &lock_level); if (lock_level && error >= 0) { /* We release EX lock which used to update atime * and get PR lock again to reduce contention * on commonly accessed directories. */ - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); lock_level = 0; - error = ocfs2_meta_lock(inode, NULL, 0); + error = ocfs2_inode_lock(inode, NULL, 0); } if (error < 0) { if (error != -ENOENT) @@ -865,7 +865,7 @@ int ocfs2_readdir(struct file * filp, void * dirent, filldir_t filldir) error = ocfs2_dir_foreach_blk(inode, &filp->f_version, &filp->f_pos, dirent, filldir, NULL); - ocfs2_meta_unlock(inode, lock_level); + ocfs2_inode_unlock(inode, lock_level); bail_nolock: mlog_exit(error); diff --git a/fs/ocfs2/dlm/dlmfsver.c b/fs/ocfs2/dlm/dlmfsver.c index d2be3ad..a733b33 100644 --- a/fs/ocfs2/dlm/dlmfsver.c +++ b/fs/ocfs2/dlm/dlmfsver.c @@ -28,7 +28,7 @@ #include "dlmfsver.h" -#define DLM_BUILD_VERSION "1.3.3" +#define DLM_BUILD_VERSION "1.5.0" #define VERSION_STR "OCFS2 DLMFS " DLM_BUILD_VERSION diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 2fde7bf..b10f3e3 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -2321,6 +2321,13 @@ void dlm_hb_node_down_cb(struct o2nm_node *node, int idx, void *data) if (!dlm_grab(dlm)) return; + /* + * This will notify any dlm users that a node in our domain + * went away without notifying us first. + */ + if (test_bit(idx, dlm->domain_map)) + dlm_fire_domain_eviction_callbacks(dlm, idx); + spin_lock(&dlm->spinlock); __dlm_hb_node_down(dlm, idx); spin_unlock(&dlm->spinlock); diff --git a/fs/ocfs2/dlm/dlmver.c b/fs/ocfs2/dlm/dlmver.c index 7ef2653..dfc0da4 100644 --- a/fs/ocfs2/dlm/dlmver.c +++ b/fs/ocfs2/dlm/dlmver.c @@ -28,7 +28,7 @@ #include "dlmver.h" -#define DLM_BUILD_VERSION "1.3.3" +#define DLM_BUILD_VERSION "1.5.0" #define VERSION_STR "OCFS2 DLM " DLM_BUILD_VERSION diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index 4e97dcc..c365206 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -55,7 +55,6 @@ #include "slot_map.h" #include "super.h" #include "uptodate.h" -#include "vote.h" #include "buffer_head_io.h" @@ -153,10 +152,10 @@ struct ocfs2_lock_res_ops { struct ocfs2_super * (*get_osb)(struct ocfs2_lock_res *); /* - * Optionally called in the downconvert (or "vote") thread - * after a successful downconvert. The lockres will not be - * referenced after this callback is called, so it is safe to - * free memory, etc. + * Optionally called in the downconvert thread after a + * successful downconvert. The lockres will not be referenced + * after this callback is called, so it is safe to free + * memory, etc. * * The exact semantics of when this is called are controlled * by ->downconvert_worker() @@ -225,17 +224,12 @@ static struct ocfs2_lock_res_ops ocfs2_inode_rw_lops = { .flags = 0, }; -static struct ocfs2_lock_res_ops ocfs2_inode_meta_lops = { +static struct ocfs2_lock_res_ops ocfs2_inode_inode_lops = { .get_osb = ocfs2_get_inode_osb, .check_downconvert = ocfs2_check_meta_downconvert, .set_lvb = ocfs2_set_meta_lvb, - .flags = LOCK_TYPE_REQUIRES_REFRESH|LOCK_TYPE_USES_LVB, -}; - -static struct ocfs2_lock_res_ops ocfs2_inode_data_lops = { - .get_osb = ocfs2_get_inode_osb, .downconvert_worker = ocfs2_data_convert_worker, - .flags = 0, + .flags = LOCK_TYPE_REQUIRES_REFRESH|LOCK_TYPE_USES_LVB, }; static struct ocfs2_lock_res_ops ocfs2_super_lops = { @@ -261,7 +255,6 @@ static struct ocfs2_lock_res_ops ocfs2_inode_open_lops = { static inline int ocfs2_is_inode_lock(struct ocfs2_lock_res *lockres) { return lockres->l_type == OCFS2_LOCK_TYPE_META || - lockres->l_type == OCFS2_LOCK_TYPE_DATA || lockres->l_type == OCFS2_LOCK_TYPE_RW || lockres->l_type == OCFS2_LOCK_TYPE_OPEN; } @@ -310,9 +303,10 @@ static inline void ocfs2_recover_from_dlm_error(struct ocfs2_lock_res *lockres, "resource %s: %s\n", dlm_errname(_stat), _func, \ _lockres->l_name, dlm_errmsg(_stat)); \ } while (0) -static void ocfs2_vote_on_unlock(struct ocfs2_super *osb, - struct ocfs2_lock_res *lockres); -static int ocfs2_meta_lock_update(struct inode *inode, +static int ocfs2_downconvert_thread(void *arg); +static void ocfs2_downconvert_on_unlock(struct ocfs2_super *osb, + struct ocfs2_lock_res *lockres); +static int ocfs2_inode_lock_update(struct inode *inode, struct buffer_head **bh); static void ocfs2_drop_osb_locks(struct ocfs2_super *osb); static inline int ocfs2_highest_compat_lock_level(int level); @@ -402,10 +396,7 @@ void ocfs2_inode_lock_res_init(struct ocfs2_lock_res *res, ops = &ocfs2_inode_rw_lops; break; case OCFS2_LOCK_TYPE_META: - ops = &ocfs2_inode_meta_lops; - break; - case OCFS2_LOCK_TYPE_DATA: - ops = &ocfs2_inode_data_lops; + ops = &ocfs2_inode_inode_lops; break; case OCFS2_LOCK_TYPE_OPEN: ops = &ocfs2_inode_open_lops; @@ -732,7 +723,7 @@ static void ocfs2_blocking_ast(void *opaque, int level) wake_up(&lockres->l_event); - ocfs2_kick_vote_thread(osb); + ocfs2_wake_downconvert_thread(osb); } static void ocfs2_locking_ast(void *opaque) @@ -1089,7 +1080,7 @@ static void ocfs2_cluster_unlock(struct ocfs2_super *osb, mlog_entry_void(); spin_lock_irqsave(&lockres->l_lock, flags); ocfs2_dec_holders(lockres, level); - ocfs2_vote_on_unlock(osb, lockres); + ocfs2_downconvert_on_unlock(osb, lockres); spin_unlock_irqrestore(&lockres->l_lock, flags); mlog_exit_void(); } @@ -1147,13 +1138,7 @@ int ocfs2_create_new_inode_locks(struct inode *inode) * We don't want to use LKM_LOCAL on a meta data lock as they * don't use a generation in their lock names. */ - ret = ocfs2_create_new_lock(osb, &OCFS2_I(inode)->ip_meta_lockres, 1, 0); - if (ret) { - mlog_errno(ret); - goto bail; - } - - ret = ocfs2_create_new_lock(osb, &OCFS2_I(inode)->ip_data_lockres, 1, 1); + ret = ocfs2_create_new_lock(osb, &OCFS2_I(inode)->ip_inode_lockres, 1, 0); if (ret) { mlog_errno(ret); goto bail; @@ -1311,76 +1296,15 @@ out: mlog_exit_void(); } -int ocfs2_data_lock_full(struct inode *inode, - int write, - int arg_flags) -{ - int status = 0, level; - struct ocfs2_lock_res *lockres; - struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); - - BUG_ON(!inode); - - mlog_entry_void(); - - mlog(0, "inode %llu take %s DATA lock\n", - (unsigned long long)OCFS2_I(inode)->ip_blkno, - write ? "EXMODE" : "PRMODE"); - - /* We'll allow faking a readonly data lock for - * rodevices. */ - if (ocfs2_is_hard_readonly(OCFS2_SB(inode->i_sb))) { - if (write) { - status = -EROFS; - mlog_errno(status); - } - goto out; - } - - if (ocfs2_mount_local(osb)) - goto out; - - lockres = &OCFS2_I(inode)->ip_data_lockres; - - level = write ? LKM_EXMODE : LKM_PRMODE; - - status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres, level, - 0, arg_flags); - if (status < 0 && status != -EAGAIN) - mlog_errno(status); - -out: - mlog_exit(status); - return status; -} - -/* see ocfs2_meta_lock_with_page() */ -int ocfs2_data_lock_with_page(struct inode *inode, - int write, - struct page *page) -{ - int ret; - - ret = ocfs2_data_lock_full(inode, write, OCFS2_LOCK_NONBLOCK); - if (ret == -EAGAIN) { - unlock_page(page); - if (ocfs2_data_lock(inode, write) == 0) - ocfs2_data_unlock(inode, write); - ret = AOP_TRUNCATED_PAGE; - } - - return ret; -} - -static void ocfs2_vote_on_unlock(struct ocfs2_super *osb, - struct ocfs2_lock_res *lockres) +static void ocfs2_downconvert_on_unlock(struct ocfs2_super *osb, + struct ocfs2_lock_res *lockres) { int kick = 0; mlog_entry_void(); /* If we know that another node is waiting on our lock, kick - * the vote thread * pre-emptively when we reach a release + * the downconvert thread * pre-emptively when we reach a release * condition. */ if (lockres->l_flags & OCFS2_LOCK_BLOCKED) { switch(lockres->l_blocking) { @@ -1398,27 +1322,7 @@ static void ocfs2_vote_on_unlock(struct ocfs2_super *osb, } if (kick) - ocfs2_kick_vote_thread(osb); - - mlog_exit_void(); -} - -void ocfs2_data_unlock(struct inode *inode, - int write) -{ - int level = write ? LKM_EXMODE : LKM_PRMODE; - struct ocfs2_lock_res *lockres = &OCFS2_I(inode)->ip_data_lockres; - struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); - - mlog_entry_void(); - - mlog(0, "inode %llu drop %s DATA lock\n", - (unsigned long long)OCFS2_I(inode)->ip_blkno, - write ? "EXMODE" : "PRMODE"); - - if (!ocfs2_is_hard_readonly(OCFS2_SB(inode->i_sb)) && - !ocfs2_mount_local(osb)) - ocfs2_cluster_unlock(OCFS2_SB(inode->i_sb), lockres, level); + ocfs2_wake_downconvert_thread(osb); mlog_exit_void(); } @@ -1442,11 +1346,11 @@ static u64 ocfs2_pack_timespec(struct timespec *spec) /* Call this with the lockres locked. I am reasonably sure we don't * need ip_lock in this function as anyone who would be changing those - * values is supposed to be blocked in ocfs2_meta_lock right now. */ + * values is supposed to be blocked in ocfs2_inode_lock right now. */ static void __ocfs2_stuff_meta_lvb(struct inode *inode) { struct ocfs2_inode_info *oi = OCFS2_I(inode); - struct ocfs2_lock_res *lockres = &oi->ip_meta_lockres; + struct ocfs2_lock_res *lockres = &oi->ip_inode_lockres; struct ocfs2_meta_lvb *lvb; mlog_entry_void(); @@ -1496,7 +1400,7 @@ static void ocfs2_unpack_timespec(struct timespec *spec, static void ocfs2_refresh_inode_from_lvb(struct inode *inode) { struct ocfs2_inode_info *oi = OCFS2_I(inode); - struct ocfs2_lock_res *lockres = &oi->ip_meta_lockres; + struct ocfs2_lock_res *lockres = &oi->ip_inode_lockres; struct ocfs2_meta_lvb *lvb; mlog_entry_void(); @@ -1604,12 +1508,12 @@ static inline void ocfs2_complete_lock_res_refresh(struct ocfs2_lock_res *lockre } /* may or may not return a bh if it went to disk. */ -static int ocfs2_meta_lock_update(struct inode *inode, +static int ocfs2_inode_lock_update(struct inode *inode, struct buffer_head **bh) { int status = 0; struct ocfs2_inode_info *oi = OCFS2_I(inode); - struct ocfs2_lock_res *lockres = &oi->ip_meta_lockres; + struct ocfs2_lock_res *lockres = &oi->ip_inode_lockres; struct ocfs2_dinode *fe; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); @@ -1721,7 +1625,7 @@ static int ocfs2_assign_bh(struct inode *inode, * returns < 0 error if the callback will never be called, otherwise * the result of the lock will be communicated via the callback. */ -int ocfs2_meta_lock_full(struct inode *inode, +int ocfs2_inode_lock_full(struct inode *inode, struct buffer_head **ret_bh, int ex, int arg_flags) @@ -1756,7 +1660,7 @@ int ocfs2_meta_lock_full(struct inode *inode, wait_event(osb->recovery_event, ocfs2_node_map_is_empty(osb, &osb->recovery_map)); - lockres = &OCFS2_I(inode)->ip_meta_lockres; + lockres = &OCFS2_I(inode)->ip_inode_lockres; level = ex ? LKM_EXMODE : LKM_PRMODE; dlm_flags = 0; if (arg_flags & OCFS2_META_LOCK_NOQUEUE) @@ -1795,11 +1699,11 @@ local: } /* This is fun. The caller may want a bh back, or it may - * not. ocfs2_meta_lock_update definitely wants one in, but + * not. ocfs2_inode_lock_update definitely wants one in, but * may or may not read one, depending on what's in the * LVB. The result of all of this is that we've *only* gone to * disk if we have to, so the complexity is worthwhile. */ - status = ocfs2_meta_lock_update(inode, &local_bh); + status = ocfs2_inode_lock_update(inode, &local_bh); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -1821,7 +1725,7 @@ bail: *ret_bh = NULL; } if (acquired) - ocfs2_meta_unlock(inode, ex); + ocfs2_inode_unlock(inode, ex); } if (local_bh) @@ -1832,19 +1736,20 @@ bail: } /* - * This is working around a lock inversion between tasks acquiring DLM locks - * while holding a page lock and the vote thread which blocks dlm lock acquiry - * while acquiring page locks. + * This is working around a lock inversion between tasks acquiring DLM + * locks while holding a page lock and the downconvert thread which + * blocks dlm lock acquiry while acquiring page locks. * * ** These _with_page variantes are only intended to be called from aop * methods that hold page locks and return a very specific *positive* error * code that aop methods pass up to the VFS -- test for errors with != 0. ** * - * The DLM is called such that it returns -EAGAIN if it would have blocked - * waiting for the vote thread. In that case we unlock our page so the vote - * thread can make progress. Once we've done this we have to return - * AOP_TRUNCATED_PAGE so the aop method that called us can bubble that back up - * into the VFS who will then immediately retry the aop call. + * The DLM is called such that it returns -EAGAIN if it would have + * blocked waiting for the downconvert thread. In that case we unlock + * our page so the downconvert thread can make progress. Once we've + * done this we have to return AOP_TRUNCATED_PAGE so the aop method + * that called us can bubble that back up into the VFS who will then + * immediately retry the aop call. * * We do a blocking lock and immediate unlock before returning, though, so that * the lock has a great chance of being cached on this node by the time the VFS @@ -1852,32 +1757,32 @@ bail: * ping locks back and forth, but that's a risk we're willing to take to avoid * the lock inversion simply. */ -int ocfs2_meta_lock_with_page(struct inode *inode, +int ocfs2_inode_lock_with_page(struct inode *inode, struct buffer_head **ret_bh, int ex, struct page *page) { int ret; - ret = ocfs2_meta_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); + ret = ocfs2_inode_lock_full(inode, ret_bh, ex, OCFS2_LOCK_NONBLOCK); if (ret == -EAGAIN) { unlock_page(page); - if (ocfs2_meta_lock(inode, ret_bh, ex) == 0) - ocfs2_meta_unlock(inode, ex); + if (ocfs2_inode_lock(inode, ret_bh, ex) == 0) + ocfs2_inode_unlock(inode, ex); ret = AOP_TRUNCATED_PAGE; } return ret; } -int ocfs2_meta_lock_atime(struct inode *inode, +int ocfs2_inode_lock_atime(struct inode *inode, struct vfsmount *vfsmnt, int *level) { int ret; mlog_entry_void(); - ret = ocfs2_meta_lock(inode, NULL, 0); + ret = ocfs2_inode_lock(inode, NULL, 0); if (ret < 0) { mlog_errno(ret); return ret; @@ -1890,8 +1795,8 @@ int ocfs2_meta_lock_atime(struct inode *inode, if (ocfs2_should_update_atime(inode, vfsmnt)) { struct buffer_head *bh = NULL; - ocfs2_meta_unlock(inode, 0); - ret = ocfs2_meta_lock(inode, &bh, 1); + ocfs2_inode_unlock(inode, 0); + ret = ocfs2_inode_lock(inode, &bh, 1); if (ret < 0) { mlog_errno(ret); return ret; @@ -1908,11 +1813,11 @@ int ocfs2_meta_lock_atime(struct inode *inode, return ret; } -void ocfs2_meta_unlock(struct inode *inode, +void ocfs2_inode_unlock(struct inode *inode, int ex) { int level = ex ? LKM_EXMODE : LKM_PRMODE; - struct ocfs2_lock_res *lockres = &OCFS2_I(inode)->ip_meta_lockres; + struct ocfs2_lock_res *lockres = &OCFS2_I(inode)->ip_inode_lockres; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); mlog_entry_void(); @@ -2320,11 +2225,11 @@ int ocfs2_dlm_init(struct ocfs2_super *osb) goto bail; } - /* launch vote thread */ - osb->vote_task = kthread_run(ocfs2_vote_thread, osb, "ocfs2vote"); - if (IS_ERR(osb->vote_task)) { - status = PTR_ERR(osb->vote_task); - osb->vote_task = NULL; + /* launch downconvert thread */ + osb->dc_task = kthread_run(ocfs2_downconvert_thread, osb, "ocfs2dc"); + if (IS_ERR(osb->dc_task)) { + status = PTR_ERR(osb->dc_task); + osb->dc_task = NULL; mlog_errno(status); goto bail; } @@ -2353,8 +2258,8 @@ local: bail: if (status < 0) { ocfs2_dlm_shutdown_debug(osb); - if (osb->vote_task) - kthread_stop(osb->vote_task); + if (osb->dc_task) + kthread_stop(osb->dc_task); } mlog_exit(status); @@ -2369,9 +2274,9 @@ void ocfs2_dlm_shutdown(struct ocfs2_super *osb) ocfs2_drop_osb_locks(osb); - if (osb->vote_task) { - kthread_stop(osb->vote_task); - osb->vote_task = NULL; + if (osb->dc_task) { + kthread_stop(osb->dc_task); + osb->dc_task = NULL; } ocfs2_lock_res_free(&osb->osb_super_lockres); @@ -2527,7 +2432,7 @@ out: /* Mark the lockres as being dropped. It will no longer be * queued if blocking, but we still may have to wait on it - * being dequeued from the vote thread before we can consider + * being dequeued from the downconvert thread before we can consider * it safe to drop. * * You can *not* attempt to call cluster_lock on this lockres anymore. */ @@ -2590,14 +2495,7 @@ int ocfs2_drop_inode_locks(struct inode *inode) status = err; err = ocfs2_drop_lock(OCFS2_SB(inode->i_sb), - &OCFS2_I(inode)->ip_data_lockres); - if (err < 0) - mlog_errno(err); - if (err < 0 && !status) - status = err; - - err = ocfs2_drop_lock(OCFS2_SB(inode->i_sb), - &OCFS2_I(inode)->ip_meta_lockres); + &OCFS2_I(inode)->ip_inode_lockres); if (err < 0) mlog_errno(err); if (err < 0 && !status) @@ -2850,6 +2748,9 @@ static int ocfs2_data_convert_worker(struct ocfs2_lock_res *lockres, inode = ocfs2_lock_res_inode(lockres); mapping = inode->i_mapping; + if (S_ISREG(inode->i_mode)) + goto out; + /* * We need this before the filemap_fdatawrite() so that it can * transfer the dirty bit from the PTE to the @@ -2875,6 +2776,7 @@ static int ocfs2_data_convert_worker(struct ocfs2_lock_res *lockres, filemap_fdatawait(mapping); } +out: return UNBLOCK_CONTINUE; } @@ -2903,7 +2805,7 @@ static void ocfs2_set_meta_lvb(struct ocfs2_lock_res *lockres) /* * Does the final reference drop on our dentry lock. Right now this - * happens in the vote thread, but we could choose to simplify the + * happens in the downconvert thread, but we could choose to simplify the * dlmglue API and push these off to the ocfs2_wq in the future. */ static void ocfs2_dentry_post_unlock(struct ocfs2_super *osb, @@ -3042,7 +2944,7 @@ void ocfs2_process_blocked_lock(struct ocfs2_super *osb, mlog(0, "lockres %s blocked.\n", lockres->l_name); /* Detect whether a lock has been marked as going away while - * the vote thread was processing other things. A lock can + * the downconvert thread was processing other things. A lock can * still be marked with OCFS2_LOCK_FREEING after this check, * but short circuiting here will still save us some * performance. */ @@ -3091,13 +2993,104 @@ static void ocfs2_schedule_blocked_lock(struct ocfs2_super *osb, lockres_or_flags(lockres, OCFS2_LOCK_QUEUED); - spin_lock(&osb->vote_task_lock); + spin_lock(&osb->dc_task_lock); if (list_empty(&lockres->l_blocked_list)) { list_add_tail(&lockres->l_blocked_list, &osb->blocked_lock_list); osb->blocked_lock_count++; } - spin_unlock(&osb->vote_task_lock); + spin_unlock(&osb->dc_task_lock); mlog_exit_void(); } + +static void ocfs2_downconvert_thread_do_work(struct ocfs2_super *osb) +{ + unsigned long processed; + struct ocfs2_lock_res *lockres; + + mlog_entry_void(); + + spin_lock(&osb->dc_task_lock); + /* grab this early so we know to try again if a state change and + * wake happens part-way through our work */ + osb->dc_work_sequence = osb->dc_wake_sequence; + + processed = osb->blocked_lock_count; + while (processed) { + BUG_ON(list_empty(&osb->blocked_lock_list)); + + lockres = list_entry(osb->blocked_lock_list.next, + struct ocfs2_lock_res, l_blocked_list); + list_del_init(&lockres->l_blocked_list); + osb->blocked_lock_count--; + spin_unlock(&osb->dc_task_lock); + + BUG_ON(!processed); + processed--; + + ocfs2_process_blocked_lock(osb, lockres); + + spin_lock(&osb->dc_task_lock); + } + spin_unlock(&osb->dc_task_lock); + + mlog_exit_void(); +} + +static int ocfs2_downconvert_thread_lists_empty(struct ocfs2_super *osb) +{ + int empty = 0; + + spin_lock(&osb->dc_task_lock); + if (list_empty(&osb->blocked_lock_list)) + empty = 1; + + spin_unlock(&osb->dc_task_lock); + return empty; +} + +static int ocfs2_downconvert_thread_should_wake(struct ocfs2_super *osb) +{ + int should_wake = 0; + + spin_lock(&osb->dc_task_lock); + if (osb->dc_work_sequence != osb->dc_wake_sequence) + should_wake = 1; + spin_unlock(&osb->dc_task_lock); + + return should_wake; +} + +int ocfs2_downconvert_thread(void *arg) +{ + int status = 0; + struct ocfs2_super *osb = arg; + + /* only quit once we've been asked to stop and there is no more + * work available */ + while (!(kthread_should_stop() && + ocfs2_downconvert_thread_lists_empty(osb))) { + + wait_event_interruptible(osb->dc_event, + ocfs2_downconvert_thread_should_wake(osb) || + kthread_should_stop()); + + mlog(0, "downconvert_thread: awoken\n"); + + ocfs2_downconvert_thread_do_work(osb); + } + + osb->dc_task = NULL; + return status; +} + +void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb) +{ + spin_lock(&osb->dc_task_lock); + /* make sure the voting thread gets a swipe at whatever changes + * the caller may have made to the voting state */ + osb->dc_wake_sequence++; + spin_unlock(&osb->dc_task_lock); + wake_up(&osb->dc_event); +} diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h index 87a785e..6dcbc94 100644 --- a/fs/ocfs2/dlmglue.h +++ b/fs/ocfs2/dlmglue.h @@ -49,12 +49,12 @@ struct ocfs2_meta_lvb { __be32 lvb_reserved2; }; -/* ocfs2_meta_lock_full() and ocfs2_data_lock_full() 'arg_flags' flags */ +/* ocfs2_inode_lock_full() 'arg_flags' flags */ /* don't wait on recovery. */ #define OCFS2_META_LOCK_RECOVERY (0x01) /* Instruct the dlm not to queue ourselves on the other node. */ #define OCFS2_META_LOCK_NOQUEUE (0x02) -/* don't block waiting for the vote thread, instead return -EAGAIN */ +/* don't block waiting for the downconvert thread, instead return -EAGAIN */ #define OCFS2_LOCK_NONBLOCK (0x04) int ocfs2_dlm_init(struct ocfs2_super *osb); @@ -69,35 +69,26 @@ void ocfs2_dentry_lock_res_init(struct ocfs2_dentry_lock *dl, void ocfs2_lock_res_free(struct ocfs2_lock_res *res); int ocfs2_create_new_inode_locks(struct inode *inode); int ocfs2_drop_inode_locks(struct inode *inode); -int ocfs2_data_lock_full(struct inode *inode, - int write, - int arg_flags); -#define ocfs2_data_lock(inode, write) ocfs2_data_lock_full(inode, write, 0) -int ocfs2_data_lock_with_page(struct inode *inode, - int write, - struct page *page); -void ocfs2_data_unlock(struct inode *inode, - int write); int ocfs2_rw_lock(struct inode *inode, int write); void ocfs2_rw_unlock(struct inode *inode, int write); int ocfs2_open_lock(struct inode *inode); int ocfs2_try_open_lock(struct inode *inode, int write); void ocfs2_open_unlock(struct inode *inode); -int ocfs2_meta_lock_atime(struct inode *inode, +int ocfs2_inode_lock_atime(struct inode *inode, struct vfsmount *vfsmnt, int *level); -int ocfs2_meta_lock_full(struct inode *inode, +int ocfs2_inode_lock_full(struct inode *inode, struct buffer_head **ret_bh, int ex, int arg_flags); -int ocfs2_meta_lock_with_page(struct inode *inode, +int ocfs2_inode_lock_with_page(struct inode *inode, struct buffer_head **ret_bh, int ex, struct page *page); /* 99% of the time we don't want to supply any additional flags -- * those are for very specific cases only. */ -#define ocfs2_meta_lock(i, b, e) ocfs2_meta_lock_full(i, b, e, 0) -void ocfs2_meta_unlock(struct inode *inode, +#define ocfs2_inode_lock(i, b, e) ocfs2_inode_lock_full(i, b, e, 0) +void ocfs2_inode_unlock(struct inode *inode, int ex); int ocfs2_super_lock(struct ocfs2_super *osb, int ex); @@ -112,9 +103,10 @@ void ocfs2_mark_lockres_freeing(struct ocfs2_lock_res *lockres); void ocfs2_simple_drop_lockres(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres); -/* for the vote thread */ +/* for the downconvert thread */ void ocfs2_process_blocked_lock(struct ocfs2_super *osb, struct ocfs2_lock_res *lockres); +void ocfs2_wake_downconvert_thread(struct ocfs2_super *osb); struct ocfs2_dlm_debug *ocfs2_new_dlm_debug(void); void ocfs2_put_dlm_debug(struct ocfs2_dlm_debug *dlm_debug); diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c index 535bfa9..1f9e353 100644 --- a/fs/ocfs2/export.c +++ b/fs/ocfs2/export.c @@ -95,7 +95,7 @@ static struct dentry *ocfs2_get_parent(struct dentry *child) mlog(0, "find parent of directory %llu\n", (unsigned long long)OCFS2_I(dir)->ip_blkno); - status = ocfs2_meta_lock(dir, NULL, 0); + status = ocfs2_inode_lock(dir, NULL, 0); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -126,7 +126,7 @@ static struct dentry *ocfs2_get_parent(struct dentry *child) parent->d_op = &ocfs2_dentry_ops; bail_unlock: - ocfs2_meta_unlock(dir, 0); + ocfs2_inode_unlock(dir, 0); bail: mlog_exit_ptr(parent); diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index b75b2e1..432e5f3 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -382,18 +382,13 @@ static int ocfs2_truncate_file(struct inode *inode, down_write(&OCFS2_I(inode)->ip_alloc_sem); - /* This forces other nodes to sync and drop their pages. Do - * this even if we have a truncate without allocation change - - * ocfs2 cluster sizes can be much greater than page size, so - * we have to truncate them anyway. */ - status = ocfs2_data_lock(inode, 1); - if (status < 0) { - up_write(&OCFS2_I(inode)->ip_alloc_sem); - - mlog_errno(status); - goto bail; - } - + /* + * The inode lock forced other nodes to sync and drop their + * pages, which (correctly) happens even if we have a truncate + * without allocation change - ocfs2 cluster sizes can be much + * greater than page size, so we have to truncate them + * anyway. + */ unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1); truncate_inode_pages(inode->i_mapping, new_i_size); @@ -403,7 +398,7 @@ static int ocfs2_truncate_file(struct inode *inode, if (status) mlog_errno(status); - goto bail_unlock_data; + goto bail_unlock_sem; } /* alright, we're going to need to do a full blown alloc size @@ -413,25 +408,23 @@ static int ocfs2_truncate_file(struct inode *inode, status = ocfs2_orphan_for_truncate(osb, inode, di_bh, new_i_size); if (status < 0) { mlog_errno(status); - goto bail_unlock_data; + goto bail_unlock_sem; } status = ocfs2_prepare_truncate(osb, inode, di_bh, &tc); if (status < 0) { mlog_errno(status); - goto bail_unlock_data; + goto bail_unlock_sem; } status = ocfs2_commit_truncate(osb, inode, di_bh, tc); if (status < 0) { mlog_errno(status); - goto bail_unlock_data; + goto bail_unlock_sem; } /* TODO: orphan dir cleanup here. */ -bail_unlock_data: - ocfs2_data_unlock(inode, 1); - +bail_unlock_sem: up_write(&OCFS2_I(inode)->ip_alloc_sem); bail: @@ -917,7 +910,7 @@ static int ocfs2_extend_file(struct inode *inode, struct buffer_head *di_bh, u64 new_i_size) { - int ret = 0, data_locked = 0; + int ret = 0; struct ocfs2_inode_info *oi = OCFS2_I(inode); BUG_ON(!di_bh); @@ -943,20 +936,6 @@ static int ocfs2_extend_file(struct inode *inode, && ocfs2_sparse_alloc(OCFS2_SB(inode->i_sb))) goto out_update_size; - /* - * protect the pages that ocfs2_zero_extend is going to be - * pulling into the page cache.. we do this before the - * metadata extend so that we don't get into the situation - * where we've extended the metadata but can't get the data - * lock to zero. - */ - ret = ocfs2_data_lock(inode, 1); - if (ret < 0) { - mlog_errno(ret); - goto out; - } - data_locked = 1; - /* * The alloc sem blocks people in read/write from reading our * allocation until we're done changing it. We depend on @@ -980,7 +959,7 @@ static int ocfs2_extend_file(struct inode *inode, up_write(&oi->ip_alloc_sem); mlog_errno(ret); - goto out_unlock; + goto out; } } @@ -991,7 +970,7 @@ static int ocfs2_extend_file(struct inode *inode, if (ret < 0) { mlog_errno(ret); - goto out_unlock; + goto out; } out_update_size: @@ -999,10 +978,6 @@ out_update_size: if (ret < 0) mlog_errno(ret); -out_unlock: - if (data_locked) - ocfs2_data_unlock(inode, 1); - out: return ret; } @@ -1050,7 +1025,7 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr *attr) } } - status = ocfs2_meta_lock(inode, &bh, 1); + status = ocfs2_inode_lock(inode, &bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -1102,7 +1077,7 @@ int ocfs2_setattr(struct dentry *dentry, struct iattr *attr) bail_commit: ocfs2_commit_trans(osb, handle); bail_unlock: - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); bail_unlock_rw: if (size_change) ocfs2_rw_unlock(inode, 1); @@ -1149,7 +1124,7 @@ int ocfs2_permission(struct inode *inode, int mask, struct nameidata *nd) mlog_entry_void(); - ret = ocfs2_meta_lock(inode, NULL, 0); + ret = ocfs2_inode_lock(inode, NULL, 0); if (ret) { if (ret != -ENOENT) mlog_errno(ret); @@ -1158,7 +1133,7 @@ int ocfs2_permission(struct inode *inode, int mask, struct nameidata *nd) ret = generic_permission(inode, mask, NULL); - ocfs2_meta_unlock(inode, 0); + ocfs2_inode_unlock(inode, 0); out: mlog_exit(ret); return ret; @@ -1630,7 +1605,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, goto out; } - ret = ocfs2_meta_lock(inode, &di_bh, 1); + ret = ocfs2_inode_lock(inode, &di_bh, 1); if (ret) { mlog_errno(ret); goto out_rw_unlock; @@ -1638,7 +1613,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, if (inode->i_flags & (S_IMMUTABLE|S_APPEND)) { ret = -EPERM; - goto out_meta_unlock; + goto out_inode_unlock; } switch (sr->l_whence) { @@ -1652,7 +1627,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, break; default: ret = -EINVAL; - goto out_meta_unlock; + goto out_inode_unlock; } sr->l_whence = 0; @@ -1663,14 +1638,14 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, || (sr->l_start + llen) < 0 || (sr->l_start + llen) > max_off) { ret = -EINVAL; - goto out_meta_unlock; + goto out_inode_unlock; } size = sr->l_start + sr->l_len; if (cmd == OCFS2_IOC_RESVSP || cmd == OCFS2_IOC_RESVSP64) { if (sr->l_len <= 0) { ret = -EINVAL; - goto out_meta_unlock; + goto out_inode_unlock; } } @@ -1678,7 +1653,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, ret = __ocfs2_write_remove_suid(inode, di_bh); if (ret) { mlog_errno(ret); - goto out_meta_unlock; + goto out_inode_unlock; } } @@ -1704,7 +1679,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, up_write(&OCFS2_I(inode)->ip_alloc_sem); if (ret) { mlog_errno(ret); - goto out_meta_unlock; + goto out_inode_unlock; } /* @@ -1714,7 +1689,7 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, if (IS_ERR(handle)) { ret = PTR_ERR(handle); mlog_errno(ret); - goto out_meta_unlock; + goto out_inode_unlock; } if (change_size && i_size_read(inode) < size) @@ -1727,9 +1702,9 @@ static int __ocfs2_change_file_space(struct file *file, struct inode *inode, ocfs2_commit_trans(osb, handle); -out_meta_unlock: +out_inode_unlock: brelse(di_bh); - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); out_rw_unlock: ocfs2_rw_unlock(inode, 1); @@ -1799,7 +1774,7 @@ static int ocfs2_prepare_inode_for_write(struct dentry *dentry, * if we need to make modifications here. */ for(;;) { - ret = ocfs2_meta_lock(inode, NULL, meta_level); + ret = ocfs2_inode_lock(inode, NULL, meta_level); if (ret < 0) { meta_level = -1; mlog_errno(ret); @@ -1817,7 +1792,7 @@ static int ocfs2_prepare_inode_for_write(struct dentry *dentry, * set inode->i_size at the end of a write. */ if (should_remove_suid(dentry)) { if (meta_level == 0) { - ocfs2_meta_unlock(inode, meta_level); + ocfs2_inode_unlock(inode, meta_level); meta_level = 1; continue; } @@ -1886,7 +1861,7 @@ static int ocfs2_prepare_inode_for_write(struct dentry *dentry, *ppos = saved_pos; out_unlock: - ocfs2_meta_unlock(inode, meta_level); + ocfs2_inode_unlock(inode, meta_level); out: return ret; @@ -2099,12 +2074,12 @@ static ssize_t ocfs2_file_splice_read(struct file *in, /* * See the comment in ocfs2_file_aio_read() */ - ret = ocfs2_meta_lock(inode, NULL, 0); + ret = ocfs2_inode_lock(inode, NULL, 0); if (ret < 0) { mlog_errno(ret); goto bail; } - ocfs2_meta_unlock(inode, 0); + ocfs2_inode_unlock(inode, 0); ret = generic_file_splice_read(in, ppos, pipe, len, flags); @@ -2160,12 +2135,12 @@ static ssize_t ocfs2_file_aio_read(struct kiocb *iocb, * like i_size. This allows the checks down below * generic_file_aio_read() a chance of actually working. */ - ret = ocfs2_meta_lock_atime(inode, filp->f_vfsmnt, &lock_level); + ret = ocfs2_inode_lock_atime(inode, filp->f_vfsmnt, &lock_level); if (ret < 0) { mlog_errno(ret); goto bail; } - ocfs2_meta_unlock(inode, lock_level); + ocfs2_inode_unlock(inode, lock_level); ret = generic_file_aio_read(iocb, iov, nr_segs, iocb->ki_pos); if (ret == -EINVAL) diff --git a/fs/ocfs2/heartbeat.c b/fs/ocfs2/heartbeat.c index c4c3617..c0efd94 100644 --- a/fs/ocfs2/heartbeat.c +++ b/fs/ocfs2/heartbeat.c @@ -30,9 +30,6 @@ #include #include -#include -#include - #include #define MLOG_MASK_PREFIX ML_SUPER @@ -44,13 +41,9 @@ #include "heartbeat.h" #include "inode.h" #include "journal.h" -#include "vote.h" #include "buffer_head_io.h" -#define OCFS2_HB_NODE_DOWN_PRI (0x0000002) -#define OCFS2_HB_NODE_UP_PRI OCFS2_HB_NODE_DOWN_PRI - static inline void __ocfs2_node_map_set_bit(struct ocfs2_node_map *map, int bit); static inline void __ocfs2_node_map_clear_bit(struct ocfs2_node_map *map, @@ -64,9 +57,7 @@ static void __ocfs2_node_map_set(struct ocfs2_node_map *target, void ocfs2_init_node_maps(struct ocfs2_super *osb) { spin_lock_init(&osb->node_map_lock); - ocfs2_node_map_init(&osb->mounted_map); ocfs2_node_map_init(&osb->recovery_map); - ocfs2_node_map_init(&osb->umount_map); ocfs2_node_map_init(&osb->osb_recovering_orphan_dirs); } @@ -87,24 +78,7 @@ static void ocfs2_do_node_down(int node_num, return; } - if (ocfs2_node_map_test_bit(osb, &osb->umount_map, node_num)) { - /* If a node is in the umount map, then we've been - * expecting him to go down and we know ahead of time - * that recovery is not necessary. */ - ocfs2_node_map_clear_bit(osb, &osb->umount_map, node_num); - return; - } - ocfs2_recovery_thread(osb, node_num); - - ocfs2_remove_node_from_vote_queues(osb, node_num); -} - -static void ocfs2_hb_node_down_cb(struct o2nm_node *node, - int node_num, - void *data) -{ - ocfs2_do_node_down(node_num, (struct ocfs2_super *) data); } /* Called from the dlm when it's about to evict a node. We may also @@ -121,27 +95,8 @@ static void ocfs2_dlm_eviction_cb(int node_num, ocfs2_do_node_down(node_num, osb); } -static void ocfs2_hb_node_up_cb(struct o2nm_node *node, - int node_num, - void *data) -{ - struct ocfs2_super *osb = data; - - BUG_ON(osb->node_num == node_num); - - mlog(0, "node up event for %d\n", node_num); - ocfs2_node_map_clear_bit(osb, &osb->umount_map, node_num); -} - void ocfs2_setup_hb_callbacks(struct ocfs2_super *osb) { - o2hb_setup_callback(&osb->osb_hb_down, O2HB_NODE_DOWN_CB, - ocfs2_hb_node_down_cb, osb, - OCFS2_HB_NODE_DOWN_PRI); - - o2hb_setup_callback(&osb->osb_hb_up, O2HB_NODE_UP_CB, - ocfs2_hb_node_up_cb, osb, OCFS2_HB_NODE_UP_PRI); - /* Not exactly a heartbeat callback, but leads to essentially * the same path so we set it up here. */ dlm_setup_eviction_cb(&osb->osb_eviction_cb, @@ -149,39 +104,6 @@ void ocfs2_setup_hb_callbacks(struct ocfs2_super *osb) osb); } -/* Most functions here are just stubs for now... */ -int ocfs2_register_hb_callbacks(struct ocfs2_super *osb) -{ - int status; - - if (ocfs2_mount_local(osb)) - return 0; - - status = o2hb_register_callback(osb->uuid_str, &osb->osb_hb_down); - if (status < 0) { - mlog_errno(status); - goto bail; - } - - status = o2hb_register_callback(osb->uuid_str, &osb->osb_hb_up); - if (status < 0) { - mlog_errno(status); - o2hb_unregister_callback(osb->uuid_str, &osb->osb_hb_down); - } - -bail: - return status; -} - -void ocfs2_clear_hb_callbacks(struct ocfs2_super *osb) -{ - if (ocfs2_mount_local(osb)) - return; - - o2hb_unregister_callback(osb->uuid_str, &osb->osb_hb_down); - o2hb_unregister_callback(osb->uuid_str, &osb->osb_hb_up); -} - void ocfs2_stop_heartbeat(struct ocfs2_super *osb) { int ret; @@ -341,8 +263,6 @@ int ocfs2_recovery_map_set(struct ocfs2_super *osb, spin_lock(&osb->node_map_lock); - __ocfs2_node_map_clear_bit(&osb->mounted_map, num); - if (!test_bit(num, osb->recovery_map.map)) { __ocfs2_node_map_set_bit(&osb->recovery_map, num); set = 1; diff --git a/fs/ocfs2/heartbeat.h b/fs/ocfs2/heartbeat.h index e8fb079..5685921 100644 --- a/fs/ocfs2/heartbeat.h +++ b/fs/ocfs2/heartbeat.h @@ -29,8 +29,6 @@ void ocfs2_init_node_maps(struct ocfs2_super *osb); void ocfs2_setup_hb_callbacks(struct ocfs2_super *osb); -int ocfs2_register_hb_callbacks(struct ocfs2_super *osb); -void ocfs2_clear_hb_callbacks(struct ocfs2_super *osb); void ocfs2_stop_heartbeat(struct ocfs2_super *osb); /* node map functions - used to keep track of mounted and in-recovery diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index ebb2bbe..00cd5b7 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -49,7 +49,6 @@ #include "symlink.h" #include "sysfile.h" #include "uptodate.h" -#include "vote.h" #include "buffer_head_io.h" @@ -322,7 +321,7 @@ int ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe, */ BUG_ON(le32_to_cpu(fe->i_flags) & OCFS2_SYSTEM_FL); - ocfs2_inode_lock_res_init(&OCFS2_I(inode)->ip_meta_lockres, + ocfs2_inode_lock_res_init(&OCFS2_I(inode)->ip_inode_lockres, OCFS2_LOCK_TYPE_META, 0, inode); ocfs2_inode_lock_res_init(&OCFS2_I(inode)->ip_open_lockres, @@ -333,10 +332,6 @@ int ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe, OCFS2_LOCK_TYPE_RW, inode->i_generation, inode); - ocfs2_inode_lock_res_init(&OCFS2_I(inode)->ip_data_lockres, - OCFS2_LOCK_TYPE_DATA, inode->i_generation, - inode); - ocfs2_set_inode_flags(inode); status = 0; @@ -414,7 +409,7 @@ static int ocfs2_read_locked_inode(struct inode *inode, if (args->fi_flags & OCFS2_FI_FLAG_SYSFILE) generation = osb->fs_generation; - ocfs2_inode_lock_res_init(&OCFS2_I(inode)->ip_meta_lockres, + ocfs2_inode_lock_res_init(&OCFS2_I(inode)->ip_inode_lockres, OCFS2_LOCK_TYPE_META, generation, inode); @@ -429,7 +424,7 @@ static int ocfs2_read_locked_inode(struct inode *inode, mlog_errno(status); return status; } - status = ocfs2_meta_lock(inode, NULL, 0); + status = ocfs2_inode_lock(inode, NULL, 0); if (status) { make_bad_inode(inode); mlog_errno(status); @@ -484,7 +479,7 @@ static int ocfs2_read_locked_inode(struct inode *inode, bail: if (can_lock) - ocfs2_meta_unlock(inode, 0); + ocfs2_inode_unlock(inode, 0); if (status < 0) make_bad_inode(inode); @@ -586,7 +581,7 @@ static int ocfs2_remove_inode(struct inode *inode, } mutex_lock(&inode_alloc_inode->i_mutex); - status = ocfs2_meta_lock(inode_alloc_inode, &inode_alloc_bh, 1); + status = ocfs2_inode_lock(inode_alloc_inode, &inode_alloc_bh, 1); if (status < 0) { mutex_unlock(&inode_alloc_inode->i_mutex); @@ -635,7 +630,7 @@ static int ocfs2_remove_inode(struct inode *inode, bail_commit: ocfs2_commit_trans(osb, handle); bail_unlock: - ocfs2_meta_unlock(inode_alloc_inode, 1); + ocfs2_inode_unlock(inode_alloc_inode, 1); mutex_unlock(&inode_alloc_inode->i_mutex); brelse(inode_alloc_bh); bail: @@ -709,7 +704,7 @@ static int ocfs2_wipe_inode(struct inode *inode, * delete_inode operation. We do this now to avoid races with * recovery completion on other nodes. */ mutex_lock(&orphan_dir_inode->i_mutex); - status = ocfs2_meta_lock(orphan_dir_inode, &orphan_dir_bh, 1); + status = ocfs2_inode_lock(orphan_dir_inode, &orphan_dir_bh, 1); if (status < 0) { mutex_unlock(&orphan_dir_inode->i_mutex); @@ -718,8 +713,8 @@ static int ocfs2_wipe_inode(struct inode *inode, } /* we do this while holding the orphan dir lock because we - * don't want recovery being run from another node to vote for - * an inode delete on us -- this will result in two nodes + * don't want recovery being run from another node to try an + * inode delete underneath us -- this will result in two nodes * truncating the same file! */ status = ocfs2_truncate_for_delete(osb, inode, di_bh); if (status < 0) { @@ -733,7 +728,7 @@ static int ocfs2_wipe_inode(struct inode *inode, mlog_errno(status); bail_unlock_dir: - ocfs2_meta_unlock(orphan_dir_inode, 1); + ocfs2_inode_unlock(orphan_dir_inode, 1); mutex_unlock(&orphan_dir_inode->i_mutex); brelse(orphan_dir_bh); bail: @@ -744,7 +739,7 @@ bail: } /* There is a series of simple checks that should be done before a - * vote is even considered. Encapsulate those in this function. */ + * trylock is even considered. Encapsulate those in this function. */ static int ocfs2_inode_is_valid_to_delete(struct inode *inode) { int ret = 0; @@ -758,14 +753,14 @@ static int ocfs2_inode_is_valid_to_delete(struct inode *inode) goto bail; } - /* If we're coming from process_vote we can't go into our own + /* If we're coming from downconvert_thread we can't go into our own * voting [hello, deadlock city!], so unforuntately we just * have to skip deleting this guy. That's OK though because * the node who's doing the actual deleting should handle it * anyway. */ - if (current == osb->vote_task) { + if (current == osb->dc_task) { mlog(0, "Skipping delete of %lu because we're currently " - "in process_vote\n", inode->i_ino); + "in downconvert\n", inode->i_ino); goto bail; } @@ -779,10 +774,9 @@ static int ocfs2_inode_is_valid_to_delete(struct inode *inode) goto bail_unlock; } - /* If we have voted "yes" on the wipe of this inode for - * another node, it will be marked here so we can safely skip - * it. Recovery will cleanup any inodes we might inadvertantly - * skip here. */ + /* If we have allowd wipe of this inode for another node, it + * will be marked here so we can safely skip it. Recovery will + * cleanup any inodes we might inadvertantly skip here. */ if (oi->ip_flags & OCFS2_INODE_SKIP_DELETE) { mlog(0, "Skipping delete of %lu because another node " "has done this for us.\n", inode->i_ino); @@ -929,13 +923,13 @@ void ocfs2_delete_inode(struct inode *inode) /* Lock down the inode. This gives us an up to date view of * it's metadata (for verification), and allows us to - * serialize delete_inode votes. + * serialize delete_inode on multiple nodes. * * Even though we might be doing a truncate, we don't take the * allocation lock here as it won't be needed - nobody will * have the file open. */ - status = ocfs2_meta_lock(inode, &di_bh, 1); + status = ocfs2_inode_lock(inode, &di_bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -947,15 +941,15 @@ void ocfs2_delete_inode(struct inode *inode) * before we go ahead and wipe the inode. */ status = ocfs2_query_inode_wipe(inode, di_bh, &wipe); if (!wipe || status < 0) { - /* Error and inode busy vote both mean we won't be + /* Error and remote inode busy both mean we won't be * removing the inode, so they take almost the same * path. */ if (status < 0) mlog_errno(status); - /* Someone in the cluster has voted to not wipe this - * inode, or it was never completely orphaned. Write - * out the pages and exit now. */ + /* Someone in the cluster has disallowed a wipe of + * this inode, or it was never completely + * orphaned. Write out the pages and exit now. */ ocfs2_cleanup_delete_inode(inode, 1); goto bail_unlock_inode; } @@ -981,7 +975,7 @@ void ocfs2_delete_inode(struct inode *inode) OCFS2_I(inode)->ip_flags |= OCFS2_INODE_DELETED; bail_unlock_inode: - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); brelse(di_bh); bail_unblock: status = sigprocmask(SIG_SETMASK, &oldset, NULL); @@ -1008,15 +1002,14 @@ void ocfs2_clear_inode(struct inode *inode) mlog_bug_on_msg(OCFS2_SB(inode->i_sb) == NULL, "Inode=%lu\n", inode->i_ino); - /* For remove delete_inode vote, we hold open lock before, - * now it is time to unlock PR and EX open locks. */ + /* To preven remote deletes we hold open lock before, now it + * is time to unlock PR and EX open locks. */ ocfs2_open_unlock(inode); /* Do these before all the other work so that we don't bounce - * the vote thread while waiting to destroy the locks. */ + * the downconvert thread while waiting to destroy the locks. */ ocfs2_mark_lockres_freeing(&oi->ip_rw_lockres); - ocfs2_mark_lockres_freeing(&oi->ip_meta_lockres); - ocfs2_mark_lockres_freeing(&oi->ip_data_lockres); + ocfs2_mark_lockres_freeing(&oi->ip_inode_lockres); ocfs2_mark_lockres_freeing(&oi->ip_open_lockres); /* We very well may get a clear_inode before all an inodes @@ -1039,8 +1032,7 @@ void ocfs2_clear_inode(struct inode *inode) mlog_errno(status); ocfs2_lock_res_free(&oi->ip_rw_lockres); - ocfs2_lock_res_free(&oi->ip_meta_lockres); - ocfs2_lock_res_free(&oi->ip_data_lockres); + ocfs2_lock_res_free(&oi->ip_inode_lockres); ocfs2_lock_res_free(&oi->ip_open_lockres); ocfs2_metadata_cache_purge(inode); @@ -1184,15 +1176,15 @@ int ocfs2_inode_revalidate(struct dentry *dentry) } spin_unlock(&OCFS2_I(inode)->ip_lock); - /* Let ocfs2_meta_lock do the work of updating our struct + /* Let ocfs2_inode_lock do the work of updating our struct * inode for us. */ - status = ocfs2_meta_lock(inode, NULL, 0); + status = ocfs2_inode_lock(inode, NULL, 0); if (status < 0) { if (status != -ENOENT) mlog_errno(status); goto bail; } - ocfs2_meta_unlock(inode, 0); + ocfs2_inode_unlock(inode, 0); bail: mlog_exit(status); diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h index 70e881c..ebf5e68 100644 --- a/fs/ocfs2/inode.h +++ b/fs/ocfs2/inode.h @@ -34,8 +34,7 @@ struct ocfs2_inode_info u64 ip_blkno; struct ocfs2_lock_res ip_rw_lockres; - struct ocfs2_lock_res ip_meta_lockres; - struct ocfs2_lock_res ip_data_lockres; + struct ocfs2_lock_res ip_inode_lockres; struct ocfs2_lock_res ip_open_lockres; /* protects allocation changes on this inode. */ @@ -56,6 +55,8 @@ struct ocfs2_inode_info /* protected by recovery_lock. */ struct inode *ip_next_orphan; + struct inode *ip_next_unlinked; + u32 ip_dir_start_lookup; /* next two are protected by trans_inc_lock */ diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c index 87dcece..67c2fb4 100644 --- a/fs/ocfs2/ioctl.c +++ b/fs/ocfs2/ioctl.c @@ -27,14 +27,14 @@ static int ocfs2_get_inode_attr(struct inode *inode, unsigned *flags) { int status; - status = ocfs2_meta_lock(inode, NULL, 0); + status = ocfs2_inode_lock(inode, NULL, 0); if (status < 0) { mlog_errno(status); return status; } ocfs2_get_inode_flags(OCFS2_I(inode)); *flags = OCFS2_I(inode)->ip_attr; - ocfs2_meta_unlock(inode, 0); + ocfs2_inode_unlock(inode, 0); mlog_exit(status); return status; @@ -52,7 +52,7 @@ static int ocfs2_set_inode_attr(struct inode *inode, unsigned flags, mutex_lock(&inode->i_mutex); - status = ocfs2_meta_lock(inode, &bh, 1); + status = ocfs2_inode_lock(inode, &bh, 1); if (status < 0) { mlog_errno(status); goto bail; @@ -100,7 +100,7 @@ static int ocfs2_set_inode_attr(struct inode *inode, unsigned flags, ocfs2_commit_trans(osb, handle); bail_unlock: - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); bail: mutex_unlock(&inode->i_mutex); diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index f9d01e2..8b9ce2a 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -44,7 +44,6 @@ #include "localalloc.h" #include "slot_map.h" #include "super.h" -#include "vote.h" #include "sysfile.h" #include "buffer_head_io.h" @@ -103,7 +102,7 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb) mlog(0, "commit_thread: flushed transaction %lu (%u handles)\n", journal->j_trans_id, flushed); - ocfs2_kick_vote_thread(osb); + ocfs2_wake_downconvert_thread(osb); wake_up(&journal->j_checkpointed); finally: mlog_exit(status); @@ -174,6 +173,12 @@ int ocfs2_commit_trans(struct ocfs2_super *osb, * transaction. extend_trans will either extend the current handle by * nblocks, or commit it and start a new one with nblocks credits. * + * This might call journal_restart() which will commit dirty buffers + * and then restart the transaction. Before calling + * ocfs2_extend_trans(), any changed blocks should have been + * dirtied. After calling it, all blocks which need to be changed must + * go through another set of journal_access/journal_dirty calls. + * * WARNING: This will not release any semaphores or disk locks taken * during the transaction, so make sure they were taken *before* * start_trans or we'll have ordering deadlocks. @@ -193,11 +198,15 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks) mlog(0, "Trying to extend transaction by %d blocks\n", nblocks); +#ifdef OCFS2_DEBUG_FS + status = 1; +#else status = journal_extend(handle, nblocks); if (status < 0) { mlog_errno(status); goto bail; } +#endif if (status > 0) { mlog(0, "journal_extend failed, trying journal_restart\n"); @@ -304,14 +313,18 @@ int ocfs2_journal_dirty_data(handle_t *handle, return err; } -#define OCFS2_DEFAULT_COMMIT_INTERVAL (HZ * 5) +#define OCFS2_DEFAULT_COMMIT_INTERVAL (HZ * JBD_DEFAULT_MAX_COMMIT_AGE) void ocfs2_set_journal_params(struct ocfs2_super *osb) { journal_t *journal = osb->journal->j_journal; + unsigned long commit_interval = OCFS2_DEFAULT_COMMIT_INTERVAL; + + if (osb->osb_commit_interval) + commit_interval = osb->osb_commit_interval; spin_lock(&journal->j_state_lock); - journal->j_commit_interval = OCFS2_DEFAULT_COMMIT_INTERVAL; + journal->j_commit_interval = commit_interval; if (osb->s_mount_opt & OCFS2_MOUNT_BARRIER) journal->j_flags |= JFS_BARRIER; else @@ -327,7 +340,7 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty) struct ocfs2_dinode *di = NULL; struct buffer_head *bh = NULL; struct ocfs2_super *osb; - int meta_lock = 0; + int inode_lock = 0; mlog_entry_void(); @@ -357,14 +370,14 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty) /* Skip recovery waits here - journal inode metadata never * changes in a live cluster so it can be considered an * exception to the rule. */ - status = ocfs2_meta_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); + status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); if (status < 0) { if (status != -ERESTARTSYS) mlog(ML_ERROR, "Could not get lock on journal!\n"); goto done; } - meta_lock = 1; + inode_lock = 1; di = (struct ocfs2_dinode *)bh->b_data; if (inode->i_size < OCFS2_MIN_JOURNAL_SIZE) { @@ -404,8 +417,8 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty) status = 0; done: if (status < 0) { - if (meta_lock) - ocfs2_meta_unlock(inode, 1); + if (inode_lock) + ocfs2_inode_unlock(inode, 1); if (bh != NULL) brelse(bh); if (inode) { @@ -534,7 +547,7 @@ void ocfs2_journal_shutdown(struct ocfs2_super *osb) OCFS2_I(inode)->ip_open_count--; /* unlock our journal */ - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); brelse(journal->j_bh); journal->j_bh = NULL; @@ -873,8 +886,8 @@ restart: ocfs2_super_unlock(osb, 1); /* We always run recovery on our own orphan dir - the dead - * node(s) may have voted "no" on an inode delete earlier. A - * revote is therefore required. */ + * node(s) may have disallowd a previos inode delete. Re-processing + * is therefore required. */ ocfs2_queue_recovery_completion(osb->journal, osb->slot_num, NULL, NULL); @@ -963,9 +976,9 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, } SET_INODE_JOURNAL(inode); - status = ocfs2_meta_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); + status = ocfs2_inode_lock_full(inode, &bh, 1, OCFS2_META_LOCK_RECOVERY); if (status < 0) { - mlog(0, "status returned from ocfs2_meta_lock=%d\n", status); + mlog(0, "status returned from ocfs2_inode_lock=%d\n", status); if (status != -ERESTARTSYS) mlog(ML_ERROR, "Could not lock journal!\n"); goto done; @@ -1037,7 +1050,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb, done: /* drop the lock on this nodes journal */ if (got_lock) - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); if (inode) iput(inode); @@ -1152,14 +1165,14 @@ static int ocfs2_trylock_journal(struct ocfs2_super *osb, SET_INODE_JOURNAL(inode); flags = OCFS2_META_LOCK_RECOVERY | OCFS2_META_LOCK_NOQUEUE; - status = ocfs2_meta_lock_full(inode, NULL, 1, flags); + status = ocfs2_inode_lock_full(inode, NULL, 1, flags); if (status < 0) { if (status != -EAGAIN) mlog_errno(status); goto bail; } - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); bail: if (inode) iput(inode); @@ -1267,7 +1280,7 @@ static int ocfs2_queue_orphans(struct ocfs2_super *osb, } mutex_lock(&orphan_dir_inode->i_mutex); - status = ocfs2_meta_lock(orphan_dir_inode, NULL, 0); + status = ocfs2_inode_lock(orphan_dir_inode, NULL, 0); if (status < 0) { mlog_errno(status); goto out; @@ -1277,12 +1290,13 @@ static int ocfs2_queue_orphans(struct ocfs2_super *osb, ocfs2_orphan_filldir); if (status) { mlog_errno(status); - goto out; + goto out_cluster; } *head = priv.head; - ocfs2_meta_unlock(orphan_dir_inode, 0); +out_cluster: + ocfs2_inode_unlock(orphan_dir_inode, 0); out: mutex_unlock(&orphan_dir_inode->i_mutex); iput(orphan_dir_inode); @@ -1369,10 +1383,10 @@ static int ocfs2_recover_orphans(struct ocfs2_super *osb, iter = oi->ip_next_orphan; spin_lock(&oi->ip_lock); - /* Delete voting may have set these on the assumption - * that the other node would wipe them successfully. - * If they are still in the node's orphan dir, we need - * to reset that state. */ + /* The remote delete code may have set these on the + * assumption that the other node would wipe them + * successfully. If they are still in the node's + * orphan dir, we need to reset that state. */ oi->ip_flags &= ~(OCFS2_INODE_DELETED|OCFS2_INODE_SKIP_DELETE); /* Set the proper information to get us going into diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index 58ea88b..0de0792 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -231,7 +231,7 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb) mutex_lock(&main_bm_inode->i_mutex); - status = ocfs2_meta_lock(main_bm_inode, &main_bm_bh, 1); + status = ocfs2_inode_lock(main_bm_inode, &main_bm_bh, 1); if (status < 0) { mlog_errno(status); goto out_mutex; @@ -286,7 +286,7 @@ out_unlock: if (main_bm_bh) brelse(main_bm_bh); - ocfs2_meta_unlock(main_bm_inode, 1); + ocfs2_inode_unlock(main_bm_inode, 1); out_mutex: mutex_unlock(&main_bm_inode->i_mutex); @@ -399,7 +399,7 @@ int ocfs2_complete_local_alloc_recovery(struct ocfs2_super *osb, mutex_lock(&main_bm_inode->i_mutex); - status = ocfs2_meta_lock(main_bm_inode, &main_bm_bh, 1); + status = ocfs2_inode_lock(main_bm_inode, &main_bm_bh, 1); if (status < 0) { mlog_errno(status); goto out_mutex; @@ -424,7 +424,7 @@ int ocfs2_complete_local_alloc_recovery(struct ocfs2_super *osb, ocfs2_commit_trans(osb, handle); out_unlock: - ocfs2_meta_unlock(main_bm_inode, 1); + ocfs2_inode_unlock(main_bm_inode, 1); out_mutex: mutex_unlock(&main_bm_inode->i_mutex); diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c index 9875615..3dc18d6 100644 --- a/fs/ocfs2/mmap.c +++ b/fs/ocfs2/mmap.c @@ -168,7 +168,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct page *page) * node. Taking the data lock will also ensure that we don't * attempt page truncation as part of a downconvert. */ - ret = ocfs2_meta_lock(inode, &di_bh, 1); + ret = ocfs2_inode_lock(inode, &di_bh, 1); if (ret < 0) { mlog_errno(ret); goto out; @@ -181,21 +181,12 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct page *page) */ down_write(&OCFS2_I(inode)->ip_alloc_sem); - ret = ocfs2_data_lock(inode, 1); - if (ret < 0) { - mlog_errno(ret); - goto out_meta_unlock; - } - ret = __ocfs2_page_mkwrite(inode, di_bh, page); - ocfs2_data_unlock(inode, 1); - -out_meta_unlock: up_write(&OCFS2_I(inode)->ip_alloc_sem); brelse(di_bh); - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); out: ret2 = ocfs2_vm_op_unblock_sigs(&oldset); @@ -214,13 +205,13 @@ int ocfs2_mmap(struct file *file, struct vm_area_struct *vma) { int ret = 0, lock_level = 0; - ret = ocfs2_meta_lock_atime(file->f_dentry->d_inode, + ret = ocfs2_inode_lock_atime(file->f_dentry->d_inode, file->f_vfsmnt, &lock_level); if (ret < 0) { mlog_errno(ret); goto out; } - ocfs2_meta_unlock(file->f_dentry->d_inode, lock_level); + ocfs2_inode_unlock(file->f_dentry->d_inode, lock_level); out: vma->vm_ops = &ocfs2_file_vm_ops; vma->vm_flags |= VM_CAN_NONLINEAR; diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 989ac27..17bcaab 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -40,6 +40,7 @@ #include #include #include +#include #define MLOG_MASK_PREFIX ML_NAMEI #include @@ -60,7 +61,6 @@ #include "symlink.h" #include "sysfile.h" #include "uptodate.h" -#include "vote.h" #include "buffer_head_io.h" @@ -96,6 +96,9 @@ static int ocfs2_create_symlink_data(struct ocfs2_super *osb, /* An orphan dir name is an 8 byte value, printed as a hex string */ #define OCFS2_ORPHAN_NAMELEN ((int)(2 * sizeof(u64))) +static atomic_t ocfs2_num_deferred_orphans = ATOMIC_INIT(0); +#define OCFS2_MAX_DEFERRED_ORPHANS 10000 + static struct dentry *ocfs2_lookup(struct inode *dir, struct dentry *dentry, struct nameidata *nd) { @@ -116,7 +119,7 @@ static struct dentry *ocfs2_lookup(struct inode *dir, struct dentry *dentry, mlog(0, "find name %.*s in directory %llu\n", dentry->d_name.len, dentry->d_name.name, (unsigned long long)OCFS2_I(dir)->ip_blkno); - status = ocfs2_meta_lock(dir, NULL, 0); + status = ocfs2_inode_lock(dir, NULL, 0); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -176,8 +179,8 @@ bail_unlock: /* Don't drop the cluster lock until *after* the d_add -- * unlink on another node will message us to remove that * dentry under this lock so otherwise we can race this with - * the vote thread and have a stale dentry. */ - ocfs2_meta_unlock(dir, 0); + * the downconvert thread and have a stale dentry. */ + ocfs2_inode_unlock(dir, 0); bail: @@ -209,7 +212,7 @@ static int ocfs2_mknod(struct inode *dir, /* get our super block */ osb = OCFS2_SB(dir->i_sb); - status = ocfs2_meta_lock(dir, &parent_fe_bh, 1); + status = ocfs2_inode_lock(dir, &parent_fe_bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -323,7 +326,7 @@ leave: if (handle) ocfs2_commit_trans(osb, handle); - ocfs2_meta_unlock(dir, 1); + ocfs2_inode_unlock(dir, 1); if (status == -ENOSPC) mlog(0, "Disk is full\n"); @@ -553,7 +556,7 @@ static int ocfs2_link(struct dentry *old_dentry, if (S_ISDIR(inode->i_mode)) return -EPERM; - err = ocfs2_meta_lock(dir, &parent_fe_bh, 1); + err = ocfs2_inode_lock(dir, &parent_fe_bh, 1); if (err < 0) { if (err != -ENOENT) mlog_errno(err); @@ -578,7 +581,7 @@ static int ocfs2_link(struct dentry *old_dentry, goto out; } - err = ocfs2_meta_lock(inode, &fe_bh, 1); + err = ocfs2_inode_lock(inode, &fe_bh, 1); if (err < 0) { if (err != -ENOENT) mlog_errno(err); @@ -643,10 +646,10 @@ static int ocfs2_link(struct dentry *old_dentry, out_commit: ocfs2_commit_trans(osb, handle); out_unlock_inode: - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); out: - ocfs2_meta_unlock(dir, 1); + ocfs2_inode_unlock(dir, 1); if (de_bh) brelse(de_bh); @@ -661,6 +664,115 @@ out: } /* + * Deferred cleanup of orphaned inodes + * + * Unlink in Ocfs2 is rather expensive due to cluster locking + * (network) overhead. We can speed up the 2nd half of unlink by + * deferring cleanup of orphaned inodes to the ocfs2 work queue. This + * gains us about a 25% speedup during large unlink storms. + * + * We're careful to only do this on inodes that have been orphaned and + * sucesfully put in the orphan dir. This leaves a clear trail for + * recovery to follow should the machine crash. + */ + +static void __ocfs2_cleanup_orphans(struct ocfs2_super *osb) +{ + struct inode *inode; + + spin_lock(&osb->osb_lock); + inode = osb->osb_first_unlinked_inode; + while (inode) { + osb->osb_first_unlinked_inode = OCFS2_I(inode)->ip_next_unlinked; + if (osb->osb_last_unlinked_inode == inode) + osb->osb_last_unlinked_inode = NULL; + OCFS2_I(inode)->ip_next_unlinked = NULL; + + spin_unlock(&osb->osb_lock); + + iput(inode); + atomic_dec(&ocfs2_num_deferred_orphans); + + cond_resched(); + + spin_lock(&osb->osb_lock); + inode = osb->osb_first_unlinked_inode; + } + spin_unlock(&osb->osb_lock); +} + +void ocfs2_cleanup_orphans(struct work_struct *work) +{ + struct ocfs2_super *osb = container_of(work, struct ocfs2_super, + osb_orphan_cleanup_wq.work); + + __ocfs2_cleanup_orphans(osb); +} + +void ocfs2_flush_orphans(struct ocfs2_super *osb) +{ + if (ocfs2_mount_local(osb)) + return; + + cancel_delayed_work(&osb->osb_orphan_cleanup_wq); + flush_workqueue(ocfs2_wq); + + __ocfs2_cleanup_orphans(osb); +} + +static void ocfs2_defer_inode_delete(struct inode *inode) +{ + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); + struct inode *last; + + if (ocfs2_mount_local(osb)) + return; + + if (inode->i_nlink > 0) + return; + + BUG_ON(OCFS2_I(inode)->ip_next_unlinked); + + /* + * Rate limit the machine-wide total number of deferred inode + * deletes. If we hit the limit, we can schedule an immediate + * flush of this mount point to free up some space. + */ + if (!atomic_add_unless(&ocfs2_num_deferred_orphans, 1, + OCFS2_MAX_DEFERRED_ORPHANS)) { + if (osb->osb_first_unlinked_inode) { + cancel_delayed_work(&osb->osb_orphan_cleanup_wq); + queue_delayed_work(ocfs2_wq, + &osb->osb_orphan_cleanup_wq, 0); + } + return; + } + + igrab(inode); + + /* + * Insert at the tail of the list so that we maintain a FIFO. + */ + spin_lock(&osb->osb_lock); + if (osb->osb_first_unlinked_inode == NULL) { + BUG_ON(osb->osb_last_unlinked_inode != NULL); + osb->osb_first_unlinked_inode = inode; + osb->osb_last_unlinked_inode = inode; + } else { + BUG_ON(osb->osb_last_unlinked_inode == NULL); + last = osb->osb_last_unlinked_inode; + OCFS2_I(last)->ip_next_unlinked = inode; + osb->osb_last_unlinked_inode = inode; + } + + spin_unlock(&osb->osb_lock); + + cancel_delayed_work(&osb->osb_orphan_cleanup_wq); + queue_delayed_work(ocfs2_wq, &osb->osb_orphan_cleanup_wq, + 2 * HZ); +} + +/* * Takes and drops an exclusive lock on the given dentry. This will * force other nodes to drop it. */ @@ -720,7 +832,7 @@ static int ocfs2_unlink(struct inode *dir, return -EPERM; } - status = ocfs2_meta_lock(dir, &parent_node_bh, 1); + status = ocfs2_inode_lock(dir, &parent_node_bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -745,7 +857,7 @@ static int ocfs2_unlink(struct inode *dir, goto leave; } - status = ocfs2_meta_lock(inode, &fe_bh, 1); + status = ocfs2_inode_lock(inode, &fe_bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -765,7 +877,7 @@ static int ocfs2_unlink(struct inode *dir, status = ocfs2_remote_dentry_delete(dentry); if (status < 0) { - /* This vote should succeed under all normal + /* This remote delete should succeed under all normal * circumstances. */ mlog_errno(status); goto leave; @@ -825,6 +937,8 @@ static int ocfs2_unlink(struct inode *dir, goto leave; } + ocfs2_defer_inode_delete(inode); + dir->i_ctime = dir->i_mtime = CURRENT_TIME; if (S_ISDIR(inode->i_mode)) drop_nlink(dir); @@ -841,13 +955,13 @@ leave: ocfs2_commit_trans(osb, handle); if (child_locked) - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); - ocfs2_meta_unlock(dir, 1); + ocfs2_inode_unlock(dir, 1); if (orphan_dir) { /* This was locked for us in ocfs2_prepare_orphan_dir() */ - ocfs2_meta_unlock(orphan_dir, 1); + ocfs2_inode_unlock(orphan_dir, 1); mutex_unlock(&orphan_dir->i_mutex); iput(orphan_dir); } @@ -908,7 +1022,7 @@ static int ocfs2_double_lock(struct ocfs2_super *osb, inode1 = tmpinode; } /* lock id2 */ - status = ocfs2_meta_lock(inode2, bh2, 1); + status = ocfs2_inode_lock(inode2, bh2, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -917,14 +1031,14 @@ static int ocfs2_double_lock(struct ocfs2_super *osb, } /* lock id1 */ - status = ocfs2_meta_lock(inode1, bh1, 1); + status = ocfs2_inode_lock(inode1, bh1, 1); if (status < 0) { /* * An error return must mean that no cluster locks * were held on function exit. */ if (oi1->ip_blkno != oi2->ip_blkno) - ocfs2_meta_unlock(inode2, 1); + ocfs2_inode_unlock(inode2, 1); if (status != -ENOENT) mlog_errno(status); @@ -937,10 +1051,10 @@ bail: static void ocfs2_double_unlock(struct inode *inode1, struct inode *inode2) { - ocfs2_meta_unlock(inode1, 1); + ocfs2_inode_unlock(inode1, 1); if (inode1 != inode2) - ocfs2_meta_unlock(inode2, 1); + ocfs2_inode_unlock(inode2, 1); } static int ocfs2_rename(struct inode *old_dir, @@ -1031,10 +1145,11 @@ static int ocfs2_rename(struct inode *old_dir, /* * Aside from allowing a meta data update, the locking here - * also ensures that the vote thread on other nodes won't have - * to concurrently downconvert the inode and the dentry locks. + * also ensures that the downconvert thread on other nodes + * won't have to concurrently downconvert the inode and the + * dentry locks. */ - status = ocfs2_meta_lock(old_inode, &old_inode_bh, 1); + status = ocfs2_inode_lock(old_inode, &old_inode_bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -1143,7 +1258,7 @@ static int ocfs2_rename(struct inode *old_dir, goto bail; } - status = ocfs2_meta_lock(new_inode, &newfe_bh, 1); + status = ocfs2_inode_lock(new_inode, &newfe_bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -1355,14 +1470,14 @@ bail: ocfs2_double_unlock(old_dir, new_dir); if (old_child_locked) - ocfs2_meta_unlock(old_inode, 1); + ocfs2_inode_unlock(old_inode, 1); if (new_child_locked) - ocfs2_meta_unlock(new_inode, 1); + ocfs2_inode_unlock(new_inode, 1); if (orphan_dir) { /* This was locked for us in ocfs2_prepare_orphan_dir() */ - ocfs2_meta_unlock(orphan_dir, 1); + ocfs2_inode_unlock(orphan_dir, 1); mutex_unlock(&orphan_dir->i_mutex); iput(orphan_dir); } @@ -1530,7 +1645,7 @@ static int ocfs2_symlink(struct inode *dir, credits = ocfs2_calc_symlink_credits(sb); /* lock the parent directory */ - status = ocfs2_meta_lock(dir, &parent_fe_bh, 1); + status = ocfs2_inode_lock(dir, &parent_fe_bh, 1); if (status < 0) { if (status != -ENOENT) mlog_errno(status); @@ -1657,7 +1772,7 @@ bail: if (handle) ocfs2_commit_trans(osb, handle); - ocfs2_meta_unlock(dir, 1); + ocfs2_inode_unlock(dir, 1); if (new_fe_bh) brelse(new_fe_bh); @@ -1735,7 +1850,7 @@ static int ocfs2_prepare_orphan_dir(struct ocfs2_super *osb, mutex_lock(&orphan_dir_inode->i_mutex); - status = ocfs2_meta_lock(orphan_dir_inode, &orphan_dir_bh, 1); + status = ocfs2_inode_lock(orphan_dir_inode, &orphan_dir_bh, 1); if (status < 0) { mlog_errno(status); goto leave; @@ -1745,7 +1860,7 @@ static int ocfs2_prepare_orphan_dir(struct ocfs2_super *osb, orphan_dir_bh, name, OCFS2_ORPHAN_NAMELEN, de_bh); if (status < 0) { - ocfs2_meta_unlock(orphan_dir_inode, 1); + ocfs2_inode_unlock(orphan_dir_inode, 1); mlog_errno(status); goto leave; diff --git a/fs/ocfs2/namei.h b/fs/ocfs2/namei.h index 688aef6..f9aab81 100644 --- a/fs/ocfs2/namei.h +++ b/fs/ocfs2/namei.h @@ -36,4 +36,7 @@ int ocfs2_orphan_del(struct ocfs2_super *osb, struct inode *inode, struct buffer_head *orphan_dir_bh); +void ocfs2_cleanup_orphans(struct work_struct *work); +void ocfs2_flush_orphans(struct ocfs2_super *osb); + #endif /* OCFS2_NAMEI_H */ diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index 60a23e1..955b132 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -189,9 +189,7 @@ struct ocfs2_super struct ocfs2_slot_info *slot_info; spinlock_t node_map_lock; - struct ocfs2_node_map mounted_map; struct ocfs2_node_map recovery_map; - struct ocfs2_node_map umount_map; u64 root_blkno; u64 system_dir_blkno; @@ -231,6 +229,7 @@ struct ocfs2_super wait_queue_head_t checkpoint_event; atomic_t needs_checkpoint; struct ocfs2_journal *journal; + unsigned long osb_commit_interval; enum ocfs2_local_alloc_state local_alloc_state; struct buffer_head *local_alloc_bh; @@ -254,28 +253,15 @@ struct ocfs2_super wait_queue_head_t recovery_event; - spinlock_t vote_task_lock; - struct task_struct *vote_task; - wait_queue_head_t vote_event; - unsigned long vote_wake_sequence; - unsigned long vote_work_sequence; + spinlock_t dc_task_lock; + struct task_struct *dc_task; + wait_queue_head_t dc_event; + unsigned long dc_wake_sequence; + unsigned long dc_work_sequence; struct list_head blocked_lock_list; unsigned long blocked_lock_count; - struct list_head vote_list; - int vote_count; - - u32 net_key; - spinlock_t net_response_lock; - unsigned int net_response_ids; - struct list_head net_response_list; - - struct o2hb_callback_func osb_hb_up; - struct o2hb_callback_func osb_hb_down; - - struct list_head osb_net_handlers; - wait_queue_head_t osb_mount_event; /* Truncate log info */ @@ -286,6 +272,10 @@ struct ocfs2_super struct ocfs2_node_map osb_recovering_orphan_dirs; unsigned int *osb_orphan_wipes; wait_queue_head_t osb_wipe_event; + + struct delayed_work osb_orphan_cleanup_wq; + struct inode *osb_first_unlinked_inode; + struct inode *osb_last_unlinked_inode; }; #define OCFS2_SB(sb) ((struct ocfs2_super *)(sb)->s_fs_info) diff --git a/fs/ocfs2/slot_map.c b/fs/ocfs2/slot_map.c index af4882b..3a50ce5 100644 --- a/fs/ocfs2/slot_map.c +++ b/fs/ocfs2/slot_map.c @@ -48,25 +48,6 @@ static void __ocfs2_fill_slot(struct ocfs2_slot_info *si, s16 slot_num, s16 node_num); -/* Use the slot information we've collected to create a map of mounted - * nodes. Should be holding an EX on super block. assumes slot info is - * up to date. Note that we call this *after* we find a slot, so our - * own node should be set in the map too... */ -void ocfs2_populate_mounted_map(struct ocfs2_super *osb) -{ - int i; - struct ocfs2_slot_info *si = osb->slot_info; - - spin_lock(&si->si_lock); - - for (i = 0; i < si->si_size; i++) - if (si->si_global_node_nums[i] != OCFS2_INVALID_SLOT) - ocfs2_node_map_set_bit(osb, &osb->mounted_map, - si->si_global_node_nums[i]); - - spin_unlock(&si->si_lock); -} - /* post the slot information on disk into our slot_info struct. */ void ocfs2_update_slot_info(struct ocfs2_slot_info *si) { diff --git a/fs/ocfs2/slot_map.h b/fs/ocfs2/slot_map.h index d8c8cee..1025872 100644 --- a/fs/ocfs2/slot_map.h +++ b/fs/ocfs2/slot_map.h @@ -52,8 +52,6 @@ s16 ocfs2_node_num_to_slot(struct ocfs2_slot_info *si, void ocfs2_clear_slot(struct ocfs2_slot_info *si, s16 slot_num); -void ocfs2_populate_mounted_map(struct ocfs2_super *osb); - static inline int ocfs2_is_empty_slot(struct ocfs2_slot_info *si, int slot_num) { diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 8f09f52..6df4dbf 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -114,7 +114,7 @@ void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac) if (inode) { if (ac->ac_which != OCFS2_AC_USE_LOCAL) - ocfs2_meta_unlock(inode, 1); + ocfs2_inode_unlock(inode, 1); mutex_unlock(&inode->i_mutex); @@ -412,7 +412,7 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb, mutex_lock(&alloc_inode->i_mutex); - status = ocfs2_meta_lock(alloc_inode, &bh, 1); + status = ocfs2_inode_lock(alloc_inode, &bh, 1); if (status < 0) { mutex_unlock(&alloc_inode->i_mutex); iput(alloc_inode); diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 5ee7754..dc85677 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -65,7 +65,6 @@ #include "sysfile.h" #include "uptodate.h" #include "ver.h" -#include "vote.h" #include "buffer_head_io.h" @@ -84,6 +83,7 @@ MODULE_LICENSE("GPL"); struct mount_options { + unsigned long commit_interval; unsigned long mount_opt; unsigned int atime_quantum; signed short slot; @@ -150,6 +150,7 @@ enum { Opt_data_writeback, Opt_atime_quantum, Opt_slot, + Opt_commit, Opt_err, }; @@ -165,6 +166,7 @@ static match_table_t tokens = { {Opt_data_writeback, "data=writeback"}, {Opt_atime_quantum, "atime_quantum=%u"}, {Opt_slot, "preferred_slot=%u"}, + {Opt_commit, "commit=%u"}, {Opt_err, NULL} }; @@ -443,6 +445,8 @@ unlock_osb: osb->s_mount_opt = parsed_options.mount_opt; osb->s_atime_quantum = parsed_options.atime_quantum; osb->preferred_slot = parsed_options.slot; + if (parsed_options.commit_interval) + osb->osb_commit_interval = parsed_options.commit_interval; if (!ocfs2_is_hard_readonly(osb)) ocfs2_set_journal_params(osb); @@ -597,6 +601,7 @@ static int ocfs2_fill_super(struct super_block *sb, void *data, int silent) osb->s_mount_opt = parsed_options.mount_opt; osb->s_atime_quantum = parsed_options.atime_quantum; osb->preferred_slot = parsed_options.slot; + osb->osb_commit_interval = parsed_options.commit_interval; sb->s_magic = OCFS2_SUPER_MAGIC; @@ -747,6 +752,7 @@ static int ocfs2_parse_options(struct super_block *sb, mlog_entry("remount: %d, options: \"%s\"\n", is_remount, options ? options : "(none)"); + mopt->commit_interval = 0; mopt->mount_opt = 0; mopt->atime_quantum = OCFS2_DEFAULT_ATIME_QUANTUM; mopt->slot = OCFS2_INVALID_SLOT; @@ -816,6 +822,18 @@ static int ocfs2_parse_options(struct super_block *sb, if (option) mopt->slot = (s16)option; break; + case Opt_commit: + option = 0; + if (match_int(&args[0], &option)) { + status = 0; + goto bail; + } + if (option < 0) + return 0; + if (option == 0) + option = JBD_DEFAULT_MAX_COMMIT_AGE; + mopt->commit_interval = HZ * option; + break; default: mlog(ML_ERROR, "Unrecognized mount option \"%s\" " @@ -864,6 +882,10 @@ static int ocfs2_show_options(struct seq_file *s, struct vfsmount *mnt) if (osb->s_atime_quantum != OCFS2_DEFAULT_ATIME_QUANTUM) seq_printf(s, ",atime_quantum=%u", osb->s_atime_quantum); + if (osb->osb_commit_interval) + seq_printf(s, ",commit=%u", + (unsigned) (osb->osb_commit_interval / HZ)); + return 0; } @@ -965,7 +987,7 @@ static int ocfs2_statfs(struct dentry *dentry, struct kstatfs *buf) goto bail; } - status = ocfs2_meta_lock(inode, &bh, 0); + status = ocfs2_inode_lock(inode, &bh, 0); if (status < 0) { mlog_errno(status); goto bail; @@ -989,7 +1011,7 @@ static int ocfs2_statfs(struct dentry *dentry, struct kstatfs *buf) brelse(bh); - ocfs2_meta_unlock(inode, 0); + ocfs2_inode_unlock(inode, 0); status = 0; bail: if (inode) @@ -1019,9 +1041,10 @@ static void ocfs2_inode_init_once(struct kmem_cache *cachep, void *data) oi->ip_blkno = 0ULL; oi->ip_clusters = 0; + oi->ip_next_unlinked = NULL; + ocfs2_lock_res_init_once(&oi->ip_rw_lockres); - ocfs2_lock_res_init_once(&oi->ip_meta_lockres); - ocfs2_lock_res_init_once(&oi->ip_data_lockres); + ocfs2_lock_res_init_once(&oi->ip_inode_lockres); ocfs2_lock_res_init_once(&oi->ip_open_lockres); ocfs2_metadata_cache_init(&oi->vfs_inode); @@ -1117,25 +1140,12 @@ static int ocfs2_mount_volume(struct super_block *sb) goto leave; } - status = ocfs2_register_hb_callbacks(osb); - if (status < 0) { - mlog_errno(status); - goto leave; - } - status = ocfs2_dlm_init(osb); if (status < 0) { mlog_errno(status); goto leave; } - /* requires vote_thread to be running. */ - status = ocfs2_register_net_handlers(osb); - if (status < 0) { - mlog_errno(status); - goto leave; - } - status = ocfs2_super_lock(osb, 1); if (status < 0) { mlog_errno(status); @@ -1150,8 +1160,6 @@ static int ocfs2_mount_volume(struct super_block *sb) goto leave; } - ocfs2_populate_mounted_map(osb); - /* load all node-local system inodes */ status = ocfs2_init_local_system_inodes(osb); if (status < 0) { @@ -1174,15 +1182,6 @@ static int ocfs2_mount_volume(struct super_block *sb) if (ocfs2_mount_local(osb)) goto leave; - /* This should be sent *after* we recovered our journal as it - * will cause other nodes to unmark us as needing - * recovery. However, we need to send it *before* dropping the - * super block lock as otherwise their recovery threads might - * try to clean us up while we're live! */ - status = ocfs2_request_mount_vote(osb); - if (status < 0) - mlog_errno(status); - leave: if (unlock_super) ocfs2_super_unlock(osb, 1); @@ -1212,6 +1211,8 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err) osb = OCFS2_SB(sb); BUG_ON(!osb); + ocfs2_flush_orphans(osb); + ocfs2_shutdown_local_alloc(osb); ocfs2_truncate_log_shutdown(osb); @@ -1240,10 +1241,6 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err) mlog_errno(tmp); return; } - - tmp = ocfs2_request_umount_vote(osb); - if (tmp < 0) - mlog_errno(tmp); } if (osb->slot_num != OCFS2_INVALID_SLOT) @@ -1254,13 +1251,8 @@ static void ocfs2_dismount_volume(struct super_block *sb, int mnt_err) ocfs2_release_system_inodes(osb); - if (osb->dlm) { - ocfs2_unregister_net_handlers(osb); - + if (osb->dlm) ocfs2_dlm_shutdown(osb); - } - - ocfs2_clear_hb_callbacks(osb); debugfs_remove(osb->osb_debug_root); @@ -1315,7 +1307,6 @@ static int ocfs2_initialize_super(struct super_block *sb, int i, cbits, bbits; struct ocfs2_dinode *di = (struct ocfs2_dinode *)bh->b_data; struct inode *inode = NULL; - struct buffer_head *bitmap_bh = NULL; struct ocfs2_journal *journal; __le32 uuid_net_key; struct ocfs2_super *osb; @@ -1344,19 +1335,13 @@ static int ocfs2_initialize_super(struct super_block *sb, osb->s_sectsize_bits = blksize_bits(sector_size); BUG_ON(!osb->s_sectsize_bits); - osb->net_response_ids = 0; - spin_lock_init(&osb->net_response_lock); - INIT_LIST_HEAD(&osb->net_response_list); - - INIT_LIST_HEAD(&osb->osb_net_handlers); init_waitqueue_head(&osb->recovery_event); - spin_lock_init(&osb->vote_task_lock); - init_waitqueue_head(&osb->vote_event); - osb->vote_work_sequence = 0; - osb->vote_wake_sequence = 0; + spin_lock_init(&osb->dc_task_lock); + init_waitqueue_head(&osb->dc_event); + osb->dc_work_sequence = 0; + osb->dc_wake_sequence = 0; INIT_LIST_HEAD(&osb->blocked_lock_list); osb->blocked_lock_count = 0; - INIT_LIST_HEAD(&osb->vote_list); spin_lock_init(&osb->osb_lock); atomic_set(&osb->alloc_stats.moves, 0); @@ -1466,6 +1451,8 @@ static int ocfs2_initialize_super(struct super_block *sb, INIT_WORK(&journal->j_recovery_work, ocfs2_complete_recovery); journal->j_state = OCFS2_JOURNAL_FREE; + INIT_DELAYED_WORK(&osb->osb_orphan_cleanup_wq, ocfs2_cleanup_orphans); + /* get some pseudo constants for clustersize bits */ osb->s_clustersize_bits = le32_to_cpu(di->id2.i_super.s_clustersize_bits); @@ -1496,7 +1483,6 @@ static int ocfs2_initialize_super(struct super_block *sb, } memcpy(&uuid_net_key, di->id2.i_super.s_uuid, sizeof(uuid_net_key)); - osb->net_key = le32_to_cpu(uuid_net_key); strncpy(osb->vol_label, di->id2.i_super.s_label, 63); osb->vol_label[63] = '\0'; @@ -1539,25 +1525,9 @@ static int ocfs2_initialize_super(struct super_block *sb, } osb->bitmap_blkno = OCFS2_I(inode)->ip_blkno; - - /* We don't have a cluster lock on the bitmap here because - * we're only interested in static information and the extra - * complexity at mount time isn't worht it. Don't pass the - * inode in to the read function though as we don't want it to - * be put in the cache. */ - status = ocfs2_read_block(osb, osb->bitmap_blkno, &bitmap_bh, 0, - NULL); iput(inode); - if (status < 0) { - mlog_errno(status); - goto bail; - } - di = (struct ocfs2_dinode *) bitmap_bh->b_data; - osb->bitmap_cpg = le16_to_cpu(di->id2.i_chain.cl_cpg); - brelse(bitmap_bh); - mlog(0, "cluster bitmap inode: %llu, clusters per group: %u\n", - (unsigned long long)osb->bitmap_blkno, osb->bitmap_cpg); + osb->bitmap_cpg = ocfs2_group_bitmap_size(sb) * 8; status = ocfs2_init_slot_info(osb); if (status < 0) { diff --git a/fs/ocfs2/ver.c b/fs/ocfs2/ver.c index 5405ce1..e2488f4 100644 --- a/fs/ocfs2/ver.c +++ b/fs/ocfs2/ver.c @@ -29,7 +29,7 @@ #include "ver.h" -#define OCFS2_BUILD_VERSION "1.3.3" +#define OCFS2_BUILD_VERSION "1.5.0" #define VERSION_STR "OCFS2 " OCFS2_BUILD_VERSION diff --git a/fs/ocfs2/vote.c b/fs/ocfs2/vote.c deleted file mode 100644 index c053585..0000000 --- a/fs/ocfs2/vote.c +++ /dev/null @@ -1,756 +0,0 @@ -/* -*- mode: c; c-basic-offset: 8; -*- - * vim: noexpandtab sw=8 ts=8 sts=0: - * - * vote.c - * - * description here - * - * Copyright (C) 2003, 2004 Oracle. All rights reserved. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License as published by the Free Software Foundation; either - * version 2 of the License, or (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public - * License along with this program; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place - Suite 330, - * Boston, MA 021110-1307, USA. - */ - -#include -#include -#include -#include - -#include -#include -#include - -#include - -#define MLOG_MASK_PREFIX ML_VOTE -#include - -#include "ocfs2.h" - -#include "alloc.h" -#include "dlmglue.h" -#include "extent_map.h" -#include "heartbeat.h" -#include "inode.h" -#include "journal.h" -#include "slot_map.h" -#include "vote.h" - -#include "buffer_head_io.h" - -#define OCFS2_MESSAGE_TYPE_VOTE (0x1) -#define OCFS2_MESSAGE_TYPE_RESPONSE (0x2) -struct ocfs2_msg_hdr -{ - __be32 h_response_id; /* used to lookup message handle on sending - * node. */ - __be32 h_request; - __be64 h_blkno; - __be32 h_generation; - __be32 h_node_num; /* node sending this particular message. */ -}; - -struct ocfs2_vote_msg -{ - struct ocfs2_msg_hdr v_hdr; - __be32 v_reserved1; -} __attribute__ ((packed)); - -/* Responses are given these values to maintain backwards - * compatibility with older ocfs2 versions */ -#define OCFS2_RESPONSE_OK (0) -#define OCFS2_RESPONSE_BUSY (-16) -#define OCFS2_RESPONSE_BAD_MSG (-22) - -struct ocfs2_response_msg -{ - struct ocfs2_msg_hdr r_hdr; - __be32 r_response; -} __attribute__ ((packed)); - -struct ocfs2_vote_work { - struct list_head w_list; - struct ocfs2_vote_msg w_msg; -}; - -enum ocfs2_vote_request { - OCFS2_VOTE_REQ_INVALID = 0, - OCFS2_VOTE_REQ_MOUNT, - OCFS2_VOTE_REQ_UMOUNT, - OCFS2_VOTE_REQ_LAST -}; - -static inline int ocfs2_is_valid_vote_request(int request) -{ - return OCFS2_VOTE_REQ_INVALID < request && - request < OCFS2_VOTE_REQ_LAST; -} - -typedef void (*ocfs2_net_response_callback)(void *priv, - struct ocfs2_response_msg *resp); -struct ocfs2_net_response_cb { - ocfs2_net_response_callback rc_cb; - void *rc_priv; -}; - -struct ocfs2_net_wait_ctxt { - struct list_head n_list; - u32 n_response_id; - wait_queue_head_t n_event; - struct ocfs2_node_map n_node_map; - int n_response; /* an agreggate response. 0 if - * all nodes are go, < 0 on any - * negative response from any - * node or network error. */ - struct ocfs2_net_response_cb *n_callback; -}; - -static void ocfs2_process_mount_request(struct ocfs2_super *osb, - unsigned int node_num) -{ - mlog(0, "MOUNT vote from node %u\n", node_num); - /* The other node only sends us this message when he has an EX - * on the superblock, so our recovery threads (if having been - * launched) are waiting on it.*/ - ocfs2_recovery_map_clear(osb, node_num); - ocfs2_node_map_set_bit(osb, &osb->mounted_map, node_num); - - /* We clear the umount map here because a node may have been - * previously mounted, safely unmounted but never stopped - * heartbeating - in which case we'd have a stale entry. */ - ocfs2_node_map_clear_bit(osb, &osb->umount_map, node_num); -} - -static void ocfs2_process_umount_request(struct ocfs2_super *osb, - unsigned int node_num) -{ - mlog(0, "UMOUNT vote from node %u\n", node_num); - ocfs2_node_map_clear_bit(osb, &osb->mounted_map, node_num); - ocfs2_node_map_set_bit(osb, &osb->umount_map, node_num); -} - -static void ocfs2_process_vote(struct ocfs2_super *osb, - struct ocfs2_vote_msg *msg) -{ - int net_status, vote_response; - unsigned int node_num; - u64 blkno; - enum ocfs2_vote_request request; - struct ocfs2_msg_hdr *hdr = &msg->v_hdr; - struct ocfs2_response_msg response; - - /* decode the network mumbo jumbo into local variables. */ - request = be32_to_cpu(hdr->h_request); - blkno = be64_to_cpu(hdr->h_blkno); - node_num = be32_to_cpu(hdr->h_node_num); - - mlog(0, "processing vote: request = %u, blkno = %llu, node_num = %u\n", - request, (unsigned long long)blkno, node_num); - - if (!ocfs2_is_valid_vote_request(request)) { - mlog(ML_ERROR, "Invalid vote request %d from node %u\n", - request, node_num); - vote_response = OCFS2_RESPONSE_BAD_MSG; - goto respond; - } - - vote_response = OCFS2_RESPONSE_OK; - - switch (request) { - case OCFS2_VOTE_REQ_UMOUNT: - ocfs2_process_umount_request(osb, node_num); - goto respond; - case OCFS2_VOTE_REQ_MOUNT: - ocfs2_process_mount_request(osb, node_num); - goto respond; - default: - /* avoids a gcc warning */ - break; - } - -respond: - /* Response struture is small so we just put it on the stack - * and stuff it inline. */ - memset(&response, 0, sizeof(struct ocfs2_response_msg)); - response.r_hdr.h_response_id = hdr->h_response_id; - response.r_hdr.h_blkno = hdr->h_blkno; - response.r_hdr.h_generation = hdr->h_generation; - response.r_hdr.h_node_num = cpu_to_be32(osb->node_num); - response.r_response = cpu_to_be32(vote_response); - - net_status = o2net_send_message(OCFS2_MESSAGE_TYPE_RESPONSE, - osb->net_key, - &response, - sizeof(struct ocfs2_response_msg), - node_num, - NULL); - /* We still want to error print for ENOPROTOOPT here. The - * sending node shouldn't have unregistered his net handler - * without sending an unmount vote 1st */ - if (net_status < 0 - && net_status != -ETIMEDOUT - && net_status != -ENOTCONN) - mlog(ML_ERROR, "message to node %u fails with error %d!\n", - node_num, net_status); -} - -static void ocfs2_vote_thread_do_work(struct ocfs2_super *osb) -{ - unsigned long processed; - struct ocfs2_lock_res *lockres; - struct ocfs2_vote_work *work; - - mlog_entry_void(); - - spin_lock(&osb->vote_task_lock); - /* grab this early so we know to try again if a state change and - * wake happens part-way through our work */ - osb->vote_work_sequence = osb->vote_wake_sequence; - - processed = osb->blocked_lock_count; - while (processed) { - BUG_ON(list_empty(&osb->blocked_lock_list)); - - lockres = list_entry(osb->blocked_lock_list.next, - struct ocfs2_lock_res, l_blocked_list); - list_del_init(&lockres->l_blocked_list); - osb->blocked_lock_count--; - spin_unlock(&osb->vote_task_lock); - - BUG_ON(!processed); - processed--; - - ocfs2_process_blocked_lock(osb, lockres); - - spin_lock(&osb->vote_task_lock); - } - - while (osb->vote_count) { - BUG_ON(list_empty(&osb->vote_list)); - work = list_entry(osb->vote_list.next, - struct ocfs2_vote_work, w_list); - list_del(&work->w_list); - osb->vote_count--; - spin_unlock(&osb->vote_task_lock); - - ocfs2_process_vote(osb, &work->w_msg); - kfree(work); - - spin_lock(&osb->vote_task_lock); - } - spin_unlock(&osb->vote_task_lock); - - mlog_exit_void(); -} - -static int ocfs2_vote_thread_lists_empty(struct ocfs2_super *osb) -{ - int empty = 0; - - spin_lock(&osb->vote_task_lock); - if (list_empty(&osb->blocked_lock_list) && - list_empty(&osb->vote_list)) - empty = 1; - - spin_unlock(&osb->vote_task_lock); - return empty; -} - -static int ocfs2_vote_thread_should_wake(struct ocfs2_super *osb) -{ - int should_wake = 0; - - spin_lock(&osb->vote_task_lock); - if (osb->vote_work_sequence != osb->vote_wake_sequence) - should_wake = 1; - spin_unlock(&osb->vote_task_lock); - - return should_wake; -} - -int ocfs2_vote_thread(void *arg) -{ - int status = 0; - struct ocfs2_super *osb = arg; - - /* only quit once we've been asked to stop and there is no more - * work available */ - while (!(kthread_should_stop() && - ocfs2_vote_thread_lists_empty(osb))) { - - wait_event_interruptible(osb->vote_event, - ocfs2_vote_thread_should_wake(osb) || - kthread_should_stop()); - - mlog(0, "vote_thread: awoken\n"); - - ocfs2_vote_thread_do_work(osb); - } - - osb->vote_task = NULL; - return status; -} - -static struct ocfs2_net_wait_ctxt *ocfs2_new_net_wait_ctxt(unsigned int response_id) -{ - struct ocfs2_net_wait_ctxt *w; - - w = kzalloc(sizeof(*w), GFP_NOFS); - if (!w) { - mlog_errno(-ENOMEM); - goto bail; - } - - INIT_LIST_HEAD(&w->n_list); - init_waitqueue_head(&w->n_event); - ocfs2_node_map_init(&w->n_node_map); - w->n_response_id = response_id; - w->n_callback = NULL; -bail: - return w; -} - -static unsigned int ocfs2_new_response_id(struct ocfs2_super *osb) -{ - unsigned int ret; - - spin_lock(&osb->net_response_lock); - ret = ++osb->net_response_ids; - spin_unlock(&osb->net_response_lock); - - return ret; -} - -static void ocfs2_dequeue_net_wait_ctxt(struct ocfs2_super *osb, - struct ocfs2_net_wait_ctxt *w) -{ - spin_lock(&osb->net_response_lock); - list_del(&w->n_list); - spin_unlock(&osb->net_response_lock); -} - -static void ocfs2_queue_net_wait_ctxt(struct ocfs2_super *osb, - struct ocfs2_net_wait_ctxt *w) -{ - spin_lock(&osb->net_response_lock); - list_add_tail(&w->n_list, - &osb->net_response_list); - spin_unlock(&osb->net_response_lock); -} - -static void __ocfs2_mark_node_responded(struct ocfs2_super *osb, - struct ocfs2_net_wait_ctxt *w, - int node_num) -{ - assert_spin_locked(&osb->net_response_lock); - - ocfs2_node_map_clear_bit(osb, &w->n_node_map, node_num); - if (ocfs2_node_map_is_empty(osb, &w->n_node_map)) - wake_up(&w->n_event); -} - -/* Intended to be called from the node down callback, we fake remove - * the node from all our response contexts */ -void ocfs2_remove_node_from_vote_queues(struct ocfs2_super *osb, - int node_num) -{ - struct list_head *p; - struct ocfs2_net_wait_ctxt *w = NULL; - - spin_lock(&osb->net_response_lock); - - list_for_each(p, &osb->net_response_list) { - w = list_entry(p, struct ocfs2_net_wait_ctxt, n_list); - - __ocfs2_mark_node_responded(osb, w, node_num); - } - - spin_unlock(&osb->net_response_lock); -} - -static int ocfs2_broadcast_vote(struct ocfs2_super *osb, - struct ocfs2_vote_msg *request, - unsigned int response_id, - int *response, - struct ocfs2_net_response_cb *callback) -{ - int status, i, remote_err; - struct ocfs2_net_wait_ctxt *w = NULL; - int dequeued = 0; - - mlog_entry_void(); - - w = ocfs2_new_net_wait_ctxt(response_id); - if (!w) { - status = -ENOMEM; - mlog_errno(status); - goto bail; - } - w->n_callback = callback; - - /* we're pretty much ready to go at this point, and this fills - * in n_response which we need anyway... */ - ocfs2_queue_net_wait_ctxt(osb, w); - - i = ocfs2_node_map_iterate(osb, &osb->mounted_map, 0); - - while (i != O2NM_INVALID_NODE_NUM) { - if (i != osb->node_num) { - mlog(0, "trying to send request to node %i\n", i); - ocfs2_node_map_set_bit(osb, &w->n_node_map, i); - - remote_err = 0; - status = o2net_send_message(OCFS2_MESSAGE_TYPE_VOTE, - osb->net_key, - request, - sizeof(*request), - i, - &remote_err); - if (status == -ETIMEDOUT) { - mlog(0, "remote node %d timed out!\n", i); - status = -EAGAIN; - goto bail; - } - if (remote_err < 0) { - status = remote_err; - mlog(0, "remote error %d on node %d!\n", - remote_err, i); - mlog_errno(status); - goto bail; - } - if (status < 0) { - mlog_errno(status); - goto bail; - } - } - i++; - i = ocfs2_node_map_iterate(osb, &osb->mounted_map, i); - mlog(0, "next is %d, i am %d\n", i, osb->node_num); - } - mlog(0, "done sending, now waiting on responses...\n"); - - wait_event(w->n_event, ocfs2_node_map_is_empty(osb, &w->n_node_map)); - - ocfs2_dequeue_net_wait_ctxt(osb, w); - dequeued = 1; - - *response = w->n_response; - status = 0; -bail: - if (w) { - if (!dequeued) - ocfs2_dequeue_net_wait_ctxt(osb, w); - kfree(w); - } - - mlog_exit(status); - return status; -} - -static struct ocfs2_vote_msg * ocfs2_new_vote_request(struct ocfs2_super *osb, - u64 blkno, - unsigned int generation, - enum ocfs2_vote_request type) -{ - struct ocfs2_vote_msg *request; - struct ocfs2_msg_hdr *hdr; - - BUG_ON(!ocfs2_is_valid_vote_request(type)); - - request = kzalloc(sizeof(*request), GFP_NOFS); - if (!request) { - mlog_errno(-ENOMEM); - } else { - hdr = &request->v_hdr; - hdr->h_node_num = cpu_to_be32(osb->node_num); - hdr->h_request = cpu_to_be32(type); - hdr->h_blkno = cpu_to_be64(blkno); - hdr->h_generation = cpu_to_be32(generation); - } - - return request; -} - -/* Complete the buildup of a new vote request and process the - * broadcast return value. */ -static int ocfs2_do_request_vote(struct ocfs2_super *osb, - struct ocfs2_vote_msg *request, - struct ocfs2_net_response_cb *callback) -{ - int status, response = -EBUSY; - unsigned int response_id; - struct ocfs2_msg_hdr *hdr; - - response_id = ocfs2_new_response_id(osb); - - hdr = &request->v_hdr; - hdr->h_response_id = cpu_to_be32(response_id); - - status = ocfs2_broadcast_vote(osb, request, response_id, &response, - callback); - if (status < 0) { - mlog_errno(status); - goto bail; - } - - status = response; -bail: - - return status; -} - -int ocfs2_request_mount_vote(struct ocfs2_super *osb) -{ - int status; - struct ocfs2_vote_msg *request = NULL; - - request = ocfs2_new_vote_request(osb, 0ULL, 0, OCFS2_VOTE_REQ_MOUNT); - if (!request) { - status = -ENOMEM; - goto bail; - } - - status = -EAGAIN; - while (status == -EAGAIN) { - if (!(osb->s_mount_opt & OCFS2_MOUNT_NOINTR) && - signal_pending(current)) { - status = -ERESTARTSYS; - goto bail; - } - - if (ocfs2_node_map_is_only(osb, &osb->mounted_map, - osb->node_num)) { - status = 0; - goto bail; - } - - status = ocfs2_do_request_vote(osb, request, NULL); - } - -bail: - kfree(request); - return status; -} - -int ocfs2_request_umount_vote(struct ocfs2_super *osb) -{ - int status; - struct ocfs2_vote_msg *request = NULL; - - request = ocfs2_new_vote_request(osb, 0ULL, 0, OCFS2_VOTE_REQ_UMOUNT); - if (!request) { - status = -ENOMEM; - goto bail; - } - - status = -EAGAIN; - while (status == -EAGAIN) { - /* Do not check signals on this vote... We really want - * this one to go all the way through. */ - - if (ocfs2_node_map_is_only(osb, &osb->mounted_map, - osb->node_num)) { - status = 0; - goto bail; - } - - status = ocfs2_do_request_vote(osb, request, NULL); - } - -bail: - kfree(request); - return status; -} - -/* TODO: This should eventually be a hash table! */ -static struct ocfs2_net_wait_ctxt * __ocfs2_find_net_wait_ctxt(struct ocfs2_super *osb, - u32 response_id) -{ - struct list_head *p; - struct ocfs2_net_wait_ctxt *w = NULL; - - list_for_each(p, &osb->net_response_list) { - w = list_entry(p, struct ocfs2_net_wait_ctxt, n_list); - if (response_id == w->n_response_id) - break; - w = NULL; - } - - return w; -} - -/* Translate response codes into local node errno values */ -static inline int ocfs2_translate_response(int response) -{ - int ret; - - switch (response) { - case OCFS2_RESPONSE_OK: - ret = 0; - break; - - case OCFS2_RESPONSE_BUSY: - ret = -EBUSY; - break; - - default: - ret = -EINVAL; - } - - return ret; -} - -static int ocfs2_handle_response_message(struct o2net_msg *msg, - u32 len, - void *data, void **ret_data) -{ - unsigned int response_id, node_num; - int response_status; - struct ocfs2_super *osb = data; - struct ocfs2_response_msg *resp; - struct ocfs2_net_wait_ctxt * w; - struct ocfs2_net_response_cb *resp_cb; - - resp = (struct ocfs2_response_msg *) msg->buf; - - response_id = be32_to_cpu(resp->r_hdr.h_response_id); - node_num = be32_to_cpu(resp->r_hdr.h_node_num); - response_status = - ocfs2_translate_response(be32_to_cpu(resp->r_response)); - - mlog(0, "received response message:\n"); - mlog(0, "h_response_id = %u\n", response_id); - mlog(0, "h_request = %u\n", be32_to_cpu(resp->r_hdr.h_request)); - mlog(0, "h_blkno = %llu\n", - (unsigned long long)be64_to_cpu(resp->r_hdr.h_blkno)); - mlog(0, "h_generation = %u\n", be32_to_cpu(resp->r_hdr.h_generation)); - mlog(0, "h_node_num = %u\n", node_num); - mlog(0, "r_response = %d\n", response_status); - - spin_lock(&osb->net_response_lock); - w = __ocfs2_find_net_wait_ctxt(osb, response_id); - if (!w) { - mlog(0, "request not found!\n"); - goto bail; - } - resp_cb = w->n_callback; - - if (response_status && (!w->n_response)) { - /* we only really need one negative response so don't - * set it twice. */ - w->n_response = response_status; - } - - if (resp_cb) { - spin_unlock(&osb->net_response_lock); - - resp_cb->rc_cb(resp_cb->rc_priv, resp); - - spin_lock(&osb->net_response_lock); - } - - __ocfs2_mark_node_responded(osb, w, node_num); -bail: - spin_unlock(&osb->net_response_lock); - - return 0; -} - -static int ocfs2_handle_vote_message(struct o2net_msg *msg, - u32 len, - void *data, void **ret_data) -{ - int status; - struct ocfs2_super *osb = data; - struct ocfs2_vote_work *work; - - work = kmalloc(sizeof(struct ocfs2_vote_work), GFP_NOFS); - if (!work) { - status = -ENOMEM; - mlog_errno(status); - goto bail; - } - - INIT_LIST_HEAD(&work->w_list); - memcpy(&work->w_msg, msg->buf, sizeof(struct ocfs2_vote_msg)); - - mlog(0, "scheduling vote request:\n"); - mlog(0, "h_response_id = %u\n", - be32_to_cpu(work->w_msg.v_hdr.h_response_id)); - mlog(0, "h_request = %u\n", be32_to_cpu(work->w_msg.v_hdr.h_request)); - mlog(0, "h_blkno = %llu\n", - (unsigned long long)be64_to_cpu(work->w_msg.v_hdr.h_blkno)); - mlog(0, "h_generation = %u\n", - be32_to_cpu(work->w_msg.v_hdr.h_generation)); - mlog(0, "h_node_num = %u\n", - be32_to_cpu(work->w_msg.v_hdr.h_node_num)); - - spin_lock(&osb->vote_task_lock); - list_add_tail(&work->w_list, &osb->vote_list); - osb->vote_count++; - spin_unlock(&osb->vote_task_lock); - - ocfs2_kick_vote_thread(osb); - - status = 0; -bail: - return status; -} - -void ocfs2_unregister_net_handlers(struct ocfs2_super *osb) -{ - if (!osb->net_key) - return; - - o2net_unregister_handler_list(&osb->osb_net_handlers); - - if (!list_empty(&osb->net_response_list)) - mlog(ML_ERROR, "net response list not empty!\n"); - - osb->net_key = 0; -} - -int ocfs2_register_net_handlers(struct ocfs2_super *osb) -{ - int status = 0; - - if (ocfs2_mount_local(osb)) - return 0; - - status = o2net_register_handler(OCFS2_MESSAGE_TYPE_RESPONSE, - osb->net_key, - sizeof(struct ocfs2_response_msg), - ocfs2_handle_response_message, - osb, NULL, &osb->osb_net_handlers); - if (status) { - mlog_errno(status); - goto bail; - } - - status = o2net_register_handler(OCFS2_MESSAGE_TYPE_VOTE, - osb->net_key, - sizeof(struct ocfs2_vote_msg), - ocfs2_handle_vote_message, - osb, NULL, &osb->osb_net_handlers); - if (status) { - mlog_errno(status); - goto bail; - } -bail: - if (status < 0) - ocfs2_unregister_net_handlers(osb); - - return status; -} diff --git a/fs/ocfs2/vote.h b/fs/ocfs2/vote.h deleted file mode 100644 index 9ea46f6..0000000 --- a/fs/ocfs2/vote.h +++ /dev/null @@ -1,48 +0,0 @@ -/* -*- mode: c; c-basic-offset: 8; -*- - * vim: noexpandtab sw=8 ts=8 sts=0: - * - * vote.h - * - * description here - * - * Copyright (C) 2002, 2004 Oracle. All rights reserved. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License as published by the Free Software Foundation; either - * version 2 of the License, or (at your option) any later version. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public - * License along with this program; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place - Suite 330, - * Boston, MA 021110-1307, USA. - */ - - -#ifndef VOTE_H -#define VOTE_H - -int ocfs2_vote_thread(void *arg); -static inline void ocfs2_kick_vote_thread(struct ocfs2_super *osb) -{ - spin_lock(&osb->vote_task_lock); - /* make sure the voting thread gets a swipe at whatever changes - * the caller may have made to the voting state */ - osb->vote_wake_sequence++; - spin_unlock(&osb->vote_task_lock); - wake_up(&osb->vote_event); -} - -int ocfs2_request_mount_vote(struct ocfs2_super *osb); -int ocfs2_request_umount_vote(struct ocfs2_super *osb); -int ocfs2_register_net_handlers(struct ocfs2_super *osb); -void ocfs2_unregister_net_handlers(struct ocfs2_super *osb); - -void ocfs2_remove_node_from_vote_queues(struct ocfs2_super *osb, - int node_num); -#endif diff --git a/include/linux/Kbuild b/include/linux/Kbuild index 37bfa19..e5efea7 100644 --- a/include/linux/Kbuild +++ b/include/linux/Kbuild @@ -49,6 +49,7 @@ header-y += comstats.h header-y += const.h header-y += cgroupstats.h header-y += cycx_cfm.h +header-y += dlmconstants.h header-y += dlm_device.h header-y += dlm_netlink.h header-y += dm-ioctl.h diff --git a/include/linux/dlm.h b/include/linux/dlm.h index be9d278..c743fbc 100644 --- a/include/linux/dlm.h +++ b/include/linux/dlm.h @@ -19,148 +19,12 @@ * routines and structures to use DLM lockspaces */ -/* - * Lock Modes - */ +/* Lock levels and flags are here */ +#include -#define DLM_LOCK_IV -1 /* invalid */ -#define DLM_LOCK_NL 0 /* null */ -#define DLM_LOCK_CR 1 /* concurrent read */ -#define DLM_LOCK_CW 2 /* concurrent write */ -#define DLM_LOCK_PR 3 /* protected read */ -#define DLM_LOCK_PW 4 /* protected write */ -#define DLM_LOCK_EX 5 /* exclusive */ - -/* - * Maximum size in bytes of a dlm_lock name - */ #define DLM_RESNAME_MAXLEN 64 -/* - * Flags to dlm_lock - * - * DLM_LKF_NOQUEUE - * - * Do not queue the lock request on the wait queue if it cannot be granted - * immediately. If the lock cannot be granted because of this flag, DLM will - * either return -EAGAIN from the dlm_lock call or will return 0 from - * dlm_lock and -EAGAIN in the lock status block when the AST is executed. - * - * DLM_LKF_CANCEL - * - * Used to cancel a pending lock request or conversion. A converting lock is - * returned to its previously granted mode. - * - * DLM_LKF_CONVERT - * - * Indicates a lock conversion request. For conversions the name and namelen - * are ignored and the lock ID in the LKSB is used to identify the lock. - * - * DLM_LKF_VALBLK - * - * Requests DLM to return the current contents of the lock value block in the - * lock status block. When this flag is set in a lock conversion from PW or EX - * modes, DLM assigns the value specified in the lock status block to the lock - * value block of the lock resource. The LVB is a DLM_LVB_LEN size array - * containing application-specific information. - * - * DLM_LKF_QUECVT - * - * Force a conversion request to be queued, even if it is compatible with - * the granted modes of other locks on the same resource. - * - * DLM_LKF_IVVALBLK - * - * Invalidate the lock value block. - * - * DLM_LKF_CONVDEADLK - * - * Allows the dlm to resolve conversion deadlocks internally by demoting the - * granted mode of a converting lock to NL. The DLM_SBF_DEMOTED flag is - * returned for a conversion that's been effected by this. - * - * DLM_LKF_PERSISTENT - * - * Only relevant to locks originating in userspace. A persistent lock will not - * be removed if the process holding the lock exits. - * - * DLM_LKF_NODLCKWT - * - * Do not cancel the lock if it gets into conversion deadlock. - * Exclude this lock from being monitored due to DLM_LSFL_TIMEWARN. - * - * DLM_LKF_NODLCKBLK - * - * net yet implemented - * - * DLM_LKF_EXPEDITE - * - * Used only with new requests for NL mode locks. Tells the lock manager - * to grant the lock, ignoring other locks in convert and wait queues. - * - * DLM_LKF_NOQUEUEBAST - * - * Send blocking AST's before returning -EAGAIN to the caller. It is only - * used along with the NOQUEUE flag. Blocking AST's are not sent for failed - * NOQUEUE requests otherwise. - * - * DLM_LKF_HEADQUE - * - * Add a lock to the head of the convert or wait queue rather than the tail. - * - * DLM_LKF_NOORDER - * - * Disregard the standard grant order rules and grant a lock as soon as it - * is compatible with other granted locks. - * - * DLM_LKF_ORPHAN - * - * not yet implemented - * - * DLM_LKF_ALTPR - * - * If the requested mode cannot be granted immediately, try to grant the lock - * in PR mode instead. If this alternate mode is granted instead of the - * requested mode, DLM_SBF_ALTMODE is returned in the lksb. - * - * DLM_LKF_ALTCW - * - * The same as ALTPR, but the alternate mode is CW. - * - * DLM_LKF_FORCEUNLOCK - * - * Unlock the lock even if it is converting or waiting or has sublocks. - * Only really for use by the userland device.c code. - * - */ - -#define DLM_LKF_NOQUEUE 0x00000001 -#define DLM_LKF_CANCEL 0x00000002 -#define DLM_LKF_CONVERT 0x00000004 -#define DLM_LKF_VALBLK 0x00000008 -#define DLM_LKF_QUECVT 0x00000010 -#define DLM_LKF_IVVALBLK 0x00000020 -#define DLM_LKF_CONVDEADLK 0x00000040 -#define DLM_LKF_PERSISTENT 0x00000080 -#define DLM_LKF_NODLCKWT 0x00000100 -#define DLM_LKF_NODLCKBLK 0x00000200 -#define DLM_LKF_EXPEDITE 0x00000400 -#define DLM_LKF_NOQUEUEBAST 0x00000800 -#define DLM_LKF_HEADQUE 0x00001000 -#define DLM_LKF_NOORDER 0x00002000 -#define DLM_LKF_ORPHAN 0x00004000 -#define DLM_LKF_ALTPR 0x00008000 -#define DLM_LKF_ALTCW 0x00010000 -#define DLM_LKF_FORCEUNLOCK 0x00020000 -#define DLM_LKF_TIMEOUT 0x00040000 - -/* - * Some return codes that are not in errno.h - */ - -#define DLM_ECANCEL 0x10001 -#define DLM_EUNLOCK 0x10002 typedef void dlm_lockspace_t; diff --git a/include/linux/dlmconstants.h b/include/linux/dlmconstants.h new file mode 100644 index 0000000..fddb3d3 --- /dev/null +++ b/include/linux/dlmconstants.h @@ -0,0 +1,159 @@ +/****************************************************************************** +******************************************************************************* +** +** Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. +** Copyright (C) 2004-2007 Red Hat, Inc. All rights reserved. +** +** This copyrighted material is made available to anyone wishing to use, +** modify, copy, or redistribute it subject to the terms and conditions +** of the GNU General Public License v.2. +** +******************************************************************************* +******************************************************************************/ + +#ifndef __DLMCONSTANTS_DOT_H__ +#define __DLMCONSTANTS_DOT_H__ + +/* + * Constants used by DLM interface. + */ + +/* + * Lock Modes + */ + +#define DLM_LOCK_IV (-1) /* invalid */ +#define DLM_LOCK_NL 0 /* null */ +#define DLM_LOCK_CR 1 /* concurrent read */ +#define DLM_LOCK_CW 2 /* concurrent write */ +#define DLM_LOCK_PR 3 /* protected read */ +#define DLM_LOCK_PW 4 /* protected write */ +#define DLM_LOCK_EX 5 /* exclusive */ + + +/* + * Flags to dlm_lock + * + * DLM_LKF_NOQUEUE + * + * Do not queue the lock request on the wait queue if it cannot be granted + * immediately. If the lock cannot be granted because of this flag, DLM will + * either return -EAGAIN from the dlm_lock call or will return 0 from + * dlm_lock and -EAGAIN in the lock status block when the AST is executed. + * + * DLM_LKF_CANCEL + * + * Used to cancel a pending lock request or conversion. A converting lock is + * returned to its previously granted mode. + * + * DLM_LKF_CONVERT + * + * Indicates a lock conversion request. For conversions the name and namelen + * are ignored and the lock ID in the LKSB is used to identify the lock. + * + * DLM_LKF_VALBLK + * + * Requests DLM to return the current contents of the lock value block in the + * lock status block. When this flag is set in a lock conversion from PW or EX + * modes, DLM assigns the value specified in the lock status block to the lock + * value block of the lock resource. The LVB is a DLM_LVB_LEN size array + * containing application-specific information. + * + * DLM_LKF_QUECVT + * + * Force a conversion request to be queued, even if it is compatible with + * the granted modes of other locks on the same resource. + * + * DLM_LKF_IVVALBLK + * + * Invalidate the lock value block. + * + * DLM_LKF_CONVDEADLK + * + * Allows the dlm to resolve conversion deadlocks internally by demoting the + * granted mode of a converting lock to NL. The DLM_SBF_DEMOTED flag is + * returned for a conversion that's been effected by this. + * + * DLM_LKF_PERSISTENT + * + * Only relevant to locks originating in userspace. A persistent lock will not + * be removed if the process holding the lock exits. + * + * DLM_LKF_NODLCKWT + * + * Do not cancel the lock if it gets into conversion deadlock. + * Exclude this lock from being monitored due to DLM_LSFL_TIMEWARN. + * + * DLM_LKF_NODLCKBLK + * + * net yet implemented + * + * DLM_LKF_EXPEDITE + * + * Used only with new requests for NL mode locks. Tells the lock manager + * to grant the lock, ignoring other locks in convert and wait queues. + * + * DLM_LKF_NOQUEUEBAST + * + * Send blocking AST's before returning -EAGAIN to the caller. It is only + * used along with the NOQUEUE flag. Blocking AST's are not sent for failed + * NOQUEUE requests otherwise. + * + * DLM_LKF_HEADQUE + * + * Add a lock to the head of the convert or wait queue rather than the tail. + * + * DLM_LKF_NOORDER + * + * Disregard the standard grant order rules and grant a lock as soon as it + * is compatible with other granted locks. + * + * DLM_LKF_ORPHAN + * + * not yet implemented + * + * DLM_LKF_ALTPR + * + * If the requested mode cannot be granted immediately, try to grant the lock + * in PR mode instead. If this alternate mode is granted instead of the + * requested mode, DLM_SBF_ALTMODE is returned in the lksb. + * + * DLM_LKF_ALTCW + * + * The same as ALTPR, but the alternate mode is CW. + * + * DLM_LKF_FORCEUNLOCK + * + * Unlock the lock even if it is converting or waiting or has sublocks. + * Only really for use by the userland device.c code. + * + */ + +#define DLM_LKF_NOQUEUE 0x00000001 +#define DLM_LKF_CANCEL 0x00000002 +#define DLM_LKF_CONVERT 0x00000004 +#define DLM_LKF_VALBLK 0x00000008 +#define DLM_LKF_QUECVT 0x00000010 +#define DLM_LKF_IVVALBLK 0x00000020 +#define DLM_LKF_CONVDEADLK 0x00000040 +#define DLM_LKF_PERSISTENT 0x00000080 +#define DLM_LKF_NODLCKWT 0x00000100 +#define DLM_LKF_NODLCKBLK 0x00000200 +#define DLM_LKF_EXPEDITE 0x00000400 +#define DLM_LKF_NOQUEUEBAST 0x00000800 +#define DLM_LKF_HEADQUE 0x00001000 +#define DLM_LKF_NOORDER 0x00002000 +#define DLM_LKF_ORPHAN 0x00004000 +#define DLM_LKF_ALTPR 0x00008000 +#define DLM_LKF_ALTCW 0x00010000 +#define DLM_LKF_FORCEUNLOCK 0x00020000 +#define DLM_LKF_TIMEOUT 0x00040000 + +/* + * Some return codes that are not in errno.h + */ + +#define DLM_ECANCEL 0x10001 +#define DLM_EUNLOCK 0x10002 + +#endif /* __DLMCONSTANTS_DOT_H__ */