GIT af1304500cee72f0f09848ce91135b45ee50796e git+ssh://master.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git#vfs-2.6.25 9b9584bb5e287312aebb5aecdc4a655dd69c56e9 commit 9b9584bb5e287312aebb5aecdc4a655dd69c56e9 Author: Miklos Szeredi Date: Thu Mar 27 13:06:26 2008 +0100 [patch 7/7] vfs: mountinfo: show dominating group id Show peer group ID of nearest dominating group that has intersection with the mount's namespace. Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro commit a3cf09e07c56de89d56a51e51a3eaafeec63ed01 Author: Ram Pai Date: Thu Mar 27 13:06:25 2008 +0100 [patch 6/7] vfs: mountinfo: add /proc//mountinfo [mszeredi@suse.cz] rewrite and split big patch into managable chunks /proc/mounts in its current form lacks important information: - propagation state - root of mount for bind mounts - the st_dev value used within the filesystem - identifier for each mount and it's parent It also suffers from the following problems: - not easily extendable - ambiguity of mountpoints within a chrooted environment - doesn't distinguish between filesystem dependent and independent options - doesn't distinguish between per mount and per super block options This patch introduces /proc//mountinfo which attempts to address all these deficiencies. Code shared between /proc//mounts and /proc//mountinfo is extracted into separate functions. Thanks to Al Viro for the help in getting the design right. Signed-off-by: Ram Pai Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro commit 72c8e314a108d17709643fa3f626dcd33fa28f30 Author: Miklos Szeredi Date: Thu Mar 27 13:06:24 2008 +0100 [patch 5/7] vfs: mountinfo: allow using process root Allow /proc//mountinfo to use the root of to calculate mountpoints. - move definition of 'struct proc_mounts' to - add the process's namespace and root to this structure - pass a pointer to 'struct proc_mounts' into seq_operations In addition the following cleanups are made: - use a common open function for /proc//{mounts,mountstat} - surround namespace.c part of these proc files with #ifdef CONFIG_PROC_FS - make the seq_operations structures const Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro commit c618976ae640031e7a3bfdc5ce795da65c700743 Author: Miklos Szeredi Date: Thu Mar 27 13:06:23 2008 +0100 [patch 4/7] vfs: mountinfo: add mount peer group ID Add a unique ID to each peer group using the IDR infrastructure. The identifiers are reused after the peer group dissolves. The IDR structures are protected by holding namepspace_sem for write while allocating or deallocating IDs. IDs are allocated when a previously unshared vfsmount becomes the first member of a peer group. When a new member is added to an existing group, the ID is copied from one of the old members. IDs are freed when the last member of a peer group is unshared. Setting the MNT_SHARED flag on members of a subtree is done as a separate step, after all the IDs have been allocated. This way an allocation failure can be cleaned up easilty, without affecting the propagation state. Based on design sketch by Al Viro. Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro commit ca1ead4c716b9836b916f39ac495f8de555a55af Author: Miklos Szeredi Date: Wed Mar 26 22:11:34 2008 +0100 [patch 3/7] vfs: mountinfo: add mount ID Add a unique ID to each vfsmount using the IDR infrastructure. The identifiers are reused after the vfsmount is freed. Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro commit 61c5e1dc9700ba327671ce4e56946ca6836725da Author: Miklos Szeredi Date: Thu Mar 27 13:06:21 2008 +0100 [patch 2/7] vfs: mountinfo: add seq_file_root() Add a new function: seq_file_root() This is similar to seq_path(), but calculates the path relative to the given root, instead of current->fs->root. If the path was unreachable from root, then modify the root parameter to reflect this. Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro commit 6b885f71d0caa8deb56725373935fa3612b64c7b Author: Ram Pai Date: Thu Mar 27 13:06:20 2008 +0100 [patch 1/7] vfs: mountinfo: add dentry_path() [mszeredi@suse.cz] split big patch into managable chunks Add the following functions: dentry_path() seq_dentry() These are similar to d_path() and seq_path(). But instead of calculating the path within a mount namespace, they calculate the path from the root of the filesystem to a given dentry, ignoring mounts completely. Signed-off-by: Ram Pai Signed-off-by: Miklos Szeredi Signed-off-by: Al Viro commit aa2cf1e899c8ee1aa878de98553774912596e848 Author: Al Viro Date: Fri Mar 28 00:46:41 2008 -0400 [PATCH] teach seq_file to discard entries Allow ->show() return SEQ_SKIP; that will discard all output from that element and move on. Signed-off-by: Al Viro commit c0e5804eaab47c5dcc6d2830ae1a069d3003b352 Author: Al Viro Date: Mon Mar 24 00:16:03 2008 -0400 [PATCH] umount_tree() will unhash everything itself Signed-off-by: Al Viro commit b986dfd058e7a5600b09a418c99f926340e47ed7 Author: Al Viro Date: Sat Mar 22 18:00:39 2008 -0400 [PATCH] get rid of more nameidata passing in namespace.c Further reduction of stack footprint (sys_pivot_root()); lose useless BKL in there, while we are at it. Signed-off-by: Al Viro commit ba7678482e729595794e558b9e5c094fd9385736 Author: Al Viro Date: Sat Mar 22 17:48:24 2008 -0400 [PATCH] switch a bunch of LSM hooks from nameidata to path Namely, ones from namespace.c Signed-off-by: Al Viro commit 24202fc82555179f87ef51519bd93ffe030627c6 Author: Al Viro Date: Sat Mar 22 16:19:49 2008 -0400 [PATCH] lock exclusively in collect_mounts() and drop_collected_mounts() Taking namespace_sem shared there isn't worth the trouble, especially with vfsmount ID allocation about to be added. That way we know that umount_tree(), copy_tree() and clone_mnt() are _always_ serialized by namespace_sem. umount_tree() still needs vfsmount_lock (it manipulates hash chains, among other things), but that's a separate story. Signed-off-by: Al Viro commit 389e69461346ea96184b43a6932ff06e1cee27c2 Author: Al Viro Date: Sat Mar 22 15:48:17 2008 -0400 [PATCH] move a bunch of declarations to fs/internal.h Signed-off-by: Al Viro commit 9da6012129380d5c73a1e1c1f06e3934557cd329 Author: Al Viro Date: Sat Feb 23 06:46:49 2008 -0500 [PATCH] fix race in kvm_dev_ioctl_create_vm(), sanitize anon_inode_getfd() a) none of the callers even looks at inode returned by anon_inode_getfd() b) none of correct callers looks at file returned by anon_inode_getfd() c) the only caller that does is (inevitably) racy, since by the time it returns we might have raced with close() from another thread and that file would be pining for fjords. Fixed by adding a variant that pins the file (used by the only odd caller) and by turning anon_inode_getfd() into wrapper around it. And losing the damn pfile and pinode from anon_inode_getfd(), to prevent similar breakage in the future. Signed-off-by: Al Viro commit 55a42249b806551239606acbe6d4f0243150d4de Author: Dave Hansen Date: Fri Feb 15 14:38:01 2008 -0800 [PATCH] r/o bind mounts: debugging for missed calls There have been a few oopses caused by 'struct file's with NULL f_vfsmnts. There was also a set of potentially missed mnt_want_write()s from dentry_open() calls. This patch provides a very simple debugging framework to catch these kinds of bugs. It will WARN_ON() them, but should stop us from having any oopses or mnt_writer count imbalances. I'm quite convinced that this is a good thing because it found bugs in the stuff I was working on as soon as I wrote it. [hch: made it conditional on a debug option. But it's still a little bit too ugly] [hch: merged forced remount r/o fix from Dave and akpm's fix for the fix] Signed-off-by: Dave Hansen Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit 4f7477e33ecdef95e49ad81c3283871cfaea000f Author: Dave Hansen Date: Fri Feb 15 14:38:00 2008 -0800 [PATCH] r/o bind mounts: honor mount writer counts at remount Originally from: Herbert Poetzl This is the core of the read-only bind mount patch set. Note that this does _not_ add a "ro" option directly to the bind mount operation. If you require such a mount, you must first do the bind, then follow it up with a 'mount -o remount,ro' operation: If you wish to have a r/o bind mount of /foo on bar: mount --bind /foo /bar mount -o remount,ro /bar Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit c95f5e2e5ae1c6e3ea7c1f8cccd8bde5d3b09f9d Author: Dave Hansen Date: Fri Feb 15 14:37:59 2008 -0800 [PATCH] r/o bind mounts: track numbers of writers to mounts This is the real meat of the entire series. It actually implements the tracking of the number of writers to a mount. However, it causes scalability problems because there can be hundreds of cpus doing open()/close() on files on the same mnt at the same time. Even an atomic_t in the mnt has massive scalaing problems because the cacheline gets so terribly contended. This uses a statically-allocated percpu variable. All want/drop operations are local to a cpu as long that cpu operates on the same mount, and there are no writer count imbalances. Writer count imbalances happen when a write is taken on one cpu, and released on another, like when an open/close pair is performed on two Upon a remount,ro request, all of the data from the percpu variables is collected (expensive, but very rare) and we determine if there are any outstanding writers to the mount. I've written a little benchmark to sit in a loop for a couple of seconds in several cpus in parallel doing open/write/close loops. http://sr71.net/~dave/linux/openbench.c The code in here is a a worst-possible case for this patch. It does opens on a _pair_ of files in two different mounts in parallel. This should cause my code to lose its "operate on the same mount" optimization completely. This worst-case scenario causes a 3% degredation in the benchmark. I could probably get rid of even this 3%, but it would be more complex than what I have here, and I think this is getting into acceptable territory. In practice, I expect writing more than 3 bytes to a file, as well as disk I/O to mask any effects that this has. (To get rid of that 3%, we could have an #defined number of mounts in the percpu variable. So, instead of a CPU getting operate only on percpu data when it accesses only one mount, it could stay on percpu data when it only accesses N or fewer mounts.) [AV] merged fix for __clear_mnt_mount() stepping on freed vfsmount Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit 3444beedb65c1153a7e1fe90d85f93bcf648a042 Author: Dave Hansen Date: Fri Feb 15 14:37:57 2008 -0800 [PATCH] r/o bind mounts: get callers of vfs_mknod/create() This takes care of all of the direct callers of vfs_mknod(). Since a few of these cases also handle normal file creation as well, this also covers some calls to vfs_create(). So that we don't have to make three mnt_want/drop_write() calls inside of the switch statement, we move some of its logic outside of the switch and into a helper function suggested by Christoph. This also encapsulates a fix for mknod(S_IFREG) that Miklos found. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit 6da603af00e384049c15b1447d52f48c3799bb67 Author: Dave Hansen Date: Fri Feb 15 14:37:56 2008 -0800 [PATCH] r/o bind mounts: check mnt instead of superblock directly If we depend on the inodes for writeability, we will not catch the r/o mounts when implemented. This patches uses __mnt_want_write(). It does not guarantee that the mount will stay writeable after the check. But, this is OK for one of the checks because it is just for a printk(). The other two are probably unnecessary and duplicate existing checks in the VFS. This won't make them better checks than before, but it will make them detect r/o mounts. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit f543e6093c08ef44a038964d1275251b5b043561 Author: Dave Hansen Date: Fri Feb 15 14:37:55 2008 -0800 [PATCH] r/o bind mounts: make access() use new r/o helper It is OK to let access() go without using a mnt_want/drop_write() pair because it doesn't actually do writes to the filesystem, and it is inherently racy anyway. This is a rare case when it is OK to use __mnt_is_readonly() directly. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit 171d58dfcd226af0137055a3dcabf89c55c9dcb1 Author: Dave Hansen Date: Fri Feb 15 14:37:53 2008 -0800 [PATCH] r/o bind mounts: elevate count for xfs timestamp updates Elevate the write count during the xfs m/ctime updates. XFS has to do it's own timestamp updates due to an unfortunate VFS design limitation, so it will have to track writers by itself aswell. [hch: split out from the touch_atime patch as it's not related to it at all] Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit c527ed344a961d7ee414982f6c5db9370507ce29 Author: Dave Hansen Date: Fri Feb 15 14:37:52 2008 -0800 [PATCH] r/o bind mounts: write counts for truncate() Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 56a8145b927db07d8862f78a32edc033d2c9950c Author: Dave Hansen Date: Fri Feb 15 14:37:50 2008 -0800 [PATCH] r/o bind mounts: elevate write count for chmod/chown callers chown/chmod,etc... don't call permission in the same way that the normal "open for write" calls do. They still write to the filesystem, so bump the write count during these operations. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit ba946da60791f040e62c29d427361f3d8eff7b18 Author: Dave Hansen Date: Fri Feb 15 14:37:49 2008 -0800 [PATCH] r/o bind mounts: get write access for vfs_rename() callers This also uses the little helper in the NFS code to make an if() a little bit less ugly. We introduced the helper at the beginning of the series. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit 4eda7907cc90e5a5f63e0bd09fe68b200cc27e11 Author: Dave Hansen Date: Fri Feb 15 14:37:48 2008 -0800 [PATCH] r/o bind mounts: elevate write count for open()s This is the first really tricky patch in the series. It elevates the writer count on a mount each time a non-special file is opened for write. We used to do this in may_open(), but Miklos pointed out that __dentry_open() is used as well to create filps. This will cover even those cases, while a call in may_open() would not have. There is also an elevated count around the vfs_create() call in open_namei(). See the comments for more details, but we need this to fix a 'create, remount, fail r/w open()' race. Some filesystems forego the use of normal vfs calls to create struct files. Make sure that these users elevate the mnt writer count because they will get __fput(), and we need to make sure they're balanced. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit fce416e4a18c0af9aace40766a1590b92151c449 Author: Dave Hansen Date: Fri Feb 15 14:37:46 2008 -0800 [PATCH] r/o bind mounts: elevate write count for ioctls() Some ioctl()s can cause writes to the filesystem. Take these, and make them use mnt_want/drop_write() instead. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit f4add95da55c03f0e4640fd73158411964cd7fed Author: Dave Hansen Date: Fri Feb 15 14:37:45 2008 -0800 [PATCH] r/o bind mounts: write counts for link/symlink Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 0eaa58592ee2f285bcaac4e0e101956b7b07d90b Author: Dave Hansen Date: Fri Feb 15 14:37:43 2008 -0800 [PATCH] r/o bind mounts: write count for file_update_time() Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 9b01fa5b747b42ab2e47494dfd55e498b6fbecf7 Author: Dave Hansen Date: Fri Feb 15 14:37:42 2008 -0800 [PATCH] r/o bind mounts: elevate write count for do_utimes() Now includes fix for oops seen by akpm. "never let a libc developer write your kernel code" - hch "nor, apparently, a kernel developer" - akpm Acked-by: Al Viro Signed-off-by: Christoph Hellwig Cc: Valdis Kletnieks Cc: Balbir Singh Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 2008883ca910d52722463cb03978ce829346d6a1 Author: Dave Hansen Date: Fri Feb 15 14:37:41 2008 -0800 [PATCH] r/o bind mounts: write counts for time functions Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 6086ac4d9b4cac9388faa9bd3117579562a02184 Author: Dave Hansen Date: Fri Feb 15 14:37:39 2008 -0800 [PATCH] r/o bind mounts: elevate write count for ncp_ioctl() Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit d459b383e59b4515cc5bc639b26c9941c3ee3ed3 Author: Dave Hansen Date: Fri Feb 15 14:37:38 2008 -0800 [PATCH] r/o bind mounts: elevate write count for xattr_permission() callers This basically audits the callers of xattr_permission(), which calls permission() and can perform writes to the filesystem. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit de86f3aa9bac8e2acdfbb2c87eef976f36f1f7e4 Author: Dave Hansen Date: Fri Feb 15 14:37:37 2008 -0800 [PATCH] r/o bind mounts: elevate mnt_writers for unlink callers Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 4360f726751ad0af7ec6dab5f55f2e728a226695 Author: Dave Hansen Date: Fri Feb 15 14:37:35 2008 -0800 [PATCH] r/o bind mounts: elevate write count for callers of vfs_mkdir() Pretty self-explanatory. Fits in with the rest of the series. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit e0619052c7e2bfefbc1630cca38766c39ccc32f7 Author: Dave Hansen Date: Fri Feb 15 14:37:34 2008 -0800 [PATCH] r/o bind mounts: elevate write count for vfs_rmdir() Elevate the write count during the vfs_rmdir() call. Acked-by: Serge Hallyn Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Andrew Morton Signed-off-by: Al Viro commit 08610f566c793cd2a2ca90d77bd6c02798020034 Author: Dave Hansen Date: Fri Feb 15 14:37:32 2008 -0800 [PATCH] r/o bind mounts: drop write during emergency remount The emergency remount code forcibly removes FMODE_WRITE from filps. The r/o bind mount code notices that this was done without a proper mnt_drop_write() and properly gives a warning. This patch does a mnt_drop_write() to keep everything balanced. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit e61239683cdf87706c936f34bc7f6dcb3bc08d01 Author: Dave Hansen Date: Fri Feb 15 14:37:31 2008 -0800 [PATCH] r/o bind mounts: create helper to drop file write access If someone decides to demote a file from r/w to just r/o, they can use this same code as __fput(). NFS does just that, and will use this in the next patch. AV: drop write access in __fput() only after we evict from file list. Signed-off-by: Dave Hansen Cc: Erez Zadok Cc: Trond Myklebust Cc: "J Bruce Fields" Acked-by: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Christoph Hellwig Signed-off-by: Al Viro commit fe85da16e6919407f5692fb1e48f88da9858c9ab Author: Dave Hansen Date: Fri Feb 15 14:37:30 2008 -0800 [PATCH] r/o bind mounts: stub functions This patch adds two function mnt_want_write() and mnt_drop_write(). These are used like a lock pair around and fs operations that might cause a write to the filesystem. Before these can become useful, we must first cover each place in the VFS where writes are performed with a want/drop pair. When that is complete, we can actually introduce code that will safely check the counts before allowing r/w<->r/o transitions to occur. Acked-by: Serge Hallyn Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 99d9f7e013ba6c6a9ba646cb41541fa662539ec4 Author: Christoph Hellwig Date: Fri Feb 15 14:37:28 2008 -0800 [PATCH] merge open_namei() and do_filp_open() open_namei() will, in the future, need to take mount write counts over its creation and truncation (via may_open()) operations. It needs to keep these write counts until any potential filp that is created gets __fput()'d. This gets complicated in the error handling and becomes very murky as to how far open_namei() actually got, and whether or not that mount write count was taken. That makes it a bad interface. All that the current do_filp_open() really does is allocate the nameidata on the stack, then call open_namei(). So, this merges those two functions and moves filp_open() over to namei.c so it can be close to its buddy: do_filp_open(). It also gets a kerneldoc comment in the process. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Andrew Morton Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 51e35404d2f97aaed7ff7c55b6a6952bcaa77d62 Author: Dave Hansen Date: Fri Feb 15 14:37:27 2008 -0800 [PATCH] do namei_flags calculation inside open_namei() My end goal here is to make sure all users of may_open() return filps. This will ensure that we properly release mount write counts which were taken for the filp in may_open(). This patch moves the sys_open flags to namei flags calculation into fs/namei.c. We'll shortly be moving the nameidata_to_filp() calls into namei.c, and this gets the sys_open flags to a place where we can get at them when we need them. Acked-by: Al Viro Signed-off-by: Christoph Hellwig Signed-off-by: Dave Hansen Signed-off-by: Al Viro commit 99bda4937b7b89efe9b4955b0f547129ee1c5488 Author: Eric Sandeen Date: Thu Mar 6 13:45:22 2008 +1100 [XFS] The forward declarations for the xfs_ioctl() helpers and the associated comment about gcc behavior really aren't needed; all of these functions are marked STATIC which includes noinline, and the stack usage won't be a problem. This effectively just removes the forward declarations and moves xfs_ioctl() back to the end of the file. SGI-PV: 971186 SGI-Modid: xfs-linux-melb:xfs-kern:30534a Signed-off-by: Eric Sandeen Signed-off-by: Niv Sardi Signed-off-by: Lachlan McIlroy Documentation/filesystems/proc.txt | 38 +++ fs/anon_inodes.c | 24 ++- fs/dcache.c | 114 +++++-- fs/eventfd.c | 5 +- fs/eventpoll.c | 7 +- fs/ext2/ioctl.c | 47 ++- fs/ext3/ioctl.c | 103 ++++-- fs/fat/file.c | 12 +- fs/file_table.c | 42 +++- fs/hfsplus/ioctl.c | 38 ++- fs/inode.c | 51 ++-- fs/internal.h | 11 + fs/jfs/ioctl.c | 33 ++- fs/namei.c | 275 ++++++++++++---- fs/namespace.c | 647 ++++++++++++++++++++++++++++++++---- fs/ncpfs/ioctl.c | 54 +++- fs/nfs/dir.c | 3 +- fs/nfsd/nfs4proc.c | 7 +- fs/nfsd/nfs4recover.c | 5 + fs/nfsd/nfs4state.c | 3 +- fs/nfsd/vfs.c | 26 ++- fs/ocfs2/ioctl.c | 11 +- fs/open.c | 149 +++++--- fs/pnode.c | 60 ++++- fs/pnode.h | 2 + fs/proc/base.c | 121 ++++--- fs/reiserfs/ioctl.c | 63 +++-- fs/seq_file.c | 111 +++++-- fs/signalfd.c | 6 +- fs/super.c | 25 ++- fs/timerfd.c | 5 +- fs/utimes.c | 18 +- fs/xattr.c | 16 +- fs/xfs/linux-2.6/xfs_ioctl.c | 575 +++++++++++++++----------------- fs/xfs/linux-2.6/xfs_iops.c | 7 - fs/xfs/linux-2.6/xfs_lrw.c | 9 +- include/linux/anon_inodes.h | 6 +- include/linux/dcache.h | 3 +- include/linux/file.h | 1 + include/linux/fs.h | 58 +++- include/linux/mnt_namespace.h | 12 + include/linux/mount.h | 15 +- include/linux/security.h | 52 ++-- include/linux/seq_file.h | 6 + ipc/mqueue.c | 25 ++- lib/Kconfig.debug | 10 + net/unix/af_unix.c | 4 + security/dummy.c | 10 +- security/security.c | 20 +- security/selinux/hooks.c | 8 +- security/smack/smack_lsm.c | 4 +- virt/kvm/kvm_main.c | 17 +- 52 files changed, 2106 insertions(+), 868 deletions(-) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 518ebe6..2a99116 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -43,6 +43,7 @@ Table of Contents 2.13 /proc//oom_score - Display current oom-killer score 2.14 /proc//io - Display the IO accounting fields 2.15 /proc//coredump_filter - Core dump filtering settings + 2.16 /proc//mountinfo - Information about mounts ------------------------------------------------------------------------------ Preface @@ -2348,4 +2349,41 @@ For example: $ echo 0x7 > /proc/self/coredump_filter $ ./some_program +2.16 /proc//mountinfo - Information about mounts +-------------------------------------------------------- + +This file contains lines of the form: + +36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue +(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) + +(1) mount ID: unique identifier of the mount (may be reused after umount) +(2) parent ID: ID of parent (or of self for the top of the mount tree) +(3) major:minor: value of st_dev for files on filesystem +(4) root: root of the mount within the filesystem +(5) mount point: mount point relative to the process's root +(6) mount options: per mount options +(7) optional fields: zero or more fields of the form "tag[:value]" +(8) separator: marks the end of the optional fields +(9) filesystem type: name of filesystem of the form "type[.subtype]" +(10) mount source: filesystem specific information or "none" +(11) super options: per super block options + +Parsers should ignore all unrecognised optional fields. Currently the +possible optional fields are: + +shared:X mount is shared in peer group X +master:X mount is slave to peer group X +propagate_from:X mount is slave and receives propagation from peer group X (*) +unbindable mount is unbindable + +(*) X is the closest dominant peer group under the process's root. If +X is the immediate master of the mount, or if there's no dominant peer +group under the same root, then only the "master:X" field is present +and not the "propagate_from:X" field. + +For more information on mount propagation see: + + Documentation/filesystems/sharedsubtree.txt + ------------------------------------------------------------------------------ diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c index f42be06..fbcb87a 100644 --- a/fs/anon_inodes.c +++ b/fs/anon_inodes.c @@ -53,12 +53,11 @@ static struct dentry_operations anon_inodefs_dentry_operations = { }; /** - * anon_inode_getfd - creates a new file instance by hooking it up to an + * __anon_inode_getfd - creates a new file instance by hooking it up to an * anonymous inode, and a dentry that describe the "class" * of the file * * @pfd: [out] pointer to the file descriptor - * @dpinode: [out] pointer to the inode * @pfile: [out] pointer to the file struct * @name: [in] name of the "class" of the new file * @fops [in] file operations for the new file @@ -69,8 +68,13 @@ static struct dentry_operations anon_inodefs_dentry_operations = { * All the files created with anon_inode_getfd() will share a single inode, * hence saving memory and avoiding code duplication for the file/inode/dentry * setup. + * File will be pinned down; the caller should fput() it once it's done looking + * at the damn thing (or just use anon_inode_getfd()). That reference to file + * is in addition to one created by inserting it into descriptor table. Note + * that *pfd might be closed by the time we return - another thread might have + * guessed it and called close(2). */ -int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, +int __anon_inode_getfd(int *pfd, struct file **pfile, const char *name, const struct file_operations *fops, void *priv) { @@ -123,10 +127,10 @@ int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, file->f_version = 0; file->private_data = priv; + get_file(file); fd_install(fd, file); *pfd = fd; - *pinode = anon_inode_inode; *pfile = file; return 0; @@ -136,6 +140,18 @@ err_put_unused_fd: put_unused_fd(fd); return error; } +EXPORT_SYMBOL_GPL(__anon_inode_getfd); + +int anon_inode_getfd(int *pfd, const char *name, + const struct file_operations *fops, + void *priv) +{ + struct file *file; + int error = __anon_inode_getfd(pfd, &file, name, fops, priv); + if (!error) + fput(file); + return error; +} EXPORT_SYMBOL_GPL(anon_inode_getfd); /* diff --git a/fs/dcache.c b/fs/dcache.c index 4345577..3ee588d 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1746,12 +1746,21 @@ shouldnt_be_hashed: goto shouldnt_be_hashed; } +static int prepend(char **buffer, int *buflen, const char *str, + int namelen) +{ + *buflen -= namelen; + if (*buflen < 0) + return -ENAMETOOLONG; + *buffer -= namelen; + memcpy(*buffer, str, namelen); + return 0; +} + /** * d_path - return the path of a dentry - * @dentry: dentry to report - * @vfsmnt: vfsmnt to which the dentry belongs - * @root: root dentry - * @rootmnt: vfsmnt to which the root dentry belongs + * @path: the dentry/vfsmount to report + * @root: root vfsmnt/dentry (may be modified by this function) * @buffer: buffer to return value in * @buflen: buffer length * @@ -1761,23 +1770,22 @@ shouldnt_be_hashed: * Returns the buffer or an error code if the path was too long. * * "buflen" should be positive. Caller holds the dcache_lock. + * + * If path is not reachable from the supplied root, then the value of + * root is changed (without modifying refcounts). */ -static char *__d_path(struct dentry *dentry, struct vfsmount *vfsmnt, - struct path *root, char *buffer, int buflen) +char *__d_path(const struct path *path, struct path *root, + char *buffer, int buflen) { + struct dentry *dentry = path->dentry; + struct vfsmount *vfsmnt = path->mnt; char * end = buffer+buflen; char * retval; - int namelen; - - *--end = '\0'; - buflen--; - if (!IS_ROOT(dentry) && d_unhashed(dentry)) { - buflen -= 10; - end -= 10; - if (buflen < 0) + + prepend(&end, &buflen, "\0", 1); + if (!IS_ROOT(dentry) && d_unhashed(dentry) && + (prepend(&end, &buflen, " (deleted)", 10) != 0)) goto Elong; - memcpy(end, " (deleted)", 10); - } if (buflen < 1) goto Elong; @@ -1804,13 +1812,10 @@ static char *__d_path(struct dentry *dentry, struct vfsmount *vfsmnt, } parent = dentry->d_parent; prefetch(parent); - namelen = dentry->d_name.len; - buflen -= namelen + 1; - if (buflen < 0) + if ((prepend(&end, &buflen, dentry->d_name.name, + dentry->d_name.len) != 0) || + (prepend(&end, &buflen, "/", 1) != 0)) goto Elong; - end -= namelen; - memcpy(end, dentry->d_name.name, namelen); - *--end = '/'; retval = end; dentry = parent; } @@ -1818,12 +1823,12 @@ static char *__d_path(struct dentry *dentry, struct vfsmount *vfsmnt, return retval; global_root: - namelen = dentry->d_name.len; - buflen -= namelen; - if (buflen < 0) + retval += 1; /* hit the slash */ + if (prepend(&retval, &buflen, dentry->d_name.name, + dentry->d_name.len) != 0) goto Elong; - retval -= namelen-1; /* hit the slash */ - memcpy(retval, dentry->d_name.name, namelen); + root->mnt = vfsmnt; + root->dentry = dentry; return retval; Elong: return ERR_PTR(-ENAMETOOLONG); @@ -1846,6 +1851,7 @@ char *d_path(struct path *path, char *buf, int buflen) { char *res; struct path root; + struct path tmp; /* * We have various synthetic filesystems that never get mounted. On @@ -1859,10 +1865,11 @@ char *d_path(struct path *path, char *buf, int buflen) read_lock(¤t->fs->lock); root = current->fs->root; - path_get(¤t->fs->root); + path_get(&root); read_unlock(¤t->fs->lock); spin_lock(&dcache_lock); - res = __d_path(path->dentry, path->mnt, &root, buf, buflen); + tmp = root; + res = __d_path(path, &tmp, buf, buflen); spin_unlock(&dcache_lock); path_put(&root); return res; @@ -1890,6 +1897,48 @@ char *dynamic_dname(struct dentry *dentry, char *buffer, int buflen, } /* + * Write full pathname from the root of the filesystem into the buffer. + */ +char *dentry_path(struct dentry *dentry, char *buf, int buflen) +{ + char *end = buf + buflen; + char *retval; + + spin_lock(&dcache_lock); + prepend(&end, &buflen, "\0", 1); + if (!IS_ROOT(dentry) && d_unhashed(dentry) && + (prepend(&end, &buflen, "//deleted", 9) != 0)) + goto Elong; + if (buflen < 1) + goto Elong; + /* Get '/' right */ + retval = end-1; + *retval = '/'; + + for (;;) { + struct dentry *parent; + if (IS_ROOT(dentry)) + break; + + parent = dentry->d_parent; + prefetch(parent); + + if ((prepend(&end, &buflen, dentry->d_name.name, + dentry->d_name.len) != 0) || + (prepend(&end, &buflen, "/", 1) != 0)) + goto Elong; + + retval = end; + dentry = parent; + } + spin_unlock(&dcache_lock); + return retval; +Elong: + spin_unlock(&dcache_lock); + return ERR_PTR(-ENAMETOOLONG); +} + +/* * NOTE! The user-level library version returns a * character pointer. The kernel system call just * returns the length of the buffer filled (which @@ -1918,9 +1967,9 @@ asmlinkage long sys_getcwd(char __user *buf, unsigned long size) read_lock(¤t->fs->lock); pwd = current->fs->pwd; - path_get(¤t->fs->pwd); + path_get(&pwd); root = current->fs->root; - path_get(¤t->fs->root); + path_get(&root); read_unlock(¤t->fs->lock); error = -ENOENT; @@ -1928,9 +1977,10 @@ asmlinkage long sys_getcwd(char __user *buf, unsigned long size) spin_lock(&dcache_lock); if (pwd.dentry->d_parent == pwd.dentry || !d_unhashed(pwd.dentry)) { unsigned long len; + struct path tmp = root; char * cwd; - cwd = __d_path(pwd.dentry, pwd.mnt, &root, page, PAGE_SIZE); + cwd = __d_path(&pwd, &tmp, page, PAGE_SIZE); spin_unlock(&dcache_lock); error = PTR_ERR(cwd); diff --git a/fs/eventfd.c b/fs/eventfd.c index a9f130c..07eef3e 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -202,8 +202,6 @@ asmlinkage long sys_eventfd(unsigned int count) { int error, fd; struct eventfd_ctx *ctx; - struct file *file; - struct inode *inode; ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) @@ -216,8 +214,7 @@ asmlinkage long sys_eventfd(unsigned int count) * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&fd, &inode, &file, "[eventfd]", - &eventfd_fops, ctx); + error = anon_inode_getfd(&fd, "[eventfd]", &eventfd_fops, ctx); if (!error) return fd; diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a415f42..369b665 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -1071,8 +1071,6 @@ asmlinkage long sys_epoll_create(int size) { int error, fd = -1; struct eventpoll *ep; - struct inode *inode; - struct file *file; DNPRINTK(3, (KERN_INFO "[%p] eventpoll: sys_epoll_create(%d)\n", current, size)); @@ -1087,10 +1085,9 @@ asmlinkage long sys_epoll_create(int size) /* * Creates all the items needed to setup an eventpoll file. That is, - * a file structure, and inode and a free file descriptor. + * a file structure and a free file descriptor. */ - error = anon_inode_getfd(&fd, &inode, &file, "[eventpoll]", - &eventpoll_fops, ep); + error = anon_inode_getfd(&fd, "[eventpoll]", &eventpoll_fops, ep); if (error) goto error_free; diff --git a/fs/ext2/ioctl.c b/fs/ext2/ioctl.c index b8ea11f..7ad01cc 100644 --- a/fs/ext2/ioctl.c +++ b/fs/ext2/ioctl.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -23,6 +24,7 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) struct ext2_inode_info *ei = EXT2_I(inode); unsigned int flags; unsigned short rsv_window_size; + int ret; ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg); @@ -34,14 +36,19 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) case EXT2_IOC_SETFLAGS: { unsigned int oldflags; - if (IS_RDONLY(inode)) - return -EROFS; + ret = mnt_want_write(filp->f_vfsmnt); + if (ret) + return ret; - if (!is_owner_or_cap(inode)) - return -EACCES; + if (!is_owner_or_cap(inode)) { + ret = -EACCES; + goto setflags_out; + } - if (get_user(flags, (int __user *) arg)) - return -EFAULT; + if (get_user(flags, (int __user *) arg)) { + ret = -EFAULT; + goto setflags_out; + } if (!S_ISDIR(inode->i_mode)) flags &= ~EXT2_DIRSYNC_FL; @@ -50,7 +57,8 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) /* Is it quota file? Do not allow user to mess with it */ if (IS_NOQUOTA(inode)) { mutex_unlock(&inode->i_mutex); - return -EPERM; + ret = -EPERM; + goto setflags_out; } oldflags = ei->i_flags; @@ -63,7 +71,8 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) { if (!capable(CAP_LINUX_IMMUTABLE)) { mutex_unlock(&inode->i_mutex); - return -EPERM; + ret = -EPERM; + goto setflags_out; } } @@ -75,20 +84,26 @@ long ext2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) ext2_set_inode_flags(inode); inode->i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); - return 0; +setflags_out: + mnt_drop_write(filp->f_vfsmnt); + return ret; } case EXT2_IOC_GETVERSION: return put_user(inode->i_generation, (int __user *) arg); case EXT2_IOC_SETVERSION: if (!is_owner_or_cap(inode)) return -EPERM; - if (IS_RDONLY(inode)) - return -EROFS; - if (get_user(inode->i_generation, (int __user *) arg)) - return -EFAULT; - inode->i_ctime = CURRENT_TIME_SEC; - mark_inode_dirty(inode); - return 0; + ret = mnt_want_write(filp->f_vfsmnt); + if (ret) + return ret; + if (get_user(inode->i_generation, (int __user *) arg)) { + ret = -EFAULT; + } else { + inode->i_ctime = CURRENT_TIME_SEC; + mark_inode_dirty(inode); + } + mnt_drop_write(filp->f_vfsmnt); + return ret; case EXT2_IOC_GETRSVSZ: if (test_opt(inode->i_sb, RESERVATION) && S_ISREG(inode->i_mode) diff --git a/fs/ext3/ioctl.c b/fs/ext3/ioctl.c index 023a070..09b850a 100644 --- a/fs/ext3/ioctl.c +++ b/fs/ext3/ioctl.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include @@ -38,14 +39,19 @@ int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, unsigned int oldflags; unsigned int jflag; - if (IS_RDONLY(inode)) - return -EROFS; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; - if (!is_owner_or_cap(inode)) - return -EACCES; + if (!is_owner_or_cap(inode)) { + err = -EACCES; + goto flags_out; + } - if (get_user(flags, (int __user *) arg)) - return -EFAULT; + if (get_user(flags, (int __user *) arg)) { + err = -EFAULT; + goto flags_out; + } if (!S_ISDIR(inode->i_mode)) flags &= ~EXT3_DIRSYNC_FL; @@ -54,7 +60,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, /* Is it quota file? Do not allow user to mess with it */ if (IS_NOQUOTA(inode)) { mutex_unlock(&inode->i_mutex); - return -EPERM; + err = -EPERM; + goto flags_out; } oldflags = ei->i_flags; @@ -70,7 +77,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, if ((flags ^ oldflags) & (EXT3_APPEND_FL | EXT3_IMMUTABLE_FL)) { if (!capable(CAP_LINUX_IMMUTABLE)) { mutex_unlock(&inode->i_mutex); - return -EPERM; + err = -EPERM; + goto flags_out; } } @@ -81,7 +89,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) { if (!capable(CAP_SYS_RESOURCE)) { mutex_unlock(&inode->i_mutex); - return -EPERM; + err = -EPERM; + goto flags_out; } } @@ -89,7 +98,8 @@ int ext3_ioctl (struct inode * inode, struct file * filp, unsigned int cmd, handle = ext3_journal_start(inode, 1); if (IS_ERR(handle)) { mutex_unlock(&inode->i_mutex); - return PTR_ERR(handle); + err = PTR_ERR(handle); + goto flags_out; } if (IS_SYNC(inode)) handle->h_sync = 1; @@ -115,6 +125,8 @@ flags_err: if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) err = ext3_change_inode_journal_flag(inode, jflag); mutex_unlock(&inode->i_mutex); +flags_out: + mnt_drop_write(filp->f_vfsmnt); return err; } case EXT3_IOC_GETVERSION: @@ -129,14 +141,18 @@ flags_err: if (!is_owner_or_cap(inode)) return -EPERM; - if (IS_RDONLY(inode)) - return -EROFS; - if (get_user(generation, (int __user *) arg)) - return -EFAULT; - + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; + if (get_user(generation, (int __user *) arg)) { + err = -EFAULT; + goto setversion_out; + } handle = ext3_journal_start(inode, 1); - if (IS_ERR(handle)) - return PTR_ERR(handle); + if (IS_ERR(handle)) { + err = PTR_ERR(handle); + goto setversion_out; + } err = ext3_reserve_inode_write(handle, inode, &iloc); if (err == 0) { inode->i_ctime = CURRENT_TIME_SEC; @@ -144,6 +160,8 @@ flags_err: err = ext3_mark_iloc_dirty(handle, inode, &iloc); } ext3_journal_stop(handle); +setversion_out: + mnt_drop_write(filp->f_vfsmnt); return err; } #ifdef CONFIG_JBD_DEBUG @@ -179,18 +197,24 @@ flags_err: } return -ENOTTY; case EXT3_IOC_SETRSVSZ: { + int err; if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode)) return -ENOTTY; - if (IS_RDONLY(inode)) - return -EROFS; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; - if (!is_owner_or_cap(inode)) - return -EACCES; + if (!is_owner_or_cap(inode)) { + err = -EACCES; + goto setrsvsz_out; + } - if (get_user(rsv_window_size, (int __user *)arg)) - return -EFAULT; + if (get_user(rsv_window_size, (int __user *)arg)) { + err = -EFAULT; + goto setrsvsz_out; + } if (rsv_window_size > EXT3_MAX_RESERVE_BLOCKS) rsv_window_size = EXT3_MAX_RESERVE_BLOCKS; @@ -208,7 +232,9 @@ flags_err: rsv->rsv_goal_size = rsv_window_size; } mutex_unlock(&ei->truncate_mutex); - return 0; +setrsvsz_out: + mnt_drop_write(filp->f_vfsmnt); + return err; } case EXT3_IOC_GROUP_EXTEND: { ext3_fsblk_t n_blocks_count; @@ -218,17 +244,20 @@ flags_err: if (!capable(CAP_SYS_RESOURCE)) return -EPERM; - if (IS_RDONLY(inode)) - return -EROFS; - - if (get_user(n_blocks_count, (__u32 __user *)arg)) - return -EFAULT; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; + if (get_user(n_blocks_count, (__u32 __user *)arg)) { + err = -EFAULT; + goto group_extend_out; + } err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count); journal_lock_updates(EXT3_SB(sb)->s_journal); journal_flush(EXT3_SB(sb)->s_journal); journal_unlock_updates(EXT3_SB(sb)->s_journal); - +group_extend_out: + mnt_drop_write(filp->f_vfsmnt); return err; } case EXT3_IOC_GROUP_ADD: { @@ -239,18 +268,22 @@ flags_err: if (!capable(CAP_SYS_RESOURCE)) return -EPERM; - if (IS_RDONLY(inode)) - return -EROFS; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; if (copy_from_user(&input, (struct ext3_new_group_input __user *)arg, - sizeof(input))) - return -EFAULT; + sizeof(input))) { + err = -EFAULT; + goto group_add_out; + } err = ext3_group_add(sb, &input); journal_lock_updates(EXT3_SB(sb)->s_journal); journal_flush(EXT3_SB(sb)->s_journal); journal_unlock_updates(EXT3_SB(sb)->s_journal); - +group_add_out: + mnt_drop_write(filp->f_vfsmnt); return err; } diff --git a/fs/fat/file.c b/fs/fat/file.c index c614175..3da035f 100644 --- a/fs/fat/file.c +++ b/fs/fat/file.c @@ -8,6 +8,7 @@ #include #include +#include #include #include #include @@ -46,10 +47,9 @@ int fat_generic_ioctl(struct inode *inode, struct file *filp, mutex_lock(&inode->i_mutex); - if (IS_RDONLY(inode)) { - err = -EROFS; - goto up; - } + err = mnt_want_write(filp->f_vfsmnt); + if (err) + goto up_no_drop_write; /* * ATTR_VOLUME and ATTR_DIR cannot be changed; this also @@ -105,7 +105,9 @@ int fat_generic_ioctl(struct inode *inode, struct file *filp, MSDOS_I(inode)->i_attrs = attr & ATTR_UNUSED; mark_inode_dirty(inode); - up: +up: + mnt_drop_write(filp->f_vfsmnt); +up_no_drop_write: mutex_unlock(&inode->i_mutex); return err; } diff --git a/fs/file_table.c b/fs/file_table.c index 986ff4e..7a0a9b8 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -42,6 +42,7 @@ static inline void file_free_rcu(struct rcu_head *head) static inline void file_free(struct file *f) { percpu_counter_dec(&nr_files); + file_check_state(f); call_rcu(&f->f_u.fu_rcuhead, file_free_rcu); } @@ -199,6 +200,18 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry, file->f_mapping = dentry->d_inode->i_mapping; file->f_mode = mode; file->f_op = fop; + + /* + * These mounts don't really matter in practice + * for r/o bind mounts. They aren't userspace- + * visible. We do this for consistency, and so + * that we can do debugging checks at __fput() + */ + if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) { + file_take_write(file); + error = mnt_want_write(mnt); + WARN_ON(error); + } return error; } EXPORT_SYMBOL(init_file); @@ -211,6 +224,31 @@ void fput(struct file *file) EXPORT_SYMBOL(fput); +/** + * drop_file_write_access - give up ability to write to a file + * @file: the file to which we will stop writing + * + * This is a central place which will give up the ability + * to write to @file, along with access to write through + * its vfsmount. + */ +void drop_file_write_access(struct file *file) +{ + struct vfsmount *mnt = file->f_path.mnt; + struct dentry *dentry = file->f_path.dentry; + struct inode *inode = dentry->d_inode; + + put_write_access(inode); + + if (special_file(inode->i_mode)) + return; + if (file_check_writeable(file) != 0) + return; + mnt_drop_write(mnt); + file_release_write(file); +} +EXPORT_SYMBOL_GPL(drop_file_write_access); + /* __fput is called from task context when aio completion releases the last * last use of a struct file *. Do not use otherwise. */ @@ -236,10 +274,10 @@ void __fput(struct file *file) if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL)) cdev_put(inode->i_cdev); fops_put(file->f_op); - if (file->f_mode & FMODE_WRITE) - put_write_access(inode); put_pid(file->f_owner.pid); file_kill(file); + if (file->f_mode & FMODE_WRITE) + drop_file_write_access(file); file->f_path.dentry = NULL; file->f_path.mnt = NULL; file_free(file); diff --git a/fs/hfsplus/ioctl.c b/fs/hfsplus/ioctl.c index b60c0af..4500010 100644 --- a/fs/hfsplus/ioctl.c +++ b/fs/hfsplus/ioctl.c @@ -14,6 +14,7 @@ #include #include +#include #include #include #include @@ -35,25 +36,32 @@ int hfsplus_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, flags |= FS_NODUMP_FL; /* EXT2_NODUMP_FL */ return put_user(flags, (int __user *)arg); case HFSPLUS_IOC_EXT2_SETFLAGS: { - if (IS_RDONLY(inode)) - return -EROFS; - - if (!is_owner_or_cap(inode)) - return -EACCES; - - if (get_user(flags, (int __user *)arg)) - return -EFAULT; - + int err = 0; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; + + if (!is_owner_or_cap(inode)) { + err = -EACCES; + goto setflags_out; + } + if (get_user(flags, (int __user *)arg)) { + err = -EFAULT; + goto setflags_out; + } if (flags & (FS_IMMUTABLE_FL|FS_APPEND_FL) || HFSPLUS_I(inode).rootflags & (HFSPLUS_FLG_IMMUTABLE|HFSPLUS_FLG_APPEND)) { - if (!capable(CAP_LINUX_IMMUTABLE)) - return -EPERM; + if (!capable(CAP_LINUX_IMMUTABLE)) { + err = -EPERM; + goto setflags_out; + } } /* don't silently ignore unsupported ext2 flags */ - if (flags & ~(FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL)) - return -EOPNOTSUPP; - + if (flags & ~(FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL)) { + err = -EOPNOTSUPP; + goto setflags_out; + } if (flags & FS_IMMUTABLE_FL) { /* EXT2_IMMUTABLE_FL */ inode->i_flags |= S_IMMUTABLE; HFSPLUS_I(inode).rootflags |= HFSPLUS_FLG_IMMUTABLE; @@ -75,7 +83,9 @@ int hfsplus_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, inode->i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); - return 0; +setflags_out: + mnt_drop_write(filp->f_vfsmnt); + return err; } default: return -ENOTTY; diff --git a/fs/inode.c b/fs/inode.c index 53245ff..cfddbaf 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1199,42 +1199,37 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry) struct inode *inode = dentry->d_inode; struct timespec now; - if (inode->i_flags & S_NOATIME) + if (mnt_want_write(mnt)) return; + if (inode->i_flags & S_NOATIME) + goto out; if (IS_NOATIME(inode)) - return; + goto out; if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode)) - return; + goto out; - /* - * We may have a NULL vfsmount when coming from NFSD - */ - if (mnt) { - if (mnt->mnt_flags & MNT_NOATIME) - return; - if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)) - return; - - if (mnt->mnt_flags & MNT_RELATIME) { - /* - * With relative atime, only update atime if the - * previous atime is earlier than either the ctime or - * mtime. - */ - if (timespec_compare(&inode->i_mtime, - &inode->i_atime) < 0 && - timespec_compare(&inode->i_ctime, - &inode->i_atime) < 0) - return; - } + if (mnt->mnt_flags & MNT_NOATIME) + goto out; + if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode)) + goto out; + if (mnt->mnt_flags & MNT_RELATIME) { + /* + * With relative atime, only update atime if the previous + * atime is earlier than either the ctime or mtime. + */ + if (timespec_compare(&inode->i_mtime, &inode->i_atime) < 0 && + timespec_compare(&inode->i_ctime, &inode->i_atime) < 0) + goto out; } now = current_fs_time(inode->i_sb); if (timespec_equal(&inode->i_atime, &now)) - return; + goto out; inode->i_atime = now; mark_inode_dirty_sync(inode); +out: + mnt_drop_write(mnt); } EXPORT_SYMBOL(touch_atime); @@ -1255,10 +1250,13 @@ void file_update_time(struct file *file) struct inode *inode = file->f_path.dentry->d_inode; struct timespec now; int sync_it = 0; + int err; if (IS_NOCMTIME(inode)) return; - if (IS_RDONLY(inode)) + + err = mnt_want_write(file->f_vfsmnt); + if (err) return; now = current_fs_time(inode->i_sb); @@ -1279,6 +1277,7 @@ void file_update_time(struct file *file) if (sync_it) mark_inode_dirty_sync(inode); + mnt_drop_write(file->f_vfsmnt); } EXPORT_SYMBOL(file_update_time); diff --git a/fs/internal.h b/fs/internal.h index 392e8cc..80aa9a0 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -43,3 +43,14 @@ extern void __init chrdev_init(void); * namespace.c */ extern int copy_mount_options(const void __user *, unsigned long *); + +extern void free_vfsmnt(struct vfsmount *); +extern struct vfsmount *alloc_vfsmnt(const char *); +extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int); +extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *, + struct vfsmount *); +extern void release_mounts(struct list_head *); +extern void umount_tree(struct vfsmount *, int, struct list_head *); +extern struct vfsmount *copy_tree(struct vfsmount *, struct dentry *, int); + +extern void __init mnt_init(void); diff --git a/fs/jfs/ioctl.c b/fs/jfs/ioctl.c index a1f8e37..bb72ff2 100644 --- a/fs/jfs/ioctl.c +++ b/fs/jfs/ioctl.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -65,23 +66,30 @@ long jfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) return put_user(flags, (int __user *) arg); case JFS_IOC_SETFLAGS: { unsigned int oldflags; + int err; - if (IS_RDONLY(inode)) - return -EROFS; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; - if (!is_owner_or_cap(inode)) - return -EACCES; - - if (get_user(flags, (int __user *) arg)) - return -EFAULT; + if (!is_owner_or_cap(inode)) { + err = -EACCES; + goto setflags_out; + } + if (get_user(flags, (int __user *) arg)) { + err = -EFAULT; + goto setflags_out; + } flags = jfs_map_ext2(flags, 1); if (!S_ISDIR(inode->i_mode)) flags &= ~JFS_DIRSYNC_FL; /* Is it quota file? Do not allow user to mess with it */ - if (IS_NOQUOTA(inode)) - return -EPERM; + if (IS_NOQUOTA(inode)) { + err = -EPERM; + goto setflags_out; + } /* Lock against other parallel changes of flags */ mutex_lock(&inode->i_mutex); @@ -98,7 +106,8 @@ long jfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) (JFS_APPEND_FL | JFS_IMMUTABLE_FL))) { if (!capable(CAP_LINUX_IMMUTABLE)) { mutex_unlock(&inode->i_mutex); - return -EPERM; + err = -EPERM; + goto setflags_out; } } @@ -110,7 +119,9 @@ long jfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) mutex_unlock(&inode->i_mutex); inode->i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); - return 0; +setflags_out: + mnt_drop_write(filp->f_vfsmnt); + return err; } default: return -ENOTTY; diff --git a/fs/namei.c b/fs/namei.c index 8cf9bb9..e179f71 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1623,8 +1623,7 @@ int may_open(struct nameidata *nd, int acc_mode, int flag) return -EACCES; flag &= ~O_TRUNC; - } else if (IS_RDONLY(inode) && (acc_mode & MAY_WRITE)) - return -EROFS; + } error = vfs_permission(nd, acc_mode); if (error) @@ -1677,7 +1676,12 @@ int may_open(struct nameidata *nd, int acc_mode, int flag) return 0; } -static int open_namei_create(struct nameidata *nd, struct path *path, +/* + * Be careful about ever adding any more callers of this + * function. Its flags must be in the namei format, not + * what get passed to sys_open(). + */ +static int __open_namei_create(struct nameidata *nd, struct path *path, int flag, int mode) { int error; @@ -1696,26 +1700,56 @@ static int open_namei_create(struct nameidata *nd, struct path *path, } /* - * open_namei() + * Note that while the flag value (low two bits) for sys_open means: + * 00 - read-only + * 01 - write-only + * 10 - read-write + * 11 - special + * it is changed into + * 00 - no permissions needed + * 01 - read-permission + * 10 - write-permission + * 11 - read-write + * for the internal routines (ie open_namei()/follow_link() etc) + * This is more logical, and also allows the 00 "no perm needed" + * to be used for symlinks (where the permissions are checked + * later). * - * namei for open - this is in fact almost the whole open-routine. - * - * Note that the low bits of "flag" aren't the same as in the open - * system call - they are 00 - no permissions needed - * 01 - read permission needed - * 10 - write permission needed - * 11 - read/write permissions needed - * which is a lot more logical, and also allows the "no perm" needed - * for symlinks (where the permissions are checked later). - * SMP-safe +*/ +static inline int open_to_namei_flags(int flag) +{ + if ((flag+1) & O_ACCMODE) + flag++; + return flag; +} + +static int open_will_write_to_fs(int flag, struct inode *inode) +{ + /* + * We'll never write to the fs underlying + * a device file. + */ + if (special_file(inode->i_mode)) + return 0; + return (flag & O_TRUNC); +} + +/* + * Note that the low bits of the passed in "open_flag" + * are not the same as in the local variable "flag". See + * open_to_namei_flags() for more details. */ -int open_namei(int dfd, const char *pathname, int flag, - int mode, struct nameidata *nd) +struct file *do_filp_open(int dfd, const char *pathname, + int open_flag, int mode) { + struct file *filp; + struct nameidata nd; int acc_mode, error; struct path path; struct dentry *dir; int count = 0; + int will_write; + int flag = open_to_namei_flags(open_flag); acc_mode = ACC_MODE(flag); @@ -1733,18 +1767,19 @@ int open_namei(int dfd, const char *pathname, int flag, */ if (!(flag & O_CREAT)) { error = path_lookup_open(dfd, pathname, lookup_flags(flag), - nd, flag); + &nd, flag); if (error) - return error; + return ERR_PTR(error); goto ok; } /* * Create - we need to know the parent. */ - error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode); + error = path_lookup_create(dfd, pathname, LOOKUP_PARENT, + &nd, flag, mode); if (error) - return error; + return ERR_PTR(error); /* * We have the parent and last component. First of all, check @@ -1752,14 +1787,14 @@ int open_namei(int dfd, const char *pathname, int flag, * will not do. */ error = -EISDIR; - if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len]) + if (nd.last_type != LAST_NORM || nd.last.name[nd.last.len]) goto exit; - dir = nd->path.dentry; - nd->flags &= ~LOOKUP_PARENT; + dir = nd.path.dentry; + nd.flags &= ~LOOKUP_PARENT; mutex_lock(&dir->d_inode->i_mutex); - path.dentry = lookup_hash(nd); - path.mnt = nd->path.mnt; + path.dentry = lookup_hash(&nd); + path.mnt = nd.path.mnt; do_last: error = PTR_ERR(path.dentry); @@ -1768,18 +1803,31 @@ do_last: goto exit; } - if (IS_ERR(nd->intent.open.file)) { - mutex_unlock(&dir->d_inode->i_mutex); - error = PTR_ERR(nd->intent.open.file); - goto exit_dput; + if (IS_ERR(nd.intent.open.file)) { + error = PTR_ERR(nd.intent.open.file); + goto exit_mutex_unlock; } /* Negative dentry, just create the file */ if (!path.dentry->d_inode) { - error = open_namei_create(nd, &path, flag, mode); + /* + * This write is needed to ensure that a + * ro->rw transition does not occur between + * the time when the file is created and when + * a permanent write count is taken through + * the 'struct file' in nameidata_to_filp(). + */ + error = mnt_want_write(nd.path.mnt); if (error) + goto exit_mutex_unlock; + error = __open_namei_create(&nd, &path, flag, mode); + if (error) { + mnt_drop_write(nd.path.mnt); goto exit; - return 0; + } + filp = nameidata_to_filp(&nd, open_flag); + mnt_drop_write(nd.path.mnt); + return filp; } /* @@ -1804,23 +1852,52 @@ do_last: if (path.dentry->d_inode->i_op && path.dentry->d_inode->i_op->follow_link) goto do_link; - path_to_nameidata(&path, nd); + path_to_nameidata(&path, &nd); error = -EISDIR; if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode)) goto exit; ok: - error = may_open(nd, acc_mode, flag); - if (error) + /* + * Consider: + * 1. may_open() truncates a file + * 2. a rw->ro mount transition occurs + * 3. nameidata_to_filp() fails due to + * the ro mount. + * That would be inconsistent, and should + * be avoided. Taking this mnt write here + * ensures that (2) can not occur. + */ + will_write = open_will_write_to_fs(flag, nd.path.dentry->d_inode); + if (will_write) { + error = mnt_want_write(nd.path.mnt); + if (error) + goto exit; + } + error = may_open(&nd, acc_mode, flag); + if (error) { + if (will_write) + mnt_drop_write(nd.path.mnt); goto exit; - return 0; + } + filp = nameidata_to_filp(&nd, open_flag); + /* + * It is now safe to drop the mnt write + * because the filp has had a write taken + * on its behalf. + */ + if (will_write) + mnt_drop_write(nd.path.mnt); + return filp; +exit_mutex_unlock: + mutex_unlock(&dir->d_inode->i_mutex); exit_dput: - path_put_conditional(&path, nd); + path_put_conditional(&path, &nd); exit: - if (!IS_ERR(nd->intent.open.file)) - release_open_intent(nd); - path_put(&nd->path); - return error; + if (!IS_ERR(nd.intent.open.file)) + release_open_intent(&nd); + path_put(&nd.path); + return ERR_PTR(error); do_link: error = -ELOOP; @@ -1836,43 +1913,60 @@ do_link: * stored in nd->last.name and we will have to putname() it when we * are done. Procfs-like symlinks just set LAST_BIND. */ - nd->flags |= LOOKUP_PARENT; - error = security_inode_follow_link(path.dentry, nd); + nd.flags |= LOOKUP_PARENT; + error = security_inode_follow_link(path.dentry, &nd); if (error) goto exit_dput; - error = __do_follow_link(&path, nd); + error = __do_follow_link(&path, &nd); if (error) { /* Does someone understand code flow here? Or it is only * me so stupid? Anathema to whoever designed this non-sense * with "intent.open". */ - release_open_intent(nd); - return error; + release_open_intent(&nd); + return ERR_PTR(error); } - nd->flags &= ~LOOKUP_PARENT; - if (nd->last_type == LAST_BIND) + nd.flags &= ~LOOKUP_PARENT; + if (nd.last_type == LAST_BIND) goto ok; error = -EISDIR; - if (nd->last_type != LAST_NORM) + if (nd.last_type != LAST_NORM) goto exit; - if (nd->last.name[nd->last.len]) { - __putname(nd->last.name); + if (nd.last.name[nd.last.len]) { + __putname(nd.last.name); goto exit; } error = -ELOOP; if (count++==32) { - __putname(nd->last.name); + __putname(nd.last.name); goto exit; } - dir = nd->path.dentry; + dir = nd.path.dentry; mutex_lock(&dir->d_inode->i_mutex); - path.dentry = lookup_hash(nd); - path.mnt = nd->path.mnt; - __putname(nd->last.name); + path.dentry = lookup_hash(&nd); + path.mnt = nd.path.mnt; + __putname(nd.last.name); goto do_last; } /** + * filp_open - open file and return file pointer + * + * @filename: path to open + * @flags: open flags as per the open(2) second argument + * @mode: mode for the new file if O_CREAT is set, else ignored + * + * This is the helper to open a file from kernelspace if you really + * have to. But in generally you should not do this, so please move + * along, nothing to see here.. + */ +struct file *filp_open(const char *filename, int flags, int mode) +{ + return do_filp_open(AT_FDCWD, filename, flags, mode); +} +EXPORT_SYMBOL(filp_open); + +/** * lookup_create - lookup a dentry, creating it if it doesn't exist * @nd: nameidata info * @is_dir: directory flag @@ -1945,6 +2039,23 @@ int vfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev) return error; } +static int may_mknod(mode_t mode) +{ + switch (mode & S_IFMT) { + case S_IFREG: + case S_IFCHR: + case S_IFBLK: + case S_IFIFO: + case S_IFSOCK: + case 0: /* zero mode translates to S_IFREG */ + return 0; + case S_IFDIR: + return -EPERM; + default: + return -EINVAL; + } +} + asmlinkage long sys_mknodat(int dfd, const char __user *filename, int mode, unsigned dev) { @@ -1963,12 +2074,19 @@ asmlinkage long sys_mknodat(int dfd, const char __user *filename, int mode, if (error) goto out; dentry = lookup_create(&nd, 0); - error = PTR_ERR(dentry); - + if (IS_ERR(dentry)) { + error = PTR_ERR(dentry); + goto out_unlock; + } if (!IS_POSIXACL(nd.path.dentry->d_inode)) mode &= ~current->fs->umask; - if (!IS_ERR(dentry)) { - switch (mode & S_IFMT) { + error = may_mknod(mode); + if (error) + goto out_dput; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; + switch (mode & S_IFMT) { case 0: case S_IFREG: error = vfs_create(nd.path.dentry->d_inode,dentry,mode,&nd); break; @@ -1979,14 +2097,11 @@ asmlinkage long sys_mknodat(int dfd, const char __user *filename, int mode, case S_IFIFO: case S_IFSOCK: error = vfs_mknod(nd.path.dentry->d_inode,dentry,mode,0); break; - case S_IFDIR: - error = -EPERM; - break; - default: - error = -EINVAL; - } - dput(dentry); } + mnt_drop_write(nd.path.mnt); +out_dput: + dput(dentry); +out_unlock: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); path_put(&nd.path); out: @@ -2044,7 +2159,12 @@ asmlinkage long sys_mkdirat(int dfd, const char __user *pathname, int mode) if (!IS_POSIXACL(nd.path.dentry->d_inode)) mode &= ~current->fs->umask; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; error = vfs_mkdir(nd.path.dentry->d_inode, dentry, mode); + mnt_drop_write(nd.path.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); @@ -2151,7 +2271,12 @@ static long do_rmdir(int dfd, const char __user *pathname) error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto exit2; + error = mnt_want_write(nd.path.mnt); + if (error) + goto exit3; error = vfs_rmdir(nd.path.dentry->d_inode, dentry); + mnt_drop_write(nd.path.mnt); +exit3: dput(dentry); exit2: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); @@ -2232,7 +2357,11 @@ static long do_unlinkat(int dfd, const char __user *pathname) inode = dentry->d_inode; if (inode) atomic_inc(&inode->i_count); + error = mnt_want_write(nd.path.mnt); + if (error) + goto exit2; error = vfs_unlink(nd.path.dentry->d_inode, dentry); + mnt_drop_write(nd.path.mnt); exit2: dput(dentry); } @@ -2313,7 +2442,12 @@ asmlinkage long sys_symlinkat(const char __user *oldname, if (IS_ERR(dentry)) goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; error = vfs_symlink(nd.path.dentry->d_inode, dentry, from, S_IALLUGO); + mnt_drop_write(nd.path.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); @@ -2408,7 +2542,12 @@ asmlinkage long sys_linkat(int olddfd, const char __user *oldname, error = PTR_ERR(new_dentry); if (IS_ERR(new_dentry)) goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; error = vfs_link(old_nd.path.dentry, nd.path.dentry->d_inode, new_dentry); + mnt_drop_write(nd.path.mnt); +out_dput: dput(new_dentry); out_unlock: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); @@ -2634,8 +2773,12 @@ static int do_rename(int olddfd, const char *oldname, if (new_dentry == trap) goto exit5; + error = mnt_want_write(oldnd.path.mnt); + if (error) + goto exit5; error = vfs_rename(old_dir->d_inode, old_dentry, new_dir->d_inode, new_dentry); + mnt_drop_write(oldnd.path.mnt); exit5: dput(new_dentry); exit4: diff --git a/fs/namespace.c b/fs/namespace.c index 94f026e..0505fb6 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -17,6 +17,7 @@ #include #include #include +#include #include #include #include @@ -26,6 +27,7 @@ #include #include #include +#include #include #include #include "pnode.h" @@ -38,6 +40,8 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(vfsmount_lock); static int event; +static DEFINE_IDA(mnt_id_ida); +static DEFINE_IDA(mnt_group_ida); static struct list_head *mount_hashtable __read_mostly; static struct kmem_cache *mnt_cache __read_mostly; @@ -55,10 +59,65 @@ static inline unsigned long hash(struct vfsmount *mnt, struct dentry *dentry) return tmp & (HASH_SIZE - 1); } +#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16) + +/* allocation is serialized by namespace_sem */ +static int mnt_alloc_id(struct vfsmount *mnt) +{ + int res; + +retry: + ida_pre_get(&mnt_id_ida, GFP_KERNEL); + spin_lock(&vfsmount_lock); + res = ida_get_new(&mnt_id_ida, &mnt->mnt_id); + spin_unlock(&vfsmount_lock); + if (res == -EAGAIN) + goto retry; + + return res; +} + +static void mnt_free_id(struct vfsmount *mnt) +{ + spin_lock(&vfsmount_lock); + ida_remove(&mnt_id_ida, mnt->mnt_id); + spin_unlock(&vfsmount_lock); +} + +/* + * Allocate a new peer group ID + * + * mnt_group_ida is protected by namespace_sem + */ +static int mnt_alloc_group_id(struct vfsmount *mnt) +{ + if (!ida_pre_get(&mnt_group_ida, GFP_KERNEL)) + return -ENOMEM; + + return ida_get_new_above(&mnt_group_ida, 1, &mnt->mnt_group_id); +} + +/* + * Release a peer group ID + */ +void mnt_release_group_id(struct vfsmount *mnt) +{ + ida_remove(&mnt_group_ida, mnt->mnt_group_id); + mnt->mnt_group_id = 0; +} + struct vfsmount *alloc_vfsmnt(const char *name) { struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL); if (mnt) { + int err; + + err = mnt_alloc_id(mnt); + if (err) { + kmem_cache_free(mnt_cache, mnt); + return NULL; + } + atomic_set(&mnt->mnt_count, 1); INIT_LIST_HEAD(&mnt->mnt_hash); INIT_LIST_HEAD(&mnt->mnt_child); @@ -68,6 +127,7 @@ struct vfsmount *alloc_vfsmnt(const char *name) INIT_LIST_HEAD(&mnt->mnt_share); INIT_LIST_HEAD(&mnt->mnt_slave_list); INIT_LIST_HEAD(&mnt->mnt_slave); + atomic_set(&mnt->__mnt_writers, 0); if (name) { int size = strlen(name) + 1; char *newname = kmalloc(size, GFP_KERNEL); @@ -80,6 +140,263 @@ struct vfsmount *alloc_vfsmnt(const char *name) return mnt; } +/* + * Most r/o checks on a fs are for operations that take + * discrete amounts of time, like a write() or unlink(). + * We must keep track of when those operations start + * (for permission checks) and when they end, so that + * we can determine when writes are able to occur to + * a filesystem. + */ +/* + * __mnt_is_readonly: check whether a mount is read-only + * @mnt: the mount to check for its write status + * + * This shouldn't be used directly ouside of the VFS. + * It does not guarantee that the filesystem will stay + * r/w, just that it is right *now*. This can not and + * should not be used in place of IS_RDONLY(inode). + * mnt_want/drop_write() will _keep_ the filesystem + * r/w. + */ +int __mnt_is_readonly(struct vfsmount *mnt) +{ + if (mnt->mnt_flags & MNT_READONLY) + return 1; + if (mnt->mnt_sb->s_flags & MS_RDONLY) + return 1; + return 0; +} +EXPORT_SYMBOL_GPL(__mnt_is_readonly); + +struct mnt_writer { + /* + * If holding multiple instances of this lock, they + * must be ordered by cpu number. + */ + spinlock_t lock; + struct lock_class_key lock_class; /* compiles out with !lockdep */ + unsigned long count; + struct vfsmount *mnt; +} ____cacheline_aligned_in_smp; +static DEFINE_PER_CPU(struct mnt_writer, mnt_writers); + +static int __init init_mnt_writers(void) +{ + int cpu; + for_each_possible_cpu(cpu) { + struct mnt_writer *writer = &per_cpu(mnt_writers, cpu); + spin_lock_init(&writer->lock); + lockdep_set_class(&writer->lock, &writer->lock_class); + writer->count = 0; + } + return 0; +} +fs_initcall(init_mnt_writers); + +static void unlock_mnt_writers(void) +{ + int cpu; + struct mnt_writer *cpu_writer; + + for_each_possible_cpu(cpu) { + cpu_writer = &per_cpu(mnt_writers, cpu); + spin_unlock(&cpu_writer->lock); + } +} + +static inline void __clear_mnt_count(struct mnt_writer *cpu_writer) +{ + if (!cpu_writer->mnt) + return; + /* + * This is in case anyone ever leaves an invalid, + * old ->mnt and a count of 0. + */ + if (!cpu_writer->count) + return; + atomic_add(cpu_writer->count, &cpu_writer->mnt->__mnt_writers); + cpu_writer->count = 0; +} + /* + * must hold cpu_writer->lock + */ +static inline void use_cpu_writer_for_mount(struct mnt_writer *cpu_writer, + struct vfsmount *mnt) +{ + if (cpu_writer->mnt == mnt) + return; + __clear_mnt_count(cpu_writer); + cpu_writer->mnt = mnt; +} + +/* + * Most r/o checks on a fs are for operations that take + * discrete amounts of time, like a write() or unlink(). + * We must keep track of when those operations start + * (for permission checks) and when they end, so that + * we can determine when writes are able to occur to + * a filesystem. + */ +/** + * mnt_want_write - get write access to a mount + * @mnt: the mount on which to take a write + * + * This tells the low-level filesystem that a write is + * about to be performed to it, and makes sure that + * writes are allowed before returning success. When + * the write operation is finished, mnt_drop_write() + * must be called. This is effectively a refcount. + */ +int mnt_want_write(struct vfsmount *mnt) +{ + int ret = 0; + struct mnt_writer *cpu_writer; + + cpu_writer = &get_cpu_var(mnt_writers); + spin_lock(&cpu_writer->lock); + if (__mnt_is_readonly(mnt)) { + ret = -EROFS; + goto out; + } + use_cpu_writer_for_mount(cpu_writer, mnt); + cpu_writer->count++; +out: + spin_unlock(&cpu_writer->lock); + put_cpu_var(mnt_writers); + return ret; +} +EXPORT_SYMBOL_GPL(mnt_want_write); + +static void lock_mnt_writers(void) +{ + int cpu; + struct mnt_writer *cpu_writer; + + for_each_possible_cpu(cpu) { + cpu_writer = &per_cpu(mnt_writers, cpu); + spin_lock(&cpu_writer->lock); + __clear_mnt_count(cpu_writer); + cpu_writer->mnt = NULL; + } +} + +/* + * These per-cpu write counts are not guaranteed to have + * matched increments and decrements on any given cpu. + * A file open()ed for write on one cpu and close()d on + * another cpu will imbalance this count. Make sure it + * does not get too far out of whack. + */ +static void handle_write_count_underflow(struct vfsmount *mnt) +{ + if (atomic_read(&mnt->__mnt_writers) >= + MNT_WRITER_UNDERFLOW_LIMIT) + return; + /* + * It isn't necessary to hold all of the locks + * at the same time, but doing it this way makes + * us share a lot more code. + */ + lock_mnt_writers(); + /* + * vfsmount_lock is for mnt_flags. + */ + spin_lock(&vfsmount_lock); + /* + * If coalescing the per-cpu writer counts did not + * get us back to a positive writer count, we have + * a bug. + */ + if ((atomic_read(&mnt->__mnt_writers) < 0) && + !(mnt->mnt_flags & MNT_IMBALANCED_WRITE_COUNT)) { + printk(KERN_DEBUG "leak detected on mount(%p) writers " + "count: %d\n", + mnt, atomic_read(&mnt->__mnt_writers)); + WARN_ON(1); + /* use the flag to keep the dmesg spam down */ + mnt->mnt_flags |= MNT_IMBALANCED_WRITE_COUNT; + } + spin_unlock(&vfsmount_lock); + unlock_mnt_writers(); +} + +/** + * mnt_drop_write - give up write access to a mount + * @mnt: the mount on which to give up write access + * + * Tells the low-level filesystem that we are done + * performing writes to it. Must be matched with + * mnt_want_write() call above. + */ +void mnt_drop_write(struct vfsmount *mnt) +{ + int must_check_underflow = 0; + struct mnt_writer *cpu_writer; + + cpu_writer = &get_cpu_var(mnt_writers); + spin_lock(&cpu_writer->lock); + + use_cpu_writer_for_mount(cpu_writer, mnt); + if (cpu_writer->count > 0) { + cpu_writer->count--; + } else { + must_check_underflow = 1; + atomic_dec(&mnt->__mnt_writers); + } + + spin_unlock(&cpu_writer->lock); + /* + * Logically, we could call this each time, + * but the __mnt_writers cacheline tends to + * be cold, and makes this expensive. + */ + if (must_check_underflow) + handle_write_count_underflow(mnt); + /* + * This could be done right after the spinlock + * is taken because the spinlock keeps us on + * the cpu, and disables preemption. However, + * putting it here bounds the amount that + * __mnt_writers can underflow. Without it, + * we could theoretically wrap __mnt_writers. + */ + put_cpu_var(mnt_writers); +} +EXPORT_SYMBOL_GPL(mnt_drop_write); + +static int mnt_make_readonly(struct vfsmount *mnt) +{ + int ret = 0; + + lock_mnt_writers(); + /* + * With all the locks held, this value is stable + */ + if (atomic_read(&mnt->__mnt_writers) > 0) { + ret = -EBUSY; + goto out; + } + /* + * nobody can do a successful mnt_want_write() with all + * of the counts in MNT_DENIED_WRITE and the locks held. + */ + spin_lock(&vfsmount_lock); + if (!ret) + mnt->mnt_flags |= MNT_READONLY; + spin_unlock(&vfsmount_lock); +out: + unlock_mnt_writers(); + return ret; +} + +static void __mnt_unmake_readonly(struct vfsmount *mnt) +{ + spin_lock(&vfsmount_lock); + mnt->mnt_flags &= ~MNT_READONLY; + spin_unlock(&vfsmount_lock); +} + int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb) { mnt->mnt_sb = sb; @@ -92,6 +409,7 @@ EXPORT_SYMBOL(simple_set_mnt); void free_vfsmnt(struct vfsmount *mnt) { kfree(mnt->mnt_devname); + mnt_free_id(mnt); kmem_cache_free(mnt_cache, mnt); } @@ -238,6 +556,17 @@ static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, struct vfsmount *mnt = alloc_vfsmnt(old->mnt_devname); if (mnt) { + if (flag & (CL_SLAVE | CL_PRIVATE)) + mnt->mnt_group_id = 0; /* not a peer of original */ + else + mnt->mnt_group_id = old->mnt_group_id; + + if ((flag & CL_MAKE_SHARED) && !mnt->mnt_group_id) { + int err = mnt_alloc_group_id(mnt); + if (err) + goto out_free; + } + mnt->mnt_flags = old->mnt_flags; atomic_inc(&sb->s_active); mnt->mnt_sb = sb; @@ -267,11 +596,44 @@ static struct vfsmount *clone_mnt(struct vfsmount *old, struct dentry *root, } } return mnt; + + out_free: + free_vfsmnt(mnt); + return NULL; } static inline void __mntput(struct vfsmount *mnt) { + int cpu; struct super_block *sb = mnt->mnt_sb; + /* + * We don't have to hold all of the locks at the + * same time here because we know that we're the + * last reference to mnt and that no new writers + * can come in. + */ + for_each_possible_cpu(cpu) { + struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu); + if (cpu_writer->mnt != mnt) + continue; + spin_lock(&cpu_writer->lock); + atomic_add(cpu_writer->count, &mnt->__mnt_writers); + cpu_writer->count = 0; + /* + * Might as well do this so that no one + * ever sees the pointer and expects + * it to be valid. + */ + cpu_writer->mnt = NULL; + spin_unlock(&cpu_writer->lock); + } + /* + * This probably indicates that somebody messed + * up a mnt_want/drop_write() pair. If this + * happens, the filesystem was probably unable + * to make r/w->r/o transitions. + */ + WARN_ON(atomic_read(&mnt->__mnt_writers)); dput(mnt->mnt_root); free_vfsmnt(mnt); deactivate_super(sb); @@ -362,20 +724,21 @@ void save_mount_options(struct super_block *sb, char *options) } EXPORT_SYMBOL(save_mount_options); +#ifdef CONFIG_PROC_FS /* iterator */ static void *m_start(struct seq_file *m, loff_t *pos) { - struct mnt_namespace *n = m->private; + struct proc_mounts *p = m->private; down_read(&namespace_sem); - return seq_list_start(&n->list, *pos); + return seq_list_start(&p->ns->list, *pos); } static void *m_next(struct seq_file *m, void *v, loff_t *pos) { - struct mnt_namespace *n = m->private; + struct proc_mounts *p = m->private; - return seq_list_next(v, &n->list, pos); + return seq_list_next(v, &p->ns->list, pos); } static void m_stop(struct seq_file *m, void *v) @@ -383,20 +746,30 @@ static void m_stop(struct seq_file *m, void *v) up_read(&namespace_sem); } -static int show_vfsmnt(struct seq_file *m, void *v) +struct proc_fs_info { + int flag; + const char *str; +}; + +static void show_sb_opts(struct seq_file *m, struct super_block *sb) { - struct vfsmount *mnt = list_entry(v, struct vfsmount, mnt_list); - int err = 0; - static struct proc_fs_info { - int flag; - char *str; - } fs_info[] = { + static const struct proc_fs_info fs_info[] = { { MS_SYNCHRONOUS, ",sync" }, { MS_DIRSYNC, ",dirsync" }, { MS_MANDLOCK, ",mand" }, { 0, NULL } }; - static struct proc_fs_info mnt_info[] = { + const struct proc_fs_info *fs_infop; + + for (fs_infop = fs_info; fs_infop->flag; fs_infop++) { + if (sb->s_flags & fs_infop->flag) + seq_puts(m, fs_infop->str); + } +} + +static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt) +{ + static const struct proc_fs_info mnt_info[] = { { MNT_NOSUID, ",nosuid" }, { MNT_NODEV, ",nodev" }, { MNT_NOEXEC, ",noexec" }, @@ -405,40 +778,108 @@ static int show_vfsmnt(struct seq_file *m, void *v) { MNT_RELATIME, ",relatime" }, { 0, NULL } }; - struct proc_fs_info *fs_infop; + const struct proc_fs_info *fs_infop; + + for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) { + if (mnt->mnt_flags & fs_infop->flag) + seq_puts(m, fs_infop->str); + } +} + +static void show_type(struct seq_file *m, struct super_block *sb) +{ + mangle(m, sb->s_type->name); + if (sb->s_subtype && sb->s_subtype[0]) { + seq_putc(m, '.'); + mangle(m, sb->s_subtype); + } +} + +static int show_vfsmnt(struct seq_file *m, void *v) +{ + struct vfsmount *mnt = list_entry(v, struct vfsmount, mnt_list); + int err = 0; struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; mangle(m, mnt->mnt_devname ? mnt->mnt_devname : "none"); seq_putc(m, ' '); seq_path(m, &mnt_path, " \t\n\\"); seq_putc(m, ' '); - mangle(m, mnt->mnt_sb->s_type->name); - if (mnt->mnt_sb->s_subtype && mnt->mnt_sb->s_subtype[0]) { - seq_putc(m, '.'); - mangle(m, mnt->mnt_sb->s_subtype); - } - seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw"); - for (fs_infop = fs_info; fs_infop->flag; fs_infop++) { - if (mnt->mnt_sb->s_flags & fs_infop->flag) - seq_puts(m, fs_infop->str); - } - for (fs_infop = mnt_info; fs_infop->flag; fs_infop++) { - if (mnt->mnt_flags & fs_infop->flag) - seq_puts(m, fs_infop->str); - } + show_type(m, mnt->mnt_sb); + seq_puts(m, __mnt_is_readonly(mnt) ? " ro" : " rw"); + show_sb_opts(m, mnt->mnt_sb); + show_mnt_opts(m, mnt); if (mnt->mnt_sb->s_op->show_options) err = mnt->mnt_sb->s_op->show_options(m, mnt); seq_puts(m, " 0 0\n"); return err; } -struct seq_operations mounts_op = { +const struct seq_operations mounts_op = { .start = m_start, .next = m_next, .stop = m_stop, .show = show_vfsmnt }; +static int show_mountinfo(struct seq_file *m, void *v) +{ + struct proc_mounts *p = m->private; + struct vfsmount *mnt = list_entry(v, struct vfsmount, mnt_list); + struct super_block *sb = mnt->mnt_sb; + struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; + struct path root = p->root; + int err = 0; + + seq_printf(m, "%i %i %u:%u ", mnt->mnt_id, mnt->mnt_parent->mnt_id, + MAJOR(sb->s_dev), MINOR(sb->s_dev)); + seq_dentry(m, mnt->mnt_root, " \t\n\\"); + seq_putc(m, ' '); + seq_path_root(m, &mnt_path, &root, " \t\n\\"); + if (root.mnt != p->root.mnt || root.dentry != p->root.dentry) { + /* + * Mountpoint is outside root, discard that one. Ugly, + * but less so than trying to do that in iterator in a + * race-free way (due to renames). + */ + return SEQ_SKIP; + } + seq_puts(m, mnt->mnt_flags & MNT_READONLY ? " ro" : " rw"); + show_mnt_opts(m, mnt); + + /* Tagged fields ("foo:X" or "bar") */ + if (IS_MNT_SHARED(mnt)) + seq_printf(m, " shared:%i", mnt->mnt_group_id); + if (IS_MNT_SLAVE(mnt)) { + int master = mnt->mnt_master->mnt_group_id; + int dom = get_dominating_id(mnt, &p->root); + seq_printf(m, " master:%i", master); + if (dom && dom != master) + seq_printf(m, " propagate_from:%i", dom); + } + if (IS_MNT_UNBINDABLE(mnt)) + seq_puts(m, " unbindable"); + + /* Filesystem specific data */ + seq_puts(m, " - "); + show_type(m, sb); + seq_putc(m, ' '); + mangle(m, mnt->mnt_devname ? mnt->mnt_devname : "none"); + seq_puts(m, sb->s_flags & MS_RDONLY ? " ro" : " rw"); + show_sb_opts(m, sb); + if (sb->s_op->show_options) + err = sb->s_op->show_options(m, mnt); + seq_putc(m, '\n'); + return err; +} + +const struct seq_operations mountinfo_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_mountinfo, +}; + static int show_vfsstat(struct seq_file *m, void *v) { struct vfsmount *mnt = list_entry(v, struct vfsmount, mnt_list); @@ -459,7 +900,7 @@ static int show_vfsstat(struct seq_file *m, void *v) /* file system type */ seq_puts(m, "with fstype "); - mangle(m, mnt->mnt_sb->s_type->name); + show_type(m, mnt->mnt_sb); /* optional statistics */ if (mnt->mnt_sb->s_op->show_stats) { @@ -471,12 +912,13 @@ static int show_vfsstat(struct seq_file *m, void *v) return err; } -struct seq_operations mountstats_op = { +const struct seq_operations mountstats_op = { .start = m_start, .next = m_next, .stop = m_stop, .show = show_vfsstat, }; +#endif /* CONFIG_PROC_FS */ /** * may_umount_tree - check if a mount tree is busy @@ -801,23 +1243,50 @@ Enomem: struct vfsmount *collect_mounts(struct vfsmount *mnt, struct dentry *dentry) { struct vfsmount *tree; - down_read(&namespace_sem); + down_write(&namespace_sem); tree = copy_tree(mnt, dentry, CL_COPY_ALL | CL_PRIVATE); - up_read(&namespace_sem); + up_write(&namespace_sem); return tree; } void drop_collected_mounts(struct vfsmount *mnt) { LIST_HEAD(umount_list); - down_read(&namespace_sem); + down_write(&namespace_sem); spin_lock(&vfsmount_lock); umount_tree(mnt, 0, &umount_list); spin_unlock(&vfsmount_lock); - up_read(&namespace_sem); + up_write(&namespace_sem); release_mounts(&umount_list); } +static void cleanup_group_ids(struct vfsmount *mnt, struct vfsmount *end) +{ + struct vfsmount *p; + + for (p = mnt; p != end; p = next_mnt(p, mnt)) { + if (p->mnt_group_id && !IS_MNT_SHARED(p)) + mnt_release_group_id(p); + } +} + +static int invent_group_ids(struct vfsmount *mnt, bool recurse) +{ + struct vfsmount *p; + + for (p = mnt; p; p = recurse ? next_mnt(p, mnt) : NULL) { + if (!p->mnt_group_id && !IS_MNT_SHARED(p)) { + int err = mnt_alloc_group_id(p); + if (err) { + cleanup_group_ids(mnt, p); + return err; + } + } + } + + return 0; +} + /* * @source_mnt : mount tree to be attached * @nd : place the mount tree @source_mnt is attached @@ -888,9 +1357,16 @@ static int attach_recursive_mnt(struct vfsmount *source_mnt, struct vfsmount *dest_mnt = path->mnt; struct dentry *dest_dentry = path->dentry; struct vfsmount *child, *p; + int err; - if (propagate_mnt(dest_mnt, dest_dentry, source_mnt, &tree_list)) - return -EINVAL; + if (IS_MNT_SHARED(dest_mnt)) { + err = invent_group_ids(source_mnt, true); + if (err) + goto out; + } + err = propagate_mnt(dest_mnt, dest_dentry, source_mnt, &tree_list); + if (err) + goto out_cleanup_ids; if (IS_MNT_SHARED(dest_mnt)) { for (p = source_mnt; p; p = next_mnt(p, source_mnt)) @@ -913,34 +1389,40 @@ static int attach_recursive_mnt(struct vfsmount *source_mnt, } spin_unlock(&vfsmount_lock); return 0; + + out_cleanup_ids: + if (IS_MNT_SHARED(dest_mnt)) + cleanup_group_ids(source_mnt, NULL); + out: + return err; } -static int graft_tree(struct vfsmount *mnt, struct nameidata *nd) +static int graft_tree(struct vfsmount *mnt, struct path *path) { int err; if (mnt->mnt_sb->s_flags & MS_NOUSER) return -EINVAL; - if (S_ISDIR(nd->path.dentry->d_inode->i_mode) != + if (S_ISDIR(path->dentry->d_inode->i_mode) != S_ISDIR(mnt->mnt_root->d_inode->i_mode)) return -ENOTDIR; err = -ENOENT; - mutex_lock(&nd->path.dentry->d_inode->i_mutex); - if (IS_DEADDIR(nd->path.dentry->d_inode)) + mutex_lock(&path->dentry->d_inode->i_mutex); + if (IS_DEADDIR(path->dentry->d_inode)) goto out_unlock; - err = security_sb_check_sb(mnt, nd); + err = security_sb_check_sb(mnt, path); if (err) goto out_unlock; err = -ENOENT; - if (IS_ROOT(nd->path.dentry) || !d_unhashed(nd->path.dentry)) - err = attach_recursive_mnt(mnt, &nd->path, NULL); + if (IS_ROOT(path->dentry) || !d_unhashed(path->dentry)) + err = attach_recursive_mnt(mnt, path, NULL); out_unlock: - mutex_unlock(&nd->path.dentry->d_inode->i_mutex); + mutex_unlock(&path->dentry->d_inode->i_mutex); if (!err) - security_sb_post_addmount(mnt, nd); + security_sb_post_addmount(mnt, path); return err; } @@ -953,6 +1435,7 @@ static noinline int do_change_type(struct nameidata *nd, int flag) struct vfsmount *m, *mnt = nd->path.mnt; int recurse = flag & MS_REC; int type = flag & ~MS_REC; + int err = 0; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -961,12 +1444,20 @@ static noinline int do_change_type(struct nameidata *nd, int flag) return -EINVAL; down_write(&namespace_sem); + if (type == MS_SHARED) { + err = invent_group_ids(mnt, recurse); + if (err) + goto out_unlock; + } + spin_lock(&vfsmount_lock); for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL)) change_mnt_propagation(m, type); spin_unlock(&vfsmount_lock); + + out_unlock: up_write(&namespace_sem); - return 0; + return err; } /* @@ -1004,7 +1495,7 @@ static noinline int do_loopback(struct nameidata *nd, char *old_name, if (!mnt) goto out; - err = graft_tree(mnt, nd); + err = graft_tree(mnt, &nd->path); if (err) { LIST_HEAD(umount_list); spin_lock(&vfsmount_lock); @@ -1019,6 +1510,23 @@ out: return err; } +static int change_mount_flags(struct vfsmount *mnt, int ms_flags) +{ + int error = 0; + int readonly_request = 0; + + if (ms_flags & MS_RDONLY) + readonly_request = 1; + if (readonly_request == __mnt_is_readonly(mnt)) + return 0; + + if (readonly_request) + error = mnt_make_readonly(mnt); + else + __mnt_unmake_readonly(mnt); + return error; +} + /* * change filesystem flags. dir should be a physical root of filesystem. * If you've mounted a non-root directory somewhere and want to do remount @@ -1041,7 +1549,10 @@ static noinline int do_remount(struct nameidata *nd, int flags, int mnt_flags, return -EINVAL; down_write(&sb->s_umount); - err = do_remount_sb(sb, flags, data, 0); + if (flags & MS_BIND) + err = change_mount_flags(nd->path.mnt, flags); + else + err = do_remount_sb(sb, flags, data, 0); if (!err) nd->path.mnt->mnt_flags = mnt_flags; up_write(&sb->s_umount); @@ -1191,7 +1702,7 @@ int do_add_mount(struct vfsmount *newmnt, struct nameidata *nd, goto unlock; newmnt->mnt_flags = mnt_flags; - if ((err = graft_tree(newmnt, nd))) + if ((err = graft_tree(newmnt, &nd->path))) goto unlock; if (fslist) /* add to the specified expiration list */ @@ -1425,6 +1936,8 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, mnt_flags |= MNT_NODIRATIME; if (flags & MS_RELATIME) mnt_flags |= MNT_RELATIME; + if (flags & MS_RDONLY) + mnt_flags |= MNT_READONLY; flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE | MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT); @@ -1434,7 +1947,8 @@ long do_mount(char *dev_name, char *dir_name, char *type_page, if (retval) return retval; - retval = security_sb_mount(dev_name, &nd, type_page, flags, data_page); + retval = security_sb_mount(dev_name, &nd.path, + type_page, flags, data_page); if (retval) goto dput_out; @@ -1674,15 +2188,13 @@ asmlinkage long sys_pivot_root(const char __user * new_root, const char __user * put_old) { struct vfsmount *tmp; - struct nameidata new_nd, old_nd, user_nd; - struct path parent_path, root_parent; + struct nameidata new_nd, old_nd; + struct path parent_path, root_parent, root; int error; if (!capable(CAP_SYS_ADMIN)) return -EPERM; - lock_kernel(); - error = __user_walk(new_root, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &new_nd); if (error) @@ -1695,14 +2207,14 @@ asmlinkage long sys_pivot_root(const char __user * new_root, if (error) goto out1; - error = security_sb_pivotroot(&old_nd, &new_nd); + error = security_sb_pivotroot(&old_nd.path, &new_nd.path); if (error) { path_put(&old_nd.path); goto out1; } read_lock(¤t->fs->lock); - user_nd.path = current->fs->root; + root = current->fs->root; path_get(¤t->fs->root); read_unlock(¤t->fs->lock); down_write(&namespace_sem); @@ -1710,9 +2222,9 @@ asmlinkage long sys_pivot_root(const char __user * new_root, error = -EINVAL; if (IS_MNT_SHARED(old_nd.path.mnt) || IS_MNT_SHARED(new_nd.path.mnt->mnt_parent) || - IS_MNT_SHARED(user_nd.path.mnt->mnt_parent)) + IS_MNT_SHARED(root.mnt->mnt_parent)) goto out2; - if (!check_mnt(user_nd.path.mnt)) + if (!check_mnt(root.mnt)) goto out2; error = -ENOENT; if (IS_DEADDIR(new_nd.path.dentry->d_inode)) @@ -1722,13 +2234,13 @@ asmlinkage long sys_pivot_root(const char __user * new_root, if (d_unhashed(old_nd.path.dentry) && !IS_ROOT(old_nd.path.dentry)) goto out2; error = -EBUSY; - if (new_nd.path.mnt == user_nd.path.mnt || - old_nd.path.mnt == user_nd.path.mnt) + if (new_nd.path.mnt == root.mnt || + old_nd.path.mnt == root.mnt) goto out2; /* loop, on the same file system */ error = -EINVAL; - if (user_nd.path.mnt->mnt_root != user_nd.path.dentry) + if (root.mnt->mnt_root != root.dentry) goto out2; /* not a mountpoint */ - if (user_nd.path.mnt->mnt_parent == user_nd.path.mnt) + if (root.mnt->mnt_parent == root.mnt) goto out2; /* not attached */ if (new_nd.path.mnt->mnt_root != new_nd.path.dentry) goto out2; /* not a mountpoint */ @@ -1750,27 +2262,26 @@ asmlinkage long sys_pivot_root(const char __user * new_root, } else if (!is_subdir(old_nd.path.dentry, new_nd.path.dentry)) goto out3; detach_mnt(new_nd.path.mnt, &parent_path); - detach_mnt(user_nd.path.mnt, &root_parent); + detach_mnt(root.mnt, &root_parent); /* mount old root on put_old */ - attach_mnt(user_nd.path.mnt, &old_nd.path); + attach_mnt(root.mnt, &old_nd.path); /* mount new_root on / */ attach_mnt(new_nd.path.mnt, &root_parent); touch_mnt_namespace(current->nsproxy->mnt_ns); spin_unlock(&vfsmount_lock); - chroot_fs_refs(&user_nd.path, &new_nd.path); - security_sb_post_pivotroot(&user_nd, &new_nd); + chroot_fs_refs(&root, &new_nd.path); + security_sb_post_pivotroot(&root, &new_nd.path); error = 0; path_put(&root_parent); path_put(&parent_path); out2: mutex_unlock(&old_nd.path.dentry->d_inode->i_mutex); up_write(&namespace_sem); - path_put(&user_nd.path); + path_put(&root); path_put(&old_nd.path); out1: path_put(&new_nd.path); out0: - unlock_kernel(); return error; out3: spin_unlock(&vfsmount_lock); diff --git a/fs/ncpfs/ioctl.c b/fs/ncpfs/ioctl.c index c67b4bd..10c07ff 100644 --- a/fs/ncpfs/ioctl.c +++ b/fs/ncpfs/ioctl.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* server, struct ncp_nls_ioctl __user *arg) } #endif /* CONFIG_NCPFS_NLS */ -int ncp_ioctl(struct inode *inode, struct file *filp, +static int __ncp_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) { struct ncp_server *server = NCP_SERVER(inode); @@ -822,6 +823,57 @@ outrel: return -EINVAL; } +static int ncp_ioctl_need_write(unsigned int cmd) +{ + switch (cmd) { + case NCP_IOC_GET_FS_INFO: + case NCP_IOC_GET_FS_INFO_V2: + case NCP_IOC_NCPREQUEST: + case NCP_IOC_SETDENTRYTTL: + case NCP_IOC_SIGN_INIT: + case NCP_IOC_LOCKUNLOCK: + case NCP_IOC_SET_SIGN_WANTED: + return 1; + case NCP_IOC_GETOBJECTNAME: + case NCP_IOC_SETOBJECTNAME: + case NCP_IOC_GETPRIVATEDATA: + case NCP_IOC_SETPRIVATEDATA: + case NCP_IOC_SETCHARSETS: + case NCP_IOC_GETCHARSETS: + case NCP_IOC_CONN_LOGGED_IN: + case NCP_IOC_GETDENTRYTTL: + case NCP_IOC_GETMOUNTUID2: + case NCP_IOC_SIGN_WANTED: + case NCP_IOC_GETROOT: + case NCP_IOC_SETROOT: + return 0; + default: + /* unkown IOCTL command, assume write */ + return 1; + } +} + +int ncp_ioctl(struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + int ret; + + if (ncp_ioctl_need_write(cmd)) { + /* + * inside the ioctl(), any failures which + * are because of file_permission() are + * -EACCESS, so it seems consistent to keep + * that here. + */ + if (mnt_want_write(filp->f_vfsmnt)) + return -EACCES; + } + ret = __ncp_ioctl(inode, filp, cmd, arg); + if (ncp_ioctl_need_write(cmd)) + mnt_drop_write(filp->f_vfsmnt); + return ret; +} + #ifdef CONFIG_COMPAT long ncp_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c index 6cea747..d9e30ac 100644 --- a/fs/nfs/dir.c +++ b/fs/nfs/dir.c @@ -967,7 +967,8 @@ static int is_atomic_open(struct inode *dir, struct nameidata *nd) if (nd->flags & LOOKUP_DIRECTORY) return 0; /* Are we trying to write to a read only partition? */ - if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE))) + if (__mnt_is_readonly(nd->path.mnt) && + (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE))) return 0; return 1; } diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c index c593db0..c309c88 100644 --- a/fs/nfsd/nfs4proc.c +++ b/fs/nfsd/nfs4proc.c @@ -658,14 +658,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, return status; } } + status = mnt_want_write(cstate->current_fh.fh_export->ex_path.mnt); + if (status) + return status; status = nfs_ok; if (setattr->sa_acl != NULL) status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh, setattr->sa_acl); if (status) - return status; + goto out; status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr, 0, (time_t)0); +out: + mnt_drop_write(cstate->current_fh.fh_export->ex_path.mnt); return status; } diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c index 1ff9062..f33a847 100644 --- a/fs/nfsd/nfs4recover.c +++ b/fs/nfsd/nfs4recover.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -154,7 +155,11 @@ nfsd4_create_clid_dir(struct nfs4_client *clp) dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n"); goto out_put; } + status = mnt_want_write(rec_dir.path.mnt); + if (status) + goto out_put; status = vfs_mkdir(rec_dir.path.dentry->d_inode, dentry, S_IRWXU); + mnt_drop_write(rec_dir.path.mnt); out_put: dput(dentry); out_unlock: diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c index bcb97d8..81a75f3 100644 --- a/fs/nfsd/nfs4state.c +++ b/fs/nfsd/nfs4state.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -1239,7 +1240,7 @@ static inline void nfs4_file_downgrade(struct file *filp, unsigned int share_access) { if (share_access & NFS4_SHARE_ACCESS_WRITE) { - put_write_access(filp->f_path.dentry->d_inode); + drop_file_write_access(filp); filp->f_mode = (filp->f_mode | FMODE_READ) & ~FMODE_WRITE; } } diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c index 46f59d5..b923e89 100644 --- a/fs/nfsd/vfs.c +++ b/fs/nfsd/vfs.c @@ -1264,7 +1264,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp, case S_IFBLK: case S_IFIFO: case S_IFSOCK: + host_err = mnt_want_write(fhp->fh_export->ex_path.mnt); + if (host_err) + break; host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev); + mnt_drop_write(fhp->fh_export->ex_path.mnt); break; default: printk("nfsd: bad file type %o in nfsd_create\n", type); @@ -1678,13 +1682,20 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, if (ndentry == trap) goto out_dput_new; -#ifdef MSNFS - if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) && + if (svc_msnfs(ffhp) && ((atomic_read(&odentry->d_count) > 1) || (atomic_read(&ndentry->d_count) > 1))) { host_err = -EPERM; - } else -#endif + goto out_dput_new; + } + + host_err = -EXDEV; + if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt) + goto out_dput_new; + host_err = mnt_want_write(ffhp->fh_export->ex_path.mnt); + if (host_err) + goto out_dput_new; + host_err = vfs_rename(fdir, odentry, tdir, ndentry); if (!host_err && EX_ISSYNC(tfhp->fh_export)) { host_err = nfsd_sync_dir(tdentry); @@ -1692,6 +1703,8 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen, host_err = nfsd_sync_dir(fdentry); } + mnt_drop_write(ffhp->fh_export->ex_path.mnt); + out_dput_new: dput(ndentry); out_dput_old: @@ -1865,7 +1878,7 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp, inode->i_mode, IS_IMMUTABLE(inode)? " immut" : "", IS_APPEND(inode)? " append" : "", - IS_RDONLY(inode)? " ro" : ""); + __mnt_is_readonly(exp->ex_path.mnt)? " ro" : ""); dprintk(" owner %d/%d user %d/%d\n", inode->i_uid, inode->i_gid, current->fsuid, current->fsgid); #endif @@ -1876,7 +1889,8 @@ nfsd_permission(struct svc_rqst *rqstp, struct svc_export *exp, */ if (!(acc & MAY_LOCAL_ACCESS)) if (acc & (MAY_WRITE | MAY_SATTR | MAY_TRUNC)) { - if (exp_rdonly(rqstp, exp) || IS_RDONLY(inode)) + if (exp_rdonly(rqstp, exp) || + __mnt_is_readonly(exp->ex_path.mnt)) return nfserr_rofs; if (/* (acc & MAY_WRITE) && */ IS_IMMUTABLE(inode)) return nfserr_perm; diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c index 5177fba..d6ed677 100644 --- a/fs/ocfs2/ioctl.c +++ b/fs/ocfs2/ioctl.c @@ -59,10 +59,6 @@ static int ocfs2_set_inode_attr(struct inode *inode, unsigned flags, goto bail; } - status = -EROFS; - if (IS_RDONLY(inode)) - goto bail_unlock; - status = -EACCES; if (!is_owner_or_cap(inode)) goto bail_unlock; @@ -133,8 +129,13 @@ int ocfs2_ioctl(struct inode * inode, struct file * filp, if (get_user(flags, (int __user *) arg)) return -EFAULT; - return ocfs2_set_inode_attr(inode, flags, + status = mnt_want_write(filp->f_vfsmnt); + if (status) + return status; + status = ocfs2_set_inode_attr(inode, flags, OCFS2_FL_MODIFIABLE); + mnt_drop_write(filp->f_vfsmnt); + return status; case OCFS2_IOC_RESVSP: case OCFS2_IOC_RESVSP64: case OCFS2_IOC_UNRESVSP: diff --git a/fs/open.c b/fs/open.c index 3fa4e4f..303b44f 100644 --- a/fs/open.c +++ b/fs/open.c @@ -244,21 +244,21 @@ static long do_sys_truncate(const char __user * path, loff_t length) if (!S_ISREG(inode->i_mode)) goto dput_and_out; - error = vfs_permission(&nd, MAY_WRITE); + error = mnt_want_write(nd.path.mnt); if (error) goto dput_and_out; - error = -EROFS; - if (IS_RDONLY(inode)) - goto dput_and_out; + error = vfs_permission(&nd, MAY_WRITE); + if (error) + goto mnt_drop_write_and_out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto dput_and_out; + goto mnt_drop_write_and_out; error = get_write_access(inode); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; /* * Make sure that there are no leases. get_write_access() protects @@ -276,6 +276,8 @@ static long do_sys_truncate(const char __user * path, loff_t length) put_write_and_out: put_write_access(inode); +mnt_drop_write_and_out: + mnt_drop_write(nd.path.mnt); dput_and_out: path_put(&nd.path); out: @@ -457,8 +459,17 @@ asmlinkage long sys_faccessat(int dfd, const char __user *filename, int mode) if(res || !(mode & S_IWOTH) || special_file(nd.path.dentry->d_inode->i_mode)) goto out_path_release; - - if(IS_RDONLY(nd.path.dentry->d_inode)) + /* + * This is a rare case where using __mnt_is_readonly() + * is OK without a mnt_want/drop_write() pair. Since + * no actual write to the fs is performed here, we do + * not need to telegraph to that to anyone. + * + * By doing this, we accept that this access is + * inherently racy and know that the fs may change + * state before we even see this result. + */ + if (__mnt_is_readonly(nd.path.mnt)) res = -EROFS; out_path_release: @@ -567,12 +578,12 @@ asmlinkage long sys_fchmod(unsigned int fd, mode_t mode) audit_inode(NULL, dentry); - err = -EROFS; - if (IS_RDONLY(inode)) + err = mnt_want_write(file->f_vfsmnt); + if (err) goto out_putf; err = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto out_putf; + goto out_drop_write; mutex_lock(&inode->i_mutex); if (mode == (mode_t) -1) mode = inode->i_mode; @@ -581,6 +592,8 @@ asmlinkage long sys_fchmod(unsigned int fd, mode_t mode) err = notify_change(dentry, &newattrs); mutex_unlock(&inode->i_mutex); +out_drop_write: + mnt_drop_write(file->f_vfsmnt); out_putf: fput(file); out: @@ -600,13 +613,13 @@ asmlinkage long sys_fchmodat(int dfd, const char __user *filename, goto out; inode = nd.path.dentry->d_inode; - error = -EROFS; - if (IS_RDONLY(inode)) + error = mnt_want_write(nd.path.mnt); + if (error) goto dput_and_out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto dput_and_out; + goto out_drop_write; mutex_lock(&inode->i_mutex); if (mode == (mode_t) -1) @@ -616,6 +629,8 @@ asmlinkage long sys_fchmodat(int dfd, const char __user *filename, error = notify_change(nd.path.dentry, &newattrs); mutex_unlock(&inode->i_mutex); +out_drop_write: + mnt_drop_write(nd.path.mnt); dput_and_out: path_put(&nd.path); out: @@ -638,9 +653,6 @@ static int chown_common(struct dentry * dentry, uid_t user, gid_t group) printk(KERN_ERR "chown_common: NULL inode\n"); goto out; } - error = -EROFS; - if (IS_RDONLY(inode)) - goto out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) goto out; @@ -671,7 +683,12 @@ asmlinkage long sys_chown(const char __user * filename, uid_t user, gid_t group) error = user_path_walk(filename, &nd); if (error) goto out; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_release; error = chown_common(nd.path.dentry, user, group); + mnt_drop_write(nd.path.mnt); +out_release: path_put(&nd.path); out: return error; @@ -691,7 +708,12 @@ asmlinkage long sys_fchownat(int dfd, const char __user *filename, uid_t user, error = __user_walk_fd(dfd, filename, follow, &nd); if (error) goto out; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_release; error = chown_common(nd.path.dentry, user, group); + mnt_drop_write(nd.path.mnt); +out_release: path_put(&nd.path); out: return error; @@ -705,7 +727,12 @@ asmlinkage long sys_lchown(const char __user * filename, uid_t user, gid_t group error = user_path_walk_link(filename, &nd); if (error) goto out; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_release; error = chown_common(nd.path.dentry, user, group); + mnt_drop_write(nd.path.mnt); +out_release: path_put(&nd.path); out: return error; @@ -722,14 +749,48 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group) if (!file) goto out; + error = mnt_want_write(file->f_vfsmnt); + if (error) + goto out_fput; dentry = file->f_path.dentry; audit_inode(NULL, dentry); error = chown_common(dentry, user, group); + mnt_drop_write(file->f_vfsmnt); +out_fput: fput(file); out: return error; } +/* + * You have to be very careful that these write + * counts get cleaned up in error cases and + * upon __fput(). This should probably never + * be called outside of __dentry_open(). + */ +static inline int __get_file_write_access(struct inode *inode, + struct vfsmount *mnt) +{ + int error; + error = get_write_access(inode); + if (error) + return error; + /* + * Do not take mount writer counts on + * special files since no writes to + * the mount itself will occur. + */ + if (!special_file(inode->i_mode)) { + /* + * Balanced in __fput() + */ + error = mnt_want_write(mnt); + if (error) + put_write_access(inode); + } + return error; +} + static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, int flags, struct file *f, int (*open)(struct inode *, struct file *)) @@ -742,9 +803,11 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, FMODE_PREAD | FMODE_PWRITE; inode = dentry->d_inode; if (f->f_mode & FMODE_WRITE) { - error = get_write_access(inode); + error = __get_file_write_access(inode, mnt); if (error) goto cleanup_file; + if (!special_file(inode->i_mode)) + file_take_write(f); } f->f_mapping = inode->i_mapping; @@ -784,8 +847,19 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, cleanup_all: fops_put(f->f_op); - if (f->f_mode & FMODE_WRITE) + if (f->f_mode & FMODE_WRITE) { put_write_access(inode); + if (!special_file(inode->i_mode)) { + /* + * We don't consider this a real + * mnt_want/drop_write() pair + * because it all happenend right + * here, so just reset the state. + */ + file_reset_write(f); + mnt_drop_write(mnt); + } + } file_kill(f); f->f_path.dentry = NULL; f->f_path.mnt = NULL; @@ -796,43 +870,6 @@ cleanup_file: return ERR_PTR(error); } -/* - * Note that while the flag value (low two bits) for sys_open means: - * 00 - read-only - * 01 - write-only - * 10 - read-write - * 11 - special - * it is changed into - * 00 - no permissions needed - * 01 - read-permission - * 10 - write-permission - * 11 - read-write - * for the internal routines (ie open_namei()/follow_link() etc). 00 is - * used by symlinks. - */ -static struct file *do_filp_open(int dfd, const char *filename, int flags, - int mode) -{ - int namei_flags, error; - struct nameidata nd; - - namei_flags = flags; - if ((namei_flags+1) & O_ACCMODE) - namei_flags++; - - error = open_namei(dfd, filename, namei_flags, mode, &nd); - if (!error) - return nameidata_to_filp(&nd, flags); - - return ERR_PTR(error); -} - -struct file *filp_open(const char *filename, int flags, int mode) -{ - return do_filp_open(AT_FDCWD, filename, flags, mode); -} -EXPORT_SYMBOL(filp_open); - /** * lookup_instantiate_filp - instantiates the open intent filp * @nd: pointer to nameidata diff --git a/fs/pnode.c b/fs/pnode.c index 1d8f544..8d5f392 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -9,6 +9,7 @@ #include #include #include +#include "internal.h" #include "pnode.h" /* return the next shared peer mount of @p */ @@ -27,6 +28,57 @@ static inline struct vfsmount *next_slave(struct vfsmount *p) return list_entry(p->mnt_slave.next, struct vfsmount, mnt_slave); } +/* + * Return true if path is reachable from root + * + * namespace_sem is held, and mnt is attached + */ +static bool is_path_reachable(struct vfsmount *mnt, struct dentry *dentry, + const struct path *root) +{ + while (mnt != root->mnt && mnt->mnt_parent != mnt) { + dentry = mnt->mnt_mountpoint; + mnt = mnt->mnt_parent; + } + return mnt == root->mnt && is_subdir(dentry, root->dentry); +} + +static struct vfsmount *get_peer_under_root(struct vfsmount *mnt, + struct mnt_namespace *ns, + const struct path *root) +{ + struct vfsmount *m = mnt; + + do { + /* Check the namespace first for optimization */ + if (m->mnt_ns == ns && is_path_reachable(m, m->mnt_root, root)) + return m; + + m = next_peer(m); + } while (m != mnt); + + return NULL; +} + +/* + * Get ID of closest dominating peer group having a representative + * under the given root. + * + * Caller must hold namespace_sem + */ +int get_dominating_id(struct vfsmount *mnt, const struct path *root) +{ + struct vfsmount *m; + + for (m = mnt->mnt_master; m != NULL; m = m->mnt_master) { + struct vfsmount *d = get_peer_under_root(m, mnt->mnt_ns, root); + if (d) + return d->mnt_group_id; + } + + return 0; +} + static int do_make_slave(struct vfsmount *mnt) { struct vfsmount *peer_mnt = mnt, *master = mnt->mnt_master; @@ -45,7 +97,11 @@ static int do_make_slave(struct vfsmount *mnt) if (peer_mnt == mnt) peer_mnt = NULL; } + if (IS_MNT_SHARED(mnt) && list_empty(&mnt->mnt_share)) + mnt_release_group_id(mnt); + list_del_init(&mnt->mnt_share); + mnt->mnt_group_id = 0; if (peer_mnt) master = peer_mnt; @@ -67,7 +123,6 @@ static int do_make_slave(struct vfsmount *mnt) } mnt->mnt_master = master; CLEAR_MNT_SHARED(mnt); - INIT_LIST_HEAD(&mnt->mnt_slave_list); return 0; } @@ -211,8 +266,7 @@ int propagate_mnt(struct vfsmount *dest_mnt, struct dentry *dest_dentry, out: spin_lock(&vfsmount_lock); while (!list_empty(&tmp_list)) { - child = list_entry(tmp_list.next, struct vfsmount, mnt_hash); - list_del_init(&child->mnt_hash); + child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash); umount_tree(child, 0, &umount_list); } spin_unlock(&vfsmount_lock); diff --git a/fs/pnode.h b/fs/pnode.h index f249be2..958665d 100644 --- a/fs/pnode.h +++ b/fs/pnode.h @@ -35,4 +35,6 @@ int propagate_mnt(struct vfsmount *, struct dentry *, struct vfsmount *, struct list_head *); int propagate_umount(struct list_head *); int propagate_mount_busy(struct vfsmount *, int); +void mnt_release_group_id(struct vfsmount *); +int get_dominating_id(struct vfsmount *mnt, const struct path *root); #endif /* _LINUX_PNODE_H */ diff --git a/fs/proc/base.c b/fs/proc/base.c index 81d7d14..9e8a46c 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -502,17 +502,14 @@ static const struct inode_operations proc_def_inode_operations = { .setattr = proc_setattr, }; -extern const struct seq_operations mounts_op; -struct proc_mounts { - struct seq_file m; - int event; -}; - -static int mounts_open(struct inode *inode, struct file *file) +static int mounts_open_common(struct inode *inode, struct file *file, + const struct seq_operations *op) { struct task_struct *task = get_proc_task(inode); struct nsproxy *nsp; struct mnt_namespace *ns = NULL; + struct fs_struct *fs = NULL; + struct path root; struct proc_mounts *p; int ret = -EINVAL; @@ -525,40 +522,61 @@ static int mounts_open(struct inode *inode, struct file *file) get_mnt_ns(ns); } rcu_read_unlock(); - + if (ns) + fs = get_fs_struct(task); put_task_struct(task); } - if (ns) { - ret = -ENOMEM; - p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL); - if (p) { - file->private_data = &p->m; - ret = seq_open(file, &mounts_op); - if (!ret) { - p->m.private = ns; - p->event = ns->event; - return 0; - } - kfree(p); - } - put_mnt_ns(ns); - } + if (!ns) + goto err; + if (!fs) + goto err_put_ns; + + read_lock(&fs->lock); + root = fs->root; + path_get(&root); + read_unlock(&fs->lock); + put_fs_struct(fs); + + ret = -ENOMEM; + p = kmalloc(sizeof(struct proc_mounts), GFP_KERNEL); + if (!p) + goto err_put_path; + + file->private_data = &p->m; + ret = seq_open(file, op); + if (ret) + goto err_free; + + p->m.private = p; + p->ns = ns; + p->root = root; + p->event = ns->event; + + return 0; + + err_free: + kfree(p); + err_put_path: + path_put(&root); + err_put_ns: + put_mnt_ns(ns); + err: return ret; } static int mounts_release(struct inode *inode, struct file *file) { - struct seq_file *m = file->private_data; - struct mnt_namespace *ns = m->private; - put_mnt_ns(ns); + struct proc_mounts *p = file->private_data; + path_put(&p->root); + put_mnt_ns(p->ns); return seq_release(inode, file); } static unsigned mounts_poll(struct file *file, poll_table *wait) { struct proc_mounts *p = file->private_data; - struct mnt_namespace *ns = p->m.private; + struct mnt_namespace *ns = p->ns; unsigned res = 0; poll_wait(file, &ns->poll, wait); @@ -573,6 +591,11 @@ static unsigned mounts_poll(struct file *file, poll_table *wait) return res; } +static int mounts_open(struct inode *inode, struct file *file) +{ + return mounts_open_common(inode, file, &mounts_op); +} + static const struct file_operations proc_mounts_operations = { .open = mounts_open, .read = seq_read, @@ -581,38 +604,22 @@ static const struct file_operations proc_mounts_operations = { .poll = mounts_poll, }; -extern const struct seq_operations mountstats_op; -static int mountstats_open(struct inode *inode, struct file *file) +static int mountinfo_open(struct inode *inode, struct file *file) { - int ret = seq_open(file, &mountstats_op); - - if (!ret) { - struct seq_file *m = file->private_data; - struct nsproxy *nsp; - struct mnt_namespace *mnt_ns = NULL; - struct task_struct *task = get_proc_task(inode); - - if (task) { - rcu_read_lock(); - nsp = task_nsproxy(task); - if (nsp) { - mnt_ns = nsp->mnt_ns; - if (mnt_ns) - get_mnt_ns(mnt_ns); - } - rcu_read_unlock(); + return mounts_open_common(inode, file, &mountinfo_op); +} - put_task_struct(task); - } +static const struct file_operations proc_mountinfo_operations = { + .open = mountinfo_open, + .read = seq_read, + .llseek = seq_lseek, + .release = mounts_release, + .poll = mounts_poll, +}; - if (mnt_ns) - m->private = mnt_ns; - else { - seq_release(inode, file); - ret = -EINVAL; - } - } - return ret; +static int mountstats_open(struct inode *inode, struct file *file) +{ + return mounts_open_common(inode, file, &mountstats_op); } static const struct file_operations proc_mountstats_operations = { @@ -2311,6 +2318,7 @@ static const struct pid_entry tgid_base_stuff[] = { LNK("root", root), LNK("exe", exe), REG("mounts", S_IRUGO, mounts), + REG("mountinfo", S_IRUGO, mountinfo), REG("mountstats", S_IRUSR, mountstats), #ifdef CONFIG_PROC_PAGE_MONITOR REG("clear_refs", S_IWUSR, clear_refs), @@ -2643,6 +2651,7 @@ static const struct pid_entry tid_base_stuff[] = { LNK("root", root), LNK("exe", exe), REG("mounts", S_IRUGO, mounts), + REG("mountinfo", S_IRUGO, mountinfo), #ifdef CONFIG_PROC_PAGE_MONITOR REG("clear_refs", S_IWUSR, clear_refs), REG("smaps", S_IRUGO, smaps), diff --git a/fs/reiserfs/ioctl.c b/fs/reiserfs/ioctl.c index e0f0f09..619470b 100644 --- a/fs/reiserfs/ioctl.c +++ b/fs/reiserfs/ioctl.c @@ -4,6 +4,7 @@ #include #include +#include #include #include #include @@ -25,6 +26,7 @@ int reiserfs_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) { unsigned int flags; + int err = 0; switch (cmd) { case REISERFS_IOC_UNPACK: @@ -48,50 +50,67 @@ int reiserfs_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, if (!reiserfs_attrs(inode->i_sb)) return -ENOTTY; - if (IS_RDONLY(inode)) - return -EROFS; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; - if (!is_owner_or_cap(inode)) - return -EPERM; - - if (get_user(flags, (int __user *)arg)) - return -EFAULT; - - /* Is it quota file? Do not allow user to mess with it. */ - if (IS_NOQUOTA(inode)) - return -EPERM; + if (!is_owner_or_cap(inode)) { + err = -EPERM; + goto setflags_out; + } + if (get_user(flags, (int __user *)arg)) { + err = -EFAULT; + goto setflags_out; + } + /* + * Is it quota file? Do not allow user to mess with it + */ + if (IS_NOQUOTA(inode)) { + err = -EPERM; + goto setflags_out; + } if (((flags ^ REISERFS_I(inode)-> i_attrs) & (REISERFS_IMMUTABLE_FL | REISERFS_APPEND_FL)) - && !capable(CAP_LINUX_IMMUTABLE)) - return -EPERM; - + && !capable(CAP_LINUX_IMMUTABLE)) { + err = -EPERM; + goto setflags_out; + } if ((flags & REISERFS_NOTAIL_FL) && S_ISREG(inode->i_mode)) { int result; result = reiserfs_unpack(inode, filp); - if (result) - return result; + if (result) { + err = result; + goto setflags_out; + } } sd_attrs_to_i_attrs(flags, inode); REISERFS_I(inode)->i_attrs = flags; inode->i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); - return 0; +setflags_out: + mnt_drop_write(filp->f_vfsmnt); + return err; } case REISERFS_IOC_GETVERSION: return put_user(inode->i_generation, (int __user *)arg); case REISERFS_IOC_SETVERSION: if (!is_owner_or_cap(inode)) return -EPERM; - if (IS_RDONLY(inode)) - return -EROFS; - if (get_user(inode->i_generation, (int __user *)arg)) - return -EFAULT; + err = mnt_want_write(filp->f_vfsmnt); + if (err) + return err; + if (get_user(inode->i_generation, (int __user *)arg)) { + err = -EFAULT; + goto setversion_out; + } inode->i_ctime = CURRENT_TIME_SEC; mark_inode_dirty(inode); - return 0; +setversion_out: + mnt_drop_write(filp->f_vfsmnt); + return err; default: return -ENOTTY; } diff --git a/fs/seq_file.c b/fs/seq_file.c index 8537702..aba3ed9 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -25,6 +25,7 @@ * into the buffer. In case of error ->start() and ->next() return * ERR_PTR(error). In the end of sequence they return %NULL. ->show() * returns 0 in case of success and negative number in case of error. + * Returning SEQ_SKIP means "discard this element and move on". */ int seq_open(struct file *file, const struct seq_operations *op) { @@ -114,8 +115,10 @@ ssize_t seq_read(struct file *file, char __user *buf, size_t size, loff_t *ppos) if (!p || IS_ERR(p)) break; err = m->op->show(m, p); - if (err) + if (err < 0) break; + if (unlikely(err)) + m->count = 0; if (m->count < m->size) goto Fill; m->op->stop(m, p); @@ -140,9 +143,10 @@ Fill: break; } err = m->op->show(m, p); - if (err || m->count == m->size) { + if (m->count == m->size || err) { m->count = offs; - break; + if (likely(err <= 0)) + break; } pos = next; } @@ -199,8 +203,12 @@ static int traverse(struct seq_file *m, loff_t offset) if (IS_ERR(p)) break; error = m->op->show(m, p); - if (error) + if (error < 0) break; + if (unlikely(error)) { + error = 0; + m->count = 0; + } if (m->count == m->size) goto Eoverflow; if (pos + m->count > offset) { @@ -342,28 +350,40 @@ int seq_printf(struct seq_file *m, const char *f, ...) } EXPORT_SYMBOL(seq_printf); +static char *mangle_path(char *s, char *p, char *esc) +{ + while (s <= p) { + char c = *p++; + if (!c) { + return s; + } else if (!strchr(esc, c)) { + *s++ = c; + } else if (s + 4 > p) { + break; + } else { + *s++ = '\\'; + *s++ = '0' + ((c & 0300) >> 6); + *s++ = '0' + ((c & 070) >> 3); + *s++ = '0' + (c & 07); + } + } + return NULL; +} + +/* + * return the absolute path of 'dentry' residing in mount 'mnt'. + */ int seq_path(struct seq_file *m, struct path *path, char *esc) { if (m->count < m->size) { char *s = m->buf + m->count; char *p = d_path(path, s, m->size - m->count); if (!IS_ERR(p)) { - while (s <= p) { - char c = *p++; - if (!c) { - p = m->buf + m->count; - m->count = s - m->buf; - return s - p; - } else if (!strchr(esc, c)) { - *s++ = c; - } else if (s + 4 > p) { - break; - } else { - *s++ = '\\'; - *s++ = '0' + ((c & 0300) >> 6); - *s++ = '0' + ((c & 070) >> 3); - *s++ = '0' + (c & 07); - } + s = mangle_path(s, p, esc); + if (s) { + p = m->buf + m->count; + m->count = s - m->buf; + return s - p; } } } @@ -372,6 +392,57 @@ int seq_path(struct seq_file *m, struct path *path, char *esc) } EXPORT_SYMBOL(seq_path); +/* + * Same as seq_path, but relative to supplied root. + * + * root may be changed, see __d_path(). + */ +int seq_path_root(struct seq_file *m, struct path *path, struct path *root, + char *esc) +{ + int err = -ENAMETOOLONG; + if (m->count < m->size) { + char *s = m->buf + m->count; + char *p; + + spin_lock(&dcache_lock); + p = __d_path(path, root, s, m->size - m->count); + spin_unlock(&dcache_lock); + err = PTR_ERR(p); + if (!IS_ERR(p)) { + s = mangle_path(s, p, esc); + if (s) { + p = m->buf + m->count; + m->count = s - m->buf; + return 0; + } + } + } + m->count = m->size; + return err; +} + +/* + * returns the path of the 'dentry' from the root of its filesystem. + */ +int seq_dentry(struct seq_file *m, struct dentry *dentry, char *esc) +{ + if (m->count < m->size) { + char *s = m->buf + m->count; + char *p = dentry_path(dentry, s, m->size - m->count); + if (!IS_ERR(p)) { + s = mangle_path(s, p, esc); + if (s) { + p = m->buf + m->count; + m->count = s - m->buf; + return s - p; + } + } + } + m->count = m->size; + return -1; +} + static void *single_start(struct seq_file *p, loff_t *pos) { return NULL + (*pos == 0); diff --git a/fs/signalfd.c b/fs/signalfd.c index 8ead0db..a06296b 100644 --- a/fs/signalfd.c +++ b/fs/signalfd.c @@ -210,8 +210,6 @@ asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemas int error; sigset_t sigmask; struct signalfd_ctx *ctx; - struct file *file; - struct inode *inode; if (sizemask != sizeof(sigset_t) || copy_from_user(&sigmask, user_mask, sizeof(sigmask))) @@ -230,12 +228,12 @@ asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemas * When we call this, the initialization must be complete, since * anon_inode_getfd() will install the fd. */ - error = anon_inode_getfd(&ufd, &inode, &file, "[signalfd]", + error = anon_inode_getfd(&ufd, "[signalfd]", &signalfd_fops, ctx); if (error) goto err_fdalloc; } else { - file = fget(ufd); + struct file *file = fget(ufd); if (!file) return -EBADF; ctx = file->private_data; diff --git a/fs/super.c b/fs/super.c index 09008db..4798350 100644 --- a/fs/super.c +++ b/fs/super.c @@ -37,7 +37,9 @@ #include #include #include +#include #include +#include "internal.h" LIST_HEAD(super_blocks); @@ -567,10 +569,29 @@ static void mark_files_ro(struct super_block *sb) { struct file *f; +retry: file_list_lock(); list_for_each_entry(f, &sb->s_files, f_u.fu_list) { - if (S_ISREG(f->f_path.dentry->d_inode->i_mode) && file_count(f)) - f->f_mode &= ~FMODE_WRITE; + struct vfsmount *mnt; + if (!S_ISREG(f->f_path.dentry->d_inode->i_mode)) + continue; + if (!file_count(f)) + continue; + if (!(f->f_mode & FMODE_WRITE)) + continue; + f->f_mode &= ~FMODE_WRITE; + if (file_check_writeable(f) != 0) + continue; + file_release_write(f); + mnt = mntget(f->f_path.mnt); + file_list_unlock(); + /* + * This can sleep, so we can't hold + * the file_list_lock() spinlock. + */ + mnt_drop_write(mnt); + mntput(mnt); + goto retry; } file_list_unlock(); } diff --git a/fs/timerfd.c b/fs/timerfd.c index 10c80b5..1ac728d 100644 --- a/fs/timerfd.c +++ b/fs/timerfd.c @@ -182,8 +182,6 @@ asmlinkage long sys_timerfd_create(int clockid, int flags) { int error, ufd; struct timerfd_ctx *ctx; - struct file *file; - struct inode *inode; if (flags) return -EINVAL; @@ -199,8 +197,7 @@ asmlinkage long sys_timerfd_create(int clockid, int flags) ctx->clockid = clockid; hrtimer_init(&ctx->tmr, clockid, HRTIMER_MODE_ABS); - error = anon_inode_getfd(&ufd, &inode, &file, "[timerfd]", - &timerfd_fops, ctx); + error = anon_inode_getfd(&ufd, "[timerfd]", &timerfd_fops, ctx); if (error) { kfree(ctx); return error; diff --git a/fs/utimes.c b/fs/utimes.c index b18da9c..a2bef77 100644 --- a/fs/utimes.c +++ b/fs/utimes.c @@ -2,6 +2,7 @@ #include #include #include +#include #include #include #include @@ -59,6 +60,7 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags struct inode *inode; struct iattr newattrs; struct file *f = NULL; + struct vfsmount *mnt; error = -EINVAL; if (times && (!nsec_valid(times[0].tv_nsec) || @@ -79,18 +81,20 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags if (!f) goto out; dentry = f->f_path.dentry; + mnt = f->f_path.mnt; } else { error = __user_walk_fd(dfd, filename, (flags & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW, &nd); if (error) goto out; dentry = nd.path.dentry; + mnt = nd.path.mnt; } inode = dentry->d_inode; - error = -EROFS; - if (IS_RDONLY(inode)) + error = mnt_want_write(mnt); + if (error) goto dput_and_out; /* Don't worry, the checks are done in inode_change_ok() */ @@ -98,7 +102,7 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags if (times) { error = -EPERM; if (IS_APPEND(inode) || IS_IMMUTABLE(inode)) - goto dput_and_out; + goto mnt_drop_write_and_out; if (times[0].tv_nsec == UTIME_OMIT) newattrs.ia_valid &= ~ATTR_ATIME; @@ -118,22 +122,24 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags } else { error = -EACCES; if (IS_IMMUTABLE(inode)) - goto dput_and_out; + goto mnt_drop_write_and_out; if (!is_owner_or_cap(inode)) { if (f) { if (!(f->f_mode & FMODE_WRITE)) - goto dput_and_out; + goto mnt_drop_write_and_out; } else { error = vfs_permission(&nd, MAY_WRITE); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; } } } mutex_lock(&inode->i_mutex); error = notify_change(dentry, &newattrs); mutex_unlock(&inode->i_mutex); +mnt_drop_write_and_out: + mnt_drop_write(mnt); dput_and_out: if (f) fput(f); diff --git a/fs/xattr.c b/fs/xattr.c index 3acab16..b5c47e1 100644 --- a/fs/xattr.c +++ b/fs/xattr.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, const char *name, int mask) * filesystem or on an immutable / append-only inode. */ if (mask & MAY_WRITE) { - if (IS_RDONLY(inode)) - return -EROFS; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) return -EPERM; } @@ -262,7 +261,11 @@ sys_setxattr(char __user *path, char __user *name, void __user *value, error = user_path_walk(path, &nd); if (error) return error; + error = mnt_want_write(nd.path.mnt); + if (error) + return error; error = setxattr(nd.path.dentry, name, value, size, flags); + mnt_drop_write(nd.path.mnt); path_put(&nd.path); return error; } @@ -277,7 +280,11 @@ sys_lsetxattr(char __user *path, char __user *name, void __user *value, error = user_path_walk_link(path, &nd); if (error) return error; + error = mnt_want_write(nd.path.mnt); + if (error) + return error; error = setxattr(nd.path.dentry, name, value, size, flags); + mnt_drop_write(nd.path.mnt); path_put(&nd.path); return error; } @@ -293,9 +300,14 @@ sys_fsetxattr(int fd, char __user *name, void __user *value, f = fget(fd); if (!f) return error; + error = mnt_want_write(f->f_vfsmnt); + if (error) + goto out_fput; dentry = f->f_path.dentry; audit_inode(NULL, dentry); error = setxattr(dentry, name, value, size, flags); + mnt_drop_write(f->f_vfsmnt); +out_fput: fput(f); return error; } diff --git a/fs/xfs/linux-2.6/xfs_ioctl.c b/fs/xfs/linux-2.6/xfs_ioctl.c index f34bd01..3680b12 100644 --- a/fs/xfs/linux-2.6/xfs_ioctl.c +++ b/fs/xfs/linux-2.6/xfs_ioctl.c @@ -535,8 +535,6 @@ xfs_attrmulti_attr_set( char *kbuf; int error = EFAULT; - if (IS_RDONLY(inode)) - return -EROFS; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) return EPERM; if (len > XATTR_SIZE_MAX) @@ -562,8 +560,6 @@ xfs_attrmulti_attr_remove( char *name, __uint32_t flags) { - if (IS_RDONLY(inode)) - return -EROFS; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) return EPERM; return xfs_attr_remove(XFS_I(inode), name, flags); @@ -573,6 +569,7 @@ STATIC int xfs_attrmulti_by_handle( xfs_mount_t *mp, void __user *arg, + struct file *parfilp, struct inode *parinode) { int error; @@ -626,13 +623,21 @@ xfs_attrmulti_by_handle( &ops[i].am_length, ops[i].am_flags); break; case ATTR_OP_SET: + ops[i].am_error = mnt_want_write(parfilp->f_vfsmnt); + if (ops[i].am_error) + break; ops[i].am_error = xfs_attrmulti_attr_set(inode, attr_name, ops[i].am_attrvalue, ops[i].am_length, ops[i].am_flags); + mnt_drop_write(parfilp->f_vfsmnt); break; case ATTR_OP_REMOVE: + ops[i].am_error = mnt_want_write(parfilp->f_vfsmnt); + if (ops[i].am_error) + break; ops[i].am_error = xfs_attrmulti_attr_remove(inode, attr_name, ops[i].am_flags); + mnt_drop_write(parfilp->f_vfsmnt); break; default: ops[i].am_error = EINVAL; @@ -651,314 +656,6 @@ xfs_attrmulti_by_handle( return -error; } -/* prototypes for a few of the stack-hungry cases that have - * their own functions. Functions are defined after their use - * so gcc doesn't get fancy and inline them with -03 */ - -STATIC int -xfs_ioc_space( - struct xfs_inode *ip, - struct inode *inode, - struct file *filp, - int flags, - unsigned int cmd, - void __user *arg); - -STATIC int -xfs_ioc_bulkstat( - xfs_mount_t *mp, - unsigned int cmd, - void __user *arg); - -STATIC int -xfs_ioc_fsgeometry_v1( - xfs_mount_t *mp, - void __user *arg); - -STATIC int -xfs_ioc_fsgeometry( - xfs_mount_t *mp, - void __user *arg); - -STATIC int -xfs_ioc_xattr( - xfs_inode_t *ip, - struct file *filp, - unsigned int cmd, - void __user *arg); - -STATIC int -xfs_ioc_fsgetxattr( - xfs_inode_t *ip, - int attr, - void __user *arg); - -STATIC int -xfs_ioc_getbmap( - struct xfs_inode *ip, - int flags, - unsigned int cmd, - void __user *arg); - -STATIC int -xfs_ioc_getbmapx( - struct xfs_inode *ip, - void __user *arg); - -int -xfs_ioctl( - xfs_inode_t *ip, - struct file *filp, - int ioflags, - unsigned int cmd, - void __user *arg) -{ - struct inode *inode = filp->f_path.dentry->d_inode; - xfs_mount_t *mp = ip->i_mount; - int error; - - xfs_itrace_entry(XFS_I(inode)); - switch (cmd) { - - case XFS_IOC_ALLOCSP: - case XFS_IOC_FREESP: - case XFS_IOC_RESVSP: - case XFS_IOC_UNRESVSP: - case XFS_IOC_ALLOCSP64: - case XFS_IOC_FREESP64: - case XFS_IOC_RESVSP64: - case XFS_IOC_UNRESVSP64: - /* - * Only allow the sys admin to reserve space unless - * unwritten extents are enabled. - */ - if (!xfs_sb_version_hasextflgbit(&mp->m_sb) && - !capable(CAP_SYS_ADMIN)) - return -EPERM; - - return xfs_ioc_space(ip, inode, filp, ioflags, cmd, arg); - - case XFS_IOC_DIOINFO: { - struct dioattr da; - xfs_buftarg_t *target = - XFS_IS_REALTIME_INODE(ip) ? - mp->m_rtdev_targp : mp->m_ddev_targp; - - da.d_mem = da.d_miniosz = 1 << target->bt_sshift; - da.d_maxiosz = INT_MAX & ~(da.d_miniosz - 1); - - if (copy_to_user(arg, &da, sizeof(da))) - return -XFS_ERROR(EFAULT); - return 0; - } - - case XFS_IOC_FSBULKSTAT_SINGLE: - case XFS_IOC_FSBULKSTAT: - case XFS_IOC_FSINUMBERS: - return xfs_ioc_bulkstat(mp, cmd, arg); - - case XFS_IOC_FSGEOMETRY_V1: - return xfs_ioc_fsgeometry_v1(mp, arg); - - case XFS_IOC_FSGEOMETRY: - return xfs_ioc_fsgeometry(mp, arg); - - case XFS_IOC_GETVERSION: - return put_user(inode->i_generation, (int __user *)arg); - - case XFS_IOC_FSGETXATTR: - return xfs_ioc_fsgetxattr(ip, 0, arg); - case XFS_IOC_FSGETXATTRA: - return xfs_ioc_fsgetxattr(ip, 1, arg); - case XFS_IOC_GETXFLAGS: - case XFS_IOC_SETXFLAGS: - case XFS_IOC_FSSETXATTR: - return xfs_ioc_xattr(ip, filp, cmd, arg); - - case XFS_IOC_FSSETDM: { - struct fsdmidata dmi; - - if (copy_from_user(&dmi, arg, sizeof(dmi))) - return -XFS_ERROR(EFAULT); - - error = xfs_set_dmattrs(ip, dmi.fsd_dmevmask, - dmi.fsd_dmstate); - return -error; - } - - case XFS_IOC_GETBMAP: - case XFS_IOC_GETBMAPA: - return xfs_ioc_getbmap(ip, ioflags, cmd, arg); - - case XFS_IOC_GETBMAPX: - return xfs_ioc_getbmapx(ip, arg); - - case XFS_IOC_FD_TO_HANDLE: - case XFS_IOC_PATH_TO_HANDLE: - case XFS_IOC_PATH_TO_FSHANDLE: - return xfs_find_handle(cmd, arg); - - case XFS_IOC_OPEN_BY_HANDLE: - return xfs_open_by_handle(mp, arg, filp, inode); - - case XFS_IOC_FSSETDM_BY_HANDLE: - return xfs_fssetdm_by_handle(mp, arg, inode); - - case XFS_IOC_READLINK_BY_HANDLE: - return xfs_readlink_by_handle(mp, arg, inode); - - case XFS_IOC_ATTRLIST_BY_HANDLE: - return xfs_attrlist_by_handle(mp, arg, inode); - - case XFS_IOC_ATTRMULTI_BY_HANDLE: - return xfs_attrmulti_by_handle(mp, arg, inode); - - case XFS_IOC_SWAPEXT: { - error = xfs_swapext((struct xfs_swapext __user *)arg); - return -error; - } - - case XFS_IOC_FSCOUNTS: { - xfs_fsop_counts_t out; - - error = xfs_fs_counts(mp, &out); - if (error) - return -error; - - if (copy_to_user(arg, &out, sizeof(out))) - return -XFS_ERROR(EFAULT); - return 0; - } - - case XFS_IOC_SET_RESBLKS: { - xfs_fsop_resblks_t inout; - __uint64_t in; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (copy_from_user(&inout, arg, sizeof(inout))) - return -XFS_ERROR(EFAULT); - - /* input parameter is passed in resblks field of structure */ - in = inout.resblks; - error = xfs_reserve_blocks(mp, &in, &inout); - if (error) - return -error; - - if (copy_to_user(arg, &inout, sizeof(inout))) - return -XFS_ERROR(EFAULT); - return 0; - } - - case XFS_IOC_GET_RESBLKS: { - xfs_fsop_resblks_t out; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - error = xfs_reserve_blocks(mp, NULL, &out); - if (error) - return -error; - - if (copy_to_user(arg, &out, sizeof(out))) - return -XFS_ERROR(EFAULT); - - return 0; - } - - case XFS_IOC_FSGROWFSDATA: { - xfs_growfs_data_t in; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (copy_from_user(&in, arg, sizeof(in))) - return -XFS_ERROR(EFAULT); - - error = xfs_growfs_data(mp, &in); - return -error; - } - - case XFS_IOC_FSGROWFSLOG: { - xfs_growfs_log_t in; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (copy_from_user(&in, arg, sizeof(in))) - return -XFS_ERROR(EFAULT); - - error = xfs_growfs_log(mp, &in); - return -error; - } - - case XFS_IOC_FSGROWFSRT: { - xfs_growfs_rt_t in; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (copy_from_user(&in, arg, sizeof(in))) - return -XFS_ERROR(EFAULT); - - error = xfs_growfs_rt(mp, &in); - return -error; - } - - case XFS_IOC_FREEZE: - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (inode->i_sb->s_frozen == SB_UNFROZEN) - freeze_bdev(inode->i_sb->s_bdev); - return 0; - - case XFS_IOC_THAW: - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - if (inode->i_sb->s_frozen != SB_UNFROZEN) - thaw_bdev(inode->i_sb->s_bdev, inode->i_sb); - return 0; - - case XFS_IOC_GOINGDOWN: { - __uint32_t in; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (get_user(in, (__uint32_t __user *)arg)) - return -XFS_ERROR(EFAULT); - - error = xfs_fs_goingdown(mp, in); - return -error; - } - - case XFS_IOC_ERROR_INJECTION: { - xfs_error_injection_t in; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - if (copy_from_user(&in, arg, sizeof(in))) - return -XFS_ERROR(EFAULT); - - error = xfs_errortag_add(in.errtag, mp); - return -error; - } - - case XFS_IOC_ERROR_CLEARALL: - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - - error = xfs_errortag_clearall(mp, 1); - return -error; - - default: - return -ENOTTY; - } -} - STATIC int xfs_ioc_space( struct xfs_inode *ip, @@ -1332,3 +1029,257 @@ xfs_ioc_getbmapx( return 0; } + +int +xfs_ioctl( + xfs_inode_t *ip, + struct file *filp, + int ioflags, + unsigned int cmd, + void __user *arg) +{ + struct inode *inode = filp->f_path.dentry->d_inode; + xfs_mount_t *mp = ip->i_mount; + int error; + + xfs_itrace_entry(XFS_I(inode)); + switch (cmd) { + + case XFS_IOC_ALLOCSP: + case XFS_IOC_FREESP: + case XFS_IOC_RESVSP: + case XFS_IOC_UNRESVSP: + case XFS_IOC_ALLOCSP64: + case XFS_IOC_FREESP64: + case XFS_IOC_RESVSP64: + case XFS_IOC_UNRESVSP64: + /* + * Only allow the sys admin to reserve space unless + * unwritten extents are enabled. + */ + if (!xfs_sb_version_hasextflgbit(&mp->m_sb) && + !capable(CAP_SYS_ADMIN)) + return -EPERM; + + return xfs_ioc_space(ip, inode, filp, ioflags, cmd, arg); + + case XFS_IOC_DIOINFO: { + struct dioattr da; + xfs_buftarg_t *target = + XFS_IS_REALTIME_INODE(ip) ? + mp->m_rtdev_targp : mp->m_ddev_targp; + + da.d_mem = da.d_miniosz = 1 << target->bt_sshift; + da.d_maxiosz = INT_MAX & ~(da.d_miniosz - 1); + + if (copy_to_user(arg, &da, sizeof(da))) + return -XFS_ERROR(EFAULT); + return 0; + } + + case XFS_IOC_FSBULKSTAT_SINGLE: + case XFS_IOC_FSBULKSTAT: + case XFS_IOC_FSINUMBERS: + return xfs_ioc_bulkstat(mp, cmd, arg); + + case XFS_IOC_FSGEOMETRY_V1: + return xfs_ioc_fsgeometry_v1(mp, arg); + + case XFS_IOC_FSGEOMETRY: + return xfs_ioc_fsgeometry(mp, arg); + + case XFS_IOC_GETVERSION: + return put_user(inode->i_generation, (int __user *)arg); + + case XFS_IOC_FSGETXATTR: + return xfs_ioc_fsgetxattr(ip, 0, arg); + case XFS_IOC_FSGETXATTRA: + return xfs_ioc_fsgetxattr(ip, 1, arg); + case XFS_IOC_GETXFLAGS: + case XFS_IOC_SETXFLAGS: + case XFS_IOC_FSSETXATTR: + return xfs_ioc_xattr(ip, filp, cmd, arg); + + case XFS_IOC_FSSETDM: { + struct fsdmidata dmi; + + if (copy_from_user(&dmi, arg, sizeof(dmi))) + return -XFS_ERROR(EFAULT); + + error = xfs_set_dmattrs(ip, dmi.fsd_dmevmask, + dmi.fsd_dmstate); + return -error; + } + + case XFS_IOC_GETBMAP: + case XFS_IOC_GETBMAPA: + return xfs_ioc_getbmap(ip, ioflags, cmd, arg); + + case XFS_IOC_GETBMAPX: + return xfs_ioc_getbmapx(ip, arg); + + case XFS_IOC_FD_TO_HANDLE: + case XFS_IOC_PATH_TO_HANDLE: + case XFS_IOC_PATH_TO_FSHANDLE: + return xfs_find_handle(cmd, arg); + + case XFS_IOC_OPEN_BY_HANDLE: + return xfs_open_by_handle(mp, arg, filp, inode); + + case XFS_IOC_FSSETDM_BY_HANDLE: + return xfs_fssetdm_by_handle(mp, arg, inode); + + case XFS_IOC_READLINK_BY_HANDLE: + return xfs_readlink_by_handle(mp, arg, inode); + + case XFS_IOC_ATTRLIST_BY_HANDLE: + return xfs_attrlist_by_handle(mp, arg, inode); + + case XFS_IOC_ATTRMULTI_BY_HANDLE: + return xfs_attrmulti_by_handle(mp, arg, filp, inode); + + case XFS_IOC_SWAPEXT: { + error = xfs_swapext((struct xfs_swapext __user *)arg); + return -error; + } + + case XFS_IOC_FSCOUNTS: { + xfs_fsop_counts_t out; + + error = xfs_fs_counts(mp, &out); + if (error) + return -error; + + if (copy_to_user(arg, &out, sizeof(out))) + return -XFS_ERROR(EFAULT); + return 0; + } + + case XFS_IOC_SET_RESBLKS: { + xfs_fsop_resblks_t inout; + __uint64_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&inout, arg, sizeof(inout))) + return -XFS_ERROR(EFAULT); + + /* input parameter is passed in resblks field of structure */ + in = inout.resblks; + error = xfs_reserve_blocks(mp, &in, &inout); + if (error) + return -error; + + if (copy_to_user(arg, &inout, sizeof(inout))) + return -XFS_ERROR(EFAULT); + return 0; + } + + case XFS_IOC_GET_RESBLKS: { + xfs_fsop_resblks_t out; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + error = xfs_reserve_blocks(mp, NULL, &out); + if (error) + return -error; + + if (copy_to_user(arg, &out, sizeof(out))) + return -XFS_ERROR(EFAULT); + + return 0; + } + + case XFS_IOC_FSGROWFSDATA: { + xfs_growfs_data_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&in, arg, sizeof(in))) + return -XFS_ERROR(EFAULT); + + error = xfs_growfs_data(mp, &in); + return -error; + } + + case XFS_IOC_FSGROWFSLOG: { + xfs_growfs_log_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&in, arg, sizeof(in))) + return -XFS_ERROR(EFAULT); + + error = xfs_growfs_log(mp, &in); + return -error; + } + + case XFS_IOC_FSGROWFSRT: { + xfs_growfs_rt_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&in, arg, sizeof(in))) + return -XFS_ERROR(EFAULT); + + error = xfs_growfs_rt(mp, &in); + return -error; + } + + case XFS_IOC_FREEZE: + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (inode->i_sb->s_frozen == SB_UNFROZEN) + freeze_bdev(inode->i_sb->s_bdev); + return 0; + + case XFS_IOC_THAW: + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + if (inode->i_sb->s_frozen != SB_UNFROZEN) + thaw_bdev(inode->i_sb->s_bdev, inode->i_sb); + return 0; + + case XFS_IOC_GOINGDOWN: { + __uint32_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (get_user(in, (__uint32_t __user *)arg)) + return -XFS_ERROR(EFAULT); + + error = xfs_fs_goingdown(mp, in); + return -error; + } + + case XFS_IOC_ERROR_INJECTION: { + xfs_error_injection_t in; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (copy_from_user(&in, arg, sizeof(in))) + return -XFS_ERROR(EFAULT); + + error = xfs_errortag_add(in.errtag, mp); + return -error; + } + + case XFS_IOC_ERROR_CLEARALL: + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + error = xfs_errortag_clearall(mp, 1); + return -error; + + default: + return -ENOTTY; + } +} diff --git a/fs/xfs/linux-2.6/xfs_iops.c b/fs/xfs/linux-2.6/xfs_iops.c index cc4abd3..26e2f4a 100644 --- a/fs/xfs/linux-2.6/xfs_iops.c +++ b/fs/xfs/linux-2.6/xfs_iops.c @@ -157,13 +157,6 @@ xfs_ichgtime_fast( */ ASSERT((flags & XFS_ICHGTIME_ACC) == 0); - /* - * We're not supposed to change timestamps in readonly-mounted - * filesystems. Throw it away if anyone asks us. - */ - if (unlikely(IS_RDONLY(inode))) - return; - if (flags & XFS_ICHGTIME_MOD) { tvp = &inode->i_mtime; ip->i_d.di_mtime.t_sec = (__int32_t)tvp->tv_sec; diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c index 1663533..c6c444a 100644 --- a/fs/xfs/linux-2.6/xfs_lrw.c +++ b/fs/xfs/linux-2.6/xfs_lrw.c @@ -51,6 +51,7 @@ #include "xfs_vnodeops.h" #include +#include #include @@ -679,10 +680,16 @@ start: if (new_size > xip->i_size) xip->i_new_size = new_size; - if (likely(!(ioflags & IO_INVIS))) { + /* + * We're not supposed to change timestamps in readonly-mounted + * filesystems. Throw it away if anyone asks us. + */ + if (likely(!(ioflags & IO_INVIS) && + !mnt_want_write(file->f_vfsmnt))) { file_update_time(file); xfs_ichgtime_fast(xip, inode, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); + mnt_drop_write(file->f_vfsmnt); } /* diff --git a/include/linux/anon_inodes.h b/include/linux/anon_inodes.h index b2e1ba3..5fdcf9a 100644 --- a/include/linux/anon_inodes.h +++ b/include/linux/anon_inodes.h @@ -8,7 +8,11 @@ #ifndef _LINUX_ANON_INODES_H #define _LINUX_ANON_INODES_H -int anon_inode_getfd(int *pfd, struct inode **pinode, struct file **pfile, +int __anon_inode_getfd(int *pfd, struct file **pfile, + const char *name, const struct file_operations *fops, + void *priv); + +int anon_inode_getfd(int *pfd, const char *name, const struct file_operations *fops, void *priv); diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 6bd6460..cfb1627 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -301,7 +301,9 @@ extern int d_validate(struct dentry *, struct dentry *); */ extern char *dynamic_dname(struct dentry *, char *, int, const char *, ...); +extern char *__d_path(const struct path *path, struct path *root, char *, int); extern char *d_path(struct path *, char *, int); +extern char *dentry_path(struct dentry *, char *, int); /* Allocation counts.. */ @@ -359,7 +361,6 @@ static inline int d_mountpoint(struct dentry *dentry) } extern struct vfsmount *lookup_mnt(struct vfsmount *, struct dentry *); -extern struct vfsmount *__lookup_mnt(struct vfsmount *, struct dentry *, int); extern struct dentry *lookup_create(struct nameidata *nd, int is_dir); extern int sysctl_vfs_cache_pressure; diff --git a/include/linux/file.h b/include/linux/file.h index 7239baa..6534770 100644 --- a/include/linux/file.h +++ b/include/linux/file.h @@ -61,6 +61,7 @@ extern struct kmem_cache *filp_cachep; extern void __fput(struct file *); extern void fput(struct file *); +extern void drop_file_write_access(struct file *file); struct file_operations; struct vfsmount; diff --git a/include/linux/fs.h b/include/linux/fs.h index b84b848..5149545 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -305,7 +305,6 @@ struct vfsmount; extern void __init inode_init(void); extern void __init inode_init_early(void); -extern void __init mnt_init(void); extern void __init files_init(unsigned long); struct buffer_head; @@ -776,6 +775,9 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) index < ra->start + ra->size); } +#define FILE_MNT_WRITE_TAKEN 1 +#define FILE_MNT_WRITE_RELEASED 2 + struct file { /* * fu_list becomes invalid after file_free is called and queued via @@ -810,6 +812,9 @@ struct file { spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ struct address_space *f_mapping; +#ifdef CONFIG_DEBUG_WRITECOUNT + unsigned long f_mnt_write_state; +#endif }; extern spinlock_t files_lock; #define file_list_lock() spin_lock(&files_lock); @@ -818,6 +823,49 @@ extern spinlock_t files_lock; #define get_file(x) atomic_inc(&(x)->f_count) #define file_count(x) atomic_read(&(x)->f_count) +#ifdef CONFIG_DEBUG_WRITECOUNT +static inline void file_take_write(struct file *f) +{ + WARN_ON(f->f_mnt_write_state != 0); + f->f_mnt_write_state = FILE_MNT_WRITE_TAKEN; +} +static inline void file_release_write(struct file *f) +{ + f->f_mnt_write_state |= FILE_MNT_WRITE_RELEASED; +} +static inline void file_reset_write(struct file *f) +{ + f->f_mnt_write_state = 0; +} +static inline void file_check_state(struct file *f) +{ + /* + * At this point, either both or neither of these bits + * should be set. + */ + WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN); + WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_RELEASED); +} +static inline int file_check_writeable(struct file *f) +{ + if (f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN) + return 0; + printk(KERN_WARNING "writeable file with no " + "mnt_want_write()\n"); + WARN_ON(1); + return -EINVAL; +} +#else /* !CONFIG_DEBUG_WRITECOUNT */ +static inline void file_take_write(struct file *filp) {} +static inline void file_release_write(struct file *filp) {} +static inline void file_reset_write(struct file *filp) {} +static inline void file_check_state(struct file *filp) {} +static inline int file_check_writeable(struct file *filp) +{ + return 0; +} +#endif /* CONFIG_DEBUG_WRITECOUNT */ + #define MAX_NON_LFS ((1UL<<31) - 1) /* Page cache limit. The filesystems should put that into their s_maxbytes @@ -1487,12 +1535,7 @@ extern struct vfsmount *kern_mount_data(struct file_system_type *, void *data); #define kern_mount(type) kern_mount_data(type, NULL) extern int may_umount_tree(struct vfsmount *); extern int may_umount(struct vfsmount *); -extern void umount_tree(struct vfsmount *, int, struct list_head *); -extern void release_mounts(struct list_head *); extern long do_mount(char *, char *, char *, unsigned long, void *); -extern struct vfsmount *copy_tree(struct vfsmount *, struct dentry *, int); -extern void mnt_set_mountpoint(struct vfsmount *, struct dentry *, - struct vfsmount *); extern struct vfsmount *collect_mounts(struct vfsmount *, struct dentry *); extern void drop_collected_mounts(struct vfsmount *); @@ -1735,7 +1778,8 @@ extern struct file *create_read_pipe(struct file *f); extern struct file *create_write_pipe(void); extern void free_write_pipe(struct file *); -extern int open_namei(int dfd, const char *, int, int, struct nameidata *); +extern struct file *do_filp_open(int dfd, const char *pathname, + int open_flag, int mode); extern int may_open(struct nameidata *, int, int); extern int kernel_read(struct file *, unsigned long, char *, unsigned long); diff --git a/include/linux/mnt_namespace.h b/include/linux/mnt_namespace.h index 8eed44f..830bbcd 100644 --- a/include/linux/mnt_namespace.h +++ b/include/linux/mnt_namespace.h @@ -5,6 +5,7 @@ #include #include #include +#include struct mnt_namespace { atomic_t count; @@ -14,6 +15,13 @@ struct mnt_namespace { int event; }; +struct proc_mounts { + struct seq_file m; /* must be the first element */ + struct mnt_namespace *ns; + struct path root; + int event; +}; + extern struct mnt_namespace *copy_mnt_ns(unsigned long, struct mnt_namespace *, struct fs_struct *); extern void __put_mnt_ns(struct mnt_namespace *ns); @@ -37,5 +45,9 @@ static inline void get_mnt_ns(struct mnt_namespace *ns) atomic_inc(&ns->count); } +extern const struct seq_operations mounts_op; +extern const struct seq_operations mountinfo_op; +extern const struct seq_operations mountstats_op; + #endif #endif diff --git a/include/linux/mount.h b/include/linux/mount.h index 5ee2df2..b4836d5 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -14,6 +14,7 @@ #include #include +#include #include #include @@ -28,8 +29,10 @@ struct mnt_namespace; #define MNT_NOATIME 0x08 #define MNT_NODIRATIME 0x10 #define MNT_RELATIME 0x20 +#define MNT_READONLY 0x40 /* does the user want this to be r/o? */ #define MNT_SHRINKABLE 0x100 +#define MNT_IMBALANCED_WRITE_COUNT 0x200 /* just for debugging */ #define MNT_SHARED 0x1000 /* if the vfsmount is a shared mount */ #define MNT_UNBINDABLE 0x2000 /* if the vfsmount is a unbindable mount */ @@ -53,6 +56,8 @@ struct vfsmount { struct list_head mnt_slave; /* slave list entry */ struct vfsmount *mnt_master; /* slave is on master->mnt_slave_list */ struct mnt_namespace *mnt_ns; /* containing namespace */ + int mnt_id; /* mount identifier */ + int mnt_group_id; /* peer group identifier */ /* * We put mnt_count & mnt_expiry_mark at the end of struct vfsmount * to let these frequently modified fields in a separate cache line @@ -62,6 +67,11 @@ struct vfsmount { int mnt_expiry_mark; /* true if marked for expiry */ int mnt_pinned; int mnt_ghosts; + /* + * This value is not stable unless all of the mnt_writers[] spinlocks + * are held, and all mnt_writer[]s on this mount have 0 as their ->count + */ + atomic_t __mnt_writers; }; static inline struct vfsmount *mntget(struct vfsmount *mnt) @@ -71,9 +81,12 @@ static inline struct vfsmount *mntget(struct vfsmount *mnt) return mnt; } +extern int mnt_want_write(struct vfsmount *mnt); +extern void mnt_drop_write(struct vfsmount *mnt); extern void mntput_no_expire(struct vfsmount *mnt); extern void mnt_pin(struct vfsmount *mnt); extern void mnt_unpin(struct vfsmount *mnt); +extern int __mnt_is_readonly(struct vfsmount *mnt); static inline void mntput(struct vfsmount *mnt) { @@ -83,8 +96,6 @@ static inline void mntput(struct vfsmount *mnt) } } -extern void free_vfsmnt(struct vfsmount *mnt); -extern struct vfsmount *alloc_vfsmnt(const char *name); extern struct vfsmount *do_kern_mount(const char *fstype, int flags, const char *name, void *data); diff --git a/include/linux/security.h b/include/linux/security.h index c673dfd..0a10329 100644 --- a/include/linux/security.h +++ b/include/linux/security.h @@ -220,7 +220,7 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * loopback/bind mount (@flags & MS_BIND), @dev_name identifies the * pathname of the object being mounted. * @dev_name contains the name for object being mounted. - * @nd contains the nameidata structure for mount point object. + * @path contains the path for mount point object. * @type contains the filesystem type. * @flags contains the mount flags. * @data contains the filesystem-specific data. @@ -239,7 +239,7 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * Check permission before the device with superblock @mnt->sb is mounted * on the mount point named by @nd. * @mnt contains the vfsmount for device being mounted. - * @nd contains the nameidata object for the mount point. + * @path contains the path for the mount point. * Return 0 if permission is granted. * @sb_umount: * Check permission before the @mnt file system is unmounted. @@ -268,16 +268,16 @@ static inline void security_free_mnt_opts(struct security_mnt_opts *opts) * This hook is called any time a mount is successfully grafetd to * the tree. * @mnt contains the mounted filesystem. - * @mountpoint_nd contains the nameidata structure for the mount point. + * @mountpoint contains the path for the mount point. * @sb_pivotroot: * Check permission before pivoting the root filesystem. - * @old_nd contains the nameidata structure for the new location of the current root (put_old). - * @new_nd contains the nameidata structure for the new root (new_root). + * @old_path contains the path for the new location of the current root (put_old). + * @new_path contains the path for the new root (new_root). * Return 0 if permission is granted. * @sb_post_pivotroot: * Update module state after a successful pivot. - * @old_nd contains the nameidata structure for the old root. - * @new_nd contains the nameidata structure for the new root. + * @old_path contains the path for the old root. + * @new_path contains the path for the new root. * @sb_get_mnt_opts: * Get the security relevant mount options used for a superblock * @sb the superblock to get security mount options from @@ -1260,20 +1260,20 @@ struct security_operations { int (*sb_copy_data)(char *orig, char *copy); int (*sb_kern_mount) (struct super_block *sb, void *data); int (*sb_statfs) (struct dentry *dentry); - int (*sb_mount) (char *dev_name, struct nameidata * nd, + int (*sb_mount) (char *dev_name, struct path *path, char *type, unsigned long flags, void *data); - int (*sb_check_sb) (struct vfsmount * mnt, struct nameidata * nd); + int (*sb_check_sb) (struct vfsmount * mnt, struct path *path); int (*sb_umount) (struct vfsmount * mnt, int flags); void (*sb_umount_close) (struct vfsmount * mnt); void (*sb_umount_busy) (struct vfsmount * mnt); void (*sb_post_remount) (struct vfsmount * mnt, unsigned long flags, void *data); void (*sb_post_addmount) (struct vfsmount * mnt, - struct nameidata * mountpoint_nd); - int (*sb_pivotroot) (struct nameidata * old_nd, - struct nameidata * new_nd); - void (*sb_post_pivotroot) (struct nameidata * old_nd, - struct nameidata * new_nd); + struct path *mountpoint); + int (*sb_pivotroot) (struct path *old_path, + struct path *new_path); + void (*sb_post_pivotroot) (struct path *old_path, + struct path *new_path); int (*sb_get_mnt_opts) (const struct super_block *sb, struct security_mnt_opts *opts); int (*sb_set_mnt_opts) (struct super_block *sb, @@ -1528,16 +1528,16 @@ void security_sb_free(struct super_block *sb); int security_sb_copy_data(char *orig, char *copy); int security_sb_kern_mount(struct super_block *sb, void *data); int security_sb_statfs(struct dentry *dentry); -int security_sb_mount(char *dev_name, struct nameidata *nd, +int security_sb_mount(char *dev_name, struct path *path, char *type, unsigned long flags, void *data); -int security_sb_check_sb(struct vfsmount *mnt, struct nameidata *nd); +int security_sb_check_sb(struct vfsmount *mnt, struct path *path); int security_sb_umount(struct vfsmount *mnt, int flags); void security_sb_umount_close(struct vfsmount *mnt); void security_sb_umount_busy(struct vfsmount *mnt); void security_sb_post_remount(struct vfsmount *mnt, unsigned long flags, void *data); -void security_sb_post_addmount(struct vfsmount *mnt, struct nameidata *mountpoint_nd); -int security_sb_pivotroot(struct nameidata *old_nd, struct nameidata *new_nd); -void security_sb_post_pivotroot(struct nameidata *old_nd, struct nameidata *new_nd); +void security_sb_post_addmount(struct vfsmount *mnt, struct path *mountpoint); +int security_sb_pivotroot(struct path *old_path, struct path *new_path); +void security_sb_post_pivotroot(struct path *old_path, struct path *new_path); int security_sb_get_mnt_opts(const struct super_block *sb, struct security_mnt_opts *opts); int security_sb_set_mnt_opts(struct super_block *sb, struct security_mnt_opts *opts); @@ -1805,7 +1805,7 @@ static inline int security_sb_statfs (struct dentry *dentry) return 0; } -static inline int security_sb_mount (char *dev_name, struct nameidata *nd, +static inline int security_sb_mount (char *dev_name, struct path *path, char *type, unsigned long flags, void *data) { @@ -1813,7 +1813,7 @@ static inline int security_sb_mount (char *dev_name, struct nameidata *nd, } static inline int security_sb_check_sb (struct vfsmount *mnt, - struct nameidata *nd) + struct path *path) { return 0; } @@ -1834,17 +1834,17 @@ static inline void security_sb_post_remount (struct vfsmount *mnt, { } static inline void security_sb_post_addmount (struct vfsmount *mnt, - struct nameidata *mountpoint_nd) + struct path *mountpoint) { } -static inline int security_sb_pivotroot (struct nameidata *old_nd, - struct nameidata *new_nd) +static inline int security_sb_pivotroot (struct path *old_path, + struct path *new_path) { return 0; } -static inline void security_sb_post_pivotroot (struct nameidata *old_nd, - struct nameidata *new_nd) +static inline void security_sb_post_pivotroot (struct path *old_path, + struct path *new_path) { } static inline int security_sb_get_mnt_opts(const struct super_block *sb, struct security_mnt_opts *opts) diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 67c2563..d7cc1b3 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -10,6 +10,7 @@ struct seq_operations; struct file; struct path; struct inode; +struct dentry; struct seq_file { char *buf; @@ -30,6 +31,8 @@ struct seq_operations { int (*show) (struct seq_file *m, void *v); }; +#define SEQ_SKIP 1 + int seq_open(struct file *, const struct seq_operations *); ssize_t seq_read(struct file *, char __user *, size_t, loff_t *); loff_t seq_lseek(struct file *, loff_t, int); @@ -42,6 +45,9 @@ int seq_printf(struct seq_file *, const char *, ...) __attribute__ ((format (printf,2,3))); int seq_path(struct seq_file *, struct path *, char *); +int seq_dentry(struct seq_file *, struct dentry *, char *); +int seq_path_root(struct seq_file *m, struct path *path, struct path *root, + char *esc); int single_open(struct file *, int (*)(struct seq_file *, void *), void *); int single_release(struct inode *, struct file *); diff --git a/ipc/mqueue.c b/ipc/mqueue.c index 60f7a27..94fd3b0 100644 --- a/ipc/mqueue.c +++ b/ipc/mqueue.c @@ -598,6 +598,7 @@ static struct file *do_create(struct dentry *dir, struct dentry *dentry, int oflag, mode_t mode, struct mq_attr __user *u_attr) { struct mq_attr attr; + struct file *result; int ret; if (u_attr) { @@ -612,13 +613,24 @@ static struct file *do_create(struct dentry *dir, struct dentry *dentry, } mode &= ~current->fs->umask; + ret = mnt_want_write(mqueue_mnt); + if (ret) + goto out; ret = vfs_create(dir->d_inode, dentry, mode, NULL); dentry->d_fsdata = NULL; if (ret) - goto out; - - return dentry_open(dentry, mqueue_mnt, oflag); - + goto out_drop_write; + + result = dentry_open(dentry, mqueue_mnt, oflag); + /* + * dentry_open() took a persistent mnt_want_write(), + * so we can now drop this one. + */ + mnt_drop_write(mqueue_mnt); + return result; + +out_drop_write: + mnt_drop_write(mqueue_mnt); out: dput(dentry); mntput(mqueue_mnt); @@ -742,8 +754,11 @@ asmlinkage long sys_mq_unlink(const char __user *u_name) inode = dentry->d_inode; if (inode) atomic_inc(&inode->i_count); - + err = mnt_want_write(mqueue_mnt); + if (err) + goto out_err; err = vfs_unlink(dentry->d_parent->d_inode, dentry); + mnt_drop_write(mqueue_mnt); out_err: dput(dentry); diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 0796c1a..d52e4f0 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -437,6 +437,16 @@ config DEBUG_VM If unsure, say N. +config DEBUG_WRITECOUNT + bool "Debug filesystem writers count" + depends on DEBUG_KERNEL + help + Enable this to catch wrong use of the writers count in struct + vfsmount. This will increase the size of each file struct by + 32 bits. + + If unsure, say N. + config DEBUG_LIST bool "Debug linked list manipulation" depends on DEBUG_KERNEL diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index b8788fd..8576e7b 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -819,7 +819,11 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) */ mode = S_IFSOCK | (SOCK_INODE(sock)->i_mode & ~current->fs->umask); + err = mnt_want_write(nd.path.mnt); + if (err) + goto out_mknod_dput; err = vfs_mknod(nd.path.dentry->d_inode, dentry, mode, 0); + mnt_drop_write(nd.path.mnt); if (err) goto out_mknod_dput; mutex_unlock(&nd.path.dentry->d_inode->i_mutex); diff --git a/security/dummy.c b/security/dummy.c index 78d8f92..0be900e 100644 --- a/security/dummy.c +++ b/security/dummy.c @@ -196,13 +196,13 @@ static int dummy_sb_statfs (struct dentry *dentry) return 0; } -static int dummy_sb_mount (char *dev_name, struct nameidata *nd, char *type, +static int dummy_sb_mount (char *dev_name, struct path *path, char *type, unsigned long flags, void *data) { return 0; } -static int dummy_sb_check_sb (struct vfsmount *mnt, struct nameidata *nd) +static int dummy_sb_check_sb (struct vfsmount *mnt, struct path *path) { return 0; } @@ -229,17 +229,17 @@ static void dummy_sb_post_remount (struct vfsmount *mnt, unsigned long flags, } -static void dummy_sb_post_addmount (struct vfsmount *mnt, struct nameidata *nd) +static void dummy_sb_post_addmount (struct vfsmount *mnt, struct path *path) { return; } -static int dummy_sb_pivotroot (struct nameidata *old_nd, struct nameidata *new_nd) +static int dummy_sb_pivotroot (struct path *old_path, struct path *new_path) { return 0; } -static void dummy_sb_post_pivotroot (struct nameidata *old_nd, struct nameidata *new_nd) +static void dummy_sb_post_pivotroot (struct path *old_path, struct path *new_path) { return; } diff --git a/security/security.c b/security/security.c index b1387a6..2caae71 100644 --- a/security/security.c +++ b/security/security.c @@ -260,15 +260,15 @@ int security_sb_statfs(struct dentry *dentry) return security_ops->sb_statfs(dentry); } -int security_sb_mount(char *dev_name, struct nameidata *nd, +int security_sb_mount(char *dev_name, struct path *path, char *type, unsigned long flags, void *data) { - return security_ops->sb_mount(dev_name, nd, type, flags, data); + return security_ops->sb_mount(dev_name, path, type, flags, data); } -int security_sb_check_sb(struct vfsmount *mnt, struct nameidata *nd) +int security_sb_check_sb(struct vfsmount *mnt, struct path *path) { - return security_ops->sb_check_sb(mnt, nd); + return security_ops->sb_check_sb(mnt, path); } int security_sb_umount(struct vfsmount *mnt, int flags) @@ -291,19 +291,19 @@ void security_sb_post_remount(struct vfsmount *mnt, unsigned long flags, void *d security_ops->sb_post_remount(mnt, flags, data); } -void security_sb_post_addmount(struct vfsmount *mnt, struct nameidata *mountpoint_nd) +void security_sb_post_addmount(struct vfsmount *mnt, struct path *mountpoint) { - security_ops->sb_post_addmount(mnt, mountpoint_nd); + security_ops->sb_post_addmount(mnt, mountpoint); } -int security_sb_pivotroot(struct nameidata *old_nd, struct nameidata *new_nd) +int security_sb_pivotroot(struct path *old_path, struct path *new_path) { - return security_ops->sb_pivotroot(old_nd, new_nd); + return security_ops->sb_pivotroot(old_path, new_path); } -void security_sb_post_pivotroot(struct nameidata *old_nd, struct nameidata *new_nd) +void security_sb_post_pivotroot(struct path *old_path, struct path *new_path) { - security_ops->sb_post_pivotroot(old_nd, new_nd); + security_ops->sb_post_pivotroot(old_path, new_path); } int security_sb_get_mnt_opts(const struct super_block *sb, diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index d39b59c..c29e1a8 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2344,22 +2344,22 @@ static int selinux_sb_statfs(struct dentry *dentry) } static int selinux_mount(char * dev_name, - struct nameidata *nd, + struct path *path, char * type, unsigned long flags, void * data) { int rc; - rc = secondary_ops->sb_mount(dev_name, nd, type, flags, data); + rc = secondary_ops->sb_mount(dev_name, path, type, flags, data); if (rc) return rc; if (flags & MS_REMOUNT) - return superblock_has_perm(current, nd->path.mnt->mnt_sb, + return superblock_has_perm(current, path->mnt->mnt_sb, FILESYSTEM__REMOUNT, NULL); else - return dentry_has_perm(current, nd->path.mnt, nd->path.dentry, + return dentry_has_perm(current, path->mnt, path->dentry, FILE__MOUNTON); } diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index 732ba27..1d4fb02 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -315,10 +315,10 @@ static int smack_sb_statfs(struct dentry *dentry) * Returns 0 if current can write the floor of the filesystem * being mounted on, an error code otherwise. */ -static int smack_sb_mount(char *dev_name, struct nameidata *nd, +static int smack_sb_mount(char *dev_name, struct path *path, char *type, unsigned long flags, void *data) { - struct superblock_smack *sbp = nd->path.mnt->mnt_sb->s_security; + struct superblock_smack *sbp = path->mnt->mnt_sb->s_security; return smk_curacc(sbp->smk_floor, MAY_WRITE); } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index b2e1289..3351d71 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -718,14 +718,11 @@ static struct file_operations kvm_vcpu_fops = { static int create_vcpu_fd(struct kvm_vcpu *vcpu) { int fd, r; - struct inode *inode; - struct file *file; - r = anon_inode_getfd(&fd, &inode, &file, - "kvm-vcpu", &kvm_vcpu_fops, vcpu); + r = anon_inode_getfd(&fd, "kvm-vcpu", &kvm_vcpu_fops, vcpu); if (r) return r; - atomic_inc(&vcpu->kvm->filp->f_count); + get_file(vcpu->kvm->filp); return fd; } @@ -1015,20 +1012,26 @@ static struct file_operations kvm_vm_fops = { static int kvm_dev_ioctl_create_vm(void) { int fd, r; - struct inode *inode; struct file *file; struct kvm *kvm; kvm = kvm_create_vm(); if (IS_ERR(kvm)) return PTR_ERR(kvm); - r = anon_inode_getfd(&fd, &inode, &file, "kvm-vm", &kvm_vm_fops, kvm); + r = __anon_inode_getfd(&fd, &file, "kvm-vm", &kvm_vm_fops, kvm); if (r) { kvm_destroy_vm(kvm); return r; } kvm->filp = file; + /* + * Drop the temporary reference pinned for us by __anon_inode_getfd(). + * If we'd raced with close() from another thread, that could free kvm, + * but by that point we don't care - from here on it could be reached + * only via file->private_data. + */ + fput(file); return fd; }