fallocate() implementation on i86, x86_64 and powerpc This patch implements sys_fallocate() and adds support on i386, x86_64 and powerpc platforms. Signed-off-by: Amit Arora General fallocate discussion Description: ============ fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called fallocate. Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. Interface: ========== The proposed system call's layout is: asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) fd: The descriptor of the open file. mode*: This specifies the behavior of the system call. Currently the system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE. FA_ALLOCATE: Applications can use this mode to preallocate blocks to a given file (specified by fd). This mode changes the file size if the preallocation is done beyond the EOF. It also updates the ctime in the inode of the corresponding file, marking a successfull allocation. FA_DEALLOCATE: This mode can be used by applications to deallocate the previously preallocated blocks. This also may change the file size and the ctime/mtime. * New modes might get added in future. One such new mode which is already under discussion is FA_PREALLOCATE, which when used will preallocate space but will not change the filesize and [cm]time. Since the semantics of this new mode is not clear and agreed upon yet, this patchset does not implement it currently. offset: This is the offset in bytes, from where the preallocation should start. len: This is the number of bytes requested for preallocation (from offset). RETURN VALUE: The system call returns 0 on success and an error on failure. This is done to keep the semantics same as of posix_fallocate(). sys_fallocate() on s390: ======================== There is a problem with s390 ABI to implement sys_fallocate() with the proposed order of arguments. Martin Schwidefsky has suggested a patch to solve this problem which makes use of a wrapper in the kernel. This will require special handling of this system call on s390 in glibc as well. But, this seems to be the best solution so far. Known Problems: ============== mmapped writes into uninitialized extents is a known problem with the current ext4 patches. Like XFS, ext4 may need to implement ->page_mkwrite() to solve this. See: http://lkml.org/lkml/2007/5/8/583 Since there is a talk of ->fault() replacing ->page_mkwrite() and also with a generic block_page_mkwrite() implementation already posted, we can implement this later some time. See: http://lkml.org/lkml/2007/3/7/161 http://lkml.org/lkml/2007/3/18/198 ToDos: ====== 1> Implementation on other architectures (other than i386, x86_64, ppc64 and s390(x)). David Chinner has already posted a patch for ia64. 2> A generic file system operation to handle fallocate (generic_fallocate), for filesystems that do _not_ have the fallocate inode operation implemented. 3> Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Changelog: ========== Changes from Take3 to Take4: 1) Do not update c/mtime. Let each filesystem update ctime (update of mtime will not be required for allocation since we touch only metadata/inode and not blocks), if required. Changes from Take2 to Take3: 1) Patches now based on 2.6.22-rc1 kernel. Changes from Take1(initial post on 26th April, 2007) to Take2: 1) Added description before sys_fallocate() definition. 2) Return EINVAL for len<=0 (With new draft that Ulrich pointed to, posix_fallocate should return EINVAL for len <= 0. 3) Return EOPNOTSUPP if mode is not one of FA_ALLOCATE or FA_DEALLOCATE 4) Do not return ENODEV for dirs (let individual file systems decide if they want to support preallocation to directories or not. 5) Check for wrap through zero. 6) Update c/mtime if fallocate() succeeds. 7) Added mode descriptions in fs.h 8) Added variable names to function definition (fallocate inode op) --- arch/i386/kernel/syscall_table.S | 1 arch/powerpc/kernel/sys_ppc32.c | 7 +++ arch/x86_64/ia32/ia32entry.S | 1 fs/open.c | 86 +++++++++++++++++++++++++++++++++++++++ include/asm-i386/unistd.h | 3 - include/asm-powerpc/systbl.h | 1 include/asm-powerpc/unistd.h | 3 - include/asm-x86_64/unistd.h | 2 include/linux/fs.h | 13 +++++ include/linux/syscalls.h | 1 10 files changed, 116 insertions(+), 2 deletions(-) diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index bf6adce..8344c70 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -323,3 +323,4 @@ ENTRY(sys_call_table) .long sys_signalfd .long sys_timerfd .long sys_eventfd + .long sys_fallocate diff --git a/arch/powerpc/kernel/sys_ppc32.c b/arch/powerpc/kernel/sys_ppc32.c index 047246a..1269b32 100644 --- a/arch/powerpc/kernel/sys_ppc32.c +++ b/arch/powerpc/kernel/sys_ppc32.c @@ -773,6 +773,13 @@ asmlinkage int compat_sys_truncate64(const char __user * path, u32 reg4, return sys_truncate(path, (high << 32) | low); } +asmlinkage long compat_sys_fallocate(int fd, int mode, u32 offhi, u32 offlo, + u32 lenhi, u32 lenlo) +{ + return sys_fallocate(fd, mode, ((loff_t)offhi << 32) | offlo, + ((loff_t)lenhi << 32) | lenlo); +} + asmlinkage int compat_sys_ftruncate64(unsigned int fd, u32 reg4, unsigned long high, unsigned long low) { diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 21868f9..d6064cd 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -719,4 +719,5 @@ ia32_sys_call_table: .quad compat_sys_signalfd .quad compat_sys_timerfd .quad sys_eventfd + .quad sys_fallocate ia32_syscall_end: diff --git a/fs/open.c b/fs/open.c index 0d515d1..02d50c3 100644 --- a/fs/open.c +++ b/fs/open.c @@ -353,6 +353,92 @@ asmlinkage long sys_ftruncate64(unsigned int fd, loff_t length) #endif /* + * sys_fallocate - preallocate blocks or free preallocated blocks + * @fd: the file descriptor + * @mode: mode specifies if fallocate should preallocate blocks OR free + * (unallocate) preallocated blocks. Currently only FA_ALLOCATE and + * FA_DEALLOCATE modes are supported. + * @offset: The offset within file, from where (un)allocation is being + * requested. It should not have a negative value. + * @len: The amount (in bytes) of space to be (un)allocated, from the offset. + * + * This system call, depending on the mode, preallocates or unallocates blocks + * for a file. The range of blocks depends on the value of offset and len + * arguments provided by the user/application. For FA_ALLOCATE mode, if this + * system call succeeds, subsequent writes to the file in the given range + * (specified by offset & len) should not fail - even if the file system + * later becomes full. Hence the preallocation done is persistent (valid + * even after reopen of the file and remount/reboot). + * + * It is expected that the ->fallocate() inode operation implemented by the + * individual file systems will update the file size and/or ctime/mtime + * depending on the mode and also on the success of the operation. + * + * Note: Incase the file system does not support preallocation, + * posix_fallocate() should fall back to the library implementation (i.e. + * allocating zero-filled new blocks to the file). + * + * Return Values + * 0 : On SUCCESS a value of zero is returned. + * error : On Failure, an error code will be returned. + * An error code of -ENOSYS or -EOPNOTSUPP should make posix_fallocate() + * fall back on library implementation of fallocate. + * + * Generic fallocate to be added for file systems that do not + * support fallocate it. + */ +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len) +{ + struct file *file; + struct inode *inode; + long ret = -EINVAL; + + if (offset < 0 || len <= 0) + goto out; + + /* Return error if mode is not supported */ + ret = -EOPNOTSUPP; + if (mode != FA_ALLOCATE && mode !=FA_DEALLOCATE) + goto out; + + ret = -EBADF; + file = fget(fd); + if (!file) + goto out; + if (!(file->f_mode & FMODE_WRITE)) + goto out_fput; + + inode = file->f_path.dentry->d_inode; + + ret = -ESPIPE; + if (S_ISFIFO(inode->i_mode)) + goto out_fput; + + ret = -ENODEV; + /* + * Let individual file system decide if it supports preallocation + * for directories or not. + */ + if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) + goto out_fput; + + ret = -EFBIG; + /* Check for wrap through zero too */ + if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) + goto out_fput; + + if (inode->i_op && inode->i_op->fallocate) + ret = inode->i_op->fallocate(inode, mode, offset, len); + else + ret = -ENOSYS; + +out_fput: + fput(file); +out: + return ret; +} + +/* * access() needs to use the real uid/gid, not the effective uid/gid. * We do this by temporarily clearing all FS-related capabilities and * switching the fsuid/fsgid around to the real ones. diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index e84ace1..9b15545 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -329,10 +329,11 @@ #define __NR_signalfd 321 #define __NR_timerfd 322 #define __NR_eventfd 323 +#define __NR_fallocate 324 #ifdef __KERNEL__ -#define NR_syscalls 324 +#define NR_syscalls 325 #define __ARCH_WANT_IPC_PARSE_VERSION #define __ARCH_WANT_OLD_READDIR diff --git a/include/asm-powerpc/systbl.h b/include/asm-powerpc/systbl.h index 700ca59..437b0df 100644 --- a/include/asm-powerpc/systbl.h +++ b/include/asm-powerpc/systbl.h @@ -308,6 +308,7 @@ COMPAT_SYS_SPU(move_pages) SYSCALL_SPU(getcpu) COMPAT_SYS(epoll_pwait) COMPAT_SYS_SPU(utimensat) +COMPAT_SYS(fallocate) COMPAT_SYS_SPU(signalfd) COMPAT_SYS_SPU(timerfd) SYSCALL_SPU(eventfd) diff --git a/include/asm-powerpc/unistd.h b/include/asm-powerpc/unistd.h index e3c28dc..cb22470 100644 --- a/include/asm-powerpc/unistd.h +++ b/include/asm-powerpc/unistd.h @@ -330,10 +330,11 @@ #define __NR_signalfd 305 #define __NR_timerfd 306 #define __NR_eventfd 307 +#define __NR_fallocate 308 #ifdef __KERNEL__ -#define __NR_syscalls 308 +#define __NR_syscalls 309 #define __NR__exit __NR_exit #define NR_syscalls __NR_syscalls diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index ae1ed05..12329e9 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -630,6 +630,8 @@ __SYSCALL(__NR_signalfd, sys_signalfd) __SYSCALL(__NR_timerfd, sys_timerfd) #define __NR_eventfd 283 __SYSCALL(__NR_eventfd, sys_eventfd) +#define __NR_fallocate 284 +__SYSCALL(__NR_fallocate, sys_fallocate) #ifndef __NO_STUBS #define __ARCH_WANT_OLD_READDIR diff --git a/include/linux/fs.h b/include/linux/fs.h index 7cf0c54..1514354 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -266,6 +266,17 @@ extern int dir_notify_enable; #define SYNC_FILE_RANGE_WRITE 2 #define SYNC_FILE_RANGE_WAIT_AFTER 4 +/* + * sys_fallocate modes + * Currently sys_fallocate supports two modes: + * FA_ALLOCATE : This is the preallocate mode, using which an application/user + * may request (pre)allocation of blocks. + * FA_DEALLOCATE: This is the deallocate mode, which can be used to free + * the preallocated blocks. + */ +#define FA_ALLOCATE 0x1 +#define FA_DEALLOCATE 0x2 + #ifdef __KERNEL__ #include @@ -1137,6 +1148,8 @@ struct inode_operations { ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); void (*truncate_range)(struct inode *, loff_t, loff_t); + long (*fallocate)(struct inode *inode, int mode, loff_t offset, + loff_t len); }; struct seq_file; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index b02070e..dd61dc5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -608,6 +608,7 @@ asmlinkage long sys_signalfd(int ufd, sigset_t __user *user_mask, size_t sizemas asmlinkage long sys_timerfd(int ufd, int clockid, int flags, const struct itimerspec __user *utmr); asmlinkage long sys_eventfd(unsigned int count); +asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len); int kernel_execve(const char *filename, char *const argv[], char *const envp[]);