From: Wu Fengguang The read-ahead logic is called when the reading hits - a PG_readahead marked page; - a non-present page. ra.prev_page should be properly setup on entrance, and readahead_cache_hit() should be called on every page reference to maintain the cache_hits counter. This call scheme achieves the following goals: - makes all stateful/stateless methods happy; - eliminates the cache hit problem naturally; - lives in harmony with application managed read-aheads via fadvise/madvise. Signed-off-by: Wu Fengguang DESC readahead: initial method - expected read size - fix fastcall EDESC From: Fengguang Wu Remove 'fastcall' directive for function readahead_close(). It has drawn concerns from Andrew Morton. Now I have some benchmarks on it, and proved it as a _false_ optimization. The tests are simple runs of the following command over _cached_ dirs: time find / > /dev/null Table of summary(averages): user sys cpu total fastcall: 1.236 4.39 89% 6.2936 non-fastcall: 1.18 4.14166667 92% 5.75416667 stock: 1.25833333 4.14666667 93.3% 5.75866667 ----------- Detailed outputs: readahead patched kernel with fastcall: noglob find / > /dev/null 1.21s user 4.58s system 90% cpu 6.378 total noglob find / > /dev/null 1.25s user 4.47s system 86% cpu 6.623 total noglob find / > /dev/null 1.23s user 4.36s system 90% cpu 6.173 total noglob find / > /dev/null 1.25s user 4.33s system 92% cpu 6.067 total noglob find / > /dev/null 1.24s user 4.21s system 87% cpu 6.227 total readahead patched kernel without fastcall: noglob find / > /dev/null 1.21s user 4.46s system 95% cpu 5.962 total noglob find / > /dev/null 1.26s user 4.58s system 94% cpu 6.142 total noglob find / > /dev/null 1.10s user 3.80s system 86% cpu 5.661 total noglob find / > /dev/null 1.13s user 3.98s system 95% cpu 5.355 total noglob find / > /dev/null 1.18s user 4.00s system 89% cpu 5.805 total noglob find / > /dev/null 1.22s user 4.03s system 93% cpu 5.600 total stock kernel: noglob find / > /dev/null 1.22s user 4.24s system 94% cpu 5.803 total noglob find / > /dev/null 1.31s user 4.21s system 95% cpu 5.784 total noglob find / > /dev/null 1.27s user 4.24s system 97% cpu 5.676 total noglob find / > /dev/null 1.34s user 4.21s system 94% cpu 5.844 total noglob find / > /dev/null 1.26s user 4.08s system 89% cpu 5.935 total noglob find / > /dev/null 1.15s user 3.90s system 91% cpu 5.510 total ----------- Similar regression has also been found by Voluspa : > "cd /usr ; time find . -type f -exec md5sum {} \;" > > 2.6.17-rc5 ------- 2.6.17-rc5-ar > > real 21m21.009s -- 21m37.663s > user 3m20.784s -- 3m20.701s > sys 6m34.261s -- 6m41.735s Signed-off-by: Wu Fengguang DESC readahead: call scheme - no fastcall for readahead_cache_hit() EDESC From: Wu Fengguang Remove 'fastcall' directive for readahead_cache_hit(). It leads to unfavorable performance in the following micro benchmark on i386 with CONFIG_REGPARM=n: Command: time cp cold /dev/null Summary: user sys cpu total no-fastcall 1.24 24.88 90.9 28.57 fastcall 1.16 25.69 91.5 29.23 Details: without fastcall: cp cold /dev/null 1.27s user 24.63s system 91% cpu 28.348 total cp cold /dev/null 1.17s user 25.09s system 91% cpu 28.653 total cp cold /dev/null 1.24s user 24.75s system 91% cpu 28.448 total cp cold /dev/null 1.20s user 25.04s system 91% cpu 28.614 total cp cold /dev/null 1.31s user 24.67s system 91% cpu 28.499 total cp cold /dev/null 1.30s user 24.87s system 91% cpu 28.530 total cp cold /dev/null 1.26s user 24.84s system 91% cpu 28.542 total cp cold /dev/null 1.16s user 25.15s system 90% cpu 28.925 total with fastcall: cp cold /dev/null 1.16s user 26.39s system 91% cpu 30.061 total cp cold /dev/null 1.25s user 26.53s system 91% cpu 30.378 total cp cold /dev/null 1.10s user 25.32s system 92% cpu 28.679 total cp cold /dev/null 1.15s user 25.20s system 91% cpu 28.747 total cp cold /dev/null 1.19s user 25.38s system 92% cpu 28.841 total cp cold /dev/null 1.11s user 25.75s system 92% cpu 29.126 total cp cold /dev/null 1.17s user 25.49s system 91% cpu 29.042 total cp cold /dev/null 1.17s user 25.49s system 92% cpu 28.970 total Signed-off-by: Wu Fengguang DESC readahead: kconfig option READAHEAD_HIT_FEEDBACK EDESC From: Wu Fengguang Introduce a kconfig option READAHEAD_HIT_FEEDBACK to enable users to disable the readahead hit feedback feature. The readahead hit accounting brings per-page overheads. However it is necessary for the onseek method, and possible strides method in future. Signed-off-by: Wu Fengguang DESC readahead-call-scheme fix EDESC From: Mike Galbraith On Thu, 2006-08-10 at 02:19 -0700, Andrew Morton wrote: > It would be interesting to try disabling CONFIG_ADAPTIVE_READAHEAD - > perhaps that got broken. A typo was pinning pagecache. Fixes leak encountered with rpm -qaV. Signed-off-by: Mike Galbraith Signed-off-by: Andrew Morton --- include/linux/fs.h | 14 +-- include/linux/mm.h | 17 +++ mm/Kconfig | 10 ++ mm/filemap.c | 50 ++++++++++- mm/readahead.c | 184 +++++++++++++++++++++++++++++++++++++++++-- 5 files changed, 258 insertions(+), 17 deletions(-) diff -puN include/linux/fs.h~readahead-call-scheme include/linux/fs.h --- a/include/linux/fs.h~readahead-call-scheme +++ a/include/linux/fs.h @@ -722,6 +722,13 @@ struct file_ra_state { pgoff_t readahead_index; /* + * Snapshot of the (node's) read-ahead aging value + * on time of I/O submission. + */ + unsigned long age; + +#ifdef CONFIG_READAHEAD_HIT_FEEDBACK + /* * Read-ahead hits. * i.e. # of distinct read-ahead pages accessed. * @@ -734,12 +741,7 @@ struct file_ra_state { u16 hit1; /* for the current sequence */ u16 hit2; /* for the previous sequence */ u16 hit3; /* for the prev-prev sequence */ - - /* - * Snapshot of the (node's) read-ahead aging value - * on time of I/O submission. - */ - unsigned long age; +#endif }; #endif }; diff -puN include/linux/mm.h~readahead-call-scheme include/linux/mm.h --- a/include/linux/mm.h~readahead-call-scheme +++ a/include/linux/mm.h @@ -1067,7 +1067,22 @@ unsigned long page_cache_readahead(struc void handle_ra_miss(struct address_space *mapping, struct file_ra_state *ra, pgoff_t offset); unsigned long max_sane_readahead(unsigned long nr); -void fastcall readahead_close(struct file *file); +void readahead_close(struct file *file); +unsigned long +page_cache_readahead_adaptive(struct address_space *mapping, + struct file_ra_state *ra, struct file *filp, + struct page *prev_page, struct page *page, + pgoff_t first_index, pgoff_t index, pgoff_t last_index); + +#if defined(CONFIG_READAHEAD_HIT_FEEDBACK) || defined(CONFIG_DEBUG_READAHEAD) +void readahead_cache_hit(struct file_ra_state *ra, struct page *page); +#else +static inline void readahead_cache_hit(struct file_ra_state *ra, + struct page *page) +{ +} +#endif + #ifdef CONFIG_ADAPTIVE_READAHEAD extern int readahead_ratio; diff -puN mm/Kconfig~readahead-call-scheme mm/Kconfig --- a/mm/Kconfig~readahead-call-scheme +++ a/mm/Kconfig @@ -225,6 +225,16 @@ config DEBUG_READAHEAD Say N for production servers. +config READAHEAD_HIT_FEEDBACK + bool "Readahead hit feedback" + default y + depends on READAHEAD_ALLOW_OVERHEADS + help + Enable readahead hit feedback. + + It is not needed in normal cases, except for detecting the + seek-and-read pattern. + config READAHEAD_SMOOTH_AGING bool "Fine grained readahead aging" default n diff -puN mm/filemap.c~readahead-call-scheme mm/filemap.c --- a/mm/filemap.c~readahead-call-scheme +++ a/mm/filemap.c @@ -975,14 +975,32 @@ void do_generic_mapping_read(struct addr nr = nr - offset; cond_resched(); - if (index == next_index) + + if (!prefer_adaptive_readahead() && index == next_index) next_index = page_cache_readahead(mapping, &ra, filp, index, last_index - index); find_page: page = find_get_page(mapping, index); + if (prefer_adaptive_readahead()) { + if (unlikely(page == NULL)) { + ra.prev_page = prev_index; + page_cache_readahead_adaptive(mapping, &ra, + filp, prev_page, NULL, + *ppos >> PAGE_CACHE_SHIFT, + index, last_index); + page = find_get_page(mapping, index); + } else if (PageReadahead(page)) { + ra.prev_page = prev_index; + page_cache_readahead_adaptive(mapping, &ra, + filp, prev_page, page, + *ppos >> PAGE_CACHE_SHIFT, + index, last_index); + } + } if (unlikely(page == NULL)) { - handle_ra_miss(mapping, &ra, index); + if (!prefer_adaptive_readahead()) + handle_ra_miss(mapping, &ra, index); goto no_cached_page; } @@ -990,6 +1008,9 @@ find_page: page_cache_release(prev_page); prev_page = page; + if (prefer_adaptive_readahead()) + readahead_cache_hit(&ra, page); + if (!PageUptodate(page)) goto page_not_up_to_date; page_ok: @@ -1134,6 +1155,8 @@ no_cached_page: out: *_ra = ra; + if (prefer_adaptive_readahead()) + _ra->prev_page = prev_index; *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; if (cached_page) @@ -1406,6 +1429,7 @@ struct page *filemap_nopage(struct vm_ar unsigned long size, pgoff; int did_readaround = 0, majmin = VM_FAULT_MINOR; + ra->flags |= RA_FLAG_MMAP; pgoff = ((address-area->vm_start) >> PAGE_CACHE_SHIFT) + area->vm_pgoff; retry_all: @@ -1423,7 +1447,7 @@ retry_all: * * For sequential accesses, we use the generic readahead logic. */ - if (VM_SequentialReadHint(area)) + if (!prefer_adaptive_readahead() && VM_SequentialReadHint(area)) page_cache_readahead(mapping, ra, file, pgoff, 1); /* @@ -1431,11 +1455,24 @@ retry_all: */ retry_find: page = find_get_page(mapping, pgoff); + if (prefer_adaptive_readahead() && VM_SequentialReadHint(area)) { + if (!page) { + page_cache_readahead_adaptive(mapping, ra, + file, NULL, NULL, + pgoff, pgoff, pgoff + 1); + page = find_get_page(mapping, pgoff); + } else if (PageReadahead(page)) { + page_cache_readahead_adaptive(mapping, ra, + file, NULL, page, + pgoff, pgoff, pgoff + 1); + } + } if (!page) { unsigned long ra_pages; if (VM_SequentialReadHint(area)) { - handle_ra_miss(mapping, ra, pgoff); + if (!prefer_adaptive_readahead()) + handle_ra_miss(mapping, ra, pgoff); goto no_cached_page; } ra->mmap_miss++; @@ -1472,6 +1509,9 @@ retry_find: if (!did_readaround) ra->mmap_hit++; + if (prefer_adaptive_readahead()) + readahead_cache_hit(ra, page); + /* * Ok, found a page in the page cache, now we need to check * that it's up-to-date. @@ -1486,6 +1526,8 @@ success: mark_page_accessed(page); if (type) *type = majmin; + if (prefer_adaptive_readahead()) + ra->prev_page = page->index; return page; outside_data_content: diff -puN mm/readahead.c~readahead-call-scheme mm/readahead.c --- a/mm/readahead.c~readahead-call-scheme +++ a/mm/readahead.c @@ -913,8 +913,12 @@ static unsigned long ra_invoke_interval( */ static int ra_cache_hit_ok(struct file_ra_state *ra) { +#ifdef CONFIG_READAHEAD_HIT_FEEDBACK return ra->hit0 * readahead_hit_rate >= (ra->lookahead_index - ra->la_index); +#else + return 1; +#endif } /* @@ -948,6 +952,7 @@ static void ra_set_class(struct file_ra_ ra->flags = flags | old_ra_class | ra_class; +#ifdef CONFIG_READAHEAD_HIT_FEEDBACK /* * Add request-hit up to sequence-hit and reset the former. */ @@ -964,6 +969,7 @@ static void ra_set_class(struct file_ra_ ra->hit2 = ra->hit1; ra->hit1 = 0; } +#endif } /* @@ -1633,6 +1639,7 @@ static int try_readahead_on_seek(struct file_ra_state *ra, pgoff_t index, unsigned long ra_size, unsigned long ra_max) { +#ifdef CONFIG_READAHEAD_HIT_FEEDBACK unsigned long hit0 = ra->hit0; unsigned long hit1 = ra->hit1 + hit0; unsigned long hit2 = ra->hit2; @@ -1661,6 +1668,9 @@ try_readahead_on_seek(struct file_ra_sta ra_set_size(ra, ra_size, 0); return 1; +#else + return 0; +#endif } /* @@ -1688,7 +1698,7 @@ thrashing_recovery_readahead(struct addr ra_size = ra->ra_index - index; else { /* After thrashing, we know the exact thrashing-threshold. */ - ra_size = ra->hit0; + ra_size = index - ra->ra_index; update_ra_thrash_bytes(mapping->backing_dev_info, ra_size); /* And we'd better be a bit conservative. */ @@ -1724,6 +1734,165 @@ static inline void get_readahead_bounds( (128*1024) / PAGE_CACHE_SIZE), *ra_max / 2); } +/** + * page_cache_readahead_adaptive - thrashing safe adaptive read-ahead + * @mapping, @ra, @filp: the same as page_cache_readahead() + * @prev_page: the page at @index-1, may be NULL to let the function find it + * @page: the page at @index, or NULL if non-present + * @begin_index, @index, @end_index: offsets into @mapping + * [@begin_index, @end_index) is the read the caller is performing + * @index indicates the page to be read now + * + * page_cache_readahead_adaptive() is the entry point of the adaptive + * read-ahead logic. It tries a set of methods in turn to determine the + * appropriate readahead action and submits the readahead I/O. + * + * The caller is expected to point ra->prev_page to the previously accessed + * page, and to call it on two conditions: + * 1. @page == NULL + * A cache miss happened, some pages have to be read in + * 2. @page != NULL && PageReadahead(@page) + * A look-ahead mark encountered, this is set by a previous read-ahead + * invocation to instruct the caller to give the function a chance to + * check up and do next read-ahead in advance. + */ +unsigned long +page_cache_readahead_adaptive(struct address_space *mapping, + struct file_ra_state *ra, struct file *filp, + struct page *prev_page, struct page *page, + pgoff_t begin_index, pgoff_t index, pgoff_t end_index) +{ + unsigned long size; + unsigned long ra_min; + unsigned long ra_max; + int ret; + + might_sleep(); + + if (page) { + if(!TestClearPageReadahead(page)) + return 0; + if (bdi_read_congested(mapping->backing_dev_info)) { + ra_account(ra, RA_EVENT_IO_CONGESTION, + end_index - index); + return 0; + } + } + + if (page) + ra_account(ra, RA_EVENT_LOOKAHEAD_HIT, + ra->readahead_index - ra->lookahead_index); + else if (index) + ra_account(ra, RA_EVENT_CACHE_MISS, end_index - begin_index); + + size = end_index - index; + get_readahead_bounds(ra, &ra_min, &ra_max); + + /* readahead disabled? */ + if (unlikely(!ra_max || !readahead_ratio)) { + size = max_sane_readahead(size); + goto readit; + } + + /* + * Start of file. + */ + if (index == 0) + return initial_readahead(mapping, filp, ra, size); + + /* + * State based sequential read-ahead. + */ + if (!debug_option(disable_stateful_method) && + index == ra->lookahead_index && ra_cache_hit_ok(ra)) + return state_based_readahead(mapping, filp, ra, page, + index, size, ra_max); + + /* + * Recover from possible thrashing. + */ + if (!page && index == ra->prev_page + 1 && ra_has_index(ra, index)) + return thrashing_recovery_readahead(mapping, filp, ra, + index, ra_max); + + /* + * Backward read-ahead. + */ + if (!page && begin_index == index && + try_read_backward(ra, index, size, ra_max)) + return ra_dispatch(ra, mapping, filp); + + /* + * Context based sequential read-ahead. + */ + ret = try_context_based_readahead(mapping, ra, prev_page, page, + index, ra_min, ra_max); + if (ret > 0) + return ra_dispatch(ra, mapping, filp); + if (ret < 0) + return 0; + + /* No action on look ahead time? */ + if (page) { + ra_account(ra, RA_EVENT_LOOKAHEAD_NOACTION, + ra->readahead_index - index); + return 0; + } + + /* + * Random read that follows a sequential one. + */ + if (try_readahead_on_seek(ra, index, size, ra_max)) + return ra_dispatch(ra, mapping, filp); + + /* + * Random read. + */ + if (size > ra_max) + size = ra_max; + +readit: + size = __do_page_cache_readahead(mapping, filp, index, size, 0); + + ra_account(ra, RA_EVENT_RANDOM_READ, size); + dprintk("random_read(ino=%lu, pages=%lu, index=%lu-%lu-%lu) = %lu\n", + mapping->host->i_ino, mapping->nrpages, + begin_index, index, end_index, size); + + return size; +} + +#if defined(CONFIG_READAHEAD_HIT_FEEDBACK) || defined(CONFIG_DEBUG_READAHEAD) +/** + * readahead_cache_hit - adaptive read-ahead feedback function + * @ra: file_ra_state which holds the readahead state + * @page: the page just accessed + * + * readahead_cache_hit() is the feedback route of the adaptive read-ahead + * logic. It must be called on every access on the read-ahead pages. + */ +void readahead_cache_hit(struct file_ra_state *ra, struct page *page) +{ + if (PageActive(page) || PageReferenced(page)) + return; + + if (!PageUptodate(page)) + ra_account(ra, RA_EVENT_IO_BLOCK, 1); + + if (!ra_has_index(ra, page->index)) + return; + +#ifdef CONFIG_READAHEAD_HIT_FEEDBACK + ra->hit0++; +#endif + + if (page->index >= ra->ra_index) + ra_account(ra, RA_EVENT_READAHEAD_HIT, 1); + else + ra_account(ra, RA_EVENT_READAHEAD_HIT, -1); +} +#endif + /* * When closing a normal readonly file, * - on cache hit: increase `backing_dev_info.ra_expect_bytes' slowly; @@ -1732,13 +1901,12 @@ static inline void get_readahead_bounds( * The resulted `ra_expect_bytes' answers the question of: * How many pages are expected to be read on start-of-file? */ -void fastcall readahead_close(struct file *file) +void readahead_close(struct file *file) { struct inode *inode = file->f_dentry->d_inode; struct address_space *mapping = inode->i_mapping; struct backing_dev_info *bdi = mapping->backing_dev_info; unsigned long pos = file->f_pos; /* supposed to be small */ - unsigned long pgrahit = file->f_ra.hit0; unsigned long pgcached = mapping->nrpages; unsigned long pgaccess; @@ -1748,7 +1916,12 @@ void fastcall readahead_close(struct fil if (pgcached > bdi->ra_pages0) /* excessive reads */ return; - pgaccess = max(pgrahit, 1 + pos / PAGE_CACHE_SIZE); + pgaccess = 1 + pos / PAGE_CACHE_SIZE; +#ifdef CONFIG_READAHEAD_HIT_FEEDBACK + if (pgaccess < file->f_ra.hit0) + pgaccess = file->f_ra.hit0; +#endif + if (pgaccess >= pgcached) { if (bdi->ra_expect_bytes < bdi->ra_pages0 * PAGE_CACHE_SIZE) bdi->ra_expect_bytes += pgcached * PAGE_CACHE_SIZE / 8; @@ -1769,12 +1942,11 @@ void fastcall readahead_close(struct fil debug_inc(initial_ra_miss); dprintk("initial_ra_miss on file %s " - "size %lluK cached %luK hit %luK " + "size %lluK cached %luK " "pos %lu by %s(%d)\n", file->f_dentry->d_name.name, i_size_read(inode) / 1024, pgcached << (PAGE_CACHE_SHIFT - 10), - pgrahit << (PAGE_CACHE_SHIFT - 10), pos, current->comm, current->pid); } _