From clameter@sgi.com Sun Apr 22 23:21:29 2007 Message-Id: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:07 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 00/16] Variable Order Page Cache Patchset V2 RFC V1->V2 - Some ext2 support - Some block layer, fs layer support etc. - Better page cache macros - Use macros to clean up code. This patchset modifies the Linujx kernel so that higher order page cache pages become possible. The higher order page cache pages are compound pages and can be handled in the same way as regular pages. Rationales: 1. We have problems supporting devices with a higher blocksize than page size. This is for example important to support CD and DVDs that can only read and write 32k or 64k blocks. We currently have a shim layer in there to deal with this situation which limits the speed of I/O. The developers are currently looking for ways to completely bypass the page cache because of this deficiency. 2. 32/64k blocksize is also used in flash devices. Same issues. 3. Future harddisks will support bigger block sizes 4. Performace. If we look at IA64 vs. x86_64 then it seems that the faster interrupt handling on x86_64 compensate for the speed loss due to a smaller page size (4k vs 16k on IA64). Having higher page sizes on all platform allows a significant reduction in I/O overhead and increases the size of I/O that can be performed by hardware in a single request since the number of scatter gather entries are typically limited for one request. This is going to become increasingly important to support the ever growing memory sizes since we may have to handle excessively large amounts of 4k requests for data sizes that may become common soon. For example to write a 1 terabyte file the kernel would have to handle 256 million 4k chunks. 5. Cross arch compatibility: It is currently not possible to mount an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system. The support here is currently only for buffered I/O and only for two filesystems ramfs and ext2. Note that the higher order pages are subject to reclaim. This works in general since we are always operating on a single page struct. Reclaim is fooled to think that it is touching page sized objects (there are likely issues to be fixed there if we want to go down this road). What is currently not supported: - Mmapping higher order pages - Direct I/O (there are some fundamental issues with direct I/O putting compound pages that have to be treated as single pages on the pagevecs and the variable order page cache putting higher order compound pages that hjave to be treated as a single large page onto pagevecs. Breakage: - Reclaim does not work for some reasons. Compound pages on the active list get lost somehow. - Disk data is corrupted when writing ext2fs data. There is likely still a lot of work to do in the block layer. - There is a lot of incomplete work. There are numerous places where the kernel can no longer assume that the page cache consists of PAGE_SIZE pages that have not been fixed yet. Future: - Expect several more RFCs - We hope for XFS support soon - There are filesystem layer and lower layer issues here that I am not that familiar with. If you can then please enhance my patches. - Mmap support could be done in a way that makes the mmap page size independent from the page cache order. There is no problem of mapping a 4k section of a larger page cache page. This should leave mmap as is. - Lets try to keep scope as small as possible. -- From clameter@sgi.com Sun Apr 22 23:21:29 2007 Message-Id: <20070423062129.171333818@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:08 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 01/16] Free up page->private for compound pages Content-Disposition: inline; filename=var_pc_compound If we add a new flag so that we can distinguish between the first page and the tail pages then we can avoid to use page->private in the first page. page->private == page for the first page, so there is no real information in there. Freeing up page->private makes the use of compound pages more transparent. They become more usable like real pages. Right now we have to be careful f.e. if we are going beyond PAGE_SIZE allocations in the slab on i386 because we can then no longer use the private field. This is one of the issues that cause us not to support debugging for page size slabs in SLAB. Also if page->private is available then a compound page may be equipped with buffer heads. This may free up the way for filesystems to support larger blocks than page size. Note that this patch is different from the one in mm. The one in mm uses PG_reclaim as a PG_tail. We cannot use PG_tail since pages can be reclaimed now. So use a separate page flag. We allow compound page headers on pagevec. That will break Direct I/O because direct I/O needs pagevecs to handle the components but not the whole. Ideas for a solution welcome. Maybe we should modify the Direct I/O layer to not operate on the individual pages but on the compound page as a whole. Signed-off-by: Christoph Lameter --- arch/ia64/mm/init.c | 2 +- include/linux/mm.h | 32 ++++++++++++++++++++++++++------ include/linux/page-flags.h | 6 ++++++ mm/internal.h | 2 +- mm/page_alloc.c | 35 +++++++++++++++++++++++++---------- mm/slab.c | 6 ++---- mm/swap.c | 20 ++++++++++++++++++-- 7 files changed, 79 insertions(+), 24 deletions(-) Index: linux-2.6.21-rc7/include/linux/mm.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/mm.h 2007-04-21 20:52:07.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/mm.h 2007-04-21 20:58:32.000000000 -0700 @@ -263,21 +263,24 @@ static inline int put_page_testzero(stru */ static inline int get_page_unless_zero(struct page *page) { - VM_BUG_ON(PageCompound(page)); return atomic_inc_not_zero(&page->_count); } +static inline struct page *compound_head(struct page *page) +{ + if (unlikely(PageTail(page))) + return (struct page *)page->private; + return page; +} + static inline int page_count(struct page *page) { - if (unlikely(PageCompound(page))) - page = (struct page *)page_private(page); - return atomic_read(&page->_count); + return atomic_read(&compound_head(page)->_count); } static inline void get_page(struct page *page) { - if (unlikely(PageCompound(page))) - page = (struct page *)page_private(page); + page = compound_head(page); VM_BUG_ON(atomic_read(&page->_count) == 0); atomic_inc(&page->_count); } @@ -314,6 +317,23 @@ static inline compound_page_dtor *get_co return (compound_page_dtor *)page[1].lru.next; } +static inline int compound_order(struct page *page) +{ + if (!PageCompound(page) || PageTail(page)) + return 0; + return (unsigned long)page[1].lru.prev; +} + +static inline void set_compound_order(struct page *page, unsigned long order) +{ + page[1].lru.prev = (void *)order; +} + +static inline int base_pages(struct page *page) +{ + return 1 << compound_order(page); +} + /* * Multiple processes may "see" the same page. E.g. for untouched * mappings of /dev/null, all processes see the same page full of Index: linux-2.6.21-rc7/include/linux/page-flags.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/page-flags.h 2007-04-21 20:52:07.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/page-flags.h 2007-04-21 20:52:15.000000000 -0700 @@ -91,6 +91,8 @@ #define PG_nosave_free 18 /* Used for system suspend/resume */ #define PG_buddy 19 /* Page is free, on buddy lists */ +#define PG_tail 20 /* Page is tail of a compound page */ + /* PG_owner_priv_1 users should have descriptive aliases */ #define PG_checked PG_owner_priv_1 /* Used by some filesystems */ @@ -241,6 +243,10 @@ static inline void SetPageUptodate(struc #define __SetPageCompound(page) __set_bit(PG_compound, &(page)->flags) #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags) +#define PageTail(page) test_bit(PG_tail, &(page)->flags) +#define __SetPageTail(page) __set_bit(PG_tail, &(page)->flags) +#define __ClearPageTail(page) __clear_bit(PG_tail, &(page)->flags) + #ifdef CONFIG_SWAP #define PageSwapCache(page) test_bit(PG_swapcache, &(page)->flags) #define SetPageSwapCache(page) set_bit(PG_swapcache, &(page)->flags) Index: linux-2.6.21-rc7/mm/internal.h =================================================================== --- linux-2.6.21-rc7.orig/mm/internal.h 2007-04-21 20:52:07.000000000 -0700 +++ linux-2.6.21-rc7/mm/internal.h 2007-04-21 20:52:15.000000000 -0700 @@ -24,7 +24,7 @@ static inline void set_page_count(struct */ static inline void set_page_refcounted(struct page *page) { - VM_BUG_ON(PageCompound(page) && page_private(page) != (unsigned long)page); + VM_BUG_ON(PageTail(page)); VM_BUG_ON(atomic_read(&page->_count)); set_page_count(page, 1); } Index: linux-2.6.21-rc7/mm/page_alloc.c =================================================================== --- linux-2.6.21-rc7.orig/mm/page_alloc.c 2007-04-21 20:52:07.000000000 -0700 +++ linux-2.6.21-rc7/mm/page_alloc.c 2007-04-21 20:58:32.000000000 -0700 @@ -227,7 +227,7 @@ static void bad_page(struct page *page) static void free_compound_page(struct page *page) { - __free_pages_ok(page, (unsigned long)page[1].lru.prev); + __free_pages_ok(page, compound_order(page)); } static void prep_compound_page(struct page *page, unsigned long order) @@ -236,12 +236,14 @@ static void prep_compound_page(struct pa int nr_pages = 1 << order; set_compound_page_dtor(page, free_compound_page); - page[1].lru.prev = (void *)order; - for (i = 0; i < nr_pages; i++) { + set_compound_order(page, order); + __SetPageCompound(page); + for (i = 1; i < nr_pages; i++) { struct page *p = page + i; + __SetPageTail(p); __SetPageCompound(p); - set_page_private(p, (unsigned long)page); + p->private = (unsigned long)page; } } @@ -250,15 +252,19 @@ static void destroy_compound_page(struct int i; int nr_pages = 1 << order; - if (unlikely((unsigned long)page[1].lru.prev != order)) + if (unlikely(compound_order(page) != order)) bad_page(page); - for (i = 0; i < nr_pages; i++) { + if (unlikely(!PageCompound(page))) + bad_page(page); + __ClearPageCompound(page); + for (i = 1; i < nr_pages; i++) { struct page *p = page + i; - if (unlikely(!PageCompound(p) | - (page_private(p) != (unsigned long)page))) + if (unlikely(!PageCompound(p) | !PageTail(p) | + ((struct page *)p->private != page))) bad_page(page); + __ClearPageTail(p); __ClearPageCompound(p); } } @@ -1438,8 +1444,17 @@ void __pagevec_free(struct pagevec *pvec { int i = pagevec_count(pvec); - while (--i >= 0) - free_hot_cold_page(pvec->pages[i], pvec->cold); + while (--i >= 0) { + struct page *page = pvec->pages[i]; + + if (PageCompound(page)) { + compound_page_dtor *dtor; + + dtor = get_compound_page_dtor(page); + (*dtor)(page); + } else + free_hot_cold_page(page, pvec->cold); + } } fastcall void __free_pages(struct page *page, unsigned int order) Index: linux-2.6.21-rc7/mm/slab.c =================================================================== --- linux-2.6.21-rc7.orig/mm/slab.c 2007-04-21 20:52:07.000000000 -0700 +++ linux-2.6.21-rc7/mm/slab.c 2007-04-21 20:52:15.000000000 -0700 @@ -592,8 +592,7 @@ static inline void page_set_cache(struct static inline struct kmem_cache *page_get_cache(struct page *page) { - if (unlikely(PageCompound(page))) - page = (struct page *)page_private(page); + page = compound_head(page); BUG_ON(!PageSlab(page)); return (struct kmem_cache *)page->lru.next; } @@ -605,8 +604,7 @@ static inline void page_set_slab(struct static inline struct slab *page_get_slab(struct page *page) { - if (unlikely(PageCompound(page))) - page = (struct page *)page_private(page); + page = compound_head(page); BUG_ON(!PageSlab(page)); return (struct slab *)page->lru.prev; } Index: linux-2.6.21-rc7/mm/swap.c =================================================================== --- linux-2.6.21-rc7.orig/mm/swap.c 2007-04-21 20:52:07.000000000 -0700 +++ linux-2.6.21-rc7/mm/swap.c 2007-04-21 21:02:59.000000000 -0700 @@ -55,7 +55,7 @@ static void fastcall __page_cache_releas static void put_compound_page(struct page *page) { - page = (struct page *)page_private(page); + page = compound_head(page); if (put_page_testzero(page)) { compound_page_dtor *dtor; @@ -263,7 +263,23 @@ void release_pages(struct page **pages, for (i = 0; i < nr; i++) { struct page *page = pages[i]; - if (unlikely(PageCompound(page))) { + /* + * There is a conflict here between handling a compound + * page as a single big page or a set of smaller pages. + * + * Direct I/O wants us to treat them separately. Variable + * Page Size support means we need to treat then as + * a single unit. + * + * So we compromise here. Tail pages are handled as a + * single page (for direct I/O) but head pages are + * handled as full pages (for Variable Page Size + * Support). + * + * FIXME: That breaks direct I/O for the head page. + */ + if (unlikely(PageTail(page))) { + /* Must treat as a single page */ if (zone) { spin_unlock_irq(&zone->lru_lock); zone = NULL; Index: linux-2.6.21-rc7/arch/ia64/mm/init.c =================================================================== --- linux-2.6.21-rc7.orig/arch/ia64/mm/init.c 2007-04-21 20:52:07.000000000 -0700 +++ linux-2.6.21-rc7/arch/ia64/mm/init.c 2007-04-21 20:52:15.000000000 -0700 @@ -121,7 +121,7 @@ lazy_mmu_prot_update (pte_t pte) return; /* i-cache is already coherent with d-cache */ if (PageCompound(page)) { - order = (unsigned long) (page[1].lru.prev); + order = compound_order(page); flush_icache_range(addr, addr + (1UL << order << PAGE_SHIFT)); } else -- From clameter@sgi.com Sun Apr 22 23:21:29 2007 Message-Id: <20070423062129.317055444@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:09 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 02/16] vmstat.c: Support accounting for compound pages Content-Disposition: inline; filename=var_pc_vmstat Compound pages must increment the counters in terms of base pages. If we detect a compound page then add the number of base pages that a compound page has to the counter. This will avoid numerous changes in the VM to fix up page accounting as we add more support for compound pages. Also fix up the accounting for active / inactive pages. Signed-off-by: Christoph Lameter --- include/linux/mm_inline.h | 12 ++++++------ mm/vmstat.c | 8 +++----- 2 files changed, 9 insertions(+), 11 deletions(-) Index: linux-2.6.21-rc7/mm/vmstat.c =================================================================== --- linux-2.6.21-rc7.orig/mm/vmstat.c 2007-04-21 23:35:49.000000000 -0700 +++ linux-2.6.21-rc7/mm/vmstat.c 2007-04-21 23:35:59.000000000 -0700 @@ -223,7 +223,7 @@ void __inc_zone_state(struct zone *zone, void __inc_zone_page_state(struct page *page, enum zone_stat_item item) { - __inc_zone_state(page_zone(page), item); + __mod_zone_page_state(page_zone(page), item, base_pages(page)); } EXPORT_SYMBOL(__inc_zone_page_state); @@ -244,7 +244,7 @@ void __dec_zone_state(struct zone *zone, void __dec_zone_page_state(struct page *page, enum zone_stat_item item) { - __dec_zone_state(page_zone(page), item); + __mod_zone_page_state(page_zone(page), item, -base_pages(page)); } EXPORT_SYMBOL(__dec_zone_page_state); @@ -260,11 +260,9 @@ void inc_zone_state(struct zone *zone, e void inc_zone_page_state(struct page *page, enum zone_stat_item item) { unsigned long flags; - struct zone *zone; - zone = page_zone(page); local_irq_save(flags); - __inc_zone_state(zone, item); + __inc_zone_page_state(page, item); local_irq_restore(flags); } EXPORT_SYMBOL(inc_zone_page_state); Index: linux-2.6.21-rc7/include/linux/mm_inline.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/mm_inline.h 2007-04-22 00:20:15.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/mm_inline.h 2007-04-22 00:21:12.000000000 -0700 @@ -2,28 +2,28 @@ static inline void add_page_to_active_list(struct zone *zone, struct page *page) { list_add(&page->lru, &zone->active_list); - __inc_zone_state(zone, NR_ACTIVE); + __inc_zone_page_state(page, NR_ACTIVE); } static inline void add_page_to_inactive_list(struct zone *zone, struct page *page) { list_add(&page->lru, &zone->inactive_list); - __inc_zone_state(zone, NR_INACTIVE); + __inc_zone_page_state(page, NR_INACTIVE); } static inline void del_page_from_active_list(struct zone *zone, struct page *page) { list_del(&page->lru); - __dec_zone_state(zone, NR_ACTIVE); + __dec_zone_page_state(page, NR_ACTIVE); } static inline void del_page_from_inactive_list(struct zone *zone, struct page *page) { list_del(&page->lru); - __dec_zone_state(zone, NR_INACTIVE); + __dec_zone_page_state(page, NR_INACTIVE); } static inline void @@ -32,9 +32,9 @@ del_page_from_lru(struct zone *zone, str list_del(&page->lru); if (PageActive(page)) { __ClearPageActive(page); - __dec_zone_state(zone, NR_ACTIVE); + __dec_zone_page_state(page, NR_ACTIVE); } else { - __dec_zone_state(zone, NR_INACTIVE); + __dec_zone_page_state(page, NR_INACTIVE); } } -- From clameter@sgi.com Sun Apr 22 23:21:29 2007 Message-Id: <20070423062129.504330506@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:10 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 03/16] Variable Order Page Cache: Add order field in mapping Content-Disposition: inline; filename=var_pc_order_field Add an "order" field in the address space structure that specifies the page order of pages in an address space. Set the field to zero by default so that filesystems not prepared to deal with higher pages can be left as is. Putting page order in the address space structure means that the order of the pages in the page cache can be varied per file that a filesystem creates. This means we can keep small 4k pages for small files. Larger files can be configured by the file system to use a higher order. Signed-off-by: Christoph Lameter --- fs/inode.c | 1 + include/linux/fs.h | 1 + 2 files changed, 2 insertions(+) Index: linux-2.6.21-rc7/fs/inode.c =================================================================== --- linux-2.6.21-rc7.orig/fs/inode.c 2007-04-18 21:21:56.000000000 -0700 +++ linux-2.6.21-rc7/fs/inode.c 2007-04-18 21:26:31.000000000 -0700 @@ -145,6 +145,7 @@ static struct inode *alloc_inode(struct mapping->a_ops = &empty_aops; mapping->host = inode; mapping->flags = 0; + mapping->order = 0; mapping_set_gfp_mask(mapping, GFP_HIGHUSER); mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; Index: linux-2.6.21-rc7/include/linux/fs.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/fs.h 2007-04-18 21:21:56.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/fs.h 2007-04-18 21:26:31.000000000 -0700 @@ -435,6 +435,7 @@ struct address_space { struct inode *host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ rwlock_t tree_lock; /* and rwlock protecting it */ + unsigned int order; /* Page order in this space */ unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ -- From clameter@sgi.com Sun Apr 22 23:21:29 2007 Message-Id: <20070423062129.645837417@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:11 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 04/16] Variable Order Page Cache: Add basic allocation functions Content-Disposition: inline; filename=var_pc_basic_alloc Extend __page_cache_alloc to take an order parameter and modify caller sites. Modify mapping_set_gfp_mask to set __GFP_COMP if the mapping requires higher order allocations. put_page() is already capable of handling compound pages. So there are no changes needed to release higher order page cache pages. However, there is a call to "alloc_page" in mm/filemap.c that does not perform an allocation conformand with the parameters of the mapping. Fix that by introducing a new page_cache_alloc function that is capable of taking a gfp_t flag. Signed-off-by: Christoph Lameter --- include/linux/pagemap.h | 34 ++++++++++++++++++++++++++++------ mm/filemap.c | 12 +++++++----- 2 files changed, 35 insertions(+), 11 deletions(-) Index: linux-2.6.21-rc7/include/linux/pagemap.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-22 21:47:47.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-22 21:52:37.000000000 -0700 @@ -3,6 +3,9 @@ /* * Copyright 1995 Linus Torvalds + * + * (C) 2007 sgi, Christoph Lameter + * Add variable order page cache support. */ #include #include @@ -32,6 +35,18 @@ static inline void mapping_set_gfp_mask( { m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) | (__force unsigned long)mask; + if (m->order) + m->flags |= __GFP_COMP; +} + +static inline void set_mapping_order(struct address_space *m, int order) +{ + m->order = order; + + if (order) + m->flags |= __GFP_COMP; + else + m->flags &= ~__GFP_COMP; } /* @@ -40,7 +55,7 @@ static inline void mapping_set_gfp_mask( * throughput (it can then be mapped into user * space in smaller chunks for same flexibility). * - * Or rather, it _will_ be done in larger chunks. + * This is the base page size */ #define PAGE_CACHE_SHIFT PAGE_SHIFT #define PAGE_CACHE_SIZE PAGE_SIZE @@ -52,22 +67,29 @@ static inline void mapping_set_gfp_mask( void release_pages(struct page **pages, int nr, int cold); #ifdef CONFIG_NUMA -extern struct page *__page_cache_alloc(gfp_t gfp); +extern struct page *__page_cache_alloc(gfp_t gfp, int order); #else -static inline struct page *__page_cache_alloc(gfp_t gfp) +static inline struct page *__page_cache_alloc(gfp_t gfp, int order) { - return alloc_pages(gfp, 0); + return alloc_pages(gfp, order); } #endif +static inline struct page *page_cache_alloc_mask(struct address_space *x, + gfp_t flags) +{ + return __page_cache_alloc(mapping_gfp_mask(x) | flags, + x->order); +} + static inline struct page *page_cache_alloc(struct address_space *x) { - return __page_cache_alloc(mapping_gfp_mask(x)); + return page_cache_alloc_mask(x, 0); } static inline struct page *page_cache_alloc_cold(struct address_space *x) { - return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD); + return page_cache_alloc_mask(x, __GFP_COLD); } typedef int filler_t(void *, struct page *); Index: linux-2.6.21-rc7/mm/filemap.c =================================================================== --- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-22 21:47:47.000000000 -0700 +++ linux-2.6.21-rc7/mm/filemap.c 2007-04-22 21:54:00.000000000 -0700 @@ -467,13 +467,13 @@ int add_to_page_cache_lru(struct page *p } #ifdef CONFIG_NUMA -struct page *__page_cache_alloc(gfp_t gfp) +struct page *__page_cache_alloc(gfp_t gfp, int order) { if (cpuset_do_page_mem_spread()) { int n = cpuset_mem_spread_node(); - return alloc_pages_node(n, gfp, 0); + return alloc_pages_node(n, gfp, order); } - return alloc_pages(gfp, 0); + return alloc_pages(gfp, order); } EXPORT_SYMBOL(__page_cache_alloc); #endif @@ -670,7 +670,8 @@ repeat: page = find_lock_page(mapping, index); if (!page) { if (!cached_page) { - cached_page = alloc_page(gfp_mask); + cached_page = + page_cache_alloc_mask(mapping, gfp_mask); if (!cached_page) return NULL; } @@ -803,7 +804,8 @@ grab_cache_page_nowait(struct address_sp page_cache_release(page); return NULL; } - page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS); + page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS, + mapping->order); if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) { page_cache_release(page); page = NULL; -- From clameter@sgi.com Sun Apr 22 23:21:29 2007 Message-Id: <20070423062129.804903028@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:12 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes Content-Disposition: inline; filename=var_pc_size_functions We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK and PAGE_CACHE_ALIGN in various places in the kernel. These are now the base page size but we do not have a means to calculating these values for higher order pages. Provide these functions. An address_space pointer must be passed to them. Also add a set of extended functions that will be used to consolidate the hand crafted shifts and adds in use right now for the page cache. New function Related base page constant --------------------------------------------------- page_cache_shift(a) PAGE_CACHE_SHIFT page_cache_size(a) PAGE_CACHE_SIZE page_cache_mask(a) PAGE_CACHE_MASK page_cache_index(a, pos) Calculate page number from position page_cache_next(addr, pos) Page number of next page page_cache_offset(a, pos) Calculate offset into a page page_cache_pos(a, index, offset) Form position based on page number and an offset. Signed-off-by: Christoph Lameter --- include/linux/pagemap.h | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) Index: linux-2.6.21-rc7/include/linux/pagemap.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-22 17:30:50.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-22 19:44:12.000000000 -0700 @@ -62,6 +62,48 @@ static inline void set_mapping_order(str #define PAGE_CACHE_MASK PAGE_MASK #define PAGE_CACHE_ALIGN(addr) (((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK) +static inline int page_cache_shift(struct address_space *a) +{ + return a->order + PAGE_SHIFT; +} + +static inline unsigned int page_cache_size(struct address_space *a) +{ + return PAGE_SIZE << a->order; +} + +static inline loff_t page_cache_mask(struct address_space *a) +{ + return (loff_t)PAGE_MASK << a->order; +} + +static inline unsigned int page_cache_offset(struct address_space *a, + loff_t pos) +{ + return pos & ~(PAGE_MASK << a->order); +} + +static inline pgoff_t page_cache_index(struct address_space *a, + loff_t pos) +{ + return pos >> page_cache_shift(a); +} + +/* + * Index of the page starting on or after the given position. + */ +static inline pgoff_t page_cache_next(struct address_space *a, + loff_t pos) +{ + return page_cache_index(a, pos + page_cache_size(a) - 1); +} + +static inline loff_t page_cache_pos(struct address_space *a, + pgoff_t index, unsigned long offset) +{ + return ((loff_t)index << page_cache_shift(a)) + offset; +} + #define page_cache_get(page) get_page(page) #define page_cache_release(page) put_page(page) void release_pages(struct page **pages, int nr, int cold); -- From clameter@sgi.com Sun Apr 22 23:21:30 2007 Message-Id: <20070423062129.967621050@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:13 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order Content-Disposition: inline; filename=var_pc_guards Before we start changing the page order we better get some debugging in there that trips us up whenever a wrong order page shows up in a mapping. This will be helpful for converting new filesystems to utilize higher orders. Signed-off-by: Christoph Lameter --- mm/filemap.c | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) Index: linux-2.6.21-rc7/mm/filemap.c =================================================================== --- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-22 21:54:00.000000000 -0700 +++ linux-2.6.21-rc7/mm/filemap.c 2007-04-22 21:59:15.000000000 -0700 @@ -127,6 +127,7 @@ void remove_from_page_cache(struct page struct address_space *mapping = page->mapping; BUG_ON(!PageLocked(page)); + VM_BUG_ON(mapping->order != compound_order(page)); write_lock_irq(&mapping->tree_lock); __remove_from_page_cache(page); @@ -268,6 +269,7 @@ int wait_on_page_writeback_range(struct if (page->index > end) continue; + VM_BUG_ON(mapping->order != compound_order(page)); wait_on_page_writeback(page); if (PageError(page)) ret = -EIO; @@ -439,6 +441,7 @@ int add_to_page_cache(struct page *page, { int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM); + VM_BUG_ON(mapping->order != compound_order(page)); if (error == 0) { write_lock_irq(&mapping->tree_lock); error = radix_tree_insert(&mapping->page_tree, offset, page); @@ -598,8 +601,10 @@ struct page * find_get_page(struct addre read_lock_irq(&mapping->tree_lock); page = radix_tree_lookup(&mapping->page_tree, offset); - if (page) + if (page) { + VM_BUG_ON(mapping->order != compound_order(page)); page_cache_get(page); + } read_unlock_irq(&mapping->tree_lock); return page; } @@ -624,6 +629,7 @@ struct page *find_lock_page(struct addre repeat: page = radix_tree_lookup(&mapping->page_tree, offset); if (page) { + VM_BUG_ON(mapping->order != compound_order(page)); page_cache_get(page); if (TestSetPageLocked(page)) { read_unlock_irq(&mapping->tree_lock); @@ -683,6 +689,7 @@ repeat: } else if (err == -EEXIST) goto repeat; } + VM_BUG_ON(mapping->order != compound_order(page)); if (cached_page) page_cache_release(cached_page); return page; @@ -714,8 +721,10 @@ unsigned find_get_pages(struct address_s read_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, start, nr_pages); - for (i = 0; i < ret; i++) + for (i = 0; i < ret; i++) { + VM_BUG_ON(mapping->order != compound_order(pages[i])); page_cache_get(pages[i]); + } read_unlock_irq(&mapping->tree_lock); return ret; } @@ -745,6 +754,7 @@ unsigned find_get_pages_contig(struct ad if (pages[i]->mapping == NULL || pages[i]->index != index) break; + VM_BUG_ON(mapping->order != compound_order(pages[i])); page_cache_get(pages[i]); index++; } @@ -772,8 +782,10 @@ unsigned find_get_pages_tag(struct addre read_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)pages, *index, nr_pages, tag); - for (i = 0; i < ret; i++) + for (i = 0; i < ret; i++) { + VM_BUG_ON(mapping->order != compound_order(pages[i])); page_cache_get(pages[i]); + } if (ret) *index = pages[ret - 1]->index + 1; read_unlock_irq(&mapping->tree_lock); @@ -2454,6 +2466,7 @@ int try_to_release_page(struct page *pag struct address_space * const mapping = page->mapping; BUG_ON(!PageLocked(page)); + VM_BUG_ON(mapping->order != compound_order(page)); if (PageWriteback(page)) return 0; -- From clameter@sgi.com Sun Apr 22 23:21:30 2007 Message-Id: <20070423062130.131716294@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:14 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 07/16] Variable Order Page Cache: Add clearing and flushing function Content-Disposition: inline; filename=var_pc_flush_zero Add a flushing and clearing function for higher order pages. These are provisional and will likely have to be optimized. Signed-off-by: Christoph Lameter --- include/linux/pagemap.h | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) Index: linux-2.6.21-rc7/include/linux/pagemap.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-22 17:37:24.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-22 17:37:39.000000000 -0700 @@ -250,6 +250,31 @@ static inline void wait_on_page_writebac extern void end_page_writeback(struct page *page); +/* Support for clearing higher order pages */ +static inline void clear_mapping_page(struct page *page) +{ + int nr_pages = base_pages(page); + int i; + + for (i = 0; i < nr_pages; i++) + clear_highpage(page + i); +} + +/* + * Support for flushing higher order pages. + * + * A bit stupid: On many platforms flushing the first page + * will flush any TLB starting there + */ +static inline void flush_mapping_page(struct page *page) +{ + int nr_pages = base_pages(page); + int i; + + for (i = 0; i < nr_pages; i++) + flush_dcache_page(page + i); +} + /* * Fault a userspace page into pagetables. Return non-zero on a fault. * -- From clameter@sgi.com Sun Apr 22 23:21:30 2007 Message-Id: <20070423062130.292552667@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:15 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 08/16] Variable Order Page Cache: Fixup fallback functions Content-Disposition: inline; filename=var_pc_libfs Fixup the fallback function in fs/libfs.c to be able to handle higher order page cache pages. FIXME: There is a use of kmap here that we leave unchanged (none of my testing platforms use highmem). There needs to be some way to clear higher order partial pages if a platform supports HIGHMEM. Signed-off-by: Christoph Lameter --- fs/libfs.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) Index: linux-2.6.21-rc7/fs/libfs.c =================================================================== --- linux-2.6.21-rc7.orig/fs/libfs.c 2007-04-22 17:28:04.000000000 -0700 +++ linux-2.6.21-rc7/fs/libfs.c 2007-04-22 17:38:58.000000000 -0700 @@ -320,8 +320,8 @@ int simple_rename(struct inode *old_dir, int simple_readpage(struct file *file, struct page *page) { - clear_highpage(page); - flush_dcache_page(page); + clear_mapping_page(page); + flush_mapping_page(page); SetPageUptodate(page); unlock_page(page); return 0; @@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi unsigned from, unsigned to) { if (!PageUptodate(page)) { - if (to - from != PAGE_CACHE_SIZE) { + if (to - from != page_cache_size(file->f_mapping)) { + /* + * Mapping to higher order pages need to be supported + * if higher order pages can be in highmem + */ void *kaddr = kmap_atomic(page, KM_USER0); memset(kaddr, 0, from); - memset(kaddr + to, 0, PAGE_CACHE_SIZE - to); - flush_dcache_page(page); + memset(kaddr + to, 0, page_cache_size(file->f_mapping) - to); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); } } @@ -345,8 +349,9 @@ int simple_prepare_write(struct file *fi int simple_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; + loff_t pos = page_cache_pos(mapping, page->index, to); if (!PageUptodate(page)) SetPageUptodate(page); -- From clameter@sgi.com Sun Apr 22 23:21:30 2007 Message-Id: <20070423062130.458545974@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:16 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 09/16] Variable Order Page Cache: Fix up mm/filemap.c Content-Disposition: inline; filename=var_pc_filemap Fix up the function in mm/filemap.c to use the variable page cache. As many of the following patches this is also pretty straightforward. 1. Convert the bit ops into calls of page_cache_xxx(mapping, ....) 2. Use the mapping flush function Doing this also cleans up the handling of page cache pages. Signed-off-by: Christoph Lameter --- mm/filemap.c | 62 +++++++++++++++++++++++++++++------------------------------ 1 file changed, 31 insertions(+), 31 deletions(-) Index: linux-2.6.21-rc7/mm/filemap.c =================================================================== --- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-22 21:59:15.000000000 -0700 +++ linux-2.6.21-rc7/mm/filemap.c 2007-04-22 22:03:09.000000000 -0700 @@ -304,8 +304,8 @@ int wait_on_page_writeback_range(struct int sync_page_range(struct inode *inode, struct address_space *mapping, loff_t pos, loff_t count) { - pgoff_t start = pos >> PAGE_CACHE_SHIFT; - pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; + pgoff_t start = page_cache_index(mapping, pos); + pgoff_t end = page_cache_index(mapping, pos + count - 1); int ret; if (!mapping_cap_writeback_dirty(mapping) || !count) @@ -336,8 +336,8 @@ EXPORT_SYMBOL(sync_page_range); int sync_page_range_nolock(struct inode *inode, struct address_space *mapping, loff_t pos, loff_t count) { - pgoff_t start = pos >> PAGE_CACHE_SHIFT; - pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT; + pgoff_t start = page_cache_index(mapping, pos); + pgoff_t end = page_cache_index(mapping, pos + count - 1); int ret; if (!mapping_cap_writeback_dirty(mapping) || !count) @@ -366,7 +366,7 @@ int filemap_fdatawait(struct address_spa return 0; return wait_on_page_writeback_range(mapping, 0, - (i_size - 1) >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, i_size - 1)); } EXPORT_SYMBOL(filemap_fdatawait); @@ -414,8 +414,8 @@ int filemap_write_and_wait_range(struct /* See comment of filemap_write_and_wait() */ if (err != -EIO) { int err2 = wait_on_page_writeback_range(mapping, - lstart >> PAGE_CACHE_SHIFT, - lend >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, lstart), + page_cache_index(mapping, lend)); if (!err) err = err2; } @@ -888,27 +888,27 @@ void do_generic_mapping_read(struct addr struct file_ra_state ra = *_ra; cached_page = NULL; - index = *ppos >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, *ppos); next_index = index; prev_index = ra.prev_page; - last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; - offset = *ppos & ~PAGE_CACHE_MASK; + last_index = page_cache_next(mapping, *ppos + desc->count); + offset = page_cache_offset(mapping, *ppos); isize = i_size_read(inode); if (!isize) goto out; - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + end_index = page_cache_index(mapping, isize - 1); for (;;) { struct page *page; unsigned long nr, ret; /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; + nr = page_cache_size(mapping); if (index >= end_index) { if (index > end_index) goto out; - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; + nr = page_cache_offset(mapping, isize - 1) + 1; if (nr <= offset) { goto out; } @@ -935,7 +935,7 @@ page_ok: * before reading the page on the kernel side. */ if (mapping_writably_mapped(mapping)) - flush_dcache_page(page); + flush_mapping_page(page); /* * When (part of) the same page is read multiple times @@ -957,8 +957,8 @@ page_ok: */ ret = actor(desc, page, offset, nr); offset += ret; - index += offset >> PAGE_CACHE_SHIFT; - offset &= ~PAGE_CACHE_MASK; + index += page_cache_index(mapping, offset); + offset = page_cache_offset(mapping, offset); page_cache_release(page); if (ret == nr && desc->count) @@ -1022,16 +1022,16 @@ readpage: * another truncate extends the file - this is desired though). */ isize = i_size_read(inode); - end_index = (isize - 1) >> PAGE_CACHE_SHIFT; + end_index = page_cache_index(mapping, isize - 1); if (unlikely(!isize || index > end_index)) { page_cache_release(page); goto out; } /* nr is the maximum number of bytes to copy from this page */ - nr = PAGE_CACHE_SIZE; + nr = page_cache_size(mapping); if (index == end_index) { - nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1; + nr = page_cache_offset(mapping, isize - 1) + 1; if (nr <= offset) { page_cache_release(page); goto out; @@ -1074,7 +1074,7 @@ no_cached_page: out: *_ra = ra; - *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; + *ppos = page_cache_pos(mapping, index, offset); if (cached_page) page_cache_release(cached_page); if (filp) @@ -1270,8 +1270,8 @@ asmlinkage ssize_t sys_readahead(int fd, if (file) { if (file->f_mode & FMODE_READ) { struct address_space *mapping = file->f_mapping; - unsigned long start = offset >> PAGE_CACHE_SHIFT; - unsigned long end = (offset + count - 1) >> PAGE_CACHE_SHIFT; + unsigned long start = page_cache_index(mapping, offset); + unsigned long end = page_cache_index(mapping, offset + count - 1); unsigned long len = end - start + 1; ret = do_readahead(mapping, file, start, len); } @@ -2086,9 +2086,9 @@ generic_file_buffered_write(struct kiocb unsigned long offset; size_t copied; - offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */ - index = pos >> PAGE_CACHE_SHIFT; - bytes = PAGE_CACHE_SIZE - offset; + offset = page_cache_offset(mapping, pos); + index = page_cache_index(mapping, pos); + bytes = page_cache_size(mapping) - offset; /* Limit the size of the copy to the caller's write size */ bytes = min(bytes, count); @@ -2149,7 +2149,7 @@ generic_file_buffered_write(struct kiocb else copied = filemap_copy_from_user_iovec(page, offset, cur_iov, iov_base, bytes); - flush_dcache_page(page); + flush_mapping_page(page); status = a_ops->commit_write(file, page, offset, offset+bytes); if (status == AOP_TRUNCATED_PAGE) { page_cache_release(page); @@ -2315,8 +2315,8 @@ __generic_file_aio_write_nolock(struct k if (err == 0) { written = written_buffered; invalidate_mapping_pages(mapping, - pos >> PAGE_CACHE_SHIFT, - endbyte >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, pos), + page_cache_index(mapping, endbyte)); } else { /* * We don't know how much we wrote, so just return @@ -2403,7 +2403,7 @@ generic_file_direct_IO(int rw, struct ki */ if (rw == WRITE) { write_len = iov_length(iov, nr_segs); - end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT; + end = page_cache_index(mapping, offset + write_len - 1); if (mapping_mapped(mapping)) unmap_mapping_range(mapping, offset, write_len, 0); } @@ -2420,7 +2420,7 @@ generic_file_direct_IO(int rw, struct ki */ if (rw == WRITE && mapping->nrpages) { retval = invalidate_inode_pages2_range(mapping, - offset >> PAGE_CACHE_SHIFT, end); + page_cache_index(mapping, offset), end); if (retval) goto out; } @@ -2438,7 +2438,7 @@ generic_file_direct_IO(int rw, struct ki */ if (rw == WRITE && mapping->nrpages) { int err = invalidate_inode_pages2_range(mapping, - offset >> PAGE_CACHE_SHIFT, end); + page_cache_index(mapping, offset), end); if (err && retval >= 0) retval = err; } -- From clameter@sgi.com Sun Apr 22 23:21:30 2007 Message-Id: <20070423062130.623658661@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:17 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 10/16] Variable Order Page Cache: Readahead fixups Content-Disposition: inline; filename=var_pc_readahead Readahead is now dependent on the page size. For larger page sizes we want less readahead. Add a parameter to max_sane_readahead specifying the page order and update the code in mm/readahead.c to be aware of variant page sizes. Mark the 2M readahead constant as a potential future problem. Signed-off-by: Christoph Lameter --- include/linux/mm.h | 2 +- mm/fadvise.c | 5 +++-- mm/filemap.c | 5 +++-- mm/madvise.c | 4 +++- mm/readahead.c | 20 +++++++++++++------- 5 files changed, 23 insertions(+), 13 deletions(-) Index: linux-2.6.21-rc7/include/linux/mm.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/mm.h 2007-04-22 21:48:22.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/mm.h 2007-04-22 22:04:44.000000000 -0700 @@ -1104,7 +1104,7 @@ unsigned long page_cache_readahead(struc unsigned long size); void handle_ra_miss(struct address_space *mapping, struct file_ra_state *ra, pgoff_t offset); -unsigned long max_sane_readahead(unsigned long nr); +unsigned long max_sane_readahead(unsigned long nr, int order); /* Do stack extension */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); Index: linux-2.6.21-rc7/mm/fadvise.c =================================================================== --- linux-2.6.21-rc7.orig/mm/fadvise.c 2007-04-22 21:47:41.000000000 -0700 +++ linux-2.6.21-rc7/mm/fadvise.c 2007-04-22 22:04:44.000000000 -0700 @@ -86,10 +86,11 @@ asmlinkage long sys_fadvise64_64(int fd, nrpages = end_index - start_index + 1; if (!nrpages) nrpages = ~0UL; - + ret = force_page_cache_readahead(mapping, file, start_index, - max_sane_readahead(nrpages)); + max_sane_readahead(nrpages, + mapping->order)); if (ret > 0) ret = 0; break; Index: linux-2.6.21-rc7/mm/filemap.c =================================================================== --- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-22 22:03:09.000000000 -0700 +++ linux-2.6.21-rc7/mm/filemap.c 2007-04-22 22:04:44.000000000 -0700 @@ -1256,7 +1256,7 @@ do_readahead(struct address_space *mappi return -EINVAL; force_page_cache_readahead(mapping, filp, index, - max_sane_readahead(nr)); + max_sane_readahead(nr, mapping->order)); return 0; } @@ -1391,7 +1391,8 @@ retry_find: count_vm_event(PGMAJFAULT); } did_readaround = 1; - ra_pages = max_sane_readahead(file->f_ra.ra_pages); + ra_pages = max_sane_readahead(file->f_ra.ra_pages, + mapping->order); if (ra_pages) { pgoff_t start = 0; Index: linux-2.6.21-rc7/mm/madvise.c =================================================================== --- linux-2.6.21-rc7.orig/mm/madvise.c 2007-04-22 21:47:41.000000000 -0700 +++ linux-2.6.21-rc7/mm/madvise.c 2007-04-22 22:04:44.000000000 -0700 @@ -105,7 +105,9 @@ static long madvise_willneed(struct vm_a end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; force_page_cache_readahead(file->f_mapping, - file, start, max_sane_readahead(end - start)); + file, start, + max_sane_readahead(end - start, + file->f_mapping->order)); return 0; } Index: linux-2.6.21-rc7/mm/readahead.c =================================================================== --- linux-2.6.21-rc7.orig/mm/readahead.c 2007-04-22 21:47:41.000000000 -0700 +++ linux-2.6.21-rc7/mm/readahead.c 2007-04-22 22:06:47.000000000 -0700 @@ -152,7 +152,7 @@ int read_cache_pages(struct address_spac put_pages_list(pages); break; } - task_io_account_read(PAGE_CACHE_SIZE); + task_io_account_read(page_cache_size(mapping)); } pagevec_lru_add(&lru_pvec); return ret; @@ -276,7 +276,7 @@ __do_page_cache_readahead(struct address if (isize == 0) goto out; - end_index = ((isize - 1) >> PAGE_CACHE_SHIFT); + end_index = page_cache_index(mapping, isize - 1); /* * Preallocate as many pages as we will need. @@ -330,7 +330,11 @@ int force_page_cache_readahead(struct ad while (nr_to_read) { int err; - unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE; + /* + * FIXME: Note the 2M constant here that may prove to + * be a problem if page sizes become bigger than one megabyte. + */ + unsigned long this_chunk = page_cache_index(mapping, 2 * 1024 * 1024); if (this_chunk > nr_to_read) this_chunk = nr_to_read; @@ -570,11 +574,13 @@ void handle_ra_miss(struct address_space } /* - * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a + * Given a desired number of page order readahead pages, return a * sensible upper limit. */ -unsigned long max_sane_readahead(unsigned long nr) +unsigned long max_sane_readahead(unsigned long nr, int order) { - return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE) - + node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2); + unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE) + + node_page_state(numa_node_id(), NR_FREE_PAGES); + + return min(nr, (base_pages / 2) >> order); } -- From clameter@sgi.com Sun Apr 22 23:21:30 2007 Message-Id: <20070423062130.785519484@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:18 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters Content-Disposition: inline; filename=var_pc_reclaim We can now reclaim larger pages. Adjust the VM counters to deal with it. Note that this does currently not make things work. For some reason we keep loosing pages off the active lists and reclaim stalls at some point attempting to remove active pages from an empty active list. It seems that the removal from the active lists happens outside of reclaim ?!? Signed-off-by: Christoph Lameter --- mm/vmscan.c | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) Index: linux-2.6.21-rc7/mm/vmscan.c =================================================================== --- linux-2.6.21-rc7.orig/mm/vmscan.c 2007-04-22 06:50:03.000000000 -0700 +++ linux-2.6.21-rc7/mm/vmscan.c 2007-04-22 17:19:35.000000000 -0700 @@ -471,14 +471,14 @@ static unsigned long shrink_page_list(st VM_BUG_ON(PageActive(page)); - sc->nr_scanned++; + sc->nr_scanned += base_pages(page); if (!sc->may_swap && page_mapped(page)) goto keep_locked; /* Double the slab pressure for mapped and swapcache pages */ if (page_mapped(page) || PageSwapCache(page)) - sc->nr_scanned++; + sc->nr_scanned += base_pages(page); if (PageWriteback(page)) goto keep_locked; @@ -581,7 +581,7 @@ static unsigned long shrink_page_list(st free_it: unlock_page(page); - nr_reclaimed++; + nr_reclaimed += base_pages(page); if (!pagevec_add(&freed_pvec, page)) __pagevec_release_nonlru(&freed_pvec); continue; @@ -627,7 +627,7 @@ static unsigned long isolate_lru_pages(u struct page *page; unsigned long scan; - for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { + for (scan = 0; scan < nr_to_scan && !list_empty(src); ) { struct list_head *target; page = lru_to_page(src); prefetchw_prev_lru_page(page, src, flags); @@ -644,10 +644,11 @@ static unsigned long isolate_lru_pages(u */ ClearPageLRU(page); target = dst; - nr_taken++; + nr_taken += base_pages(page); } /* else it is being freed elsewhere */ list_add(&page->lru, target); + scan += base_pages(page); } *scanned = scan; @@ -856,7 +857,7 @@ force_reclaim_mapped: ClearPageActive(page); list_move(&page->lru, &zone->inactive_list); - pgmoved++; + pgmoved += base_pages(page); if (!pagevec_add(&pvec, page)) { __mod_zone_page_state(zone, NR_INACTIVE, pgmoved); spin_unlock_irq(&zone->lru_lock); @@ -884,7 +885,7 @@ force_reclaim_mapped: SetPageLRU(page); VM_BUG_ON(!PageActive(page)); list_move(&page->lru, &zone->active_list); - pgmoved++; + pgmoved += base_pages(page); if (!pagevec_add(&pvec, page)) { __mod_zone_page_state(zone, NR_ACTIVE, pgmoved); pgmoved = 0; -- From clameter@sgi.com Sun Apr 22 23:21:31 2007 Message-Id: <20070423062130.952242003@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:19 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 12/16] Variable Order Page Cache: Fix up the writeback logic Content-Disposition: inline; filename=var_pc_writeback Nothing special here. Just the usual transformations. Signed-off-by: Christoph Lameter --- fs/sync.c | 8 ++++---- mm/fadvise.c | 8 ++++---- mm/page-writeback.c | 4 ++-- mm/truncate.c | 23 ++++++++++++----------- 4 files changed, 22 insertions(+), 21 deletions(-) Index: linux-2.6.21-rc7/mm/page-writeback.c =================================================================== --- linux-2.6.21-rc7.orig/mm/page-writeback.c 2007-04-22 21:47:34.000000000 -0700 +++ linux-2.6.21-rc7/mm/page-writeback.c 2007-04-22 22:08:35.000000000 -0700 @@ -606,8 +606,8 @@ int generic_writepages(struct address_sp index = mapping->writeback_index; /* Start from prev offset */ end = -1; } else { - index = wbc->range_start >> PAGE_CACHE_SHIFT; - end = wbc->range_end >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, wbc->range_start); + end = page_cache_index(mapping, wbc->range_end); if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; scanned = 1; Index: linux-2.6.21-rc7/fs/sync.c =================================================================== --- linux-2.6.21-rc7.orig/fs/sync.c 2007-04-22 21:47:34.000000000 -0700 +++ linux-2.6.21-rc7/fs/sync.c 2007-04-22 22:08:35.000000000 -0700 @@ -254,8 +254,8 @@ int do_sync_file_range(struct file *file ret = 0; if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) { ret = wait_on_page_writeback_range(mapping, - offset >> PAGE_CACHE_SHIFT, - endbyte >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, offset), + page_cache_index(mapping, endbyte)); if (ret < 0) goto out; } @@ -269,8 +269,8 @@ int do_sync_file_range(struct file *file if (flags & SYNC_FILE_RANGE_WAIT_AFTER) { ret = wait_on_page_writeback_range(mapping, - offset >> PAGE_CACHE_SHIFT, - endbyte >> PAGE_CACHE_SHIFT); + page_cache_index(mapping, offset), + page_cache_index(mapping, endbyte)); } out: return ret; Index: linux-2.6.21-rc7/mm/fadvise.c =================================================================== --- linux-2.6.21-rc7.orig/mm/fadvise.c 2007-04-22 22:04:44.000000000 -0700 +++ linux-2.6.21-rc7/mm/fadvise.c 2007-04-22 22:08:35.000000000 -0700 @@ -79,8 +79,8 @@ asmlinkage long sys_fadvise64_64(int fd, } /* First and last PARTIAL page! */ - start_index = offset >> PAGE_CACHE_SHIFT; - end_index = endbyte >> PAGE_CACHE_SHIFT; + start_index = page_cache_index(mapping, offset); + end_index = page_cache_index(mapping, endbyte); /* Careful about overflow on the "+1" */ nrpages = end_index - start_index + 1; @@ -101,8 +101,8 @@ asmlinkage long sys_fadvise64_64(int fd, filemap_flush(mapping); /* First and last FULL page! */ - start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT; - end_index = (endbyte >> PAGE_CACHE_SHIFT); + start_index = page_cache_next(mapping, offset); + end_index = page_cache_index(mapping, endbyte); if (end_index >= start_index) invalidate_mapping_pages(mapping, start_index, Index: linux-2.6.21-rc7/mm/truncate.c =================================================================== --- linux-2.6.21-rc7.orig/mm/truncate.c 2007-04-22 21:47:34.000000000 -0700 +++ linux-2.6.21-rc7/mm/truncate.c 2007-04-22 22:11:19.000000000 -0700 @@ -46,7 +46,8 @@ void do_invalidatepage(struct page *page static inline void truncate_partial_page(struct page *page, unsigned partial) { - memclear_highpage_flush(page, partial, PAGE_CACHE_SIZE-partial); + memclear_highpage_flush(page, partial, + (PAGE_SIZE << compound_order(page)) - partial); if (PagePrivate(page)) do_invalidatepage(page, partial); } @@ -94,7 +95,7 @@ truncate_complete_page(struct address_sp if (page->mapping != mapping) return; - cancel_dirty_page(page, PAGE_CACHE_SIZE); + cancel_dirty_page(page, page_cache_size(mapping)); if (PagePrivate(page)) do_invalidatepage(page, 0); @@ -156,9 +157,9 @@ invalidate_complete_page(struct address_ void truncate_inode_pages_range(struct address_space *mapping, loff_t lstart, loff_t lend) { - const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT; + const pgoff_t start = page_cache_next(mapping, lstart); pgoff_t end; - const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1); + const unsigned partial = page_cache_offset(mapping, lstart); struct pagevec pvec; pgoff_t next; int i; @@ -166,8 +167,9 @@ void truncate_inode_pages_range(struct a if (mapping->nrpages == 0) return; - BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1)); - end = (lend >> PAGE_CACHE_SHIFT); + BUG_ON(page_cache_offset(mapping, lend) != + page_cache_size(mapping) - 1); + end = page_cache_index(mapping, lend); pagevec_init(&pvec, 0); next = start; @@ -402,9 +404,8 @@ int invalidate_inode_pages2_range(struct * Zap the rest of the file in one hit. */ unmap_mapping_range(mapping, - (loff_t)page_index< References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:20 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 13/16] Variable Order Page Cache: Fixed to block layer Content-Disposition: inline; filename=var_pc_buffer_head Fix up (at least some pieces of) the block layer. It already has some flexibility. Extend that for larger page sizes. set_blocksize is changed to allow to specify a blocksize larger than a page. If that occurs then we switch the device to use compound pages. Signed-off-by: Christoph Lameter --- fs/block_dev.c | 22 ++++++--- fs/buffer.c | 101 +++++++++++++++++++++++--------------------- fs/inode.c | 5 +- fs/mpage.c | 34 +++++++------- include/linux/buffer_head.h | 9 +++ 5 files changed, 100 insertions(+), 71 deletions(-) Index: linux-2.6.21-rc7/include/linux/buffer_head.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/buffer_head.h 2007-04-22 21:47:33.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/buffer_head.h 2007-04-22 22:14:41.000000000 -0700 @@ -129,7 +129,14 @@ BUFFER_FNS(Ordered, ordered) BUFFER_FNS(Eopnotsupp, eopnotsupp) BUFFER_FNS(Unwritten, unwritten) -#define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) +static inline unsigned long bh_offset(struct buffer_head *bh) +{ + /* Cannot use the mapping since it may be set to NULL. */ + unsigned long mask = ~(PAGE_MASK << compound_order(bh->b_page)); + + return (unsigned long)bh->b_data & mask; +} + #define touch_buffer(bh) mark_page_accessed(bh->b_page) /* If we *know* page->private refers to buffer_heads */ Index: linux-2.6.21-rc7/fs/block_dev.c =================================================================== --- linux-2.6.21-rc7.orig/fs/block_dev.c 2007-04-22 21:47:33.000000000 -0700 +++ linux-2.6.21-rc7/fs/block_dev.c 2007-04-22 22:11:44.000000000 -0700 @@ -60,12 +60,12 @@ static void kill_bdev(struct block_devic { invalidate_bdev(bdev, 1); truncate_inode_pages(bdev->bd_inode->i_mapping, 0); -} +} int set_blocksize(struct block_device *bdev, int size) { - /* Size must be a power of two, and between 512 and PAGE_SIZE */ - if (size > PAGE_SIZE || size < 512 || (size & (size-1))) + /* Size must be a power of two, and greater than 512 */ + if (size < 512 || (size & (size-1))) return -EINVAL; /* Size cannot be smaller than the size supported by the device */ @@ -74,10 +74,16 @@ int set_blocksize(struct block_device *b /* Don't change the size if it is same as current */ if (bdev->bd_block_size != size) { + int bits = blksize_bits(size); + struct address_space *mapping = + bdev->bd_inode->i_mapping; + sync_blockdev(bdev); - bdev->bd_block_size = size; - bdev->bd_inode->i_blkbits = blksize_bits(size); kill_bdev(bdev); + bdev->bd_block_size = size; + bdev->bd_inode->i_blkbits = bits; + set_mapping_order(mapping, + bits < PAGE_SHIFT ? 0 : bits - PAGE_SHIFT); } return 0; } @@ -88,8 +94,10 @@ int sb_set_blocksize(struct super_block { if (set_blocksize(sb->s_bdev, size)) return 0; - /* If we get here, we know size is power of two - * and it's value is between 512 and PAGE_SIZE */ + /* + * If we get here, we know size is power of two + * and it's value is larger than 512 + */ sb->s_blocksize = size; sb->s_blocksize_bits = blksize_bits(size); return sb->s_blocksize; Index: linux-2.6.21-rc7/fs/buffer.c =================================================================== --- linux-2.6.21-rc7.orig/fs/buffer.c 2007-04-22 21:47:33.000000000 -0700 +++ linux-2.6.21-rc7/fs/buffer.c 2007-04-22 22:11:44.000000000 -0700 @@ -259,7 +259,7 @@ __find_get_block_slow(struct block_devic struct page *page; int all_mapped = 1; - index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits); + index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits); page = find_get_page(bd_mapping, index); if (!page) goto out; @@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page if (page->mapping) { /* Race with truncate? */ if (mapping_cap_account_dirty(mapping)) { __inc_zone_page_state(page, NR_FILE_DIRTY); - task_io_account_write(PAGE_CACHE_SIZE); + task_io_account_write(page_cache_size(mapping)); } radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); @@ -879,10 +879,13 @@ struct buffer_head *alloc_page_buffers(s { struct buffer_head *bh, *head; long offset; + unsigned page_size = page_cache_size(page->mapping); + + BUG_ON(size > page_size); try_again: head = NULL; - offset = PAGE_SIZE; + offset = page_size; while ((offset -= size) >= 0) { bh = alloc_buffer_head(GFP_NOFS); if (!bh) @@ -1080,7 +1083,7 @@ __getblk_slow(struct block_device *bdev, { /* Size must be multiple of hard sectorsize */ if (unlikely(size & (bdev_hardsect_size(bdev)-1) || - (size < 512 || size > PAGE_SIZE))) { + size < 512)) { printk(KERN_ERR "getblk(): invalid block size %d requested\n", size); printk(KERN_ERR "hardsect size: %d\n", @@ -1417,7 +1420,7 @@ void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { bh->b_page = page; - BUG_ON(offset >= PAGE_SIZE); + VM_BUG_ON(offset >= page_cache_size(page->mapping)); if (PageHighMem(page)) /* * This catches illegal uses and preserves the offset: @@ -1766,8 +1769,8 @@ static int __block_prepare_write(struct struct buffer_head *bh, *head, *wait[2], **wait_bh=wait; BUG_ON(!PageLocked(page)); - BUG_ON(from > PAGE_CACHE_SIZE); - BUG_ON(to > PAGE_CACHE_SIZE); + BUG_ON(from > page_cache_size(inode->i_mapping)); + BUG_ON(to > page_cache_size(inode->i_mapping)); BUG_ON(from > to); blocksize = 1 << inode->i_blkbits; @@ -1776,7 +1779,7 @@ static int __block_prepare_write(struct head = page_buffers(page); bbits = inode->i_blkbits; - block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); + block = (sector_t)page->index << (page_cache_shift(inode->i_mapping) - bbits); for(bh = head, block_start = 0; bh != head || !block_start; block++, block_start=block_end, bh = bh->b_this_page) { @@ -1934,7 +1937,7 @@ int block_read_full_page(struct page *pa create_empty_buffers(page, blocksize, 0); head = page_buffers(page); - iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits); + iblock = (sector_t)page->index << (page_cache_shift(page->mapping) - inode->i_blkbits); lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits; bh = head; nr = 0; @@ -1957,7 +1960,7 @@ int block_read_full_page(struct page *pa if (!buffer_mapped(bh)) { void *kaddr = kmap_atomic(page, KM_USER0); memset(kaddr + i * blocksize, 0, blocksize); - flush_dcache_page(page); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); if (!err) set_buffer_uptodate(bh); @@ -2058,10 +2061,11 @@ out: int generic_cont_expand(struct inode *inode, loff_t size) { + struct address_space *mapping = inode->i_mapping; pgoff_t index; unsigned int offset; - offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */ + offset = page_cache_offset(mapping, size); /* ugh. in prepare/commit_write, if from==to==start of block, we ** skip the prepare. make sure we never send an offset for the start @@ -2071,7 +2075,7 @@ int generic_cont_expand(struct inode *in /* caller must handle this extra byte. */ offset++; } - index = size >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, size); return __generic_cont_expand(inode, size, index, offset); } @@ -2079,8 +2083,8 @@ int generic_cont_expand(struct inode *in int generic_cont_expand_simple(struct inode *inode, loff_t size) { loff_t pos = size - 1; - pgoff_t index = pos >> PAGE_CACHE_SHIFT; - unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1; + pgoff_t index = page_cache_index(inode->i_mapping, pos); + unsigned int offset = page_cache_offset(inode->i_mapping, pos) + 1; /* prepare/commit_write can handle even if from==to==start of block. */ return __generic_cont_expand(inode, size, index, offset); @@ -2103,31 +2107,32 @@ int cont_prepare_write(struct page *page unsigned blocksize = 1 << inode->i_blkbits; void *kaddr; - while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) { + while(page->index > (pgpos = page_cache_index(mapping, *bytes))) { status = -ENOMEM; new_page = grab_cache_page(mapping, pgpos); if (!new_page) goto out; /* we might sleep */ - if (*bytes>>PAGE_CACHE_SHIFT != pgpos) { + if (page_cache_index(mapping, *bytes) != pgpos) { unlock_page(new_page); page_cache_release(new_page); continue; } - zerofrom = *bytes & ~PAGE_CACHE_MASK; + zerofrom = page_cache_offset(mapping, *bytes); if (zerofrom & (blocksize-1)) { *bytes |= (blocksize-1); (*bytes)++; } status = __block_prepare_write(inode, new_page, zerofrom, - PAGE_CACHE_SIZE, get_block); + page_cache_size(mapping), get_block); if (status) goto out_unmap; + /* Need higher order kmap?? */ kaddr = kmap_atomic(new_page, KM_USER0); - memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom); + memset(kaddr+zerofrom, 0, page_cache_size(mapping)-zerofrom); flush_dcache_page(new_page); kunmap_atomic(kaddr, KM_USER0); - generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE); + generic_commit_write(NULL, new_page, zerofrom, page_cache_size(mapping)); unlock_page(new_page); page_cache_release(new_page); } @@ -2137,7 +2142,7 @@ int cont_prepare_write(struct page *page zerofrom = offset; } else { /* page covers the boundary, find the boundary offset */ - zerofrom = *bytes & ~PAGE_CACHE_MASK; + zerofrom = page_cache_offset(mapping, *bytes); /* if we will expand the thing last block will be filled */ if (to > zerofrom && (zerofrom & (blocksize-1))) { @@ -2192,8 +2197,9 @@ int block_commit_write(struct page *page int generic_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; + loff_t pos = page_cache_pos(mapping, page->index, to); __block_commit_write(inode,page,from,to); /* * No need to use i_size_read() here, the i_size @@ -2235,6 +2241,7 @@ static void end_buffer_read_nobh(struct int nobh_prepare_write(struct page *page, unsigned from, unsigned to, get_block_t *get_block) { + struct address_space *mapping = page->mapping; struct inode *inode = page->mapping->host; const unsigned blkbits = inode->i_blkbits; const unsigned blocksize = 1 << blkbits; @@ -2242,6 +2249,7 @@ int nobh_prepare_write(struct page *page struct buffer_head *read_bh[MAX_BUF_PER_PAGE]; unsigned block_in_page; unsigned block_start; + unsigned page_size = page_cache_size(mapping); sector_t block_in_file; char *kaddr; int nr_reads = 0; @@ -2252,7 +2260,7 @@ int nobh_prepare_write(struct page *page if (PageMappedToDisk(page)) return 0; - block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); + block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits); map_bh.b_page = page; /* @@ -2261,7 +2269,7 @@ int nobh_prepare_write(struct page *page * page is fully mapped-to-disk. */ for (block_start = 0, block_in_page = 0; - block_start < PAGE_CACHE_SIZE; + block_start < page_size; block_in_page++, block_start += blocksize) { unsigned block_end = block_start + blocksize; int create; @@ -2288,7 +2296,7 @@ int nobh_prepare_write(struct page *page memset(kaddr+block_start, 0, from-block_start); if (block_end > to) memset(kaddr + to, 0, block_end - to); - flush_dcache_page(page); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); continue; } @@ -2356,8 +2364,8 @@ failed: * so we'll later zero out any blocks which _were_ allocated. */ kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr, 0, PAGE_CACHE_SIZE); - flush_dcache_page(page); + memset(kaddr, 0, page_size); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); SetPageUptodate(page); set_page_dirty(page); @@ -2372,8 +2380,9 @@ EXPORT_SYMBOL(nobh_prepare_write); int nobh_commit_write(struct file *file, struct page *page, unsigned from, unsigned to) { - struct inode *inode = page->mapping->host; - loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; + loff_t pos = page_cache_pos(mapping, page->index, to); SetPageUptodate(page); set_page_dirty(page); @@ -2395,7 +2404,7 @@ int nobh_writepage(struct page *page, ge { struct inode * const inode = page->mapping->host; loff_t i_size = i_size_read(inode); - const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; + const pgoff_t end_index = page_cache_offset(page->mapping, i_size); unsigned offset; void *kaddr; int ret; @@ -2405,7 +2414,7 @@ int nobh_writepage(struct page *page, ge goto out; /* Is the page fully outside i_size? (truncate in progress) */ - offset = i_size & (PAGE_CACHE_SIZE-1); + offset = page_cache_offset(page->mapping, i_size); if (page->index >= end_index+1 || !offset) { /* * The page may have dirty, unmapped buffers. For example, @@ -2429,7 +2438,7 @@ int nobh_writepage(struct page *page, ge * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); + memset(kaddr + offset, 0, page_cache_size(page->mapping) - offset); flush_dcache_page(page); kunmap_atomic(kaddr, KM_USER0); out: @@ -2447,8 +2456,8 @@ int nobh_truncate_page(struct address_sp { struct inode *inode = mapping->host; unsigned blocksize = 1 << inode->i_blkbits; - pgoff_t index = from >> PAGE_CACHE_SHIFT; - unsigned offset = from & (PAGE_CACHE_SIZE-1); + pgoff_t index = page_cache_index(mapping, from); + unsigned offset = page_cache_offset(mapping, from); unsigned to; struct page *page; const struct address_space_operations *a_ops = mapping->a_ops; @@ -2467,8 +2476,8 @@ int nobh_truncate_page(struct address_sp ret = a_ops->prepare_write(NULL, page, offset, to); if (ret == 0) { kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); - flush_dcache_page(page); + memset(kaddr + offset, 0, page_cache_size(mapping) - offset); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); /* * It would be more correct to call aops->commit_write() @@ -2487,8 +2496,8 @@ EXPORT_SYMBOL(nobh_truncate_page); int block_truncate_page(struct address_space *mapping, loff_t from, get_block_t *get_block) { - pgoff_t index = from >> PAGE_CACHE_SHIFT; - unsigned offset = from & (PAGE_CACHE_SIZE-1); + pgoff_t index = page_cache_index(mapping, from); + unsigned offset = page_cache_offset(mapping, from); unsigned blocksize; sector_t iblock; unsigned length, pos; @@ -2506,7 +2515,7 @@ int block_truncate_page(struct address_s return 0; length = blocksize - length; - iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); + iblock = (sector_t)index << (page_cache_shift(mapping) - inode->i_blkbits); page = grab_cache_page(mapping, index); err = -ENOMEM; @@ -2551,7 +2560,7 @@ int block_truncate_page(struct address_s kaddr = kmap_atomic(page, KM_USER0); memset(kaddr + offset, 0, length); - flush_dcache_page(page); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); mark_buffer_dirty(bh); @@ -2572,7 +2581,7 @@ int block_write_full_page(struct page *p { struct inode * const inode = page->mapping->host; loff_t i_size = i_size_read(inode); - const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT; + const pgoff_t end_index = page_cache_index(page->mapping, i_size); unsigned offset; void *kaddr; @@ -2581,7 +2590,7 @@ int block_write_full_page(struct page *p return __block_write_full_page(inode, page, get_block, wbc); /* Is the page fully outside i_size? (truncate in progress) */ - offset = i_size & (PAGE_CACHE_SIZE-1); + offset = page_cache_offset(page->mapping, i_size); if (page->index >= end_index+1 || !offset) { /* * The page may have dirty, unmapped buffers. For example, @@ -2601,8 +2610,8 @@ int block_write_full_page(struct page *p * writes to that region are not written out to the file." */ kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); - flush_dcache_page(page); + memset(kaddr + offset, 0, page_cache_size(page->mapping) - offset); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); return __block_write_full_page(inode, page, get_block, wbc); } @@ -2857,7 +2866,7 @@ int try_to_free_buffers(struct page *pag * dirty bit from being lost. */ if (ret) - cancel_dirty_page(page, PAGE_CACHE_SIZE); + cancel_dirty_page(page, page_cache_size(mapping)); spin_unlock(&mapping->private_lock); out: if (buffers_to_free) { Index: linux-2.6.21-rc7/fs/inode.c =================================================================== --- linux-2.6.21-rc7.orig/fs/inode.c 2007-04-22 21:52:18.000000000 -0700 +++ linux-2.6.21-rc7/fs/inode.c 2007-04-22 22:11:44.000000000 -0700 @@ -145,7 +145,10 @@ static struct inode *alloc_inode(struct mapping->a_ops = &empty_aops; mapping->host = inode; mapping->flags = 0; - mapping->order = 0; + if (inode->i_blkbits > PAGE_SHIFT) + set_mapping_order(mapping, inode->i_blkbits - PAGE_SHIFT); + else + set_mapping_order(mapping, 0); mapping_set_gfp_mask(mapping, GFP_HIGHUSER); mapping->assoc_mapping = NULL; mapping->backing_dev_info = &default_backing_dev_info; Index: linux-2.6.21-rc7/fs/mpage.c =================================================================== --- linux-2.6.21-rc7.orig/fs/mpage.c 2007-04-22 21:47:33.000000000 -0700 +++ linux-2.6.21-rc7/fs/mpage.c 2007-04-22 22:11:44.000000000 -0700 @@ -133,7 +133,8 @@ mpage_alloc(struct block_device *bdev, static void map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) { - struct inode *inode = page->mapping->host; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; struct buffer_head *page_bh, *head; int block = 0; @@ -142,9 +143,9 @@ map_buffer_to_page(struct page *page, st * don't make any buffers if there is only one buffer on * the page and the page just needs to be set up to date */ - if (inode->i_blkbits == PAGE_CACHE_SHIFT && + if (inode->i_blkbits == page_cache_shift(mapping) && buffer_uptodate(bh)) { - SetPageUptodate(page); + SetPageUptodate(page); return; } create_empty_buffers(page, 1 << inode->i_blkbits, 0); @@ -177,9 +178,10 @@ do_mpage_readpage(struct bio *bio, struc sector_t *last_block_in_bio, struct buffer_head *map_bh, unsigned long *first_logical_block, get_block_t get_block) { - struct inode *inode = page->mapping->host; + struct address_space *mapping = page->mapping; + struct inode *inode = mapping->host; const unsigned blkbits = inode->i_blkbits; - const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; + const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits; const unsigned blocksize = 1 << blkbits; sector_t block_in_file; sector_t last_block; @@ -196,7 +198,7 @@ do_mpage_readpage(struct bio *bio, struc if (page_has_buffers(page)) goto confused; - block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); + block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits); last_block = block_in_file + nr_pages * blocks_per_page; last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits; if (last_block > last_block_in_file) @@ -286,8 +288,8 @@ do_mpage_readpage(struct bio *bio, struc if (first_hole != blocks_per_page) { char *kaddr = kmap_atomic(page, KM_USER0); memset(kaddr + (first_hole << blkbits), 0, - PAGE_CACHE_SIZE - (first_hole << blkbits)); - flush_dcache_page(page); + page_cache_size(mapping) - (first_hole << blkbits)); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); if (first_hole == 0) { SetPageUptodate(page); @@ -465,7 +467,7 @@ __mpage_writepage(struct bio *bio, struc struct inode *inode = page->mapping->host; const unsigned blkbits = inode->i_blkbits; unsigned long end_index; - const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits; + const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits; sector_t last_block; sector_t block_in_file; sector_t blocks[MAX_BUF_PER_PAGE]; @@ -533,7 +535,7 @@ __mpage_writepage(struct bio *bio, struc * The page has no buffers: map it to disk */ BUG_ON(!PageUptodate(page)); - block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits); + block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits); last_block = (i_size - 1) >> blkbits; map_bh.b_page = page; for (page_block = 0; page_block < blocks_per_page; ) { @@ -565,7 +567,7 @@ __mpage_writepage(struct bio *bio, struc first_unmapped = page_block; page_is_mapped: - end_index = i_size >> PAGE_CACHE_SHIFT; + end_index = page_cache_index(mapping, i_size); if (page->index >= end_index) { /* * The page straddles i_size. It must be zeroed out on each @@ -575,14 +577,14 @@ page_is_mapped: * is zeroed when mapped, and writes to that region are not * written out to the file." */ - unsigned offset = i_size & (PAGE_CACHE_SIZE - 1); + unsigned offset = page_cache_offset(mapping, i_size); char *kaddr; if (page->index > end_index || !offset) goto confused; kaddr = kmap_atomic(page, KM_USER0); - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset); - flush_dcache_page(page); + memset(kaddr + offset, 0, page_cache_size(mapping) - offset); + flush_mapping_page(page); kunmap_atomic(kaddr, KM_USER0); } @@ -727,8 +729,8 @@ mpage_writepages(struct address_space *m index = mapping->writeback_index; /* Start from prev offset */ end = -1; } else { - index = wbc->range_start >> PAGE_CACHE_SHIFT; - end = wbc->range_end >> PAGE_CACHE_SHIFT; + index = page_cache_index(mapping, wbc->range_start); + end = page_cache_index(mapping, wbc->range_end); if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; scanned = 1; -- From clameter@sgi.com Sun Apr 22 23:21:31 2007 Message-Id: <20070423062131.279607407@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:21 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 14/16] Variable Order Page Cache: Add support to ramfs Content-Disposition: inline; filename=var_pc_ramfs The simplest file system to use is ramfs. Add a mount parameter that specifies the page order of the pages that ramfs should use. If the order is greater than zero then disable mmap functionality. This could be removed if the VM would be changes to support faulting higher order pages but for now we are content with buffered I/O on higher order pages. Note that ramfs does not use the lower layers (buffer I/O etc) so its the safest to use right now. If you apply this patch and then you can f.e. try this: mount -tramfs -o10 none /media Mounts a ramfs filesystem with order 10 pages (4 MB) cp linux-2.6.21-rc7.tar.gz /media Populate the ramfs. Note that we allocate 14 pages of 4M each instead of 13508.. umount /media Gets rid of the large pages again Signed-off-by: Christoph Lameter --- fs/ramfs/file-mmu.c | 11 +++++++++++ fs/ramfs/inode.c | 15 ++++++++++++--- include/linux/ramfs.h | 1 + 3 files changed, 24 insertions(+), 3 deletions(-) Index: linux-2.6.21-rc7/fs/ramfs/file-mmu.c =================================================================== --- linux-2.6.21-rc7.orig/fs/ramfs/file-mmu.c 2007-04-18 21:46:38.000000000 -0700 +++ linux-2.6.21-rc7/fs/ramfs/file-mmu.c 2007-04-18 22:02:03.000000000 -0700 @@ -45,6 +45,17 @@ const struct file_operations ramfs_file_ .llseek = generic_file_llseek, }; +/* Higher order mappings do not support mmmap */ +const struct file_operations ramfs_file_higher_order_operations = { + .read = do_sync_read, + .aio_read = generic_file_aio_read, + .write = do_sync_write, + .aio_write = generic_file_aio_write, + .fsync = simple_sync_file, + .sendfile = generic_file_sendfile, + .llseek = generic_file_llseek, +}; + const struct inode_operations ramfs_file_inode_operations = { .getattr = simple_getattr, }; Index: linux-2.6.21-rc7/fs/ramfs/inode.c =================================================================== --- linux-2.6.21-rc7.orig/fs/ramfs/inode.c 2007-04-18 21:46:38.000000000 -0700 +++ linux-2.6.21-rc7/fs/ramfs/inode.c 2007-04-18 22:02:03.000000000 -0700 @@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup inode->i_blocks = 0; inode->i_mapping->a_ops = &ramfs_aops; inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info; + inode->i_mapping->order = sb->s_blocksize_bits - PAGE_CACHE_SHIFT; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; switch (mode & S_IFMT) { default: @@ -68,7 +69,10 @@ struct inode *ramfs_get_inode(struct sup break; case S_IFREG: inode->i_op = &ramfs_file_inode_operations; - inode->i_fop = &ramfs_file_operations; + if (inode->i_mapping->order) + inode->i_fop = &ramfs_file_higher_order_operations; + else + inode->i_fop = &ramfs_file_operations; break; case S_IFDIR: inode->i_op = &ramfs_dir_inode_operations; @@ -164,10 +168,15 @@ static int ramfs_fill_super(struct super { struct inode * inode; struct dentry * root; + int order = 0; + char *options = data; + + if (options && *options) + order = simple_strtoul(options, NULL, 10); sb->s_maxbytes = MAX_LFS_FILESIZE; - sb->s_blocksize = PAGE_CACHE_SIZE; - sb->s_blocksize_bits = PAGE_CACHE_SHIFT; + sb->s_blocksize = PAGE_CACHE_SIZE << order; + sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT; sb->s_magic = RAMFS_MAGIC; sb->s_op = &ramfs_ops; sb->s_time_gran = 1; Index: linux-2.6.21-rc7/include/linux/ramfs.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/ramfs.h 2007-04-18 21:46:38.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/ramfs.h 2007-04-18 22:02:03.000000000 -0700 @@ -16,6 +16,7 @@ extern int ramfs_nommu_mmap(struct file #endif extern const struct file_operations ramfs_file_operations; +extern const struct file_operations ramfs_file_higher_order_operations; extern struct vm_operations_struct generic_file_vm_ops; extern int __init init_rootfs(void); -- From clameter@sgi.com Sun Apr 22 23:21:31 2007 Message-Id: <20070423062131.446138927@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:22 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 15/16] ext2: Add variable page size support Content-Disposition: inline; filename=var_pc_ext2 This adds variable page size support. It is then possible to mount filesystems that have a larger blocksize than the page size. F.e. the following is possible on x86_64 and i386 that have only a 4k page size. mke2fs -b 16384 /dev/hdd2 mount /dev/hdd2 /media ls -l /media .... Do more things with the volume that uses a 16k page cache size on a 4k page sized platform.. Note that there are issues with ext2 support: 1. Data is not writtten back correctly (block layer?) 2. Reclaim does not work right. And we disable mmap for higher order pages like also done for ramfs. This is temporary until we get support for mmapping higher order pages. Signed-off-by: Christoph Lameter --- fs/ext2/dir.c | 40 +++++++++++++++++++++++----------------- fs/ext2/ext2.h | 1 + fs/ext2/file.c | 18 ++++++++++++++++++ fs/ext2/inode.c | 10 ++++++++-- fs/ext2/namei.c | 10 ++++++++-- 5 files changed, 58 insertions(+), 21 deletions(-) Index: linux-2.6.21-rc7/fs/ext2/dir.c =================================================================== --- linux-2.6.21-rc7.orig/fs/ext2/dir.c 2007-04-22 19:43:05.000000000 -0700 +++ linux-2.6.21-rc7/fs/ext2/dir.c 2007-04-22 20:09:57.000000000 -0700 @@ -44,7 +44,8 @@ static inline void ext2_put_page(struct static inline unsigned long dir_pages(struct inode *inode) { - return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT; + return (inode->i_size+page_cache_size(inode->i_mapping)-1)>> + page_cache_shift(inode->i_mapping); } /* @@ -55,10 +56,11 @@ static unsigned ext2_last_byte(struct inode *inode, unsigned long page_nr) { unsigned last_byte = inode->i_size; + struct address_space *mapping = inode->i_mapping; - last_byte -= page_nr << PAGE_CACHE_SHIFT; - if (last_byte > PAGE_CACHE_SIZE) - last_byte = PAGE_CACHE_SIZE; + last_byte -= page_nr << page_cache_shift(mapping); + if (last_byte > page_cache_size(mapping)) + last_byte = page_cache_size(mapping); return last_byte; } @@ -77,18 +79,19 @@ static int ext2_commit_chunk(struct page static void ext2_check_page(struct page *page) { - struct inode *dir = page->mapping->host; + struct address_space *mapping = page->mapping; + struct inode *dir = mapping->host; struct super_block *sb = dir->i_sb; unsigned chunk_size = ext2_chunk_size(dir); char *kaddr = page_address(page); u32 max_inumber = le32_to_cpu(EXT2_SB(sb)->s_es->s_inodes_count); unsigned offs, rec_len; - unsigned limit = PAGE_CACHE_SIZE; + unsigned limit = page_cache_size(mapping); ext2_dirent *p; char *error; - if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) { - limit = dir->i_size & ~PAGE_CACHE_MASK; + if (page_cache_index(mapping, dir->i_size) == page->index) { + limit = page_cache_offset(mapping, dir->i_size); if (limit & (chunk_size - 1)) goto Ebadsize; if (!limit) @@ -140,7 +143,7 @@ Einumber: bad_entry: ext2_error (sb, "ext2_check_page", "bad entry in directory #%lu: %s - " "offset=%lu, inode=%lu, rec_len=%d, name_len=%d", - dir->i_ino, error, (page->index<i_ino, error, page_cache_pos(mapping, page->index, offs), (unsigned long) le32_to_cpu(p->inode), rec_len, p->name_len); goto fail; @@ -149,7 +152,7 @@ Eend: ext2_error (sb, "ext2_check_page", "entry in directory #%lu spans the page boundary" "offset=%lu, inode=%lu", - dir->i_ino, (page->index<i_ino, page_cache_pos(mapping, page->index, offs), (unsigned long) le32_to_cpu(p->inode)); fail: SetPageChecked(page); @@ -250,8 +253,9 @@ ext2_readdir (struct file * filp, void * loff_t pos = filp->f_pos; struct inode *inode = filp->f_path.dentry->d_inode; struct super_block *sb = inode->i_sb; - unsigned int offset = pos & ~PAGE_CACHE_MASK; - unsigned long n = pos >> PAGE_CACHE_SHIFT; + struct address_space *mapping = inode->i_mapping; + unsigned int offset = page_cache_offset(mapping, pos); + unsigned long n = page_cache_index(mapping, pos); unsigned long npages = dir_pages(inode); unsigned chunk_mask = ~(ext2_chunk_size(inode)-1); unsigned char *types = NULL; @@ -272,14 +276,14 @@ ext2_readdir (struct file * filp, void * ext2_error(sb, __FUNCTION__, "bad page in #%lu", inode->i_ino); - filp->f_pos += PAGE_CACHE_SIZE - offset; + filp->f_pos += page_cache_size(mapping) - offset; return -EIO; } kaddr = page_address(page); if (unlikely(need_revalidate)) { if (offset) { offset = ext2_validate_entry(kaddr, offset, chunk_mask); - filp->f_pos = (n<f_pos = page_cache_pos(mapping, n, offset); } filp->f_version = inode->i_version; need_revalidate = 0; @@ -302,7 +306,7 @@ ext2_readdir (struct file * filp, void * offset = (char *)de - kaddr; over = filldir(dirent, de->name, de->name_len, - (n<inode), d_type); if (over) { ext2_put_page(page); @@ -328,6 +332,7 @@ struct ext2_dir_entry_2 * ext2_find_entr struct dentry *dentry, struct page ** res_page) { const char *name = dentry->d_name.name; + struct address_space *mapping = dir->i_mapping; int namelen = dentry->d_name.len; unsigned reclen = EXT2_DIR_REC_LEN(namelen); unsigned long start, n; @@ -369,7 +374,7 @@ struct ext2_dir_entry_2 * ext2_find_entr if (++n >= npages) n = 0; /* next page is past the blocks we've got */ - if (unlikely(n > (dir->i_blocks >> (PAGE_CACHE_SHIFT - 9)))) { + if (unlikely(n > (dir->i_blocks >> (page_cache_shift(mapping) - 9)))) { ext2_error(dir->i_sb, __FUNCTION__, "dir %lu size %lld exceeds block count %llu", dir->i_ino, dir->i_size, @@ -438,6 +443,7 @@ void ext2_set_link(struct inode *dir, st int ext2_add_link (struct dentry *dentry, struct inode *inode) { struct inode *dir = dentry->d_parent->d_inode; + struct address_space *mapping = inode->i_mapping; const char *name = dentry->d_name.name; int namelen = dentry->d_name.len; unsigned chunk_size = ext2_chunk_size(dir); @@ -467,7 +473,7 @@ int ext2_add_link (struct dentry *dentry kaddr = page_address(page); dir_end = kaddr + ext2_last_byte(dir, n); de = (ext2_dirent *)kaddr; - kaddr += PAGE_CACHE_SIZE - reclen; + kaddr += page_cache_size(mapping) - reclen; while ((char *)de <= kaddr) { if ((char *)de == dir_end) { /* We hit i_size */ Index: linux-2.6.21-rc7/fs/ext2/ext2.h =================================================================== --- linux-2.6.21-rc7.orig/fs/ext2/ext2.h 2007-04-22 19:43:05.000000000 -0700 +++ linux-2.6.21-rc7/fs/ext2/ext2.h 2007-04-22 19:44:22.000000000 -0700 @@ -160,6 +160,7 @@ extern const struct file_operations ext2 /* file.c */ extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; +extern const struct file_operations ext2_no_mmap_file_operations; extern const struct file_operations ext2_xip_file_operations; /* inode.c */ Index: linux-2.6.21-rc7/fs/ext2/file.c =================================================================== --- linux-2.6.21-rc7.orig/fs/ext2/file.c 2007-04-22 19:43:05.000000000 -0700 +++ linux-2.6.21-rc7/fs/ext2/file.c 2007-04-22 19:44:22.000000000 -0700 @@ -58,6 +58,24 @@ const struct file_operations ext2_file_o .splice_write = generic_file_splice_write, }; +const struct file_operations ext2_no_mmap_file_operations = { + .llseek = generic_file_llseek, + .read = do_sync_read, + .write = do_sync_write, + .aio_read = generic_file_aio_read, + .aio_write = generic_file_aio_write, + .ioctl = ext2_ioctl, +#ifdef CONFIG_COMPAT + .compat_ioctl = ext2_compat_ioctl, +#endif + .open = generic_file_open, + .release = ext2_release_file, + .fsync = ext2_sync_file, + .sendfile = generic_file_sendfile, + .splice_read = generic_file_splice_read, + .splice_write = generic_file_splice_write, +}; + #ifdef CONFIG_EXT2_FS_XIP const struct file_operations ext2_xip_file_operations = { .llseek = generic_file_llseek, Index: linux-2.6.21-rc7/fs/ext2/inode.c =================================================================== --- linux-2.6.21-rc7.orig/fs/ext2/inode.c 2007-04-22 19:43:05.000000000 -0700 +++ linux-2.6.21-rc7/fs/ext2/inode.c 2007-04-22 19:44:22.000000000 -0700 @@ -1128,10 +1128,16 @@ void ext2_read_inode (struct inode * ino inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; - inode->i_fop = &ext2_file_operations; + if (inode->i_mapping->order) + inode->i_fop = &ext2_no_mmap_file_operations; + else + inode->i_fop = &ext2_file_operations; } else { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_file_operations; + if (inode->i_mapping->order) + inode->i_fop = &ext2_no_mmap_file_operations; + else + inode->i_fop = &ext2_file_operations; } } else if (S_ISDIR(inode->i_mode)) { inode->i_op = &ext2_dir_inode_operations; Index: linux-2.6.21-rc7/fs/ext2/namei.c =================================================================== --- linux-2.6.21-rc7.orig/fs/ext2/namei.c 2007-04-22 19:43:05.000000000 -0700 +++ linux-2.6.21-rc7/fs/ext2/namei.c 2007-04-22 19:44:22.000000000 -0700 @@ -114,10 +114,16 @@ static int ext2_create (struct inode * d inode->i_fop = &ext2_xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = &ext2_nobh_aops; - inode->i_fop = &ext2_file_operations; + if (inode->i_mapping->order) + inode->i_fop = &ext2_no_mmap_file_operations; + else + inode->i_fop = &ext2_file_operations; } else { inode->i_mapping->a_ops = &ext2_aops; - inode->i_fop = &ext2_file_operations; + if (inode->i_mapping->order) + inode->i_fop = &ext2_no_mmap_file_operations; + else + inode->i_fop = &ext2_file_operations; } mark_inode_dirty(inode); err = ext2_add_nondir(dentry, inode); -- From clameter@sgi.com Sun Apr 22 23:21:31 2007 Message-Id: <20070423062131.611972880@sgi.com> References: <20070423062107.843307112@sgi.com> User-Agent: quilt/0.45-1 Date: Sun, 22 Apr 2007 23:21:23 -0700 From: clameter@sgi.com To: linux-mm@kvack.org Cc: Mel Gorman , William Lee Irwin III , Adam Litke , David Chinner , Jens Axboe , Avi Kivity , Dave Hansen , Badari Pulavarty , Maxim Levitsky Subject: [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros Content-Disposition: inline; filename=var_pc_alternate Implement the page cache macros in a more efficient way by storing key values in the mapping. This reduces code size but increases inode size. Signed-off-by: Christoph Lameter --- include/linux/fs.h | 4 +++- include/linux/pagemap.h | 13 +++++++------ 2 files changed, 10 insertions(+), 7 deletions(-) Index: linux-2.6.21-rc7/include/linux/fs.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/fs.h 2007-04-22 19:43:01.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/fs.h 2007-04-22 19:44:29.000000000 -0700 @@ -435,7 +435,9 @@ struct address_space { struct inode *host; /* owner: inode, block_device */ struct radix_tree_root page_tree; /* radix tree of all pages */ rwlock_t tree_lock; /* and rwlock protecting it */ - unsigned int order; /* Page order in this space */ + unsigned int shift; /* Shift for to get to the page number */ + unsigned int order; /* Page order for allocations */ + loff_t offset_mask; /* To mask out offset in page */ unsigned int i_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ Index: linux-2.6.21-rc7/include/linux/pagemap.h =================================================================== --- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-22 19:44:16.000000000 -0700 +++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-22 19:46:23.000000000 -0700 @@ -42,7 +42,8 @@ static inline void mapping_set_gfp_mask( static inline void set_mapping_order(struct address_space *m, int order) { m->order = order; - + m->shift = order + PAGE_SHIFT; + m->offset_mask = (1UL << m->shift) -1; if (order) m->flags |= __GFP_COMP; else @@ -64,23 +65,23 @@ static inline void set_mapping_order(str static inline int page_cache_shift(struct address_space *a) { - return a->order + PAGE_SHIFT; + return a->shift; } static inline unsigned int page_cache_size(struct address_space *a) { - return PAGE_SIZE << a->order; + return a->offset_mask + 1; } static inline loff_t page_cache_mask(struct address_space *a) { - return (loff_t)PAGE_MASK << a->order; + return ~(loff_t)a->offset_mask; } static inline unsigned int page_cache_offset(struct address_space *a, loff_t pos) { - return pos & ~(PAGE_MASK << a->order); + return pos & a->offset_mask; } static inline pgoff_t page_cache_index(struct address_space *a, @@ -95,7 +96,7 @@ static inline pgoff_t page_cache_index(s static inline pgoff_t page_cache_next(struct address_space *a, loff_t pos) { - return page_cache_index(a, pos + page_cache_size(a) - 1); + return page_cache_index(a, pos + a->offset_mask); } static inline loff_t page_cache_pos(struct address_space *a, --