From clameter@sgi.com Sun Apr 22 23:21:29 2007
Message-Id: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:07 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 00/16] Variable Order Page Cache Patchset V2

RFC V1->V2
- Some ext2 support
- Some block layer, fs layer support etc.
- Better page cache macros
- Use macros to clean up code.

This patchset modifies the Linujx kernel so that higher order page cache
pages become possible. The higher order page cache pages are compound pages
and can be handled in the same way as regular pages.

Rationales:

1. We have problems supporting devices with a higher blocksize than
   page size. This is for example important to support CD and DVDs that
   can only read and write 32k or 64k blocks. We currently have a shim
   layer in there to deal with this situation which limits the speed
   of I/O. The developers are currently looking for ways to completely
   bypass the page cache because of this deficiency.

2. 32/64k blocksize is also used in flash devices. Same issues.

3. Future harddisks will support bigger block sizes

4. Performace. If we look at IA64 vs. x86_64 then it seems that the
   faster interrupt handling on x86_64 compensate for the speed loss due to
   a smaller page size (4k vs 16k on IA64). Having higher page sizes on all
   platform allows a significant reduction in I/O overhead and increases the
   size of I/O that can be performed by hardware in a single request
   since the number of scatter gather entries are typically limited for
   one request. This is going to become increasingly important to support
   the ever growing memory sizes since we may have to handle excessively
   large amounts of 4k requests for data sizes that may become common
   soon. For example to write a 1 terabyte file the kernel would have to
   handle 256 million 4k chunks.

5. Cross arch compatibility: It is currently not possible to mount
   an 16k blocksize ext2 filesystem created on IA64 on an x86_64 system.

The support here is currently only for buffered I/O and only for two
filesystems ramfs and ext2.

Note that the higher order pages are subject to reclaim. This works in general
since we are always operating on a single page struct. Reclaim is fooled to
think that it is touching page sized objects (there are likely issues to be
fixed there if we want to go down this road).

What is currently not supported:
- Mmapping higher order pages
- Direct I/O (there are some fundamental issues with direct I/O
  putting compound pages that have to be treated as single pages
  on the pagevecs and the variable order page cache putting higher
  order compound pages that hjave to be treated as a single large page
  onto pagevecs.

Breakage:
- Reclaim does not work for some reasons. Compound pages on the active
  list get lost somehow.
- Disk data is corrupted when writing ext2fs data. There is likely
  still a lot of work to do in the block layer.
- There is a lot of incomplete work. There are numerous places
  where the kernel can no longer assume that the page cache consists
  of PAGE_SIZE pages that have not been fixed yet.

Future:
- Expect several more RFCs
- We hope for XFS support soon
- There are filesystem layer and lower layer issues here that I am not
  that familiar with. If you can then please enhance my patches.
- Mmap support could be done in a way that makes the mmap page size
  independent from the page cache order. There is no problem of mapping a
  4k section of a larger page cache page. This should leave mmap as is.
- Lets try to keep scope as small as possible.


--

From clameter@sgi.com Sun Apr 22 23:21:29 2007
Message-Id: <20070423062129.171333818@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:08 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 01/16] Free up page->private for compound pages
Content-Disposition: inline; filename=var_pc_compound

If we add a new flag so that we can distinguish between the
first page and the tail pages then we can avoid to use page->private
in the first page. page->private == page for the first page, so there
is no real information in there.

Freeing up page->private makes the use of compound pages more transparent.
They become more usable like real pages. Right now we have to be careful f.e.
if we are going beyond PAGE_SIZE allocations in the slab on i386 because we
can then no longer use the private field. This is one of the issues that
cause us not to support debugging for page size slabs in SLAB.

Also if page->private is available then a compound page may be equipped
with buffer heads. This may free up the way for filesystems to support
larger blocks than page size.

Note that this patch is different from the one in mm. The one in mm
uses PG_reclaim as a PG_tail. We cannot use PG_tail since pages can
be reclaimed now. So use a separate page flag.

We allow compound page headers on pagevec. That will break
Direct I/O because direct I/O needs pagevecs to handle the components
but not the whole. Ideas for a solution welcome. Maybe we should
modify the Direct I/O layer to not operate on the individual pages
but on the compound page as a whole.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/ia64/mm/init.c        |    2 +-
 include/linux/mm.h         |   32 ++++++++++++++++++++++++++------
 include/linux/page-flags.h |    6 ++++++
 mm/internal.h              |    2 +-
 mm/page_alloc.c            |   35 +++++++++++++++++++++++++----------
 mm/slab.c                  |    6 ++----
 mm/swap.c                  |   20 ++++++++++++++++++--
 7 files changed, 79 insertions(+), 24 deletions(-)

Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h	2007-04-21 20:58:32.000000000 -0700
@@ -263,21 +263,24 @@ static inline int put_page_testzero(stru
  */
 static inline int get_page_unless_zero(struct page *page)
 {
-	VM_BUG_ON(PageCompound(page));
 	return atomic_inc_not_zero(&page->_count);
 }
 
+static inline struct page *compound_head(struct page *page)
+{
+	if (unlikely(PageTail(page)))
+		return (struct page *)page->private;
+	return page;
+}
+
 static inline int page_count(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
-	return atomic_read(&page->_count);
+	return atomic_read(&compound_head(page)->_count);
 }
 
 static inline void get_page(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
+	page = compound_head(page);
 	VM_BUG_ON(atomic_read(&page->_count) == 0);
 	atomic_inc(&page->_count);
 }
@@ -314,6 +317,23 @@ static inline compound_page_dtor *get_co
 	return (compound_page_dtor *)page[1].lru.next;
 }
 
+static inline int compound_order(struct page *page)
+{
+	if (!PageCompound(page) || PageTail(page))
+		return 0;
+	return (unsigned long)page[1].lru.prev;
+}
+
+static inline void set_compound_order(struct page *page, unsigned long order)
+{
+	page[1].lru.prev = (void *)order;
+}
+
+static inline int base_pages(struct page *page)
+{
+ 	return 1 << compound_order(page);
+}
+
 /*
  * Multiple processes may "see" the same page. E.g. for untouched
  * mappings of /dev/null, all processes see the same page full of
Index: linux-2.6.21-rc7/include/linux/page-flags.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/page-flags.h	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/page-flags.h	2007-04-21 20:52:15.000000000 -0700
@@ -91,6 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
+#define PG_tail			20	/* Page is tail of a compound page */
+
 /* PG_owner_priv_1 users should have descriptive aliases */
 #define PG_checked		PG_owner_priv_1 /* Used by some filesystems */
 
@@ -241,6 +243,10 @@ static inline void SetPageUptodate(struc
 #define __SetPageCompound(page)	__set_bit(PG_compound, &(page)->flags)
 #define __ClearPageCompound(page) __clear_bit(PG_compound, &(page)->flags)
 
+#define PageTail(page)	test_bit(PG_tail, &(page)->flags)
+#define __SetPageTail(page)	__set_bit(PG_tail, &(page)->flags)
+#define __ClearPageTail(page)	__clear_bit(PG_tail, &(page)->flags)
+
 #ifdef CONFIG_SWAP
 #define PageSwapCache(page)	test_bit(PG_swapcache, &(page)->flags)
 #define SetPageSwapCache(page)	set_bit(PG_swapcache, &(page)->flags)
Index: linux-2.6.21-rc7/mm/internal.h
===================================================================
--- linux-2.6.21-rc7.orig/mm/internal.h	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/internal.h	2007-04-21 20:52:15.000000000 -0700
@@ -24,7 +24,7 @@ static inline void set_page_count(struct
  */
 static inline void set_page_refcounted(struct page *page)
 {
-	VM_BUG_ON(PageCompound(page) && page_private(page) != (unsigned long)page);
+	VM_BUG_ON(PageTail(page));
 	VM_BUG_ON(atomic_read(&page->_count));
 	set_page_count(page, 1);
 }
Index: linux-2.6.21-rc7/mm/page_alloc.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/page_alloc.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/page_alloc.c	2007-04-21 20:58:32.000000000 -0700
@@ -227,7 +227,7 @@ static void bad_page(struct page *page)
 
 static void free_compound_page(struct page *page)
 {
-	__free_pages_ok(page, (unsigned long)page[1].lru.prev);
+	__free_pages_ok(page, compound_order(page));
 }
 
 static void prep_compound_page(struct page *page, unsigned long order)
@@ -236,12 +236,14 @@ static void prep_compound_page(struct pa
 	int nr_pages = 1 << order;
 
 	set_compound_page_dtor(page, free_compound_page);
-	page[1].lru.prev = (void *)order;
-	for (i = 0; i < nr_pages; i++) {
+	set_compound_order(page, order);
+	__SetPageCompound(page);
+	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 
+		__SetPageTail(p);
 		__SetPageCompound(p);
-		set_page_private(p, (unsigned long)page);
+		p->private = (unsigned long)page;
 	}
 }
 
@@ -250,15 +252,19 @@ static void destroy_compound_page(struct
 	int i;
 	int nr_pages = 1 << order;
 
-	if (unlikely((unsigned long)page[1].lru.prev != order))
+	if (unlikely(compound_order(page) != order))
 		bad_page(page);
 
-	for (i = 0; i < nr_pages; i++) {
+	if (unlikely(!PageCompound(page)))
+			bad_page(page);
+	__ClearPageCompound(page);
+	for (i = 1; i < nr_pages; i++) {
 		struct page *p = page + i;
 
-		if (unlikely(!PageCompound(p) |
-				(page_private(p) != (unsigned long)page)))
+		if (unlikely(!PageCompound(p) | !PageTail(p) |
+				((struct page *)p->private != page)))
 			bad_page(page);
+		__ClearPageTail(p);
 		__ClearPageCompound(p);
 	}
 }
@@ -1438,8 +1444,17 @@ void __pagevec_free(struct pagevec *pvec
 {
 	int i = pagevec_count(pvec);
 
-	while (--i >= 0)
-		free_hot_cold_page(pvec->pages[i], pvec->cold);
+	while (--i >= 0) {
+		struct page *page = pvec->pages[i];
+
+		if (PageCompound(page)) {
+			compound_page_dtor *dtor;
+
+			dtor = get_compound_page_dtor(page);
+			(*dtor)(page);
+		} else
+			free_hot_cold_page(page, pvec->cold);
+	}
 }
 
 fastcall void __free_pages(struct page *page, unsigned int order)
Index: linux-2.6.21-rc7/mm/slab.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/slab.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/slab.c	2007-04-21 20:52:15.000000000 -0700
@@ -592,8 +592,7 @@ static inline void page_set_cache(struct
 
 static inline struct kmem_cache *page_get_cache(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
+	page = compound_head(page);
 	BUG_ON(!PageSlab(page));
 	return (struct kmem_cache *)page->lru.next;
 }
@@ -605,8 +604,7 @@ static inline void page_set_slab(struct 
 
 static inline struct slab *page_get_slab(struct page *page)
 {
-	if (unlikely(PageCompound(page)))
-		page = (struct page *)page_private(page);
+	page = compound_head(page);
 	BUG_ON(!PageSlab(page));
 	return (struct slab *)page->lru.prev;
 }
Index: linux-2.6.21-rc7/mm/swap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/swap.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/mm/swap.c	2007-04-21 21:02:59.000000000 -0700
@@ -55,7 +55,7 @@ static void fastcall __page_cache_releas
 
 static void put_compound_page(struct page *page)
 {
-	page = (struct page *)page_private(page);
+	page = compound_head(page);
 	if (put_page_testzero(page)) {
 		compound_page_dtor *dtor;
 
@@ -263,7 +263,23 @@ void release_pages(struct page **pages, 
 	for (i = 0; i < nr; i++) {
 		struct page *page = pages[i];
 
-		if (unlikely(PageCompound(page))) {
+		/*
+		 * There is a conflict here between handling a compound
+		 * page as a single big page or a set of smaller pages.
+		 *
+		 * Direct I/O wants us to treat them separately. Variable
+		 * Page Size support means we need to treat then as
+		 * a single unit.
+		 *
+		 * So we compromise here. Tail pages are handled as a
+		 * single page (for direct I/O) but head pages are
+		 * handled as full pages (for Variable Page Size
+		 * Support).
+		 *
+		 * FIXME: That breaks direct I/O for the head page.
+		 */
+		if (unlikely(PageTail(page))) {
+			/* Must treat as a single page */
 			if (zone) {
 				spin_unlock_irq(&zone->lru_lock);
 				zone = NULL;
Index: linux-2.6.21-rc7/arch/ia64/mm/init.c
===================================================================
--- linux-2.6.21-rc7.orig/arch/ia64/mm/init.c	2007-04-21 20:52:07.000000000 -0700
+++ linux-2.6.21-rc7/arch/ia64/mm/init.c	2007-04-21 20:52:15.000000000 -0700
@@ -121,7 +121,7 @@ lazy_mmu_prot_update (pte_t pte)
 		return;				/* i-cache is already coherent with d-cache */
 
 	if (PageCompound(page)) {
-		order = (unsigned long) (page[1].lru.prev);
+		order = compound_order(page);
 		flush_icache_range(addr, addr + (1UL << order << PAGE_SHIFT));
 	}
 	else

--

From clameter@sgi.com Sun Apr 22 23:21:29 2007
Message-Id: <20070423062129.317055444@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:09 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 02/16] vmstat.c: Support accounting for compound pages
Content-Disposition: inline; filename=var_pc_vmstat

Compound pages must increment the counters in terms of base pages.
If we detect a compound page then add the number of base pages that
a compound page has to the counter.

This will avoid numerous changes in the VM to fix up page accounting
as we add more support for  compound pages.

Also fix up the accounting for active / inactive pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm_inline.h |   12 ++++++------
 mm/vmstat.c               |    8 +++-----
 2 files changed, 9 insertions(+), 11 deletions(-)

Index: linux-2.6.21-rc7/mm/vmstat.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/vmstat.c	2007-04-21 23:35:49.000000000 -0700
+++ linux-2.6.21-rc7/mm/vmstat.c	2007-04-21 23:35:59.000000000 -0700
@@ -223,7 +223,7 @@ void __inc_zone_state(struct zone *zone,
 
 void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	__inc_zone_state(page_zone(page), item);
+	__mod_zone_page_state(page_zone(page), item, base_pages(page));
 }
 EXPORT_SYMBOL(__inc_zone_page_state);
 
@@ -244,7 +244,7 @@ void __dec_zone_state(struct zone *zone,
 
 void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
 {
-	__dec_zone_state(page_zone(page), item);
+	__mod_zone_page_state(page_zone(page), item, -base_pages(page));
 }
 EXPORT_SYMBOL(__dec_zone_page_state);
 
@@ -260,11 +260,9 @@ void inc_zone_state(struct zone *zone, e
 void inc_zone_page_state(struct page *page, enum zone_stat_item item)
 {
 	unsigned long flags;
-	struct zone *zone;
 
-	zone = page_zone(page);
 	local_irq_save(flags);
-	__inc_zone_state(zone, item);
+	__inc_zone_page_state(page, item);
 	local_irq_restore(flags);
 }
 EXPORT_SYMBOL(inc_zone_page_state);
Index: linux-2.6.21-rc7/include/linux/mm_inline.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm_inline.h	2007-04-22 00:20:15.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm_inline.h	2007-04-22 00:21:12.000000000 -0700
@@ -2,28 +2,28 @@ static inline void
 add_page_to_active_list(struct zone *zone, struct page *page)
 {
 	list_add(&page->lru, &zone->active_list);
-	__inc_zone_state(zone, NR_ACTIVE);
+	__inc_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 add_page_to_inactive_list(struct zone *zone, struct page *page)
 {
 	list_add(&page->lru, &zone->inactive_list);
-	__inc_zone_state(zone, NR_INACTIVE);
+	__inc_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
 del_page_from_active_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_ACTIVE);
+	__dec_zone_page_state(page, NR_ACTIVE);
 }
 
 static inline void
 del_page_from_inactive_list(struct zone *zone, struct page *page)
 {
 	list_del(&page->lru);
-	__dec_zone_state(zone, NR_INACTIVE);
+	__dec_zone_page_state(page, NR_INACTIVE);
 }
 
 static inline void
@@ -32,9 +32,9 @@ del_page_from_lru(struct zone *zone, str
 	list_del(&page->lru);
 	if (PageActive(page)) {
 		__ClearPageActive(page);
-		__dec_zone_state(zone, NR_ACTIVE);
+		__dec_zone_page_state(page, NR_ACTIVE);
 	} else {
-		__dec_zone_state(zone, NR_INACTIVE);
+		__dec_zone_page_state(page, NR_INACTIVE);
 	}
 }
 

--

From clameter@sgi.com Sun Apr 22 23:21:29 2007
Message-Id: <20070423062129.504330506@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:10 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 03/16] Variable Order Page Cache: Add order field in mapping
Content-Disposition: inline; filename=var_pc_order_field

Add an "order" field in the address space structure that
specifies the page order of pages in an address space.

Set the field to zero by default so that filesystems not prepared to
deal with higher pages can be left as is.

Putting page order in the address space structure means that the order of the
pages in the page cache can be varied per file that a filesystem creates.
This means we can keep small 4k pages for small files. Larger files can
be configured by the file system to use a higher order.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/inode.c         |    1 +
 include/linux/fs.h |    1 +
 2 files changed, 2 insertions(+)

Index: linux-2.6.21-rc7/fs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/inode.c	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/fs/inode.c	2007-04-18 21:26:31.000000000 -0700
@@ -145,6 +145,7 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
+		mapping->order = 0;
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
Index: linux-2.6.21-rc7/include/linux/fs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/fs.h	2007-04-18 21:21:56.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/fs.h	2007-04-18 21:26:31.000000000 -0700
@@ -435,6 +435,7 @@ struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	unsigned int		order;		/* Page order in this space */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */

--

From clameter@sgi.com Sun Apr 22 23:21:29 2007
Message-Id: <20070423062129.645837417@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:11 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 04/16] Variable Order Page Cache: Add basic allocation functions
Content-Disposition: inline; filename=var_pc_basic_alloc

Extend __page_cache_alloc to take an order parameter and modify caller
sites. Modify mapping_set_gfp_mask to set __GFP_COMP if the mapping
requires higher order allocations.

put_page() is already capable of handling compound pages. So there are no
changes needed to release higher order page cache pages.

However, there is a call to "alloc_page" in mm/filemap.c that does not
perform an allocation conformand with the parameters of the mapping.
Fix that by introducing a new page_cache_alloc function that
is capable of taking a gfp_t flag.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   34 ++++++++++++++++++++++++++++------
 mm/filemap.c            |   12 +++++++-----
 2 files changed, 35 insertions(+), 11 deletions(-)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 21:47:47.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 21:52:37.000000000 -0700
@@ -3,6 +3,9 @@
 
 /*
  * Copyright 1995 Linus Torvalds
+ *
+ * (C) 2007 sgi, Christoph Lameter <clameter@sgi.com>
+ * 	Add variable order page cache support.
  */
 #include <linux/mm.h>
 #include <linux/fs.h>
@@ -32,6 +35,18 @@ static inline void mapping_set_gfp_mask(
 {
 	m->flags = (m->flags & ~(__force unsigned long)__GFP_BITS_MASK) |
 				(__force unsigned long)mask;
+	if (m->order)
+		m->flags |= __GFP_COMP;
+}
+
+static inline void set_mapping_order(struct address_space *m, int order)
+{
+	m->order = order;
+
+	if (order)
+		m->flags |= __GFP_COMP;
+	else
+		m->flags &= ~__GFP_COMP;
 }
 
 /*
@@ -40,7 +55,7 @@ static inline void mapping_set_gfp_mask(
  * throughput (it can then be mapped into user
  * space in smaller chunks for same flexibility).
  *
- * Or rather, it _will_ be done in larger chunks.
+ * This is the base page size
  */
 #define PAGE_CACHE_SHIFT	PAGE_SHIFT
 #define PAGE_CACHE_SIZE		PAGE_SIZE
@@ -52,22 +67,29 @@ static inline void mapping_set_gfp_mask(
 void release_pages(struct page **pages, int nr, int cold);
 
 #ifdef CONFIG_NUMA
-extern struct page *__page_cache_alloc(gfp_t gfp);
+extern struct page *__page_cache_alloc(gfp_t gfp, int order);
 #else
-static inline struct page *__page_cache_alloc(gfp_t gfp)
+static inline struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 #endif
 
+static inline struct page *page_cache_alloc_mask(struct address_space *x,
+						gfp_t flags)
+{
+	return __page_cache_alloc(mapping_gfp_mask(x) | flags,
+		x->order);
+}
+
 static inline struct page *page_cache_alloc(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x));
+	return page_cache_alloc_mask(x, 0);
 }
 
 static inline struct page *page_cache_alloc_cold(struct address_space *x)
 {
-	return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+	return page_cache_alloc_mask(x, __GFP_COLD);
 }
 
 typedef int filler_t(void *, struct page *);
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 21:47:47.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 21:54:00.000000000 -0700
@@ -467,13 +467,13 @@ int add_to_page_cache_lru(struct page *p
 }
 
 #ifdef CONFIG_NUMA
-struct page *__page_cache_alloc(gfp_t gfp)
+struct page *__page_cache_alloc(gfp_t gfp, int order)
 {
 	if (cpuset_do_page_mem_spread()) {
 		int n = cpuset_mem_spread_node();
-		return alloc_pages_node(n, gfp, 0);
+		return alloc_pages_node(n, gfp, order);
 	}
-	return alloc_pages(gfp, 0);
+	return alloc_pages(gfp, order);
 }
 EXPORT_SYMBOL(__page_cache_alloc);
 #endif
@@ -670,7 +670,8 @@ repeat:
 	page = find_lock_page(mapping, index);
 	if (!page) {
 		if (!cached_page) {
-			cached_page = alloc_page(gfp_mask);
+			cached_page =
+				page_cache_alloc_mask(mapping, gfp_mask);
 			if (!cached_page)
 				return NULL;
 		}
@@ -803,7 +804,8 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS,
+		mapping->order);
 	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
 		page_cache_release(page);
 		page = NULL;

--

From clameter@sgi.com Sun Apr 22 23:21:29 2007
Message-Id: <20070423062129.804903028@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:12 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 05/16] Variable Order Page Cache: Add functions to establish sizes
Content-Disposition: inline; filename=var_pc_size_functions

We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
and PAGE_CACHE_ALIGN in various places in the kernel. These are now
the base page size but we do not have a means to calculating these
values for higher order pages.

Provide these functions. An address_space pointer must be passed
to them. Also add a set of extended functions that will be used
to consolidate the hand crafted shifts and adds in use right
now for the page cache.

New function			Related base page constant
---------------------------------------------------
page_cache_shift(a)		PAGE_CACHE_SHIFT
page_cache_size(a)		PAGE_CACHE_SIZE
page_cache_mask(a)		PAGE_CACHE_MASK
page_cache_index(a, pos)	Calculate page number from position
page_cache_next(addr, pos)	Page number of next page
page_cache_offset(a, pos)	Calculate offset into a page
page_cache_pos(a, index, offset)
				Form position based on page number
				and an offset.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 17:30:50.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 19:44:12.000000000 -0700
@@ -62,6 +62,48 @@ static inline void set_mapping_order(str
 #define PAGE_CACHE_MASK		PAGE_MASK
 #define PAGE_CACHE_ALIGN(addr)	(((addr)+PAGE_CACHE_SIZE-1)&PAGE_CACHE_MASK)
 
+static inline int page_cache_shift(struct address_space *a)
+{
+	return a->order + PAGE_SHIFT;
+}
+
+static inline unsigned int page_cache_size(struct address_space *a)
+{
+	return PAGE_SIZE << a->order;
+}
+
+static inline loff_t page_cache_mask(struct address_space *a)
+{
+	return (loff_t)PAGE_MASK << a->order;
+}
+
+static inline unsigned int page_cache_offset(struct address_space *a,
+		loff_t pos)
+{
+	return pos & ~(PAGE_MASK << a->order);
+}
+
+static inline pgoff_t page_cache_index(struct address_space *a,
+		loff_t pos)
+{
+	return pos >> page_cache_shift(a);
+}
+
+/*
+ * Index of the page starting on or after the given position.
+ */
+static inline pgoff_t page_cache_next(struct address_space *a,
+		loff_t pos)
+{
+	return page_cache_index(a, pos + page_cache_size(a) - 1);
+}
+
+static inline loff_t page_cache_pos(struct address_space *a,
+		pgoff_t index, unsigned long offset)
+{
+	return ((loff_t)index << page_cache_shift(a)) + offset;
+}
+
 #define page_cache_get(page)		get_page(page)
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);

--

From clameter@sgi.com Sun Apr 22 23:21:30 2007
Message-Id: <20070423062129.967621050@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:13 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 06/16] Variable Page Cache: Add VM_BUG_ONs to check for correct page order
Content-Disposition: inline; filename=var_pc_guards

Before we start changing the page order we better get some debugging
in there that trips us up whenever a wrong order page shows up in a
mapping. This will be helpful for converting new filesystems to
utilize higher orders.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 21:54:00.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 21:59:15.000000000 -0700
@@ -127,6 +127,7 @@ void remove_from_page_cache(struct page 
 	struct address_space *mapping = page->mapping;
 
 	BUG_ON(!PageLocked(page));
+	VM_BUG_ON(mapping->order != compound_order(page));
 
 	write_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
@@ -268,6 +269,7 @@ int wait_on_page_writeback_range(struct 
 			if (page->index > end)
 				continue;
 
+			VM_BUG_ON(mapping->order != compound_order(page));
 			wait_on_page_writeback(page);
 			if (PageError(page))
 				ret = -EIO;
@@ -439,6 +441,7 @@ int add_to_page_cache(struct page *page,
 {
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
+	VM_BUG_ON(mapping->order != compound_order(page));
 	if (error == 0) {
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
@@ -598,8 +601,10 @@ struct page * find_get_page(struct addre
 
 	read_lock_irq(&mapping->tree_lock);
 	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
+	if (page) {
+		VM_BUG_ON(mapping->order != compound_order(page));
 		page_cache_get(page);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return page;
 }
@@ -624,6 +629,7 @@ struct page *find_lock_page(struct addre
 repeat:
 	page = radix_tree_lookup(&mapping->page_tree, offset);
 	if (page) {
+		VM_BUG_ON(mapping->order != compound_order(page));
 		page_cache_get(page);
 		if (TestSetPageLocked(page)) {
 			read_unlock_irq(&mapping->tree_lock);
@@ -683,6 +689,7 @@ repeat:
 		} else if (err == -EEXIST)
 			goto repeat;
 	}
+	VM_BUG_ON(mapping->order != compound_order(page));
 	if (cached_page)
 		page_cache_release(cached_page);
 	return page;
@@ -714,8 +721,10 @@ unsigned find_get_pages(struct address_s
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup(&mapping->page_tree,
 				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
+		VM_BUG_ON(mapping->order != compound_order(pages[i]));
 		page_cache_get(pages[i]);
+	}
 	read_unlock_irq(&mapping->tree_lock);
 	return ret;
 }
@@ -745,6 +754,7 @@ unsigned find_get_pages_contig(struct ad
 		if (pages[i]->mapping == NULL || pages[i]->index != index)
 			break;
 
+		VM_BUG_ON(mapping->order != compound_order(pages[i]));
 		page_cache_get(pages[i]);
 		index++;
 	}
@@ -772,8 +782,10 @@ unsigned find_get_pages_tag(struct addre
 	read_lock_irq(&mapping->tree_lock);
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
+	for (i = 0; i < ret; i++) {
+		VM_BUG_ON(mapping->order != compound_order(pages[i]));
 		page_cache_get(pages[i]);
+	}
 	if (ret)
 		*index = pages[ret - 1]->index + 1;
 	read_unlock_irq(&mapping->tree_lock);
@@ -2454,6 +2466,7 @@ int try_to_release_page(struct page *pag
 	struct address_space * const mapping = page->mapping;
 
 	BUG_ON(!PageLocked(page));
+	VM_BUG_ON(mapping->order != compound_order(page));
 	if (PageWriteback(page))
 		return 0;
 

--

From clameter@sgi.com Sun Apr 22 23:21:30 2007
Message-Id: <20070423062130.131716294@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:14 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 07/16] Variable Order Page Cache: Add clearing and flushing function
Content-Disposition: inline; filename=var_pc_flush_zero

Add a flushing and clearing function for higher order pages.
These are provisional and will likely have to be optimized.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/pagemap.h |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 17:37:24.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 17:37:39.000000000 -0700
@@ -250,6 +250,31 @@ static inline void wait_on_page_writebac
 
 extern void end_page_writeback(struct page *page);
 
+/* Support for clearing higher order pages */
+static inline void clear_mapping_page(struct page *page)
+{
+	int nr_pages = base_pages(page);
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		clear_highpage(page + i);
+}
+
+/*
+ * Support for flushing higher order pages.
+ *
+ * A bit stupid: On many platforms flushing the first page
+ * will flush any TLB starting there
+ */
+static inline void flush_mapping_page(struct page *page)
+{
+	int nr_pages = base_pages(page);
+	int i;
+
+	for (i = 0; i < nr_pages; i++)
+		flush_dcache_page(page + i);
+}
+
 /*
  * Fault a userspace page into pagetables.  Return non-zero on a fault.
  *

--

From clameter@sgi.com Sun Apr 22 23:21:30 2007
Message-Id: <20070423062130.292552667@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:15 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 08/16] Variable Order Page Cache: Fixup fallback functions
Content-Disposition: inline; filename=var_pc_libfs

Fixup the fallback function in fs/libfs.c to be able to handle
higher order page cache pages.

FIXME: There is a use of kmap here that we leave unchanged
(none of my testing platforms use highmem). There needs to
be some way to clear higher order partial pages if a platform
supports HIGHMEM.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/libfs.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc7/fs/libfs.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/libfs.c	2007-04-22 17:28:04.000000000 -0700
+++ linux-2.6.21-rc7/fs/libfs.c	2007-04-22 17:38:58.000000000 -0700
@@ -320,8 +320,8 @@ int simple_rename(struct inode *old_dir,
 
 int simple_readpage(struct file *file, struct page *page)
 {
-	clear_highpage(page);
-	flush_dcache_page(page);
+	clear_mapping_page(page);
+	flush_mapping_page(page);
 	SetPageUptodate(page);
 	unlock_page(page);
 	return 0;
@@ -331,11 +331,15 @@ int simple_prepare_write(struct file *fi
 			unsigned from, unsigned to)
 {
 	if (!PageUptodate(page)) {
-		if (to - from != PAGE_CACHE_SIZE) {
+		if (to - from != page_cache_size(file->f_mapping)) {
+			/*
+			 * Mapping to higher order pages need to be supported
+			 * if higher order pages can be in highmem
+			 */
 			void *kaddr = kmap_atomic(page, KM_USER0);
 			memset(kaddr, 0, from);
-			memset(kaddr + to, 0, PAGE_CACHE_SIZE - to);
-			flush_dcache_page(page);
+			memset(kaddr + to, 0, page_cache_size(file->f_mapping) - to);
+			flush_mapping_page(page);
 			kunmap_atomic(kaddr, KM_USER0);
 		}
 	}
@@ -345,8 +349,9 @@ int simple_prepare_write(struct file *fi
 int simple_commit_write(struct file *file, struct page *page,
 			unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 
 	if (!PageUptodate(page))
 		SetPageUptodate(page);

--

From clameter@sgi.com Sun Apr 22 23:21:30 2007
Message-Id: <20070423062130.458545974@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:16 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 09/16] Variable Order Page Cache: Fix up mm/filemap.c
Content-Disposition: inline; filename=var_pc_filemap

Fix up the function in mm/filemap.c to use the variable page cache.
As many of the following patches this is also pretty straightforward.

1. Convert the bit ops into calls of page_cache_xxx(mapping, ....)
2. Use the mapping flush function

Doing this also cleans up the handling of page cache pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/filemap.c |   62 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 21:59:15.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 22:03:09.000000000 -0700
@@ -304,8 +304,8 @@ int wait_on_page_writeback_range(struct 
 int sync_page_range(struct inode *inode, struct address_space *mapping,
 			loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = page_cache_index(mapping, pos);
+	pgoff_t end = page_cache_index(mapping, pos + count - 1);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -336,8 +336,8 @@ EXPORT_SYMBOL(sync_page_range);
 int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
 			   loff_t pos, loff_t count)
 {
-	pgoff_t start = pos >> PAGE_CACHE_SHIFT;
-	pgoff_t end = (pos + count - 1) >> PAGE_CACHE_SHIFT;
+	pgoff_t start = page_cache_index(mapping, pos);
+	pgoff_t end = page_cache_index(mapping, pos + count - 1);
 	int ret;
 
 	if (!mapping_cap_writeback_dirty(mapping) || !count)
@@ -366,7 +366,7 @@ int filemap_fdatawait(struct address_spa
 		return 0;
 
 	return wait_on_page_writeback_range(mapping, 0,
-				(i_size - 1) >> PAGE_CACHE_SHIFT);
+				page_cache_index(mapping, i_size - 1));
 }
 EXPORT_SYMBOL(filemap_fdatawait);
 
@@ -414,8 +414,8 @@ int filemap_write_and_wait_range(struct 
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
 			int err2 = wait_on_page_writeback_range(mapping,
-						lstart >> PAGE_CACHE_SHIFT,
-						lend >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, lstart),
+					page_cache_index(mapping, lend));
 			if (!err)
 				err = err2;
 		}
@@ -888,27 +888,27 @@ void do_generic_mapping_read(struct addr
 	struct file_ra_state ra = *_ra;
 
 	cached_page = NULL;
-	index = *ppos >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(mapping, *ppos);
 	next_index = index;
 	prev_index = ra.prev_page;
-	last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
-	offset = *ppos & ~PAGE_CACHE_MASK;
+	last_index = page_cache_next(mapping, *ppos + desc->count);
+	offset = page_cache_offset(mapping, *ppos);
 
 	isize = i_size_read(inode);
 	if (!isize)
 		goto out;
 
-	end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(mapping, isize - 1);
 	for (;;) {
 		struct page *page;
 		unsigned long nr, ret;
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index >= end_index) {
 			if (index > end_index)
 				goto out;
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = page_cache_offset(mapping, isize - 1) + 1;
 			if (nr <= offset) {
 				goto out;
 			}
@@ -935,7 +935,7 @@ page_ok:
 		 * before reading the page on the kernel side.
 		 */
 		if (mapping_writably_mapped(mapping))
-			flush_dcache_page(page);
+			flush_mapping_page(page);
 
 		/*
 		 * When (part of) the same page is read multiple times
@@ -957,8 +957,8 @@ page_ok:
 		 */
 		ret = actor(desc, page, offset, nr);
 		offset += ret;
-		index += offset >> PAGE_CACHE_SHIFT;
-		offset &= ~PAGE_CACHE_MASK;
+		index += page_cache_index(mapping, offset);
+		offset = page_cache_offset(mapping, offset);
 
 		page_cache_release(page);
 		if (ret == nr && desc->count)
@@ -1022,16 +1022,16 @@ readpage:
 		 * another truncate extends the file - this is desired though).
 		 */
 		isize = i_size_read(inode);
-		end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+		end_index = page_cache_index(mapping, isize - 1);
 		if (unlikely(!isize || index > end_index)) {
 			page_cache_release(page);
 			goto out;
 		}
 
 		/* nr is the maximum number of bytes to copy from this page */
-		nr = PAGE_CACHE_SIZE;
+		nr = page_cache_size(mapping);
 		if (index == end_index) {
-			nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
+			nr = page_cache_offset(mapping, isize - 1) + 1;
 			if (nr <= offset) {
 				page_cache_release(page);
 				goto out;
@@ -1074,7 +1074,7 @@ no_cached_page:
 out:
 	*_ra = ra;
 
-	*ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
+	*ppos = page_cache_pos(mapping, index, offset);
 	if (cached_page)
 		page_cache_release(cached_page);
 	if (filp)
@@ -1270,8 +1270,8 @@ asmlinkage ssize_t sys_readahead(int fd,
 	if (file) {
 		if (file->f_mode & FMODE_READ) {
 			struct address_space *mapping = file->f_mapping;
-			unsigned long start = offset >> PAGE_CACHE_SHIFT;
-			unsigned long end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
+			unsigned long start = page_cache_index(mapping, offset);
+			unsigned long end = page_cache_index(mapping, offset + count - 1);
 			unsigned long len = end - start + 1;
 			ret = do_readahead(mapping, file, start, len);
 		}
@@ -2086,9 +2086,9 @@ generic_file_buffered_write(struct kiocb
 		unsigned long offset;
 		size_t copied;
 
-		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
-		index = pos >> PAGE_CACHE_SHIFT;
-		bytes = PAGE_CACHE_SIZE - offset;
+		offset = page_cache_offset(mapping, pos);
+		index = page_cache_index(mapping, pos);
+		bytes = page_cache_size(mapping) - offset;
 
 		/* Limit the size of the copy to the caller's write size */
 		bytes = min(bytes, count);
@@ -2149,7 +2149,7 @@ generic_file_buffered_write(struct kiocb
 		else
 			copied = filemap_copy_from_user_iovec(page, offset,
 						cur_iov, iov_base, bytes);
-		flush_dcache_page(page);
+		flush_mapping_page(page);
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (status == AOP_TRUNCATED_PAGE) {
 			page_cache_release(page);
@@ -2315,8 +2315,8 @@ __generic_file_aio_write_nolock(struct k
 		if (err == 0) {
 			written = written_buffered;
 			invalidate_mapping_pages(mapping,
-						 pos >> PAGE_CACHE_SHIFT,
-						 endbyte >> PAGE_CACHE_SHIFT);
+						 page_cache_index(mapping, pos),
+						 page_cache_index(mapping, endbyte));
 		} else {
 			/*
 			 * We don't know how much we wrote, so just return
@@ -2403,7 +2403,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE) {
 		write_len = iov_length(iov, nr_segs);
-		end = (offset + write_len - 1) >> PAGE_CACHE_SHIFT;
+		end = page_cache_index(mapping, offset + write_len - 1);
 	       	if (mapping_mapped(mapping))
 			unmap_mapping_range(mapping, offset, write_len, 0);
 	}
@@ -2420,7 +2420,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		retval = invalidate_inode_pages2_range(mapping,
-					offset >> PAGE_CACHE_SHIFT, end);
+					page_cache_index(mapping, offset), end);
 		if (retval)
 			goto out;
 	}
@@ -2438,7 +2438,7 @@ generic_file_direct_IO(int rw, struct ki
 	 */
 	if (rw == WRITE && mapping->nrpages) {
 		int err = invalidate_inode_pages2_range(mapping,
-					      offset >> PAGE_CACHE_SHIFT, end);
+					      page_cache_index(mapping, offset), end);
 		if (err && retval >= 0)
 			retval = err;
 	}

--

From clameter@sgi.com Sun Apr 22 23:21:30 2007
Message-Id: <20070423062130.623658661@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:17 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 10/16] Variable Order Page Cache: Readahead fixups
Content-Disposition: inline; filename=var_pc_readahead

Readahead is now dependent on the page size. For larger page sizes
we want less readahead.

Add a parameter to max_sane_readahead specifying the page order
and update the code in mm/readahead.c to be aware of variant
page sizes.

Mark the 2M readahead constant as a potential future problem.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mm.h |    2 +-
 mm/fadvise.c       |    5 +++--
 mm/filemap.c       |    5 +++--
 mm/madvise.c       |    4 +++-
 mm/readahead.c     |   20 +++++++++++++-------
 5 files changed, 23 insertions(+), 13 deletions(-)

Index: linux-2.6.21-rc7/include/linux/mm.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/mm.h	2007-04-22 21:48:22.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/mm.h	2007-04-22 22:04:44.000000000 -0700
@@ -1104,7 +1104,7 @@ unsigned long page_cache_readahead(struc
 			  unsigned long size);
 void handle_ra_miss(struct address_space *mapping, 
 		    struct file_ra_state *ra, pgoff_t offset);
-unsigned long max_sane_readahead(unsigned long nr);
+unsigned long max_sane_readahead(unsigned long nr, int order);
 
 /* Do stack extension */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
Index: linux-2.6.21-rc7/mm/fadvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fadvise.c	2007-04-22 21:47:41.000000000 -0700
+++ linux-2.6.21-rc7/mm/fadvise.c	2007-04-22 22:04:44.000000000 -0700
@@ -86,10 +86,11 @@ asmlinkage long sys_fadvise64_64(int fd,
 		nrpages = end_index - start_index + 1;
 		if (!nrpages)
 			nrpages = ~0UL;
-		
+
 		ret = force_page_cache_readahead(mapping, file,
 				start_index,
-				max_sane_readahead(nrpages));
+				max_sane_readahead(nrpages,
+					mapping->order));
 		if (ret > 0)
 			ret = 0;
 		break;
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c	2007-04-22 22:03:09.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c	2007-04-22 22:04:44.000000000 -0700
@@ -1256,7 +1256,7 @@ do_readahead(struct address_space *mappi
 		return -EINVAL;
 
 	force_page_cache_readahead(mapping, filp, index,
-					max_sane_readahead(nr));
+				max_sane_readahead(nr, mapping->order));
 	return 0;
 }
 
@@ -1391,7 +1391,8 @@ retry_find:
 			count_vm_event(PGMAJFAULT);
 		}
 		did_readaround = 1;
-		ra_pages = max_sane_readahead(file->f_ra.ra_pages);
+		ra_pages = max_sane_readahead(file->f_ra.ra_pages,
+							mapping->order);
 		if (ra_pages) {
 			pgoff_t start = 0;
 
Index: linux-2.6.21-rc7/mm/madvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/madvise.c	2007-04-22 21:47:41.000000000 -0700
+++ linux-2.6.21-rc7/mm/madvise.c	2007-04-22 22:04:44.000000000 -0700
@@ -105,7 +105,9 @@ static long madvise_willneed(struct vm_a
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
 	force_page_cache_readahead(file->f_mapping,
-			file, start, max_sane_readahead(end - start));
+			file, start,
+			max_sane_readahead(end - start,
+				file->f_mapping->order));
 	return 0;
 }
 
Index: linux-2.6.21-rc7/mm/readahead.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/readahead.c	2007-04-22 21:47:41.000000000 -0700
+++ linux-2.6.21-rc7/mm/readahead.c	2007-04-22 22:06:47.000000000 -0700
@@ -152,7 +152,7 @@ int read_cache_pages(struct address_spac
 			put_pages_list(pages);
 			break;
 		}
-		task_io_account_read(PAGE_CACHE_SIZE);
+		task_io_account_read(page_cache_size(mapping));
 	}
 	pagevec_lru_add(&lru_pvec);
 	return ret;
@@ -276,7 +276,7 @@ __do_page_cache_readahead(struct address
 	if (isize == 0)
 		goto out;
 
- 	end_index = ((isize - 1) >> PAGE_CACHE_SHIFT);
+ 	end_index = page_cache_index(mapping, isize - 1);
 
 	/*
 	 * Preallocate as many pages as we will need.
@@ -330,7 +330,11 @@ int force_page_cache_readahead(struct ad
 	while (nr_to_read) {
 		int err;
 
-		unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_CACHE_SIZE;
+		/*
+		 * FIXME: Note the 2M constant here that may prove to
+		 * be a problem if page sizes become bigger than one megabyte.
+		 */
+		unsigned long this_chunk = page_cache_index(mapping, 2 * 1024 * 1024);
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
@@ -570,11 +574,13 @@ void handle_ra_miss(struct address_space
 }
 
 /*
- * Given a desired number of PAGE_CACHE_SIZE readahead pages, return a
+ * Given a desired number of page order readahead pages, return a
  * sensible upper limit.
  */
-unsigned long max_sane_readahead(unsigned long nr)
+unsigned long max_sane_readahead(unsigned long nr, int order)
 {
-	return min(nr, (node_page_state(numa_node_id(), NR_INACTIVE)
-		+ node_page_state(numa_node_id(), NR_FREE_PAGES)) / 2);
+	unsigned long base_pages = node_page_state(numa_node_id(), NR_INACTIVE)
+			+ node_page_state(numa_node_id(), NR_FREE_PAGES);
+
+	return min(nr, (base_pages / 2) >> order);
 }

--

From clameter@sgi.com Sun Apr 22 23:21:30 2007
Message-Id: <20070423062130.785519484@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:18 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 11/16] Variable Page Cache Size: Fix up reclaim counters
Content-Disposition: inline; filename=var_pc_reclaim

We can now reclaim larger pages. Adjust the VM counters
to deal with it.

Note that this does currently not make things work.
For some reason we keep loosing pages off the active lists
and reclaim stalls at some point attempting to remove
active pages from an empty active list.
It seems that the removal from the active lists happens
outside of reclaim ?!?

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/vmscan.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc7/mm/vmscan.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/vmscan.c	2007-04-22 06:50:03.000000000 -0700
+++ linux-2.6.21-rc7/mm/vmscan.c	2007-04-22 17:19:35.000000000 -0700
@@ -471,14 +471,14 @@ static unsigned long shrink_page_list(st
 
 		VM_BUG_ON(PageActive(page));
 
-		sc->nr_scanned++;
+		sc->nr_scanned += base_pages(page);
 
 		if (!sc->may_swap && page_mapped(page))
 			goto keep_locked;
 
 		/* Double the slab pressure for mapped and swapcache pages */
 		if (page_mapped(page) || PageSwapCache(page))
-			sc->nr_scanned++;
+			sc->nr_scanned += base_pages(page);
 
 		if (PageWriteback(page))
 			goto keep_locked;
@@ -581,7 +581,7 @@ static unsigned long shrink_page_list(st
 
 free_it:
 		unlock_page(page);
-		nr_reclaimed++;
+		nr_reclaimed += base_pages(page);
 		if (!pagevec_add(&freed_pvec, page))
 			__pagevec_release_nonlru(&freed_pvec);
 		continue;
@@ -627,7 +627,7 @@ static unsigned long isolate_lru_pages(u
 	struct page *page;
 	unsigned long scan;
 
-	for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
+	for (scan = 0; scan < nr_to_scan && !list_empty(src); ) {
 		struct list_head *target;
 		page = lru_to_page(src);
 		prefetchw_prev_lru_page(page, src, flags);
@@ -644,10 +644,11 @@ static unsigned long isolate_lru_pages(u
 			 */
 			ClearPageLRU(page);
 			target = dst;
-			nr_taken++;
+			nr_taken += base_pages(page);
 		} /* else it is being freed elsewhere */
 
 		list_add(&page->lru, target);
+		scan += base_pages(page);
 	}
 
 	*scanned = scan;
@@ -856,7 +857,7 @@ force_reclaim_mapped:
 		ClearPageActive(page);
 
 		list_move(&page->lru, &zone->inactive_list);
-		pgmoved++;
+		pgmoved += base_pages(page);
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_INACTIVE, pgmoved);
 			spin_unlock_irq(&zone->lru_lock);
@@ -884,7 +885,7 @@ force_reclaim_mapped:
 		SetPageLRU(page);
 		VM_BUG_ON(!PageActive(page));
 		list_move(&page->lru, &zone->active_list);
-		pgmoved++;
+		pgmoved += base_pages(page);
 		if (!pagevec_add(&pvec, page)) {
 			__mod_zone_page_state(zone, NR_ACTIVE, pgmoved);
 			pgmoved = 0;

--

From clameter@sgi.com Sun Apr 22 23:21:31 2007
Message-Id: <20070423062130.952242003@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:19 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 12/16] Variable Order Page Cache: Fix up the writeback logic
Content-Disposition: inline; filename=var_pc_writeback

Nothing special here. Just the usual transformations.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/sync.c           |    8 ++++----
 mm/fadvise.c        |    8 ++++----
 mm/page-writeback.c |    4 ++--
 mm/truncate.c       |   23 ++++++++++++-----------
 4 files changed, 22 insertions(+), 21 deletions(-)

Index: linux-2.6.21-rc7/mm/page-writeback.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/page-writeback.c	2007-04-22 21:47:34.000000000 -0700
+++ linux-2.6.21-rc7/mm/page-writeback.c	2007-04-22 22:08:35.000000000 -0700
@@ -606,8 +606,8 @@ int generic_writepages(struct address_sp
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
-		index = wbc->range_start >> PAGE_CACHE_SHIFT;
-		end = wbc->range_end >> PAGE_CACHE_SHIFT;
+		index = page_cache_index(mapping, wbc->range_start);
+		end = page_cache_index(mapping, wbc->range_end);
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		scanned = 1;
Index: linux-2.6.21-rc7/fs/sync.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/sync.c	2007-04-22 21:47:34.000000000 -0700
+++ linux-2.6.21-rc7/fs/sync.c	2007-04-22 22:08:35.000000000 -0700
@@ -254,8 +254,8 @@ int do_sync_file_range(struct file *file
 	ret = 0;
 	if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
 		ret = wait_on_page_writeback_range(mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					endbyte >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, offset),
+					page_cache_index(mapping, endbyte));
 		if (ret < 0)
 			goto out;
 	}
@@ -269,8 +269,8 @@ int do_sync_file_range(struct file *file
 
 	if (flags & SYNC_FILE_RANGE_WAIT_AFTER) {
 		ret = wait_on_page_writeback_range(mapping,
-					offset >> PAGE_CACHE_SHIFT,
-					endbyte >> PAGE_CACHE_SHIFT);
+					page_cache_index(mapping, offset),
+					page_cache_index(mapping, endbyte));
 	}
 out:
 	return ret;
Index: linux-2.6.21-rc7/mm/fadvise.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/fadvise.c	2007-04-22 22:04:44.000000000 -0700
+++ linux-2.6.21-rc7/mm/fadvise.c	2007-04-22 22:08:35.000000000 -0700
@@ -79,8 +79,8 @@ asmlinkage long sys_fadvise64_64(int fd,
 		}
 
 		/* First and last PARTIAL page! */
-		start_index = offset >> PAGE_CACHE_SHIFT;
-		end_index = endbyte >> PAGE_CACHE_SHIFT;
+		start_index = page_cache_index(mapping, offset);
+		end_index = page_cache_index(mapping, endbyte);
 
 		/* Careful about overflow on the "+1" */
 		nrpages = end_index - start_index + 1;
@@ -101,8 +101,8 @@ asmlinkage long sys_fadvise64_64(int fd,
 			filemap_flush(mapping);
 
 		/* First and last FULL page! */
-		start_index = (offset+(PAGE_CACHE_SIZE-1)) >> PAGE_CACHE_SHIFT;
-		end_index = (endbyte >> PAGE_CACHE_SHIFT);
+		start_index = page_cache_next(mapping, offset);
+		end_index = page_cache_index(mapping, endbyte);
 
 		if (end_index >= start_index)
 			invalidate_mapping_pages(mapping, start_index,
Index: linux-2.6.21-rc7/mm/truncate.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/truncate.c	2007-04-22 21:47:34.000000000 -0700
+++ linux-2.6.21-rc7/mm/truncate.c	2007-04-22 22:11:19.000000000 -0700
@@ -46,7 +46,8 @@ void do_invalidatepage(struct page *page
 
 static inline void truncate_partial_page(struct page *page, unsigned partial)
 {
-	memclear_highpage_flush(page, partial, PAGE_CACHE_SIZE-partial);
+	memclear_highpage_flush(page, partial,
+		(PAGE_SIZE << compound_order(page)) - partial);
 	if (PagePrivate(page))
 		do_invalidatepage(page, partial);
 }
@@ -94,7 +95,7 @@ truncate_complete_page(struct address_sp
 	if (page->mapping != mapping)
 		return;
 
-	cancel_dirty_page(page, PAGE_CACHE_SIZE);
+	cancel_dirty_page(page, page_cache_size(mapping));
 
 	if (PagePrivate(page))
 		do_invalidatepage(page, 0);
@@ -156,9 +157,9 @@ invalidate_complete_page(struct address_
 void truncate_inode_pages_range(struct address_space *mapping,
 				loff_t lstart, loff_t lend)
 {
-	const pgoff_t start = (lstart + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
+	const pgoff_t start = page_cache_next(mapping, lstart);
 	pgoff_t end;
-	const unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
+	const unsigned partial = page_cache_offset(mapping, lstart);
 	struct pagevec pvec;
 	pgoff_t next;
 	int i;
@@ -166,8 +167,9 @@ void truncate_inode_pages_range(struct a
 	if (mapping->nrpages == 0)
 		return;
 
-	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
-	end = (lend >> PAGE_CACHE_SHIFT);
+	BUG_ON(page_cache_offset(mapping, lend) !=
+				page_cache_size(mapping) - 1);
+	end = page_cache_index(mapping, lend);
 
 	pagevec_init(&pvec, 0);
 	next = start;
@@ -402,9 +404,8 @@ int invalidate_inode_pages2_range(struct
 					 * Zap the rest of the file in one hit.
 					 */
 					unmap_mapping_range(mapping,
-					   (loff_t)page_index<<PAGE_CACHE_SHIFT,
-					   (loff_t)(end - page_index + 1)
-							<< PAGE_CACHE_SHIFT,
+					   page_cache_pos(mapping, page_index, 0),
+					   page_cache_pos(mapping, end - page_index + 1, 0),
 					    0);
 					did_range_unmap = 1;
 				} else {
@@ -412,8 +413,8 @@ int invalidate_inode_pages2_range(struct
 					 * Just zap this page
 					 */
 					unmap_mapping_range(mapping,
-					  (loff_t)page_index<<PAGE_CACHE_SHIFT,
-					  PAGE_CACHE_SIZE, 0);
+					  page_cache_pos(mapping, page_index, 0),
+					  page_cache_size(mapping), 0);
 				}
 			}
 			ret = do_launder_page(mapping, page);

--

From clameter@sgi.com Sun Apr 22 23:21:31 2007
Message-Id: <20070423062131.114158637@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:20 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 13/16] Variable Order Page Cache: Fixed to block layer
Content-Disposition: inline; filename=var_pc_buffer_head

Fix up (at least some pieces of) the block layer. It already has some
flexibility. Extend that for larger page sizes.

set_blocksize is changed to allow to specify a blocksize larger than a
page. If that occurs then we switch the device to use compound pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/block_dev.c              |   22 ++++++---
 fs/buffer.c                 |  101 +++++++++++++++++++++++---------------------
 fs/inode.c                  |    5 +-
 fs/mpage.c                  |   34 +++++++-------
 include/linux/buffer_head.h |    9 +++
 5 files changed, 100 insertions(+), 71 deletions(-)

Index: linux-2.6.21-rc7/include/linux/buffer_head.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/buffer_head.h	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/buffer_head.h	2007-04-22 22:14:41.000000000 -0700
@@ -129,7 +129,14 @@ BUFFER_FNS(Ordered, ordered)
 BUFFER_FNS(Eopnotsupp, eopnotsupp)
 BUFFER_FNS(Unwritten, unwritten)
 
-#define bh_offset(bh)		((unsigned long)(bh)->b_data & ~PAGE_MASK)
+static inline unsigned long bh_offset(struct buffer_head *bh)
+{
+	/* Cannot use the mapping since it may be set to NULL. */
+	unsigned long mask = ~(PAGE_MASK << compound_order(bh->b_page));
+
+	return (unsigned long)bh->b_data & mask;
+}
+
 #define touch_buffer(bh)	mark_page_accessed(bh->b_page)
 
 /* If we *know* page->private refers to buffer_heads */
Index: linux-2.6.21-rc7/fs/block_dev.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/block_dev.c	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/fs/block_dev.c	2007-04-22 22:11:44.000000000 -0700
@@ -60,12 +60,12 @@ static void kill_bdev(struct block_devic
 {
 	invalidate_bdev(bdev, 1);
 	truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-}	
+}
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-	/* Size must be a power of two, and between 512 and PAGE_SIZE */
-	if (size > PAGE_SIZE || size < 512 || (size & (size-1)))
+	/* Size must be a power of two, and greater than 512 */
+	if (size < 512 || (size & (size-1)))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
@@ -74,10 +74,16 @@ int set_blocksize(struct block_device *b
 
 	/* Don't change the size if it is same as current */
 	if (bdev->bd_block_size != size) {
+		int bits = blksize_bits(size);
+		struct address_space *mapping =
+			bdev->bd_inode->i_mapping;
+
 		sync_blockdev(bdev);
-		bdev->bd_block_size = size;
-		bdev->bd_inode->i_blkbits = blksize_bits(size);
 		kill_bdev(bdev);
+		bdev->bd_block_size = size;
+		bdev->bd_inode->i_blkbits = bits;
+		set_mapping_order(mapping,
+			bits < PAGE_SHIFT ? 0 : bits - PAGE_SHIFT);
 	}
 	return 0;
 }
@@ -88,8 +94,10 @@ int sb_set_blocksize(struct super_block 
 {
 	if (set_blocksize(sb->s_bdev, size))
 		return 0;
-	/* If we get here, we know size is power of two
-	 * and it's value is between 512 and PAGE_SIZE */
+	/*
+	 * If we get here, we know size is power of two
+	 * and it's value is larger than 512
+	 */
 	sb->s_blocksize = size;
 	sb->s_blocksize_bits = blksize_bits(size);
 	return sb->s_blocksize;
Index: linux-2.6.21-rc7/fs/buffer.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/buffer.c	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/fs/buffer.c	2007-04-22 22:11:44.000000000 -0700
@@ -259,7 +259,7 @@ __find_get_block_slow(struct block_devic
 	struct page *page;
 	int all_mapped = 1;
 
-	index = block >> (PAGE_CACHE_SHIFT - bd_inode->i_blkbits);
+	index = block >> (page_cache_shift(bd_mapping) - bd_inode->i_blkbits);
 	page = find_get_page(bd_mapping, index);
 	if (!page)
 		goto out;
@@ -733,7 +733,7 @@ int __set_page_dirty_buffers(struct page
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
-			task_io_account_write(PAGE_CACHE_SIZE);
+			task_io_account_write(page_cache_size(mapping));
 		}
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
@@ -879,10 +879,13 @@ struct buffer_head *alloc_page_buffers(s
 {
 	struct buffer_head *bh, *head;
 	long offset;
+	unsigned page_size = page_cache_size(page->mapping);
+
+	BUG_ON(size > page_size);
 
 try_again:
 	head = NULL;
-	offset = PAGE_SIZE;
+	offset = page_size;
 	while ((offset -= size) >= 0) {
 		bh = alloc_buffer_head(GFP_NOFS);
 		if (!bh)
@@ -1080,7 +1083,7 @@ __getblk_slow(struct block_device *bdev,
 {
 	/* Size must be multiple of hard sectorsize */
 	if (unlikely(size & (bdev_hardsect_size(bdev)-1) ||
-			(size < 512 || size > PAGE_SIZE))) {
+			size < 512)) {
 		printk(KERN_ERR "getblk(): invalid block size %d requested\n",
 					size);
 		printk(KERN_ERR "hardsect size: %d\n",
@@ -1417,7 +1420,7 @@ void set_bh_page(struct buffer_head *bh,
 		struct page *page, unsigned long offset)
 {
 	bh->b_page = page;
-	BUG_ON(offset >= PAGE_SIZE);
+	VM_BUG_ON(offset >= page_cache_size(page->mapping));
 	if (PageHighMem(page))
 		/*
 		 * This catches illegal uses and preserves the offset:
@@ -1766,8 +1769,8 @@ static int __block_prepare_write(struct 
 	struct buffer_head *bh, *head, *wait[2], **wait_bh=wait;
 
 	BUG_ON(!PageLocked(page));
-	BUG_ON(from > PAGE_CACHE_SIZE);
-	BUG_ON(to > PAGE_CACHE_SIZE);
+	BUG_ON(from > page_cache_size(inode->i_mapping));
+	BUG_ON(to > page_cache_size(inode->i_mapping));
 	BUG_ON(from > to);
 
 	blocksize = 1 << inode->i_blkbits;
@@ -1776,7 +1779,7 @@ static int __block_prepare_write(struct 
 	head = page_buffers(page);
 
 	bbits = inode->i_blkbits;
-	block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
+	block = (sector_t)page->index << (page_cache_shift(inode->i_mapping) - bbits);
 
 	for(bh = head, block_start = 0; bh != head || !block_start;
 	    block++, block_start=block_end, bh = bh->b_this_page) {
@@ -1934,7 +1937,7 @@ int block_read_full_page(struct page *pa
 		create_empty_buffers(page, blocksize, 0);
 	head = page_buffers(page);
 
-	iblock = (sector_t)page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	iblock = (sector_t)page->index << (page_cache_shift(page->mapping) - inode->i_blkbits);
 	lblock = (i_size_read(inode)+blocksize-1) >> inode->i_blkbits;
 	bh = head;
 	nr = 0;
@@ -1957,7 +1960,7 @@ int block_read_full_page(struct page *pa
 			if (!buffer_mapped(bh)) {
 				void *kaddr = kmap_atomic(page, KM_USER0);
 				memset(kaddr + i * blocksize, 0, blocksize);
-				flush_dcache_page(page);
+				flush_mapping_page(page);
 				kunmap_atomic(kaddr, KM_USER0);
 				if (!err)
 					set_buffer_uptodate(bh);
@@ -2058,10 +2061,11 @@ out:
 
 int generic_cont_expand(struct inode *inode, loff_t size)
 {
+	struct address_space *mapping = inode->i_mapping;
 	pgoff_t index;
 	unsigned int offset;
 
-	offset = (size & (PAGE_CACHE_SIZE - 1)); /* Within page */
+	offset = page_cache_offset(mapping, size);
 
 	/* ugh.  in prepare/commit_write, if from==to==start of block, we
 	** skip the prepare.  make sure we never send an offset for the start
@@ -2071,7 +2075,7 @@ int generic_cont_expand(struct inode *in
 		/* caller must handle this extra byte. */
 		offset++;
 	}
-	index = size >> PAGE_CACHE_SHIFT;
+	index = page_cache_index(mapping, size);
 
 	return __generic_cont_expand(inode, size, index, offset);
 }
@@ -2079,8 +2083,8 @@ int generic_cont_expand(struct inode *in
 int generic_cont_expand_simple(struct inode *inode, loff_t size)
 {
 	loff_t pos = size - 1;
-	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	unsigned int offset = (pos & (PAGE_CACHE_SIZE - 1)) + 1;
+	pgoff_t index = page_cache_index(inode->i_mapping, pos);
+	unsigned int offset = page_cache_offset(inode->i_mapping, pos) + 1;
 
 	/* prepare/commit_write can handle even if from==to==start of block. */
 	return __generic_cont_expand(inode, size, index, offset);
@@ -2103,31 +2107,32 @@ int cont_prepare_write(struct page *page
 	unsigned blocksize = 1 << inode->i_blkbits;
 	void *kaddr;
 
-	while(page->index > (pgpos = *bytes>>PAGE_CACHE_SHIFT)) {
+	while(page->index > (pgpos = page_cache_index(mapping, *bytes))) {
 		status = -ENOMEM;
 		new_page = grab_cache_page(mapping, pgpos);
 		if (!new_page)
 			goto out;
 		/* we might sleep */
-		if (*bytes>>PAGE_CACHE_SHIFT != pgpos) {
+		if (page_cache_index(mapping, *bytes) != pgpos) {
 			unlock_page(new_page);
 			page_cache_release(new_page);
 			continue;
 		}
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
+		zerofrom = page_cache_offset(mapping, *bytes);
 		if (zerofrom & (blocksize-1)) {
 			*bytes |= (blocksize-1);
 			(*bytes)++;
 		}
 		status = __block_prepare_write(inode, new_page, zerofrom,
-						PAGE_CACHE_SIZE, get_block);
+						page_cache_size(mapping), get_block);
 		if (status)
 			goto out_unmap;
+		/* Need higher order kmap?? */
 		kaddr = kmap_atomic(new_page, KM_USER0);
-		memset(kaddr+zerofrom, 0, PAGE_CACHE_SIZE-zerofrom);
+		memset(kaddr+zerofrom, 0, page_cache_size(mapping)-zerofrom);
 		flush_dcache_page(new_page);
 		kunmap_atomic(kaddr, KM_USER0);
-		generic_commit_write(NULL, new_page, zerofrom, PAGE_CACHE_SIZE);
+		generic_commit_write(NULL, new_page, zerofrom, page_cache_size(mapping));
 		unlock_page(new_page);
 		page_cache_release(new_page);
 	}
@@ -2137,7 +2142,7 @@ int cont_prepare_write(struct page *page
 		zerofrom = offset;
 	} else {
 		/* page covers the boundary, find the boundary offset */
-		zerofrom = *bytes & ~PAGE_CACHE_MASK;
+		zerofrom = page_cache_offset(mapping, *bytes);
 
 		/* if we will expand the thing last block will be filled */
 		if (to > zerofrom && (zerofrom & (blocksize-1))) {
@@ -2192,8 +2197,9 @@ int block_commit_write(struct page *page
 int generic_commit_write(struct file *file, struct page *page,
 		unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 	__block_commit_write(inode,page,from,to);
 	/*
 	 * No need to use i_size_read() here, the i_size
@@ -2235,6 +2241,7 @@ static void end_buffer_read_nobh(struct 
 int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
 			get_block_t *get_block)
 {
+	struct address_space *mapping = page->mapping;
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	const unsigned blocksize = 1 << blkbits;
@@ -2242,6 +2249,7 @@ int nobh_prepare_write(struct page *page
 	struct buffer_head *read_bh[MAX_BUF_PER_PAGE];
 	unsigned block_in_page;
 	unsigned block_start;
+	unsigned page_size = page_cache_size(mapping);
 	sector_t block_in_file;
 	char *kaddr;
 	int nr_reads = 0;
@@ -2252,7 +2260,7 @@ int nobh_prepare_write(struct page *page
 	if (PageMappedToDisk(page))
 		return 0;
 
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
 	map_bh.b_page = page;
 
 	/*
@@ -2261,7 +2269,7 @@ int nobh_prepare_write(struct page *page
 	 * page is fully mapped-to-disk.
 	 */
 	for (block_start = 0, block_in_page = 0;
-		  block_start < PAGE_CACHE_SIZE;
+		  block_start < page_size;
 		  block_in_page++, block_start += blocksize) {
 		unsigned block_end = block_start + blocksize;
 		int create;
@@ -2288,7 +2296,7 @@ int nobh_prepare_write(struct page *page
 				memset(kaddr+block_start, 0, from-block_start);
 			if (block_end > to)
 				memset(kaddr + to, 0, block_end - to);
-			flush_dcache_page(page);
+			flush_mapping_page(page);
 			kunmap_atomic(kaddr, KM_USER0);
 			continue;
 		}
@@ -2356,8 +2364,8 @@ failed:
 	 * so we'll later zero out any blocks which _were_ allocated.
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
-	memset(kaddr, 0, PAGE_CACHE_SIZE);
-	flush_dcache_page(page);
+	memset(kaddr, 0, page_size);
+	flush_mapping_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 	SetPageUptodate(page);
 	set_page_dirty(page);
@@ -2372,8 +2380,9 @@ EXPORT_SYMBOL(nobh_prepare_write);
 int nobh_commit_write(struct file *file, struct page *page,
 		unsigned from, unsigned to)
 {
-	struct inode *inode = page->mapping->host;
-	loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
+	loff_t pos = page_cache_pos(mapping, page->index, to);
 
 	SetPageUptodate(page);
 	set_page_dirty(page);
@@ -2395,7 +2404,7 @@ int nobh_writepage(struct page *page, ge
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
-	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	const pgoff_t end_index = page_cache_offset(page->mapping, i_size);
 	unsigned offset;
 	void *kaddr;
 	int ret;
@@ -2405,7 +2414,7 @@ int nobh_writepage(struct page *page, ge
 		goto out;
 
 	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_CACHE_SIZE-1);
+	offset = page_cache_offset(page->mapping, i_size);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
@@ -2429,7 +2438,7 @@ int nobh_writepage(struct page *page, ge
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
-	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
+	memset(kaddr + offset, 0, page_cache_size(page->mapping) - offset);
 	flush_dcache_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 out:
@@ -2447,8 +2456,8 @@ int nobh_truncate_page(struct address_sp
 {
 	struct inode *inode = mapping->host;
 	unsigned blocksize = 1 << inode->i_blkbits;
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	pgoff_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned to;
 	struct page *page;
 	const struct address_space_operations *a_ops = mapping->a_ops;
@@ -2467,8 +2476,8 @@ int nobh_truncate_page(struct address_sp
 	ret = a_ops->prepare_write(NULL, page, offset, to);
 	if (ret == 0) {
 		kaddr = kmap_atomic(page, KM_USER0);
-		memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-		flush_dcache_page(page);
+		memset(kaddr + offset, 0, page_cache_size(mapping) - offset);
+		flush_mapping_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
 		/*
 		 * It would be more correct to call aops->commit_write()
@@ -2487,8 +2496,8 @@ EXPORT_SYMBOL(nobh_truncate_page);
 int block_truncate_page(struct address_space *mapping,
 			loff_t from, get_block_t *get_block)
 {
-	pgoff_t index = from >> PAGE_CACHE_SHIFT;
-	unsigned offset = from & (PAGE_CACHE_SIZE-1);
+	pgoff_t index = page_cache_index(mapping, from);
+	unsigned offset = page_cache_offset(mapping, from);
 	unsigned blocksize;
 	sector_t iblock;
 	unsigned length, pos;
@@ -2506,7 +2515,7 @@ int block_truncate_page(struct address_s
 		return 0;
 
 	length = blocksize - length;
-	iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+	iblock = (sector_t)index << (page_cache_shift(mapping) - inode->i_blkbits);
 	
 	page = grab_cache_page(mapping, index);
 	err = -ENOMEM;
@@ -2551,7 +2560,7 @@ int block_truncate_page(struct address_s
 
 	kaddr = kmap_atomic(page, KM_USER0);
 	memset(kaddr + offset, 0, length);
-	flush_dcache_page(page);
+	flush_mapping_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 
 	mark_buffer_dirty(bh);
@@ -2572,7 +2581,7 @@ int block_write_full_page(struct page *p
 {
 	struct inode * const inode = page->mapping->host;
 	loff_t i_size = i_size_read(inode);
-	const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
+	const pgoff_t end_index = page_cache_index(page->mapping, i_size);
 	unsigned offset;
 	void *kaddr;
 
@@ -2581,7 +2590,7 @@ int block_write_full_page(struct page *p
 		return __block_write_full_page(inode, page, get_block, wbc);
 
 	/* Is the page fully outside i_size? (truncate in progress) */
-	offset = i_size & (PAGE_CACHE_SIZE-1);
+	offset = page_cache_offset(page->mapping, i_size);
 	if (page->index >= end_index+1 || !offset) {
 		/*
 		 * The page may have dirty, unmapped buffers.  For example,
@@ -2601,8 +2610,8 @@ int block_write_full_page(struct page *p
 	 * writes to that region are not written out to the file."
 	 */
 	kaddr = kmap_atomic(page, KM_USER0);
-	memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-	flush_dcache_page(page);
+	memset(kaddr + offset, 0, page_cache_size(page->mapping) - offset);
+	flush_mapping_page(page);
 	kunmap_atomic(kaddr, KM_USER0);
 	return __block_write_full_page(inode, page, get_block, wbc);
 }
@@ -2857,7 +2866,7 @@ int try_to_free_buffers(struct page *pag
 	 * dirty bit from being lost.
 	 */
 	if (ret)
-		cancel_dirty_page(page, PAGE_CACHE_SIZE);
+		cancel_dirty_page(page, page_cache_size(mapping));
 	spin_unlock(&mapping->private_lock);
 out:
 	if (buffers_to_free) {
Index: linux-2.6.21-rc7/fs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/inode.c	2007-04-22 21:52:18.000000000 -0700
+++ linux-2.6.21-rc7/fs/inode.c	2007-04-22 22:11:44.000000000 -0700
@@ -145,7 +145,10 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping->order = 0;
+		if (inode->i_blkbits > PAGE_SHIFT)
+			set_mapping_order(mapping, inode->i_blkbits - PAGE_SHIFT);
+		else
+			set_mapping_order(mapping, 0);
 		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
Index: linux-2.6.21-rc7/fs/mpage.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/mpage.c	2007-04-22 21:47:33.000000000 -0700
+++ linux-2.6.21-rc7/fs/mpage.c	2007-04-22 22:11:44.000000000 -0700
@@ -133,7 +133,8 @@ mpage_alloc(struct block_device *bdev,
 static void 
 map_buffer_to_page(struct page *page, struct buffer_head *bh, int page_block) 
 {
-	struct inode *inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
 	struct buffer_head *page_bh, *head;
 	int block = 0;
 
@@ -142,9 +143,9 @@ map_buffer_to_page(struct page *page, st
 		 * don't make any buffers if there is only one buffer on
 		 * the page and the page just needs to be set up to date
 		 */
-		if (inode->i_blkbits == PAGE_CACHE_SHIFT && 
+		if (inode->i_blkbits == page_cache_shift(mapping) &&
 		    buffer_uptodate(bh)) {
-			SetPageUptodate(page);    
+			SetPageUptodate(page);
 			return;
 		}
 		create_empty_buffers(page, 1 << inode->i_blkbits, 0);
@@ -177,9 +178,10 @@ do_mpage_readpage(struct bio *bio, struc
 		sector_t *last_block_in_bio, struct buffer_head *map_bh,
 		unsigned long *first_logical_block, get_block_t get_block)
 {
-	struct inode *inode = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *inode = mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+	const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
 	const unsigned blocksize = 1 << blkbits;
 	sector_t block_in_file;
 	sector_t last_block;
@@ -196,7 +198,7 @@ do_mpage_readpage(struct bio *bio, struc
 	if (page_has_buffers(page))
 		goto confused;
 
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
 	last_block = block_in_file + nr_pages * blocks_per_page;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
@@ -286,8 +288,8 @@ do_mpage_readpage(struct bio *bio, struc
 	if (first_hole != blocks_per_page) {
 		char *kaddr = kmap_atomic(page, KM_USER0);
 		memset(kaddr + (first_hole << blkbits), 0,
-				PAGE_CACHE_SIZE - (first_hole << blkbits));
-		flush_dcache_page(page);
+				page_cache_size(mapping) - (first_hole << blkbits));
+		flush_mapping_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
 		if (first_hole == 0) {
 			SetPageUptodate(page);
@@ -465,7 +467,7 @@ __mpage_writepage(struct bio *bio, struc
 	struct inode *inode = page->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
 	unsigned long end_index;
-	const unsigned blocks_per_page = PAGE_CACHE_SIZE >> blkbits;
+	const unsigned blocks_per_page = page_cache_size(mapping) >> blkbits;
 	sector_t last_block;
 	sector_t block_in_file;
 	sector_t blocks[MAX_BUF_PER_PAGE];
@@ -533,7 +535,7 @@ __mpage_writepage(struct bio *bio, struc
 	 * The page has no buffers: map it to disk
 	 */
 	BUG_ON(!PageUptodate(page));
-	block_in_file = (sector_t)page->index << (PAGE_CACHE_SHIFT - blkbits);
+	block_in_file = (sector_t)page->index << (page_cache_shift(mapping) - blkbits);
 	last_block = (i_size - 1) >> blkbits;
 	map_bh.b_page = page;
 	for (page_block = 0; page_block < blocks_per_page; ) {
@@ -565,7 +567,7 @@ __mpage_writepage(struct bio *bio, struc
 	first_unmapped = page_block;
 
 page_is_mapped:
-	end_index = i_size >> PAGE_CACHE_SHIFT;
+	end_index = page_cache_index(mapping, i_size);
 	if (page->index >= end_index) {
 		/*
 		 * The page straddles i_size.  It must be zeroed out on each
@@ -575,14 +577,14 @@ page_is_mapped:
 		 * is zeroed when mapped, and writes to that region are not
 		 * written out to the file."
 		 */
-		unsigned offset = i_size & (PAGE_CACHE_SIZE - 1);
+		unsigned offset = page_cache_offset(mapping, i_size);
 		char *kaddr;
 
 		if (page->index > end_index || !offset)
 			goto confused;
 		kaddr = kmap_atomic(page, KM_USER0);
-		memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-		flush_dcache_page(page);
+		memset(kaddr + offset, 0, page_cache_size(mapping) - offset);
+		flush_mapping_page(page);
 		kunmap_atomic(kaddr, KM_USER0);
 	}
 
@@ -727,8 +729,8 @@ mpage_writepages(struct address_space *m
 		index = mapping->writeback_index; /* Start from prev offset */
 		end = -1;
 	} else {
-		index = wbc->range_start >> PAGE_CACHE_SHIFT;
-		end = wbc->range_end >> PAGE_CACHE_SHIFT;
+		index = page_cache_index(mapping, wbc->range_start);
+		end = page_cache_index(mapping, wbc->range_end);
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
 			range_whole = 1;
 		scanned = 1;

--

From clameter@sgi.com Sun Apr 22 23:21:31 2007
Message-Id: <20070423062131.279607407@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:21 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 14/16] Variable Order Page Cache: Add support to ramfs
Content-Disposition: inline; filename=var_pc_ramfs

The simplest file system to use is ramfs. Add a mount parameter that
specifies the page order of the pages that ramfs should use. If the
order is greater than zero then disable mmap functionality.

This could be removed if the VM would be changes to support faulting
higher order pages but for now we are content with buffered I/O on higher
order pages.

Note that ramfs does not use the lower layers (buffer I/O etc) so its
the safest to use right now.

If you apply this patch and then you can f.e. try this:

mount -tramfs -o10 none /media

Mounts a ramfs filesystem with order 10 pages (4 MB)

cp linux-2.6.21-rc7.tar.gz /media

Populate the ramfs. Note that we allocate 14 pages of 4M each
instead of 13508..

umount /media

Gets rid of the large pages again

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ramfs/file-mmu.c   |   11 +++++++++++
 fs/ramfs/inode.c      |   15 ++++++++++++---
 include/linux/ramfs.h |    1 +
 3 files changed, 24 insertions(+), 3 deletions(-)

Index: linux-2.6.21-rc7/fs/ramfs/file-mmu.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/file-mmu.c	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/file-mmu.c	2007-04-18 22:02:03.000000000 -0700
@@ -45,6 +45,17 @@ const struct file_operations ramfs_file_
 	.llseek		= generic_file_llseek,
 };
 
+/* Higher order mappings do not support mmmap */
+const struct file_operations ramfs_file_higher_order_operations = {
+	.read		= do_sync_read,
+	.aio_read	= generic_file_aio_read,
+	.write		= do_sync_write,
+	.aio_write	= generic_file_aio_write,
+	.fsync		= simple_sync_file,
+	.sendfile	= generic_file_sendfile,
+	.llseek		= generic_file_llseek,
+};
+
 const struct inode_operations ramfs_file_inode_operations = {
 	.getattr	= simple_getattr,
 };
Index: linux-2.6.21-rc7/fs/ramfs/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ramfs/inode.c	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/fs/ramfs/inode.c	2007-04-18 22:02:03.000000000 -0700
@@ -61,6 +61,7 @@ struct inode *ramfs_get_inode(struct sup
 		inode->i_blocks = 0;
 		inode->i_mapping->a_ops = &ramfs_aops;
 		inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
+		inode->i_mapping->order = sb->s_blocksize_bits - PAGE_CACHE_SHIFT;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		switch (mode & S_IFMT) {
 		default:
@@ -68,7 +69,10 @@ struct inode *ramfs_get_inode(struct sup
 			break;
 		case S_IFREG:
 			inode->i_op = &ramfs_file_inode_operations;
-			inode->i_fop = &ramfs_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ramfs_file_higher_order_operations;
+			else
+				inode->i_fop = &ramfs_file_operations;
 			break;
 		case S_IFDIR:
 			inode->i_op = &ramfs_dir_inode_operations;
@@ -164,10 +168,15 @@ static int ramfs_fill_super(struct super
 {
 	struct inode * inode;
 	struct dentry * root;
+	int order = 0;
+	char *options = data;
+
+	if (options && *options)
+		order = simple_strtoul(options, NULL, 10);
 
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_CACHE_SIZE;
-	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
+	sb->s_blocksize = PAGE_CACHE_SIZE << order;
+	sb->s_blocksize_bits = order + PAGE_CACHE_SHIFT;
 	sb->s_magic = RAMFS_MAGIC;
 	sb->s_op = &ramfs_ops;
 	sb->s_time_gran = 1;
Index: linux-2.6.21-rc7/include/linux/ramfs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/ramfs.h	2007-04-18 21:46:38.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/ramfs.h	2007-04-18 22:02:03.000000000 -0700
@@ -16,6 +16,7 @@ extern int ramfs_nommu_mmap(struct file 
 #endif
 
 extern const struct file_operations ramfs_file_operations;
+extern const struct file_operations ramfs_file_higher_order_operations;
 extern struct vm_operations_struct generic_file_vm_ops;
 extern int __init init_rootfs(void);
 

--

From clameter@sgi.com Sun Apr 22 23:21:31 2007
Message-Id: <20070423062131.446138927@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:22 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 15/16] ext2: Add variable page size support
Content-Disposition: inline; filename=var_pc_ext2

This adds variable page size support. It is then possible to mount filesystems
that have a larger blocksize than the page size.

F.e. the following is possible on x86_64 and i386 that have only a 4k page
size.

mke2fs -b 16384 /dev/hdd2	<Ignore warning about too large block size>

mount /dev/hdd2 /media
ls -l /media

.... Do more things with the volume that uses a 16k page cache size on
a 4k page sized platform..

Note that there are issues with ext2 support:

1. Data is not writtten back correctly (block layer?)
2. Reclaim does not work right.

And we disable mmap for higher order pages like also done for ramfs. This
is temporary until we get support for mmapping higher order pages.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ext2/dir.c   |   40 +++++++++++++++++++++++-----------------
 fs/ext2/ext2.h  |    1 +
 fs/ext2/file.c  |   18 ++++++++++++++++++
 fs/ext2/inode.c |   10 ++++++++--
 fs/ext2/namei.c |   10 ++++++++--
 5 files changed, 58 insertions(+), 21 deletions(-)

Index: linux-2.6.21-rc7/fs/ext2/dir.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/dir.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/dir.c	2007-04-22 20:09:57.000000000 -0700
@@ -44,7 +44,8 @@ static inline void ext2_put_page(struct 
 
 static inline unsigned long dir_pages(struct inode *inode)
 {
-	return (inode->i_size+PAGE_CACHE_SIZE-1)>>PAGE_CACHE_SHIFT;
+	return (inode->i_size+page_cache_size(inode->i_mapping)-1)>>
+			page_cache_shift(inode->i_mapping);
 }
 
 /*
@@ -55,10 +56,11 @@ static unsigned
 ext2_last_byte(struct inode *inode, unsigned long page_nr)
 {
 	unsigned last_byte = inode->i_size;
+	struct address_space *mapping = inode->i_mapping;
 
-	last_byte -= page_nr << PAGE_CACHE_SHIFT;
-	if (last_byte > PAGE_CACHE_SIZE)
-		last_byte = PAGE_CACHE_SIZE;
+	last_byte -= page_nr << page_cache_shift(mapping);
+	if (last_byte > page_cache_size(mapping))
+		last_byte = page_cache_size(mapping);
 	return last_byte;
 }
 
@@ -77,18 +79,19 @@ static int ext2_commit_chunk(struct page
 
 static void ext2_check_page(struct page *page)
 {
-	struct inode *dir = page->mapping->host;
+	struct address_space *mapping = page->mapping;
+	struct inode *dir = mapping->host;
 	struct super_block *sb = dir->i_sb;
 	unsigned chunk_size = ext2_chunk_size(dir);
 	char *kaddr = page_address(page);
 	u32 max_inumber = le32_to_cpu(EXT2_SB(sb)->s_es->s_inodes_count);
 	unsigned offs, rec_len;
-	unsigned limit = PAGE_CACHE_SIZE;
+	unsigned limit = page_cache_size(mapping);
 	ext2_dirent *p;
 	char *error;
 
-	if ((dir->i_size >> PAGE_CACHE_SHIFT) == page->index) {
-		limit = dir->i_size & ~PAGE_CACHE_MASK;
+	if (page_cache_index(mapping, dir->i_size) == page->index) {
+		limit = page_cache_offset(mapping, dir->i_size);
 		if (limit & (chunk_size - 1))
 			goto Ebadsize;
 		if (!limit)
@@ -140,7 +143,7 @@ Einumber:
 bad_entry:
 	ext2_error (sb, "ext2_check_page", "bad entry in directory #%lu: %s - "
 		"offset=%lu, inode=%lu, rec_len=%d, name_len=%d",
-		dir->i_ino, error, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		dir->i_ino, error, page_cache_pos(mapping, page->index, offs),
 		(unsigned long) le32_to_cpu(p->inode),
 		rec_len, p->name_len);
 	goto fail;
@@ -149,7 +152,7 @@ Eend:
 	ext2_error (sb, "ext2_check_page",
 		"entry in directory #%lu spans the page boundary"
 		"offset=%lu, inode=%lu",
-		dir->i_ino, (page->index<<PAGE_CACHE_SHIFT)+offs,
+		dir->i_ino, page_cache_pos(mapping, page->index, offs),
 		(unsigned long) le32_to_cpu(p->inode));
 fail:
 	SetPageChecked(page);
@@ -250,8 +253,9 @@ ext2_readdir (struct file * filp, void *
 	loff_t pos = filp->f_pos;
 	struct inode *inode = filp->f_path.dentry->d_inode;
 	struct super_block *sb = inode->i_sb;
-	unsigned int offset = pos & ~PAGE_CACHE_MASK;
-	unsigned long n = pos >> PAGE_CACHE_SHIFT;
+	struct address_space *mapping = inode->i_mapping;
+	unsigned int offset = page_cache_offset(mapping, pos);
+	unsigned long n = page_cache_index(mapping, pos);
 	unsigned long npages = dir_pages(inode);
 	unsigned chunk_mask = ~(ext2_chunk_size(inode)-1);
 	unsigned char *types = NULL;
@@ -272,14 +276,14 @@ ext2_readdir (struct file * filp, void *
 			ext2_error(sb, __FUNCTION__,
 				   "bad page in #%lu",
 				   inode->i_ino);
-			filp->f_pos += PAGE_CACHE_SIZE - offset;
+			filp->f_pos += page_cache_size(mapping) - offset;
 			return -EIO;
 		}
 		kaddr = page_address(page);
 		if (unlikely(need_revalidate)) {
 			if (offset) {
 				offset = ext2_validate_entry(kaddr, offset, chunk_mask);
-				filp->f_pos = (n<<PAGE_CACHE_SHIFT) + offset;
+				filp->f_pos = page_cache_pos(mapping, n, offset);
 			}
 			filp->f_version = inode->i_version;
 			need_revalidate = 0;
@@ -302,7 +306,7 @@ ext2_readdir (struct file * filp, void *
 
 				offset = (char *)de - kaddr;
 				over = filldir(dirent, de->name, de->name_len,
-						(n<<PAGE_CACHE_SHIFT) | offset,
+						page_cache_pos(mapping, n, offset),
 						le32_to_cpu(de->inode), d_type);
 				if (over) {
 					ext2_put_page(page);
@@ -328,6 +332,7 @@ struct ext2_dir_entry_2 * ext2_find_entr
 			struct dentry *dentry, struct page ** res_page)
 {
 	const char *name = dentry->d_name.name;
+	struct address_space *mapping = dir->i_mapping;
 	int namelen = dentry->d_name.len;
 	unsigned reclen = EXT2_DIR_REC_LEN(namelen);
 	unsigned long start, n;
@@ -369,7 +374,7 @@ struct ext2_dir_entry_2 * ext2_find_entr
 		if (++n >= npages)
 			n = 0;
 		/* next page is past the blocks we've got */
-		if (unlikely(n > (dir->i_blocks >> (PAGE_CACHE_SHIFT - 9)))) {
+		if (unlikely(n > (dir->i_blocks >> (page_cache_shift(mapping) - 9)))) {
 			ext2_error(dir->i_sb, __FUNCTION__,
 				"dir %lu size %lld exceeds block count %llu",
 				dir->i_ino, dir->i_size,
@@ -438,6 +443,7 @@ void ext2_set_link(struct inode *dir, st
 int ext2_add_link (struct dentry *dentry, struct inode *inode)
 {
 	struct inode *dir = dentry->d_parent->d_inode;
+	struct address_space *mapping = inode->i_mapping;
 	const char *name = dentry->d_name.name;
 	int namelen = dentry->d_name.len;
 	unsigned chunk_size = ext2_chunk_size(dir);
@@ -467,7 +473,7 @@ int ext2_add_link (struct dentry *dentry
 		kaddr = page_address(page);
 		dir_end = kaddr + ext2_last_byte(dir, n);
 		de = (ext2_dirent *)kaddr;
-		kaddr += PAGE_CACHE_SIZE - reclen;
+		kaddr += page_cache_size(mapping) - reclen;
 		while ((char *)de <= kaddr) {
 			if ((char *)de == dir_end) {
 				/* We hit i_size */
Index: linux-2.6.21-rc7/fs/ext2/ext2.h
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/ext2.h	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/ext2.h	2007-04-22 19:44:22.000000000 -0700
@@ -160,6 +160,7 @@ extern const struct file_operations ext2
 /* file.c */
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
+extern const struct file_operations ext2_no_mmap_file_operations;
 extern const struct file_operations ext2_xip_file_operations;
 
 /* inode.c */
Index: linux-2.6.21-rc7/fs/ext2/file.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/file.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/file.c	2007-04-22 19:44:22.000000000 -0700
@@ -58,6 +58,24 @@ const struct file_operations ext2_file_o
 	.splice_write	= generic_file_splice_write,
 };
 
+const struct file_operations ext2_no_mmap_file_operations = {
+	.llseek		= generic_file_llseek,
+	.read		= do_sync_read,
+	.write		= do_sync_write,
+	.aio_read	= generic_file_aio_read,
+	.aio_write	= generic_file_aio_write,
+	.ioctl		= ext2_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= ext2_compat_ioctl,
+#endif
+	.open		= generic_file_open,
+	.release	= ext2_release_file,
+	.fsync		= ext2_sync_file,
+	.sendfile	= generic_file_sendfile,
+	.splice_read	= generic_file_splice_read,
+	.splice_write	= generic_file_splice_write,
+};
+
 #ifdef CONFIG_EXT2_FS_XIP
 const struct file_operations ext2_xip_file_operations = {
 	.llseek		= generic_file_llseek,
Index: linux-2.6.21-rc7/fs/ext2/inode.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/inode.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/inode.c	2007-04-22 19:44:22.000000000 -0700
@@ -1128,10 +1128,16 @@ void ext2_read_inode (struct inode * ino
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		} else {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		}
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = &ext2_dir_inode_operations;
Index: linux-2.6.21-rc7/fs/ext2/namei.c
===================================================================
--- linux-2.6.21-rc7.orig/fs/ext2/namei.c	2007-04-22 19:43:05.000000000 -0700
+++ linux-2.6.21-rc7/fs/ext2/namei.c	2007-04-22 19:44:22.000000000 -0700
@@ -114,10 +114,16 @@ static int ext2_create (struct inode * d
 			inode->i_fop = &ext2_xip_file_operations;
 		} else if (test_opt(inode->i_sb, NOBH)) {
 			inode->i_mapping->a_ops = &ext2_nobh_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		} else {
 			inode->i_mapping->a_ops = &ext2_aops;
-			inode->i_fop = &ext2_file_operations;
+			if (inode->i_mapping->order)
+				inode->i_fop = &ext2_no_mmap_file_operations;
+			else
+				inode->i_fop = &ext2_file_operations;
 		}
 		mark_inode_dirty(inode);
 		err = ext2_add_nondir(dentry, inode);

--

From clameter@sgi.com Sun Apr 22 23:21:31 2007
Message-Id: <20070423062131.611972880@sgi.com>
References: <20070423062107.843307112@sgi.com>
User-Agent: quilt/0.45-1
Date: Sun, 22 Apr 2007 23:21:23 -0700
From: clameter@sgi.com
To: linux-mm@kvack.org
Cc: Mel Gorman <mel@skynet.ie>,
 William Lee Irwin III <wli@holomorphy.com>,
 Adam Litke <aglitke@gmail.com>,
 David Chinner <dgc@sgi.com>,
 Jens Axboe <jens.axboe@oracle.com>,
 Avi Kivity <avi@argo.co.il>,
 Dave Hansen <hansendc@us.ibm.com>,
 Badari Pulavarty <pbadari@gmail.com>,
 Maxim Levitsky <maximlevitsky@gmail.com>
Subject: [RFC 16/16] Variable Order Page Cache: Alternate implementation of page cache macros
Content-Disposition: inline; filename=var_pc_alternate

Implement the page cache macros in a more efficient way by storing key
values in the mapping. This reduces code size but increases inode size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/fs.h      |    4 +++-
 include/linux/pagemap.h |   13 +++++++------
 2 files changed, 10 insertions(+), 7 deletions(-)

Index: linux-2.6.21-rc7/include/linux/fs.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/fs.h	2007-04-22 19:43:01.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/fs.h	2007-04-22 19:44:29.000000000 -0700
@@ -435,7 +435,9 @@ struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
 	rwlock_t		tree_lock;	/* and rwlock protecting it */
-	unsigned int		order;		/* Page order in this space */
+	unsigned int		shift;		/* Shift for to get to the page number */
+	unsigned int		order;		/* Page order for allocations */
+	loff_t			offset_mask;	/* To mask out offset in page */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h	2007-04-22 19:44:16.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h	2007-04-22 19:46:23.000000000 -0700
@@ -42,7 +42,8 @@ static inline void mapping_set_gfp_mask(
 static inline void set_mapping_order(struct address_space *m, int order)
 {
 	m->order = order;
-
+	m->shift = order + PAGE_SHIFT;
+	m->offset_mask = (1UL << m->shift) -1;
 	if (order)
 		m->flags |= __GFP_COMP;
 	else
@@ -64,23 +65,23 @@ static inline void set_mapping_order(str
 
 static inline int page_cache_shift(struct address_space *a)
 {
-	return a->order + PAGE_SHIFT;
+	return a->shift;
 }
 
 static inline unsigned int page_cache_size(struct address_space *a)
 {
-	return PAGE_SIZE << a->order;
+	return a->offset_mask + 1;
 }
 
 static inline loff_t page_cache_mask(struct address_space *a)
 {
-	return (loff_t)PAGE_MASK << a->order;
+	return ~(loff_t)a->offset_mask;
 }
 
 static inline unsigned int page_cache_offset(struct address_space *a,
 		loff_t pos)
 {
-	return pos & ~(PAGE_MASK << a->order);
+	return pos & a->offset_mask;
 }
 
 static inline pgoff_t page_cache_index(struct address_space *a,
@@ -95,7 +96,7 @@ static inline pgoff_t page_cache_index(s
 static inline pgoff_t page_cache_next(struct address_space *a,
 		loff_t pos)
 {
-	return page_cache_index(a, pos + page_cache_size(a) - 1);
+	return page_cache_index(a, pos + a->offset_mask);
 }
 
 static inline loff_t page_cache_pos(struct address_space *a,

--