From clameter@sgi.com Mon Jun 18 02:31:17 2007
Message-Id: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:30:53 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 00/26] Current slab allocator / SLUB patch queue

These contain the following groups of patches:

1. Slab allocator code consolidation and fixing of inconsistencies

This makes ZERO_SIZE_PTR generic so that it works in all
slab allocators.

It adds __GFP_ZERO support to all slab allocators and
cleans up the zeroing in the slabs and provides modifications
to remove explicit zeroing following kmalloc_node and
kmem_cache_alloc_node calls.

2. SLUB improvements

Inline some small functions to reduce code size. Some more memory
optimizations using CONFIG_SLUB_DEBUG. Changes to handling of the
slub_lock and an optimization of runtime determination of kmalloc slabs
(replaces ilog2 patch that failed with gcc 3.3 on powerpc).

3. Slab defragmentation

This is V3 of the patchset with the one fix for the locking problem that
showed up during testing.

4. Performance optimizations

These patches have a long history since the early drafts of SLUB. The
problem with these patches is that they require the touching of additional
cachelines (only for read) and SLUB was designed for minimal cacheline
touching. In doing so we may be able to remove cacheline bouncing in
particular for remote alloc/ free situations where I have had reports of
issues that I was not able to confirm for lack of specificity. The tradeoffs
here are not clear. Certainly the larger cacheline footprint will hurt the
casual slab user somewhat but it will benefit processes that perform these
local/remote alloc/free operations.

I'd appreciate if someone could evaluate these.

The complete patchset against 2.6.22-rc4-mm2 is available at

http://ftp.kernel.org/pub/linux/kernel/people/christoph/slub-patches/2.6.22-rc4-mm2

Tested on

x86_64 SMP
x86_64 NUMA emulation
IA64 emulator
Altix 64p/128G NUMA system.
Altix 8p/6G asymmetric NUMA system.


-- 

From clameter@sgi.com Mon Jun 18 02:31:18 2007
Message-Id: <20070618093117.839900398@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:30:54 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 01/26] SLUB Debug: Fix initial object debug state of NUMA bootstrap objects
Content-Disposition: inline; filename=fix_numa_boostrap_debug

The function we are calling to initialize object debug state during early
NUMA bootstrap sets up an inactive object giving it the wrong redzone
signature. The bootstrap nodes are active objects and should have active redzone
signatures.

Slab validation complains and reverts the object to active state.

Call the functions that setup_object_debug calls ourselves and specify
that this is going to be an active object.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 23:51:43.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 00:40:04.000000000 -0700
@@ -1925,7 +1925,8 @@ static struct kmem_cache_node * __init e
 	page->freelist = get_freepointer(kmalloc_caches, n);
 	page->inuse++;
 	kmalloc_caches->node[node] = n;
-	setup_object_debug(kmalloc_caches, page, n);
+	init_object(kmalloc_caches, n, 1);
+	init_tracking(kmalloc_caches, n);
 	init_kmem_cache_node(n);
 	atomic_long_inc(&n->nr_slabs);
 	add_partial(n, page);

-- 

From clameter@sgi.com Mon Jun 18 02:31:18 2007
Message-Id: <20070618093118.023053339@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:30:55 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 02/26] Slab allocators: Consolidate code for krealloc in mm/util.c
Content-Disposition: inline; filename=slab_allocators_consolidate_krealloc

There is no reason to have separate functions for krealloc.

The size of a kmalloc object is readily available via ksize().
ksize is provided by all allocators.

Implement krealloc in mm/util.c and drop slab specific definitions
of krealloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slab.c |   46 ----------------------------------------------
 mm/slob.c |   33 ---------------------------------
 mm/slub.c |   37 -------------------------------------
 mm/util.c |   34 ++++++++++++++++++++++++++++++++++
 4 files changed, 34 insertions(+), 116 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-16 18:58:10.000000000 -0700
@@ -3715,52 +3715,6 @@ EXPORT_SYMBOL(__kmalloc);
 #endif
 
 /**
- * krealloc - reallocate memory. The contents will remain unchanged.
- * @p: object to reallocate memory for.
- * @new_size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- *
- * The contents of the object pointed to are preserved up to the
- * lesser of the new and old sizes.  If @p is %NULL, krealloc()
- * behaves exactly like kmalloc().  If @size is 0 and @p is not a
- * %NULL pointer, the object pointed to is freed.
- */
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
-{
-	struct kmem_cache *cache, *new_cache;
-	void *ret;
-
-	if (unlikely(!p))
-		return kmalloc_track_caller(new_size, flags);
-
-	if (unlikely(!new_size)) {
-		kfree(p);
-		return NULL;
-	}
-
-	cache = virt_to_cache(p);
-	new_cache = __find_general_cachep(new_size, flags);
-
-	/*
- 	 * If new size fits in the current cache, bail out.
- 	 */
-	if (likely(cache == new_cache))
-		return (void *)p;
-
-	/*
- 	 * We are on the slow-path here so do not use __cache_alloc
- 	 * because it bloats kernel text.
- 	 */
-	ret = kmalloc_track_caller(new_size, flags);
-	if (ret) {
-		memcpy(ret, p, min(new_size, ksize(p)));
-		kfree(p);
-	}
-	return ret;
-}
-EXPORT_SYMBOL(krealloc);
-
-/**
  * kmem_cache_free - Deallocate an object
  * @cachep: The cache the allocation was from.
  * @objp: The previously allocated object.
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-16 18:58:10.000000000 -0700
@@ -407,39 +407,6 @@ void *__kmalloc(size_t size, gfp_t gfp)
 }
 EXPORT_SYMBOL(__kmalloc);
 
-/**
- * krealloc - reallocate memory. The contents will remain unchanged.
- *
- * @p: object to reallocate memory for.
- * @new_size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- *
- * The contents of the object pointed to are preserved up to the
- * lesser of the new and old sizes.  If @p is %NULL, krealloc()
- * behaves exactly like kmalloc().  If @size is 0 and @p is not a
- * %NULL pointer, the object pointed to is freed.
- */
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
-{
-	void *ret;
-
-	if (unlikely(!p))
-		return kmalloc_track_caller(new_size, flags);
-
-	if (unlikely(!new_size)) {
-		kfree(p);
-		return NULL;
-	}
-
-	ret = kmalloc_track_caller(new_size, flags);
-	if (ret) {
-		memcpy(ret, p, min(new_size, ksize(p)));
-		kfree(p);
-	}
-	return ret;
-}
-EXPORT_SYMBOL(krealloc);
-
 void kfree(const void *block)
 {
 	struct slob_page *sp;
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-16 18:58:10.000000000 -0700
@@ -2476,43 +2476,6 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
-/**
- * krealloc - reallocate memory. The contents will remain unchanged.
- * @p: object to reallocate memory for.
- * @new_size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- *
- * The contents of the object pointed to are preserved up to the
- * lesser of the new and old sizes.  If @p is %NULL, krealloc()
- * behaves exactly like kmalloc().  If @size is 0 and @p is not a
- * %NULL pointer, the object pointed to is freed.
- */
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
-{
-	void *ret;
-	size_t ks;
-
-	if (unlikely(!p || p == ZERO_SIZE_PTR))
-		return kmalloc(new_size, flags);
-
-	if (unlikely(!new_size)) {
-		kfree(p);
-		return ZERO_SIZE_PTR;
-	}
-
-	ks = ksize(p);
-	if (ks >= new_size)
-		return (void *)p;
-
-	ret = kmalloc(new_size, flags);
-	if (ret) {
-		memcpy(ret, p, min(new_size, ks));
-		kfree(p);
-	}
-	return ret;
-}
-EXPORT_SYMBOL(krealloc);
-
 /********************************************************************
  *			Basic setup of slabs
  *******************************************************************/
Index: linux-2.6.22-rc4-mm2/mm/util.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/util.c	2007-06-16 18:58:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/util.c	2007-06-16 19:01:26.000000000 -0700
@@ -81,6 +81,40 @@ void *kmemdup(const void *src, size_t le
 }
 EXPORT_SYMBOL(kmemdup);
 
+/**
+ * krealloc - reallocate memory. The contents will remain unchanged.
+ * @p: object to reallocate memory for.
+ * @new_size: how many bytes of memory are required.
+ * @flags: the type of memory to allocate.
+ *
+ * The contents of the object pointed to are preserved up to the
+ * lesser of the new and old sizes.  If @p is %NULL, krealloc()
+ * behaves exactly like kmalloc().  If @size is 0 and @p is not a
+ * %NULL pointer, the object pointed to is freed.
+ */
+void *krealloc(const void *p, size_t new_size, gfp_t flags)
+{
+	void *ret;
+	size_t ks;
+
+	if (unlikely(!new_size)) {
+		kfree(p);
+		return NULL;
+	}
+
+	ks = ksize(p);
+	if (ks >= new_size)
+		return (void *)p;
+
+	ret = kmalloc_track_caller(new_size, flags);
+	if (ret) {
+		memcpy(ret, p, min(new_size, ks));
+		kfree(p);
+	}
+	return ret;
+}
+EXPORT_SYMBOL(krealloc);
+
 /*
  * strndup_user - duplicate an existing string from user space
  * @s: The string to duplicate

-- 

From clameter@sgi.com Mon Jun 18 02:31:18 2007
Message-Id: <20070618093118.197791676@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:30:56 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 03/26] Slab allocators: Consistent ZERO_SIZE_PTR support and NULL result semantics
Content-Disposition: inline; filename=slab_allocators_consolidate_zero_size_ptr

Define ZERO_OR_NULL_PTR macro to be able to remove the ugly checks
from the allocators. Move ZERO_SIZE_PTR related stuff into slab.h.

Make ZERO_SIZE_PTR work for all slab allocators and get rid of the
WARN_ON_ONCE(size == 0) that is still remaining in SLAB.

Make slub also return NULL like the other allocators if a too large
memory segment is requested via __kmalloc.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   13 +++++++++++++
 include/linux/slab_def.h |   12 ++++++++++++
 include/linux/slub_def.h |   11 -----------
 mm/slab.c                |   14 ++++++++------
 mm/slob.c                |   13 ++++++++-----
 mm/slub.c                |   29 ++++++++++++++++-------------
 mm/util.c                |    2 +-
 7 files changed, 58 insertions(+), 36 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-12 12:04:57.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 20:35:03.000000000 -0700
@@ -33,6 +33,19 @@
 #define SLAB_RECLAIM_ACCOUNT	0x00020000UL		/* Objects are reclaimable */
 #define SLAB_TEMPORARY		SLAB_RECLAIM_ACCOUNT	/* Objects are short-lived */
 /*
+ * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
+ *
+ * Dereferencing ZERO_SIZE_PTR will lead to a distinct access fault.
+ *
+ * ZERO_SIZE_PTR can be passed to kfree though in the same way that NULL can.
+ * Both make kfree a no-op.
+ */
+#define ZERO_SIZE_PTR ((void *)16)
+
+#define ZERO_OR_NULL_PTR(x) ((unsigned long)(x) < \
+				(unsigned long)ZERO_SIZE_PTR)
+
+/*
  * struct kmem_cache related prototypes
  */
 void __init kmem_cache_init(void);
Index: linux-2.6.22-rc4-mm2/include/linux/slab_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab_def.h	2007-06-04 17:57:25.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab_def.h	2007-06-17 20:35:03.000000000 -0700
@@ -29,6 +29,10 @@ static inline void *kmalloc(size_t size,
 {
 	if (__builtin_constant_p(size)) {
 		int i = 0;
+
+		if (!size)
+			return ZERO_SIZE_PTR;
+
 #define CACHE(x) \
 		if (size <= x) \
 			goto found; \
@@ -55,6 +59,10 @@ static inline void *kzalloc(size_t size,
 {
 	if (__builtin_constant_p(size)) {
 		int i = 0;
+
+		if (!size)
+			return ZERO_SIZE_PTR;
+
 #define CACHE(x) \
 		if (size <= x) \
 			goto found; \
@@ -84,6 +92,10 @@ static inline void *kmalloc_node(size_t 
 {
 	if (__builtin_constant_p(size)) {
 		int i = 0;
+
+		if (!size)
+			return ZERO_SIZE_PTR;
+
 #define CACHE(x) \
 		if (size <= x) \
 			goto found; \
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 20:35:03.000000000 -0700
@@ -160,17 +160,6 @@ static inline struct kmem_cache *kmalloc
 #endif
 
 
-/*
- * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
- *
- * Dereferencing ZERO_SIZE_PTR will lead to a distinct access fault.
- *
- * ZERO_SIZE_PTR can be passed to kfree though in the same way that NULL can.
- * Both make kfree a no-op.
- */
-#define ZERO_SIZE_PTR ((void *)16)
-
-
 static inline void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size) && !(flags & SLUB_DMA)) {
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 20:35:08.000000000 -0700
@@ -774,7 +774,9 @@ static inline struct kmem_cache *__find_
 	 */
 	BUG_ON(malloc_sizes[INDEX_AC].cs_cachep == NULL);
 #endif
-	WARN_ON_ONCE(size == 0);
+	if (!size)
+		return ZERO_SIZE_PTR;
+
 	while (size > csizep->cs_size)
 		csizep++;
 
@@ -2340,7 +2342,7 @@ kmem_cache_create (const char *name, siz
 		 * this should not happen at all.
 		 * But leave a BUG_ON for some lucky dude.
 		 */
-		BUG_ON(!cachep->slabp_cache);
+		BUG_ON(ZERO_OR_NULL_PTR(cachep->slabp_cache));
 	}
 	cachep->ctor = ctor;
 	cachep->name = name;
@@ -3642,8 +3644,8 @@ __do_kmalloc_node(size_t size, gfp_t fla
 	struct kmem_cache *cachep;
 
 	cachep = kmem_find_general_cachep(size, flags);
-	if (unlikely(cachep == NULL))
-		return NULL;
+	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
+		return cachep;
 	return kmem_cache_alloc_node(cachep, flags, node);
 }
 
@@ -3749,7 +3751,7 @@ void kfree(const void *objp)
 	struct kmem_cache *c;
 	unsigned long flags;
 
-	if (unlikely(!objp))
+	if (unlikely(ZERO_OR_NULL_PTR(objp)))
 		return;
 	local_irq_save(flags);
 	kfree_debugcheck(objp);
@@ -4436,7 +4438,7 @@ const struct seq_operations slabstats_op
  */
 size_t ksize(const void *objp)
 {
-	if (unlikely(objp == NULL))
+	if (unlikely(ZERO_OR_NULL_PTR(objp)))
 		return 0;
 
 	return obj_size(virt_to_cache(objp));
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 20:35:04.000000000 -0700
@@ -306,7 +306,7 @@ static void slob_free(void *block, int s
 	slobidx_t units;
 	unsigned long flags;
 
-	if (!block)
+	if (ZERO_OR_NULL_PTR(block))
 		return;
 	BUG_ON(!size);
 
@@ -384,11 +384,14 @@ out:
 
 void *__kmalloc(size_t size, gfp_t gfp)
 {
+	unsigned int *m;
 	int align = max(ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
 
 	if (size < PAGE_SIZE - align) {
-		unsigned int *m;
-		m = slob_alloc(size + align, gfp, align);
+		if (!size)
+			return ZERO_SIZE_PTR;
+
+		m = slob_alloc(size + align, gfp, align);
 		if (m)
 			*m = size;
 		return (void *)m + align;
@@ -411,7 +414,7 @@ void kfree(const void *block)
 {
 	struct slob_page *sp;
 
-	if (!block)
+	if (ZERO_OR_NULL_PTR(block))
 		return;
 
 	sp = (struct slob_page *)virt_to_page(block);
@@ -430,7 +433,7 @@ size_t ksize(const void *block)
 {
 	struct slob_page *sp;
 
-	if (!block)
+	if (ZERO_OR_NULL_PTR(block))
 		return 0;
 
 	sp = (struct slob_page *)virt_to_page(block);
Index: linux-2.6.22-rc4-mm2/mm/util.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/util.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/util.c	2007-06-17 20:35:03.000000000 -0700
@@ -99,7 +99,7 @@ void *krealloc(const void *p, size_t new
 
 	if (unlikely(!new_size)) {
 		kfree(p);
-		return NULL;
+		return ZERO_SIZE_PTR;
 	}
 
 	ks = ksize(p);
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 19:10:33.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 20:35:04.000000000 -0700
@@ -2279,10 +2279,11 @@ static struct kmem_cache *get_slab(size_
 	int index = kmalloc_index(size);
 
 	if (!index)
-		return NULL;
+		return ZERO_SIZE_PTR;
 
 	/* Allocation too large? */
-	BUG_ON(index < 0);
+	if (index < 0)
+		return NULL;
 
 #ifdef CONFIG_ZONE_DMA
 	if ((flags & SLUB_DMA)) {
@@ -2323,9 +2324,10 @@ void *__kmalloc(size_t size, gfp_t flags
 {
 	struct kmem_cache *s = get_slab(size, flags);
 
-	if (s)
-		return slab_alloc(s, flags, -1, __builtin_return_address(0));
-	return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
+
+	return slab_alloc(s, flags, -1, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(__kmalloc);
 
@@ -2334,9 +2336,10 @@ void *__kmalloc_node(size_t size, gfp_t 
 {
 	struct kmem_cache *s = get_slab(size, flags);
 
-	if (s)
-		return slab_alloc(s, flags, node, __builtin_return_address(0));
-	return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
+
+	return slab_alloc(s, flags, node, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(__kmalloc_node);
 #endif
@@ -2387,7 +2390,7 @@ void kfree(const void *x)
 	 * this comparison would be true for all "negative" pointers
 	 * (which would cover the whole upper half of the address space).
 	 */
-	if ((unsigned long)x <= (unsigned long)ZERO_SIZE_PTR)
+	if (ZERO_OR_NULL_PTR(x))
 		return;
 
 	page = virt_to_head_page(x);
@@ -2706,8 +2709,8 @@ void *__kmalloc_track_caller(size_t size
 {
 	struct kmem_cache *s = get_slab(size, gfpflags);
 
-	if (!s)
-		return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
 
 	return slab_alloc(s, gfpflags, -1, caller);
 }
@@ -2717,8 +2720,8 @@ void *__kmalloc_node_track_caller(size_t
 {
 	struct kmem_cache *s = get_slab(size, gfpflags);
 
-	if (!s)
-		return ZERO_SIZE_PTR;
+	if (ZERO_OR_NULL_PTR(s))
+		return s;
 
 	return slab_alloc(s, gfpflags, node, caller);
 }

-- 

From clameter@sgi.com Mon Jun 18 02:31:18 2007
Message-Id: <20070618093118.360996596@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:30:57 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 04/26] Slab allocators: Support __GFP_ZERO in all allocators.
Content-Disposition: inline; filename=slab_gfpzero

A kernel convention for allocators is that if __GFP_ZERO is passed to
an allocator (currently only vmalloc and page allocator) then memory that
is returned should be zeroed. This is currently not supported by the slab
allocators and as a result it is also difficult to implement in derived
allocators such as the uncached allocator and the pool allocators.

The support for zeroing does not currently have a consistent API in the
slab allocator. There are no zeroing allocator functions for NUMA node
placement. The zeroing allocations are only provided for default allocs.
__GFP_ZERO will make zeroing universally available and does not require any
addititional functions.

So add the necessary logic to all slab allocators to support __GFP_ZERO.

The code is added to the hot path. However, the gfp flags are on the stack
and so the cacheline is readily available for checking if we want a zeroed
object.

Zeroing while allocating is now a frequent operation and we seem to be
gradually approaching a 1-1 parity between zeroing and not zeroing allocs.
Current tree has 3476 uses of kmalloc vs 2731 uses of kzalloc. It is
therefore reasonable to check for zeroing in the hotpath instead of having
to rely on alternate code paths.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slab.c |    8 +++++++-
 mm/slob.c |    2 ++
 mm/slub.c |   24 +++++++++++++++---------
 3 files changed, 24 insertions(+), 10 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 22:30:01.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 22:30:38.000000000 -0700
@@ -293,6 +293,8 @@ static void *slob_alloc(size_t size, gfp
 		BUG_ON(!b);
 		spin_unlock_irqrestore(&slob_lock, flags);
 	}
+	if (unlikely((gfp & __GFP_ZERO) && b))
+		memset(b, 0, size);
 	return b;
 }
 
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 22:30:01.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 22:31:41.000000000 -0700
@@ -1087,7 +1087,7 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 
-	BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
+	BUG_ON(flags & ~(GFP_DMA | __GFP_ZERO | GFP_LEVEL_MASK));
 
 	if (flags & __GFP_WAIT)
 		local_irq_enable();
@@ -1550,7 +1550,7 @@ debug:
  * Otherwise we can simply pick the next object from the lockless free list.
  */
 static void __always_inline *slab_alloc(struct kmem_cache *s,
-				gfp_t gfpflags, int node, void *addr)
+		gfp_t gfpflags, int node, void *addr, int length)
 {
 	struct page *page;
 	void **object;
@@ -1568,19 +1568,25 @@ static void __always_inline *slab_alloc(
 		page->lockless_freelist = object[page->offset];
 	}
 	local_irq_restore(flags);
+
+	if (unlikely((gfpflags & __GFP_ZERO) && object))
+		memset(object, 0, length);
+
 	return object;
 }
 
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	return slab_alloc(s, gfpflags, -1, __builtin_return_address(0));
+	return slab_alloc(s, gfpflags, -1,
+			__builtin_return_address(0), s->objsize);
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
 #ifdef CONFIG_NUMA
 void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node)
 {
-	return slab_alloc(s, gfpflags, node, __builtin_return_address(0));
+	return slab_alloc(s, gfpflags, node,
+		__builtin_return_address(0), s->objsize);
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
 #endif
@@ -2327,7 +2333,7 @@ void *__kmalloc(size_t size, gfp_t flags
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, flags, -1, __builtin_return_address(0));
+	return slab_alloc(s, flags, -1, __builtin_return_address(0), size);
 }
 EXPORT_SYMBOL(__kmalloc);
 
@@ -2339,7 +2345,7 @@ void *__kmalloc_node(size_t size, gfp_t 
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, flags, node, __builtin_return_address(0));
+	return slab_alloc(s, flags, node, __builtin_return_address(0), size);
 }
 EXPORT_SYMBOL(__kmalloc_node);
 #endif
@@ -2662,7 +2668,7 @@ void *kmem_cache_zalloc(struct kmem_cach
 {
 	void *x;
 
-	x = slab_alloc(s, flags, -1, __builtin_return_address(0));
+	x = slab_alloc(s, flags, -1, __builtin_return_address(0), 0);
 	if (x)
 		memset(x, 0, s->objsize);
 	return x;
@@ -2712,7 +2718,7 @@ void *__kmalloc_track_caller(size_t size
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, gfpflags, -1, caller);
+	return slab_alloc(s, gfpflags, -1, caller, size);
 }
 
 void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
@@ -2723,7 +2729,7 @@ void *__kmalloc_node_track_caller(size_t
 	if (ZERO_OR_NULL_PTR(s))
 		return s;
 
-	return slab_alloc(s, gfpflags, node, caller);
+	return slab_alloc(s, gfpflags, node, caller, size);
 }
 
 #if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG)
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 22:30:01.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 22:30:38.000000000 -0700
@@ -2734,7 +2734,7 @@ static int cache_grow(struct kmem_cache 
 	 * Be lazy and only check for valid flags here,  keeping it out of the
 	 * critical path in kmem_cache_alloc().
 	 */
-	BUG_ON(flags & ~(GFP_DMA | GFP_LEVEL_MASK));
+	BUG_ON(flags & ~(GFP_DMA | __GFP_ZERO | GFP_LEVEL_MASK));
 
 	local_flags = (flags & GFP_LEVEL_MASK);
 	/* Take the l3 list lock to change the colour_next on this node */
@@ -3380,6 +3380,9 @@ __cache_alloc_node(struct kmem_cache *ca
 	local_irq_restore(save_flags);
 	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
 
+	if (unlikely((flags & __GFP_ZERO) && ptr))
+		memset(ptr, 0, cachep->buffer_size);
+
 	return ptr;
 }
 
@@ -3431,6 +3434,9 @@ __cache_alloc(struct kmem_cache *cachep,
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	prefetchw(objp);
 
+	if (unlikely((flags & __GFP_ZERO) && objp))
+		memset(objp, 0, cachep->buffer_size);
+
 	return objp;
 }
 

-- 

From clameter@sgi.com Mon Jun 18 02:31:18 2007
Message-Id: <20070618093118.534914229@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:30:58 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 05/26] Slab allocators: Cleanup zeroing allocations
Content-Disposition: inline; filename=slab_remove_shortcut

It becomes quite easy to support the zeroing allocs with simple inline functions
in slab.h. Provide some inline definitions to allow the continued use of
kzalloc, kmem_cache_zalloc etc but remove all other definitions of zeroing functions
from the slab allocators and util.c.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   36 ++++++++++++++++++++++++------------
 include/linux/slab_def.h |   30 ------------------------------
 include/linux/slub_def.h |   13 -------------
 mm/slab.c                |   17 -----------------
 mm/slob.c                |   10 ----------
 mm/slub.c                |   11 -----------
 mm/util.c                |   14 --------------
 7 files changed, 24 insertions(+), 107 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:11:59.000000000 -0700
@@ -58,7 +58,6 @@ struct kmem_cache *kmem_cache_create(con
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
-void *kmem_cache_zalloc(struct kmem_cache *, gfp_t);
 void kmem_cache_free(struct kmem_cache *, void *);
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
@@ -105,7 +104,6 @@ static inline void *kmem_cache_alloc_nod
  * Common kmalloc functions provided by all allocators
  */
 void *__kmalloc(size_t, gfp_t);
-void *__kzalloc(size_t, gfp_t);
 void * __must_check krealloc(const void *, size_t, gfp_t);
 void kfree(const void *);
 size_t ksize(const void *);
@@ -120,7 +118,7 @@ static inline void *kcalloc(size_t n, si
 {
 	if (n != 0 && size > ULONG_MAX / n)
 		return NULL;
-	return __kzalloc(n * size, flags);
+	return __kmalloc(n * size, flags | __GFP_ZERO);
 }
 
 /*
@@ -192,15 +190,6 @@ static inline void *kmalloc(size_t size,
 	return __kmalloc(size, flags);
 }
 
-/**
- * kzalloc - allocate memory. The memory is set to zero.
- * @size: how many bytes of memory are required.
- * @flags: the type of memory to allocate (see kmalloc).
- */
-static inline void *kzalloc(size_t size, gfp_t flags)
-{
-	return __kzalloc(size, flags);
-}
 #endif
 
 #ifndef CONFIG_NUMA
@@ -258,6 +247,29 @@ extern void *__kmalloc_node_track_caller
 
 #endif /* DEBUG_SLAB */
 
+/*
+ * Shortcuts
+ */
+static inline void *kmem_cache_zalloc(struct kmem_cache *k, gfp_t flags)
+{
+	return kmem_cache_alloc(k, flags | __GFP_ZERO);
+}
+
+static inline void *__kzalloc(int size, gfp_t flags)
+{
+	return kmalloc(size, flags | __GFP_ZERO);
+}
+
+/**
+ * kzalloc - allocate memory. The memory is set to zero.
+ * @size: how many bytes of memory are required.
+ * @flags: the type of memory to allocate (see kmalloc).
+ */
+static inline void *kzalloc(size_t size, gfp_t flags)
+{
+	return kmalloc(size, flags | __GFP_ZERO);
+}
+
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SLAB_H */
 
Index: linux-2.6.22-rc4-mm2/include/linux/slab_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab_def.h	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab_def.h	2007-06-17 18:11:59.000000000 -0700
@@ -55,36 +55,6 @@ found:
 	return __kmalloc(size, flags);
 }
 
-static inline void *kzalloc(size_t size, gfp_t flags)
-{
-	if (__builtin_constant_p(size)) {
-		int i = 0;
-
-		if (!size)
-			return ZERO_SIZE_PTR;
-
-#define CACHE(x) \
-		if (size <= x) \
-			goto found; \
-		else \
-			i++;
-#include "kmalloc_sizes.h"
-#undef CACHE
-		{
-			extern void __you_cannot_kzalloc_that_much(void);
-			__you_cannot_kzalloc_that_much();
-		}
-found:
-#ifdef CONFIG_ZONE_DMA
-		if (flags & GFP_DMA)
-			return kmem_cache_zalloc(malloc_sizes[i].cs_dmacachep,
-						flags);
-#endif
-		return kmem_cache_zalloc(malloc_sizes[i].cs_cachep, flags);
-	}
-	return __kzalloc(size, flags);
-}
-
 #ifdef CONFIG_NUMA
 extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
 
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:11:59.000000000 -0700
@@ -173,19 +173,6 @@ static inline void *kmalloc(size_t size,
 		return __kmalloc(size, flags);
 }
 
-static inline void *kzalloc(size_t size, gfp_t flags)
-{
-	if (__builtin_constant_p(size) && !(flags & SLUB_DMA)) {
-		struct kmem_cache *s = kmalloc_slab(size);
-
-		if (!s)
-			return ZERO_SIZE_PTR;
-
-		return kmem_cache_zalloc(s, flags);
-	} else
-		return __kzalloc(size, flags);
-}
-
 #ifdef CONFIG_NUMA
 extern void *__kmalloc_node(size_t size, gfp_t flags, int node);
 
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:11:55.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:11:59.000000000 -0700
@@ -3578,23 +3578,6 @@ void *kmem_cache_alloc(struct kmem_cache
 EXPORT_SYMBOL(kmem_cache_alloc);
 
 /**
- * kmem_cache_zalloc - Allocate an object. The memory is set to zero.
- * @cache: The cache to allocate from.
- * @flags: See kmalloc().
- *
- * Allocate an object from this cache and set the allocated memory to zero.
- * The flags are only relevant if the cache has no available objects.
- */
-void *kmem_cache_zalloc(struct kmem_cache *cache, gfp_t flags)
-{
-	void *ret = __cache_alloc(cache, flags, __builtin_return_address(0));
-	if (ret)
-		memset(ret, 0, obj_size(cache));
-	return ret;
-}
-EXPORT_SYMBOL(kmem_cache_zalloc);
-
-/**
  * kmem_ptr_validate - check if an untrusted pointer might
  *	be a slab entry.
  * @cachep: the cache we're checking against
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:10:47.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:11:59.000000000 -0700
@@ -505,16 +505,6 @@ void *kmem_cache_alloc(struct kmem_cache
 }
 EXPORT_SYMBOL(kmem_cache_alloc);
 
-void *kmem_cache_zalloc(struct kmem_cache *c, gfp_t flags)
-{
-	void *ret = kmem_cache_alloc(c, flags);
-	if (ret)
-		memset(ret, 0, c->size);
-
-	return ret;
-}
-EXPORT_SYMBOL(kmem_cache_zalloc);
-
 static void __kmem_cache_free(void *b, int size)
 {
 	if (size < PAGE_SIZE)
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:10:47.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:11:59.000000000 -0700
@@ -2664,17 +2664,6 @@ err:
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
-void *kmem_cache_zalloc(struct kmem_cache *s, gfp_t flags)
-{
-	void *x;
-
-	x = slab_alloc(s, flags, -1, __builtin_return_address(0), 0);
-	if (x)
-		memset(x, 0, s->objsize);
-	return x;
-}
-EXPORT_SYMBOL(kmem_cache_zalloc);
-
 #ifdef CONFIG_SMP
 /*
  * Use the cpu notifier to insure that the cpu slabs are flushed when
Index: linux-2.6.22-rc4-mm2/mm/util.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/util.c	2007-06-17 18:08:09.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/util.c	2007-06-17 18:11:59.000000000 -0700
@@ -5,20 +5,6 @@
 #include <asm/uaccess.h>
 
 /**
- * __kzalloc - allocate memory. The memory is set to zero.
- * @size: how many bytes of memory are required.
- * @flags: the type of memory to allocate.
- */
-void *__kzalloc(size_t size, gfp_t flags)
-{
-	void *ret = kmalloc_track_caller(size, flags);
-	if (ret)
-		memset(ret, 0, size);
-	return ret;
-}
-EXPORT_SYMBOL(__kzalloc);
-
-/**
  * kstrdup - allocate space for and copy an existing string
  * @s: the string to duplicate
  * @gfp: the GFP mask used in the kmalloc() call when allocating memory

-- 

From clameter@sgi.com Mon Jun 18 02:31:18 2007
Message-Id: <20070618093118.693111291@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:30:59 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 06/26] Slab allocators: Replace explicit zeroing with __GFP_ZERO
Content-Disposition: inline; filename=slab_use_gfpzero_for_kmalloc_node

kmalloc_node() and kmem_cache_alloc_node() were not available in
a zeroing variant in the past. But with __GFP_ZERO it is possible
now to do zeroing while allocating.

Remove the explicit clearing of memory via memset whereever we can.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 block/as-iosched.c       |    3 +--
 block/cfq-iosched.c      |   18 +++++++++---------
 block/deadline-iosched.c |    3 +--
 block/elevator.c         |    3 +--
 block/genhd.c            |    8 ++++----
 block/ll_rw_blk.c        |    4 ++--
 drivers/ide/ide-probe.c  |    4 ++--
 kernel/timer.c           |    4 ++--
 lib/genalloc.c           |    3 +--
 mm/allocpercpu.c         |    9 +++------
 mm/mempool.c             |    3 +--
 mm/vmalloc.c             |    6 +++---
 12 files changed, 30 insertions(+), 38 deletions(-)

Index: linux-2.6.22-rc4-mm2/block/as-iosched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/as-iosched.c	2007-06-17 15:46:35.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/as-iosched.c	2007-06-17 15:46:59.000000000 -0700
@@ -1322,10 +1322,9 @@ static void *as_init_queue(request_queue
 {
 	struct as_data *ad;
 
-	ad = kmalloc_node(sizeof(*ad), GFP_KERNEL, q->node);
+	ad = kmalloc_node(sizeof(*ad), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!ad)
 		return NULL;
-	memset(ad, 0, sizeof(*ad));
 
 	ad->q = q; /* Identify what queue the data belongs to */
 
Index: linux-2.6.22-rc4-mm2/block/cfq-iosched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/cfq-iosched.c	2007-06-17 15:42:50.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/cfq-iosched.c	2007-06-17 15:47:21.000000000 -0700
@@ -1249,9 +1249,9 @@ cfq_alloc_io_context(struct cfq_data *cf
 {
 	struct cfq_io_context *cic;
 
-	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask, cfqd->queue->node);
+	cic = kmem_cache_alloc_node(cfq_ioc_pool, gfp_mask | __GFP_ZERO,
+							cfqd->queue->node);
 	if (cic) {
-		memset(cic, 0, sizeof(*cic));
 		cic->last_end_request = jiffies;
 		INIT_LIST_HEAD(&cic->queue_list);
 		cic->dtor = cfq_free_io_context;
@@ -1374,17 +1374,19 @@ retry:
 			 * free memory.
 			 */
 			spin_unlock_irq(cfqd->queue->queue_lock);
-			new_cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask|__GFP_NOFAIL, cfqd->queue->node);
+			new_cfqq = kmem_cache_alloc_node(cfq_pool,
+					gfp_mask | __GFP_NOFAIL | __GFP_ZERO,
+					cfqd->queue->node);
 			spin_lock_irq(cfqd->queue->queue_lock);
 			goto retry;
 		} else {
-			cfqq = kmem_cache_alloc_node(cfq_pool, gfp_mask, cfqd->queue->node);
+			cfqq = kmem_cache_alloc_node(cfq_pool,
+					gfp_mask | __GFP_ZERO,
+					cfqd->queue->node);
 			if (!cfqq)
 				goto out;
 		}
 
-		memset(cfqq, 0, sizeof(*cfqq));
-
 		RB_CLEAR_NODE(&cfqq->rb_node);
 		INIT_LIST_HEAD(&cfqq->fifo);
 
@@ -2046,12 +2048,10 @@ static void *cfq_init_queue(request_queu
 {
 	struct cfq_data *cfqd;
 
-	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL, q->node);
+	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
-	memset(cfqd, 0, sizeof(*cfqd));
-
 	cfqd->service_tree = CFQ_RB_ROOT;
 	INIT_LIST_HEAD(&cfqd->cic_list);
 
Index: linux-2.6.22-rc4-mm2/block/deadline-iosched.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/deadline-iosched.c	2007-06-17 15:47:37.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/deadline-iosched.c	2007-06-17 15:47:47.000000000 -0700
@@ -360,10 +360,9 @@ static void *deadline_init_queue(request
 {
 	struct deadline_data *dd;
 
-	dd = kmalloc_node(sizeof(*dd), GFP_KERNEL, q->node);
+	dd = kmalloc_node(sizeof(*dd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!dd)
 		return NULL;
-	memset(dd, 0, sizeof(*dd));
 
 	INIT_LIST_HEAD(&dd->fifo_list[READ]);
 	INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
Index: linux-2.6.22-rc4-mm2/block/elevator.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/elevator.c	2007-06-17 15:47:57.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/elevator.c	2007-06-17 15:48:07.000000000 -0700
@@ -177,11 +177,10 @@ static elevator_t *elevator_alloc(reques
 	elevator_t *eq;
 	int i;
 
-	eq = kmalloc_node(sizeof(elevator_t), GFP_KERNEL, q->node);
+	eq = kmalloc_node(sizeof(elevator_t), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (unlikely(!eq))
 		goto err;
 
-	memset(eq, 0, sizeof(*eq));
 	eq->ops = &e->ops;
 	eq->elevator_type = e;
 	kobject_init(&eq->kobj);
Index: linux-2.6.22-rc4-mm2/block/genhd.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/genhd.c	2007-06-17 15:48:27.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/genhd.c	2007-06-17 15:49:03.000000000 -0700
@@ -726,21 +726,21 @@ struct gendisk *alloc_disk_node(int mino
 {
 	struct gendisk *disk;
 
-	disk = kmalloc_node(sizeof(struct gendisk), GFP_KERNEL, node_id);
+	disk = kmalloc_node(sizeof(struct gendisk),
+				GFP_KERNEL | __GFP_ZERO, node_id);
 	if (disk) {
-		memset(disk, 0, sizeof(struct gendisk));
 		if (!init_disk_stats(disk)) {
 			kfree(disk);
 			return NULL;
 		}
 		if (minors > 1) {
 			int size = (minors - 1) * sizeof(struct hd_struct *);
-			disk->part = kmalloc_node(size, GFP_KERNEL, node_id);
+			disk->part = kmalloc_node(size,
+				GFP_KERNEL | __GFP_ZERO, node_id);
 			if (!disk->part) {
 				kfree(disk);
 				return NULL;
 			}
-			memset(disk->part, 0, size);
 		}
 		disk->minors = minors;
 		kobj_set_kset_s(disk,block_subsys);
Index: linux-2.6.22-rc4-mm2/block/ll_rw_blk.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/block/ll_rw_blk.c	2007-06-17 15:44:27.000000000 -0700
+++ linux-2.6.22-rc4-mm2/block/ll_rw_blk.c	2007-06-17 15:45:03.000000000 -0700
@@ -1828,11 +1828,11 @@ request_queue_t *blk_alloc_queue_node(gf
 {
 	request_queue_t *q;
 
-	q = kmem_cache_alloc_node(requestq_cachep, gfp_mask, node_id);
+	q = kmem_cache_alloc_node(requestq_cachep,
+				gfp_mask | __GFP_ZERO, node_id);
 	if (!q)
 		return NULL;
 
-	memset(q, 0, sizeof(*q));
 	init_timer(&q->unplug_timer);
 
 	snprintf(q->kobj.name, KOBJ_NAME_LEN, "%s", "queue");
Index: linux-2.6.22-rc4-mm2/drivers/ide/ide-probe.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/ide/ide-probe.c	2007-06-17 15:49:57.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/ide/ide-probe.c	2007-06-17 15:50:13.000000000 -0700
@@ -1073,14 +1073,14 @@ static int init_irq (ide_hwif_t *hwif)
 		hwgroup->hwif->next = hwif;
 		spin_unlock_irq(&ide_lock);
 	} else {
-		hwgroup = kmalloc_node(sizeof(ide_hwgroup_t), GFP_KERNEL,
+		hwgroup = kmalloc_node(sizeof(ide_hwgroup_t),
+					GFP_KERNEL | __GFP_ZERO,
 					hwif_to_node(hwif->drives[0].hwif));
 		if (!hwgroup)
 	       		goto out_up;
 
 		hwif->hwgroup = hwgroup;
 
-		memset(hwgroup, 0, sizeof(ide_hwgroup_t));
 		hwgroup->hwif     = hwif->next = hwif;
 		hwgroup->rq       = NULL;
 		hwgroup->handler  = NULL;
Index: linux-2.6.22-rc4-mm2/kernel/timer.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/timer.c	2007-06-17 15:50:50.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/timer.c	2007-06-17 15:51:16.000000000 -0700
@@ -1221,7 +1221,8 @@ static int __devinit init_timers_cpu(int
 			/*
 			 * The APs use this path later in boot
 			 */
-			base = kmalloc_node(sizeof(*base), GFP_KERNEL,
+			base = kmalloc_node(sizeof(*base),
+						GFP_KERNEL | __GFP_ZERO,
 						cpu_to_node(cpu));
 			if (!base)
 				return -ENOMEM;
@@ -1232,7 +1233,6 @@ static int __devinit init_timers_cpu(int
 				kfree(base);
 				return -ENOMEM;
 			}
-			memset(base, 0, sizeof(*base));
 			per_cpu(tvec_bases, cpu) = base;
 		} else {
 			/*
Index: linux-2.6.22-rc4-mm2/lib/genalloc.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/lib/genalloc.c	2007-06-17 15:51:38.000000000 -0700
+++ linux-2.6.22-rc4-mm2/lib/genalloc.c	2007-06-17 15:51:56.000000000 -0700
@@ -54,11 +54,10 @@ int gen_pool_add(struct gen_pool *pool, 
 	int nbytes = sizeof(struct gen_pool_chunk) +
 				(nbits + BITS_PER_BYTE - 1) / BITS_PER_BYTE;
 
-	chunk = kmalloc_node(nbytes, GFP_KERNEL, nid);
+	chunk = kmalloc_node(nbytes, GFP_KERNEL | __GFP_ZERO, nid);
 	if (unlikely(chunk == NULL))
 		return -1;
 
-	memset(chunk, 0, nbytes);
 	spin_lock_init(&chunk->lock);
 	chunk->start_addr = addr;
 	chunk->end_addr = addr + size;
Index: linux-2.6.22-rc4-mm2/mm/allocpercpu.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/allocpercpu.c	2007-06-17 15:52:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/allocpercpu.c	2007-06-17 15:52:38.000000000 -0700
@@ -53,12 +53,9 @@ void *percpu_populate(void *__pdata, siz
 	int node = cpu_to_node(cpu);
 
 	BUG_ON(pdata->ptrs[cpu]);
-	if (node_online(node)) {
-		/* FIXME: kzalloc_node(size, gfp, node) */
-		pdata->ptrs[cpu] = kmalloc_node(size, gfp, node);
-		if (pdata->ptrs[cpu])
-			memset(pdata->ptrs[cpu], 0, size);
-	} else
+	if (node_online(node))
+		pdata->ptrs[cpu] = kmalloc_node(size, gfp|__GFP_ZERO, node);
+	else
 		pdata->ptrs[cpu] = kzalloc(size, gfp);
 	return pdata->ptrs[cpu];
 }
Index: linux-2.6.22-rc4-mm2/mm/mempool.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/mempool.c	2007-06-17 15:52:52.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/mempool.c	2007-06-17 15:53:19.000000000 -0700
@@ -62,10 +62,9 @@ mempool_t *mempool_create_node(int min_n
 			mempool_free_t *free_fn, void *pool_data, int node_id)
 {
 	mempool_t *pool;
-	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL, node_id);
+	pool = kmalloc_node(sizeof(*pool), GFP_KERNEL | __GFP_ZERO, node_id);
 	if (!pool)
 		return NULL;
-	memset(pool, 0, sizeof(*pool));
 	pool->elements = kmalloc_node(min_nr * sizeof(void *),
 					GFP_KERNEL, node_id);
 	if (!pool->elements) {
Index: linux-2.6.22-rc4-mm2/mm/vmalloc.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/vmalloc.c	2007-06-17 15:57:18.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/vmalloc.c	2007-06-17 16:03:38.000000000 -0700
@@ -434,11 +434,12 @@ void *__vmalloc_area_node(struct vm_stru
 	area->nr_pages = nr_pages;
 	/* Please note that the recursion is strictly bounded. */
 	if (array_size > PAGE_SIZE) {
-		pages = __vmalloc_node(array_size, gfp_mask, PAGE_KERNEL, node);
+		pages = __vmalloc_node(array_size, gfp_mask | __GFP_ZERO,
+					PAGE_KERNEL, node);
 		area->flags |= VM_VPAGES;
 	} else {
 		pages = kmalloc_node(array_size,
-				(gfp_mask & GFP_LEVEL_MASK),
+				(gfp_mask & GFP_LEVEL_MASK) | __GFP_ZERO,
 				node);
 	}
 	area->pages = pages;
@@ -447,7 +448,6 @@ void *__vmalloc_area_node(struct vm_stru
 		kfree(area);
 		return NULL;
 	}
-	memset(area->pages, 0, array_size);
 
 	for (i = 0; i < area->nr_pages; i++) {
 		if (node < 0)

-- 

From clameter@sgi.com Mon Jun 18 02:31:19 2007
Message-Id: <20070618093118.867627072@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:00 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 07/26] SLUB: Add some more inlines and #ifdef CONFIG_SLUB_DEBUG
Content-Disposition: inline; filename=slub_inlines_ifdefs

Add #ifdefs around data structures only needed if debugging is compiling
into it.

Add inlines to small functions to reduce code size.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    4 ++++
 mm/slub.c                |   13 +++++++------
 2 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:04.000000000 -0700
@@ -259,9 +259,10 @@ static int sysfs_slab_add(struct kmem_ca
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
 static void sysfs_slab_remove(struct kmem_cache *);
 #else
-static int sysfs_slab_add(struct kmem_cache *s) { return 0; }
-static int sysfs_slab_alias(struct kmem_cache *s, const char *p) { return 0; }
-static void sysfs_slab_remove(struct kmem_cache *s) {}
+static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
+static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
+							{ return 0; }
+static inline void sysfs_slab_remove(struct kmem_cache *s) {}
 #endif
 
 /********************************************************************
@@ -1405,7 +1406,7 @@ static void deactivate_slab(struct kmem_
 	unfreeze_slab(s, page);
 }
 
-static void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
+static inline void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
 {
 	slab_lock(page);
 	deactivate_slab(s, page, cpu);
@@ -1415,7 +1416,7 @@ static void flush_slab(struct kmem_cache
  * Flush cpu slab.
  * Called from IPI handler with interrupts disabled.
  */
-static void __flush_cpu_slab(struct kmem_cache *s, int cpu)
+static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
 	struct page *page = s->cpu_slab[cpu];
 
@@ -2174,7 +2175,7 @@ static int free_list(struct kmem_cache *
 /*
  * Release all resources used by a slab cache.
  */
-static int kmem_cache_close(struct kmem_cache *s)
+static inline int kmem_cache_close(struct kmem_cache *s)
 {
 	int node;
 
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:12:04.000000000 -0700
@@ -16,7 +16,9 @@ struct kmem_cache_node {
 	unsigned long nr_partial;
 	atomic_long_t nr_slabs;
 	struct list_head partial;
+#ifdef CONFIG_SLUB_DEBUG
 	struct list_head full;
+#endif
 };
 
 /*
@@ -44,7 +46,9 @@ struct kmem_cache {
 	int align;		/* Alignment */
 	const char *name;	/* Name (only for display!) */
 	struct list_head list;	/* List of slab caches */
+#ifdef CONFIG_SLUB_DEBUG
 	struct kobject kobj;	/* For sysfs */
+#endif
 
 #ifdef CONFIG_NUMA
 	int defrag_ratio;

-- 

From clameter@sgi.com Mon Jun 18 02:31:19 2007
Message-Id: <20070618093119.043263862@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:01 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 08/26] SLUB: Extract dma_kmalloc_cache from get_cache.
Content-Disposition: inline; filename=slub_extract_dma_cache

The function leads to get_slab becoming too complex. The compiler begins
to spill variables from the working set onto the stack. The created
function is only used in extremely rare cases so make sure that the compiler
does not decide on its own to merge it back into get_slab.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   66 +++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 36 insertions(+), 30 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:04.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:10.000000000 -0700
@@ -2281,6 +2281,40 @@ panic:
 	panic("Creation of kmalloc slab %s size=%d failed.\n", name, size);
 }
 
+#ifdef CONFIG_ZONE_DMA
+static struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
+{
+	struct kmem_cache *s;
+	struct kmem_cache *x;
+	char *text;
+	size_t realsize;
+
+	s = kmalloc_caches_dma[index];
+	if (s)
+		return s;
+
+	/* Dynamically create dma cache */
+	x = kmalloc(kmem_size, flags & ~SLUB_DMA);
+	if (!x)
+		panic("Unable to allocate memory for dma cache\n");
+
+	if (index <= KMALLOC_SHIFT_HIGH)
+		realsize = 1 << index;
+	else {
+		if (index == 1)
+			realsize = 96;
+		else
+			realsize = 192;
+	}
+
+	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
+			(unsigned int)realsize);
+	s = create_kmalloc_cache(x, text, realsize, flags);
+	kmalloc_caches_dma[index] = s;
+	return s;
+}
+#endif
+
 static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 {
 	int index = kmalloc_index(size);
@@ -2293,36 +2327,8 @@ static struct kmem_cache *get_slab(size_
 		return NULL;
 
 #ifdef CONFIG_ZONE_DMA
-	if ((flags & SLUB_DMA)) {
-		struct kmem_cache *s;
-		struct kmem_cache *x;
-		char *text;
-		size_t realsize;
-
-		s = kmalloc_caches_dma[index];
-		if (s)
-			return s;
-
-		/* Dynamically create dma cache */
-		x = kmalloc(kmem_size, flags & ~SLUB_DMA);
-		if (!x)
-			panic("Unable to allocate memory for dma cache\n");
-
-		if (index <= KMALLOC_SHIFT_HIGH)
-			realsize = 1 << index;
-		else {
-			if (index == 1)
-				realsize = 96;
-			else
-				realsize = 192;
-		}
-
-		text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
-				(unsigned int)realsize);
-		s = create_kmalloc_cache(x, text, realsize, flags);
-		kmalloc_caches_dma[index] = s;
-		return s;
-	}
+	if ((flags & SLUB_DMA))
+		return dma_kmalloc_cache(index, flags);
 #endif
 	return &kmalloc_caches[index];
 }

-- 

From clameter@sgi.com Mon Jun 18 02:31:19 2007
Message-Id: <20070618093119.208439977@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:02 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 09/26] SLUB: Do proper locking during dma slab creation
Content-Disposition: inline; filename=slub_dma_create_lock

We modify the kmalloc_cache_dma[] array without proper locking.
Do the proper locking and undo the dma cache creation if another processor
has already created it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:10.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:13.000000000 -0700
@@ -2310,8 +2310,15 @@ static struct kmem_cache *dma_kmalloc_ca
 	text = kasprintf(flags & ~SLUB_DMA, "kmalloc_dma-%d",
 			(unsigned int)realsize);
 	s = create_kmalloc_cache(x, text, realsize, flags);
-	kmalloc_caches_dma[index] = s;
-	return s;
+	down_write(&slub_lock);
+	if (!kmalloc_caches_dma[index]) {
+		kmalloc_caches_dma[index] = s;
+		up_write(&slub_lock);
+		return s;
+	}
+	up_write(&slub_lock);
+	kmem_cache_destroy(s);
+	return kmalloc_caches_dma[index];
 }
 #endif
 

-- 

From clameter@sgi.com Mon Jun 18 02:31:19 2007
Message-Id: <20070618093119.369653717@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:03 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 10/26] SLUB: faster more efficient slab determination for __kmalloc.
Content-Disposition: inline; filename=slub_faster_kmalloc_slab

kmalloc_index is a long series of comparisons. The attempt to replace
kmalloc_index with something more efficient like ilog2 failed due to
compiler issues with constant folding on gcc 3.3 / powerpc.

kmalloc_index'es long list of comparisons works fine for constant folding
since all the comparisons are optimized away. However, SLUB also uses
kmalloc_index to determine the slab to use for the __kmalloc_xxx functions.
This leads to a large set of comparisons in get_slab().

The patch here allows to get tid of that list of comparisons in get_slab:

1. If the requested size is larger than 192 then we can simply use
   fls to determine the slab index since all larger slabs are
   of the power of two type.

2. If the requested size is smaller then we cannot use fls since there
   are non power of two caches to be considered. However, the sizes are
   in a managable range. So we divide the size by 8. Then we have only
   24 possibilities left and then we simply look up the kmalloc index
   in a table.

Code size of slub.o decreases by more than 200 bytes through this patch.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 65 insertions(+), 8 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:13.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:16.000000000 -0700
@@ -2282,7 +2282,7 @@ panic:
 }
 
 #ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
+static noinline struct kmem_cache *dma_kmalloc_cache(int index, gfp_t flags)
 {
 	struct kmem_cache *s;
 	struct kmem_cache *x;
@@ -2322,20 +2322,59 @@ static struct kmem_cache *dma_kmalloc_ca
 }
 #endif
 
+/*
+ * Conversion table for small slabs sizes / 8 to the index in the
+ * kmalloc array. This is necessary for slabs < 192 since we have non power
+ * of two cache sizes there. The size of larger slabs can be determined using
+ * fls.
+ */
+static s8 size_index[24] = {
+	3,	/* 8 */
+	4,	/* 16 */
+	5,	/* 24 */
+	5,	/* 32 */
+	6,	/* 40 */
+	6,	/* 48 */
+	6,	/* 56 */
+	6,	/* 64 */
+	1,	/* 72 */
+	1,	/* 80 */
+	1,	/* 88 */
+	1,	/* 96 */
+	7,	/* 104 */
+	7,	/* 112 */
+	7,	/* 120 */
+	7,	/* 128 */
+	2,	/* 136 */
+	2,	/* 144 */
+	2,	/* 152 */
+	2,	/* 160 */
+	2,	/* 168 */
+	2,	/* 176 */
+	2,	/* 184 */
+	2	/* 192 */
+};
+
 static struct kmem_cache *get_slab(size_t size, gfp_t flags)
 {
-	int index = kmalloc_index(size);
+	int index;
 
-	if (!index)
-		return ZERO_SIZE_PTR;
+	if (size <= 192) {
+		if (!size)
+			return ZERO_SIZE_PTR;
 
-	/* Allocation too large? */
-	if (index < 0)
-		return NULL;
+		index = size_index[(size - 1) / 8];
+	} else {
+		if (size > KMALLOC_MAX_SIZE)
+			return NULL;
+
+		index = fls(size - 1) + 1;
+	}
 
 #ifdef CONFIG_ZONE_DMA
-	if ((flags & SLUB_DMA))
+	if (unlikely((flags & SLUB_DMA)))
 		return dma_kmalloc_cache(index, flags);
+
 #endif
 	return &kmalloc_caches[index];
 }
@@ -2550,6 +2589,24 @@ void __init kmem_cache_init(void)
 		caches++;
 	}
 
+
+	/*
+	 * Patch up the size_index table if we have strange large alignment
+	 * requirements for the kmalloc array. This is only the case for
+	 * mips it seems. The standard arches will not generate any code here.
+	 *
+	 * Largest permitted alignment is 256 bytes due to the way we
+	 * handle the index determination for the smaller caches.
+	 *
+	 * Make sure that nothing crazy happens if someone starts tinkering
+	 * around with ARCH_KMALLOC_MINALIGN
+	 */
+	BUG_ON(KMALLOC_MIN_SIZE > 256 ||
+		(KMALLOC_MIN_SIZE & (KMALLOC_MIN_SIZE - 1)));
+
+	for (i = 8; i < KMALLOC_MIN_SIZE;i++)
+		size_index[(i - 1) / 8] = KMALLOC_SHIFT_LOW;
+
 	slab_state = UP;
 
 	/* Provide the correct kmalloc names now that the caches are up */

-- 

From clameter@sgi.com Mon Jun 18 02:31:19 2007
Message-Id: <20070618093119.541071249@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:04 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 11/26] SLUB: Add support for kmem_cache_ops
Content-Disposition: inline; filename=slab_defrag_kmem_cache_ops

We use the parameter formerly used by the destructor to pass an optional
pointer to a kmem_cache_ops structure to kmem_cache_create.

kmem_cache_ops is created as empty. Later patches populate kmem_cache_ops.

Create a KMEM_CACHE_OPS macro that allows the specification of a the
kmem_cache_ops.

Code to handle kmem_cache_ops is added to SLUB. SLAB and SLOB are updated
to be able to accept a kmem_cache_ops structure but will ignore it.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h     |   13 +++++++++----
 include/linux/slub_def.h |    1 +
 mm/slab.c                |    6 +++---
 mm/slob.c                |    2 +-
 mm/slub.c                |   44 ++++++++++++++++++++++++++++++--------------
 5 files changed, 44 insertions(+), 22 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:19.000000000 -0700
@@ -51,10 +51,13 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
+struct kmem_cache_ops {
+};
+
 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			unsigned long,
 			void (*)(void *, struct kmem_cache *, unsigned long),
-			void (*)(void *, struct kmem_cache *, unsigned long));
+			const struct kmem_cache_ops *s);
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
@@ -71,9 +74,11 @@ int kmem_ptr_validate(struct kmem_cache 
  * f.e. add ____cacheline_aligned_in_smp to the struct declaration
  * then the objects will be properly aligned in SMP configurations.
  */
-#define KMEM_CACHE(__struct, __flags) kmem_cache_create(#__struct,\
-		sizeof(struct __struct), __alignof__(struct __struct),\
-		(__flags), NULL, NULL)
+#define KMEM_CACHE_OPS(__struct, __flags, __ops) \
+	kmem_cache_create(#__struct, sizeof(struct __struct), \
+	__alignof__(struct __struct), (__flags), NULL, (__ops))
+
+#define KMEM_CACHE(__struct, __flags) KMEM_CACHE_OPS(__struct, __flags, NULL)
 
 #ifdef CONFIG_NUMA
 extern void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node);
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:16.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:19.000000000 -0700
@@ -300,6 +300,9 @@ static inline int check_valid_pointer(st
 	return 1;
 }
 
+struct kmem_cache_ops slub_default_ops = {
+};
+
 /*
  * Slow version of get and set free pointer.
  *
@@ -2081,11 +2084,13 @@ static int calculate_sizes(struct kmem_c
 static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 		const char *name, size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long))
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		const struct kmem_cache_ops *ops)
 {
 	memset(s, 0, kmem_size);
 	s->name = name;
 	s->ctor = ctor;
+	s->ops = ops;
 	s->objsize = size;
 	s->flags = flags;
 	s->align = align;
@@ -2268,7 +2273,7 @@ static struct kmem_cache *create_kmalloc
 
 	down_write(&slub_lock);
 	if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN,
-			flags, NULL))
+			flags, NULL, &slub_default_ops))
 		goto panic;
 
 	list_add(&s->list, &slab_caches);
@@ -2645,12 +2650,16 @@ static int slab_unmergeable(struct kmem_
 	if (s->refcount < 0)
 		return 1;
 
+	if (s->ops != &slub_default_ops)
+		return 1;
+
 	return 0;
 }
 
 static struct kmem_cache *find_mergeable(size_t size,
 		size_t align, unsigned long flags,
-		void (*ctor)(void *, struct kmem_cache *, unsigned long))
+		void (*ctor)(void *, struct kmem_cache *, unsigned long),
+		const struct kmem_cache_ops *ops)
 {
 	struct kmem_cache *s;
 
@@ -2660,6 +2669,9 @@ static struct kmem_cache *find_mergeable
 	if (ctor)
 		return NULL;
 
+	if (ops != &slub_default_ops)
+		return NULL;
+
 	size = ALIGN(size, sizeof(void *));
 	align = calculate_alignment(flags, align, size);
 	size = ALIGN(size, align);
@@ -2692,13 +2704,15 @@ static struct kmem_cache *find_mergeable
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 		size_t align, unsigned long flags,
 		void (*ctor)(void *, struct kmem_cache *, unsigned long),
-		void (*dtor)(void *, struct kmem_cache *, unsigned long))
+		const struct kmem_cache_ops *ops)
 {
 	struct kmem_cache *s;
 
-	BUG_ON(dtor);
+	if (!ops)
+		ops = &slub_default_ops;
+
 	down_write(&slub_lock);
-	s = find_mergeable(size, align, flags, ctor);
+	s = find_mergeable(size, align, flags, ctor, ops);
 	if (s) {
 		s->refcount++;
 		/*
@@ -2712,7 +2726,7 @@ struct kmem_cache *kmem_cache_create(con
 	} else {
 		s = kmalloc(kmem_size, GFP_KERNEL);
 		if (s && kmem_cache_open(s, GFP_KERNEL, name,
-				size, align, flags, ctor)) {
+				size, align, flags, ctor, ops)) {
 			if (sysfs_slab_add(s)) {
 				kfree(s);
 				goto err;
@@ -3323,16 +3337,18 @@ static ssize_t order_show(struct kmem_ca
 }
 SLAB_ATTR_RO(order);
 
-static ssize_t ctor_show(struct kmem_cache *s, char *buf)
+static ssize_t ops_show(struct kmem_cache *s, char *buf)
 {
-	if (s->ctor) {
-		int n = sprint_symbol(buf, (unsigned long)s->ctor);
+	int x = 0;
 
-		return n + sprintf(buf + n, "\n");
+	if (s->ctor) {
+		x += sprintf(buf + x, "ctor : ");
+		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
+		x += sprintf(buf + x, "\n");
 	}
-	return 0;
+	return x;
 }
-SLAB_ATTR_RO(ctor);
+SLAB_ATTR_RO(ops);
 
 static ssize_t aliases_show(struct kmem_cache *s, char *buf)
 {
@@ -3564,7 +3580,7 @@ static struct attribute * slab_attrs[] =
 	&slabs_attr.attr,
 	&partial_attr.attr,
 	&cpu_slabs_attr.attr,
-	&ctor_attr.attr,
+	&ops_attr.attr,
 	&aliases_attr.attr,
 	&align_attr.attr,
 	&sanity_checks_attr.attr,
Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:12:04.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:12:19.000000000 -0700
@@ -42,6 +42,7 @@ struct kmem_cache {
 	int objects;		/* Number of objects in slab */
 	int refcount;		/* Refcount for slab cache destroy */
 	void (*ctor)(void *, struct kmem_cache *, unsigned long);
+	const struct kmem_cache_ops *ops;
 	int inuse;		/* Offset to metadata */
 	int align;		/* Alignment */
 	const char *name;	/* Name (only for display!) */
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:12:19.000000000 -0700
@@ -2102,7 +2102,7 @@ static int __init_refok setup_cpu_cache(
  * @align: The required alignment for the objects.
  * @flags: SLAB flags
  * @ctor: A constructor for the objects.
- * @dtor: A destructor for the objects (not implemented anymore).
+ * @ops: A kmem_cache_ops structure (ignored).
  *
  * Returns a ptr to the cache on success, NULL on failure.
  * Cannot be called within a int, but can be interrupted.
@@ -2128,7 +2128,7 @@ struct kmem_cache *
 kmem_cache_create (const char *name, size_t size, size_t align,
 	unsigned long flags,
 	void (*ctor)(void*, struct kmem_cache *, unsigned long),
-	void (*dtor)(void*, struct kmem_cache *, unsigned long))
+	const struct kmem_cache_ops *ops)
 {
 	size_t left_over, slab_size, ralign;
 	struct kmem_cache *cachep = NULL, *pc;
@@ -2137,7 +2137,7 @@ kmem_cache_create (const char *name, siz
 	 * Sanity checks... these are all serious usage bugs.
 	 */
 	if (!name || in_interrupt() || (size < BYTES_PER_WORD) ||
-	    size > KMALLOC_MAX_SIZE || dtor) {
+	    size > KMALLOC_MAX_SIZE) {
 		printk(KERN_ERR "%s: Early error in slab %s\n", __FUNCTION__,
 				name);
 		BUG();
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:11:59.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:12:19.000000000 -0700
@@ -455,7 +455,7 @@ struct kmem_cache {
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,
 	size_t align, unsigned long flags,
 	void (*ctor)(void*, struct kmem_cache *, unsigned long),
-	void (*dtor)(void*, struct kmem_cache *, unsigned long))
+	const struct kmem_cache_ops *o)
 {
 	struct kmem_cache *c;
 

-- 

From clameter@sgi.com Mon Jun 18 02:31:19 2007
Message-Id: <20070618093119.702023138@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:05 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 12/26] SLUB: Slab defragmentation core
Content-Disposition: inline; filename=slab_defrag_core

Slab defragmentation occurs either

1. Unconditionally when kmem_cache_shrink is called on slab by the kernel
   calling kmem_cache_shrink or slabinfo triggering slab shrinking. This
   form performs defragmentation on all nodes of a NUMA system.

2. Conditionally when kmem_cache_defrag(<percentage>, <node>) is called.

   The defragmentation is only performed if the fragmentation of the slab
   is higher then the specified percentage. Fragmentation ratios are measured
   by calculating the percentage of objects in use compared to the total
   number of objects that the slab cache could hold.

   kmem_cache_defrag takes a node parameter. This can either be -1 if
   defragmentation should be performed on all nodes, or a node number.
   If a node number was specified then defragmentation is only performed
   on a specific node.

   Slab defragmentation is a memory intensive operation that can be
   sped up in a NUMA system if mostly node local memory is accessed. That
   is the case if we just have reclaimed reclaim on a node.

For defragmentation SLUB first generates a sorted list of partial slabs.
Sorting is performed according to the number of objects allocated.
Thus the slabs with the least objects will be at the end.

We extract slabs off the tail of that list until we have either reached a
mininum number of slabs or until we encounter a slab that has more than a
quarter of its objects allocated. Then we attempt to remove the objects
from each of the slabs taken.

In order for a slabcache to support defragmentation a couple of functions
must be defined via kmem_cache_ops. These are

void *get(struct kmem_cache *s, int nr, void **objects)

	Must obtain a reference to the listed objects. SLUB guarantees that
	the objects are still allocated. However, other threads may be blocked
	in slab_free attempting to free objects in the slab. These may succeed
	as soon as get() returns to the slab allocator. The function must
	be able to detect the situation and void the attempts to handle such
	objects (by for example voiding the corresponding entry in the objects
	array).

	No slab operations may be performed in get_reference(). Interrupts
	are disabled. What can be done is very limited. The slab lock
	for the page with the object is taken. Any attempt to perform a slab
	operation may lead to a deadlock.

	get() returns a private pointer that is passed to kick. Should we
	be unable to obtain all references then that pointer may indicate
	to the kick() function that it should not attempt any object removal
	or move but simply remove the reference counts.

void kick(struct kmem_cache *, int nr, void **objects, void *get_result)

	After SLUB has established references to the objects in a
	slab it will drop all locks and then use kick() to move objects out
	of the slab. The existence of the object is guaranteed by virtue of
	the earlier obtained references via get(). The callback may perform
	any slab operation since no locks are held at the time of call.

	The callback should remove the object from the slab in some way. This
	may be accomplished by reclaiming the object and then running
	kmem_cache_free() or reallocating it and then running
	kmem_cache_free(). Reallocation is advantageous because the partial
	slabs were just sorted to have the partial slabs with the most objects
	first. Reallocation is likely to result in filling up a slab in
	addition to freeing up one slab so that it also can be removed from
	the partial list.

	Kick() does not return a result. SLUB will check the number of
	remaining objects in the slab. If all objects were removed then
	we know that the operation was successful.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |   32 ++++
 mm/slab.c            |    5 
 mm/slob.c            |    5 
 mm/slub.c            |  344 +++++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 322 insertions(+), 64 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:22.000000000 -0700
@@ -51,7 +51,39 @@
 void __init kmem_cache_init(void);
 int slab_is_available(void);
 
+struct kmem_cache;
+
 struct kmem_cache_ops {
+	/*
+	 * Called with slab lock held and interrupts disabled.
+	 * No slab operation may be performed.
+	 *
+	 * Parameters passed are the number of objects to process
+	 * and an array of pointers to objects for which we
+	 * need references.
+	 *
+	 * Returns a pointer that is passed to the kick function.
+	 * If all objects cannot be moved then the pointer may
+	 * indicate that this wont work and then kick can simply
+	 * remove the references that were already obtained.
+	 *
+	 * The array passed to get() is also passed to kick(). The
+	 * function may remove objects by setting array elements to NULL.
+	 */
+	void *(*get)(struct kmem_cache *, int nr, void **);
+
+	/*
+	 * Called with no locks held and interrupts enabled.
+	 * Any operation may be performed in kick().
+	 *
+	 * Parameters passed are the number of objects in the array,
+	 * the array of pointers to the objects and the pointer
+	 * returned by get().
+	 *
+	 * Success is checked by examining the number of remaining
+	 * objects in the slab.
+	 */
+	void (*kick)(struct kmem_cache *, int nr, void **, void *private);
 };
 
 struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:22.000000000 -0700
@@ -2464,6 +2464,195 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	unsigned long x = 0;
+	struct page *page;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry(page, &n->partial, lru)
+		x += page->inuse;
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return x;
+}
+
+/*
+ * Vacate all objects in the given slab.
+ *
+ * Slab must be locked and frozen. Interrupts are disabled (flags must
+ * be passed).
+ *
+ * Will drop and regain and drop the slab lock. At the end the slab will
+ * either be freed or returned to the partial lists.
+ *
+ * Returns the number of remaining objects
+ */
+static int __kmem_cache_vacate(struct kmem_cache *s,
+		struct page *page, unsigned long flags, void *scratch)
+{
+	void **vector = scratch;
+	void *p;
+	void *addr = page_address(page);
+	DECLARE_BITMAP(map, s->objects);
+	int leftover;
+	int objects;
+	void *private;
+
+	if (!page->inuse)
+		goto out;
+
+	/* Determine used objects */
+	bitmap_fill(map, s->objects);
+	for_each_free_object(p, s, page->freelist)
+		__clear_bit(slab_index(p, s, addr), map);
+
+	objects = 0;
+	memset(vector, 0, s->objects * sizeof(void **));
+	for_each_object(p, s, addr) {
+		if (test_bit(slab_index(p, s, addr), map))
+			vector[objects++] = p;
+	}
+
+	private = s->ops->get(s, objects, vector);
+
+	/*
+	 * Got references. Now we can drop the slab lock. The slab
+	 * is frozen so it cannot vanish from under us nor will
+	 * allocations be performed on the slab. However, unlocking the
+	 * slab will allow concurrent slab_frees to proceed.
+	 */
+	slab_unlock(page);
+	local_irq_restore(flags);
+
+	/*
+	 * Perform the KICK callbacks to remove the objects.
+	 */
+	s->ops->kick(s, objects, vector, private);
+
+	local_irq_save(flags);
+	slab_lock(page);
+out:
+	/*
+	 * Check the result and unfreeze the slab
+	 */
+	leftover = page->inuse;
+	unfreeze_slab(s, page);
+	local_irq_restore(flags);
+	return leftover;
+}
+
+/*
+ * Sort the partial slabs by the number of items allocated.
+ * The slabs with the least objects come last.
+ */
+static unsigned long sort_partial_list(struct kmem_cache *s,
+	struct kmem_cache_node *n, void *scratch)
+{
+	struct list_head *slabs_by_inuse = scratch;
+	int i;
+	struct page *page;
+	struct page *t;
+	unsigned long freed = 0;
+
+	for (i = 0; i < s->objects; i++)
+		INIT_LIST_HEAD(slabs_by_inuse + i);
+
+	/*
+	 * Build lists indexed by the items in use in each slab.
+	 *
+	 * Note that concurrent frees may occur while we hold the
+	 * list_lock. page->inuse here is the upper limit.
+	 */
+	list_for_each_entry_safe(page, t, &n->partial, lru) {
+		if (!page->inuse && slab_trylock(page)) {
+			/*
+			 * Must hold slab lock here because slab_free
+			 * may have freed the last object and be
+			 * waiting to release the slab.
+			 */
+			list_del(&page->lru);
+			n->nr_partial--;
+			slab_unlock(page);
+			discard_slab(s, page);
+			freed++;
+		} else {
+			list_move(&page->lru,
+			slabs_by_inuse + page->inuse);
+		}
+	}
+
+	/*
+	 * Rebuild the partial list with the slabs filled up most
+	 * first and the least used slabs at the end.
+	 */
+	for (i = s->objects - 1; i >= 0; i--)
+		list_splice(slabs_by_inuse + i, n->partial.prev);
+
+	return freed;
+}
+
+/*
+ * Shrink the slab cache on a particular node of the cache
+ */
+static unsigned long __kmem_cache_shrink(struct kmem_cache *s,
+	struct kmem_cache_node *n, void *scratch)
+{
+	unsigned long flags;
+	struct page *page, *page2;
+	LIST_HEAD(zaplist);
+	int freed;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	freed = sort_partial_list(s, n, scratch);
+
+	/*
+	 * If we have no functions available to defragment the slabs
+	 * then we are done.
+	*/
+	if (!s->ops->get || !s->ops->kick) {
+		spin_unlock_irqrestore(&n->list_lock, flags);
+		return freed;
+	}
+
+	/*
+	 * Take slabs with just a few objects off the tail of the now
+	 * ordered list. These are the slabs with the least objects
+	 * and those are likely easy to reclaim.
+	 */
+	while (n->nr_partial > MAX_PARTIAL) {
+		page = container_of(n->partial.prev, struct page, lru);
+
+		/*
+		 * We are holding the list_lock so we can only
+		 * trylock the slab
+		 */
+		if (page->inuse > s->objects / 4)
+			break;
+
+		if (!slab_trylock(page))
+			break;
+
+		list_move_tail(&page->lru, &zaplist);
+		n->nr_partial--;
+		SetSlabFrozen(page);
+		slab_unlock(page);
+	}
+
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	/* Now we can free objects in the slabs on the zaplist */
+	list_for_each_entry_safe(page, page2, &zaplist, lru) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		slab_lock(page);
+		if (__kmem_cache_vacate(s, page, flags, scratch) == 0)
+			freed++;
+	}
+	return freed;
+}
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -2477,71 +2666,97 @@ EXPORT_SYMBOL(kfree);
 int kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
-	int i;
-	struct kmem_cache_node *n;
-	struct page *page;
-	struct page *t;
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL);
-	unsigned long flags;
+	void *scratch;
+
+	flush_all(s);
 
-	if (!slabs_by_inuse)
+	scratch = kmalloc(sizeof(struct list_head) * s->objects,
+							GFP_KERNEL);
+	if (!scratch)
 		return -ENOMEM;
 
-	flush_all(s);
-	for_each_online_node(node) {
-		n = get_node(s, node);
+	for_each_online_node(node)
+		__kmem_cache_shrink(s, get_node(s, node), scratch);
 
-		if (!n->nr_partial)
-			continue;
+	kfree(scratch);
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_shrink);
 
-		for (i = 0; i < s->objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
+static unsigned long __kmem_cache_defrag(struct kmem_cache *s,
+				int percent, int node, void *scratch)
+{
+	unsigned long capacity;
+	unsigned long objects;
+	unsigned long ratio;
+	struct kmem_cache_node *n = get_node(s, node);
 
-		spin_lock_irqsave(&n->list_lock, flags);
+	/*
+	 * An insignificant number of partial slabs makes
+	 * the slab not interesting.
+	 */
+	if (n->nr_partial <= MAX_PARTIAL)
+		return 0;
 
-		/*
-		 * Build lists indexed by the items in use in each slab.
-		 *
-		 * Note that concurrent frees may occur while we hold the
-		 * list_lock. page->inuse here is the upper limit.
-		 */
-		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			if (!page->inuse && slab_trylock(page)) {
-				/*
-				 * Must hold slab lock here because slab_free
-				 * may have freed the last object and be
-				 * waiting to release the slab.
-				 */
-				list_del(&page->lru);
-				n->nr_partial--;
-				slab_unlock(page);
-				discard_slab(s, page);
-			} else {
-				if (n->nr_partial > MAX_PARTIAL)
-					list_move(&page->lru,
-					slabs_by_inuse + page->inuse);
-			}
-		}
+	/*
+	 * Calculate usage ratio
+	 */
+	capacity = atomic_long_read(&n->nr_slabs) * s->objects;
+	objects = capacity - n->nr_partial * s->objects + count_partial(n);
+	ratio = objects * 100 / capacity;
 
-		if (n->nr_partial <= MAX_PARTIAL)
-			goto out;
+	/*
+	 * If usage ratio is more than required then no
+	 * defragmentation
+	 */
+	if (ratio > percent)
+		return 0;
+
+	return __kmem_cache_shrink(s, n, scratch) << s->order;
+}
+
+/*
+ * Defrag slabs on the local node if fragmentation is higher
+ * than the given percentage. This is called from the memory reclaim
+ * path.
+ */
+int kmem_cache_defrag(int percent, int node)
+{
+	struct kmem_cache *s;
+	unsigned long pages = 0;
+	void *scratch;
+
+	/*
+	 * kmem_cache_defrag may be called from the reclaim path which may be
+	 * called for any page allocator alloc. So there is the danger that we
+	 * get called in a situation where slub already acquired the slub_lock
+	 * for other purposes.
+	 */
+	if (!down_read_trylock(&slub_lock))
+		return 0;
+
+	list_for_each_entry(s, &slab_caches, list) {
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * The slab cache must have defrag methods.
 		 */
-		for (i = s->objects - 1; i >= 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (!s->ops || !s->ops->kick)
+			continue;
 
-	out:
-		spin_unlock_irqrestore(&n->list_lock, flags);
+		scratch = kmalloc(sizeof(struct list_head) * s->objects,
+								GFP_KERNEL);
+		if (node == -1) {
+			for_each_online_node(node)
+				pages += __kmem_cache_defrag(s, percent,
+							node, scratch);
+		} else
+			pages += __kmem_cache_defrag(s, percent, node, scratch);
+		kfree(scratch);
 	}
-
-	kfree(slabs_by_inuse);
-	return 0;
+	up_read(&slub_lock);
+	return pages;
 }
-EXPORT_SYMBOL(kmem_cache_shrink);
+EXPORT_SYMBOL(kmem_cache_defrag);
 
 /********************************************************************
  *			Basic setup of slabs
@@ -3178,19 +3393,6 @@ static int list_locations(struct kmem_ca
 	return n;
 }
 
-static unsigned long count_partial(struct kmem_cache_node *n)
-{
-	unsigned long flags;
-	unsigned long x = 0;
-	struct page *page;
-
-	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, lru)
-		x += page->inuse;
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	return x;
-}
-
 enum slab_stat_type {
 	SL_FULL,
 	SL_PARTIAL,
@@ -3346,6 +3548,20 @@ static ssize_t ops_show(struct kmem_cach
 		x += sprint_symbol(buf + x, (unsigned long)s->ctor);
 		x += sprintf(buf + x, "\n");
 	}
+
+	if (s->ops->get) {
+		x += sprintf(buf + x, "get : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->ops->get);
+		x += sprintf(buf + x, "\n");
+	}
+
+	if (s->ops->kick) {
+		x += sprintf(buf + x, "kick : ");
+		x += sprint_symbol(buf + x,
+				(unsigned long)s->ops->kick);
+		x += sprintf(buf + x, "\n");
+	}
 	return x;
 }
 SLAB_ATTR_RO(ops);
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:12:22.000000000 -0700
@@ -2518,6 +2518,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int percent, int node)
+{
+	return 0;
+}
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:12:22.000000000 -0700
@@ -553,6 +553,11 @@ int kmem_cache_shrink(struct kmem_cache 
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
+int kmem_cache_defrag(int percentage, int node)
+{
+	return 0;
+}
+
 int kmem_ptr_validate(struct kmem_cache *a, const void *b)
 {
 	return 0;

-- 

From clameter@sgi.com Mon Jun 18 02:31:20 2007
Message-Id: <20070618093119.867308492@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:06 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 13/26] SLUB: Extend slabinfo to support -D and -C options
Content-Disposition: inline; filename=slab_defrag_slabinfo_updates

-D lists caches that support defragmentation

-C lists caches that use a ctor.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/vm/slabinfo.c |   39 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 34 insertions(+), 5 deletions(-)

Index: linux-2.6.22-rc4-mm2/Documentation/vm/slabinfo.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/vm/slabinfo.c	2007-06-18 01:26:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/vm/slabinfo.c	2007-06-18 01:27:21.000000000 -0700
@@ -30,6 +30,7 @@ struct slabinfo {
 	int hwcache_align, object_size, objs_per_slab;
 	int sanity_checks, slab_size, store_user, trace;
 	int order, poison, reclaim_account, red_zone;
+	int defrag, ctor;
 	unsigned long partial, objects, slabs;
 	int numa[MAX_NODES];
 	int numa_partial[MAX_NODES];
@@ -56,6 +57,8 @@ int show_slab = 0;
 int skip_zero = 1;
 int show_numa = 0;
 int show_track = 0;
+int show_defrag = 0;
+int show_ctor = 0;
 int show_first_alias = 0;
 int validate = 0;
 int shrink = 0;
@@ -90,18 +93,20 @@ void fatal(const char *x, ...)
 void usage(void)
 {
 	printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
-		"slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
+		"slabinfo [-aCDefhilnosSrtTvz1] [-d debugopts] [slab-regexp]\n"
 		"-a|--aliases           Show aliases\n"
+		"-C|--ctor              Show slabs with ctors\n"
 		"-d<options>|--debug=<options> Set/Clear Debug options\n"
-		"-e|--empty		Show empty slabs\n"
+		"-D|--defrag            Show defragmentable caches\n"
+		"-e|--empty             Show empty slabs\n"
 		"-f|--first-alias       Show first alias\n"
 		"-h|--help              Show usage information\n"
 		"-i|--inverted          Inverted list\n"
 		"-l|--slabs             Show slabs\n"
 		"-n|--numa              Show NUMA information\n"
-		"-o|--ops		Show kmem_cache_ops\n"
+		"-o|--ops               Show kmem_cache_ops\n"
 		"-s|--shrink            Shrink slabs\n"
-		"-r|--report		Detailed report on single slabs\n"
+		"-r|--report            Detailed report on single slabs\n"
 		"-S|--Size              Sort by size\n"
 		"-t|--tracking          Show alloc/free information\n"
 		"-T|--Totals            Show summary information\n"
@@ -281,7 +286,7 @@ int line = 0;
 void first_line(void)
 {
 	printf("Name                   Objects Objsize    Space "
-		"Slabs/Part/Cpu  O/S O %%Fr %%Ef Flg\n");
+		"Slabs/Part/Cpu  O/S O %%Ra %%Ef Flg\n");
 }
 
 /*
@@ -452,6 +457,12 @@ void slabcache(struct slabinfo *s)
 	if (show_empty && s->slabs)
 		return;
 
+	if (show_defrag && !s->defrag)
+		return;
+
+	if (show_ctor && !s->ctor)
+		return;
+
 	store_size(size_str, slab_size(s));
 	sprintf(dist_str,"%lu/%lu/%d", s->slabs, s->partial, s->cpu_slabs);
 
@@ -462,6 +473,10 @@ void slabcache(struct slabinfo *s)
 		*p++ = '*';
 	if (s->cache_dma)
 		*p++ = 'd';
+	if (s->defrag)
+		*p++ = 'D';
+	if (s->ctor)
+		*p++ = 'C';
 	if (s->hwcache_align)
 		*p++ = 'A';
 	if (s->poison)
@@ -481,7 +496,7 @@ void slabcache(struct slabinfo *s)
 	printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
 		s->name, s->objects, s->object_size, size_str, dist_str,
 		s->objs_per_slab, s->order,
-		s->slabs ? (s->partial * 100) / s->slabs : 100,
+		s->slabs ? (s->objects * 100) / (s->slabs * s->objs_per_slab) : 100,
 		s->slabs ? (s->objects * s->object_size * 100) /
 			(s->slabs * (page_size << s->order)) : 100,
 		flags);
@@ -1072,6 +1087,12 @@ void read_slab_dir(void)
 			slab->store_user = get_obj("store_user");
 			slab->trace = get_obj("trace");
 			chdir("..");
+			if (read_slab_obj(slab, "ops")) {
+				if (strstr(buffer, "ctor :"))
+					slab->ctor = 1;
+				if (strstr(buffer, "kick :"))
+					slab->defrag = 1;
+			}
 			if (slab->name[0] == ':')
 				alias_targets++;
 			slab++;
@@ -1121,7 +1142,9 @@ void output_slabs(void)
 
 struct option opts[] = {
 	{ "aliases", 0, NULL, 'a' },
+	{ "ctor", 0, NULL, 'C' },
 	{ "debug", 2, NULL, 'd' },
+	{ "defrag", 0, NULL, 'D' },
 	{ "empty", 0, NULL, 'e' },
 	{ "first-alias", 0, NULL, 'f' },
 	{ "help", 0, NULL, 'h' },
@@ -1146,7 +1169,7 @@ int main(int argc, char *argv[])
 
 	page_size = getpagesize();
 
-	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzTS",
+	while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzCDTS",
 						opts, NULL)) != -1)
 	switch(c) {
 		case '1':
@@ -1196,6 +1219,12 @@ int main(int argc, char *argv[])
 		case 'z':
 			skip_zero = 0;
 			break;
+		case 'C':
+			show_ctor = 1;
+			break;
+		case 'D':
+			show_defrag = 1;
+			break;
 		case 'T':
 			show_totals = 1;
 			break;

-- 

From clameter@sgi.com Mon Jun 18 02:31:20 2007
Message-Id: <20070618093120.033130317@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:07 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 14/26] SLUB: Logic to trigger slab defragmentation from memory reclaim
Content-Disposition: inline; filename=slab_defrag_trigger

At some point slab defragmentation needs to be triggered. The logical
point for this is after slab shrinking was performed in vmscan.c. At
that point the fragmentation ratio of a slab was increased by objects
being freed. So we call kmem_cache_defrag from there.

kmem_cache_defrag takes the defrag ratio to make the decision to
defrag a slab or not. We define a new VM tunable

	slab_defrag_ratio

that contains the limit to trigger slab defragmentation.

slab_shrink() from vmscan.c is called in some contexts to do
global shrinking of slabs and in others to do shrinking for
a particular zone. Pass the zone to slab_shrink, so that slab_shrink
can call kmem_cache_defrag() and restrict the defragmentation to
the node that is under memory pressure.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 Documentation/sysctl/vm.txt |   25 +++++++++++++++++++++++++
 fs/drop_caches.c            |    2 +-
 include/linux/mm.h          |    2 +-
 include/linux/slab.h        |    1 +
 kernel/sysctl.c             |   10 ++++++++++
 mm/vmscan.c                 |   34 +++++++++++++++++++++++++++-------
 6 files changed, 65 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc4-mm2/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.22-rc4-mm2.orig/Documentation/sysctl/vm.txt	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/Documentation/sysctl/vm.txt	2007-06-17 18:12:29.000000000 -0700
@@ -35,6 +35,7 @@ Currently, these files are in /proc/sys/
 - swap_prefetch
 - swap_prefetch_delay
 - swap_prefetch_sleep
+- slab_defrag_ratio
 
 ==============================================================
 
@@ -300,3 +301,27 @@ sleep for when the ram is found to be fu
 further.
 
 The default value is 5.
+
+==============================================================
+
+slab_defrag_ratio
+
+After shrinking the slabs the system checks if slabs have a lower usage
+ratio than the percentage given here. If so then slab defragmentation is
+activated to increase the usage ratio of the slab and in order to free
+memory.
+
+This is the percentage of objects allocated of the total possible number
+of objects in a slab. A lower percentage signifies more fragmentation.
+
+Note slab defragmentation only works on slabs that have the proper methods
+defined (see /sys/slab/<slabname>/ops). When this text was written slab
+defragmentation was only supported by the dentry cache and the inode cache.
+
+The main purpose of the slab defragmentation is to address pathological
+situations in which large amounts of inodes or dentries have been
+removed from the system. That may leave lots of slabs around with just
+a few objects. Slab defragmentation removes these slabs.
+
+The default value is 30% meaning for 3 items in use we have 7 free
+and unused items.
Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:29.000000000 -0700
@@ -97,6 +97,7 @@ void kmem_cache_free(struct kmem_cache *
 unsigned int kmem_cache_size(struct kmem_cache *);
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
+int kmem_cache_defrag(int percentage, int node);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6.22-rc4-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/kernel/sysctl.c	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/kernel/sysctl.c	2007-06-17 18:12:29.000000000 -0700
@@ -81,6 +81,7 @@ extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int maps_protect;
 extern int sysctl_stat_interval;
+extern int sysctl_slab_defrag_ratio;
 
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
@@ -917,6 +918,15 @@ static ctl_table vm_table[] = {
 		.strategy	= &sysctl_intvec,
 		.extra1		= &zero,
 	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "slab_defrag_ratio",
+		.data		= &sysctl_slab_defrag_ratio,
+		.maxlen		= sizeof(sysctl_slab_defrag_ratio),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+		.strategy	= &sysctl_intvec,
+	},
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
 	{
 		.ctl_name	= VM_LEGACY_VA_LAYOUT,
Index: linux-2.6.22-rc4-mm2/mm/vmscan.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/vmscan.c	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/vmscan.c	2007-06-17 18:12:29.000000000 -0700
@@ -135,6 +135,12 @@ void unregister_shrinker(struct shrinker
 EXPORT_SYMBOL(unregister_shrinker);
 
 #define SHRINK_BATCH 128
+
+/*
+ * Slabs should be defragmented if less than 30% of objects are allocated.
+ */
+int sysctl_slab_defrag_ratio = 30;
+
 /*
  * Call the shrink functions to age shrinkable caches
  *
@@ -152,10 +158,19 @@ EXPORT_SYMBOL(unregister_shrinker);
  * are eligible for the caller's allocation attempt.  It is used for balancing
  * slab reclaim versus page reclaim.
  *
+ * zone is the zone for which we are shrinking the slabs. If the intent
+ * is to do a global shrink then zone can be be NULL. This is currently
+ * only used to limit slab defragmentation to a NUMA node. The performace
+ * of shrink_slab would be better (in particular under NUMA) if it could
+ * be targeted as a whole to a zone that is under memory pressure but
+ * the VFS datastructures do not allow that at the present time. As a
+ * result zone_reclaim must perform global slab reclaim in order
+ * to free up memory in a zone.
+ *
  * Returns the number of slab objects which we shrunk.
  */
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages)
+			unsigned long lru_pages, struct zone *zone)
 {
 	struct shrinker *shrinker;
 	unsigned long ret = 0;
@@ -218,6 +233,8 @@ unsigned long shrink_slab(unsigned long 
 		shrinker->nr += total_scan;
 	}
 	up_read(&shrinker_rwsem);
+	kmem_cache_defrag(sysctl_slab_defrag_ratio,
+		zone ? zone_to_nid(zone) : -1);
 	return ret;
 }
 
@@ -1163,7 +1180,8 @@ unsigned long try_to_free_pages(struct z
 		if (!priority)
 			disable_swap_token();
 		nr_reclaimed += shrink_zones(priority, zones, &sc);
-		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages);
+		shrink_slab(sc.nr_scanned, gfp_mask, lru_pages,
+						NULL);
 		if (reclaim_state) {
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			reclaim_state->reclaimed_slab = 0;
@@ -1333,7 +1351,7 @@ loop_again:
 			nr_reclaimed += shrink_zone(priority, zone, &sc);
 			reclaim_state->reclaimed_slab = 0;
 			nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
-						lru_pages);
+						lru_pages, zone);
 			nr_reclaimed += reclaim_state->reclaimed_slab;
 			total_scanned += sc.nr_scanned;
 			if (zone->all_unreclaimable)
@@ -1601,7 +1619,7 @@ unsigned long shrink_all_memory(unsigned
 	/* If slab caches are huge, it's better to hit them first */
 	while (nr_slab >= lru_pages) {
 		reclaim_state.reclaimed_slab = 0;
-		shrink_slab(nr_pages, sc.gfp_mask, lru_pages);
+		shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL);
 		if (!reclaim_state.reclaimed_slab)
 			break;
 
@@ -1639,7 +1657,7 @@ unsigned long shrink_all_memory(unsigned
 
 			reclaim_state.reclaimed_slab = 0;
 			shrink_slab(sc.nr_scanned, sc.gfp_mask,
-					count_lru_pages());
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 			if (ret >= nr_pages)
 				goto out;
@@ -1656,7 +1674,8 @@ unsigned long shrink_all_memory(unsigned
 	if (!ret) {
 		do {
 			reclaim_state.reclaimed_slab = 0;
-			shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages());
+			shrink_slab(nr_pages, sc.gfp_mask,
+					count_lru_pages(), NULL);
 			ret += reclaim_state.reclaimed_slab;
 		} while (ret < nr_pages && reclaim_state.reclaimed_slab > 0);
 	}
@@ -1816,7 +1835,8 @@ static int __zone_reclaim(struct zone *z
 		 * Note that shrink_slab will free memory on all zones and may
 		 * take a long time.
 		 */
-		while (shrink_slab(sc.nr_scanned, gfp_mask, order) &&
+		while (shrink_slab(sc.nr_scanned, gfp_mask, order,
+						zone) &&
 			zone_page_state(zone, NR_SLAB_RECLAIMABLE) >
 				slab_reclaimable - nr_pages)
 			;
Index: linux-2.6.22-rc4-mm2/fs/drop_caches.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/fs/drop_caches.c	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/fs/drop_caches.c	2007-06-17 18:12:29.000000000 -0700
@@ -52,7 +52,7 @@ void drop_slab(void)
 	int nr_objects;
 
 	do {
-		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000);
+		nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL);
 	} while (nr_objects > 10);
 }
 
Index: linux-2.6.22-rc4-mm2/include/linux/mm.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/mm.h	2007-06-17 18:08:02.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/mm.h	2007-06-17 18:12:29.000000000 -0700
@@ -1229,7 +1229,7 @@ int in_gate_area_no_task(unsigned long a
 int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *,
 					void __user *, size_t *, loff_t *);
 unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
-			unsigned long lru_pages);
+			unsigned long lru_pages, struct zone *zone);
 extern void drop_pagecache_sb(struct super_block *);
 void drop_pagecache(void);
 void drop_slab(void);

-- 

From clameter@sgi.com Mon Jun 18 02:31:20 2007
Message-Id: <20070618093120.198614342@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:08 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 15/26] Slab defrag: Support generic defragmentation for inode slab caches
Content-Disposition: inline; filename=slub_defrag_inode_generic

This implements the ability to remove inodes in a particular slab
from inode cache. In order to remove an inode we may have to write out
the pages of an inode, the inode itself and remove the dentries referring
to the node.

Provide generic functionality that can be used by filesystems that have
their own inode caches to also tie into the defragmentation functions
that are made available here.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/inode.c         |  100 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/fs.h |    5 ++
 2 files changed, 104 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc4-mm2/fs/inode.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/fs/inode.c	2007-06-17 22:29:43.000000000 -0700
+++ linux-2.6.22-rc4-mm2/fs/inode.c	2007-06-17 22:54:41.000000000 -0700
@@ -1351,6 +1351,105 @@ static int __init set_ihash_entries(char
 }
 __setup("ihash_entries=", set_ihash_entries);
 
+static void *get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	int i;
+
+	spin_lock(&inode_lock);
+	for (i = 0; i < nr; i++) {
+		struct inode *inode = v[i];
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE))
+			v[i] = NULL;
+		else
+			__iget(inode);
+	}
+	spin_unlock(&inode_lock);
+	return NULL;
+}
+
+/*
+ * Function for filesystems that embedd struct inode into their own
+ * structures. The offset is the offset of the struct inode in the fs inode.
+ */
+void *fs_get_inodes(struct kmem_cache *s, int nr, void **v,
+						unsigned long offset)
+{
+	int i;
+
+	for (i = 0; i < nr; i++)
+		v[i] += offset;
+
+	return get_inodes(s, nr, v);
+}
+EXPORT_SYMBOL(fs_get_inodes);
+
+void kick_inodes(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct inode *inode;
+	int i;
+	int abort = 0;
+	LIST_HEAD(freeable);
+	struct super_block *sb;
+
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+			if (remove_inode_buffers(inode))
+				invalidate_mapping_pages(&inode->i_data,
+								0, -1);
+		}
+
+		/* Invalidate children and dentry */
+		if (S_ISDIR(inode->i_mode)) {
+			struct dentry *d = d_find_alias(inode);
+
+			if (d) {
+				d_invalidate(d);
+				dput(d);
+			}
+		}
+
+		if (inode->i_state & I_DIRTY)
+			write_inode_now(inode, 1);
+
+		d_prune_aliases(inode);
+	}
+
+	mutex_lock(&iprune_mutex);
+	for (i = 0; i < nr; i++) {
+		inode = v[i];
+		if (!inode)
+			continue;
+
+		sb = inode->i_sb;
+		iput(inode);
+		if (abort || !(sb->s_flags & MS_ACTIVE))
+			continue;
+
+		spin_lock(&inode_lock);
+		abort =  !can_unuse(inode);
+
+		if (!abort) {
+			list_move(&inode->i_list, &freeable);
+			inode->i_state |= I_FREEING;
+			inodes_stat.nr_unused--;
+		}
+		spin_unlock(&inode_lock);
+	}
+	dispose_list(&freeable);
+	mutex_unlock(&iprune_mutex);
+}
+EXPORT_SYMBOL(kick_inodes);
+
+static struct kmem_cache_ops inode_kmem_cache_ops = {
+	.get = get_inodes,
+	.kick = kick_inodes
+};
+
 /*
  * Initialize the waitqueues and inode hash table.
  */
@@ -1389,7 +1488,7 @@ void __init inode_init(unsigned long mem
 					 (SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
 					 SLAB_MEM_SPREAD),
 					 init_once,
-					 NULL);
+					 &inode_kmem_cache_ops);
 	register_shrinker(&icache_shrinker);
 
 	/* Hash may have been set up in inode_init_early */
Index: linux-2.6.22-rc4-mm2/include/linux/fs.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/fs.h	2007-06-17 22:29:43.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/fs.h	2007-06-17 22:31:52.000000000 -0700
@@ -1790,6 +1790,11 @@ static inline void insert_inode_hash(str
 	__insert_inode_hash(inode, inode->i_ino);
 }
 
+/* Helper functions for inode defragmentation support in filesystems */
+extern void kick_inodes(struct kmem_cache *, int, void **, void *);
+extern void *fs_get_inodes(struct kmem_cache *, int nr, void **,
+						unsigned long offset);
+
 extern struct file * get_empty_filp(void);
 extern void file_move(struct file *f, struct list_head *list);
 extern void file_kill(struct file *f);

-- 

From clameter@sgi.com Mon Jun 18 02:31:20 2007
Message-Id: <20070618093120.360924307@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:09 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 16/26] Slab defragmentation: Support defragmentation for extX filesystem inodes
Content-Disposition: inline; filename=slub_defrag_fs_ext234

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/ext2/super.c |   16 ++++++++++++++--
 fs/ext3/super.c |   14 +++++++++++++-
 fs/ext4/super.c |   14 +++++++++++++-
 3 files changed, 40 insertions(+), 4 deletions(-)

Index: slub/fs/ext2/super.c
===================================================================
--- slub.orig/fs/ext2/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/ext2/super.c	2007-06-07 14:28:47.000000000 -0700
@@ -168,14 +168,26 @@ static void init_once(void * foo, struct
 	mutex_init(&ei->truncate_mutex);
 	inode_init_once(&ei->vfs_inode);
 }
- 
+
+static void *ext2_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext2_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext2_kmem_cache_ops = {
+	.get = ext2_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext2_inode_cachep = kmem_cache_create("ext2_inode_cache",
 					     sizeof(struct ext2_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext2_kmem_cache_ops);
 	if (ext2_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
Index: slub/fs/ext3/super.c
===================================================================
--- slub.orig/fs/ext3/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/ext3/super.c	2007-06-07 14:28:47.000000000 -0700
@@ -483,13 +483,25 @@ static void init_once(void * foo, struct
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext3_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext3_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext3_kmem_cache_ops = {
+	.get = ext3_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext3_inode_cachep = kmem_cache_create("ext3_inode_cache",
 					     sizeof(struct ext3_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext3_kmem_cache_ops);
 	if (ext3_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;
Index: slub/fs/ext4/super.c
===================================================================
--- slub.orig/fs/ext4/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/ext4/super.c	2007-06-07 14:29:49.000000000 -0700
@@ -543,13 +543,25 @@ static void init_once(void * foo, struct
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *ext4_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct ext4_inode_info, vfs_inode));
+}
+
+static struct kmem_cache_ops ext4_kmem_cache_ops = {
+	.get = ext4_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	ext4_inode_cachep = kmem_cache_create("ext4_inode_cache",
 					     sizeof(struct ext4_inode_info),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once,
+					     &ext4_kmem_cache_ops);
 	if (ext4_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

From clameter@sgi.com Mon Jun 18 02:31:20 2007
Message-Id: <20070618093120.532176203@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:10 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 17/26] Slab defragmentation: Support inode defragmentation for xfs
Content-Disposition: inline; filename=slub_defrag_fs_xfs

Add slab defrag support.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/xfs/linux-2.6/kmem.h      |    5 +++--
 fs/xfs/linux-2.6/xfs_buf.c   |    2 +-
 fs/xfs/linux-2.6/xfs_super.c |   13 ++++++++++++-
 fs/xfs/xfs_vfsops.c          |    6 +++---
 4 files changed, 19 insertions(+), 7 deletions(-)

Index: slub/fs/xfs/linux-2.6/kmem.h
===================================================================
--- slub.orig/fs/xfs/linux-2.6/kmem.h	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/kmem.h	2007-06-06 13:32:58.000000000 -0700
@@ -79,9 +79,10 @@ kmem_zone_init(int size, char *zone_name
 
 static inline kmem_zone_t *
 kmem_zone_init_flags(int size, char *zone_name, unsigned long flags,
-		     void (*construct)(void *, kmem_zone_t *, unsigned long))
+		     void (*construct)(void *, kmem_zone_t *, unsigned long),
+		     const struct kmem_cache_ops *ops)
 {
-	return kmem_cache_create(zone_name, size, 0, flags, construct, NULL);
+	return kmem_cache_create(zone_name, size, 0, flags, construct, ops);
 }
 
 static inline void
Index: slub/fs/xfs/linux-2.6/xfs_buf.c
===================================================================
--- slub.orig/fs/xfs/linux-2.6/xfs_buf.c	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/xfs_buf.c	2007-06-06 13:32:58.000000000 -0700
@@ -1834,7 +1834,7 @@ xfs_buf_init(void)
 #endif
 
 	xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
-						KM_ZONE_HWALIGN, NULL);
+						KM_ZONE_HWALIGN, NULL, NULL);
 	if (!xfs_buf_zone)
 		goto out_free_trace_buf;
 
Index: slub/fs/xfs/linux-2.6/xfs_super.c
===================================================================
--- slub.orig/fs/xfs/linux-2.6/xfs_super.c	2007-06-06 13:08:09.000000000 -0700
+++ slub/fs/xfs/linux-2.6/xfs_super.c	2007-06-06 13:32:58.000000000 -0700
@@ -355,13 +355,24 @@ xfs_fs_inode_init_once(
 	inode_init_once(vn_to_inode((bhv_vnode_t *)vnode));
 }
 
+static void *xfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v, offsetof(bhv_vnode_t, v_inode));
+};
+
+static struct kmem_cache_ops xfs_kmem_cache_ops = {
+	.get = xfs_get_inodes,
+	.kick = kick_inodes
+};
+
 STATIC int
 xfs_init_zones(void)
 {
 	xfs_vnode_zone = kmem_zone_init_flags(sizeof(bhv_vnode_t), "xfs_vnode",
 					KM_ZONE_HWALIGN | KM_ZONE_RECLAIM |
 					KM_ZONE_SPREAD,
-					xfs_fs_inode_init_once);
+					xfs_fs_inode_init_once,
+					&xfs_kmem_cache_ops);
 	if (!xfs_vnode_zone)
 		goto out;
 
Index: slub/fs/xfs/xfs_vfsops.c
===================================================================
--- slub.orig/fs/xfs/xfs_vfsops.c	2007-06-06 15:19:52.000000000 -0700
+++ slub/fs/xfs/xfs_vfsops.c	2007-06-06 15:20:36.000000000 -0700
@@ -109,13 +109,13 @@ xfs_init(void)
 	xfs_inode_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
 					KM_ZONE_HWALIGN | KM_ZONE_RECLAIM |
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 	xfs_ili_zone =
 		kmem_zone_init_flags(sizeof(xfs_inode_log_item_t), "xfs_ili",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 	xfs_chashlist_zone =
 		kmem_zone_init_flags(sizeof(xfs_chashlist_t), "xfs_chashlist",
-					KM_ZONE_SPREAD, NULL);
+					KM_ZONE_SPREAD, NULL, NULL);
 
 	/*
 	 * Allocate global trace buffers.

-- 

From clameter@sgi.com Mon Jun 18 02:31:20 2007
Message-Id: <20070618093120.709349679@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:11 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 18/26] Slab defragmentation: Support procfs inode defragmentation
Content-Disposition: inline; filename=slub_defrag_fs_proc

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/proc/inode.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

Index: slub/fs/proc/inode.c
===================================================================
--- slub.orig/fs/proc/inode.c	2007-06-04 20:12:56.000000000 -0700
+++ slub/fs/proc/inode.c	2007-06-04 21:35:00.000000000 -0700
@@ -112,14 +112,25 @@ static void init_once(void * foo, struct
 
 	inode_init_once(&ei->vfs_inode);
 }
- 
+
+static void *proc_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+			offsetof(struct proc_inode, vfs_inode));
+};
+
+static struct kmem_cache_ops proc_kmem_cache_ops = {
+	.get = proc_get_inodes,
+	.kick = kick_inodes
+};
+
 int __init proc_init_inodecache(void)
 {
 	proc_inode_cachep = kmem_cache_create("proc_inode_cache",
 					     sizeof(struct proc_inode),
 					     0, (SLAB_RECLAIM_ACCOUNT|
 						SLAB_MEM_SPREAD),
-					     init_once, NULL);
+					     init_once, &proc_kmem_cache_ops);
 	if (proc_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

From clameter@sgi.com Mon Jun 18 02:31:21 2007
Message-Id: <20070618093120.965286820@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:12 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 19/26] Slab defragmentation: Support reiserfs inode defragmentation
Content-Disposition: inline; filename=slub_defrag_fs_reiser

Add inode defrag support

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/reiserfs/super.c |   14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Index: slub/fs/reiserfs/super.c
===================================================================
--- slub.orig/fs/reiserfs/super.c	2007-06-07 14:09:36.000000000 -0700
+++ slub/fs/reiserfs/super.c	2007-06-07 14:30:49.000000000 -0700
@@ -520,6 +520,17 @@ static void init_once(void *foo, struct 
 #endif
 }
 
+static void *reiserfs_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+			offsetof(struct reiserfs_inode_info, vfs_inode));
+}
+
+struct kmem_cache_ops reiserfs_kmem_cache_ops = {
+	.get = reiserfs_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	reiserfs_inode_cachep = kmem_cache_create("reiser_inode_cache",
@@ -527,7 +538,8 @@ static int init_inodecache(void)
 							 reiserfs_inode_info),
 						  0, (SLAB_RECLAIM_ACCOUNT|
 							SLAB_MEM_SPREAD),
-						  init_once, NULL);
+						  init_once,
+						  &reiserfs_kmem_cache_ops);
 	if (reiserfs_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

From clameter@sgi.com Mon Jun 18 02:31:21 2007
Message-Id: <20070618093121.039905792@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:13 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 20/26] Slab defragmentation: Support inode defragmentation for sockets
Content-Disposition: inline; filename=slub_defrag_fs_socket

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 net/socket.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: slub/net/socket.c
===================================================================
--- slub.orig/net/socket.c	2007-06-06 15:19:29.000000000 -0700
+++ slub/net/socket.c	2007-06-06 15:20:54.000000000 -0700
@@ -264,6 +264,17 @@ static void init_once(void *foo, struct 
 	inode_init_once(&ei->vfs_inode);
 }
 
+static void *sock_get_inodes(struct kmem_cache *s, int nr, void **v)
+{
+	return fs_get_inodes(s, nr, v,
+		offsetof(struct socket_alloc, vfs_inode));
+}
+
+static struct kmem_cache_ops sock_kmem_cache_ops = {
+	.get = sock_get_inodes,
+	.kick = kick_inodes
+};
+
 static int init_inodecache(void)
 {
 	sock_inode_cachep = kmem_cache_create("sock_inode_cache",
@@ -273,7 +284,7 @@ static int init_inodecache(void)
 					       SLAB_RECLAIM_ACCOUNT |
 					       SLAB_MEM_SPREAD),
 					      init_once,
-					      NULL);
+					      &sock_kmem_cache_ops);
 	if (sock_inode_cachep == NULL)
 		return -ENOMEM;
 	return 0;

-- 

From clameter@sgi.com Mon Jun 18 02:31:21 2007
Message-Id: <20070618093121.211864359@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:14 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 21/26] Slab defragmentation: support dentry defragmentation
Content-Disposition: inline; filename=slub_defrag_dentry

get() uses the dcache lock and then works with dget_locked to obtain a
reference to the dentry. An additional complication is that the dentry
may be in process of being freed or it may just have been allocated.
We add an additional flag to d_flags to be able to determined the
status of an object.

kick() is called after get() has been used and after the slab has dropped
all of its own locks. The dentry pruning for unused entries works in a
straighforward way.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 fs/dcache.c            |  112 +++++++++++++++++++++++++++++++++++++++++++++----
 include/linux/dcache.h |    5 ++
 2 files changed, 109 insertions(+), 8 deletions(-)

Index: slub/fs/dcache.c
===================================================================
--- slub.orig/fs/dcache.c	2007-06-07 14:31:24.000000000 -0700
+++ slub/fs/dcache.c	2007-06-07 14:31:39.000000000 -0700
@@ -135,6 +135,7 @@ static struct dentry *d_kill(struct dent
 
 	list_del(&dentry->d_u.d_child);
 	dentry_stat.nr_dentry--;	/* For d_free, below */
+	dentry->d_flags &= ~DCACHE_ENTRY_VALID;
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
 	parent = dentry->d_parent;
@@ -951,6 +952,7 @@ struct dentry *d_alloc(struct dentry * p
 	if (parent)
 		list_add(&dentry->d_u.d_child, &parent->d_subdirs);
 	dentry_stat.nr_dentry++;
+	dentry->d_flags |= DCACHE_ENTRY_VALID;
 	spin_unlock(&dcache_lock);
 
 	return dentry;
@@ -2108,18 +2110,112 @@ static void __init dcache_init_early(voi
 		INIT_HLIST_HEAD(&dentry_hashtable[loop]);
 }
 
+/*
+ * The slab is holding off frees. Thus we can safely examine
+ * the object without the danger of it vanishing from under us.
+ */
+static void *get_dentries(struct kmem_cache *s, int nr, void **v)
+{
+	struct dentry *dentry;
+	int i;
+
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		/*
+		 * if DCACHE_ENTRY_VALID is not set then the dentry
+		 * may be already in the process of being freed.
+		 */
+		if (!(dentry->d_flags & DCACHE_ENTRY_VALID))
+			v[i] = NULL;
+		else
+			dget_locked(dentry);
+	}
+	spin_unlock(&dcache_lock);
+	return 0;
+}
+
+/*
+ * Slab has dropped all the locks. Get rid of the
+ * refcount we obtained earlier and also rid of the
+ * object.
+ */
+static void kick_dentries(struct kmem_cache *s, int nr, void **v, void *private)
+{
+	struct dentry *dentry;
+	int abort = 0;
+	int i;
+
+	/*
+	 * First invalidate the dentries without holding the dcache lock
+	 */
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+
+		if (dentry)
+			d_invalidate(dentry);
+	}
+
+	/*
+	 * If we are the last one holding a reference then the dentries can
+	 * be freed. We  need the dcache_lock.
+	 */
+	spin_lock(&dcache_lock);
+	for (i = 0; i < nr; i++) {
+		dentry = v[i];
+		if (!dentry)
+			continue;
+
+		if (abort)
+			goto put_dentry;
+
+		spin_lock(&dentry->d_lock);
+		if (atomic_read(&dentry->d_count) > 1) {
+			/*
+			 * Reference count was increased.
+			 * We need to abandon the freeing of
+			 * objects.
+			 */
+			abort = 1;
+			spin_unlock(&dentry->d_lock);
+put_dentry:
+			spin_unlock(&dcache_lock);
+			dput(dentry);
+			spin_lock(&dcache_lock);
+			continue;
+		}
+
+		/* Remove from LRU */
+		if (!list_empty(&dentry->d_lru)) {
+			dentry_stat.nr_unused--;
+			list_del_init(&dentry->d_lru);
+		}
+		/* Drop the entry */
+		prune_one_dentry(dentry, 1);
+	}
+	spin_unlock(&dcache_lock);
+
+	/*
+	 * dentries are freed using RCU so we need to wait until RCU
+	 * operations arei complete
+	 */
+	if (!abort)
+		synchronize_rcu();
+}
+
+static struct kmem_cache_ops dentry_kmem_cache_ops = {
+	.get = get_dentries,
+	.kick = kick_dentries,
+};
+
 static void __init dcache_init(unsigned long mempages)
 {
 	int loop;
 
-	/* 
-	 * A constructor could be added for stable state like the lists,
-	 * but it is probably not worth it because of the cache nature
-	 * of the dcache. 
-	 */
-	dentry_cache = KMEM_CACHE(dentry,
-		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-	
+	dentry_cache = KMEM_CACHE_OPS(dentry,
+		SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD,
+		&dentry_kmem_cache_ops);
+
 	register_shrinker(&dcache_shrinker);
 
 	/* Hash may have been set up in dcache_init_early */
Index: slub/include/linux/dcache.h
===================================================================
--- slub.orig/include/linux/dcache.h	2007-06-07 14:31:24.000000000 -0700
+++ slub/include/linux/dcache.h	2007-06-07 14:32:35.000000000 -0700
@@ -177,6 +177,11 @@ d_iput:		no		no		no       yes
 
 #define DCACHE_INOTIFY_PARENT_WATCHED	0x0020 /* Parent inode is watched */
 
+#define DCACHE_ENTRY_VALID	0x0040	/*
+					 * Entry is valid and not in the
+					 * process of being created or
+					 * destroyed.
+					 */
 extern spinlock_t dcache_lock;
 
 /**

-- 

From clameter@sgi.com Mon Jun 18 02:31:21 2007
Message-Id: <20070618093121.375402205@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:15 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 22/26] SLUB: kmem_cache_vacate to support page allocator memory defragmentation 
Content-Disposition: inline; filename=slab_defrag_kmem_cache_vacate

Special function kmem_cache_vacate() to push out the objects in a
specified slab. In order to make that work we will have to handle
slab page allocations in such a way that we can determine if a slab is valid whenever we access it regardless of its time in life.

A valid slab that can be freed has PageSlab(page) and page->inuse > 0 set.
So we need to make sure in allocate_slab that page->inuse is zero before
PageSlab is set otherwise kmem_cache_vacate may operate on a slab that
has not been properly setup yet.

There is currently no in kernel user. The hope is that Mel's defragmentation
method can at some point use this functionality to make slabs movable
so that the reclaimable type of pages may not be necessary anymore.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slab.h |    1 
 mm/slab.c            |    9 ++++
 mm/slob.c            |    9 ++++
 mm/slub.c            |  109 ++++++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 119 insertions(+), 9 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slab.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slab.h	2007-06-17 18:12:29.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slab.h	2007-06-17 18:12:37.000000000 -0700
@@ -98,6 +98,7 @@ unsigned int kmem_cache_size(struct kmem
 const char *kmem_cache_name(struct kmem_cache *);
 int kmem_ptr_validate(struct kmem_cache *cachep, const void *ptr);
 int kmem_cache_defrag(int percentage, int node);
+int kmem_cache_vacate(struct page *);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
Index: linux-2.6.22-rc4-mm2/mm/slab.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slab.c	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slab.c	2007-06-17 18:12:37.000000000 -0700
@@ -2523,6 +2523,15 @@ int kmem_cache_defrag(int percent, int n
 	return 0;
 }
 
+/*
+ * SLAB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 /**
  * kmem_cache_destroy - delete a cache
  * @cachep: the cache to destroy
Index: linux-2.6.22-rc4-mm2/mm/slob.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slob.c	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slob.c	2007-06-17 18:12:37.000000000 -0700
@@ -558,6 +558,15 @@ int kmem_cache_defrag(int percentage, in
 	return 0;
 }
 
+/*
+ * SLOB does not support slab defragmentation
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	return 0;
+}
+EXPORT_SYMBOL(kmem_cache_vacate);
+
 int kmem_ptr_validate(struct kmem_cache *a, const void *b)
 {
 	return 0;
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:22.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:37.000000000 -0700
@@ -1038,6 +1038,7 @@ static inline int slab_pad_check(struct 
 static inline int check_object(struct kmem_cache *s, struct page *page,
 			void *object, int active) { return 1; }
 static inline void add_full(struct kmem_cache_node *n, struct page *page) {}
+static inline void remove_full(struct kmem_cache *s, struct page *page) {}
 static inline void kmem_cache_open_debug_check(struct kmem_cache *s) {}
 #define slub_debug 0
 #endif
@@ -1103,12 +1104,11 @@ static struct page *new_slab(struct kmem
 	n = get_node(s, page_to_nid(page));
 	if (n)
 		atomic_long_inc(&n->nr_slabs);
+
+	page->inuse = 0;
+	page->lockless_freelist = NULL;
 	page->offset = s->offset / sizeof(void *);
 	page->slab = s;
-	page->flags |= 1 << PG_slab;
-	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
-			SLAB_STORE_USER | SLAB_TRACE))
-		SetSlabDebug(page);
 
 	start = page_address(page);
 	end = start + s->objects * s->size;
@@ -1126,11 +1126,20 @@ static struct page *new_slab(struct kmem
 	set_freepointer(s, last, NULL);
 
 	page->freelist = start;
-	page->lockless_freelist = NULL;
-	page->inuse = 0;
-out:
-	if (flags & __GFP_WAIT)
-		local_irq_disable();
+
+	/*
+	 * page->inuse must be 0 when PageSlab(page) becomes
+	 * true so that defrag knows that this slab is not in use.
+	 */
+	smp_wmb();
+	__SetPageSlab(page);
+	if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON |
+			SLAB_STORE_USER | SLAB_TRACE))
+		SetSlabDebug(page);
+
+ out:
+	if (flags & __GFP_WAIT)
+		local_irq_disable();
 	return page;
 }
 
@@ -2654,6 +2663,88 @@ static unsigned long __kmem_cache_shrink
 }
 
 /*
+ * Get a page off a list and freeze it. Must be holding slab lock.
+ */
+static void freeze_from_list(struct kmem_cache *s, struct page *page)
+{
+	if (page->inuse < s->objects)
+		remove_partial(s, page);
+	else if (s->flags & SLAB_STORE_USER)
+		remove_full(s, page);
+	SetSlabFrozen(page);
+}
+
+/*
+ * Attempt to free objects in a page. Return 1 if succesful.
+ */
+int kmem_cache_vacate(struct page *page)
+{
+	unsigned long flags;
+	struct kmem_cache *s;
+	int vacated = 0;
+	void **vector = NULL;
+
+	/*
+	 * Get a reference to the page. Return if its freed or being freed.
+	 * This is necessary to make sure that the page does not vanish
+	 * from under us before we are able to check the result.
+	 */
+	if (!get_page_unless_zero(page))
+		return 0;
+
+	if (!PageSlab(page))
+		goto out;
+
+	s = page->slab;
+	if (!s)
+		goto out;
+
+	vector = kmalloc(s->objects * sizeof(void *), GFP_KERNEL);
+	if (!vector)
+		goto out2;
+
+	local_irq_save(flags);
+	/*
+	 * The implicit memory barrier in slab_lock guarantees that page->inuse
+	 * is loaded after PageSlab(page) has been established to be true.
+	 * Only revelant for a  newly created slab.
+	 */
+	slab_lock(page);
+
+	/*
+	 * We may now have locked a page that may be in various stages of
+	 * being freed. If the PageSlab bit is off then we have already
+	 * reached the page allocator. If page->inuse is zero then we are
+	 * in SLUB but freeing or allocating the page.
+	 * page->inuse is never modified without the slab lock held.
+	 *
+	 * Also abort if the page happens to be already frozen. If its
+	 * frozen then a concurrent vacate may be in progress.
+	 */
+	if (!PageSlab(page) || SlabFrozen(page) || !page->inuse)
+		goto out_locked;
+
+	/*
+	 * We are holding a lock on a slab page and all operations on the
+	 * slab are blocking.
+	 */
+	if (!s->ops->get || !s->ops->kick)
+		goto out_locked;
+	freeze_from_list(s, page);
+	vacated = __kmem_cache_vacate(s, page, flags, vector);
+out:
+	kfree(vector);
+out2:
+	put_page(page);
+	return vacated == 0;
+out_locked:
+	slab_unlock(page);
+	local_irq_restore(flags);
+	goto out;
+
+}
+
+/*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up

-- 

From clameter@sgi.com Mon Jun 18 02:31:21 2007
Message-Id: <20070618093121.575014717@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:16 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 23/26] SLUB: Move sysfs operations outside of slub_lock
Content-Disposition: inline; filename=slub_lock_cleanup

Sysfs can do a gazillion things when called. Make sure that we do
not call any sysfs functions while holding the slub_lock. Let sysfs
fend for itself locking wise.

Just protect the essentials:

1. The list of all slab caches
2. The kmalloc_dma array
3. The ref counters of the slabs.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   34 +++++++++++++++++++++-------------
 1 file changed, 21 insertions(+), 13 deletions(-)

Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:37.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:44.000000000 -0700
@@ -2217,12 +2217,13 @@ void kmem_cache_destroy(struct kmem_cach
 	s->refcount--;
 	if (!s->refcount) {
 		list_del(&s->list);
+		up_write(&slub_lock);
 		if (kmem_cache_close(s))
 			WARN_ON(1);
 		sysfs_slab_remove(s);
 		kfree(s);
-	}
-	up_write(&slub_lock);
+	} else
+		up_write(&slub_lock);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
@@ -3027,26 +3028,33 @@ struct kmem_cache *kmem_cache_create(con
 		 */
 		s->objsize = max(s->objsize, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
+		up_write(&slub_lock);
+
 		if (sysfs_slab_alias(s, name))
 			goto err;
-	} else {
-		s = kmalloc(kmem_size, GFP_KERNEL);
-		if (s && kmem_cache_open(s, GFP_KERNEL, name,
+
+		return s;
+	}
+
+	s = kmalloc(kmem_size, GFP_KERNEL);
+	if (s) {
+		if (kmem_cache_open(s, GFP_KERNEL, name,
 				size, align, flags, ctor, ops)) {
-			if (sysfs_slab_add(s)) {
-				kfree(s);
-				goto err;
-			}
 			list_add(&s->list, &slab_caches);
+			up_write(&slub_lock);
 			raise_kswapd_order(s->order);
-		} else
-			kfree(s);
+
+			if (sysfs_slab_add(s))
+				goto err;
+
+			return s;
+
+		}
+		kfree(s);
 	}
 	up_write(&slub_lock);
-	return s;
 
 err:
-	up_write(&slub_lock);
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slabcache %s\n", name);
 	else

-- 

From clameter@sgi.com Mon Jun 18 02:31:21 2007
Message-Id: <20070618093121.715910394@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:17 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 24/26] SLUB: Avoid page struct cacheline bouncing due to remote frees to cpu slab
Content-Disposition: inline; filename=slub_performance_conc_free_alloc

A remote free may access the same page struct that also contains the lockless
freelist for the cpu slab. If objects have a short lifetime and are freed by
a different processor then remote frees back to the slab from which we are
currently allocating are frequent. The cacheline with the page struct needs
to be repeately acquired in exclusive mode by both the allocating thread and
the freeing thread. If this is frequent enough then performance will suffer
because of cacheline bouncing.

This patchset puts the lockless_freelist pointer in its own cacheline. In
order to make that happen we introduce a per cpu structure called
kmem_cache_cpu.

Instead of keeping an array of pointers to page structs we now keep an array
to a per cpu structure that--among other things--contains the pointer to the
lockless freelist. The freeing thread can then keep possession of exclusive
access to the page struct cacheline while the allocating thread keeps its
exclusive access to the cacheline containing the per cpu structure.

This works as long as the allocating cpu is able to service its request
from the lockless freelist. If the lockless freelist runs empty then the
allocating thread needs to acquire exclusive access to the cacheline with
the page struct lock the slab.

The allocating thread will then check if new objects were freed to the per
cpu slab. If so it will keep the slab as the cpu slab and continue with the
recently remote freed objects. So the allocating thread can take a series
of just freed remote pages and dish them out again. Ideally allocations
could be just recycling objects in the same slab this way which will lead
to an ideal allocation / remote free pattern.

The number of objects that can be treated like that is limited by the
capacity of one slab. Increasing slab size via slub_min_objects/
slub_max_order may increase the number of objects if necessary.

If the allocating thread runs out of objects and finds that no objects were
put back by the remote processor then it will retrieve a new slab (from the
partial lists or from the page allocator) and start with a whole
new set of objects while the remote thread may still be freeing objects to
the old cpu slab. This may then repeat until the new slab is also exhausted.
If remote freeing has freed objects in the earlier slab then that earlier
slab will now be on the partial freelist and the allocating thread will
pick that slab next for allocation. So the loop is extended. However,
both threads need to take the list_lock to make the swizzling via
the partial list happen.

It is likely that this kind of scheme will keep the objects being passed
around to a small set that can be kept in the cpu caches leading to increased
performance.

More code cleanups become possible:

- Instead of passing a cpu we can now pass a kmem_cache_cpu structure around.
  Allows reducing the number of parameters to various functions.
- Can define a new node_match() function for NUMA to encapsulate locality
  checks.


Effect on allocations:

Cachelines touched before this patch:

	Write:	page cache struct and first cacheline of object

Cachelines touched after this patch:

	Write:	kmem_cache_cpu cacheline and first cacheline of object


The handling when the lockless alloc list runs empty gets to be a bit more
complicated since another cacheline has now to be written to. But that is
halfway out of the hot path.


Effect on freeing:

Cachelines touched before this patch:

	Write: page_struct and first cacheline of object

Cachelines touched after this patch depending on how we free:

  Write(to cpu_slab):	kmem_cache_cpu struct and first cacheline of object
  Write(to other):	page struct and first cacheline of object

  Read(to cpu_slab):	page struct to id slab etc.
  Read(to other):	cpu local kmem_cache_cpu struct to verify its not
  			the cpu slab.


Summary:

Pro:
	- Distinct cachelines so that concurrent remote frees and local
	  allocs on a cpuslab can occur without cacheline bouncing.
	- Avoids potential bouncing cachelines because of neighboring
	  per cpu pointer updates in kmem_cache's cpu_slab structure since
	  it now grows to a cacheline (Therefore remove the comment
	  that talks about that concern).

Cons:
	- Freeing objects now requires the reading of one additional
	  cacheline.

	- Memory usage grows slightly.

	The size of each per cpu object is blown up from one word
	(pointing to the page_struct) to one cacheline with various data.
	So this is NR_CPUS*NR_SLABS*L1_BYTES more memory use. Lets say
	NR_SLABS is 100 and a cache line size of 128 then we have just
	increased SLAB metadata requirements by 12.8k per cpu.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    9 ++
 mm/slub.c                |  164 +++++++++++++++++++++++++----------------------
 2 files changed, 97 insertions(+), 76 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 18:12:19.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-17 18:12:48.000000000 -0700
@@ -11,6 +11,13 @@
 #include <linux/workqueue.h>
 #include <linux/kobject.h>
 
+struct kmem_cache_cpu {
+	void **lockless_freelist;
+	struct page *page;
+	int node;
+	/* Lots of wasted space */
+} ____cacheline_aligned_in_smp;
+
 struct kmem_cache_node {
 	spinlock_t list_lock;	/* Protect partial list and nr_partial */
 	unsigned long nr_partial;
@@ -55,7 +62,7 @@ struct kmem_cache {
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-	struct page *cpu_slab[NR_CPUS];
+	struct kmem_cache_cpu cpu_slab[NR_CPUS];
 };
 
 /*
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-17 18:12:44.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-17 18:12:48.000000000 -0700
@@ -140,11 +140,6 @@ static inline void ClearSlabDebug(struct
 /*
  * Issues still to be resolved:
  *
- * - The per cpu array is updated for each new slab and and is a remote
- *   cacheline for most nodes. This could become a bouncing cacheline given
- *   enough frequent updates. There are 16 pointers in a cacheline, so at
- *   max 16 cpus could compete for the cacheline which may be okay.
- *
  * - Support PAGE_ALLOC_DEBUG. Should be easy to do.
  *
  * - Variable sizing of the per node arrays
@@ -283,6 +278,11 @@ static inline struct kmem_cache_node *ge
 #endif
 }
 
+static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
+{
+	return &s->cpu_slab[cpu];
+}
+
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
 {
@@ -1395,33 +1395,34 @@ static void unfreeze_slab(struct kmem_ca
 /*
  * Remove the cpu slab
  */
-static void deactivate_slab(struct kmem_cache *s, struct page *page, int cpu)
+static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
+	struct page *page = c->page;
 	/*
 	 * Merge cpu freelist into freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(page->lockless_freelist)) {
+	while (unlikely(c->lockless_freelist)) {
 		void **object;
 
 		/* Retrieve object from cpu_freelist */
-		object = page->lockless_freelist;
-		page->lockless_freelist = page->lockless_freelist[page->offset];
+		object = c->lockless_freelist;
+		c->lockless_freelist = c->lockless_freelist[page->offset];
 
 		/* And put onto the regular freelist */
 		object[page->offset] = page->freelist;
 		page->freelist = object;
 		page->inuse--;
 	}
-	s->cpu_slab[cpu] = NULL;
+	c->page = NULL;
 	unfreeze_slab(s, page);
 }
 
-static inline void flush_slab(struct kmem_cache *s, struct page *page, int cpu)
+static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	slab_lock(page);
-	deactivate_slab(s, page, cpu);
+	slab_lock(c->page);
+	deactivate_slab(s, c);
 }
 
 /*
@@ -1430,18 +1431,17 @@ static inline void flush_slab(struct kme
  */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	struct page *page = s->cpu_slab[cpu];
+	struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 
-	if (likely(page))
-		flush_slab(s, page, cpu);
+	if (likely(c && c->page))
+		flush_slab(s, c);
 }
 
 static void flush_cpu_slab(void *d)
 {
 	struct kmem_cache *s = d;
-	int cpu = smp_processor_id();
 
-	__flush_cpu_slab(s, cpu);
+	__flush_cpu_slab(s, smp_processor_id());
 }
 
 static void flush_all(struct kmem_cache *s)
@@ -1458,6 +1458,19 @@ static void flush_all(struct kmem_cache 
 }
 
 /*
+ * Check if the objects in a per cpu structure fit numa
+ * locality expectations.
+ */
+static inline int node_match(struct kmem_cache_cpu *c, int node)
+{
+#ifdef CONFIG_NUMA
+	if (node != -1 && c->node != node)
+		return 0;
+#endif
+	return 1;
+}
+
+/*
  * Slow path. The lockless freelist is empty or we need to perform
  * debugging duties.
  *
@@ -1475,45 +1488,46 @@ static void flush_all(struct kmem_cache 
  * we need to allocate a new slab. This is slowest path since we may sleep.
  */
 static void *__slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, int node, void *addr, struct page *page)
+		gfp_t gfpflags, int node, void *addr, struct kmem_cache_cpu *c)
 {
 	void **object;
-	int cpu = smp_processor_id();
+	struct page *new;
 
-	if (!page)
+	if (!c->page)
 		goto new_slab;
 
-	slab_lock(page);
-	if (unlikely(node != -1 && page_to_nid(page) != node))
+	slab_lock(c->page);
+	if (unlikely(!node_match(c, node)))
 		goto another_slab;
 load_freelist:
-	object = page->freelist;
+	object = c->page->freelist;
 	if (unlikely(!object))
 		goto another_slab;
-	if (unlikely(SlabDebug(page)))
+	if (unlikely(SlabDebug(c->page)))
 		goto debug;
 
-	object = page->freelist;
-	page->lockless_freelist = object[page->offset];
-	page->inuse = s->objects;
-	page->freelist = NULL;
-	slab_unlock(page);
+	object = c->page->freelist;
+	c->lockless_freelist = object[c->page->offset];
+	c->page->inuse = s->objects;
+	c->page->freelist = NULL;
+	c->node = page_to_nid(c->page);
+	slab_unlock(c->page);
 	return object;
 
 another_slab:
-	deactivate_slab(s, page, cpu);
+	deactivate_slab(s, c);
 
 new_slab:
-	page = get_partial(s, gfpflags, node);
-	if (page) {
-		s->cpu_slab[cpu] = page;
+	new = get_partial(s, gfpflags, node);
+	if (new) {
+		c->page = new;
 		goto load_freelist;
 	}
 
-	page = new_slab(s, gfpflags, node);
-	if (page) {
-		cpu = smp_processor_id();
-		if (s->cpu_slab[cpu]) {
+	new = new_slab(s, gfpflags, node);
+	if (new) {
+		c = get_cpu_slab(s, smp_processor_id());
+		if (c->page) {
 			/*
 			 * Someone else populated the cpu_slab while we
 			 * enabled interrupts, or we have gotten scheduled
@@ -1521,34 +1535,32 @@ new_slab:
 			 * requested node even if __GFP_THISNODE was
 			 * specified. So we need to recheck.
 			 */
-			if (node == -1 ||
-				page_to_nid(s->cpu_slab[cpu]) == node) {
+			if (node_match(c, node)) {
 				/*
 				 * Current cpuslab is acceptable and we
 				 * want the current one since its cache hot
 				 */
-				discard_slab(s, page);
-				page = s->cpu_slab[cpu];
-				slab_lock(page);
+				discard_slab(s, new);
+				slab_lock(c->page);
 				goto load_freelist;
 			}
 			/* New slab does not fit our expectations */
-			flush_slab(s, s->cpu_slab[cpu], cpu);
+			flush_slab(s, c);
 		}
-		slab_lock(page);
-		SetSlabFrozen(page);
-		s->cpu_slab[cpu] = page;
+		slab_lock(new);
+		SetSlabFrozen(new);
+		c->page = new;
 		goto load_freelist;
 	}
 	return NULL;
 debug:
-	object = page->freelist;
-	if (!alloc_debug_processing(s, page, object, addr))
+	object = c->page->freelist;
+	if (!alloc_debug_processing(s, c->page, object, addr))
 		goto another_slab;
 
-	page->inuse++;
-	page->freelist = object[page->offset];
-	slab_unlock(page);
+	c->page->inuse++;
+	c->page->freelist = object[c->page->offset];
+	slab_unlock(c->page);
 	return object;
 }
 
@@ -1565,20 +1577,20 @@ debug:
 static void __always_inline *slab_alloc(struct kmem_cache *s,
 		gfp_t gfpflags, int node, void *addr, int length)
 {
-	struct page *page;
 	void **object;
 	unsigned long flags;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
-	page = s->cpu_slab[smp_processor_id()];
-	if (unlikely(!page || !page->lockless_freelist ||
-			(node != -1 && page_to_nid(page) != node)))
+	c = get_cpu_slab(s, smp_processor_id());
+	if (unlikely(!c->page || !c->lockless_freelist ||
+					!node_match(c, node)))
 
-		object = __slab_alloc(s, gfpflags, node, addr, page);
+		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = page->lockless_freelist;
-		page->lockless_freelist = object[page->offset];
+		object = c->lockless_freelist;
+		c->lockless_freelist = object[c->page->offset];
 	}
 	local_irq_restore(flags);
 
@@ -1678,12 +1690,13 @@ static void __always_inline slab_free(st
 {
 	void **object = (void *)x;
 	unsigned long flags;
+	struct kmem_cache_cpu *c;
 
 	local_irq_save(flags);
-	if (likely(page == s->cpu_slab[smp_processor_id()] &&
-						!SlabDebug(page))) {
-		object[page->offset] = page->lockless_freelist;
-		page->lockless_freelist = object;
+	c = get_cpu_slab(s, smp_processor_id());
+	if (likely(page == c->page && !SlabDebug(page))) {
+		object[page->offset] = c->lockless_freelist;
+		c->lockless_freelist = object;
 	} else
 		__slab_free(s, page, x, addr);
 
@@ -2931,7 +2944,7 @@ void __init kmem_cache_init(void)
 #endif
 
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct page *);
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu);
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d,"
 		" MinObjects=%d, CPUs=%d, Nodes=%d\n",
@@ -3518,22 +3531,20 @@ static unsigned long slab_objects(struct
 	per_cpu = nodes + nr_node_ids;
 
 	for_each_possible_cpu(cpu) {
-		struct page *page = s->cpu_slab[cpu];
-		int node;
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
 
-		if (page) {
-			node = page_to_nid(page);
+		if (c && c->page) {
 			if (flags & SO_CPU) {
 				int x = 0;
 
 				if (flags & SO_OBJECTS)
-					x = page->inuse;
+					x = c->page->inuse;
 				else
 					x = 1;
 				total += x;
-				nodes[node] += x;
+				nodes[c->node] += x;
 			}
-			per_cpu[node]++;
+			per_cpu[c->node]++;
 		}
 	}
 
@@ -3579,14 +3590,17 @@ static int any_slab_objects(struct kmem_
 	int node;
 	int cpu;
 
-	for_each_possible_cpu(cpu)
-		if (s->cpu_slab[cpu])
+	for_each_possible_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c && c->page)
 			return 1;
+	}
 
-	for_each_node(node) {
+	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
-		if (n->nr_partial || atomic_read(&n->nr_slabs))
+		if (n && (n->nr_partial || atomic_read(&n->nr_slabs)))
 			return 1;
 	}
 	return 0;

-- 

From clameter@sgi.com Mon Jun 18 02:31:22 2007
Message-Id: <20070618093121.884470044@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:18 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 25/26] SLUB: Add an object counter to the kmem_cache_cpu structure
Content-Disposition: inline; filename=slub_performance_cpuslab_counter

The kmem_cache_cpu structure is now 2 1/2 words. Allocation sizes are rounded
to word boundaries so we can place an additional integer in the
kmem_cache_structure without increasing its size.

The counter is useful to keep track of the numbers of objects left in the
lockless per cpu list. If we have this number then the merging of the
per cpu objects back into the slab (when a slab is deactivated) can be very fast
since we have no need anymore to count the objects.

Pros:
	- The benefit is that requests to rapidly changing node numbers
	  from a single processor are improved on NUMA.
	  Switching from a slab on one node to another becomes faster
	  since the back spilling of objects is simplified.

Cons:
	- Additional need to increase and decrease a counter in slab_alloc
	  and slab_free. But the counter is in a cacheline already written to
	  so its cheap to do.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    1 
 mm/slub.c                |   48 ++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 38 insertions(+), 11 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-17 23:51:36.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-18 00:45:54.000000000 -0700
@@ -14,6 +14,7 @@
 struct kmem_cache_cpu {
 	void **lockless_freelist;
 	struct page *page;
+	int objects;	/* Saved page->inuse */
 	int node;
 	/* Lots of wasted space */
 } ____cacheline_aligned_in_smp;
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-18 00:40:04.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 00:45:54.000000000 -0700
@@ -1398,23 +1398,47 @@ static void unfreeze_slab(struct kmem_ca
 static void deactivate_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	struct page *page = c->page;
+
 	/*
 	 * Merge cpu freelist into freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(c->lockless_freelist)) {
-		void **object;
+	if (unlikely(c->lockless_freelist)) {
 
-		/* Retrieve object from cpu_freelist */
-		object = c->lockless_freelist;
-		c->lockless_freelist = c->lockless_freelist[page->offset];
+		/*
+		 * Special case in which no remote frees have occurred.
+		 * Then we can simply have the lockless_freelist become
+		 * the page->freelist and put the counter back.
+		 */
+		if (!page->freelist) {
+			page->freelist = c->lockless_freelist;
+			page->inuse = c->objects;
+			c->lockless_freelist = NULL;
+		} else {
 
-		/* And put onto the regular freelist */
-		object[page->offset] = page->freelist;
-		page->freelist = object;
-		page->inuse--;
+			/*
+			 * Objects both on page freelist and cpu freelist.
+			 * We need to merge both lists. By doing that
+			 * we reverse the object order in the slab.
+			 * Sigh. But we rarely get here.
+			 */
+			while (c->lockless_freelist) {
+				void **object;
+
+				/* Retrieve object from lockless freelist */
+				object = c->lockless_freelist;
+				c->lockless_freelist =
+					c->lockless_freelist[page->offset];
+
+				/* And put onto the regular freelist */
+				object[page->offset] = page->freelist;
+				page->freelist = object;
+				page->inuse--;
+			}
+		}
 	}
+
 	c->page = NULL;
 	unfreeze_slab(s, page);
 }
@@ -1508,6 +1532,7 @@ load_freelist:
 
 	object = c->page->freelist;
 	c->lockless_freelist = object[c->page->offset];
+	c->objects = c->page->inuse + 1;
 	c->page->inuse = s->objects;
 	c->page->freelist = NULL;
 	c->node = page_to_nid(c->page);
@@ -1583,14 +1608,14 @@ static void __always_inline *slab_alloc(
 
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
-	if (unlikely(!c->page || !c->lockless_freelist ||
-					!node_match(c, node)))
+	if (unlikely(!c->lockless_freelist || !node_match(c, node)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
 		object = c->lockless_freelist;
 		c->lockless_freelist = object[c->page->offset];
+		c->objects++;
 	}
 	local_irq_restore(flags);
 
@@ -1697,6 +1722,7 @@ static void __always_inline slab_free(st
 	if (likely(page == c->page && !SlabDebug(page))) {
 		object[page->offset] = c->lockless_freelist;
 		c->lockless_freelist = object;
+		c->objects--;
 	} else
 		__slab_free(s, page, x, addr);
 

-- 

From clameter@sgi.com Mon Jun 18 02:31:22 2007
Message-Id: <20070618093122.055195981@sgi.com>
References: <20070618093053.890987336@sgi.com>
User-Agent: quilt/0.46-1
Date: Mon, 18 Jun 2007 02:31:19 -0700
From: clameter@sgi.com
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org,
 linux-mm@kvack.org,
 Pekka Enberg <penberg@cs.helsinki.fi>
Cc: suresh.b.siddha@intel.com
Subject: [patch 26/26] SLUB: Place kmem_cache_cpu structures in a NUMA aware way.
Content-Disposition: inline; filename=slub_performance_numa_placement

The kmem_cache_cpu structures introduced are currently an array in the
kmem_cache_struct. Meaning the kmem_cache_cpu structures are overwhelmingly
on the wrong node for systems with a higher amount of nodes. These are
performance critical structures since the per node information has
to be touched for every alloc and free in a slab.

In order to place the kmem_cache_cpu structure optimally we put an array
of pointers to kmem_cache_cpu structs in kmem_cache (similar to SLAB).

The kmem_cache_cpu structures can now be allocated in a more intelligent way.
We could put per cpu structures for the same cpu but different
slab caches in cachelines together to save space and decrease the cache
footprint. However, the slab allocators itself control only allocations
per node. Thus we set up a simple per cpu array for every processor with
100 per cpu structures which is usually enough to get them all set up right.
If we run out then we fall back to kmalloc_node. This also solves the
bootstrap problem since we do not have to use slab allocator functions
early in boot to get memory for the small per cpu structures.

Pro:
	- NUMA aware placement improves memory performance
	- All global structures in struct kmem_cache become readonly
	- Dense packing of per cpu structures reduces cacheline
	  footprint in SMP and NUMA.
	- Potential avoidance of exclusive cacheline fetches
	  on the free and alloc hotpath since multiple kmem_cache_cpu
	  structures are in one cacheline. This is particularly important
	  for the kmalloc array.

Cons:
	- Additional reference to one read only cacheline (per cpu
	  array of pointers to kmem_cache_cpu) in both slab_alloc()
	  and slab_free().

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/slub_def.h |    9 ++-
 mm/slub.c                |  131 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 133 insertions(+), 7 deletions(-)

Index: linux-2.6.22-rc4-mm2/include/linux/slub_def.h
===================================================================
--- linux-2.6.22-rc4-mm2.orig/include/linux/slub_def.h	2007-06-18 01:28:48.000000000 -0700
+++ linux-2.6.22-rc4-mm2/include/linux/slub_def.h	2007-06-18 01:34:52.000000000 -0700
@@ -16,8 +16,7 @@ struct kmem_cache_cpu {
 	struct page *page;
 	int objects;	/* Saved page->inuse */
 	int node;
-	/* Lots of wasted space */
-} ____cacheline_aligned_in_smp;
+};
 
 struct kmem_cache_node {
 	spinlock_t list_lock;	/* Protect partial list and nr_partial */
@@ -63,7 +62,11 @@ struct kmem_cache {
 	int defrag_ratio;
 	struct kmem_cache_node *node[MAX_NUMNODES];
 #endif
-	struct kmem_cache_cpu cpu_slab[NR_CPUS];
+#ifdef CONFIG_SMP
+	struct kmem_cache_cpu *cpu_slab[NR_CPUS];
+#else
+	struct kmem_cache_cpu cpu_slab;
+#endif
 };
 
 /*
Index: linux-2.6.22-rc4-mm2/mm/slub.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/mm/slub.c	2007-06-18 01:34:42.000000000 -0700
+++ linux-2.6.22-rc4-mm2/mm/slub.c	2007-06-18 02:15:22.000000000 -0700
@@ -280,7 +280,11 @@ static inline struct kmem_cache_node *ge
 
 static inline struct kmem_cache_cpu *get_cpu_slab(struct kmem_cache *s, int cpu)
 {
-	return &s->cpu_slab[cpu];
+#ifdef CONFIG_SMP
+	return s->cpu_slab[cpu];
+#else
+	return &s->cpu_slab;
+#endif
 }
 
 static inline int check_valid_pointer(struct kmem_cache *s,
@@ -1924,14 +1928,126 @@ static void init_kmem_cache_node(struct 
 	INIT_LIST_HEAD(&n->full);
 }
 
+#ifdef CONFIG_SMP
+/*
+ * Per cpu array for per cpu structures.
+ *
+ * The per cpu array places all kmem_cache_cpu structures from one processor
+ * close together meaning that it becomes possible that multiple per cpu
+ * structures are contained in one cacheline. This may be particularly
+ * beneficial for the kmalloc caches.
+ *
+ * A desktop system typically has around 60-80 slabs. With 100 here we are
+ * likely able to get per cpu structures for all caches from the array defined
+ * here. We must be able to cover all kmalloc caches during bootstrap.
+ *
+ * If the per cpu array is exhausted then fall back to kmalloc
+ * of individual cachelines. No sharing is possible then.
+ */
+#define NR_KMEM_CACHE_CPU 100
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu,
+				kmem_cache_cpu)[NR_KMEM_CACHE_CPU];
+
+static DEFINE_PER_CPU(struct kmem_cache_cpu *, kmem_cache_cpu_free);
+
+static struct kmem_cache_cpu *alloc_kmem_cache_cpu(int cpu, gfp_t flags)
+{
+	struct kmem_cache_cpu *c = per_cpu(kmem_cache_cpu_free, cpu);
+
+	if (c)
+		per_cpu(kmem_cache_cpu_free, cpu) =
+				(void *)c->lockless_freelist;
+	else {
+		/* Table overflow: So allocate ourselves */
+		c = kmalloc_node(
+			ALIGN(sizeof(struct kmem_cache_cpu), cache_line_size()),
+			flags, cpu_to_node(cpu));
+		if (!c)
+			return NULL;
+	}
+
+	memset(c, 0, sizeof(struct kmem_cache_cpu));
+	return c;
+}
+
+static void free_kmem_cache_cpu(struct kmem_cache_cpu *c, int cpu)
+{
+	if (c < per_cpu(kmem_cache_cpu, cpu) ||
+			c > per_cpu(kmem_cache_cpu, cpu) + NR_KMEM_CACHE_CPU) {
+		kfree(c);
+		return;
+	}
+	c->lockless_freelist = (void *)per_cpu(kmem_cache_cpu_free, cpu);
+	per_cpu(kmem_cache_cpu_free, cpu) = c;
+}
+
+static void free_kmem_cache_cpus(struct kmem_cache *s)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c) {
+			s->cpu_slab[cpu] = NULL;
+			free_kmem_cache_cpu(c, cpu);
+		}
+	}
+}
+
+static int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
+		if (c)
+			continue;
+
+		c = alloc_kmem_cache_cpu(cpu, flags);
+		if (!c) {
+			free_kmem_cache_cpus(s);
+			return 0;
+		}
+		s->cpu_slab[cpu] = c;
+	}
+	return 1;
+}
+
+static void __init init_alloc_cpu(void)
+{
+	int cpu;
+	int i;
+
+	for_each_online_cpu(cpu) {
+		for (i = NR_KMEM_CACHE_CPU - 1; i >= 0; i--)
+			free_kmem_cache_cpu(&per_cpu(kmem_cache_cpu, cpu)[i],
+								cpu);
+	}
+}
+
+#else
+static inline void free_kmem_cache_cpus(struct kmem_cache *s) {}
+static inline void init_alloc_cpu(struct kmem_cache *s) {}
+
+static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
+{
+	return 1;
+}
+#endif
+
 #ifdef CONFIG_NUMA
+
 /*
  * No kmalloc_node yet so do it by hand. We know that this is the first
  * slab on the node for this slabcache. There are no concurrent accesses
  * possible.
  *
  * Note that this function only works on the kmalloc_node_cache
- * when allocating for the kmalloc_node_cache.
+ * when allocating for the kmalloc_node_cache. This is used for bootstrapping
+ * memory on a fresh node that has no slab structures yet.
  */
 static struct kmem_cache_node * __init early_kmem_cache_node_alloc(gfp_t gfpflags,
 								int node)
@@ -2152,8 +2268,13 @@ static int kmem_cache_open(struct kmem_c
 #ifdef CONFIG_NUMA
 	s->defrag_ratio = 100;
 #endif
-	if (init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+	if (!init_kmem_cache_nodes(s, gfpflags & ~SLUB_DMA))
+		goto error;
+
+	if (alloc_kmem_cache_cpus(s, gfpflags & ~SLUB_DMA))
 		return 1;
+
+	free_kmem_cache_nodes(s);
 error:
 	if (flags & SLAB_PANIC)
 		panic("Cannot create slab %s size=%lu realsize=%u "
@@ -2236,6 +2357,8 @@ static inline int kmem_cache_close(struc
 	flush_all(s);
 
 	/* Attempt to free all objects */
+	free_kmem_cache_cpus(s);
+
 	for_each_online_node(node) {
 		struct kmem_cache_node *n = get_node(s, node);
 
@@ -2908,6 +3031,8 @@ void __init kmem_cache_init(void)
 		slub_min_objects = DEFAULT_ANTIFRAG_MIN_OBJECTS;
 	}
 
+	init_alloc_cpu();
+
 #ifdef CONFIG_NUMA
 	/*
 	 * Must first have the slab cache available for the allocations of the
@@ -2971,7 +3096,7 @@ void __init kmem_cache_init(void)
 #endif
 
 	kmem_size = offsetof(struct kmem_cache, cpu_slab) +
-				nr_cpu_ids * sizeof(struct kmem_cache_cpu);
+				nr_cpu_ids * sizeof(struct kmem_cache_cpu *);
 
 	printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d,"
 		" MinObjects=%d, CPUs=%d, Nodes=%d\n",
@@ -3116,15 +3241,28 @@ static int __cpuinit slab_cpuup_callback
 	unsigned long flags;
 
 	switch (action) {
+	case CPU_UP_PREPARE:
+	case CPU_UP_PREPARE_FROZEN:
+		down_read(&slub_lock);
+		list_for_each_entry(s, &slab_caches, list)
+			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(cpu,
+							GFP_KERNEL);
+		up_read(&slub_lock);
+		break;
+
 	case CPU_UP_CANCELED:
 	case CPU_UP_CANCELED_FROZEN:
 	case CPU_DEAD:
 	case CPU_DEAD_FROZEN:
 		down_read(&slub_lock);
 		list_for_each_entry(s, &slab_caches, list) {
+			struct kmem_cache_cpu *c = get_cpu_slab(s, cpu);
+
 			local_irq_save(flags);
 			__flush_cpu_slab(s, cpu);
 			local_irq_restore(flags);
+			free_kmem_cache_cpu(c, cpu);
+			s->cpu_slab[cpu] = NULL;
 		}
 		up_read(&slub_lock);
 		break;

--