From compudj@krystal.dyndns.org Thu Mar 27 20:31:33 2008
Date: Thu, 27 Mar 2008 23:33:25 -0400
From: Mathieu Desnoyers <compudj@krystal.dyndns.org>
To: Christoph Lameter <clameter@sgi.com>
Cc: Robert Wisniewski <bob@watson.ibm.com>, mbligh@google.com, michel.dagenais@polymtl.ca
Subject: [PATCH] Implement slub fastpath with sequence number

* Christoph Lameter (clameter@sgi.com) wrote:
> On Tue, 25 Mar 2008, Mathieu Desnoyers wrote:
> 
> > > Actually, I think there is a problem with my approach : it won't work if
> > > _all_ allocations are done in interrupt context, because it would be
> > > stuck in the slow path when it reaches the overflow. I'll try to think
> > > about a solution.
> > A solution would be to keep an "active fastpath" count each percpu slab
> > structure. So instead of checking for in_interrupt() in the slow path, I
> > can check if the number of active fastpaths is non-zero. This, too,
> > requires preemption to be disabled in the slow path so I can use
> > non-atomic increment/decrements for this active fastpaths count.
> 
> We are getting nearer to the 2.6.26 merge cycle. Do we have a usable 
> approach now? I am fine with an approach that initially requires !PREEMPT.
> 

I had to add a new member to the percpu slub :

   hulong active;   /* Active alloc/free fastpaths */

To count the number of nested fastpaths under the running code. Just
checking for in_interrupt() was buggy : if we only allocate/free from
interrupt context, we always end up running the slow path after we have
reached overflow.

Well, since we already have to add a counter, I don't see any reason to
try to pack this nesting count _and_ the "base" into the freelist 5
LSBs... Therefore, I would suggest using half a long for the count and
half a long for the base.

Here is a version with individual SLUB_FASTPATH_ALLOC and
SLUB_FASTPATH_FREE. The numbers below seems to indicate that the overall
gain of the alloc and then free pair seems to outweigth the performance
degradation of the free when the slow path must be taken.

Here is the patch. It applies on 2.6.25-rc7 in vm.git.

Mathieu


Implement slub fastpath with sequence number

It allows the cmpxchg_local to detect nested object re-use and slab change by
keeping a counter in the freelist MSBs. (16 bits on 32 bits architectures, 32
bits on 64 bits architectures)

Whenever an object is freed in the cpu slab cache, the counter is incremented.
Whenever the alloc/free slow paths are modifying the offset or freebase, the
sequence counter is also incremented. It is used to make sure we know if
freebase has been modified in an interrupt nested over the fast path.

This version running with preemption enabled gives the following results on
x86_32 (UP, 3GHz Pentium 4), 2.6.25-rc4 kernel.

The result "Overall for alloc and then free" is the speedup for an alloc and
free pair in the "Repeatedly allocate then free test".

-----------------
* With preemption
-----------------

Speedups (kmalloc 8)
SLUB_FASTPATH_FREE disabled
Allocs :                          2.04
Then free :                       0.97 (slowdown)
Overall for alloc and then free : 1.37
Repeated alloc+free :             1.44

SLUB_FASTPATH_FREE enabled
Allocs :                          2.04
Then free :                       0.83 (slowdown)
Overall for alloc and then free : 1.25
Repeated alloc+free :             2.45

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test

* baseline

10000 times kmalloc(8) -> 190 cycles kfree -> 150 cycles
10000 times kmalloc(16) -> 191 cycles kfree -> 150 cycles
10000 times kmalloc(32) -> 190 cycles kfree -> 152 cycles
10000 times kmalloc(64) -> 207 cycles kfree -> 162 cycles
10000 times kmalloc(128) -> 253 cycles kfree -> 178 cycles
10000 times kmalloc(256) -> 364 cycles kfree -> 267 cycles
10000 times kmalloc(512) -> 369 cycles kfree -> 284 cycles
10000 times kmalloc(1024) -> 434 cycles kfree -> 362 cycles
10000 times kmalloc(2048) -> 454 cycles kfree -> 385 cycles
10000 times kmalloc(4096) -> 512 cycles kfree -> 395 cycles
10000 times kmalloc(8192) -> 670 cycles kfree -> 692 cycles
10000 times kmalloc(16384) -> 710 cycles kfree -> 797 cycles

* cmpxchg_local (SLUB_FASTPATH_FREE disabled)

10000 times kmalloc(8) -> 93 cycles kfree -> 154 cycles
10000 times kmalloc(16) -> 97 cycles kfree -> 154 cycles
10000 times kmalloc(32) -> 106 cycles kfree -> 156 cycles
10000 times kmalloc(64) -> 124 cycles kfree -> 161 cycles
10000 times kmalloc(128) -> 180 cycles kfree -> 178 cycles
10000 times kmalloc(256) -> 306 cycles kfree -> 276 cycles
10000 times kmalloc(512) -> 347 cycles kfree -> 283 cycles
10000 times kmalloc(1024) -> 481 cycles kfree -> 370 cycles
10000 times kmalloc(2048) -> 503 cycles kfree -> 382 cycles
10000 times kmalloc(4096) -> 560 cycles kfree -> 394 cycles
10000 times kmalloc(8192) -> 693 cycles kfree -> 691 cycles
10000 times kmalloc(16384) -> 715 cycles kfree -> 819 cycles

* cmpxchg_local (SLUB_FASTPATH_FREE enabled)

10000 times kmalloc(8) -> 93 cycles kfree -> 180 cycles
10000 times kmalloc(16) -> 98 cycles kfree -> 188 cycles
10000 times kmalloc(32) -> 107 cycles kfree -> 187 cycles
10000 times kmalloc(64) -> 121 cycles kfree -> 191 cycles
10000 times kmalloc(128) -> 182 cycles kfree -> 206 cycles
10000 times kmalloc(256) -> 295 cycles kfree -> 274 cycles
10000 times kmalloc(512) -> 349 cycles kfree -> 300 cycles
10000 times kmalloc(1024) -> 474 cycles kfree -> 385 cycles
10000 times kmalloc(2048) -> 502 cycles kfree -> 401 cycles
10000 times kmalloc(4096) -> 563 cycles kfree -> 432 cycles
10000 times kmalloc(8192) -> 665 cycles kfree -> 695 cycles
10000 times kmalloc(16384) -> 720 cycles kfree -> 816 cycles


2. Kmalloc: alloc/free test

* baseline

10000 times kmalloc(8)/kfree -> 314 cycles
10000 times kmalloc(16)/kfree -> 312 cycles
10000 times kmalloc(32)/kfree -> 310 cycles
10000 times kmalloc(64)/kfree -> 310 cycles
10000 times kmalloc(128)/kfree -> 310 cycles
10000 times kmalloc(256)/kfree -> 326 cycles
10000 times kmalloc(512)/kfree -> 322 cycles
10000 times kmalloc(1024)/kfree -> 320 cycles
10000 times kmalloc(2048)/kfree -> 324 cycles
10000 times kmalloc(4096)/kfree -> 324 cycles
10000 times kmalloc(8192)/kfree -> 885 cycles
10000 times kmalloc(16384)/kfree -> 916 cycles

* cmpxchg_local (SLUB_FASTPATH_FREE disabled)

10000 times kmalloc(8)/kfree -> 218 cycles
10000 times kmalloc(16)/kfree -> 218 cycles
10000 times kmalloc(32)/kfree -> 218 cycles
10000 times kmalloc(64)/kfree -> 218 cycles
10000 times kmalloc(128)/kfree -> 224 cycles
10000 times kmalloc(256)/kfree -> 232 cycles
10000 times kmalloc(512)/kfree -> 226 cycles
10000 times kmalloc(1024)/kfree -> 226 cycles
10000 times kmalloc(2048)/kfree -> 228 cycles
10000 times kmalloc(4096)/kfree -> 228 cycles
10000 times kmalloc(8192)/kfree -> 899 cycles
10000 times kmalloc(16384)/kfree -> 952 cycles

* cmpxchg_local (SLUB_FASTPATH_FREE enabled)

10000 times kmalloc(8)/kfree -> 128 cycles
10000 times kmalloc(16)/kfree -> 125 cycles
10000 times kmalloc(32)/kfree -> 133 cycles
10000 times kmalloc(64)/kfree -> 228 cycles
10000 times kmalloc(128)/kfree -> 262 cycles
10000 times kmalloc(256)/kfree -> 138 cycles
10000 times kmalloc(512)/kfree -> 190 cycles
10000 times kmalloc(1024)/kfree -> 133 cycles
10000 times kmalloc(2048)/kfree -> 135 cycles
10000 times kmalloc(4096)/kfree -> 130 cycles
10000 times kmalloc(8192)/kfree -> 898 cycles
10000 times kmalloc(16384)/kfree -> 938 cycles


--------------------
* Without preemption
--------------------

Speedups (kmalloc 8)
Allocs     :                      2.18
Then free  :                      0.88 (slowdown)
Overall for alloc and then free : 1.31
Repeated alloc+free :             2.83

1. Kmalloc: Repeatedly allocate then free test

* baseline

10000 times kmalloc(8) -> 181 cycles kfree -> 149 cycles
10000 times kmalloc(16) -> 183 cycles kfree -> 150 cycles
10000 times kmalloc(32) -> 190 cycles kfree -> 152 cycles
10000 times kmalloc(64) -> 214 cycles kfree -> 156 cycles
10000 times kmalloc(128) -> 251 cycles kfree -> 182 cycles
10000 times kmalloc(256) -> 356 cycles kfree -> 270 cycles
10000 times kmalloc(512) -> 372 cycles kfree -> 275 cycles
10000 times kmalloc(1024) -> 424 cycles kfree -> 346 cycles
10000 times kmalloc(2048) -> 443 cycles kfree -> 375 cycles
10000 times kmalloc(4096) -> 504 cycles kfree -> 386 cycles
10000 times kmalloc(8192) -> 675 cycles kfree -> 652 cycles
10000 times kmalloc(16384) -> 703 cycles kfree -> 804 cycles

* cmpxchg_local

10000 times kmalloc(8) -> 83 cycles kfree -> 168 cycles
10000 times kmalloc(16) -> 89 cycles kfree -> 173 cycles
10000 times kmalloc(32) -> 98 cycles kfree -> 179 cycles
10000 times kmalloc(64) -> 117 cycles kfree -> 179 cycles
10000 times kmalloc(128) -> 169 cycles kfree -> 196 cycles
10000 times kmalloc(256) -> 291 cycles kfree -> 270 cycles
10000 times kmalloc(512) -> 335 cycles kfree -> 295 cycles
10000 times kmalloc(1024) -> 466 cycles kfree -> 367 cycles
10000 times kmalloc(2048) -> 470 cycles kfree -> 382 cycles
10000 times kmalloc(4096) -> 529 cycles kfree -> 401 cycles
10000 times kmalloc(8192) -> 701 cycles kfree -> 681 cycles
10000 times kmalloc(16384) -> 678 cycles kfree -> 801 cycles


2. Kmalloc: alloc/free test

* baseline

10000 times kmalloc(8)/kfree -> 318 cycles
10000 times kmalloc(16)/kfree -> 318 cycles
10000 times kmalloc(32)/kfree -> 318 cycles
10000 times kmalloc(64)/kfree -> 321 cycles
10000 times kmalloc(128)/kfree -> 316 cycles
10000 times kmalloc(256)/kfree -> 326 cycles
10000 times kmalloc(512)/kfree -> 326 cycles
10000 times kmalloc(1024)/kfree -> 326 cycles
10000 times kmalloc(2048)/kfree -> 333 cycles
10000 times kmalloc(4096)/kfree -> 330 cycles
10000 times kmalloc(8192)/kfree -> 836 cycles
10000 times kmalloc(16384)/kfree -> 920 cycles

* cmpxchg_local

10000 times kmalloc(8)/kfree -> 112 cycles
10000 times kmalloc(16)/kfree -> 113 cycles
10000 times kmalloc(32)/kfree -> 193 cycles
10000 times kmalloc(64)/kfree -> 216 cycles
10000 times kmalloc(128)/kfree -> 144 cycles
10000 times kmalloc(256)/kfree -> 127 cycles
10000 times kmalloc(512)/kfree -> 119 cycles
10000 times kmalloc(1024)/kfree -> 122 cycles
10000 times kmalloc(2048)/kfree -> 122 cycles
10000 times kmalloc(4096)/kfree -> 125 cycles
10000 times kmalloc(8192)/kfree -> 847 cycles
10000 times kmalloc(16384)/kfree -> 918 cycles


Changelog :
- fix end of freelist
- Removed VM_BUG_ON in get_/set_freelist_ptr (killed init). It makes no sense
  anyway, because it is valid to get/set the freelist ptr when the freelist
  is not in use by the rest of the system yet without disabling interrupts.
- Fixed debug_check_no_locks_freed(object, c->objsize) in SLUB_FASTPATH free.
- Fix deactivate slab : write back the freelist pointer after the loop has been
  executed.
- Use union to select sub-registers instead of using masks.
- Add CONFIG_PREEMPT support.
- Fix CPU slab caches c->freelist direct references (use get/set).
- Fix error in set_freelist_ptr (base was set to LSB instead of MSB).
- Use END to mark slab end.
- Check for same_base at the beginning of fastpaths. It detects if we deal with
  cross-slabs pointers (object in one slab, next_object in another slab). It
  will also detect if the order of slab used is too big to be represented in
  half long size. It should be safe to deal with slabs with order higher than 4:
  the same_base check should detect this and fallback on the slow path.
- Check the sequence counter before using c->base. It makes sure base and
  offset are in sync and won't trigger a page fault. We re-read the c->freelist
  after reading c->base and check if the sequence number has changed. Therefore,
  since only an interrupt could have changed them, we can detect that both are
  in sync and that it is safe to use the make_ptr generated.
- Make sure an overflow will _always_ be detected by forcing the slow path when
  overflow is detected and by only allowing !in_interrupt() code to reset the
  counter. This solution is good as long as preemption is disabled in the
  fastpath.
- Small error I made : the !in_interrupt() code path must also go through
  the slow path to correctly deal with the case where the !in_interrupt()
  path comes on the overflow condition, then interrupts come and modify
  the freelist LSBs, and then the cmpxchg could miss detection of LSB
  modification by the nested interrupts. Forcing the slow path upon overflow
  detection will fix this.
- Turn off CMPCXCHG_LOCAL fast path on CONFIG_DEBUG_PAGEALLOC. We therefore make
  sure we never get the pointer in the cmpxchg loop, get interrupted, the nested
  context calls free_pages, which flags the pages against read and write under
  debug pagealloc. We require that these pages are always reused and are
  readable.
- I had to add a new member to the percpu slub :

    hulong active;   /* Active alloc/free fastpaths */

  To count the number of nested fastpaths under the running code. Just
  checking for in_interrupt() was buggy : if we only allocate/free from
  interrupt context, we always end up running the slow path after we have
  reached overflow.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Christoph Lameter <clameter@sgi.com>
CC: Martin Bligh <mbligh@google.com>
CC: Robert Wisniewski <bob@watson.ibm.com>
CC: Michel Dagenais <michel.dagenais@polymtl.ca>
---
 include/linux/slub_def.h |   52 +++++++
 mm/slub.c                |  335 +++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 359 insertions(+), 28 deletions(-)

Index: vm/mm/slub.c
===================================================================
--- vm.orig/mm/slub.c	2008-03-27 21:22:01.000000000 -0400
+++ vm/mm/slub.c	2008-03-27 22:05:22.000000000 -0400
@@ -138,6 +138,80 @@ static inline void ClearSlabDebug(struct
 	page->flags &= ~SLABDEBUG;
 }
 
+#ifdef SLUB_FASTPATH
+
+static inline unsigned long get_high_half(unsigned long v)
+{
+	return ((union halfselect)v).s.h;
+}
+
+static inline unsigned long get_low_half(unsigned long v)
+{
+	return ((union halfselect)v).s.l;
+}
+
+static inline int same_base(unsigned long base, void *object)
+{
+	return base == get_high_half((unsigned long)object);
+}
+
+static inline void *make_ptr(unsigned long base, void *freelist)
+{
+	union halfselect hs = {
+		.s.h = base,
+		.s.l = get_low_half((unsigned long)freelist),
+	};
+	return (void *)hs.v;
+}
+
+/*
+ * Make a new version by keeping the high half of the previous version and
+ * low half of object.
+ */
+static inline void *make_version(void *version, void *object)
+{
+	union halfselect ret = {
+		.s.h = get_high_half((unsigned long)version),
+		.s.l = get_low_half((unsigned long)object),
+	};
+	return (void *)ret.v;
+}
+
+static inline void **get_freelist_ptr(struct kmem_cache_cpu *c)
+{
+	return make_ptr(c->base, c->freelist);
+}
+
+static inline void set_freelist_ptr(struct kmem_cache_cpu *c, void **freelist)
+{
+	c->base = get_high_half((unsigned long)freelist);
+	/*
+	 * Detect overflow. Only wrap if no active fastpath underneath us.
+	 * Slowpath is always taken when a counter overflow is detected.
+	 */
+	if (unlikely(get_high_half((unsigned long)c->freelist)
+			== HALF_LONG_MASK && c->active > 0)) {
+		c->freelist = make_version((void *)c->freelist, freelist);
+		return;
+	}
+	c->freelist = make_version((void *)c->freelist + HALF_LONG_MASK + 1,
+				freelist);
+}
+
+#else
+
+static inline void **get_freelist_ptr(struct kmem_cache_cpu *c)
+{
+	return c->freelist;
+}
+
+static inline void set_freelist_ptr(struct kmem_cache_cpu *c, void **freelist)
+{
+	c->freelist = freelist;
+}
+
+#endif
+
 /*
  * Issues still to be resolved:
  *
@@ -270,13 +344,20 @@ static inline struct kmem_cache_cpu *get
 #endif
 }
 
+#define END ((void *)1)
+
+static inline int is_end(const void *freelist)
+{
+	return (unsigned long)freelist & (unsigned long)END;
+}
+
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
 {
 	void *base;
 
-	if (!object)
+	if (is_end(object))
 		return 1;
 
 	base = page_address(page);
@@ -312,7 +393,8 @@ static inline void set_freepointer(struc
 
 /* Scan freelist */
 #define for_each_free_object(__p, __s, __free) \
-	for (__p = (__free); __p; __p = get_freepointer((__s), __p))
+	for (__p = (__free); !is_end(__p); __p = get_freepointer((__s),\
+		__p))
 
 /* Determine object index from a given position */
 static inline int slab_index(void *p, struct kmem_cache *s, void *addr)
@@ -730,7 +812,7 @@ static int check_object(struct kmem_cach
 		 * of the free objects in this slab. May cause
 		 * another error because the object count is now wrong.
 		 */
-		set_freepointer(s, p, NULL);
+		set_freepointer(s, p, END);
 		return 0;
 	}
 	return 1;
@@ -771,21 +853,21 @@ static int on_freelist(struct kmem_cache
 {
 	int nr = 0;
 	void *fp = page->freelist;
-	void *object = NULL;
+	void *object = END;
 	unsigned long max_objects;
 
-	while (fp && nr <= page->objects) {
+	while (!is_end(fp) && nr <= page->objects) {
 		if (fp == search)
 			return 1;
 		if (!check_valid_pointer(s, page, fp)) {
-			if (object) {
+			if (!is_end(object)) {
 				object_err(s, page, object,
 					"Freechain corrupt");
-				set_freepointer(s, object, NULL);
+				set_freepointer(s, object, END);
 				break;
 			} else {
 				slab_err(s, page, "Freepointer corrupt");
-				page->freelist = NULL;
+				page->freelist = END;
 				page->inuse = page->objects;
 				slab_fix(s, "Freelist cleared");
 				return 0;
@@ -813,7 +895,7 @@ static int on_freelist(struct kmem_cache
 		page->inuse = page->objects - nr;
 		slab_fix(s, "Object count adjusted.");
 	}
-	return search == NULL;
+	return is_end(search);
 }
 
 static void trace(struct kmem_cache *s, struct page *page, void *object, int alloc)
@@ -901,7 +983,7 @@ bad:
 		 */
 		slab_fix(s, "Marking all objects used");
 		page->inuse = page->objects;
-		page->freelist = NULL;
+		page->freelist = END;
 	}
 	return 0;
 }
@@ -941,7 +1023,7 @@ static int free_debug_processing(struct 
 	}
 
 	/* Special debug activities for freeing objects */
-	if (!SlabFrozen(page) && !page->freelist)
+	if (!SlabFrozen(page) && is_end(page->freelist))
 		remove_full(s, page);
 	if (s->flags & SLAB_STORE_USER)
 		set_track(s, object, TRACK_FREE, addr);
@@ -1142,7 +1224,7 @@ static struct page *new_slab(struct kmem
 		last = p;
 	}
 	setup_object(s, page, last);
-	set_freepointer(s, last, NULL);
+	set_freepointer(s, last, END);
 
 	page->freelist = start;
 	page->inuse = 0;
@@ -1376,7 +1458,7 @@ static void unfreeze_slab(struct kmem_ca
 	ClearSlabFrozen(page);
 	if (page->inuse) {
 
-		if (page->freelist) {
+		if (!is_end(page->freelist)) {
 			add_partial(n, page, tail);
 			stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD);
 		} else {
@@ -1415,28 +1497,32 @@ static void deactivate_slab(struct kmem_
 {
 	struct page *page = c->page;
 	int tail = 1;
+	void **freelist;
 
-	if (page->freelist)
+	if (!is_end(page->freelist))
 		stat(c, DEACTIVATE_REMOTE_FREES);
 	/*
 	 * Merge cpu freelist into slab freelist. Typically we get here
 	 * because both freelists are empty. So this is unlikely
 	 * to occur.
 	 */
-	while (unlikely(c->freelist)) {
+	freelist = get_freelist_ptr(c);
+	while (unlikely(!is_end(freelist))) {
 		void **object;
 
 		tail = 0;	/* Hot objects. Put the slab first */
 
 		/* Retrieve object from cpu_freelist */
-		object = c->freelist;
-		c->freelist = c->freelist[c->offset];
+		object = freelist;
+		freelist = object[c->offset];
 
 		/* And put onto the regular freelist */
 		object[c->offset] = page->freelist;
 		page->freelist = object;
 		page->inuse--;
 	}
+	if (!tail)
+		set_freelist_ptr(c, freelist);
 	c->page = NULL;
 	unfreeze_slab(s, page, tail);
 }
@@ -1517,7 +1603,14 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+#ifdef SLUB_FASTPATH_ALLOC
+	unsigned long flags;
 
+	local_irq_save(flags);
+#ifdef CONFIG_PREEMPT
+	c = get_cpu_slab(s, raw_smp_processor_id());
+#endif
+#endif
 	if (!c->page)
 		goto new_slab;
 
@@ -1529,18 +1622,21 @@ static void *__slab_alloc(struct kmem_ca
 
 load_freelist:
 	object = c->page->freelist;
-	if (unlikely(!object))
+	if (unlikely(is_end(object)))
 		goto another_slab;
 	if (unlikely(SlabDebug(c->page)))
 		goto debug;
 
-	c->freelist = object[c->offset];
+	set_freelist_ptr(c, object[c->offset]);
 	c->page->inuse = c->page->objects;
-	c->page->freelist = NULL;
+	c->page->freelist = END;
 	c->node = page_to_nid(c->page);
 unlock_out:
 	slab_unlock(c->page);
 	stat(c, ALLOC_SLOWPATH);
+#ifdef SLUB_FASTPATH_ALLOC
+	local_irq_restore(flags);
+#endif
 	return object;
 
 another_slab:
@@ -1572,6 +1668,9 @@ new_slab:
 		c->page = new;
 		goto load_freelist;
 	}
+#ifdef SLUB_FASTPATH_ALLOC
+	local_irq_restore(flags);
+#endif
 	return NULL;
 debug:
 	if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1598,20 +1697,112 @@ static __always_inline void *slab_alloc(
 {
 	void **object;
 	struct kmem_cache_cpu *c;
+
+/*
+ * The SLUB_FASTPATH path is provisional and is currently disabled if the
+ * kernel is compiled with preemption or if the arch does not support
+ * fast cmpxchg operations. There are a couple of coming changes that will
+ * simplify matters and allow preemption. Ultimately we may end up making
+ * SLUB_FASTPATH the default.
+ *
+ * 1. The introduction of the per cpu allocator will avoid array lookups
+ *    through get_cpu_slab(). A special register can be used instead.
+ *
+ * 2. The introduction of per cpu atomic operations (cpu_ops) means that
+ *    we can realize the logic here entirely with per cpu atomics. The
+ *    per cpu atomic ops will take care of the preemption issues.
+ */
+
+#ifdef SLUB_FASTPATH_ALLOC
+	void *old, *new, *result, *next_object;
+	unsigned long base;
+
+	preempt_disable();
+	c = get_cpu_slab(s, raw_smp_processor_id());
+	c->active++;
+	barrier();
+fastpath:	/* fastpath cmpxchg loop */
+	old = c->freelist;
+	/*
+	 * Whenever c->base is changed, the sequence number
+	 * _must_ be incremented. This barrier insures we read
+	 * version before c->base wrt interrupts.
+	 */
+	barrier();
+	base = c->base;
+
+	if (unlikely(is_end(old) || !node_match(c, node)))
+		goto slowpath;
+	if (unlikely(get_high_half((unsigned long)old) == HALF_LONG_MASK))
+		goto slowpath;
+	/*
+	 * make_ptr on base should always return a valid pointer;
+	 * insure base has not been changed by a nested interrupt by
+	 * re-reading the freelist sequence number. It makes sure the
+	 * base and the offset will generate a valid pointer.
+	 */
+	barrier();
+	if (c->freelist != old)
+		goto fastpath;	/* retry */
+	object = make_ptr(base, old);
+	/*
+	 * Need to increment the MSB counter here, because
+	 * object[c->offset] use is racy. We can race against
+	 * another slab_alloc fast path.
+	 * Note that the object[c->offset] read may return garbage, but
+	 * is insured to point to a valid address since pages are always
+	 * reused in the page allocator. We know if the
+	 * object[c->offset] read returned garbage because the sequence
+	 * number is incremented each time the freelist is modified.
+	 */
+	next_object = object[c->offset];
+	if (unlikely(!same_base(base, next_object)))
+		goto slowpath;
+	stat(c, ALLOC_FASTPATH);
+	new = make_version(old + HALF_LONG_MASK + 1, next_object);
+	result = cmpxchg_local(&c->freelist, old, new);
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * Just to be paranoid : warn if we detect that enough free or
+	 * slow paths nested on top of us to get the counter to go
+	 * half-way to overflow. That would be insane to do that many
+	 * allocations/free in interrupt handers, but check it anyway.
+	 */
+	WARN_ON(result - old > -1UL >> 1);
+#endif
+	if (result != old)
+		goto fastpath;	/* retry */
+	barrier();
+	c->active--;
+	preempt_enable();
+	goto got_object;
+slowpath:
+	barrier();
+	c->active--;
+	preempt_enable();
+	/*
+	 * __slab_alloc must make no assumption about the
+	 * tests previously done by slab_alloc : we could be
+	 * migrated to a different CPU.
+	 */
+	object = __slab_alloc(s, gfpflags, node, addr, c);
+got_object:
+#else
 	unsigned long flags;
 
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	if (unlikely(is_end(c->freelist)) || !node_match(c, node))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 
 	else {
-		object = c->freelist;
-		c->freelist = object[c->offset];
+		object = get_freelist_ptr(c);
+		set_freelist_ptr(c, object[c->offset]);
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
+#endif
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, c->objsize);
@@ -1648,6 +1839,11 @@ static void __slab_free(struct kmem_cach
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
+#ifdef SLUB_FASTPATH_FREE
+	unsigned long flags;
+
+	local_irq_save(flags);
+#endif
 	c = get_cpu_slab(s, raw_smp_processor_id());
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
@@ -1672,17 +1868,20 @@ checks_ok:
 	 * Objects left in the slab. If it was not on the partial list before
 	 * then add it.
 	 */
-	if (unlikely(!prior)) {
+	if (unlikely(is_end(prior))) {
 		add_partial(get_node(s, page_to_nid(page)), page, 1);
 		stat(c, FREE_ADD_PARTIAL);
 	}
 
 out_unlock:
 	slab_unlock(page);
+#ifdef SLUB_FASTPATH_FREE
+	local_irq_restore(flags);
+#endif
 	return;
 
 slab_empty:
-	if (prior) {
+	if (!is_end(prior)) {
 		/*
 		 * Slab still on the partial list.
 		 */
@@ -1692,6 +1891,9 @@ slab_empty:
 	slab_unlock(page);
 	stat(c, FREE_SLAB);
 	discard_slab(s, page);
+#ifdef SLUB_FASTPATH_FREE
+	local_irq_restore(flags);
+#endif
 	return;
 
 debug:
@@ -1716,19 +1918,90 @@ static __always_inline void slab_free(st
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
+
+#ifdef SLUB_FASTPATH_FREE
+	void *old, *new, *result;
+	unsigned long base;
+
+	preempt_disable();
+	c = get_cpu_slab(s, raw_smp_processor_id());
+	c->active++;
+	barrier();
+	debug_check_no_locks_freed(object, c->objsize);
+	while (1) {
+		old = c->freelist;
+		/*
+		 * If the compiler would reorder the retrieval of c->page and
+		 * c->base to come before c->freelist then an interrupt
+		 * could change the cpu slab before we retrieve c->version.
+		 * We could be matching on a page no longer active and put the
+		 * object onto the freelist of the wrong slab.
+		 *
+		 * On the other hand: If we already have the version
+		 * then any change of cpu_slab will cause the cmpxchg to fail
+		 * since the freelist pointers are unique per slab.
+		 */
+		barrier();
+		base = c->base;
+
+		if (unlikely(get_high_half((unsigned long)old) == HALF_LONG_MASK
+				|| !same_base(base, object)
+				|| page != c->page || c->node < 0)) {
+			barrier();
+			c->active--;
+			preempt_enable();
+			/*
+			 * __slab_free must make no assumption about the
+			 * tests previously done by slab_free : we could be
+			 * migrated to a different CPU.
+			 */
+			__slab_free(s, page, x, addr, c->offset);
+			break;
+		}
+		/*
+		 * It's ok to overwrite the content of object[c->offset] because
+		 * we own the object. This object won't appear in the freelist
+		 * until our cmpxchg_local succeeds. Therefore, no other part of
+		 * the slub slow path can use this object.
+		 * The result of make_ptr does not have to be dereferenced
+		 * until the cmpxchg succeeds. We don't care if base and old are
+		 * out-of-sync.
+		 */
+		object[c->offset] = make_ptr(base, old);
+		stat(c, FREE_FASTPATH);
+		new = make_version(old + HALF_LONG_MASK + 1, object);
+		result = cmpxchg_local(&c->freelist, old, new);
+#ifdef CONFIG_DEBUG_VM
+		/*
+		 * Just to be paranoid : warn if we detect that enough free or
+		 * slow paths nested on top of us to get the counter to go
+		 * half-way to overflow. That would be insane to do that many
+		 * allocations/free in interrupt handers, but check it anyway.
+		 */
+		WARN_ON(result - old > -1UL >> 1);
+#endif
+		if (result == old) {
+			barrier();
+			c->active--;
+			preempt_enable();
+			break;
+		}
+	}
+#else
 	unsigned long flags;
 
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
 	debug_check_no_locks_freed(object, c->objsize);
 	if (likely(page == c->page && c->node >= 0)) {
-		object[c->offset] = c->freelist;
-		c->freelist = object;
+		object[c->offset] = get_freelist_ptr(c);
+		set_freelist_ptr(c, object);
 		stat(c, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, x, addr, c->offset);
 
 	local_irq_restore(flags);
+#endif
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -1911,7 +2184,13 @@ static void init_kmem_cache_cpu(struct k
 			struct kmem_cache_cpu *c)
 {
 	c->page = NULL;
-	c->freelist = NULL;
+#ifdef SLUB_FASTPATH
+	c->freelist = make_version(0, END);
+	c->base = 0;
+	c->active = 0;
+#else
+	c->freelist = END;
+#endif
 	c->node = 0;
 	c->offset = s->offset / sizeof(void *);
 	c->objsize = s->objsize;
Index: vm/include/linux/slub_def.h
===================================================================
--- vm.orig/include/linux/slub_def.h	2008-03-27 21:21:57.000000000 -0400
+++ vm/include/linux/slub_def.h	2008-03-27 21:34:09.000000000 -0400
@@ -32,9 +32,61 @@ enum stat_item {
 	ORDER_FALLBACK,		/* Number of times fallback was necessary */
 	NR_SLUB_STAT_ITEMS };
 
+#if (defined (CONFIG_FAST_CMPXCHG_LOCAL) && !defined(CONFIG_DEBUG_PAGEALLOC))
+#define SLUB_FASTPATH
+#define SLUB_FASTPATH_ALLOC
+
+/*
+ * FREE FASTPATH is slower with config preempt because the slow path is taken
+ * often and performance is degradedby preempt disable/enable + irq
+ * disable/enable.
+ */
+#ifndef CONFIG_PREEMPT
+#define SLUB_FASTPATH_FREE
+#endif
+
+/*
+ * The fastpath divides the free pointer into two halves.
+ *
+ * The lower bits provide an offset relative to the base pointer (that is
+ * kept in different field). The upper bits provide a sequence number so
+ * that we can decide if a given pointer is still valid through a cmpxchg.
+ */
+#define HALF_BITS_PER_LONG (BITS_PER_LONG / 2)
+#define HALF_LONG_MASK ((1UL << HALF_BITS_PER_LONG) - 1)
+
+/*
+ * Define a half-long type to use register selection instead of bitmasks.
+ */
+#if (HALF_BITS_PER_LONG == 32)
+typedef u32 hulong;
+#elif (HALF_BITS_PER_LONG == 16)
+typedef u16 hulong;
+#else
+#error "Unknown long size"
+#endif
+
+union halfselect {
+	struct {
+#ifdef __BIG_ENDIAN
+		hulong h;
+		hulong l;
+#else
+		hulong l;
+		hulong h;
+#endif
+	} s;
+	unsigned long v;
+};
+#endif
+
 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
+#ifdef SLUB_FASTPATH
+	hulong base;		/* Base for fastpath. */
+	hulong active;		/* Active alloc/free fastpaths */
+#endif
 	int node;		/* The node of the page (or -1 for debug) */
 	unsigned int offset;	/* Freepointer offset (in word units) */
 	unsigned int objsize;	/* Size of an object (from kmem_cache) */


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68