From mathieu.desnoyers@polymtl.ca Wed Mar 12 18:35:01 2008
Date: Wed, 12 Mar 2008 21:35:40 -0400
From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Christoph Lameter <clameter@sgi.com>, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH] Implement slub fastpath with sequence number

* Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@polymtl.ca) wrote:
> > * Peter Zijlstra (peterz@infradead.org) wrote:
[...]
> > > Is it (remotely) possible that the version will wrap giving the false
> > > impression that nothing has changed and thus falsely proceed with a
> > > wrong object?
> > > 
> > 
> > If we have exactly, on a 32 bits arch, 65536 memory alloc or free in the
> > slab we are dealing with done in interrupt and softirqs nested over the
> > cmpxchg_local loop, without ever giving the control back to check the
> > value, and that after we get the control back the offset within the slab
> > is exactly the same, then, yes, it's possible. In that case, we would
> > think it's ok to proceed with the object and we would have memory
> > corruption (object used twice or free object list corruption).
> > 
> 
> I think I found a solution to this.
> 
> Quoting the changelog :
> 
> - Make sure an overflow will _always_ be detected by forcing the slow path when
>   overflow is detected and by only allowing !in_interrupt() code to reset the
>   counter. This solution is good as long as preemption is disabled in the
>   fastpath.
> 
 
Small error I made : the !in_interrupt() code path must also go through
the slow path to correctly deal with the case where the !in_interrupt()
path comes on the overflow condition, then interrupts come and modify
the freelist LSBs, and then the cmpxchg could miss detection of LSB
modification by the nested interrupts.

Forcing the slow path upon overflow detection will fix this.

New patch below.

Mathieu


Implement slub fastpath with sequence number

It allows the cmpxchg_local to detect nested object re-use and slab change by
keeping a counter in the freelist MSBs. (16 bits on 32 bits architectures, 32
bits on 64 bits architectures)

Whenever an object is freed in the cpu slab cache, the counter is incremented.
Whenever the alloc/free slow paths are modifying the offset or freebase, the
sequence counter is also incremented. It is used to make sure we know if
freebase has been modified in an interrupt nested over the fast path.

This version running with preemption enabled gives the following results on
x86_32 (UP, 3GHz Pentium 4), 2.6.25-rc4 kernel.

The speedup for the alloc/free test is 2.69.

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test

* baseline

10000 times kmalloc(8) -> 184 cycles kfree -> 150 cycles
10000 times kmalloc(16) -> 185 cycles kfree -> 151 cycles
10000 times kmalloc(32) -> 194 cycles kfree -> 152 cycles
10000 times kmalloc(64) -> 208 cycles kfree -> 158 cycles
10000 times kmalloc(128) -> 325 cycles kfree -> 178 cycles
10000 times kmalloc(256) -> 387 cycles kfree -> 254 cycles
10000 times kmalloc(512) -> 376 cycles kfree -> 238 cycles
10000 times kmalloc(1024) -> 379 cycles kfree -> 246 cycles
10000 times kmalloc(2048) -> 403 cycles kfree -> 270 cycles
10000 times kmalloc(4096) -> 439 cycles kfree -> 292 cycles
10000 times kmalloc(8192) -> 697 cycles kfree -> 642 cycles
10000 times kmalloc(16384) -> 742 cycles kfree -> 744 cycles

* cmpxchg_local

10000 times kmalloc(8) -> 89 cycles kfree -> 178 cycles
10000 times kmalloc(16) -> 91 cycles kfree -> 176 cycles
10000 times kmalloc(32) -> 103 cycles kfree -> 187 cycles
10000 times kmalloc(64) -> 125 cycles kfree -> 186 cycles
10000 times kmalloc(128) -> 238 cycles kfree -> 208 cycles
10000 times kmalloc(256) -> 312 cycles kfree -> 258 cycles
10000 times kmalloc(512) -> 299 cycles kfree -> 245 cycles
10000 times kmalloc(1024) -> 312 cycles kfree -> 263 cycles
10000 times kmalloc(2048) -> 334 cycles kfree -> 282 cycles
10000 times kmalloc(4096) -> 398 cycles kfree -> 323 cycles
10000 times kmalloc(8192) -> 690 cycles kfree -> 641 cycles
10000 times kmalloc(16384) -> 738 cycles kfree -> 729 cycles


2. Kmalloc: alloc/free test

* baseline

10000 times kmalloc(8)/kfree -> 310 cycles
10000 times kmalloc(16)/kfree -> 318 cycles
10000 times kmalloc(32)/kfree -> 314 cycles
10000 times kmalloc(64)/kfree -> 310 cycles
10000 times kmalloc(128)/kfree -> 318 cycles
10000 times kmalloc(256)/kfree -> 332 cycles
10000 times kmalloc(512)/kfree -> 332 cycles
10000 times kmalloc(1024)/kfree -> 324 cycles
10000 times kmalloc(2048)/kfree -> 324 cycles
10000 times kmalloc(4096)/kfree -> 328 cycles
10000 times kmalloc(8192)/kfree -> 888 cycles
10000 times kmalloc(16384)/kfree -> 973 cycles

* cmpxchg_local

10000 times kmalloc(8)/kfree -> 115 cycles
10000 times kmalloc(16)/kfree -> 112 cycles
10000 times kmalloc(32)/kfree -> 112 cycles
10000 times kmalloc(64)/kfree -> 112 cycles
10000 times kmalloc(128)/kfree -> 112 cycles
10000 times kmalloc(256)/kfree -> 124 cycles
10000 times kmalloc(512)/kfree -> 128 cycles
10000 times kmalloc(1024)/kfree -> 127 cycles
10000 times kmalloc(2048)/kfree -> 127 cycles
10000 times kmalloc(4096)/kfree -> 127 cycles
10000 times kmalloc(8192)/kfree -> 874 cycles
10000 times kmalloc(16384)/kfree -> 934 cycles

Changelog :
- fix end of freelist
- Removed VM_BUG_ON in get_/set_freelist_ptr (killed init). It makes no sense
  anyway, because it is valid to get/set the freelist ptr when the freelist
  is not in use by the rest of the system yet without disabling interrupts.
- Fixed debug_check_no_locks_freed(object, c->objsize) in SLUB_FASTPATH free.
- Fix deactivate slab : write back the freelist pointer after the loop has been
  executed.
- Use union to select sub-registers instead of using masks.
- Add CONFIG_PREEMPT support.
- Fix CPU slab caches c->freelist direct references (use get/set).
- Fix error in set_freelist_ptr (base was set to LSB instead of MSB).
- Use END to mark slab end.
- Check for same_base at the beginning of fastpaths. It detects if we deal with
  cross-slabs pointers (object in one slab, next_object in another slab). It
  will also detect if the order of slab used is too big to be represented in
  half long size. It should be safe to deal with slabs with order higher than 4:
  the same_base check should detect this and fallback on the slow path.
- Check the sequence counter before using c->base. It makes sure base and
  offset are in sync and won't trigger a page fault. We re-read the c->freelist
  after reading c->base and check if the sequence number has changed. Therefore,
  since only an interrupt could have changed them, we can detect that both are
  in sync and that it is safe to use the make_ptr generated.
- Make sure an overflow will _always_ be detected by forcing the slow path when
  overflow is detected and by only allowing !in_interrupt() code to reset the
  counter. This solution is good as long as preemption is disabled in the
  fastpath.
- Small error I made : the !in_interrupt() code path must also go through
  the slow path to correctly deal with the case where the !in_interrupt()
  path comes on the overflow condition, then interrupts come and modify
  the freelist LSBs, and then the cmpxchg could miss detection of LSB
  modification by the nested interrupts. Forcing the slow path upon overflow
  detection will fix this.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
 include/linux/slub_def.h |    1 
 mm/slub.c                |  298 +++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 291 insertions(+), 8 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2008-03-14 02:34:12.000000000 -0700
+++ linux-2.6/mm/slub.c	2008-03-14 02:42:39.000000000 -0700
@@ -293,19 +293,113 @@ static inline struct kmem_cache_cpu *get
 #endif
 }
 
-#define END NULL
+#if defined(CONFIG_FAST_CMPXCHG_LOCAL) && defined(CONFIG_SMP)
+#define SLUB_FASTPATH
+/*
+ * The fastpath divides the free pointer into two halves.
+ *
+ * The lower bits provide an offset relative to the base pointer (that is
+ * kept in different field). The upper bits provide a sequence number so
+ * that we can decide if a given pointer is still valid through a cmpxchg.
+ */
+#define HALF_BITS_PER_LONG (BITS_PER_LONG / 2)
+#define HALF_LONG_MASK ((1UL << HALF_BITS_PER_LONG) - 1)
+
+/*
+ * Define a half-long type to use register selection instead of bitmasks.
+ */
+#if (HALF_BITS_PER_LONG == 32)
+typedef u32 hulong;
+#elif (HALF_BITS_PER_LONG == 16)
+typedef u16 hulong;
+#else
+#error "Unknown long size"
+#endif
+
+union halfselect {
+	struct {
+#ifdef __BIG_ENDIAN
+		hulong h;
+		hulong l;
+#else
+		hulong l;
+		hulong h;
+#endif
+	} s;
+	unsigned long v;
+};
+
+static inline unsigned long get_high_half(unsigned long v)
+{
+	return ((union halfselect)v).s.h;
+}
+
+static inline unsigned long get_low_half(unsigned long v)
+{
+	return ((union halfselect)v).s.l;
+}
+
+static inline int same_base(unsigned long base, void *object)
+{
+	return get_high_half(base) == get_high_half((unsigned long)object);
+}
+
+static inline void *make_ptr(unsigned long base, void *freelist)
+{
+	return (void *)(base | get_low_half((unsigned long)freelist));
+}
+
+/*
+ * Make a new version by keeping the high half of the previous version and
+ * low half of object.
+ */
+static inline void *make_version(void *version, void *object)
+{
+	union halfselect ret = {
+		.s.h = get_high_half((unsigned long)version),
+		.s.l = get_low_half((unsigned long)object),
+	};
+	return (void *)ret.v;
+}
+
+#define END ((void *)1)
 
 static inline int is_end(const void *freelist)
 {
-	return freelist == NULL;
+	return (unsigned long)freelist & (unsigned long)END;
 }
 
-/* Determine the maximum number of objects that a slab page can hold */
-static inline unsigned long slab_objects(struct kmem_cache *s, struct page *page)
+static inline void **get_freelist(struct kmem_cache_cpu *c)
 {
-	if (unlikely(SlabSmall(page)))
-		return s->min_objects;
-	return s->max_objects;
+	return make_ptr(c->base, c->freelist);
+}
+
+static inline void set_freelist(struct kmem_cache_cpu *c, void **freelist)
+{
+	c->base = get_high_half((unsigned long)freelist) << HALF_BITS_PER_LONG;
+	/*
+	 * Detect overflow. Only wrap if not in interrupt.
+	 * Slowpath is always taken when a counter overflow is detected.
+	 */
+	if (unlikely(get_high_half((unsigned long)c->freelist)
+			== HALF_LONG_MASK && in_interrupt())) {
+		c->freelist = make_version((void *)c->freelist, freelist);
+		return;
+	}
+	c->freelist = make_version((void *)c->freelist + HALF_LONG_MASK + 1,
+				freelist);
+}
+
+#else
+/*
+ * No cmpxchg_local fastpath. Thus no tricks with pointers. NULL
+ * can be our end pointer which is simple to check for.
+ */
+#define END NULL
+
+static inline int is_end(const void *freelist)
+{
+	return freelist == NULL;
 }
 
 static inline void **get_freelist(struct kmem_cache_cpu *c)
@@ -318,6 +412,20 @@ static inline void set_freelist(struct k
 	c->freelist = p;
 }
 
+static inline void *make_version(void *version, void *object)
+{
+	return object;
+}
+#endif
+
+/* Determine the maximum number of objects that a slab page can hold */
+static inline unsigned long slab_objects(struct kmem_cache *s, struct page *page)
+{
+	if (unlikely(SlabSmall(page)))
+		return s->min_objects;
+	return s->max_objects;
+}
+
 /* Verify that a pointer has an address that is valid within a slab page */
 static inline int check_valid_pointer(struct kmem_cache *s,
 				struct page *page, const void *object)
@@ -1526,7 +1634,14 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
+#ifdef SLUB_FASTPATH
+	unsigned long flags;
 
+	local_irq_save(flags);
+#ifdef CONFIG_PREEMPT
+	c = get_cpu_slab(s, raw_smp_processor_id());
+#endif
+#endif
 	if (!c->page)
 		goto new_slab;
 
@@ -1550,6 +1665,9 @@ load_freelist:
 unlock_out:
 	slab_unlock(c->page);
 	stat(c, ALLOC_SLOWPATH);
+#ifdef SLUB_FASTPATH
+	local_irq_restore(flags);
+#endif
 	return object;
 
 another_slab:
@@ -1581,6 +1699,9 @@ new_slab:
 		c->page = new;
 		goto load_freelist;
 	}
+#ifdef SLUB_FASTPATH
+	local_irq_restore(flags);
+#endif
 	return NULL;
 debug:
 	if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1607,6 +1728,90 @@ static __always_inline void *slab_alloc(
 {
 	void **object;
 	struct kmem_cache_cpu *c;
+
+/*
+ * The SLUB_FASTPATH path is provisional and is currently disabled if the
+ * kernel is compiled with preemption or if the arch does not support
+ * fast cmpxchg operations. There are a couple of coming changes that will
+ * simplify matters and allow preemption. Ultimately we may end up making
+ * SLUB_FASTPATH the default.
+ *
+ * 1. The introduction of the per cpu allocator will avoid array lookups
+ *    through get_cpu_slab(). A special register can be used instead.
+ *
+ * 2. The introduction of per cpu atomic operations (cpu_ops) means that
+ *    we can realize the logic here entirely with per cpu atomics. The
+ *    per cpu atomic ops will take care of the preemption issues.
+ */
+
+#ifdef SLUB_FASTPATH
+ 	void *old, *new, *result, *next_object;
+ 	unsigned long base;
+
+ 	preempt_disable();
+ 	c = get_cpu_slab(s, raw_smp_processor_id());
+fastpath:	/* fastpath cmpxchg loop */
+ 	old = c->freelist;
+ 	/*
+ 	 * Whenever c->base is changed, the sequence number
+ 	 * _must_ be incremented. This barrier insures we read
+ 	 * version before c->base wrt interrupts.
+ 	 */
+ 	barrier();
+ 	base = c->base;
+ 	if (unlikely(is_end(old) || !node_match(c, node)))
+ 		goto slowpath;
+ 	if (unlikely(get_high_half((unsigned long)old) == HALF_LONG_MASK))
+ 		goto slowpath;
+ 	/*
+ 	 * make_ptr on base should always return a valid pointer;
+ 	 * insure base has not been changed by a nested interrupt by
+ 	 * re-reading the freelist sequence number. It makes sure the
+ 	 * base and the offset will generate a valid pointer.
+ 	 */
+ 	barrier();
+ 	if (c->freelist != old)
+ 		goto fastpath;	/* retry */
+ 	object = make_ptr(base, old);
+ 	/*
+ 	 * Need to increment the MSB counter here, because
+ 	 * object[c->offset] use is racy. We can race against
+ 	 * another slab_alloc fast path.
+ 	 * Note that the object[c->offset] read may return garbage, but
+ 	 * is insured to point to a valid address since pages are always
+ 	 * reused in the page allocator. We know if the
+ 	 * object[c->offset] read returned garbage because the sequence
+ 	 * number is incremented each time the freelist is modified.
+ 	 */
+ 	next_object = object[c->offset];
+ 	if (unlikely(!same_base(base, next_object)))
+ 		goto slowpath;
+ 	stat(c, ALLOC_FASTPATH);
+ 	new = make_version(old + HALF_LONG_MASK + 1, next_object);
+ 	result = cmpxchg_local(&c->freelist, old, new);
+#ifdef CONFIG_DEBUG_VM
+	/*
+	 * Just to be paranoid : warn if we detect that enough free or
+	 * slow paths nested on top of us to get the counter to go
+	 * half-way to overflow. That would be insane to do that much
+	 * allocations/free in interrupt handers, but check it anyway.
+	 */
+	WARN_ON(result - old > -1UL >> 1);
+#endif
+	if (result != old)
+		goto fastpath;	/* retry */
+	preempt_enable();
+	goto got_object;
+slowpath:
+	preempt_enable();
+	/*
+	 * __slab_alloc must make no assumption about the
+	 * tests previously done by slab_alloc : we could be
+	 * migrated to a different CPU.
+	 */
+	object = __slab_alloc(s, gfpflags, node, addr, c);
+got_object:
+#else
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -1621,6 +1826,7 @@ static __always_inline void *slab_alloc(
 		stat(c, ALLOC_FASTPATH);
 	}
 	local_irq_restore(flags);
+#endif
 
 	if (unlikely((gfpflags & __GFP_ZERO) && object))
 		memset(object, 0, c->objsize);
@@ -1657,6 +1863,11 @@ static void __slab_free(struct kmem_cach
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
 
+#ifdef SLUB_FASTPATH
+	unsigned long flags;
+
+	local_irq_save(flags);
+#endif
 	c = get_cpu_slab(s, raw_smp_processor_id());
 	stat(c, FREE_SLOWPATH);
 	slab_lock(page);
@@ -1688,6 +1899,9 @@ checks_ok:
 
 out_unlock:
 	slab_unlock(page);
+#ifdef SLUB_FASTPATH
+	local_irq_restore(flags);
+#endif
 	return;
 
 slab_empty:
@@ -1701,6 +1915,9 @@ slab_empty:
 	slab_unlock(page);
 	stat(c, FREE_SLAB);
 	discard_slab(s, page);
+#ifdef SLUB_FASTPATH
+	local_irq_restore(flags);
+#endif
 	return;
 
 debug:
@@ -1725,6 +1942,69 @@ static __always_inline void slab_free(st
 {
 	void **object = (void *)x;
 	struct kmem_cache_cpu *c;
+
+#ifdef SLUB_FASTPATH
+	void *old, *new, *result;
+	unsigned long base;
+
+	preempt_disable();
+	c = get_cpu_slab(s, raw_smp_processor_id());
+	debug_check_no_locks_freed(object, c->objsize);
+	while (1) {
+		old = c->freelist;
+		/*
+		 * If the compiler would reorder the retrieval of c->page and
+		 * c->base to come before c->freelist then an interrupt
+		 * could change the cpu slab before we retrieve c->version.
+		 * We could be matching on a page no longer active and put the
+		 * object onto the freelist of the wrong slab.
+		 *
+		 * On the other hand: If we already have the version
+		 * then any change of cpu_slab will cause the cmpxchg to fail
+		 * since the freelist pointers are unique per slab.
+		 */
+		barrier();
+		base = c->base;
+		if (unlikely(get_high_half((unsigned long)old) == HALF_LONG_MASK
+				|| !same_base(base, object)
+				|| page != c->page || c->node < 0)) {
+			preempt_enable();
+			/*
+			 * __slab_free must make no assumption about the
+			 * tests previously done by slab_free : we could be
+			 * migrated to a different CPU.
+			 */
+			__slab_free(s, page, x, addr, c->offset);
+			break;
+		}
+		/*
+		 * It's ok to overwrite the content of object[c->offset] because
+		 * we own the object. This object won't appear in the freelist
+		 * until our cmpxchg_local succeeds. Therefore, no other part of
+		 * the slub slow path can use this object.
+		 * The result of make_ptr does not have to be dereferenced
+		 * until the cmpxchg succeeds. We don't care if base and old are
+		 * out-of-sync.
+		 */
+		object[c->offset] = make_ptr(base, old);
+		stat(c, FREE_FASTPATH);
+		new = make_version(old + HALF_LONG_MASK + 1, object);
+		result = cmpxchg_local(&c->freelist, old, new);
+#ifdef CONFIG_DEBUG_VM
+		/*
+		 * Just to be paranoid : warn if we detect that enough free or
+		 * slow paths nested on top of us to get the counter to go
+		 * half-way to overflow. That would be insane to do that much
+		 * allocations/free in interrupt handers, but check it anyway.
+		 */
+		WARN_ON(result - old > -1UL >> 1);
+#endif
+		if (result == old) {
+			preempt_enable();
+			break;
+		}
+	}
+#else
 	unsigned long flags;
 
 	local_irq_save(flags);
@@ -1738,6 +2018,7 @@ static __always_inline void slab_free(st
 		__slab_free(s, page, x, addr, c->offset);
 
 	local_irq_restore(flags);
+#endif
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
@@ -1915,7 +2196,8 @@ static void init_kmem_cache_cpu(struct k
 			struct kmem_cache_cpu *c)
 {
 	c->page = NULL;
-	set_freelist(c, END);
+	set_freelist(c, make_version(0, END));
+	c->base = 0;
 	c->node = 0;
 	c->offset = s->offset / sizeof(void *);
 	c->objsize = s->objsize;
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h	2008-03-14 02:31:28.000000000 -0700
+++ linux-2.6/include/linux/slub_def.h	2008-03-14 02:36:04.000000000 -0700
@@ -35,6 +35,7 @@ enum stat_item {
 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to first free per cpu object */
 	struct page *page;	/* The slab from which we are allocating */
+	unsigned long base;	/* Base for fastpath. */
 	int node;		/* The node of the page (or -1 for debug) */
 	unsigned int offset;	/* Freepointer offset (in word units) */
 	unsigned int objsize;	/* Size of an object (from kmem_cache) */