From compudj@krystal.dyndns.org Thu Mar 27 20:31:33 2008 Date: Thu, 27 Mar 2008 23:33:25 -0400 From: Mathieu Desnoyers To: Christoph Lameter Cc: Robert Wisniewski , mbligh@google.com, michel.dagenais@polymtl.ca Subject: [PATCH] Implement slub fastpath with sequence number * Christoph Lameter (clameter@sgi.com) wrote: > On Tue, 25 Mar 2008, Mathieu Desnoyers wrote: > > > > Actually, I think there is a problem with my approach : it won't work if > > > _all_ allocations are done in interrupt context, because it would be > > > stuck in the slow path when it reaches the overflow. I'll try to think > > > about a solution. > > A solution would be to keep an "active fastpath" count each percpu slab > > structure. So instead of checking for in_interrupt() in the slow path, I > > can check if the number of active fastpaths is non-zero. This, too, > > requires preemption to be disabled in the slow path so I can use > > non-atomic increment/decrements for this active fastpaths count. > > We are getting nearer to the 2.6.26 merge cycle. Do we have a usable > approach now? I am fine with an approach that initially requires !PREEMPT. > I had to add a new member to the percpu slub : hulong active; /* Active alloc/free fastpaths */ To count the number of nested fastpaths under the running code. Just checking for in_interrupt() was buggy : if we only allocate/free from interrupt context, we always end up running the slow path after we have reached overflow. Well, since we already have to add a counter, I don't see any reason to try to pack this nesting count _and_ the "base" into the freelist 5 LSBs... Therefore, I would suggest using half a long for the count and half a long for the base. Here is a version with individual SLUB_FASTPATH_ALLOC and SLUB_FASTPATH_FREE. The numbers below seems to indicate that the overall gain of the alloc and then free pair seems to outweigth the performance degradation of the free when the slow path must be taken. Here is the patch. It applies on 2.6.25-rc7 in vm.git. Mathieu Implement slub fastpath with sequence number It allows the cmpxchg_local to detect nested object re-use and slab change by keeping a counter in the freelist MSBs. (16 bits on 32 bits architectures, 32 bits on 64 bits architectures) Whenever an object is freed in the cpu slab cache, the counter is incremented. Whenever the alloc/free slow paths are modifying the offset or freebase, the sequence counter is also incremented. It is used to make sure we know if freebase has been modified in an interrupt nested over the fast path. This version running with preemption enabled gives the following results on x86_32 (UP, 3GHz Pentium 4), 2.6.25-rc4 kernel. The result "Overall for alloc and then free" is the speedup for an alloc and free pair in the "Repeatedly allocate then free test". ----------------- * With preemption ----------------- Speedups (kmalloc 8) SLUB_FASTPATH_FREE disabled Allocs : 2.04 Then free : 0.97 (slowdown) Overall for alloc and then free : 1.37 Repeated alloc+free : 1.44 SLUB_FASTPATH_FREE enabled Allocs : 2.04 Then free : 0.83 (slowdown) Overall for alloc and then free : 1.25 Repeated alloc+free : 2.45 Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test * baseline 10000 times kmalloc(8) -> 190 cycles kfree -> 150 cycles 10000 times kmalloc(16) -> 191 cycles kfree -> 150 cycles 10000 times kmalloc(32) -> 190 cycles kfree -> 152 cycles 10000 times kmalloc(64) -> 207 cycles kfree -> 162 cycles 10000 times kmalloc(128) -> 253 cycles kfree -> 178 cycles 10000 times kmalloc(256) -> 364 cycles kfree -> 267 cycles 10000 times kmalloc(512) -> 369 cycles kfree -> 284 cycles 10000 times kmalloc(1024) -> 434 cycles kfree -> 362 cycles 10000 times kmalloc(2048) -> 454 cycles kfree -> 385 cycles 10000 times kmalloc(4096) -> 512 cycles kfree -> 395 cycles 10000 times kmalloc(8192) -> 670 cycles kfree -> 692 cycles 10000 times kmalloc(16384) -> 710 cycles kfree -> 797 cycles * cmpxchg_local (SLUB_FASTPATH_FREE disabled) 10000 times kmalloc(8) -> 93 cycles kfree -> 154 cycles 10000 times kmalloc(16) -> 97 cycles kfree -> 154 cycles 10000 times kmalloc(32) -> 106 cycles kfree -> 156 cycles 10000 times kmalloc(64) -> 124 cycles kfree -> 161 cycles 10000 times kmalloc(128) -> 180 cycles kfree -> 178 cycles 10000 times kmalloc(256) -> 306 cycles kfree -> 276 cycles 10000 times kmalloc(512) -> 347 cycles kfree -> 283 cycles 10000 times kmalloc(1024) -> 481 cycles kfree -> 370 cycles 10000 times kmalloc(2048) -> 503 cycles kfree -> 382 cycles 10000 times kmalloc(4096) -> 560 cycles kfree -> 394 cycles 10000 times kmalloc(8192) -> 693 cycles kfree -> 691 cycles 10000 times kmalloc(16384) -> 715 cycles kfree -> 819 cycles * cmpxchg_local (SLUB_FASTPATH_FREE enabled) 10000 times kmalloc(8) -> 93 cycles kfree -> 180 cycles 10000 times kmalloc(16) -> 98 cycles kfree -> 188 cycles 10000 times kmalloc(32) -> 107 cycles kfree -> 187 cycles 10000 times kmalloc(64) -> 121 cycles kfree -> 191 cycles 10000 times kmalloc(128) -> 182 cycles kfree -> 206 cycles 10000 times kmalloc(256) -> 295 cycles kfree -> 274 cycles 10000 times kmalloc(512) -> 349 cycles kfree -> 300 cycles 10000 times kmalloc(1024) -> 474 cycles kfree -> 385 cycles 10000 times kmalloc(2048) -> 502 cycles kfree -> 401 cycles 10000 times kmalloc(4096) -> 563 cycles kfree -> 432 cycles 10000 times kmalloc(8192) -> 665 cycles kfree -> 695 cycles 10000 times kmalloc(16384) -> 720 cycles kfree -> 816 cycles 2. Kmalloc: alloc/free test * baseline 10000 times kmalloc(8)/kfree -> 314 cycles 10000 times kmalloc(16)/kfree -> 312 cycles 10000 times kmalloc(32)/kfree -> 310 cycles 10000 times kmalloc(64)/kfree -> 310 cycles 10000 times kmalloc(128)/kfree -> 310 cycles 10000 times kmalloc(256)/kfree -> 326 cycles 10000 times kmalloc(512)/kfree -> 322 cycles 10000 times kmalloc(1024)/kfree -> 320 cycles 10000 times kmalloc(2048)/kfree -> 324 cycles 10000 times kmalloc(4096)/kfree -> 324 cycles 10000 times kmalloc(8192)/kfree -> 885 cycles 10000 times kmalloc(16384)/kfree -> 916 cycles * cmpxchg_local (SLUB_FASTPATH_FREE disabled) 10000 times kmalloc(8)/kfree -> 218 cycles 10000 times kmalloc(16)/kfree -> 218 cycles 10000 times kmalloc(32)/kfree -> 218 cycles 10000 times kmalloc(64)/kfree -> 218 cycles 10000 times kmalloc(128)/kfree -> 224 cycles 10000 times kmalloc(256)/kfree -> 232 cycles 10000 times kmalloc(512)/kfree -> 226 cycles 10000 times kmalloc(1024)/kfree -> 226 cycles 10000 times kmalloc(2048)/kfree -> 228 cycles 10000 times kmalloc(4096)/kfree -> 228 cycles 10000 times kmalloc(8192)/kfree -> 899 cycles 10000 times kmalloc(16384)/kfree -> 952 cycles * cmpxchg_local (SLUB_FASTPATH_FREE enabled) 10000 times kmalloc(8)/kfree -> 128 cycles 10000 times kmalloc(16)/kfree -> 125 cycles 10000 times kmalloc(32)/kfree -> 133 cycles 10000 times kmalloc(64)/kfree -> 228 cycles 10000 times kmalloc(128)/kfree -> 262 cycles 10000 times kmalloc(256)/kfree -> 138 cycles 10000 times kmalloc(512)/kfree -> 190 cycles 10000 times kmalloc(1024)/kfree -> 133 cycles 10000 times kmalloc(2048)/kfree -> 135 cycles 10000 times kmalloc(4096)/kfree -> 130 cycles 10000 times kmalloc(8192)/kfree -> 898 cycles 10000 times kmalloc(16384)/kfree -> 938 cycles -------------------- * Without preemption -------------------- Speedups (kmalloc 8) Allocs : 2.18 Then free : 0.88 (slowdown) Overall for alloc and then free : 1.31 Repeated alloc+free : 2.83 1. Kmalloc: Repeatedly allocate then free test * baseline 10000 times kmalloc(8) -> 181 cycles kfree -> 149 cycles 10000 times kmalloc(16) -> 183 cycles kfree -> 150 cycles 10000 times kmalloc(32) -> 190 cycles kfree -> 152 cycles 10000 times kmalloc(64) -> 214 cycles kfree -> 156 cycles 10000 times kmalloc(128) -> 251 cycles kfree -> 182 cycles 10000 times kmalloc(256) -> 356 cycles kfree -> 270 cycles 10000 times kmalloc(512) -> 372 cycles kfree -> 275 cycles 10000 times kmalloc(1024) -> 424 cycles kfree -> 346 cycles 10000 times kmalloc(2048) -> 443 cycles kfree -> 375 cycles 10000 times kmalloc(4096) -> 504 cycles kfree -> 386 cycles 10000 times kmalloc(8192) -> 675 cycles kfree -> 652 cycles 10000 times kmalloc(16384) -> 703 cycles kfree -> 804 cycles * cmpxchg_local 10000 times kmalloc(8) -> 83 cycles kfree -> 168 cycles 10000 times kmalloc(16) -> 89 cycles kfree -> 173 cycles 10000 times kmalloc(32) -> 98 cycles kfree -> 179 cycles 10000 times kmalloc(64) -> 117 cycles kfree -> 179 cycles 10000 times kmalloc(128) -> 169 cycles kfree -> 196 cycles 10000 times kmalloc(256) -> 291 cycles kfree -> 270 cycles 10000 times kmalloc(512) -> 335 cycles kfree -> 295 cycles 10000 times kmalloc(1024) -> 466 cycles kfree -> 367 cycles 10000 times kmalloc(2048) -> 470 cycles kfree -> 382 cycles 10000 times kmalloc(4096) -> 529 cycles kfree -> 401 cycles 10000 times kmalloc(8192) -> 701 cycles kfree -> 681 cycles 10000 times kmalloc(16384) -> 678 cycles kfree -> 801 cycles 2. Kmalloc: alloc/free test * baseline 10000 times kmalloc(8)/kfree -> 318 cycles 10000 times kmalloc(16)/kfree -> 318 cycles 10000 times kmalloc(32)/kfree -> 318 cycles 10000 times kmalloc(64)/kfree -> 321 cycles 10000 times kmalloc(128)/kfree -> 316 cycles 10000 times kmalloc(256)/kfree -> 326 cycles 10000 times kmalloc(512)/kfree -> 326 cycles 10000 times kmalloc(1024)/kfree -> 326 cycles 10000 times kmalloc(2048)/kfree -> 333 cycles 10000 times kmalloc(4096)/kfree -> 330 cycles 10000 times kmalloc(8192)/kfree -> 836 cycles 10000 times kmalloc(16384)/kfree -> 920 cycles * cmpxchg_local 10000 times kmalloc(8)/kfree -> 112 cycles 10000 times kmalloc(16)/kfree -> 113 cycles 10000 times kmalloc(32)/kfree -> 193 cycles 10000 times kmalloc(64)/kfree -> 216 cycles 10000 times kmalloc(128)/kfree -> 144 cycles 10000 times kmalloc(256)/kfree -> 127 cycles 10000 times kmalloc(512)/kfree -> 119 cycles 10000 times kmalloc(1024)/kfree -> 122 cycles 10000 times kmalloc(2048)/kfree -> 122 cycles 10000 times kmalloc(4096)/kfree -> 125 cycles 10000 times kmalloc(8192)/kfree -> 847 cycles 10000 times kmalloc(16384)/kfree -> 918 cycles Changelog : - fix end of freelist - Removed VM_BUG_ON in get_/set_freelist_ptr (killed init). It makes no sense anyway, because it is valid to get/set the freelist ptr when the freelist is not in use by the rest of the system yet without disabling interrupts. - Fixed debug_check_no_locks_freed(object, c->objsize) in SLUB_FASTPATH free. - Fix deactivate slab : write back the freelist pointer after the loop has been executed. - Use union to select sub-registers instead of using masks. - Add CONFIG_PREEMPT support. - Fix CPU slab caches c->freelist direct references (use get/set). - Fix error in set_freelist_ptr (base was set to LSB instead of MSB). - Use END to mark slab end. - Check for same_base at the beginning of fastpaths. It detects if we deal with cross-slabs pointers (object in one slab, next_object in another slab). It will also detect if the order of slab used is too big to be represented in half long size. It should be safe to deal with slabs with order higher than 4: the same_base check should detect this and fallback on the slow path. - Check the sequence counter before using c->base. It makes sure base and offset are in sync and won't trigger a page fault. We re-read the c->freelist after reading c->base and check if the sequence number has changed. Therefore, since only an interrupt could have changed them, we can detect that both are in sync and that it is safe to use the make_ptr generated. - Make sure an overflow will _always_ be detected by forcing the slow path when overflow is detected and by only allowing !in_interrupt() code to reset the counter. This solution is good as long as preemption is disabled in the fastpath. - Small error I made : the !in_interrupt() code path must also go through the slow path to correctly deal with the case where the !in_interrupt() path comes on the overflow condition, then interrupts come and modify the freelist LSBs, and then the cmpxchg could miss detection of LSB modification by the nested interrupts. Forcing the slow path upon overflow detection will fix this. - Turn off CMPCXCHG_LOCAL fast path on CONFIG_DEBUG_PAGEALLOC. We therefore make sure we never get the pointer in the cmpxchg loop, get interrupted, the nested context calls free_pages, which flags the pages against read and write under debug pagealloc. We require that these pages are always reused and are readable. - I had to add a new member to the percpu slub : hulong active; /* Active alloc/free fastpaths */ To count the number of nested fastpaths under the running code. Just checking for in_interrupt() was buggy : if we only allocate/free from interrupt context, we always end up running the slow path after we have reached overflow. Signed-off-by: Mathieu Desnoyers CC: Christoph Lameter CC: Martin Bligh CC: Robert Wisniewski CC: Michel Dagenais --- include/linux/slub_def.h | 52 +++++++ mm/slub.c | 335 +++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 359 insertions(+), 28 deletions(-) Index: vm/mm/slub.c =================================================================== --- vm.orig/mm/slub.c 2008-03-27 21:22:01.000000000 -0400 +++ vm/mm/slub.c 2008-03-27 22:05:22.000000000 -0400 @@ -138,6 +138,80 @@ static inline void ClearSlabDebug(struct page->flags &= ~SLABDEBUG; } +#ifdef SLUB_FASTPATH + +static inline unsigned long get_high_half(unsigned long v) +{ + return ((union halfselect)v).s.h; +} + +static inline unsigned long get_low_half(unsigned long v) +{ + return ((union halfselect)v).s.l; +} + +static inline int same_base(unsigned long base, void *object) +{ + return base == get_high_half((unsigned long)object); +} + +static inline void *make_ptr(unsigned long base, void *freelist) +{ + union halfselect hs = { + .s.h = base, + .s.l = get_low_half((unsigned long)freelist), + }; + return (void *)hs.v; +} + +/* + * Make a new version by keeping the high half of the previous version and + * low half of object. + */ +static inline void *make_version(void *version, void *object) +{ + union halfselect ret = { + .s.h = get_high_half((unsigned long)version), + .s.l = get_low_half((unsigned long)object), + }; + return (void *)ret.v; +} + +static inline void **get_freelist_ptr(struct kmem_cache_cpu *c) +{ + return make_ptr(c->base, c->freelist); +} + +static inline void set_freelist_ptr(struct kmem_cache_cpu *c, void **freelist) +{ + c->base = get_high_half((unsigned long)freelist); + /* + * Detect overflow. Only wrap if no active fastpath underneath us. + * Slowpath is always taken when a counter overflow is detected. + */ + if (unlikely(get_high_half((unsigned long)c->freelist) + == HALF_LONG_MASK && c->active > 0)) { + c->freelist = make_version((void *)c->freelist, freelist); + return; + } + c->freelist = make_version((void *)c->freelist + HALF_LONG_MASK + 1, + freelist); +} + +#else + +static inline void **get_freelist_ptr(struct kmem_cache_cpu *c) +{ + return c->freelist; +} + +static inline void set_freelist_ptr(struct kmem_cache_cpu *c, void **freelist) +{ + c->freelist = freelist; +} + +#endif + /* * Issues still to be resolved: * @@ -270,13 +344,20 @@ static inline struct kmem_cache_cpu *get #endif } +#define END ((void *)1) + +static inline int is_end(const void *freelist) +{ + return (unsigned long)freelist & (unsigned long)END; +} + /* Verify that a pointer has an address that is valid within a slab page */ static inline int check_valid_pointer(struct kmem_cache *s, struct page *page, const void *object) { void *base; - if (!object) + if (is_end(object)) return 1; base = page_address(page); @@ -312,7 +393,8 @@ static inline void set_freepointer(struc /* Scan freelist */ #define for_each_free_object(__p, __s, __free) \ - for (__p = (__free); __p; __p = get_freepointer((__s), __p)) + for (__p = (__free); !is_end(__p); __p = get_freepointer((__s),\ + __p)) /* Determine object index from a given position */ static inline int slab_index(void *p, struct kmem_cache *s, void *addr) @@ -730,7 +812,7 @@ static int check_object(struct kmem_cach * of the free objects in this slab. May cause * another error because the object count is now wrong. */ - set_freepointer(s, p, NULL); + set_freepointer(s, p, END); return 0; } return 1; @@ -771,21 +853,21 @@ static int on_freelist(struct kmem_cache { int nr = 0; void *fp = page->freelist; - void *object = NULL; + void *object = END; unsigned long max_objects; - while (fp && nr <= page->objects) { + while (!is_end(fp) && nr <= page->objects) { if (fp == search) return 1; if (!check_valid_pointer(s, page, fp)) { - if (object) { + if (!is_end(object)) { object_err(s, page, object, "Freechain corrupt"); - set_freepointer(s, object, NULL); + set_freepointer(s, object, END); break; } else { slab_err(s, page, "Freepointer corrupt"); - page->freelist = NULL; + page->freelist = END; page->inuse = page->objects; slab_fix(s, "Freelist cleared"); return 0; @@ -813,7 +895,7 @@ static int on_freelist(struct kmem_cache page->inuse = page->objects - nr; slab_fix(s, "Object count adjusted."); } - return search == NULL; + return is_end(search); } static void trace(struct kmem_cache *s, struct page *page, void *object, int alloc) @@ -901,7 +983,7 @@ bad: */ slab_fix(s, "Marking all objects used"); page->inuse = page->objects; - page->freelist = NULL; + page->freelist = END; } return 0; } @@ -941,7 +1023,7 @@ static int free_debug_processing(struct } /* Special debug activities for freeing objects */ - if (!SlabFrozen(page) && !page->freelist) + if (!SlabFrozen(page) && is_end(page->freelist)) remove_full(s, page); if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_FREE, addr); @@ -1142,7 +1224,7 @@ static struct page *new_slab(struct kmem last = p; } setup_object(s, page, last); - set_freepointer(s, last, NULL); + set_freepointer(s, last, END); page->freelist = start; page->inuse = 0; @@ -1376,7 +1458,7 @@ static void unfreeze_slab(struct kmem_ca ClearSlabFrozen(page); if (page->inuse) { - if (page->freelist) { + if (!is_end(page->freelist)) { add_partial(n, page, tail); stat(c, tail ? DEACTIVATE_TO_TAIL : DEACTIVATE_TO_HEAD); } else { @@ -1415,28 +1497,32 @@ static void deactivate_slab(struct kmem_ { struct page *page = c->page; int tail = 1; + void **freelist; - if (page->freelist) + if (!is_end(page->freelist)) stat(c, DEACTIVATE_REMOTE_FREES); /* * Merge cpu freelist into slab freelist. Typically we get here * because both freelists are empty. So this is unlikely * to occur. */ - while (unlikely(c->freelist)) { + freelist = get_freelist_ptr(c); + while (unlikely(!is_end(freelist))) { void **object; tail = 0; /* Hot objects. Put the slab first */ /* Retrieve object from cpu_freelist */ - object = c->freelist; - c->freelist = c->freelist[c->offset]; + object = freelist; + freelist = object[c->offset]; /* And put onto the regular freelist */ object[c->offset] = page->freelist; page->freelist = object; page->inuse--; } + if (!tail) + set_freelist_ptr(c, freelist); c->page = NULL; unfreeze_slab(s, page, tail); } @@ -1517,7 +1603,14 @@ static void *__slab_alloc(struct kmem_ca { void **object; struct page *new; +#ifdef SLUB_FASTPATH_ALLOC + unsigned long flags; + local_irq_save(flags); +#ifdef CONFIG_PREEMPT + c = get_cpu_slab(s, raw_smp_processor_id()); +#endif +#endif if (!c->page) goto new_slab; @@ -1529,18 +1622,21 @@ static void *__slab_alloc(struct kmem_ca load_freelist: object = c->page->freelist; - if (unlikely(!object)) + if (unlikely(is_end(object))) goto another_slab; if (unlikely(SlabDebug(c->page))) goto debug; - c->freelist = object[c->offset]; + set_freelist_ptr(c, object[c->offset]); c->page->inuse = c->page->objects; - c->page->freelist = NULL; + c->page->freelist = END; c->node = page_to_nid(c->page); unlock_out: slab_unlock(c->page); stat(c, ALLOC_SLOWPATH); +#ifdef SLUB_FASTPATH_ALLOC + local_irq_restore(flags); +#endif return object; another_slab: @@ -1572,6 +1668,9 @@ new_slab: c->page = new; goto load_freelist; } +#ifdef SLUB_FASTPATH_ALLOC + local_irq_restore(flags); +#endif return NULL; debug: if (!alloc_debug_processing(s, c->page, object, addr)) @@ -1598,20 +1697,112 @@ static __always_inline void *slab_alloc( { void **object; struct kmem_cache_cpu *c; + +/* + * The SLUB_FASTPATH path is provisional and is currently disabled if the + * kernel is compiled with preemption or if the arch does not support + * fast cmpxchg operations. There are a couple of coming changes that will + * simplify matters and allow preemption. Ultimately we may end up making + * SLUB_FASTPATH the default. + * + * 1. The introduction of the per cpu allocator will avoid array lookups + * through get_cpu_slab(). A special register can be used instead. + * + * 2. The introduction of per cpu atomic operations (cpu_ops) means that + * we can realize the logic here entirely with per cpu atomics. The + * per cpu atomic ops will take care of the preemption issues. + */ + +#ifdef SLUB_FASTPATH_ALLOC + void *old, *new, *result, *next_object; + unsigned long base; + + preempt_disable(); + c = get_cpu_slab(s, raw_smp_processor_id()); + c->active++; + barrier(); +fastpath: /* fastpath cmpxchg loop */ + old = c->freelist; + /* + * Whenever c->base is changed, the sequence number + * _must_ be incremented. This barrier insures we read + * version before c->base wrt interrupts. + */ + barrier(); + base = c->base; + + if (unlikely(is_end(old) || !node_match(c, node))) + goto slowpath; + if (unlikely(get_high_half((unsigned long)old) == HALF_LONG_MASK)) + goto slowpath; + /* + * make_ptr on base should always return a valid pointer; + * insure base has not been changed by a nested interrupt by + * re-reading the freelist sequence number. It makes sure the + * base and the offset will generate a valid pointer. + */ + barrier(); + if (c->freelist != old) + goto fastpath; /* retry */ + object = make_ptr(base, old); + /* + * Need to increment the MSB counter here, because + * object[c->offset] use is racy. We can race against + * another slab_alloc fast path. + * Note that the object[c->offset] read may return garbage, but + * is insured to point to a valid address since pages are always + * reused in the page allocator. We know if the + * object[c->offset] read returned garbage because the sequence + * number is incremented each time the freelist is modified. + */ + next_object = object[c->offset]; + if (unlikely(!same_base(base, next_object))) + goto slowpath; + stat(c, ALLOC_FASTPATH); + new = make_version(old + HALF_LONG_MASK + 1, next_object); + result = cmpxchg_local(&c->freelist, old, new); +#ifdef CONFIG_DEBUG_VM + /* + * Just to be paranoid : warn if we detect that enough free or + * slow paths nested on top of us to get the counter to go + * half-way to overflow. That would be insane to do that many + * allocations/free in interrupt handers, but check it anyway. + */ + WARN_ON(result - old > -1UL >> 1); +#endif + if (result != old) + goto fastpath; /* retry */ + barrier(); + c->active--; + preempt_enable(); + goto got_object; +slowpath: + barrier(); + c->active--; + preempt_enable(); + /* + * __slab_alloc must make no assumption about the + * tests previously done by slab_alloc : we could be + * migrated to a different CPU. + */ + object = __slab_alloc(s, gfpflags, node, addr, c); +got_object: +#else unsigned long flags; local_irq_save(flags); c = get_cpu_slab(s, smp_processor_id()); - if (unlikely(!c->freelist || !node_match(c, node))) + if (unlikely(is_end(c->freelist)) || !node_match(c, node)) object = __slab_alloc(s, gfpflags, node, addr, c); else { - object = c->freelist; - c->freelist = object[c->offset]; + object = get_freelist_ptr(c); + set_freelist_ptr(c, object[c->offset]); stat(c, ALLOC_FASTPATH); } local_irq_restore(flags); +#endif if (unlikely((gfpflags & __GFP_ZERO) && object)) memset(object, 0, c->objsize); @@ -1648,6 +1839,11 @@ static void __slab_free(struct kmem_cach void **object = (void *)x; struct kmem_cache_cpu *c; +#ifdef SLUB_FASTPATH_FREE + unsigned long flags; + + local_irq_save(flags); +#endif c = get_cpu_slab(s, raw_smp_processor_id()); stat(c, FREE_SLOWPATH); slab_lock(page); @@ -1672,17 +1868,20 @@ checks_ok: * Objects left in the slab. If it was not on the partial list before * then add it. */ - if (unlikely(!prior)) { + if (unlikely(is_end(prior))) { add_partial(get_node(s, page_to_nid(page)), page, 1); stat(c, FREE_ADD_PARTIAL); } out_unlock: slab_unlock(page); +#ifdef SLUB_FASTPATH_FREE + local_irq_restore(flags); +#endif return; slab_empty: - if (prior) { + if (!is_end(prior)) { /* * Slab still on the partial list. */ @@ -1692,6 +1891,9 @@ slab_empty: slab_unlock(page); stat(c, FREE_SLAB); discard_slab(s, page); +#ifdef SLUB_FASTPATH_FREE + local_irq_restore(flags); +#endif return; debug: @@ -1716,19 +1918,90 @@ static __always_inline void slab_free(st { void **object = (void *)x; struct kmem_cache_cpu *c; + +#ifdef SLUB_FASTPATH_FREE + void *old, *new, *result; + unsigned long base; + + preempt_disable(); + c = get_cpu_slab(s, raw_smp_processor_id()); + c->active++; + barrier(); + debug_check_no_locks_freed(object, c->objsize); + while (1) { + old = c->freelist; + /* + * If the compiler would reorder the retrieval of c->page and + * c->base to come before c->freelist then an interrupt + * could change the cpu slab before we retrieve c->version. + * We could be matching on a page no longer active and put the + * object onto the freelist of the wrong slab. + * + * On the other hand: If we already have the version + * then any change of cpu_slab will cause the cmpxchg to fail + * since the freelist pointers are unique per slab. + */ + barrier(); + base = c->base; + + if (unlikely(get_high_half((unsigned long)old) == HALF_LONG_MASK + || !same_base(base, object) + || page != c->page || c->node < 0)) { + barrier(); + c->active--; + preempt_enable(); + /* + * __slab_free must make no assumption about the + * tests previously done by slab_free : we could be + * migrated to a different CPU. + */ + __slab_free(s, page, x, addr, c->offset); + break; + } + /* + * It's ok to overwrite the content of object[c->offset] because + * we own the object. This object won't appear in the freelist + * until our cmpxchg_local succeeds. Therefore, no other part of + * the slub slow path can use this object. + * The result of make_ptr does not have to be dereferenced + * until the cmpxchg succeeds. We don't care if base and old are + * out-of-sync. + */ + object[c->offset] = make_ptr(base, old); + stat(c, FREE_FASTPATH); + new = make_version(old + HALF_LONG_MASK + 1, object); + result = cmpxchg_local(&c->freelist, old, new); +#ifdef CONFIG_DEBUG_VM + /* + * Just to be paranoid : warn if we detect that enough free or + * slow paths nested on top of us to get the counter to go + * half-way to overflow. That would be insane to do that many + * allocations/free in interrupt handers, but check it anyway. + */ + WARN_ON(result - old > -1UL >> 1); +#endif + if (result == old) { + barrier(); + c->active--; + preempt_enable(); + break; + } + } +#else unsigned long flags; local_irq_save(flags); c = get_cpu_slab(s, smp_processor_id()); debug_check_no_locks_freed(object, c->objsize); if (likely(page == c->page && c->node >= 0)) { - object[c->offset] = c->freelist; - c->freelist = object; + object[c->offset] = get_freelist_ptr(c); + set_freelist_ptr(c, object); stat(c, FREE_FASTPATH); } else __slab_free(s, page, x, addr, c->offset); local_irq_restore(flags); +#endif } void kmem_cache_free(struct kmem_cache *s, void *x) @@ -1911,7 +2184,13 @@ static void init_kmem_cache_cpu(struct k struct kmem_cache_cpu *c) { c->page = NULL; - c->freelist = NULL; +#ifdef SLUB_FASTPATH + c->freelist = make_version(0, END); + c->base = 0; + c->active = 0; +#else + c->freelist = END; +#endif c->node = 0; c->offset = s->offset / sizeof(void *); c->objsize = s->objsize; Index: vm/include/linux/slub_def.h =================================================================== --- vm.orig/include/linux/slub_def.h 2008-03-27 21:21:57.000000000 -0400 +++ vm/include/linux/slub_def.h 2008-03-27 21:34:09.000000000 -0400 @@ -32,9 +32,61 @@ enum stat_item { ORDER_FALLBACK, /* Number of times fallback was necessary */ NR_SLUB_STAT_ITEMS }; +#if (defined (CONFIG_FAST_CMPXCHG_LOCAL) && !defined(CONFIG_DEBUG_PAGEALLOC)) +#define SLUB_FASTPATH +#define SLUB_FASTPATH_ALLOC + +/* + * FREE FASTPATH is slower with config preempt because the slow path is taken + * often and performance is degradedby preempt disable/enable + irq + * disable/enable. + */ +#ifndef CONFIG_PREEMPT +#define SLUB_FASTPATH_FREE +#endif + +/* + * The fastpath divides the free pointer into two halves. + * + * The lower bits provide an offset relative to the base pointer (that is + * kept in different field). The upper bits provide a sequence number so + * that we can decide if a given pointer is still valid through a cmpxchg. + */ +#define HALF_BITS_PER_LONG (BITS_PER_LONG / 2) +#define HALF_LONG_MASK ((1UL << HALF_BITS_PER_LONG) - 1) + +/* + * Define a half-long type to use register selection instead of bitmasks. + */ +#if (HALF_BITS_PER_LONG == 32) +typedef u32 hulong; +#elif (HALF_BITS_PER_LONG == 16) +typedef u16 hulong; +#else +#error "Unknown long size" +#endif + +union halfselect { + struct { +#ifdef __BIG_ENDIAN + hulong h; + hulong l; +#else + hulong l; + hulong h; +#endif + } s; + unsigned long v; +}; +#endif + struct kmem_cache_cpu { void **freelist; /* Pointer to first free per cpu object */ struct page *page; /* The slab from which we are allocating */ +#ifdef SLUB_FASTPATH + hulong base; /* Base for fastpath. */ + hulong active; /* Active alloc/free fastpaths */ +#endif int node; /* The node of the page (or -1 for debug) */ unsigned int offset; /* Freepointer offset (in word units) */ unsigned int objsize; /* Size of an object (from kmem_cache) */ -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68