Message-Id: <20070823064653.081843729@sgi.com>
User-Agent: quilt/0.46-1
Date: Wed, 22 Aug 2007 23:46:53 -0700
From: Christoph Lameter <clameter@sgi.com>
To: akpm@linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Subject: Per cpu structures for SLUB
Subject-Prefix: [patch @num@/@total@]

The following patchset introduces per cpu structures for SLUB. These
are very small (and multiples of these may fit into one cacheline)
and (apart from performance improvements) allow the addressing of
several isues in SLUB:

1. The number of objects per slab is no longer limited to a 16 bit
   number.

2. Room is freed up in the page struct. We can avoid using the
   mapping field which allows to get rid of the #ifdef CONFIG_SLUB
   in page_mapping().

3. We will have an easier time adding new things like Peter Z.s reserve
   management.

The RFC for this patchset was discussed on lkml a while ago:

http://marc.info/?l=linux-kernel&m=118386677704534&w=2

(And no this patchset does not include the use of cmpxchg_local that
we discussed recently on lkml nor the cmpxchg implementation
mentioned in the RFC)

Performance
-----------


Norm = 2.6.23-rc3
PCPU = Adds page allocator pass through plus per cpu structure patches


IA64 8p 4n NUMA Altix

            Single threaded               Concurrent Alloc

	Kmalloc		Alloc/Free	Kmalloc         Alloc/Free
 Size	Norm   PCPU	Norm   PCPU	Norm   PCPU	Norm   PCPU
-------------------------------------------------------------------
    8	132	84	93	104	98	90	95	106
   16    98	92	93	104	115	98	95	106
   32   112	105	93	104	146	111	95	106
   64	119	112	93	104	214	133	95	106
  128   132	119	94	104	321	163	95	106
  256+  83255	176	106	115	415	224	108	117
  512   191	176	106	115	487	341	108	117
 1024   252	246	106	115	937	609	108	117
 2048   308	292	107	115	2494	1207	108	117
 4096   341	319	107	115	2497	1217	108	117
 8192   402	380	107	115	2367	1188	108	117
16384*  560	474	106	434	4464	1904	108	478

X86_64 2p SMP (Dual Core Pentium 940)

         Single threaded                   Concurrent Alloc

        Kmalloc         Alloc/Free      Kmalloc         Alloc/Free
 Size   Norm   PCPU     Norm   PCPU     Norm   PCPU     Norm   PCPU
--------------------------------------------------------------------
    8	313	227	314	324	207	208	314	323
   16   202	203	315	324	209	211	312	321
   32	212	207	314	324	251	243	312	321
   64	240	237	314	326	329	306	312	321
  128	301	302	314	324	511	416	313	324
  256   498	554	327	332	970	837	326	332
  512   532	553	324	332	1025	932	326	335
 1024   705	718	325	333	1489	1231	324	330
 2048   764	767	324	334	2708	2175	324	332
 4096* 1033	476	325	674	4727	782	324	678

Notes:

Worst case:
-----------
We generally loose in the alloc free test (x86_64 3%, IA64 5-10%)
since the processing overhead increases because we need to lookup
the per cpu structure. Alloc/Free is simply kfree(kmalloc(size, mask)).
So objects with the shortest lifetime possible. We would never use
objects in that way but the measurement is important to show the worst
case overhead created.

Single Threaded:
----------------
The single threaded kmalloc test shows behavior of a continual stream
of allocation without contention. In the SMP case the losses are minimal.
In the NUMA case we already have a winner there because the per cpu structure
is placed local to the processor. So in the single threaded case we already
win around 5% just by placing things better.

Concurrent Alloc:
-----------------
We have varying gains up to a 50% on NUMA because we are now never updating
a cacheline used by the other processor and the data structures are local
to the processor.

The SMP case shows gains but they are smaller (especially since
this is the smallest SMP system possible.... 2 CPUs). So only up
to 25%.

Page allocator pass through
---------------------------
There is a significant difference in the columns marked with a * because
of the way that allocations for page sized objects are handled. If we handle
the allocations in the slab allocator (Norm) then the alloc free tests
results are superb since we can use the per cpu slab to just pass a pointer
back and forth. The page allocator pass through (PCPU) shows that the page
allocator may have problems with giving back the same page after a free.
Or there something else in the page allocator that creates significant
overhead compared to slab. Needs to be checked out I guess.

However, the page allocator pass through is a win in the other cases
since we can cut out the page allocator overhead. That is the more typical
load of allocating a sequence of objects and we should optimize for that.

(+ = Must be some cache artifact here or code crossing a TLB boundary.
The result is reproducable)

--