From: Christoph Lameter To: akpm@linux-foundation.org Cc: linux-arch@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: Mathieu Desnoyers Cc: Pekka Enberg Subject: [RFC] SLUB: Improve allocpercpu to reduce per cpu access overhead Subject-Prefix: [patch @num@/@total@] This patch increases the speed of the SLUB fastpath by improving the per cpu allocator and makes it usable for SLUB. Currently allocpercpu manages arrays of pointer to per cpu objects. This means that is has to allocate the arrays and then populate them as needed with objects. Although these objects are called per cpu objects they cannot be handled in the same way as per cpu objects by adding the per cpu offset of the respective cpu. The patch here changes that. We create a small memory pool in the percpu area and allocate from there if alloc per cpu is called. As a result we do not need the per cpu pointer arrays for each object. This reduces memory usage and also the cache foot print of allocpercpu users. Also the per cpu objects for a single processor are tightly packed next to each other decreasing cache footprint even further and making it possible to access multiple objects in the same cacheline. SLUB has the same mechanism implemented. After fixing up the alloccpu stuff we throw the SLUB method out and use the new allocpercpu handling. Then we optimize allocpercpu addressing by adding a new function this_cpu_ptr() that allows the determination of the per cpu pointer for the current processor in an more efficient way on many platforms. This increases the speed of SLUB (and likely other kernel subsystems that benefit from the allocpercpu enhancements): SLAB SLUB SLUB+ SLUB-o SLUB-a 8 96 86 45 44 38 3 * 16 84 92 49 48 43 2 * 32 84 106 61 59 53 +++ 64 102 129 82 88 75 ++ 128 147 226 188 181 176 - 256 200 248 207 285 204 = 512 300 301 260 209 250 + 1024 416 440 398 264 391 ++ 2048 720 542 530 390 511 +++ 4096 1254 342 342 336 376 3 * alloc/free test SLAB SLUB SLUB+ SLUB-o SLUB-a 137-146 151 68-72 68-74 56-58 3 * Note: The per cpu optimization are only half way there because of the screwed up way that x86_64 handles its cpu area that causes addditional cycles to be spend by retrieving a pointer from memory and adding it to the address. The i386 code is much less cycle intensive being able to get to per cpu data using a segment prefix and if we can get that to work on x86_64 then we may be able to get the cycle count for the fastpath down to 20-30 cycles.