GIT 28e4b71a66881df1ac343f13d06395fa01021e8e git+ssh://master.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6.git#for-mm 28e4b71a66881df1ac343f13d06395fa01021e8e commit 28e4b71a66881df1ac343f13d06395fa01021e8e Author: Pekka Enberg Date: Tue Apr 8 22:26:36 2008 +0300 slub: use typedefs for ->get and ->kick functions As suggested by Andrew Morton, use typedefs for the SLUB defragmentation ->get and ->kick callback functions. Signed-off-by: Pekka Enberg commit 9ea7cc66193775609c0e736de6ef38c5487a5ad9 Author: Christoph Lameter Date: Tue Apr 8 22:26:31 2008 +0300 SLUB: Trigger defragmentation from memory reclaim This patch triggers slab defragmentation from memory reclaim. The logical point for this is after slab shrinking was performed in vmscan.c. At that point the fragmentation ratio of a slab was increased because objects were freed via the LRUs. So we call kmem_cache_defrag() from there. slab_shrink() from vmscan.c is called in some contexts to do global shrinking of slabs and in others to do shrinking for a particular zone. Pass the zone to slab_shrink, so that slab_shrink() can call kmem_cache_defrag() and restrict the defragmentation to the node that is under memory pressure. Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 94c7b233b80d6ecff8304ed20acf071c37de342c Author: Christoph Lameter Date: Tue Apr 8 22:26:31 2008 +0300 slub: add defrag statistics Add statistics counters for slab defragmentation. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 171250363fe803b4dc61301276c2693cce3e5684 Author: Christoph Lameter Date: Tue Apr 8 22:26:30 2008 +0300 SLUB: Extend slabinfo to support -D and -F options -F lists caches that support defragmentation -C lists caches that use a ctor. Change field names for defrag_ratio and remote_node_defrag_ratio. Add determination of the allocation ratio for a slab. The allocation ratio is the percentage of available slots for objects in use. Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 831d78b552aade2c383cf8d75b180dd35f81a4e3 Author: Christoph Lameter Date: Tue Apr 8 22:26:30 2008 +0300 SLUB: Add KICKABLE to avoid repeated kick() attempts Add a flag KICKABLE to be set on slabs with a defragmentation method Clear the flag if a kick action is not successful in reducing the number of objects in a slab. This will avoid future attempts to kick objects out. The KICKABLE flag is set again when all objects of the slab have been allocated (Occurs during removal of a slab from the partial lists). Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit c963d891d875a9bd39ae44da623c421bc0140937 Author: Christoph Lameter Date: Tue Apr 8 22:26:30 2008 +0300 SLUB: Slab defrag core Slab defragmentation may occur: 1. Unconditionally when kmem_cache_shrink is called on a slab cache by the kernel calling kmem_cache_shrink. 2. Use of the slabinfo command line to trigger slab shrinking. 3. Per node defrag conditionally when kmem_cache_defrag() is called. Defragmentation is only performed if the fragmentation of the slab is lower than the specified percentage. Fragmentation ratios are measured by calculating the percentage of objects in use compared to the total number of objects that the slab page can accomodate. Fragmentation is skipped if it was less than a tenth of a second since we last checked a slab cache. An unsuccessful defrag attempt pauses attempts for at least one second. This is necessary to limit useless partial list scanning. The scanning of slab caches is optimized because the defragmentable slabs come first on the list. Thus we can terminate scans on the first slab encountered that does not support defragmentation. kmem_cache_defrag() takes a node parameter. This can either be -1 if defragmentation should be performed on all nodes, or a node number. If a node number was specified then defragmentation is only performed on a specific node. A couple of functions must be setup via a call to kmem_cache_setup_defrag() in order for a slabcache to support defragmentation. These are void *get(struct kmem_cache *s, int nr, void **objects) Must obtain a reference to the listed objects. SLUB guarantees that the objects are still allocated. However, other threads may be blocked in slab_free() attempting to free objects in the slab. These may succeed as soon as get() returns to the slab allocator. The function must be able to detect such situations and void the attempts to free such objects (by for example voiding the corresponding entry in the objects array). No slab operations may be performed in get(). Interrupts are disabled. What can be done is very limited. The slab lock for the page that contains the object is taken. Any attempt to perform a slab operation may lead to a deadlock. get() returns a private pointer that is passed to kick. Should we be unable to obtain all references then that pointer may indicate to the kick() function that it should not attempt any object removal or move but simply remove the reference counts. void kick(struct kmem_cache *, int nr, void **objects, void *get_result) After SLUB has established references to the objects in a slab it will then drop all locks and use kick() to move objects out of the slab. The existence of the object is guaranteed by virtue of the earlier obtained references via get(). The callback may perform any slab operation since no locks are held at the time of call. The callback should remove the object from the slab in some way. This may be accomplished by reclaiming the object and then running kmem_cache_free() or reallocating it and then running kmem_cache_free(). Reallocation is advantageous because the partial slabs were just sorted to have the partial slabs with the most objects first. Reallocation is likely to result in filling up a slab in addition to freeing up one slab. A filled up slab can also be removed from the partial list. So there could be a double effect. Kick() does not return a result. SLUB will check the number of remaining objects in the slab. If all objects were removed then we know that the operation was successful. [penberg@cs.helsinki.fi: fix up locking in __kmem_cache_shrink()] Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit bf197812d941b484ac64da8d28ee885f8938dcff Author: Christoph Lameter Date: Tue Apr 8 22:26:30 2008 +0300 SLUB: Sort slab cache list and establish maximum objects for defrag slabs When defragmenting slabs then it is advantageous to have all defragmentable slabs together at the beginning of the list so that there is no need to scan the complete list. Put defragmentable caches first when adding a slab cache and others last. Determine the maximum number of objects in defragmentable slabs. This allows to size the allocation of arrays holding refs to these objects later. Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 66892337435a0d88996057af221e8c18ff91bc14 Author: Christoph Lameter Date: Tue Apr 8 22:26:29 2008 +0300 SLUB: Add get() and kick() methods Add the two methods needed for defragmentation and add the display of the methods via the proc interface. Add documentation explaining the use of these methods. Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 24337bca6e77ab48f459e35690b32ef20a34bda5 Author: Christoph Lameter Date: Tue Apr 8 22:26:29 2008 +0300 SLUB: Replace ctor field with ops field in /sys/slab/* Create an ops field in /sys/slab/*/ops to contain all the operations defined on a slab. This will be used to display the additional operations that will be defined soon. Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 676e1fd04174c8192c0fbf920a798c8f033e1960 Author: Christoph Lameter Date: Tue Apr 8 22:26:29 2008 +0300 SLUB: Add defrag_ratio field and sysfs support. The defrag_ratio is used to set the threshold at which defragmentation should be run on a slabcache. The allocation ratio is measured in the percentage of the available slots allocated. The percentage will be lower for slabs that are more fragmented. Add a defrag ratio field and set it to 30% by default. A limit of 30% specified that less than 3 out of 10 available slots for objects are in use before slab defragmeentation runs. Reviewed-by: Rik van Riel Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit d3cd9f89992a413a94e1e48714426b7ee22b06dc Author: Zhang, Yanmin Date: Tue Apr 8 17:18:06 2008 +0800 slub: change the formula which calculates min_objects based on number of processors Current formula to calculate min_objects based on number of processors is '4 * fls(nr_cpu_ids)', which is not the best optimization on 16-core tigerton. If I add 4 to its result, hackbench result is better. On 16-core tigerton, by run ./hackbench 100 process 2000 results are: 1) 2.6.25-rc6slab: 23.5seconds 2) 2.6.25-rc7SLUB+slub_min_objects=20: 31seconds 3) 2.6.25-rc7SLUB+slub_min_objects=24: 23.5seconds So adding 4 to the output of '4 * fls(nr_cpu_ids)' could get the similar result like CONFIG_SLAB=y. Below patch adds 4 to the formula. With the patch, the mininum objects per slab is calculated as below: Processors min_objects --------------------------- 1 8 2 12 4 16 8 20 16 24 32 28 64 32 1024 48 4096 56 Signed-off-by: Zhang Yanmin CC: Christoph Lameter Signed-off-by: Pekka Enberg commit 8d817d05eeb38a52ebbe2982e65691e507bff1e2 Author: Christoph Lameter Date: Fri Apr 4 15:50:31 2008 -0700 slub: pack objects denser Since we now have more orders available use a denser packing. Increase slab order if more than 1/16th of a slab would be wasted. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 88a50aab9a11f0bd75c34703a42445a56489446a Author: Christoph Lameter Date: Fri Apr 4 15:50:30 2008 -0700 slub: Calculate min_objects based on number of processors. The mininum objects per slab is calculated based on the number of processors that may come online. Processors min_objects --------------------------- 1 4 2 8 4 12 8 16 16 20 32 24 64 28 1024 44 4096 52 The higher the number of processors the large the order sizes used for various slab caches will become. This has been shown to address the performance issues in hackbench on 16p etc. The calculation is only performed if slub_min_objects is zero (default). If one specifies a slub_min_objects on boot then that setting is taken. Cc: yanmin_zhang@linux.intel.com Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit b81da56f11a47ec547f4a745e329c6945d85637b Author: Christoph Lameter Date: Fri Apr 4 15:50:29 2008 -0700 slub: Drop DEFAULT_MAX_ORDER / DEFAULT_MIN_OBJECTS We can now fallback to order 0 slabs. So set the slub_max_order to PAGE_CACHE_ORDER_COSTLY but keep the slub_min_objects at 4. This will mostly preserve the orders used in 2.6.25. F.e. The 2k kmalloc slab will use order 1 allocs and the 4k kmalloc slab order 2. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 23ac45a7efac366d5eabfbe3a94f16f992311f08 Author: Christoph Lameter Date: Fri Apr 4 15:50:28 2008 -0700 slub: Simplify any_slab_object checks Since we now have total_objects counter per node use that to check for the presence of any objects. The loop over all cpu slabs is not that useful since any cpu slab would require an object allocation first. So drop that. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit acd49c885e03f087c31f49e7c42ccb8befbf4009 Author: Christoph Lameter Date: Fri Apr 4 15:50:27 2008 -0700 slub: Make the order configurable for each slab cache Makes /sys/kernel/slab//order writable. The allocation order of a slab cache can then be changed dynamically during runtime. This can be used to override the objects per slabs value establisheed with the slub_min_objects setting that was manually specified or calculated on bootup. The changes of the slab order can occur while allocate_slab() runs. Allocate slab needs the order and the number of slab objects that are both changed by the change of order. Both are put into a single word (struct kmem_cache_order_objects). They can then be atomically updated and retrieved. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit d82693414e6841a6ce9ae30a7d767a1a98b30161 Author: Christoph Lameter Date: Fri Apr 4 15:50:26 2008 -0700 slub: Drop fallback to page allocator method There is now a generic method of falling back to a slab page of minimal order. No need anymore for the fallback to kmalloc_large(). Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit c5dafdd6cf205e2ecb5ef513d0f3deab82ed45dd Author: Christoph Lameter Date: Fri Apr 4 15:50:25 2008 -0700 slub: Fallback to minimal order during slab page allocation If any higher order allocation fails then fall back the smallest order necessary to contain at least one object. This enables fallback for all allocations to order 0 pages. The fallback will waste more memory (objects will not fit neatly) and the fallback slabs will be not as efficient as larger slabs since they contain less objects. Note that SLAB also depends on order 1 allocations for some slabs that waste too much memory if forced into PAGE_SIZE'd page. SLUB now can now deal with failing order 1 allocs which SLAB cannot do. Add a new field min that will contain the objects for the smallest possible order for a slab cache. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 5bfe92162b96c73a6244400e54676fa73617c6a4 Author: Christoph Lameter Date: Fri Apr 4 15:50:24 2008 -0700 slub: Update statistics handling for variable order slabs Change the statistics to consider that slabs of the same slabcache can have different number of objects in them since they may be of different order. Provide a new sysfs field total_objects which shows the total objects that the allocated slabs of a slabcache could hold. Add a max field that holds the largest slab order that was ever used for a slab cache. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 5293be7138acda6968f082c57462837eb4d5040c Author: Christoph Lameter Date: Fri Apr 4 15:50:23 2008 -0700 slub: Add kmem_cache_order_objects struct Pack the order and the number of objects into a single word. This saves some memory in the kmem_cache_structure and more importantly allows us to fetch both values atomically. Later the slab orders become runtime configurable and we need to fetch these two items together in order to properly allocate a slab and initialize its objects. Fix the race by fetching the order and the number of objects in one word. [penberg@cs.helsinki.fi: fix memset() page order in new_slab()] Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 303383ffc854607f1ea0e6f0b9a7467549b2393a Author: Christoph Lameter Date: Fri Apr 4 15:50:22 2008 -0700 slub: for_each_object must be passed the number of objects in a slab Pass the number of objects to the for_each_object macro. Most of these are debug related. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 2c494faabfb92d03b1a6cb8d0295dd2db129a9ab Author: Christoph Lameter Date: Fri Apr 4 15:50:21 2008 -0700 slub: Store max number of objects in the page struct. Split the inuse field up to be able to store the number of objects in this page in the page struct as well. Necessary if we want to have pages of various orders for a slab. Also avoids touching struct kmem_cache cachelines in __slab_alloc(). Update diagnostic code to check the number of objects and make sure that the number of objects always stays within the bounds of a 16 bit unsigned integer. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 910e419b394acb4f280aea33ab47eb52d654b4aa Author: Christoph Lameter Date: Sat Apr 5 20:02:13 2008 +0300 slub: No need for per node slab counters if !SLUB_DEBUG The per node counters are used mainly for showing data through the sysfs API. If that API is not compiled in then there is no point in keeping track of this data. Disable counters for the number of slabs and the number of total slabs if !SLUB_DEBUG. Incrementing the per node counters is also accessing a potentially contended cacheline so this could actually be a performance benefit to embedded systems. SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which is on by default). Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab() if the system is not compiled with NUMA support. [penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG] Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 6aac8c3bc0db55415b461e9cb80086ac025e1bab Author: Christoph Lameter Date: Sat Apr 5 18:44:58 2008 +0300 slub: Move map/flag clearing to __free_slab __free_slab does some diagnostics. The resetting of mapcount etc in discard_slab() can interfere with debug processing. So move the reset immediately before the page is freed. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 20666019174a16a129dd4b20130fd4adb09c3702 Author: Christoph Lameter Date: Fri Apr 4 13:00:55 2008 -0700 slub: Fixes to per cpu stat output in sysfs Only output per cpu stats if the kernel is build for SMP. Use a capital "C" as a leading character for the processor number (same as the numa statistics that also use a capital letter "N"). Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit d1f8e10ba374edee5aa456594e0df0373ca1a0ad Author: Christoph Lameter Date: Fri Apr 4 13:00:54 2008 -0700 slub: Deal with config variable dependencies count_partial() is used by both slabinfo and the sysfs proc support. Move the function directly before the beginning of the sysfs code so that it can be easily found. Rework the preprocessor conditional to take into account that slub sysfs support depends on CONFIG_SYSFS *and* CONFIG_SLUB_DEBUG. Make CONFIG_SLUB_STATS depend on CONFIG_SLUB_DEBUG and CONFIG_SYSFS. There is no point of keeping statistics if no one can restrive them. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit efbda3893a44778241f02a77bf241f4cbd1cdccf Author: Christoph Lameter Date: Fri Apr 4 13:00:53 2008 -0700 slub: Reduce #ifdef ZONE_DMA by moving kmalloc_caches_dma near dma logic Move the definition of kmalloc_caches_dma() into a later #ifdef CONFIG_ZONE_DMA. This saves one #ifdef and leaves us with a total of two #ifdefs for dma slab support. Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg commit 80ef72015572c71f81cfb128e208b2a0f51c7a66 Author: Pekka Enberg Date: Fri Apr 4 13:00:52 2008 -0700 slub: Initialize per-cpu stats As spotted by kmemcheck, we need to initialize the per-CPU ->stat array before using it. [kmem_cache_cpu structures are usually allocated from arrays defined via DEFINE_PER_CPU that are zeroed so we have not noticed this so far --cl]. Reported-by: Vegard Nossum Signed-off-by: Christoph Lameter Signed-off-by: Pekka Enberg Documentation/vm/slabinfo.c | 103 ++++-- fs/drop_caches.c | 2 +- include/linux/mm.h | 2 +- include/linux/mm_types.h | 5 +- include/linux/slab_def.h | 5 + include/linux/slob_def.h | 5 + include/linux/slub_def.h | 70 ++++- init/Kconfig | 2 +- lib/Kconfig.debug | 2 +- mm/slub.c | 914 ++++++++++++++++++++++++++++++------------- mm/vmscan.c | 26 +- 11 files changed, 823 insertions(+), 313 deletions(-) diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c index 22d7e3e..e14f026 100644 --- a/Documentation/vm/slabinfo.c +++ b/Documentation/vm/slabinfo.c @@ -31,7 +31,9 @@ struct slabinfo { int hwcache_align, object_size, objs_per_slab; int sanity_checks, slab_size, store_user, trace; int order, poison, reclaim_account, red_zone; - unsigned long partial, objects, slabs; + int defrag, ctor; + int defrag_ratio, remote_node_defrag_ratio; + unsigned long partial, objects, slabs, objects_partial, objects_total; unsigned long alloc_fastpath, alloc_slowpath; unsigned long free_fastpath, free_slowpath; unsigned long free_frozen, free_add_partial, free_remove_partial; @@ -39,6 +41,9 @@ struct slabinfo { unsigned long cpuslab_flush, deactivate_full, deactivate_empty; unsigned long deactivate_to_head, deactivate_to_tail; unsigned long deactivate_remote_frees; + unsigned long shrink_calls, shrink_attempt_defrag, shrink_empty_slab; + unsigned long shrink_slab_skipped, shrink_slab_reclaimed; + unsigned long shrink_object_reclaim_failure; int numa[MAX_NODES]; int numa_partial[MAX_NODES]; } slabinfo[MAX_SLABS]; @@ -64,6 +69,8 @@ int show_slab = 0; int skip_zero = 1; int show_numa = 0; int show_track = 0; +int show_defrag = 0; +int show_ctor = 0; int show_first_alias = 0; int validate = 0; int shrink = 0; @@ -100,13 +107,15 @@ void fatal(const char *x, ...) void usage(void) { printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n" - "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n" + "slabinfo [-aCdDefFhnpvtsz] [-d debugopts] [slab-regexp]\n" "-a|--aliases Show aliases\n" "-A|--activity Most active slabs first\n" + "-C|--ctor Show slabs with ctors\n" "-d|--debug= Set/Clear Debug options\n" "-D|--display-active Switch line format to activity\n" "-e|--empty Show empty slabs\n" "-f|--first-alias Show first alias\n" + "-F|--defrag Show defragmentable caches\n" "-h|--help Show usage information\n" "-i|--inverted Inverted list\n" "-l|--slabs Show slabs\n" @@ -296,7 +305,7 @@ void first_line(void) printf("Name Objects Alloc Free %%Fast\n"); else printf("Name Objects Objsize Space " - "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n"); + "Slabs/Part/Cpu O/S O %%Ra %%Ef Flg\n"); } /* @@ -345,7 +354,7 @@ void slab_numa(struct slabinfo *s, int mode) return; if (!line) { - printf("\n%-21s:", mode ? "NUMA nodes" : "Slab"); + printf("\n%-21s: Rto ", mode ? "NUMA nodes" : "Slab"); for(node = 0; node <= highest_node; node++) printf(" %4d", node); printf("\n----------------------"); @@ -354,6 +363,7 @@ void slab_numa(struct slabinfo *s, int mode) printf("\n"); } printf("%-21s ", mode ? "All slabs" : s->name); + printf("%3d ", s->remote_node_defrag_ratio); for(node = 0; node <= highest_node; node++) { char b[20]; @@ -459,22 +469,28 @@ void slab_stats(struct slabinfo *s) printf("Total %8lu %8lu\n\n", total_alloc, total_free); - if (s->cpuslab_flush) - printf("Flushes %8lu\n", s->cpuslab_flush); - - if (s->alloc_refill) - printf("Refill %8lu\n", s->alloc_refill); + if (s->cpuslab_flush || s->alloc_refill) + printf("CPU Slab : Flushes=%lu Refills=%lu\n", + s->cpuslab_flush, s->alloc_refill); total = s->deactivate_full + s->deactivate_empty + s->deactivate_to_head + s->deactivate_to_tail; if (total) - printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) " + printf("Deactivate: Full=%lu(%lu%%) Empty=%lu(%lu%%) " "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n", s->deactivate_full, (s->deactivate_full * 100) / total, s->deactivate_empty, (s->deactivate_empty * 100) / total, s->deactivate_to_head, (s->deactivate_to_head * 100) / total, s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total); + + if (s->shrink_calls) + printf("Shrink : Calls=%lu Attempts=%lu Empty=%lu Successful=%lu\n", + s->shrink_calls, s->shrink_attempt_defrag, + s->shrink_empty_slab, s->shrink_slab_reclaimed); + if (s->shrink_slab_skipped || s->shrink_object_reclaim_failure) + printf("Defrag : Slabs skipped=%lu Object reclaim failure=%lu\n", + s->shrink_slab_skipped, s->shrink_object_reclaim_failure); } void report(struct slabinfo *s) @@ -492,6 +508,8 @@ void report(struct slabinfo *s) printf("** Slabs are destroyed via RCU\n"); if (s->reclaim_account) printf("** Reclaim accounting active\n"); + if (s->defrag) + printf("** Defragmentation at %d%%\n", s->defrag_ratio); printf("\nSizes (bytes) Slabs Debug Memory\n"); printf("------------------------------------------------------------------------\n"); @@ -539,8 +557,15 @@ void slabcache(struct slabinfo *s) if (show_empty && s->slabs) return; + if (show_defrag && !s->defrag) + return; + + if (show_ctor && !s->ctor) + return; + store_size(size_str, slab_size(s)); - snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs, s->partial, s->cpu_slabs); + snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs, + s->partial, s->cpu_slabs); if (!line++) first_line(); @@ -549,6 +574,10 @@ void slabcache(struct slabinfo *s) *p++ = '*'; if (s->cache_dma) *p++ = 'd'; + if (s->defrag) + *p++ = 'F'; + if (s->ctor) + *p++ = 'C'; if (s->hwcache_align) *p++ = 'A'; if (s->poison) @@ -582,7 +611,8 @@ void slabcache(struct slabinfo *s) printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n", s->name, s->objects, s->object_size, size_str, dist_str, s->objs_per_slab, s->order, - s->slabs ? (s->partial * 100) / s->slabs : 100, + s->slabs ? (s->partial * 100) / + (s->slabs * s->objs_per_slab) : 100, s->slabs ? (s->objects * s->object_size * 100) / (s->slabs * (page_size << s->order)) : 100, flags); @@ -776,7 +806,6 @@ void totals(void) unsigned long used; unsigned long long wasted; unsigned long long objwaste; - long long objects_in_partial_slabs; unsigned long percentage_partial_slabs; unsigned long percentage_partial_objs; @@ -790,18 +819,11 @@ void totals(void) wasted = size - used; objwaste = s->slab_size - s->object_size; - objects_in_partial_slabs = s->objects - - (s->slabs - s->partial - s ->cpu_slabs) * - s->objs_per_slab; - - if (objects_in_partial_slabs < 0) - objects_in_partial_slabs = 0; - percentage_partial_slabs = s->partial * 100 / s->slabs; if (percentage_partial_slabs > 100) percentage_partial_slabs = 100; - percentage_partial_objs = objects_in_partial_slabs * 100 + percentage_partial_objs = s->objects_partial * 100 / s->objects; if (percentage_partial_objs > 100) @@ -823,8 +845,8 @@ void totals(void) min_objects = s->objects; if (used < min_used) min_used = used; - if (objects_in_partial_slabs < min_partobj) - min_partobj = objects_in_partial_slabs; + if (s->objects_partial < min_partobj) + min_partobj = s->objects_partial; if (percentage_partial_slabs < min_ppart) min_ppart = percentage_partial_slabs; if (percentage_partial_objs < min_ppartobj) @@ -848,8 +870,8 @@ void totals(void) max_objects = s->objects; if (used > max_used) max_used = used; - if (objects_in_partial_slabs > max_partobj) - max_partobj = objects_in_partial_slabs; + if (s->objects_partial > max_partobj) + max_partobj = s->objects_partial; if (percentage_partial_slabs > max_ppart) max_ppart = percentage_partial_slabs; if (percentage_partial_objs > max_ppartobj) @@ -864,7 +886,7 @@ void totals(void) total_objects += s->objects; total_used += used; - total_partobj += objects_in_partial_slabs; + total_partobj += s->objects_partial; total_ppart += percentage_partial_slabs; total_ppartobj += percentage_partial_objs; @@ -1160,6 +1182,8 @@ void read_slab_dir(void) slab->hwcache_align = get_obj("hwcache_align"); slab->object_size = get_obj("object_size"); slab->objects = get_obj("objects"); + slab->objects_partial = get_obj("objects_partial"); + slab->objects_total = get_obj("objects_total"); slab->objs_per_slab = get_obj("objs_per_slab"); slab->order = get_obj("order"); slab->partial = get_obj("partial"); @@ -1193,7 +1217,24 @@ void read_slab_dir(void) slab->deactivate_to_head = get_obj("deactivate_to_head"); slab->deactivate_to_tail = get_obj("deactivate_to_tail"); slab->deactivate_remote_frees = get_obj("deactivate_remote_frees"); + slab->shrink_calls = get_obj("shrink_calls"); + slab->shrink_attempt_defrag = get_obj("shrink_attempt_defrag"); + slab->shrink_empty_slab = get_obj("shrink_empty_slab"); + slab->shrink_slab_skipped = get_obj("shrink_slab_skipped"); + slab->shrink_slab_reclaimed = get_obj("shrink_slab_reclaimed"); + slab->shrink_object_reclaim_failure = + get_obj("shrink_object_reclaim_failure"); + slab->defrag_ratio = get_obj("defrag_ratio"); + slab->remote_node_defrag_ratio = + get_obj("remote_node_defrag_ratio"); chdir(".."); + if (read_slab_obj(slab, "ops")) { + if (strstr(buffer, "ctor :")) + slab->ctor = 1; + if (strstr(buffer, "kick :")) + slab->defrag = 1; + } + if (slab->name[0] == ':') alias_targets++; slab++; @@ -1244,10 +1285,12 @@ void output_slabs(void) struct option opts[] = { { "aliases", 0, NULL, 'a' }, { "activity", 0, NULL, 'A' }, + { "ctor", 0, NULL, 'C' }, { "debug", 2, NULL, 'd' }, { "display-activity", 0, NULL, 'D' }, { "empty", 0, NULL, 'e' }, { "first-alias", 0, NULL, 'f' }, + { "defrag", 0, NULL, 'F' }, { "help", 0, NULL, 'h' }, { "inverted", 0, NULL, 'i'}, { "numa", 0, NULL, 'n' }, @@ -1270,7 +1313,7 @@ int main(int argc, char *argv[]) page_size = getpagesize(); - while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS", + while ((c = getopt_long(argc, argv, "aACd::DefFhil1noprstvzTS", opts, NULL)) != -1) switch (c) { case '1': @@ -1326,6 +1369,12 @@ int main(int argc, char *argv[]) case 'z': skip_zero = 0; break; + case 'C': + show_ctor = 1; + break; + case 'F': + show_defrag = 1; + break; case 'T': show_totals = 1; break; diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 59375ef..fb58e63 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -50,7 +50,7 @@ void drop_slab(void) int nr_objects; do { - nr_objects = shrink_slab(1000, GFP_KERNEL, 1000); + nr_objects = shrink_slab(1000, GFP_KERNEL, 1000, NULL); } while (nr_objects > 10); } diff --git a/include/linux/mm.h b/include/linux/mm.h index b695875..c322ae3 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1206,7 +1206,7 @@ int in_gate_area_no_task(unsigned long addr); int drop_caches_sysctl_handler(struct ctl_table *, int, struct file *, void __user *, size_t *, loff_t *); unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, - unsigned long lru_pages); + unsigned long lru_pages, struct zone *z); void drop_pagecache(void); void drop_slab(void); diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index af190ce..e0bd223 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -42,7 +42,10 @@ struct page { * to show when page is mapped * & limit reverse map searches. */ - unsigned int inuse; /* SLUB: Nr of objects */ + struct { /* SLUB */ + u16 inuse; + u16 objects; + }; }; union { struct { diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 39c3a5e..3a3811a 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -95,4 +95,9 @@ found: #endif /* CONFIG_NUMA */ +static inline void kmem_cache_setup_defrag(struct kmem_cache *s, + void *(*get)(struct kmem_cache *, int nr, void **), + void (*kick)(struct kmem_cache *, int nr, void **, void *private)) {} +static inline int kmem_cache_defrag(int node) { return 0; } + #endif /* _LINUX_SLAB_DEF_H */ diff --git a/include/linux/slob_def.h b/include/linux/slob_def.h index 59a3fa4..1e94782 100644 --- a/include/linux/slob_def.h +++ b/include/linux/slob_def.h @@ -33,4 +33,9 @@ static inline void *__kmalloc(size_t size, gfp_t flags) return kmalloc(size, flags); } +static inline void kmem_cache_setup_defrag(struct kmem_cache *s, + void *(*get)(struct kmem_cache *, int nr, void **), + void (*kick)(struct kmem_cache *, int nr, void **, void *private)) {} +static inline int kmem_cache_defrag(int node) { return 0; } + #endif /* __LINUX_SLOB_DEF_H */ diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index b00c1c7..b2933d1 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -29,8 +29,18 @@ enum stat_item { DEACTIVATE_TO_HEAD, /* Cpu slab was moved to the head of partials */ DEACTIVATE_TO_TAIL, /* Cpu slab was moved to the tail of partials */ DEACTIVATE_REMOTE_FREES,/* Slab contained remotely freed objects */ + ORDER_FALLBACK, /* Number of times fallback was necessary */ + SHRINK_CALLS, /* Number of invocations of kmem_cache_shrink */ + SHRINK_ATTEMPT_DEFRAG, /* Slabs that were attempted to be reclaimed */ + SHRINK_EMPTY_SLAB, /* Shrink encountered and freed empty slab */ + SHRINK_SLAB_SKIPPED, /* Slab reclaim skipped an slab (busy etc) */ + SHRINK_SLAB_RECLAIMED, /* Successfully reclaimed slabs */ + SHRINK_OBJECT_RECLAIM_FAILED, /* Callbacks signaled busy objects */ NR_SLUB_STAT_ITEMS }; +typedef void *(*kmem_get_fn_t)(struct kmem_cache *, int, void **); +typedef void (*kmem_kick_fn_t)(struct kmem_cache *, int, void **, void *); + struct kmem_cache_cpu { void **freelist; /* Pointer to first free per cpu object */ struct page *page; /* The slab from which we are allocating */ @@ -45,14 +55,24 @@ struct kmem_cache_cpu { struct kmem_cache_node { spinlock_t list_lock; /* Protect partial list and nr_partial */ unsigned long nr_partial; - atomic_long_t nr_slabs; struct list_head partial; #ifdef CONFIG_SLUB_DEBUG + atomic_long_t nr_slabs; + atomic_long_t total_objects; struct list_head full; #endif }; /* + * Word size structure that can be atomically updated or read and that + * contains both the order and the number of objects that a slab of the + * given order would contain. + */ +struct kmem_cache_order_objects { + unsigned long x; +}; + +/* * Slab cache management. */ struct kmem_cache { @@ -61,7 +81,7 @@ struct kmem_cache { int size; /* The size of an object including meta data */ int objsize; /* The size of an object without meta data */ int offset; /* Free pointer offset. */ - int order; /* Current preferred allocation order */ + struct kmem_cache_order_objects oo; /* * Avoid an extra cache line for UP, SMP and for the node local to @@ -70,12 +90,52 @@ struct kmem_cache { struct kmem_cache_node local_node; /* Allocation and freeing of slabs */ - int objects; /* Number of objects in slab */ + struct kmem_cache_order_objects max; + struct kmem_cache_order_objects min; gfp_t allocflags; /* gfp flags to use on each alloc */ int refcount; /* Refcount for slab cache destroy */ + unsigned long next_defrag; void (*ctor)(struct kmem_cache *, void *); + /* + * Called with slab lock held and interrupts disabled. + * No slab operation may be performed in get(). + * + * Parameters passed are the number of objects to process + * and an array of pointers to objects for which we + * need references. + * + * Returns a pointer that is passed to the kick function. + * If all objects cannot be moved then the pointer may + * indicate that this wont work and then kick can simply + * remove the references that were already obtained. + * + * The array passed to get() is also passed to kick(). The + * function may remove objects by setting array elements to NULL. + */ + kmem_get_fn_t get; + + /* + * Called with no locks held and interrupts enabled. + * Any operation may be performed in kick(). + * + * Parameters passed are the number of objects in the array, + * the array of pointers to the objects and the pointer + * returned by get(). + * + * Success is checked by examining the number of remaining + * objects in the slab. + */ + kmem_kick_fn_t kick; + int inuse; /* Offset to metadata */ int align; /* Alignment */ + int defrag_ratio; /* + * Ratio used to check the percentage of + * objects allocate in a slab page. + * If less than this ratio is allocated + * then reclaim attempts are made. + */ + const char *name; /* Name (only for display!) */ struct list_head list; /* List of slab caches */ #ifdef CONFIG_SLUB_DEBUG @@ -231,4 +291,8 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) } #endif +void kmem_cache_setup_defrag(struct kmem_cache *s, kmem_get_fn_t get, + kmem_kick_fn_t kick); +int kmem_cache_defrag(int node); + #endif /* _LINUX_SLUB_DEF_H */ diff --git a/init/Kconfig b/init/Kconfig index a97924b..7fccf09 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -763,7 +763,7 @@ endmenu # General setup config SLABINFO bool depends on PROC_FS - depends on SLAB || SLUB + depends on SLAB || SLUB_DEBUG default y config RT_MUTEXES diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 0796c1a..eef557d 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -211,7 +211,7 @@ config SLUB_DEBUG_ON config SLUB_STATS default n bool "Enable SLUB performance statistics" - depends on SLUB + depends on SLUB && SLUB_DEBUG && SYSFS help SLUB statistics are useful to debug SLUBs allocation behavior in order find ways to optimize the allocator. This should never be diff --git a/mm/slub.c b/mm/slub.c index acc975f..4b694a7 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -101,6 +101,7 @@ */ #define FROZEN (1 << PG_active) +#define KICKABLE (1 << PG_dirty) #ifdef CONFIG_SLUB_DEBUG #define SLABDEBUG (1 << PG_error) @@ -138,6 +139,21 @@ static inline void ClearSlabDebug(struct page *page) page->flags &= ~SLABDEBUG; } +static inline int SlabKickable(struct page *page) +{ + return page->flags & KICKABLE; +} + +static inline void SetSlabKickable(struct page *page) +{ + page->flags |= KICKABLE; +} + +static inline void ClearSlabKickable(struct page *page) +{ + page->flags &= ~KICKABLE; +} + /* * Issues still to be resolved: * @@ -149,25 +165,6 @@ static inline void ClearSlabDebug(struct page *page) /* Enable to test recovery from slab corruption on boot */ #undef SLUB_RESILIENCY_TEST -#if PAGE_SHIFT <= 12 - -/* - * Small page size. Make sure that we do not fragment memory - */ -#define DEFAULT_MAX_ORDER 1 -#define DEFAULT_MIN_OBJECTS 4 - -#else - -/* - * Large page machines are customarily able to handle larger - * page orders. - */ -#define DEFAULT_MAX_ORDER 2 -#define DEFAULT_MIN_OBJECTS 8 - -#endif - /* * Mininum number of partial slabs. These will be left on the partial * lists even if they are empty. kmem_cache_shrink may reclaim them. @@ -176,10 +173,10 @@ static inline void ClearSlabDebug(struct page *page) /* * Maximum number of desirable partial slabs. - * The existence of more partial slabs makes kmem_cache_shrink - * sort the partial list by the number of objects in the. + * More slabs cause kmem_cache_shrink to sort the slabs by objects + * and triggers slab defragmentation. */ -#define MAX_PARTIAL 10 +#define MAX_PARTIAL 20 #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) @@ -204,8 +201,6 @@ static inline void ClearSlabDebug(struct page *page) /* Internal SLUB flags */ #define __OBJECT_POISON 0x80000000 /* Poison object */ #define __SYSFS_ADD_DEFERRED 0x40000000 /* Not yet visible via sysfs */ -#define __KMALLOC_CACHE 0x20000000 /* objects freed using kfree */ -#define __PAGE_ALLOC_FALLBACK 0x10000000 /* Allow fallback to page alloc */ /* Not all arches define cache_line_size */ #ifndef cache_line_size @@ -229,6 +224,9 @@ static enum { static DECLARE_RWSEM(slub_lock); static LIST_HEAD(slab_caches); +/* Maximum objects in defragmentable slabs */ +static unsigned int max_defrag_slab_objects __read_mostly; + /* * Tracking user of a slab. */ @@ -301,7 +299,7 @@ static inline int check_valid_pointer(struct kmem_cache *s, return 1; base = page_address(page); - if (object < base || object >= base + s->objects * s->size || + if (object < base || object >= base + page->objects * s->size || (object - base) % s->size) { return 0; } @@ -327,8 +325,8 @@ static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) } /* Loop over all objects in a slab */ -#define for_each_object(__p, __s, __addr) \ - for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\ +#define for_each_object(__p, __s, __addr, __objects) \ + for (__p = (__addr); __p < (__addr) + (__objects) * (__s)->size;\ __p += (__s)->size) /* Scan freelist */ @@ -341,6 +339,26 @@ static inline int slab_index(void *p, struct kmem_cache *s, void *addr) return (p - addr) / s->size; } +static inline struct kmem_cache_order_objects oo_make(int order, + unsigned long size) +{ + struct kmem_cache_order_objects x = { + (order << 16) + (PAGE_SIZE << order) / size + }; + + return x; +} + +static inline int oo_order(struct kmem_cache_order_objects x) +{ + return x.x >> 16; +} + +static inline int oo_objects(struct kmem_cache_order_objects x) +{ + return x.x & ((1 << 16) - 1); +} + #ifdef CONFIG_SLUB_DEBUG /* * Debug settings: @@ -451,8 +469,8 @@ static void print_tracking(struct kmem_cache *s, void *object) static void print_page_info(struct page *page) { - printk(KERN_ERR "INFO: Slab 0x%p used=%u fp=0x%p flags=0x%04lx\n", - page, page->inuse, page->freelist, page->flags); + printk(KERN_ERR "INFO: Slab 0x%p objects=%u used=%u fp=0x%p flags=0x%04lx\n", + page, page->objects, page->inuse, page->freelist, page->flags); } @@ -652,6 +670,7 @@ static int check_pad_bytes(struct kmem_cache *s, struct page *page, u8 *p) p + off, POISON_INUSE, s->size - off); } +/* Check the pad bytes at the end of a slab page */ static int slab_pad_check(struct kmem_cache *s, struct page *page) { u8 *start; @@ -664,20 +683,20 @@ static int slab_pad_check(struct kmem_cache *s, struct page *page) return 1; start = page_address(page); - end = start + (PAGE_SIZE << s->order); - length = s->objects * s->size; - remainder = end - (start + length); + length = (PAGE_SIZE << compound_order(page)); + end = start + length; + remainder = length % s->size; if (!remainder) return 1; - fault = check_bytes(start + length, POISON_INUSE, remainder); + fault = check_bytes(end - remainder, POISON_INUSE, remainder); if (!fault) return 1; while (end > fault && end[-1] == POISON_INUSE) end--; slab_err(s, page, "Padding overwritten. 0x%p-0x%p", fault, end - 1); - print_section("Padding", start, length); + print_section("Padding", end - remainder, remainder); restore_bytes(s, "slab padding", POISON_INUSE, start, end); return 0; @@ -739,15 +758,24 @@ static int check_object(struct kmem_cache *s, struct page *page, static int check_slab(struct kmem_cache *s, struct page *page) { + int maxobj; + VM_BUG_ON(!irqs_disabled()); if (!PageSlab(page)) { slab_err(s, page, "Not a valid slab page"); return 0; } - if (page->inuse > s->objects) { + + maxobj = (PAGE_SIZE << compound_order(page)) / s->size; + if (page->objects > maxobj) { + slab_err(s, page, "objects %u > max %u", + s->name, page->objects, maxobj); + return 0; + } + if (page->inuse > page->objects) { slab_err(s, page, "inuse %u > max %u", - s->name, page->inuse, s->objects); + s->name, page->inuse, page->objects); return 0; } /* Slab_pad_check fixes things up after itself */ @@ -764,8 +792,9 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search) int nr = 0; void *fp = page->freelist; void *object = NULL; + unsigned long max_objects; - while (fp && nr <= s->objects) { + while (fp && nr <= page->objects) { if (fp == search) return 1; if (!check_valid_pointer(s, page, fp)) { @@ -777,7 +806,7 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search) } else { slab_err(s, page, "Freepointer corrupt"); page->freelist = NULL; - page->inuse = s->objects; + page->inuse = page->objects; slab_fix(s, "Freelist cleared"); return 0; } @@ -788,16 +817,27 @@ static int on_freelist(struct kmem_cache *s, struct page *page, void *search) nr++; } - if (page->inuse != s->objects - nr) { + max_objects = (PAGE_SIZE << compound_order(page)) / s->size; + if (max_objects > 65535) + max_objects = 65535; + + if (page->objects != max_objects) { + slab_err(s, page, "Wrong number of objects. Found %d but " + "should be %d", page->objects, max_objects); + page->objects = max_objects; + slab_fix(s, "Number of objects adjusted."); + } + if (page->inuse != page->objects - nr) { slab_err(s, page, "Wrong object count. Counter is %d but " - "counted were %d", page->inuse, s->objects - nr); - page->inuse = s->objects - nr; + "counted were %d", page->inuse, page->objects - nr); + page->inuse = page->objects - nr; slab_fix(s, "Object count adjusted."); } return search == NULL; } -static void trace(struct kmem_cache *s, struct page *page, void *object, int alloc) +static void +trace(struct kmem_cache *s, struct page *page, void *object, int alloc) { if (s->flags & SLAB_TRACE) { printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n", @@ -837,6 +877,38 @@ static void remove_full(struct kmem_cache *s, struct page *page) spin_unlock(&n->list_lock); } +/* Tracking of the number of slabs for debugging purposes */ +static inline unsigned long slabs_node(struct kmem_cache *s, int node) +{ + struct kmem_cache_node *n = get_node(s, node); + + return atomic_long_read(&n->nr_slabs); +} + +static inline void inc_slabs_node(struct kmem_cache *s, int node, int objects) +{ + struct kmem_cache_node *n = get_node(s, node); + + /* + * May be called early in order to allocate a slab for the + * kmem_cache_node structure. Solve the chicken-egg + * dilemma by deferring the increment of the count during + * bootstrap (see early_kmem_cache_node_alloc). + */ + if (!NUMA_BUILD || n) { + atomic_long_inc(&n->nr_slabs); + atomic_long_add(objects, &n->total_objects); + } +} +static inline void dec_slabs_node(struct kmem_cache *s, int node, int objects) +{ + struct kmem_cache_node *n = get_node(s, node); + + atomic_long_dec(&n->nr_slabs); + atomic_long_sub(objects, &n->total_objects); +} + +/* Object debug checks for alloc/free paths */ static void setup_object_debug(struct kmem_cache *s, struct page *page, void *object) { @@ -881,7 +953,7 @@ bad: * as used avoids touching the remaining objects. */ slab_fix(s, "Marking all objects used"); - page->inuse = s->objects; + page->inuse = page->objects; page->freelist = NULL; } return 0; @@ -1028,29 +1100,55 @@ static inline unsigned long kmem_cache_flags(unsigned long objsize, return flags; } #define slub_debug 0 + +static inline unsigned long slabs_node(struct kmem_cache *s, int node) + { return 0; } +static inline void inc_slabs_node(struct kmem_cache *s, int node, + int objects) {} +static inline void dec_slabs_node(struct kmem_cache *s, int node, + int objects) {} #endif + /* * Slab allocation and freeing */ +static inline struct page *alloc_slab_page(gfp_t flags, int node, + struct kmem_cache_order_objects oo) +{ + int order = oo_order(oo); + + if (node == -1) + return alloc_pages(flags, order); + else + return alloc_pages_node(node, flags, order); +} + static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; - int pages = 1 << s->order; + struct kmem_cache_order_objects oo = s->oo; flags |= s->allocflags; - if (node == -1) - page = alloc_pages(flags, s->order); - else - page = alloc_pages_node(node, flags, s->order); - - if (!page) - return NULL; + page = alloc_slab_page(flags | __GFP_NOWARN | __GFP_NORETRY, node, + oo); + if (unlikely(!page)) { + oo = s->min; + /* + * Allocation may have failed due to fragmentation. + * Try a lower order alloc if possible + */ + page = alloc_slab_page(flags, node, oo); + if (!page) + return NULL; + stat(get_cpu_slab(s, raw_smp_processor_id()), ORDER_FALLBACK); + } + page->objects = oo_objects(oo); mod_zone_page_state(page_zone(page), (s->flags & SLAB_RECLAIM_ACCOUNT) ? NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, - pages); + 1 << oo_order(oo)); return page; } @@ -1066,7 +1164,6 @@ static void setup_object(struct kmem_cache *s, struct page *page, static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) { struct page *page; - struct kmem_cache_node *n; void *start; void *last; void *p; @@ -1078,22 +1175,23 @@ static struct page *new_slab(struct kmem_cache *s, gfp_t flags, int node) if (!page) goto out; - n = get_node(s, page_to_nid(page)); - if (n) - atomic_long_inc(&n->nr_slabs); + inc_slabs_node(s, page_to_nid(page), page->objects); page->slab = s; page->flags |= 1 << PG_slab; if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | SLAB_TRACE)) SetSlabDebug(page); + if (s->kick) + SetSlabKickable(page); + start = page_address(page); if (unlikely(s->flags & SLAB_POISON)) - memset(start, POISON_INUSE, PAGE_SIZE << s->order); + memset(start, POISON_INUSE, PAGE_SIZE << compound_order(page)); last = start; - for_each_object(p, s, start) { + for_each_object(p, s, start, page->objects) { setup_object(s, page, last); set_freepointer(s, last, p); last = p; @@ -1109,13 +1207,15 @@ out: static void __free_slab(struct kmem_cache *s, struct page *page) { - int pages = 1 << s->order; + int order = compound_order(page); + int pages = 1 << order; if (unlikely(SlabDebug(page))) { void *p; slab_pad_check(s, page); - for_each_object(p, s, page_address(page)) + for_each_object(p, s, page_address(page), + page->objects) check_object(s, page, p, 0); ClearSlabDebug(page); } @@ -1125,7 +1225,10 @@ static void __free_slab(struct kmem_cache *s, struct page *page) NR_SLAB_RECLAIMABLE : NR_SLAB_UNRECLAIMABLE, -pages); - __free_pages(page, s->order); + ClearSlabKickable(page); + __ClearPageSlab(page); + reset_page_mapcount(page); + __free_pages(page, order); } static void rcu_free_slab(struct rcu_head *h) @@ -1151,11 +1254,7 @@ static void free_slab(struct kmem_cache *s, struct page *page) static void discard_slab(struct kmem_cache *s, struct page *page) { - struct kmem_cache_node *n = get_node(s, page_to_nid(page)); - - atomic_long_dec(&n->nr_slabs); - reset_page_mapcount(page); - __ClearPageSlab(page); + dec_slabs_node(s, page_to_nid(page), page->objects); free_slab(s, page); } @@ -1211,7 +1310,8 @@ static void remove_partial(struct kmem_cache *s, * * Must hold list_lock. */ -static inline int lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page) +static inline int +lock_and_freeze_slab(struct kmem_cache_node *n, struct page *page) { if (slab_trylock(page)) { list_del(&page->lru); @@ -1335,6 +1435,8 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) stat(c, DEACTIVATE_FULL); if (SlabDebug(page) && (s->flags & SLAB_STORE_USER)) add_full(n, page); + if (s->kick) + SetSlabKickable(page); } slab_unlock(page); } else { @@ -1347,8 +1449,8 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail) * so that the others get filled first. That way the * size of the partial list stays small. * - * kmem_cache_shrink can reclaim any empty slabs from the - * partial list. + * kmem_cache_shrink can reclaim any empty slabs from + * the partial list. */ add_partial(n, page, 1); slab_unlock(page); @@ -1470,9 +1572,6 @@ static void *__slab_alloc(struct kmem_cache *s, void **object; struct page *new; - /* We handle __GFP_ZERO in the caller */ - gfpflags &= ~__GFP_ZERO; - if (!c->page) goto new_slab; @@ -1490,7 +1589,7 @@ load_freelist: goto debug; c->freelist = object[c->offset]; - c->page->inuse = s->objects; + c->page->inuse = c->page->objects; c->page->freelist = NULL; c->node = page_to_nid(c->page); unlock_out: @@ -1527,27 +1626,6 @@ new_slab: c->page = new; goto load_freelist; } - - /* - * No memory available. - * - * If the slab uses higher order allocs but the object is - * smaller than a page size then we can fallback in emergencies - * to the page allocator via kmalloc_large. The page allocator may - * have failed to obtain a higher order page and we can try to - * allocate a single page if the object fits into a single page. - * That is only possible if certain conditions are met that are being - * checked when a slab is created. - */ - if (!(gfpflags & __GFP_NORETRY) && - (s->flags & __PAGE_ALLOC_FALLBACK)) { - if (gfpflags & __GFP_WAIT) - local_irq_enable(); - object = kmalloc_large(s->objsize, gfpflags); - if (gfpflags & __GFP_WAIT) - local_irq_disable(); - return object; - } return NULL; debug: if (!alloc_debug_processing(s, c->page, object, addr)) @@ -1748,8 +1826,8 @@ static struct page *get_object_page(const void *x) * take the list_lock. */ static int slub_min_order; -static int slub_max_order = DEFAULT_MAX_ORDER; -static int slub_min_objects = DEFAULT_MIN_OBJECTS; +static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER; +static int slub_min_objects __read_mostly; /* * Merge control. If this is set then no merging of slab caches will occur. @@ -1764,7 +1842,7 @@ static int slub_nomerge; * system components. Generally order 0 allocations should be preferred since * order 0 does not cause fragmentation in the page allocator. Larger objects * be problematic to put into order 0 slabs because there may be too much - * unused space left. We go to a higher order if more than 1/8th of the slab + * unused space left. We go to a higher order if more than 1/16th of the slab * would be wasted. * * In order to reach satisfactory performance we must ensure that a minimum @@ -1789,6 +1867,9 @@ static inline int slab_order(int size, int min_objects, int rem; int min_order = slub_min_order; + if ((PAGE_SIZE << min_order) / size > 65535) + return get_order(size * 65535) - 1; + for (order = max(min_order, fls(min_objects * size - 1) - PAGE_SHIFT); order <= max_order; order++) { @@ -1823,8 +1904,10 @@ static inline int calculate_order(int size) * we reduce the minimum objects required in a slab. */ min_objects = slub_min_objects; + if (!min_objects) + min_objects = 4 * (fls(nr_cpu_ids) + 1); while (min_objects > 1) { - fraction = 8; + fraction = 16; while (fraction >= 4) { order = slab_order(size, min_objects, slub_max_order, fraction); @@ -1886,15 +1969,18 @@ static void init_kmem_cache_cpu(struct kmem_cache *s, c->node = 0; c->offset = s->offset / sizeof(void *); c->objsize = s->objsize; +#ifdef CONFIG_SLUB_STATS + memset(c->stat, 0, NR_SLUB_STAT_ITEMS * sizeof(unsigned)); +#endif } static void init_kmem_cache_node(struct kmem_cache_node *n) { n->nr_partial = 0; - atomic_long_set(&n->nr_slabs, 0); spin_lock_init(&n->list_lock); INIT_LIST_HEAD(&n->partial); #ifdef CONFIG_SLUB_DEBUG + atomic_long_set(&n->nr_slabs, 0); INIT_LIST_HEAD(&n->full); #endif } @@ -2063,7 +2149,7 @@ static struct kmem_cache_node *early_kmem_cache_node_alloc(gfp_t gfpflags, init_tracking(kmalloc_caches, n); #endif init_kmem_cache_node(n); - atomic_long_inc(&n->nr_slabs); + inc_slabs_node(kmalloc_caches, node, page->objects); /* * lockdep requires consistent irq usage for each lock @@ -2139,11 +2225,12 @@ static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags) * calculate_sizes() determines the order and the distribution of data within * a slab object. */ -static int calculate_sizes(struct kmem_cache *s) +static int calculate_sizes(struct kmem_cache *s, int forced_order) { unsigned long flags = s->flags; unsigned long size = s->objsize; unsigned long align = s->align; + int order; /* * Round up object size to the next word boundary. We can only @@ -2227,26 +2314,16 @@ static int calculate_sizes(struct kmem_cache *s) */ size = ALIGN(size, align); s->size = size; + if (forced_order >= 0) + order = forced_order; + else + order = calculate_order(size); - if ((flags & __KMALLOC_CACHE) && - PAGE_SIZE / size < slub_min_objects) { - /* - * Kmalloc cache that would not have enough objects in - * an order 0 page. Kmalloc slabs can fallback to - * page allocator order 0 allocs so take a reasonably large - * order that will allows us a good number of objects. - */ - s->order = max(slub_max_order, PAGE_ALLOC_COSTLY_ORDER); - s->flags |= __PAGE_ALLOC_FALLBACK; - s->allocflags |= __GFP_NOWARN; - } else - s->order = calculate_order(size); - - if (s->order < 0) + if (order < 0) return 0; s->allocflags = 0; - if (s->order) + if (order) s->allocflags |= __GFP_COMP; if (s->flags & SLAB_CACHE_DMA) @@ -2258,9 +2335,12 @@ static int calculate_sizes(struct kmem_cache *s) /* * Determine the number of objects per slab */ - s->objects = (PAGE_SIZE << s->order) / size; + s->oo = oo_make(order, size); + s->min = oo_make(get_order(size), size); + if (oo_objects(s->oo) > oo_objects(s->max)) + s->max = s->oo; - return !!s->objects; + return !!oo_objects(s->oo); } @@ -2276,10 +2356,11 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags, s->align = align; s->flags = kmem_cache_flags(size, flags, name, ctor); - if (!calculate_sizes(s)) + if (!calculate_sizes(s, -1)) goto error; s->refcount = 1; + s->defrag_ratio = 30; #ifdef CONFIG_NUMA s->remote_node_defrag_ratio = 100; #endif @@ -2293,7 +2374,7 @@ error: if (flags & SLAB_PANIC) panic("Cannot create slab %s size=%lu realsize=%u " "order=%u offset=%u flags=%lx\n", - s->name, (unsigned long)size, s->size, s->order, + s->name, (unsigned long)size, s->size, oo_order(s->oo), s->offset, flags); return 0; } @@ -2376,7 +2457,7 @@ static inline int kmem_cache_close(struct kmem_cache *s) struct kmem_cache_node *n = get_node(s, node); n->nr_partial -= free_list(s, n, &n->partial); - if (atomic_long_read(&n->nr_slabs)) + if (slabs_node(s, node)) return 1; } free_kmem_cache_nodes(s); @@ -2409,10 +2490,6 @@ EXPORT_SYMBOL(kmem_cache_destroy); struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned; EXPORT_SYMBOL(kmalloc_caches); -#ifdef CONFIG_ZONE_DMA -static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1]; -#endif - static int __init setup_slub_min_order(char *str) { get_option(&str, &slub_min_order); @@ -2458,10 +2535,10 @@ static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s, down_write(&slub_lock); if (!kmem_cache_open(s, gfp_flags, name, size, ARCH_KMALLOC_MINALIGN, - flags | __KMALLOC_CACHE, NULL)) + flags, NULL)) goto panic; - list_add(&s->list, &slab_caches); + list_add_tail(&s->list, &slab_caches); up_write(&slub_lock); if (sysfs_slab_add(s)) goto panic; @@ -2472,6 +2549,7 @@ panic: } #ifdef CONFIG_ZONE_DMA +static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1]; static void sysfs_add_func(struct work_struct *w) { @@ -2688,91 +2766,268 @@ void kfree(const void *x) } EXPORT_SYMBOL(kfree); -#if defined(CONFIG_SLUB_DEBUG) || defined(CONFIG_SLABINFO) -static unsigned long count_partial(struct kmem_cache_node *n) +static inline void *alloc_scratch(void) { - unsigned long flags; - unsigned long x = 0; - struct page *page; + return kmalloc(max_defrag_slab_objects * sizeof(void *) + + BITS_TO_LONGS(max_defrag_slab_objects) * sizeof(unsigned long), + GFP_KERNEL); +} - spin_lock_irqsave(&n->list_lock, flags); - list_for_each_entry(page, &n->partial, lru) - x += page->inuse; - spin_unlock_irqrestore(&n->list_lock, flags); - return x; +void kmem_cache_setup_defrag(struct kmem_cache *s, kmem_get_fn_t get, + kmem_kick_fn_t kick) +{ + int max_objects = oo_objects(s->max); + + /* + * Defragmentable slabs must have a ctor otherwise objects may be + * in an undetermined state after they are allocated. + */ + BUG_ON(!s->ctor); + s->get = get; + s->kick = kick; + down_write(&slub_lock); + list_move(&s->list, &slab_caches); + if (max_objects > max_defrag_slab_objects) + max_defrag_slab_objects = max_objects; + up_write(&slub_lock); } -#endif +EXPORT_SYMBOL(kmem_cache_setup_defrag); /* - * kmem_cache_shrink removes empty slabs from the partial lists and sorts - * the remaining slabs by the number of items in use. The slabs with the - * most items in use come first. New allocations will then fill those up - * and thus they can be removed from the partial lists. + * Vacate all objects in the given slab. * - * The slabs with the least items are placed last. This results in them - * being allocated from last increasing the chance that the last objects - * are freed in them. + * The scratch aread passed to list function is sufficient to hold + * struct listhead times objects per slab. We use it to hold void ** times + * objects per slab plus a bitmap for each object. */ -int kmem_cache_shrink(struct kmem_cache *s) +static int kmem_cache_vacate(struct page *page, void *scratch) { - int node; - int i; - struct kmem_cache_node *n; - struct page *page; - struct page *t; - struct list_head *slabs_by_inuse = - kmalloc(sizeof(struct list_head) * s->objects, GFP_KERNEL); + void **vector = scratch; + void *p; + void *addr = page_address(page); + struct kmem_cache *s; + unsigned long *map; + int leftover; + int count; + void *private; unsigned long flags; + unsigned long objects; + struct kmem_cache_cpu *c; - if (!slabs_by_inuse) - return -ENOMEM; + BUG_ON(!PageSlab(page)); + local_irq_save(flags); + slab_lock(page); + BUG_ON(!SlabFrozen(page)); - flush_all(s); - for_each_node_state(node, N_NORMAL_MEMORY) { - n = get_node(s, node); + s = page->slab; + objects = page->objects; + c = get_cpu_slab(s, smp_processor_id()); + map = scratch + max_defrag_slab_objects * sizeof(void **); + if (!page->inuse || !s->kick || !SlabKickable(page)) { + stat(c, SHRINK_SLAB_SKIPPED); + goto out; + } + + /* Determine used objects */ + bitmap_fill(map, objects); + for_each_free_object(p, s, page->freelist) + __clear_bit(slab_index(p, s, addr), map); + + count = 0; + memset(vector, 0, objects * sizeof(void **)); + for_each_object(p, s, addr, objects) + if (test_bit(slab_index(p, s, addr), map)) + vector[count++] = p; + + private = s->get(s, count, vector); + + /* + * Got references. Now we can drop the slab lock. The slab + * is frozen so it cannot vanish from under us nor will + * allocations be performed on the slab. However, unlocking the + * slab will allow concurrent slab_frees to proceed. + */ + slab_unlock(page); + local_irq_restore(flags); + + /* + * Perform the KICK callbacks to remove the objects. + */ + s->kick(s, count, vector, private); + + local_irq_save(flags); + slab_lock(page); +out: + /* + * Check the result and unfreeze the slab + */ + leftover = page->inuse; + if (leftover) { + stat(c, SHRINK_OBJECT_RECLAIM_FAILED); + ClearSlabKickable(page); + } else + stat(c, SHRINK_SLAB_RECLAIMED); + + unfreeze_slab(s, page, leftover > 0); + local_irq_restore(flags); + return leftover; +} + +/* + * Remove objects from a list of slab pages that have been gathered. + * Must be called with slabs that have been isolated before. + */ +int kmem_cache_reclaim(struct list_head *zaplist) +{ + int freed = 0; + void **scratch; + struct page *page; + struct page *page2; + + if (list_empty(zaplist)) + return 0; + + scratch = alloc_scratch(); + if (!scratch) + return 0; - if (!n->nr_partial) + list_for_each_entry_safe(page, page2, zaplist, lru) { + list_del(&page->lru); + if (kmem_cache_vacate(page, scratch) == 0) + freed++; + } + kfree(scratch); + return freed; +} + +/* + * Shrink the slab cache on a particular node of the cache + * by releasing slabs with zero objects and trying to reclaim + * slabs with less than a quarter of objects allocated. + */ +static unsigned long __kmem_cache_shrink(struct kmem_cache *s, int node, + unsigned long limit) +{ + unsigned long flags; + struct page *page, *page2; + LIST_HEAD(zaplist); + int freed = 0; + struct kmem_cache_node *n = get_node(s, node); + struct kmem_cache_cpu *c; + + if (n->nr_partial <= limit) + return 0; + + spin_lock_irqsave(&n->list_lock, flags); + c = get_cpu_slab(s, smp_processor_id()); + stat(c, SHRINK_CALLS); + list_for_each_entry_safe(page, page2, &n->partial, lru) { + if (!slab_trylock(page)) continue; - for (i = 0; i < s->objects; i++) - INIT_LIST_HEAD(slabs_by_inuse + i); + if (page->inuse) { + if (!SlabKickable(page)) + continue; - spin_lock_irqsave(&n->list_lock, flags); + if (page->inuse * 100 >= + s->defrag_ratio * page->objects) { + slab_unlock(page); + continue; + } - /* - * Build lists indexed by the items in use in each slab. - * - * Note that concurrent frees may occur while we hold the - * list_lock. page->inuse here is the upper limit. - */ - list_for_each_entry_safe(page, t, &n->partial, lru) { - if (!page->inuse && slab_trylock(page)) { - /* - * Must hold slab lock here because slab_free - * may have freed the last object and be - * waiting to release the slab. - */ - list_del(&page->lru); + list_move(&page->lru, &zaplist); + if (s->kick) { + stat(c, SHRINK_ATTEMPT_DEFRAG); n->nr_partial--; - slab_unlock(page); - discard_slab(s, page); - } else { - list_move(&page->lru, - slabs_by_inuse + page->inuse); + SetSlabFrozen(page); } + slab_unlock(page); + } else { + stat(c, SHRINK_EMPTY_SLAB); + list_del(&page->lru); + n->nr_partial--; + slab_unlock(page); + discard_slab(s, page); + freed++; } + } + + if (!s->kick) + /* Simply put the zaplist at the end */ + list_splice(&zaplist, n->partial.prev); + + spin_unlock_irqrestore(&n->list_lock, flags); + + if (s->kick) + freed += kmem_cache_reclaim(&zaplist); + + return freed; +} + +/* + * Defrag slabs conditional on the amount of fragmentation in a page. + */ +int kmem_cache_defrag(int node) +{ + struct kmem_cache *s; + unsigned long slabs = 0; + + /* + * kmem_cache_defrag may be called from the reclaim path which may be + * called for any page allocator alloc. So there is the danger that we + * get called in a situation where slub already acquired the slub_lock + * for other purposes. + */ + if (!down_read_trylock(&slub_lock)) + return 0; + + list_for_each_entry(s, &slab_caches, list) { + unsigned long reclaimed; + + if (time_before(jiffies, s->next_defrag)) + continue; /* - * Rebuild the partial list with the slabs filled up most - * first and the least used slabs at the end. + * Defragmentable caches come first. If the slab cache is not + * defragmentable then we can stop traversing the list. */ - for (i = s->objects - 1; i >= 0; i--) - list_splice(slabs_by_inuse + i, n->partial.prev); + if (!s->kick) + break; - spin_unlock_irqrestore(&n->list_lock, flags); + if (node == -1) { + int nid; + + for_each_node_state(nid, N_NORMAL_MEMORY) + reclaimed = __kmem_cache_shrink(s, nid, + MAX_PARTIAL); + } else + reclaimed = __kmem_cache_shrink(s, node, MAX_PARTIAL); + + if (reclaimed) + s->next_defrag = jiffies + HZ / 10; + else + s->next_defrag = jiffies + HZ; + + slabs += reclaimed; } + up_read(&slub_lock); + return slabs; +} +EXPORT_SYMBOL(kmem_cache_defrag); + +/* + * kmem_cache_shrink removes empty slabs from the partial lists. + * If the slab cache supports defragmentation then objects are + * reclaimed. + */ +int kmem_cache_shrink(struct kmem_cache *s) +{ + int node; + + flush_all(s); + for_each_node_state(node, N_NORMAL_MEMORY) + __kmem_cache_shrink(s, node, 0); - kfree(slabs_by_inuse); return 0; } EXPORT_SYMBOL(kmem_cache_shrink); @@ -2816,7 +3071,7 @@ static void slab_mem_offline_callback(void *arg) * and offline_pages() function shoudn't call this * callback. So, we must fail. */ - BUG_ON(atomic_long_read(&n->nr_slabs)); + BUG_ON(slabs_node(s, offline_node)); s->node[offline_node] = NULL; kmem_cache_free(kmalloc_caches, n); @@ -2841,7 +3096,7 @@ static int slab_mem_going_online_callback(void *arg) return 0; /* - * We are bringing a node online. No memory is availabe yet. We must + * We are bringing a node online. No memory is available yet. We must * allocate a kmem_cache_node structure in order to bring the node * online. */ @@ -2987,10 +3242,7 @@ static int slab_unmergeable(struct kmem_cache *s) if (slub_nomerge || (s->flags & SLUB_NEVER_MERGE)) return 1; - if ((s->flags & __PAGE_ALLOC_FALLBACK)) - return 1; - - if (s->ctor) + if (s->ctor || s->kick || s->get) return 1; /* @@ -3080,7 +3332,7 @@ struct kmem_cache *kmem_cache_create(const char *name, size_t size, if (s) { if (kmem_cache_open(s, GFP_KERNEL, name, size, align, flags, ctor)) { - list_add(&s->list, &slab_caches); + list_add_tail(&s->list, &slab_caches); up_write(&slub_lock); if (sysfs_slab_add(s)) goto err; @@ -3181,6 +3433,37 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags, return slab_alloc(s, gfpflags, node, caller); } +#if (defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG)) || defined(CONFIG_SLABINFO) +static unsigned long count_partial(struct kmem_cache_node *n, + int (*get_count)(struct page *)) +{ + unsigned long flags; + unsigned long x = 0; + struct page *page; + + spin_lock_irqsave(&n->list_lock, flags); + list_for_each_entry(page, &n->partial, lru) + x += get_count(page); + spin_unlock_irqrestore(&n->list_lock, flags); + return x; +} + +static int count_inuse(struct page *page) +{ + return page->inuse; +} + +static int count_total(struct page *page) +{ + return page->objects; +} + +static int count_free(struct page *page) +{ + return page->objects - page->inuse; +} +#endif + #if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG) static int validate_slab(struct kmem_cache *s, struct page *page, unsigned long *map) @@ -3193,7 +3476,7 @@ static int validate_slab(struct kmem_cache *s, struct page *page, return 0; /* Now we know that a valid freelist exists */ - bitmap_zero(map, s->objects); + bitmap_zero(map, page->objects); for_each_free_object(p, s, page->freelist) { set_bit(slab_index(p, s, addr), map); @@ -3201,7 +3484,7 @@ static int validate_slab(struct kmem_cache *s, struct page *page, return 0; } - for_each_object(p, s, addr) + for_each_object(p, s, addr, page->objects) if (!test_bit(slab_index(p, s, addr), map)) if (!check_object(s, page, p, 1)) return 0; @@ -3267,7 +3550,7 @@ static long validate_slab_cache(struct kmem_cache *s) { int node; unsigned long count = 0; - unsigned long *map = kmalloc(BITS_TO_LONGS(s->objects) * + unsigned long *map = kmalloc(BITS_TO_LONGS(oo_objects(s->max)) * sizeof(unsigned long), GFP_KERNEL); if (!map) @@ -3470,14 +3753,14 @@ static void process_slab(struct loc_track *t, struct kmem_cache *s, struct page *page, enum track_item alloc) { void *addr = page_address(page); - DECLARE_BITMAP(map, s->objects); + DECLARE_BITMAP(map, page->objects); void *p; - bitmap_zero(map, s->objects); + bitmap_zero(map, page->objects); for_each_free_object(p, s, page->freelist) set_bit(slab_index(p, s, addr), map); - for_each_object(p, s, addr) + for_each_object(p, s, addr, page->objects) if (!test_bit(slab_index(p, s, addr), map)) add_location(t, s, get_track(s, p, alloc)); } @@ -3567,22 +3850,23 @@ static int list_locations(struct kmem_cache *s, char *buf, } enum slab_stat_type { - SL_FULL, - SL_PARTIAL, - SL_CPU, - SL_OBJECTS + SL_ALL, /* All slabs */ + SL_PARTIAL, /* Only partially allocated slabs */ + SL_CPU, /* Only slabs used for cpu caches */ + SL_OBJECTS, /* Determine allocated objects not slabs */ + SL_TOTAL /* Determine object capacity not slabs */ }; -#define SO_FULL (1 << SL_FULL) +#define SO_ALL (1 << SL_ALL) #define SO_PARTIAL (1 << SL_PARTIAL) #define SO_CPU (1 << SL_CPU) #define SO_OBJECTS (1 << SL_OBJECTS) +#define SO_TOTAL (1 << SL_TOTAL) static ssize_t show_slab_objects(struct kmem_cache *s, char *buf, unsigned long flags) { unsigned long total = 0; - int cpu; int node; int x; unsigned long *nodes; @@ -3593,56 +3877,60 @@ static ssize_t show_slab_objects(struct kmem_cache *s, return -ENOMEM; per_cpu = nodes + nr_node_ids; - for_each_possible_cpu(cpu) { - struct page *page; - struct kmem_cache_cpu *c = get_cpu_slab(s, cpu); + if (flags & SO_CPU) { + int cpu; - if (!c) - continue; + for_each_possible_cpu(cpu) { + struct kmem_cache_cpu *c = get_cpu_slab(s, cpu); - page = c->page; - node = c->node; - if (node < 0) - continue; - if (page) { - if (flags & SO_CPU) { - if (flags & SO_OBJECTS) - x = page->inuse; + if (!c || c->node < 0) + continue; + + if (c->page) { + if (flags & SO_TOTAL) + x = c->page->objects; + else if (flags & SO_OBJECTS) + x = c->page->inuse; else x = 1; + total += x; - nodes[node] += x; + nodes[c->node] += x; } - per_cpu[node]++; + per_cpu[c->node]++; } } - for_each_node_state(node, N_NORMAL_MEMORY) { - struct kmem_cache_node *n = get_node(s, node); + if (flags & SO_ALL) { + for_each_node_state(node, N_NORMAL_MEMORY) { + struct kmem_cache_node *n = get_node(s, node); + + if (flags & SO_TOTAL) + x = atomic_long_read(&n->total_objects); + else if (flags & SO_OBJECTS) + x = atomic_long_read(&n->total_objects) - + count_partial(n, count_free); - if (flags & SO_PARTIAL) { - if (flags & SO_OBJECTS) - x = count_partial(n); else - x = n->nr_partial; + x = atomic_long_read(&n->nr_slabs); total += x; nodes[node] += x; } - if (flags & SO_FULL) { - int full_slabs = atomic_long_read(&n->nr_slabs) - - per_cpu[node] - - n->nr_partial; + } else if (flags & SO_PARTIAL) { + for_each_node_state(node, N_NORMAL_MEMORY) { + struct kmem_cache_node *n = get_node(s, node); - if (flags & SO_OBJECTS) - x = full_slabs * s->objects; + if (flags & SO_TOTAL) + x = count_partial(n, count_total); + else if (flags & SO_OBJECTS) + x = count_partial(n, count_inuse); else - x = full_slabs; + x = n->nr_partial; total += x; nodes[node] += x; } } - x = sprintf(buf, "%lu", total); #ifdef CONFIG_NUMA for_each_node_state(node, N_NORMAL_MEMORY) @@ -3657,14 +3945,6 @@ static ssize_t show_slab_objects(struct kmem_cache *s, static int any_slab_objects(struct kmem_cache *s) { int node; - int cpu; - - for_each_possible_cpu(cpu) { - struct kmem_cache_cpu *c = get_cpu_slab(s, cpu); - - if (c && c->page) - return 1; - } for_each_online_node(node) { struct kmem_cache_node *n = get_node(s, node); @@ -3672,7 +3952,7 @@ static int any_slab_objects(struct kmem_cache *s) if (!n) continue; - if (n->nr_partial || atomic_long_read(&n->nr_slabs)) + if (atomic_read(&n->total_objects)) return 1; } return 0; @@ -3714,26 +3994,59 @@ SLAB_ATTR_RO(object_size); static ssize_t objs_per_slab_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", s->objects); + return sprintf(buf, "%d\n", oo_objects(s->oo)); } SLAB_ATTR_RO(objs_per_slab); +static ssize_t order_store(struct kmem_cache *s, + const char *buf, size_t length) +{ + unsigned long order; + int err; + + err = strict_strtoul(buf, 10, &order); + if (err) + return err; + + if (order > slub_max_order || order < slub_min_order) + return -EINVAL; + + calculate_sizes(s, order); + return length; +} + static ssize_t order_show(struct kmem_cache *s, char *buf) { - return sprintf(buf, "%d\n", s->order); + return sprintf(buf, "%d\n", oo_order(s->oo)); } -SLAB_ATTR_RO(order); +SLAB_ATTR(order); -static ssize_t ctor_show(struct kmem_cache *s, char *buf) +static ssize_t ops_show(struct kmem_cache *s, char *buf) { + int x = 0; + if (s->ctor) { - int n = sprint_symbol(buf, (unsigned long)s->ctor); + x += sprintf(buf + x, "ctor : "); + x += sprint_symbol(buf + x, (unsigned long)s->ctor); + x += sprintf(buf + x, "\n"); + } - return n + sprintf(buf + n, "\n"); + if (s->get) { + x += sprintf(buf + x, "get : "); + x += sprint_symbol(buf + x, + (unsigned long)s->get); + x += sprintf(buf + x, "\n"); } - return 0; + + if (s->kick) { + x += sprintf(buf + x, "kick : "); + x += sprint_symbol(buf + x, + (unsigned long)s->kick); + x += sprintf(buf + x, "\n"); + } + return x; } -SLAB_ATTR_RO(ctor); +SLAB_ATTR_RO(ops); static ssize_t aliases_show(struct kmem_cache *s, char *buf) { @@ -3743,7 +4056,7 @@ SLAB_ATTR_RO(aliases); static ssize_t slabs_show(struct kmem_cache *s, char *buf) { - return show_slab_objects(s, buf, SO_FULL|SO_PARTIAL|SO_CPU); + return show_slab_objects(s, buf, SO_ALL); } SLAB_ATTR_RO(slabs); @@ -3761,10 +4074,22 @@ SLAB_ATTR_RO(cpu_slabs); static ssize_t objects_show(struct kmem_cache *s, char *buf) { - return show_slab_objects(s, buf, SO_FULL|SO_PARTIAL|SO_CPU|SO_OBJECTS); + return show_slab_objects(s, buf, SO_ALL|SO_OBJECTS); } SLAB_ATTR_RO(objects); +static ssize_t objects_partial_show(struct kmem_cache *s, char *buf) +{ + return show_slab_objects(s, buf, SO_PARTIAL|SO_OBJECTS); +} +SLAB_ATTR_RO(objects_partial); + +static ssize_t total_objects_show(struct kmem_cache *s, char *buf) +{ + return show_slab_objects(s, buf, SO_ALL|SO_TOTAL); +} +SLAB_ATTR_RO(total_objects); + static ssize_t sanity_checks_show(struct kmem_cache *s, char *buf) { return sprintf(buf, "%d\n", !!(s->flags & SLAB_DEBUG_FREE)); @@ -3844,7 +4169,7 @@ static ssize_t red_zone_store(struct kmem_cache *s, s->flags &= ~SLAB_RED_ZONE; if (buf[0] == '1') s->flags |= SLAB_RED_ZONE; - calculate_sizes(s); + calculate_sizes(s, -1); return length; } SLAB_ATTR(red_zone); @@ -3863,7 +4188,7 @@ static ssize_t poison_store(struct kmem_cache *s, s->flags &= ~SLAB_POISON; if (buf[0] == '1') s->flags |= SLAB_POISON; - calculate_sizes(s); + calculate_sizes(s, -1); return length; } SLAB_ATTR(poison); @@ -3882,7 +4207,7 @@ static ssize_t store_user_store(struct kmem_cache *s, s->flags &= ~SLAB_STORE_USER; if (buf[0] == '1') s->flags |= SLAB_STORE_USER; - calculate_sizes(s); + calculate_sizes(s, -1); return length; } SLAB_ATTR(store_user); @@ -3941,6 +4266,27 @@ static ssize_t free_calls_show(struct kmem_cache *s, char *buf) } SLAB_ATTR_RO(free_calls); +static ssize_t defrag_ratio_show(struct kmem_cache *s, char *buf) +{ + return sprintf(buf, "%d\n", s->defrag_ratio); +} + +static ssize_t defrag_ratio_store(struct kmem_cache *s, + const char *buf, size_t length) +{ + unsigned long ratio; + int err; + + err = strict_strtoul(buf, 10, &ratio); + if (err) + return err; + + if (ratio < 100) + s->defrag_ratio = ratio; + return length; +} +SLAB_ATTR(defrag_ratio); + #ifdef CONFIG_NUMA static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf) { @@ -3950,10 +4296,16 @@ static ssize_t remote_node_defrag_ratio_show(struct kmem_cache *s, char *buf) static ssize_t remote_node_defrag_ratio_store(struct kmem_cache *s, const char *buf, size_t length) { - int n = simple_strtoul(buf, NULL, 10); + unsigned long ratio; + int err; + + err = strict_strtoul(buf, 10, &ratio); + if (err) + return err; + + if (ratio < 100) + s->remote_node_defrag_ratio = ratio * 10; - if (n < 100) - s->remote_node_defrag_ratio = n * 10; return length; } SLAB_ATTR(remote_node_defrag_ratio); @@ -3979,10 +4331,12 @@ static int show_stat(struct kmem_cache *s, char *buf, enum stat_item si) len = sprintf(buf, "%lu", sum); +#ifdef CONFIG_SMP for_each_online_cpu(cpu) { if (data[cpu] && len < PAGE_SIZE - 20) - len += sprintf(buf + len, " c%d=%u", cpu, data[cpu]); + len += sprintf(buf + len, " C%d=%u", cpu, data[cpu]); } +#endif kfree(data); return len + sprintf(buf + len, "\n"); } @@ -4011,7 +4365,13 @@ STAT_ATTR(DEACTIVATE_EMPTY, deactivate_empty); STAT_ATTR(DEACTIVATE_TO_HEAD, deactivate_to_head); STAT_ATTR(DEACTIVATE_TO_TAIL, deactivate_to_tail); STAT_ATTR(DEACTIVATE_REMOTE_FREES, deactivate_remote_frees); - +STAT_ATTR(ORDER_FALLBACK, order_fallback); +STAT_ATTR(SHRINK_CALLS, shrink_calls); +STAT_ATTR(SHRINK_ATTEMPT_DEFRAG, shrink_attempt_defrag); +STAT_ATTR(SHRINK_EMPTY_SLAB, shrink_empty_slab); +STAT_ATTR(SHRINK_SLAB_SKIPPED, shrink_slab_skipped); +STAT_ATTR(SHRINK_SLAB_RECLAIMED, shrink_slab_reclaimed); +STAT_ATTR(SHRINK_OBJECT_RECLAIM_FAILED, shrink_object_reclaim_failed); #endif static struct attribute *slab_attrs[] = { @@ -4020,10 +4380,12 @@ static struct attribute *slab_attrs[] = { &objs_per_slab_attr.attr, &order_attr.attr, &objects_attr.attr, + &objects_partial_attr.attr, + &total_objects_attr.attr, &slabs_attr.attr, &partial_attr.attr, &cpu_slabs_attr.attr, - &ctor_attr.attr, + &ops_attr.attr, &aliases_attr.attr, &align_attr.attr, &sanity_checks_attr.attr, @@ -4038,6 +4400,7 @@ static struct attribute *slab_attrs[] = { &shrink_attr.attr, &alloc_calls_attr.attr, &free_calls_attr.attr, + &defrag_ratio_attr.attr, #ifdef CONFIG_ZONE_DMA &cache_dma_attr.attr, #endif @@ -4062,6 +4425,13 @@ static struct attribute *slab_attrs[] = { &deactivate_to_head_attr.attr, &deactivate_to_tail_attr.attr, &deactivate_remote_frees_attr.attr, + &order_fallback_attr.attr, + &shrink_calls_attr.attr, + &shrink_attempt_defrag_attr.attr, + &shrink_empty_slab_attr.attr, + &shrink_slab_skipped_attr.attr, + &shrink_slab_reclaimed_attr.attr, + &shrink_object_reclaim_failed_attr.attr, #endif NULL }; @@ -4305,8 +4675,8 @@ __initcall(slab_sysfs_init); */ #ifdef CONFIG_SLABINFO -ssize_t slabinfo_write(struct file *file, const char __user * buffer, - size_t count, loff_t *ppos) +ssize_t slabinfo_write(struct file *file, const char __user *buffer, + size_t count, loff_t *ppos) { return -EINVAL; } @@ -4348,7 +4718,8 @@ static int s_show(struct seq_file *m, void *p) unsigned long nr_partials = 0; unsigned long nr_slabs = 0; unsigned long nr_inuse = 0; - unsigned long nr_objs; + unsigned long nr_objs = 0; + unsigned long nr_free = 0; struct kmem_cache *s; int node; @@ -4362,14 +4733,15 @@ static int s_show(struct seq_file *m, void *p) nr_partials += n->nr_partial; nr_slabs += atomic_long_read(&n->nr_slabs); - nr_inuse += count_partial(n); + nr_objs += atomic_long_read(&n->total_objects); + nr_free += count_partial(n, count_free); } - nr_objs = nr_slabs * s->objects; - nr_inuse += (nr_slabs - nr_partials) * s->objects; + nr_inuse = nr_objs - nr_free; seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse, - nr_objs, s->size, s->objects, (1 << s->order)); + nr_objs, s->size, oo_objects(s->oo), + (1 << oo_order(s->oo))); seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0); seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs, 0UL); diff --git a/mm/vmscan.c b/mm/vmscan.c index 4046434..032ff11 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -166,10 +166,18 @@ EXPORT_SYMBOL(unregister_shrinker); * are eligible for the caller's allocation attempt. It is used for balancing * slab reclaim versus page reclaim. * + * zone is the zone for which we are shrinking the slabs. If the intent + * is to do a global shrink then zone may be NULL. Specification of a + * zone is currently only used to limit slab defragmentation to a NUMA node. + * The performace of shrink_slab would be better (in particular under NUMA) + * if it could be targeted as a whole to the zone that is under memory + * pressure but the VFS infrastructure does not allow that at the present + * time. + * * Returns the number of slab objects which we shrunk. */ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, - unsigned long lru_pages) + unsigned long lru_pages, struct zone *zone) { struct shrinker *shrinker; unsigned long ret = 0; @@ -226,6 +234,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask, shrinker->nr += total_scan; } up_read(&shrinker_rwsem); + if (ret && (gfp_mask & __GFP_FS)) + kmem_cache_defrag(zone ? zone_to_nid(zone) : -1); return ret; } @@ -1339,7 +1349,7 @@ static unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask, * over limit cgroups */ if (scan_global_lru(sc)) { - shrink_slab(sc->nr_scanned, gfp_mask, lru_pages); + shrink_slab(sc->nr_scanned, gfp_mask, lru_pages, NULL); if (reclaim_state) { nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; @@ -1566,7 +1576,7 @@ loop_again: nr_reclaimed += shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, - lru_pages); + lru_pages, zone); nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; if (zone_is_all_unreclaimable(zone)) @@ -1806,7 +1816,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages) /* If slab caches are huge, it's better to hit them first */ while (nr_slab >= lru_pages) { reclaim_state.reclaimed_slab = 0; - shrink_slab(nr_pages, sc.gfp_mask, lru_pages); + shrink_slab(nr_pages, sc.gfp_mask, lru_pages, NULL); if (!reclaim_state.reclaimed_slab) break; @@ -1844,7 +1854,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages) reclaim_state.reclaimed_slab = 0; shrink_slab(sc.nr_scanned, sc.gfp_mask, - count_lru_pages()); + count_lru_pages(), NULL); ret += reclaim_state.reclaimed_slab; if (ret >= nr_pages) goto out; @@ -1861,7 +1871,8 @@ unsigned long shrink_all_memory(unsigned long nr_pages) if (!ret) { do { reclaim_state.reclaimed_slab = 0; - shrink_slab(nr_pages, sc.gfp_mask, count_lru_pages()); + shrink_slab(nr_pages, sc.gfp_mask, + count_lru_pages(), NULL); ret += reclaim_state.reclaimed_slab; } while (ret < nr_pages && reclaim_state.reclaimed_slab > 0); } @@ -2024,7 +2035,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * Note that shrink_slab will free memory on all zones and may * take a long time. */ - while (shrink_slab(sc.nr_scanned, gfp_mask, order) && + while (shrink_slab(sc.nr_scanned, gfp_mask, order, + zone) && zone_page_state(zone, NR_SLAB_RECLAIMABLE) > slab_reclaimable - nr_pages) ;