From clameter@sgi.com Mon May 7 14:14:47 2007 Message-Id: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:34 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 00/17] SLUB fixes and enhancements against 2.6.21-m1 A series of SLUB fixes and enhancements. Someof the patches may already be in mm. The patches can also be found at http://ftp.kernel.org/pub/linux/kernel/people/christoph/slub-patches -- From clameter@sgi.com Mon May 7 14:14:48 2007 Message-Id: <20070507211447.888239993@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:35 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 01/17] SLUB: Add support for dynamic cacheline size determination Content-Disposition: inline; filename=cacheline SLUB currently assumes that the cacheline size is static. However, i386 f.e. supports dynamic cache line size determination. Use cache_line_size() instead of L1_CACHE_BYTES in the allocator. That also explains the purpose of SLAB_HWCACHE_ALIGN. So we will need to keep that one around to allow dynamic aligning of objects depending on boot determination of the cache line size. Signed-off-by: Christoph Lameter --- mm/slub.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 14:00:21.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 14:00:23.000000000 -0700 @@ -1492,8 +1492,8 @@ static unsigned long calculate_alignment * then use it. */ if ((flags & SLAB_HWCACHE_ALIGN) && - size > L1_CACHE_BYTES / 2) - return max_t(unsigned long, align, L1_CACHE_BYTES); + size > cache_line_size() / 2) + return max_t(unsigned long, align, cache_line_size()); if (align < ARCH_SLAB_MINALIGN) return ARCH_SLAB_MINALIGN; @@ -1679,8 +1679,8 @@ static int calculate_sizes(struct kmem_c size += sizeof(void *); /* * Determine the alignment based on various parameters that the - * user specified (this is unecessarily complex due to the attempt - * to be compatible with SLAB. Should be cleaned up some day). + * user specified and the dynamic determination of cache line size + * on bootup. */ align = calculate_alignment(flags, align, s->objsize); @@ -2301,7 +2301,7 @@ void __init kmem_cache_init(void) printk(KERN_INFO "SLUB: Genslabs=%d, HWalign=%d, Order=%d-%d, MinObjects=%d," " Processors=%d, Nodes=%d\n", - KMALLOC_SHIFT_HIGH, L1_CACHE_BYTES, + KMALLOC_SHIFT_HIGH, cache_line_size(), slub_min_order, slub_max_order, slub_min_objects, nr_cpu_ids, nr_node_ids); } -- From clameter@sgi.com Mon May 7 14:14:48 2007 Message-Id: <20070507211448.075023914@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:36 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 02/17] SLUB: Reduce antifrag max order Content-Disposition: inline; filename=reduce_order My test systems fails to obtain order 4 allocs after prolonged use. So the Antifragmentation patches are unable to guarantee order 4 blocks after a while (straight compile, edit load). Reduce the the max order if antifrag measures are detected to 3. Signed-off-by: Christoph Lameter --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 14:00:23.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 14:00:27.000000000 -0700 @@ -126,7 +126,7 @@ * If antifragmentation methods are in effect then increase the * slab sizes to increase performance */ -#define DEFAULT_ANTIFRAG_MAX_ORDER 4 +#define DEFAULT_ANTIFRAG_MAX_ORDER 3 #define DEFAULT_ANTIFRAG_MIN_OBJECTS 16 /* -- From clameter@sgi.com Mon May 7 14:14:48 2007 Message-Id: <20070507211448.237055274@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:37 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 03/17] SLUB: After object padding only needed for Redzoning Content-Disposition: inline; filename=better_padding If no redzoning is selected then we do not need padding before the next object. Signed-off-by: Christoph Lameter --- mm/slub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:51:50.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:52:44.000000000 -0700 @@ -1668,7 +1668,7 @@ static int calculate_sizes(struct kmem_c */ size += 2 * sizeof(struct track); - if (flags & DEBUG_DEFAULT_FLAGS) + if (flags & SLAB_RED_ZONE) /* * Add some empty padding so that we can catch * overwrites from earlier objects rather than let -- From clameter@sgi.com Mon May 7 14:14:48 2007 Message-Id: <20070507211448.421171737@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:38 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 04/17] SLUB: slabinfo upgrade Content-Disposition: inline; filename=slabinfo_debug -e Show empty slabs -d Modification of slab debug options at runtime -o Operations. Display of ctor / dtor etc. -r Report: Display all available information about a slabcache. Cleanup tracking display and make it work right. Signed-off-by: Christoph Lameter --- Documentation/vm/slabinfo.c | 426 ++++++++++++++++++++++++++++++++++++-------- 1 file changed, 352 insertions(+), 74 deletions(-) Index: slub/Documentation/vm/slabinfo.c =================================================================== --- slub.orig/Documentation/vm/slabinfo.c 2007-05-07 13:51:40.000000000 -0700 +++ slub/Documentation/vm/slabinfo.c 2007-05-07 13:57:38.000000000 -0700 @@ -16,6 +16,7 @@ #include #include #include +#include #define MAX_SLABS 500 #define MAX_ALIASES 500 @@ -41,12 +42,15 @@ struct aliasinfo { } aliasinfo[MAX_ALIASES]; int slabs = 0; +int actual_slabs = 0; int aliases = 0; int alias_targets = 0; int highest_node = 0; char buffer[4096]; +int show_empty = 0; +int show_report = 0; int show_alias = 0; int show_slab = 0; int skip_zero = 1; @@ -59,6 +63,15 @@ int show_inverted = 0; int show_single_ref = 0; int show_totals = 0; int sort_size = 0; +int set_debug = 0; +int show_ops = 0; + +/* Debug options */ +int sanity = 0; +int redzone = 0; +int poison = 0; +int tracking = 0; +int tracing = 0; int page_size; @@ -76,20 +89,33 @@ void fatal(const char *x, ...) void usage(void) { - printf("slabinfo [-ahnpvtsz] [slab-regexp]\n" + printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n" + "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n" "-a|--aliases Show aliases\n" + "-d|--debug= Set/Clear Debug options\n" + "-e|--empty Show empty slabs\n" + "-f|--first-alias Show first alias\n" "-h|--help Show usage information\n" + "-i|--inverted Inverted list\n" + "-l|--slabs Show slabs\n" "-n|--numa Show NUMA information\n" + "-o|--ops Show kmem_cache_ops\n" "-s|--shrink Shrink slabs\n" - "-v|--validate Validate slabs\n" + "-r|--report Detailed report on single slabs\n" + "-S|--Size Sort by size\n" "-t|--tracking Show alloc/free information\n" "-T|--Totals Show summary information\n" - "-l|--slabs Show slabs\n" - "-S|--Size Sort by size\n" + "-v|--validate Validate slabs\n" "-z|--zero Include empty slabs\n" - "-f|--first-alias Show first alias\n" - "-i|--inverted Inverted list\n" "-1|--1ref Single reference\n" + "\nValid debug options (FZPUT may be combined)\n" + "a / A Switch on all debug options (=FZUP)\n" + "- Switch off all debug options\n" + "f / F Sanity Checks (SLAB_DEBUG_FREE)\n" + "z / Z Redzoning\n" + "p / P Poisoning\n" + "u / U Tracking\n" + "t / T Tracing\n" ); } @@ -143,11 +169,10 @@ unsigned long get_obj_and_str(char *name void set_obj(struct slabinfo *s, char *name, int n) { char x[100]; + FILE *f; sprintf(x, "%s/%s", s->name, name); - - FILE *f = fopen(x, "w"); - + f = fopen(x, "w"); if (!f) fatal("Cannot write to %s\n", x); @@ -155,6 +180,26 @@ void set_obj(struct slabinfo *s, char *n fclose(f); } +unsigned long read_slab_obj(struct slabinfo *s, char *name) +{ + char x[100]; + FILE *f; + int l; + + sprintf(x, "%s/%s", s->name, name); + f = fopen(x, "r"); + if (!f) { + buffer[0] = 0; + l = 0; + } else { + l = fread(buffer, 1, sizeof(buffer), f); + buffer[l] = 0; + fclose(f); + } + return l; +} + + /* * Put a size string together */ @@ -226,7 +271,7 @@ int line = 0; void first_line(void) { - printf("Name Objects Objsize Space " + printf("Name Objects Objsize Space " "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n"); } @@ -246,10 +291,7 @@ struct aliasinfo *find_one_alias(struct return best; } } - if (best) - return best; - fatal("Cannot find alias for %s\n", find->name); - return NULL; + return best; } unsigned long slab_size(struct slabinfo *s) @@ -257,6 +299,126 @@ unsigned long slab_size(struct slabinfo return s->slabs * (page_size << s->order); } +void slab_numa(struct slabinfo *s, int mode) +{ + int node; + + if (strcmp(s->name, "*") == 0) + return; + + if (!highest_node) { + printf("\n%s: No NUMA information available.\n", s->name); + return; + } + + if (skip_zero && !s->slabs) + return; + + if (!line) { + printf("\n%-21s:", mode ? "NUMA nodes" : "Slab"); + for(node = 0; node <= highest_node; node++) + printf(" %4d", node); + printf("\n----------------------"); + for(node = 0; node <= highest_node; node++) + printf("-----"); + printf("\n"); + } + printf("%-21s ", mode ? "All slabs" : s->name); + for(node = 0; node <= highest_node; node++) { + char b[20]; + + store_size(b, s->numa[node]); + printf(" %4s", b); + } + printf("\n"); + if (mode) { + printf("%-21s ", "Partial slabs"); + for(node = 0; node <= highest_node; node++) { + char b[20]; + + store_size(b, s->numa_partial[node]); + printf(" %4s", b); + } + printf("\n"); + } + line++; +} + +void show_tracking(struct slabinfo *s) +{ + printf("\n%s: Kernel object allocation\n", s->name); + printf("-----------------------------------------------------------------------\n"); + if (read_slab_obj(s, "alloc_calls")) + printf(buffer); + else + printf("No Data\n"); + + printf("\n%s: Kernel object freeing\n", s->name); + printf("------------------------------------------------------------------------\n"); + if (read_slab_obj(s, "free_calls")) + printf(buffer); + else + printf("No Data\n"); + +} + +void ops(struct slabinfo *s) +{ + if (strcmp(s->name, "*") == 0) + return; + + if (read_slab_obj(s, "ops")) { + printf("\n%s: kmem_cache operations\n", s->name); + printf("--------------------------------------------\n"); + printf(buffer); + } else + printf("\n%s has no kmem_cache operations\n", s->name); +} + +const char *onoff(int x) +{ + if (x) + return "On "; + return "Off"; +} + +void report(struct slabinfo *s) +{ + if (strcmp(s->name, "*") == 0) + return; + printf("\nSlabcache: %-20s Aliases: %2d Order : %2d\n", s->name, s->aliases, s->order); + if (s->hwcache_align) + printf("** Hardware cacheline aligned\n"); + if (s->cache_dma) + printf("** Memory is allocated in a special DMA zone\n"); + if (s->destroy_by_rcu) + printf("** Slabs are destroyed via RCU\n"); + if (s->reclaim_account) + printf("** Reclaim accounting active\n"); + + printf("\nSizes (bytes) Slabs Debug Memory\n"); + printf("------------------------------------------------------------------------\n"); + printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n", + s->object_size, s->slabs, onoff(s->sanity_checks), + s->slabs * (page_size << s->order)); + printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n", + s->slab_size, s->slabs - s->partial - s->cpu_slabs, + onoff(s->red_zone), s->objects * s->object_size); + printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n", + page_size << s->order, s->partial, onoff(s->poison), + s->slabs * (page_size << s->order) - s->objects * s->object_size); + printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n", + s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user), + (s->slab_size - s->object_size) * s->objects); + printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n", + s->align, s->objs_per_slab, onoff(s->trace), + ((page_size << s->order) - s->objs_per_slab * s->slab_size) * + s->slabs); + + ops(s); + show_tracking(s); + slab_numa(s, 1); +} void slabcache(struct slabinfo *s) { @@ -265,7 +427,18 @@ void slabcache(struct slabinfo *s) char flags[20]; char *p = flags; - if (skip_zero && !s->slabs) + if (strcmp(s->name, "*") == 0) + return; + + if (actual_slabs == 1) { + report(s); + return; + } + + if (skip_zero && !show_empty && !s->slabs) + return; + + if (show_empty && s->slabs) return; store_size(size_str, slab_size(s)); @@ -303,48 +476,128 @@ void slabcache(struct slabinfo *s) flags); } -void slab_numa(struct slabinfo *s) +/* + * Analyze debug options. Return false if something is amiss. + */ +int debug_opt_scan(char *opt) { - int node; + if (!opt || !opt[0] || strcmp(opt, "-") == 0) + return 1; - if (!highest_node) - fatal("No NUMA information available.\n"); + if (strcasecmp(opt, "a") == 0) { + sanity = 1; + poison = 1; + redzone = 1; + tracking = 1; + return 1; + } + + for ( ; *opt; opt++) + switch (*opt) { + case 'F' : case 'f': + if (sanity) + return 0; + sanity = 1; + break; + case 'P' : case 'p': + if (poison) + return 0; + poison = 1; + break; - if (skip_zero && !s->slabs) - return; + case 'Z' : case 'z': + if (redzone) + return 0; + redzone = 1; + break; - if (!line) { - printf("\nSlab Node "); - for(node = 0; node <= highest_node; node++) - printf(" %4d", node); - printf("\n----------------------"); - for(node = 0; node <= highest_node; node++) - printf("-----"); - printf("\n"); - } - printf("%-21s ", s->name); - for(node = 0; node <= highest_node; node++) { - char b[20]; + case 'U' : case 'u': + if (tracking) + return 0; + tracking = 1; + break; - store_size(b, s->numa[node]); - printf(" %4s", b); - } - printf("\n"); - line++; + case 'T' : case 't': + if (tracing) + return 0; + tracing = 1; + break; + default: + return 0; + } + return 1; } -void show_tracking(struct slabinfo *s) +int slab_empty(struct slabinfo *s) { - printf("\n%s: Calls to allocate a slab object\n", s->name); - printf("---------------------------------------------------\n"); - if (read_obj("alloc_calls")) - printf(buffer); + if (s->objects > 0) + return 0; - printf("%s: Calls to free a slab object\n", s->name); - printf("-----------------------------------------------\n"); - if (read_obj("free_calls")) - printf(buffer); + /* + * We may still have slabs even if there are no objects. Shrinking will + * remove them. + */ + if (s->slabs != 0) + set_obj(s, "shrink", 1); + return 1; +} + +void slab_debug(struct slabinfo *s) +{ + if (sanity && !s->sanity_checks) { + set_obj(s, "sanity", 1); + } + if (!sanity && s->sanity_checks) { + if (slab_empty(s)) + set_obj(s, "sanity", 0); + else + fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name); + } + if (redzone && !s->red_zone) { + if (slab_empty(s)) + set_obj(s, "red_zone", 1); + else + fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name); + } + if (!redzone && s->red_zone) { + if (slab_empty(s)) + set_obj(s, "red_zone", 0); + else + fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name); + } + if (poison && !s->poison) { + if (slab_empty(s)) + set_obj(s, "poison", 1); + else + fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name); + } + if (!poison && s->poison) { + if (slab_empty(s)) + set_obj(s, "poison", 0); + else + fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name); + } + if (tracking && !s->store_user) { + if (slab_empty(s)) + set_obj(s, "store_user", 1); + else + fprintf(stderr, "%s not empty cannot enable tracking\n", s->name); + } + if (!tracking && s->store_user) { + if (slab_empty(s)) + set_obj(s, "store_user", 0); + else + fprintf(stderr, "%s not empty cannot disable tracking\n", s->name); + } + if (tracing && !s->trace) { + if (slabs == 1) + set_obj(s, "trace", 1); + else + fprintf(stderr, "%s can only enable trace for one slab at a time\n", s->name); + } + if (!tracing && s->trace) + set_obj(s, "trace", 1); } void totals(void) @@ -673,7 +926,7 @@ void link_slabs(void) for (a = aliasinfo; a < aliasinfo + aliases; a++) { - for(s = slabinfo; s < slabinfo + slabs; s++) + for (s = slabinfo; s < slabinfo + slabs; s++) if (strcmp(a->ref, s->name) == 0) { a->slab = s; s->refs++; @@ -704,7 +957,7 @@ void alias(void) continue; } } - printf("\n%-20s <- %s", a->slab->name, a->name); + printf("\n%-12s <- %s", a->slab->name, a->name); active = a->slab->name; } else @@ -729,7 +982,12 @@ void rename_slabs(void) a = find_one_alias(s); - s->name = a->name; + if (a) + s->name = a->name; + else { + s->name = "*"; + actual_slabs--; + } } } @@ -748,11 +1006,14 @@ void read_slab_dir(void) char *t; int count; + if (chdir("/sys/slab")) + fatal("SYSFS support for SLUB not active\n"); + dir = opendir("."); while ((de = readdir(dir))) { if (de->d_name[0] == '.' || - slab_mismatch(de->d_name)) - continue; + (de->d_name[0] != ':' && slab_mismatch(de->d_name))) + continue; switch (de->d_type) { case DT_LNK: alias->name = strdup(de->d_name); @@ -807,6 +1068,7 @@ void read_slab_dir(void) } closedir(dir); slabs = slab - slabinfo; + actual_slabs = slabs; aliases = alias - aliasinfo; if (slabs > MAX_SLABS) fatal("Too many slabs\n"); @@ -825,34 +1087,37 @@ void output_slabs(void) if (show_numa) - slab_numa(slab); - else - if (show_track) + slab_numa(slab, 0); + else if (show_track) show_tracking(slab); - else - if (validate) + else if (validate) slab_validate(slab); - else - if (shrink) + else if (shrink) slab_shrink(slab); - else { - if (show_slab) - slabcache(slab); - } + else if (set_debug) + slab_debug(slab); + else if (show_ops) + ops(slab); + else if (show_slab) + slabcache(slab); } } struct option opts[] = { { "aliases", 0, NULL, 'a' }, - { "slabs", 0, NULL, 'l' }, - { "numa", 0, NULL, 'n' }, - { "zero", 0, NULL, 'z' }, - { "help", 0, NULL, 'h' }, - { "validate", 0, NULL, 'v' }, + { "debug", 2, NULL, 'd' }, + { "empty", 0, NULL, 'e' }, { "first-alias", 0, NULL, 'f' }, + { "help", 0, NULL, 'h' }, + { "inverted", 0, NULL, 'i'}, + { "numa", 0, NULL, 'n' }, + { "ops", 0, NULL, 'o' }, + { "report", 0, NULL, 'r' }, { "shrink", 0, NULL, 's' }, + { "slabs", 0, NULL, 'l' }, { "track", 0, NULL, 't'}, - { "inverted", 0, NULL, 'i'}, + { "validate", 0, NULL, 'v' }, + { "zero", 0, NULL, 'z' }, { "1ref", 0, NULL, '1'}, { NULL, 0, NULL, 0 } }; @@ -864,10 +1129,9 @@ int main(int argc, char *argv[]) char *pattern_source; page_size = getpagesize(); - if (chdir("/sys/slab")) - fatal("This kernel does not have SLUB support.\n"); - while ((c = getopt_long(argc, argv, "afhil1npstvzTS", opts, NULL)) != -1) + while ((c = getopt_long(argc, argv, "ad::efhil1noprstvzTS", + opts, NULL)) != -1) switch(c) { case '1': show_single_ref = 1; @@ -875,6 +1139,14 @@ int main(int argc, char *argv[]) case 'a': show_alias = 1; break; + case 'd': + set_debug = 1; + if (!debug_opt_scan(optarg)) + fatal("Invalid debug option '%s'\n", optarg); + break; + case 'e': + show_empty = 1; + break; case 'f': show_first_alias = 1; break; @@ -887,6 +1159,12 @@ int main(int argc, char *argv[]) case 'n': show_numa = 1; break; + case 'o': + show_ops = 1; + break; + case 'r': + show_report = 1; + break; case 's': shrink = 1; break; @@ -914,8 +1192,8 @@ int main(int argc, char *argv[]) } - if (!show_slab && !show_alias && !show_track - && !validate && !shrink) + if (!show_slab && !show_alias && !show_track && !show_report + && !validate && !shrink && !set_debug && !show_ops) show_slab = 1; if (argc > optind) -- From clameter@sgi.com Mon May 7 14:14:48 2007 Message-Id: <20070507211448.604910225@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:39 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 05/17] Move remote node draining out of slab allocators Content-Disposition: inline; filename=newdrain Currently the slab allocators contain callbacks into the page allocator to perform the draining of pagesets on remote nodes. This requires SLUB to have a whole subsystem in order to be compatible with SLAB. Moving node draining out of the slab allocators avoids a section of code in SLUB. Move the node draining so that is is done when the vm statistics are updated. At that point we are already touching all the cachelines with the pagesets of a processor. Add a expire counter there. If we have to update per zone or global vm statistics then assume that the pageset will require subsequent draining. The expire counter will be decremented on each vm stats update pass until it reaches zero. Then we will drain one batch from the pageset. The draining will cause vm counter updates which will then cause another expiration until the pcp is empty. So we will drain a batch every 3 seconds. Note that remote node draining is a somewhat esoteric feature that is required on large NUMA systems because otherwise significant portions of system memory can become trapped in pcp queues. The number of pcp is determined by the number of processors and nodes in a system. A system with 4 processors and 2 nodes has 8 pcps which is okay. But a system with 1024 processors and 512 nodes has 512k pcps with a high potential for large amount of memory being caught in them. Signed-off-by: Christoph Lameter --- include/linux/gfp.h | 6 --- include/linux/mmzone.h | 3 + mm/page_alloc.c | 45 ++++++++------------------ mm/slab.c | 6 --- mm/slub.c | 84 ------------------------------------------------- mm/vmstat.c | 54 ++++++++++++++++++++++++++++--- 6 files changed, 67 insertions(+), 131 deletions(-) Index: linux-2.6.21-mm1/include/linux/gfp.h =================================================================== --- linux-2.6.21-mm1.orig/include/linux/gfp.h 2007-05-05 09:58:56.000000000 -0700 +++ linux-2.6.21-mm1/include/linux/gfp.h 2007-05-05 10:10:41.000000000 -0700 @@ -197,10 +197,6 @@ extern void FASTCALL(free_cold_page(stru #define free_page(addr) free_pages((addr),0) void page_alloc_init(void); -#ifdef CONFIG_NUMA -void drain_node_pages(int node); -#else -static inline void drain_node_pages(int node) { }; -#endif +void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); #endif /* __LINUX_GFP_H */ Index: linux-2.6.21-mm1/include/linux/mmzone.h =================================================================== --- linux-2.6.21-mm1.orig/include/linux/mmzone.h 2007-05-05 09:58:56.000000000 -0700 +++ linux-2.6.21-mm1/include/linux/mmzone.h 2007-05-05 10:10:41.000000000 -0700 @@ -104,6 +104,9 @@ struct per_cpu_pages { struct per_cpu_pageset { struct per_cpu_pages pcp[2]; /* 0: hot. 1: cold */ +#ifdef CONFIG_NUMA + s8 expire; +#endif #ifdef CONFIG_SMP s8 stat_threshold; s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS]; Index: linux-2.6.21-mm1/mm/page_alloc.c =================================================================== --- linux-2.6.21-mm1.orig/mm/page_alloc.c 2007-05-05 09:58:56.000000000 -0700 +++ linux-2.6.21-mm1/mm/page_alloc.c 2007-05-05 10:10:41.000000000 -0700 @@ -933,43 +933,26 @@ static void __init setup_nr_node_ids(voi #ifdef CONFIG_NUMA /* - * Called from the slab reaper to drain pagesets on a particular node that - * belongs to the currently executing processor. + * Called from the vmstat counter updater to drain pagesets of this + * currently executing processor on remote nodes after they have + * expired. + * * Note that this function must be called with the thread pinned to * a single processor. */ -void drain_node_pages(int nodeid) +void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) { - int i; - enum zone_type z; unsigned long flags; + int to_drain; - for (z = 0; z < MAX_NR_ZONES; z++) { - struct zone *zone = NODE_DATA(nodeid)->node_zones + z; - struct per_cpu_pageset *pset; - - if (!populated_zone(zone)) - continue; - - pset = zone_pcp(zone, smp_processor_id()); - for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) { - struct per_cpu_pages *pcp; - - pcp = &pset->pcp[i]; - if (pcp->count) { - int to_drain; - - local_irq_save(flags); - if (pcp->count >= pcp->batch) - to_drain = pcp->batch; - else - to_drain = pcp->count; - free_pages_bulk(zone, to_drain, &pcp->list, 0); - pcp->count -= to_drain; - local_irq_restore(flags); - } - } - } + local_irq_save(flags); + if (pcp->count >= pcp->batch) + to_drain = pcp->batch; + else + to_drain = pcp->count; + free_pages_bulk(zone, to_drain, &pcp->list, 0); + pcp->count -= to_drain; + local_irq_restore(flags); } #endif Index: linux-2.6.21-mm1/mm/slab.c =================================================================== --- linux-2.6.21-mm1.orig/mm/slab.c 2007-05-05 09:58:56.000000000 -0700 +++ linux-2.6.21-mm1/mm/slab.c 2007-05-05 10:10:41.000000000 -0700 @@ -928,12 +928,6 @@ static void next_reap_node(void) { int node = __get_cpu_var(reap_node); - /* - * Also drain per cpu pages on remote zones - */ - if (node != numa_node_id()) - drain_node_pages(node); - node = next_node(node, node_online_map); if (unlikely(node >= MAX_NUMNODES)) node = first_node(node_online_map); Index: linux-2.6.21-mm1/mm/slub.c =================================================================== --- linux-2.6.21-mm1.orig/mm/slub.c 2007-05-05 09:58:56.000000000 -0700 +++ linux-2.6.21-mm1/mm/slub.c 2007-05-05 10:10:41.000000000 -0700 @@ -2462,90 +2462,6 @@ static struct notifier_block __cpuinitda #endif -#ifdef CONFIG_NUMA - -/***************************************************************** - * Generic reaper used to support the page allocator - * (the cpu slabs are reaped by a per slab workqueue). - * - * Maybe move this to the page allocator? - ****************************************************************/ - -static DEFINE_PER_CPU(unsigned long, reap_node); - -static void init_reap_node(int cpu) -{ - int node; - - node = next_node(cpu_to_node(cpu), node_online_map); - if (node == MAX_NUMNODES) - node = first_node(node_online_map); - - __get_cpu_var(reap_node) = node; -} - -static void next_reap_node(void) -{ - int node = __get_cpu_var(reap_node); - - /* - * Also drain per cpu pages on remote zones - */ - if (node != numa_node_id()) - drain_node_pages(node); - - node = next_node(node, node_online_map); - if (unlikely(node >= MAX_NUMNODES)) - node = first_node(node_online_map); - __get_cpu_var(reap_node) = node; -} -#else -#define init_reap_node(cpu) do { } while (0) -#define next_reap_node(void) do { } while (0) -#endif - -#define REAPTIMEOUT_CPUC (2*HZ) - -#ifdef CONFIG_SMP -static DEFINE_PER_CPU(struct delayed_work, reap_work); - -static void cache_reap(struct work_struct *unused) -{ - next_reap_node(); - schedule_delayed_work(&__get_cpu_var(reap_work), - REAPTIMEOUT_CPUC); -} - -static void __devinit start_cpu_timer(int cpu) -{ - struct delayed_work *reap_work = &per_cpu(reap_work, cpu); - - /* - * When this gets called from do_initcalls via cpucache_init(), - * init_workqueues() has already run, so keventd will be setup - * at that time. - */ - if (keventd_up() && reap_work->work.func == NULL) { - init_reap_node(cpu); - INIT_DELAYED_WORK(reap_work, cache_reap); - schedule_delayed_work_on(cpu, reap_work, HZ + 3 * cpu); - } -} - -static int __init cpucache_init(void) -{ - int cpu; - - /* - * Register the timers that drain pcp pages and update vm statistics - */ - for_each_online_cpu(cpu) - start_cpu_timer(cpu); - return 0; -} -__initcall(cpucache_init); -#endif - #ifdef SLUB_RESILIENCY_TEST static unsigned long validate_slab_cache(struct kmem_cache *s); Index: linux-2.6.21-mm1/mm/vmstat.c =================================================================== --- linux-2.6.21-mm1.orig/mm/vmstat.c 2007-05-05 09:58:56.000000000 -0700 +++ linux-2.6.21-mm1/mm/vmstat.c 2007-05-05 10:10:41.000000000 -0700 @@ -281,6 +281,17 @@ EXPORT_SYMBOL(dec_zone_page_state); /* * Update the zone counters for one cpu. + * + * Note that refresh_cpu_vm_stats strives to only access + * node local memory. The per cpu pagesets on remote zones are placed + * in the memory local to the processor using that pageset. So the + * loop over all zones will access a series of cachelines local to + * the processor. + * + * The call to zone_page_state_add updates the cachelines with the + * statistics in the remote zone struct as well as the global cachelines + * with the global counters. These could cause remote node cache line + * bouncing and will have to be only done when necessary. */ void refresh_cpu_vm_stats(int cpu) { @@ -289,21 +300,54 @@ void refresh_cpu_vm_stats(int cpu) unsigned long flags; for_each_zone(zone) { - struct per_cpu_pageset *pcp; + struct per_cpu_pageset *p; if (!populated_zone(zone)) continue; - pcp = zone_pcp(zone, cpu); + p = zone_pcp(zone, cpu); for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) - if (pcp->vm_stat_diff[i]) { + if (p->vm_stat_diff[i]) { local_irq_save(flags); - zone_page_state_add(pcp->vm_stat_diff[i], + zone_page_state_add(p->vm_stat_diff[i], zone, i); - pcp->vm_stat_diff[i] = 0; + p->vm_stat_diff[i] = 0; +#ifdef CONFIG_NUMA + /* 3 seconds idle till flush */ + p->expire = 3; +#endif local_irq_restore(flags); } +#ifdef CONFIG_NUMA + /* + * Deal with draining the remote pageset of this + * processor + * + * Check if there are pages remaining in this pageset + * if not then there is nothing to expire. + */ + if (!p->expire || (!p->pcp[0].count && !p->pcp[1].count)) + continue; + + /* + * We never drain zones local to this processor. + */ + if (zone_to_nid(zone) == numa_node_id()) { + p->expire = 0; + continue; + } + + p->expire--; + if (p->expire) + continue; + + if (p->pcp[0].count) + drain_zone_pages(zone, p->pcp + 0); + + if (p->pcp[1].count) + drain_zone_pages(zone, p->pcp + 1); +#endif } } -- From clameter@sgi.com Mon May 7 14:14:48 2007 Message-Id: <20070507211448.775669300@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:40 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 06/17] SLUB: Use check_valid_pointer in kmem_ptr_validate Content-Disposition: inline; filename=use_check_valid_pointer_in_kmem_ptr_validate We needlessly duplicate code. Also make check_valid_pointer inline. Signed-off-by: Christoph LAemter --- mm/slub.c | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:52:44.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:52:47.000000000 -0700 @@ -407,9 +407,8 @@ static int check_bytes(u8 *start, unsign return 1; } - -static int check_valid_pointer(struct kmem_cache *s, struct page *page, - void *object) +static inline int check_valid_pointer(struct kmem_cache *s, + struct page *page, const void *object) { void *base; @@ -1803,13 +1802,7 @@ int kmem_ptr_validate(struct kmem_cache /* No slab or wrong slab */ return 0; - addr = page_address(page); - if (object < addr || object >= addr + s->objects * s->size) - /* Out of bounds */ - return 0; - - if ((object - addr) % s->size) - /* Improperly aligned */ + if (!check_valid_pointer(s, page, object)) return 0; /* -- From clameter@sgi.com Mon May 7 14:14:49 2007 Message-Id: <20070507211448.955034534@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:41 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 07/17] SLUB: Clean up krealloc Content-Disposition: inline; filename=better_krealloc We really do not need all this gaga there. ksize gives us all the information we need to figure out if the object can cope with the new size. Signed-off-by: Christoph Lameter --- mm/slub.c | 15 ++++----------- 1 file changed, 4 insertions(+), 11 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:52:47.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:52:51.000000000 -0700 @@ -2206,9 +2206,8 @@ EXPORT_SYMBOL(kmem_cache_shrink); */ void *krealloc(const void *p, size_t new_size, gfp_t flags) { - struct kmem_cache *new_cache; void *ret; - struct page *page; + unsigned long ks; if (unlikely(!p)) return kmalloc(new_size, flags); @@ -2218,19 +2217,13 @@ void *krealloc(const void *p, size_t new return NULL; } - page = virt_to_head_page(p); - - new_cache = get_slab(new_size, flags); - - /* - * If new size fits in the current cache, bail out. - */ - if (likely(page->slab == new_cache)) + ks = ksize(p); + if (ks >= new_size) return (void *)p; ret = kmalloc(new_size, flags); if (ret) { - memcpy(ret, p, min(new_size, ksize(p))); + memcpy(ret, p, min(new_size, ks)); kfree(p); } return ret; -- From clameter@sgi.com Mon May 7 14:14:49 2007 Message-Id: <20070507211449.120783005@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:42 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 08/17] SLUB: Get rid of finish_bootstrap Content-Disposition: inline; filename=die_finish_bootstrap Its only purpose was to bring some sort of symmetry to sysfs usage when dealing with bootstrapping per cpu flushing. Since we do not time out slabs anymore we have no need to run finish_bootstrap even without sysfs. Fold it back into slab_sysfs_init and drop the initcall for the !SYFS case. Signed-off-by: Christoph Lameter --- mm/slub.c | 30 ++++++++++-------------------- 1 file changed, 10 insertions(+), 20 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:52:51.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:52:54.000000000 -0700 @@ -1711,23 +1711,6 @@ static int calculate_sizes(struct kmem_c } -static int __init finish_bootstrap(void) -{ - struct list_head *h; - int err; - - slab_state = SYSFS; - - list_for_each(h, &slab_caches) { - struct kmem_cache *s = - container_of(h, struct kmem_cache, list); - - err = sysfs_slab_add(s); - BUG_ON(err); - } - return 0; -} - static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags, const char *name, size_t size, size_t align, unsigned long flags, @@ -3415,6 +3398,7 @@ static int sysfs_slab_alias(struct kmem_ static int __init slab_sysfs_init(void) { + struct list_head *h; int err; err = subsystem_register(&slab_subsys); @@ -3423,7 +3407,15 @@ static int __init slab_sysfs_init(void) return -ENOSYS; } - finish_bootstrap(); + slab_state = SYSFS; + + list_for_each(h, &slab_caches) { + struct kmem_cache *s = + container_of(h, struct kmem_cache, list); + + err = sysfs_slab_add(s); + BUG_ON(err); + } while (alias_list) { struct saved_alias *al = alias_list; @@ -3439,6 +3431,4 @@ static int __init slab_sysfs_init(void) } __initcall(slab_sysfs_init); -#else -__initcall(finish_bootstrap); #endif -- From clameter@sgi.com Mon May 7 14:14:49 2007 Message-Id: <20070507211449.299892792@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:43 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 09/17] SLUB: Update comments Content-Disposition: inline; filename=comments Update comments throughout SLUB to reflect the new developments. Fix up various awkward sentences. Signed-off-by: Christoph Lameter --- mm/slub.c | 246 ++++++++++++++++++++++++++++++-------------------------------- 1 file changed, 121 insertions(+), 125 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 14:00:33.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 14:00:34.000000000 -0700 @@ -66,11 +66,11 @@ * SLUB assigns one slab for allocation to each processor. * Allocations only occur from these slabs called cpu slabs. * - * Slabs with free elements are kept on a partial list. - * There is no list for full slabs. If an object in a full slab is + * Slabs with free elements are kept on a partial list and during regular + * operations no list for full slabs is used. If an object in a full slab is * freed then the slab will show up again on the partial lists. - * Otherwise there is no need to track full slabs unless we have to - * track full slabs for debugging purposes. + * We track full slabs for debugging purposes though because otherwise we + * cannot scan all objects. * * Slabs are freed when they become empty. Teardown and setup is * minimal so we rely on the page allocators per cpu caches for @@ -92,8 +92,8 @@ * * - The per cpu array is updated for each new slab and and is a remote * cacheline for most nodes. This could become a bouncing cacheline given - * enough frequent updates. There are 16 pointers in a cacheline.so at - * max 16 cpus could compete. Likely okay. + * enough frequent updates. There are 16 pointers in a cacheline, so at + * max 16 cpus could compete for the cacheline which may be okay. * * - Support PAGE_ALLOC_DEBUG. Should be easy to do. * @@ -144,6 +144,7 @@ #define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \ SLAB_POISON | SLAB_STORE_USER) + /* * Set of flags that will prevent slab merging */ @@ -173,7 +174,7 @@ static struct notifier_block slab_notifi static enum { DOWN, /* No slab functionality available */ PARTIAL, /* kmem_cache_open() works but kmalloc does not */ - UP, /* Everything works */ + UP, /* Everything works but does not show up in sysfs */ SYSFS /* Sysfs up */ } slab_state = DOWN; @@ -247,9 +248,9 @@ static void print_section(char *text, u8 /* * Slow version of get and set free pointer. * - * This requires touching the cache lines of kmem_cache. - * The offset can also be obtained from the page. In that - * case it is in the cacheline that we already need to touch. + * This version requires touching the cache lines of kmem_cache which + * we avoid to do in the fast alloc free paths. There we obtain the offset + * from the page struct. */ static void *get_freepointer(struct kmem_cache *s, void *object) { @@ -431,26 +432,34 @@ static inline int check_valid_pointer(st * Bytes of the object to be managed. * If the freepointer may overlay the object then the free * pointer is the first word of the object. + * * Poisoning uses 0x6b (POISON_FREE) and the last byte is * 0xa5 (POISON_END) * * object + s->objsize * Padding to reach word boundary. This is also used for Redzoning. - * Padding is extended to word size if Redzoning is enabled - * and objsize == inuse. + * Padding is extended by another word if Redzoning is enabled and + * objsize == inuse. + * * We fill with 0xbb (RED_INACTIVE) for inactive objects and with * 0xcc (RED_ACTIVE) for objects in use. * * object + s->inuse + * Meta data starts here. + * * A. Free pointer (if we cannot overwrite object on free) * B. Tracking data for SLAB_STORE_USER - * C. Padding to reach required alignment boundary - * Padding is done using 0x5a (POISON_INUSE) + * C. Padding to reach required alignment boundary or at mininum + * one word if debuggin is on to be able to detect writes + * before the word boundary. + * + * Padding is done using 0x5a (POISON_INUSE) * * object + s->size + * Nothing is used beyond s->size. * - * If slabcaches are merged then the objsize and inuse boundaries are to - * be ignored. And therefore no slab options that rely on these boundaries + * If slabcaches are merged then the objsize and inuse boundaries are mostly + * ignored. And therefore no slab options that rely on these boundaries * may be used with merged slabcaches. */ @@ -576,8 +585,7 @@ static int check_object(struct kmem_cach /* * No choice but to zap it and thus loose the remainder * of the free objects in this slab. May cause - * another error because the object count maybe - * wrong now. + * another error because the object count is now wrong. */ set_freepointer(s, p, NULL); return 0; @@ -617,9 +625,8 @@ static int check_slab(struct kmem_cache } /* - * Determine if a certain object on a page is on the freelist and - * therefore free. Must hold the slab lock for cpu slabs to - * guarantee that the chains are consistent. + * Determine if a certain object on a page is on the freelist. Must hold the + * slab lock to guarantee that the chains are in a consistent state. */ static int on_freelist(struct kmem_cache *s, struct page *page, void *search) { @@ -665,7 +672,7 @@ static int on_freelist(struct kmem_cache } /* - * Tracking of fully allocated slabs for debugging + * Tracking of fully allocated slabs for debugging purposes. */ static void add_full(struct kmem_cache_node *n, struct page *page) { @@ -716,7 +723,7 @@ bad: /* * If this is a slab page then lets do the best we can * to avoid issues in the future. Marking all objects - * as used avoids touching the remainder. + * as used avoids touching the remaining objects. */ printk(KERN_ERR "@@@ SLUB: %s slab 0x%p. Marking all objects used.\n", s->name, page); @@ -972,9 +979,9 @@ static void remove_partial(struct kmem_c } /* - * Lock page and remove it from the partial list + * Lock slab and remove from the partial list. * - * Must hold list_lock + * Must hold list_lock. */ static int lock_and_del_slab(struct kmem_cache_node *n, struct page *page) { @@ -987,7 +994,7 @@ static int lock_and_del_slab(struct kmem } /* - * Try to get a partial slab from a specific node + * Try to allocate a partial slab from a specific node. */ static struct page *get_partial_node(struct kmem_cache_node *n) { @@ -996,7 +1003,8 @@ static struct page *get_partial_node(str /* * Racy check. If we mistakenly see no partial slabs then we * just allocate an empty slab. If we mistakenly try to get a - * partial slab then get_partials() will return NULL. + * partial slab and there is none available then get_partials() + * will return NULL. */ if (!n || !n->nr_partial) return NULL; @@ -1012,8 +1020,7 @@ out: } /* - * Get a page from somewhere. Search in increasing NUMA - * distances. + * Get a page from somewhere. Search in increasing NUMA distances. */ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags) { @@ -1023,24 +1030,22 @@ static struct page *get_any_partial(stru struct page *page; /* - * The defrag ratio allows to configure the tradeoffs between - * inter node defragmentation and node local allocations. - * A lower defrag_ratio increases the tendency to do local - * allocations instead of scanning throught the partial - * lists on other nodes. - * - * If defrag_ratio is set to 0 then kmalloc() always - * returns node local objects. If its higher then kmalloc() - * may return off node objects in order to avoid fragmentation. + * The defrag ratio allows a configuration of the tradeoffs between + * inter node defragmentation and node local allocations. A lower + * defrag_ratio increases the tendency to do local allocations + * instead of attempting to obtain partial slabs from other nodes. * - * A higher ratio means slabs may be taken from other nodes - * thus reducing the number of partial slabs on those nodes. + * If the defrag_ratio is set to 0 then kmalloc() always + * returns node local objects. If the ratio is higher then kmalloc() + * may return off node objects because partial slabs are obtained + * from other nodes and filled up. * * If /sys/slab/xx/defrag_ratio is set to 100 (which makes - * defrag_ratio = 1000) then every (well almost) allocation - * will first attempt to defrag slab caches on other nodes. This - * means scanning over all nodes to look for partial slabs which - * may be a bit expensive to do on every slab allocation. + * defrag_ratio = 1000) then every (well almost) allocation will + * first attempt to defrag slab caches on other nodes. This means + * scanning over all nodes to look for partial slabs which may be + * expensive if we do it every time we are trying to find a slab + * with available objects. */ if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio) return NULL; @@ -1100,11 +1105,12 @@ static void putback_slab(struct kmem_cac } else { if (n->nr_partial < MIN_PARTIAL) { /* - * Adding an empty page to the partial slabs in order - * to avoid page allocator overhead. This page needs to - * come after all the others that are not fully empty - * in order to make sure that we do maximum - * defragmentation. + * Adding an empty slab to the partial slabs in order + * to avoid page allocator overhead. This slab needs + * to come after the other slabs with objects in + * order to fill them up. That way the size of the + * partial list stays small. kmem_cache_shrink can + * reclaim empty slabs from the partial list. */ add_partial_tail(n, page); slab_unlock(page); @@ -1172,7 +1178,7 @@ static void flush_all(struct kmem_cache * 1. The page struct * 2. The first cacheline of the object to be allocated. * - * The only cache lines that are read (apart from code) is the + * The only other cache lines that are read (apart from code) is the * per cpu array in the kmem_cache struct. * * Fastpath is not possible if we need to get a new slab or have @@ -1226,9 +1232,11 @@ have_slab: cpu = smp_processor_id(); if (s->cpu_slab[cpu]) { /* - * Someone else populated the cpu_slab while we enabled - * interrupts, or we have got scheduled on another cpu. - * The page may not be on the requested node. + * Someone else populated the cpu_slab while we + * enabled interrupts, or we have gotten scheduled + * on another cpu. The page may not be on the + * requested node even if __GFP_THISNODE was + * specified. So we need to recheck. */ if (node == -1 || page_to_nid(s->cpu_slab[cpu]) == node) { @@ -1241,7 +1249,7 @@ have_slab: slab_lock(page); goto redo; } - /* Dump the current slab */ + /* New slab does not fit our expectations */ flush_slab(s, s->cpu_slab[cpu], cpu); } slab_lock(page); @@ -1282,7 +1290,8 @@ EXPORT_SYMBOL(kmem_cache_alloc_node); * The fastpath only writes the cacheline of the page struct and the first * cacheline of the object. * - * No special cachelines need to be read + * We read the cpu_slab cacheline to check if the slab is the per cpu + * slab for this processor. */ static void slab_free(struct kmem_cache *s, struct page *page, void *x, void *addr) @@ -1327,7 +1336,7 @@ out_unlock: slab_empty: if (prior) /* - * Slab on the partial list. + * Slab still on the partial list. */ remove_partial(s, page); @@ -1376,22 +1385,16 @@ static struct page *get_object_page(cons } /* - * kmem_cache_open produces objects aligned at "size" and the first object - * is placed at offset 0 in the slab (We have no metainformation on the - * slab, all slabs are in essence "off slab"). - * - * In order to get the desired alignment one just needs to align the - * size. + * Object placement in a slab is made very easy because we always start at + * offset 0. If we tune the size of the object to the alignment then we can + * get the required alignment by putting one properly sized object after + * another. * * Notice that the allocation order determines the sizes of the per cpu * caches. Each processor has always one slab available for allocations. * Increasing the allocation order reduces the number of times that slabs - * must be moved on and off the partial lists and therefore may influence + * must be moved on and off the partial lists and is therefore a factor in * locking overhead. - * - * The offset is used to relocate the free list link in each object. It is - * therefore possible to move the free list link behind the object. This - * is necessary for RCU to work properly and also useful for debugging. */ /* @@ -1407,15 +1410,11 @@ static int user_override; */ static int slub_min_order; static int slub_max_order = DEFAULT_MAX_ORDER; - -/* - * Minimum number of objects per slab. This is necessary in order to - * reduce locking overhead. Similar to the queue size in SLAB. - */ static int slub_min_objects = DEFAULT_MIN_OBJECTS; /* * Merge control. If this is set then no merging of slab caches will occur. + * (Could be removed. This was introduced to pacify the merge skeptics.) */ static int slub_nomerge; @@ -1429,23 +1428,27 @@ static char *slub_debug_slabs; /* * Calculate the order of allocation given an slab object size. * - * The order of allocation has significant impact on other elements - * of the system. Generally order 0 allocations should be preferred - * since they do not cause fragmentation in the page allocator. Larger - * objects may have problems with order 0 because there may be too much - * space left unused in a slab. We go to a higher order if more than 1/8th - * of the slab would be wasted. - * - * In order to reach satisfactory performance we must ensure that - * a minimum number of objects is in one slab. Otherwise we may - * generate too much activity on the partial lists. This is less a - * concern for large slabs though. slub_max_order specifies the order - * where we begin to stop considering the number of objects in a slab. - * - * Higher order allocations also allow the placement of more objects - * in a slab and thereby reduce object handling overhead. If the user - * has requested a higher mininum order then we start with that one - * instead of zero. + * The order of allocation has significant impact on performance and other + * system components. Generally order 0 allocations should be preferred since + * order 0 does not cause fragmentation in the page allocator. Larger objects + * be problematic to put into order 0 slabs because there may be too much + * unused space left. We go to a higher order if more than 1/8th of the slab + * would be wasted. + * + * In order to reach satisfactory performance we must ensure that a minimum + * number of objects is in one slab. Otherwise we may generate too much + * activity on the partial lists which requires taking the list_lock. This is + * less a concern for large slabs though which are rarely used. + * + * slub_max_order specifies the order where we begin to stop considering the + * number of objects in a slab as critical. If we reach slub_max_order then + * we try to keep the page order as low as possible. So we accept more waste + * of space in favor of a small page order. + * + * Higher order allocations also allow the placement of more objects in a + * slab and thereby reduce object handling overhead. If the user has + * requested a higher mininum order then we start with that one instead of + * the smallest order which will fit the object. */ static int calculate_order(int size) { @@ -1465,18 +1468,18 @@ static int calculate_order(int size) rem = slab_size % size; - if (rem <= (PAGE_SIZE << order) / 8) + if (rem <= slab_size / 8) break; } if (order >= MAX_ORDER) return -E2BIG; + return order; } /* - * Function to figure out which alignment to use from the - * various ways of specifying it. + * Figure out what the alignment of the objects will be. */ static unsigned long calculate_alignment(unsigned long flags, unsigned long align, unsigned long size) @@ -1631,18 +1634,16 @@ static int calculate_sizes(struct kmem_c size = ALIGN(size, sizeof(void *)); /* - * If we are redzoning then check if there is some space between the + * If we are Redzoning then check if there is some space between the * end of the object and the free pointer. If not then add an - * additional word, so that we can establish a redzone between - * the object and the freepointer to be able to check for overwrites. + * additional word to have some bytes to store Redzone information. */ if ((flags & SLAB_RED_ZONE) && size == s->objsize) size += sizeof(void *); /* - * With that we have determined how much of the slab is in actual - * use by the object. This is the potential offset to the free - * pointer. + * With that we have determined the number of bytes in actual use + * by the object. This is the potential offset to the free pointer. */ s->inuse = size; @@ -1676,6 +1677,7 @@ static int calculate_sizes(struct kmem_c * of the object. */ size += sizeof(void *); + /* * Determine the alignment based on various parameters that the * user specified and the dynamic determination of cache line size @@ -1777,7 +1779,6 @@ EXPORT_SYMBOL(kmem_cache_open); int kmem_ptr_validate(struct kmem_cache *s, const void *object) { struct page * page; - void *addr; page = get_object_page(object); @@ -1814,7 +1815,8 @@ const char *kmem_cache_name(struct kmem_ EXPORT_SYMBOL(kmem_cache_name); /* - * Attempt to free all slabs on a node + * Attempt to free all slabs on a node. Return the number of slabs we + * were unable to free. */ static int free_list(struct kmem_cache *s, struct kmem_cache_node *n, struct list_head *list) @@ -1835,7 +1837,7 @@ static int free_list(struct kmem_cache * } /* - * Release all resources used by slab cache + * Release all resources used by a slab cache. */ static int kmem_cache_close(struct kmem_cache *s) { @@ -2096,13 +2098,14 @@ void kfree(const void *x) EXPORT_SYMBOL(kfree); /* - * kmem_cache_shrink removes empty slabs from the partial lists - * and then sorts the partially allocated slabs by the number - * of items in use. The slabs with the most items in use - * come first. New allocations will remove these from the - * partial list because they are full. The slabs with the - * least items are placed last. If it happens that the objects - * are freed then the page can be returned to the page allocator. + * kmem_cache_shrink removes empty slabs from the partial lists and sorts + * the remaining slabs by the number of items in use. The slabs with the + * most items in use come first. New allocations will then fill those up + * and thus they can be removed from the partial lists. + * + * The slabs with the least items are placed last. This results in them + * being allocated from last increasing the chance that the last objects + * are freed in them. */ int kmem_cache_shrink(struct kmem_cache *s) { @@ -2131,12 +2134,10 @@ int kmem_cache_shrink(struct kmem_cache spin_lock_irqsave(&n->list_lock, flags); /* - * Build lists indexed by the items in use in - * each slab or free slabs if empty. + * Build lists indexed by the items in use in each slab. * - * Note that concurrent frees may occur while - * we hold the list_lock. page->inuse here is - * the upper limit. + * Note that concurrent frees may occur while we hold the + * list_lock. page->inuse here is the upper limit. */ list_for_each_entry_safe(page, t, &n->partial, lru) { if (!page->inuse && slab_trylock(page)) { @@ -2160,8 +2161,8 @@ int kmem_cache_shrink(struct kmem_cache goto out; /* - * Rebuild the partial list with the slabs filled up - * most first and the least used slabs at the end. + * Rebuild the partial list with the slabs filled up most + * first and the least used slabs at the end. */ for (i = s->objects - 1; i >= 0; i--) list_splice(slabs_by_inuse + i, n->partial.prev); @@ -2233,7 +2234,7 @@ void __init kmem_cache_init(void) #ifdef CONFIG_NUMA /* * Must first have the slab cache available for the allocations of the - * struct kmalloc_cache_node's. There is special bootstrap code in + * struct kmem_cache_node's. There is special bootstrap code in * kmem_cache_open for slab_state == DOWN. */ create_kmalloc_cache(&kmalloc_caches[0], "kmem_cache_node", @@ -2405,8 +2406,8 @@ static void for_all_slabs(void (*func)(s } /* - * Use the cpu notifier to insure that the slab are flushed - * when necessary. + * Use the cpu notifier to insure that the cpu slabs are flushed when + * necessary. */ static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb, unsigned long action, void *hcpu) @@ -2488,11 +2489,6 @@ static void resiliency_test(void) static void resiliency_test(void) {}; #endif -/* - * These are not as efficient as kmalloc for the non debug case. - * We do not have the page struct available so we have to touch one - * cacheline in struct kmem_cache to check slab flags. - */ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, void *caller) { struct kmem_cache *s = get_slab(size, gfpflags); @@ -2610,7 +2606,7 @@ static unsigned long validate_slab_cache } /* - * Generate lists of locations where slabcache objects are allocated + * Generate lists of code addresses where slabcache objects are allocated * and freed. */ @@ -2689,7 +2685,7 @@ static int add_location(struct loc_track } /* - * Not found. Insert new tracking element + * Not found. Insert new tracking element. */ if (t->count >= t->max && !alloc_loc_track(t, 2 * t->max)) return 0; -- From clameter@sgi.com Mon May 7 14:14:49 2007 Message-Id: <20070507211449.475055347@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:44 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 10/17] SLUB: Add macros for scanning objects in a slab Content-Disposition: inline; filename=for_each_object Scanning of objects happens in a number of functions. Consolidate that code. DECLARE_BITMAP instead of coding the declaration for bitmaps. Signed-off-by: Christoph Lameter --- mm/slub.c | 75 ++++++++++++++++++++++++++++++++++++-------------------------- 1 file changed, 44 insertions(+), 31 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:54:26.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:54:31.000000000 -0700 @@ -211,6 +211,38 @@ static inline struct kmem_cache_node *ge } /* + * Slow version of get and set free pointer. + * + * This version requires touching the cache lines of kmem_cache which + * we avoid to do in the fast alloc free paths. There we obtain the offset + * from the page struct. + */ +static inline void *get_freepointer(struct kmem_cache *s, void *object) +{ + return *(void **)(object + s->offset); +} + +static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp) +{ + *(void **)(object + s->offset) = fp; +} + +/* Loop over all objects in a slab */ +#define for_each_object(__p, __s, __addr) \ + for (__p = (__addr); __p < (__addr) + (__s)->objects * (__s)->size;\ + __p += (__s)->size) + +/* Scan freelist */ +#define for_each_free_object(__p, __s, __free) \ + for (__p = (__free); __p; __p = get_freepointer((__s), __p)) + +/* Determine object index from a given position */ +static inline int slab_index(void *p, struct kmem_cache *s, void *addr) +{ + return (p - addr) / s->size; +} + +/* * Object debugging */ static void print_section(char *text, u8 *addr, unsigned int length) @@ -246,23 +278,6 @@ static void print_section(char *text, u8 } /* - * Slow version of get and set free pointer. - * - * This version requires touching the cache lines of kmem_cache which - * we avoid to do in the fast alloc free paths. There we obtain the offset - * from the page struct. - */ -static void *get_freepointer(struct kmem_cache *s, void *object) -{ - return *(void **)(object + s->offset); -} - -static void set_freepointer(struct kmem_cache *s, void *object, void *fp) -{ - *(void **)(object + s->offset) = fp; -} - -/* * Tracking user of a slab. */ struct track { @@ -854,7 +869,7 @@ static struct page *new_slab(struct kmem memset(start, POISON_INUSE, PAGE_SIZE << s->order); last = start; - for (p = start + s->size; p < end; p += s->size) { + for_each_object(p, s, start) { setup_object(s, page, last); set_freepointer(s, last, p); last = p; @@ -875,12 +890,10 @@ static void __free_slab(struct kmem_cach int pages = 1 << s->order; if (unlikely(PageError(page) || s->dtor)) { - void *start = page_address(page); - void *end = start + (pages << PAGE_SHIFT); void *p; slab_pad_check(s, page); - for (p = start; p <= end - s->size; p += s->size) { + for_each_object(p, s, page_address(page)) { if (s->dtor) s->dtor(p, s, 0); check_object(s, page, p, 0); @@ -2516,7 +2529,7 @@ static int validate_slab(struct kmem_cac { void *p; void *addr = page_address(page); - unsigned long map[BITS_TO_LONGS(s->objects)]; + DECLARE_BITMAP(map, s->objects); if (!check_slab(s, page) || !on_freelist(s, page, NULL)) @@ -2525,14 +2538,14 @@ static int validate_slab(struct kmem_cac /* Now we know that a valid freelist exists */ bitmap_zero(map, s->objects); - for(p = page->freelist; p; p = get_freepointer(s, p)) { - set_bit((p - addr) / s->size, map); + for_each_free_object(p, s, page->freelist) { + set_bit(slab_index(p, s, addr), map); if (!check_object(s, page, p, 0)) return 0; } - for(p = addr; p < addr + s->objects * s->size; p += s->size) - if (!test_bit((p - addr) / s->size, map)) + for_each_object(p, s, addr) + if (!test_bit(slab_index(p, s, addr), map)) if (!check_object(s, page, p, 1)) return 0; return 1; @@ -2704,15 +2717,15 @@ static void process_slab(struct loc_trac struct page *page, enum track_item alloc) { void *addr = page_address(page); - unsigned long map[BITS_TO_LONGS(s->objects)]; + DECLARE_BITMAP(map, s->objects); void *p; bitmap_zero(map, s->objects); - for (p = page->freelist; p; p = get_freepointer(s, p)) - set_bit((p - addr) / s->size, map); + for_each_free_object(p, s, page->freelist) + set_bit(slab_index(p, s, addr), map); - for (p = addr; p < addr + s->objects * s->size; p += s->size) - if (!test_bit((p - addr) / s->size, map)) { + for_each_object(p, s, addr) + if (!test_bit(slab_index(p, s, addr), map)) { void *addr = get_track(s, p, alloc)->addr; add_location(t, s, addr); -- From clameter@sgi.com Mon May 7 14:14:49 2007 Message-Id: <20070507211449.651055191@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:45 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 11/17] SLUB: Move resiliency check into SYSFS section Content-Disposition: inline; filename=move_resiliency_check Move the resiliency check into the SYSFS section after validate_slab that is used by the resiliency check. This will avoid a forward declaration. Signed-off-by: Christoph Lameter --- mm/slub.c | 112 ++++++++++++++++++++++++++++++-------------------------------- 1 file changed, 55 insertions(+), 57 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:54:31.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:54:35.000000000 -0700 @@ -2445,63 +2445,6 @@ static struct notifier_block __cpuinitda #endif -#ifdef SLUB_RESILIENCY_TEST -static unsigned long validate_slab_cache(struct kmem_cache *s); - -static void resiliency_test(void) -{ - u8 *p; - - printk(KERN_ERR "SLUB resiliency testing\n"); - printk(KERN_ERR "-----------------------\n"); - printk(KERN_ERR "A. Corruption after allocation\n"); - - p = kzalloc(16, GFP_KERNEL); - p[16] = 0x12; - printk(KERN_ERR "\n1. kmalloc-16: Clobber Redzone/next pointer" - " 0x12->0x%p\n\n", p + 16); - - validate_slab_cache(kmalloc_caches + 4); - - /* Hmmm... The next two are dangerous */ - p = kzalloc(32, GFP_KERNEL); - p[32 + sizeof(void *)] = 0x34; - printk(KERN_ERR "\n2. kmalloc-32: Clobber next pointer/next slab" - " 0x34 -> -0x%p\n", p); - printk(KERN_ERR "If allocated object is overwritten then not detectable\n\n"); - - validate_slab_cache(kmalloc_caches + 5); - p = kzalloc(64, GFP_KERNEL); - p += 64 + (get_cycles() & 0xff) * sizeof(void *); - *p = 0x56; - printk(KERN_ERR "\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n", - p); - printk(KERN_ERR "If allocated object is overwritten then not detectable\n\n"); - validate_slab_cache(kmalloc_caches + 6); - - printk(KERN_ERR "\nB. Corruption after free\n"); - p = kzalloc(128, GFP_KERNEL); - kfree(p); - *p = 0x78; - printk(KERN_ERR "1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p); - validate_slab_cache(kmalloc_caches + 7); - - p = kzalloc(256, GFP_KERNEL); - kfree(p); - p[50] = 0x9a; - printk(KERN_ERR "\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n", p); - validate_slab_cache(kmalloc_caches + 8); - - p = kzalloc(512, GFP_KERNEL); - kfree(p); - p[512] = 0xab; - printk(KERN_ERR "\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p); - validate_slab_cache(kmalloc_caches + 9); -} -#else -static void resiliency_test(void) {}; -#endif - void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, void *caller) { struct kmem_cache *s = get_slab(size, gfpflags); @@ -2618,6 +2561,61 @@ static unsigned long validate_slab_cache return count; } +#ifdef SLUB_RESILIENCY_TEST +static void resiliency_test(void) +{ + u8 *p; + + printk(KERN_ERR "SLUB resiliency testing\n"); + printk(KERN_ERR "-----------------------\n"); + printk(KERN_ERR "A. Corruption after allocation\n"); + + p = kzalloc(16, GFP_KERNEL); + p[16] = 0x12; + printk(KERN_ERR "\n1. kmalloc-16: Clobber Redzone/next pointer" + " 0x12->0x%p\n\n", p + 16); + + validate_slab_cache(kmalloc_caches + 4); + + /* Hmmm... The next two are dangerous */ + p = kzalloc(32, GFP_KERNEL); + p[32 + sizeof(void *)] = 0x34; + printk(KERN_ERR "\n2. kmalloc-32: Clobber next pointer/next slab" + " 0x34 -> -0x%p\n", p); + printk(KERN_ERR "If allocated object is overwritten then not detectable\n\n"); + + validate_slab_cache(kmalloc_caches + 5); + p = kzalloc(64, GFP_KERNEL); + p += 64 + (get_cycles() & 0xff) * sizeof(void *); + *p = 0x56; + printk(KERN_ERR "\n3. kmalloc-64: corrupting random byte 0x56->0x%p\n", + p); + printk(KERN_ERR "If allocated object is overwritten then not detectable\n\n"); + validate_slab_cache(kmalloc_caches + 6); + + printk(KERN_ERR "\nB. Corruption after free\n"); + p = kzalloc(128, GFP_KERNEL); + kfree(p); + *p = 0x78; + printk(KERN_ERR "1. kmalloc-128: Clobber first word 0x78->0x%p\n\n", p); + validate_slab_cache(kmalloc_caches + 7); + + p = kzalloc(256, GFP_KERNEL); + kfree(p); + p[50] = 0x9a; + printk(KERN_ERR "\n2. kmalloc-256: Clobber 50th byte 0x9a->0x%p\n\n", p); + validate_slab_cache(kmalloc_caches + 8); + + p = kzalloc(512, GFP_KERNEL); + kfree(p); + p[512] = 0xab; + printk(KERN_ERR "\n3. kmalloc-512: Clobber redzone 0xab->0x%p\n\n", p); + validate_slab_cache(kmalloc_caches + 9); +} +#else +static void resiliency_test(void) {}; +#endif + /* * Generate lists of code addresses where slabcache objects are allocated * and freed. -- From clameter@sgi.com Mon May 7 14:14:49 2007 Message-Id: <20070507211449.819048130@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:46 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 12/17] SLUB: Introduce DebugSlab(page) Content-Disposition: inline; filename=define_debugslab This replaces the PageError() checking. DebugSlab is clearer and allows for future changes to the page bit used. We also need it to support CONFIG_SLUB_DEBUG. Signed-off-by: Christoph Lameter --- mm/slub.c | 40 ++++++++++++++++++++++++++++------------ 1 file changed, 28 insertions(+), 12 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:54:35.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:56:25.000000000 -0700 @@ -87,6 +87,21 @@ * the fast path. */ +static inline int SlabDebug(struct page *page) +{ + return PageError(page); +} + +static inline void SetSlabDebug(struct page *page) +{ + SetPageError(page); +} + +static inline void ClearSlabDebug(struct page *page) +{ + ClearPageError(page); +} + /* * Issues still to be resolved: * @@ -825,7 +840,7 @@ static struct page *allocate_slab(struct static void setup_object(struct kmem_cache *s, struct page *page, void *object) { - if (PageError(page)) { + if (SlabDebug(page)) { init_object(s, object, 0); init_tracking(s, object); } @@ -860,7 +875,7 @@ static struct page *new_slab(struct kmem page->flags |= 1 << PG_slab; if (s->flags & (SLAB_DEBUG_FREE | SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER | SLAB_TRACE)) - page->flags |= 1 << PG_error; + SetSlabDebug(page); start = page_address(page); end = start + s->objects * s->size; @@ -889,7 +904,7 @@ static void __free_slab(struct kmem_cach { int pages = 1 << s->order; - if (unlikely(PageError(page) || s->dtor)) { + if (unlikely(SlabDebug(page) || s->dtor)) { void *p; slab_pad_check(s, page); @@ -936,7 +951,8 @@ static void discard_slab(struct kmem_cac atomic_long_dec(&n->nr_slabs); reset_page_mapcount(page); - page->flags &= ~(1 << PG_slab | 1 << PG_error); + ClearSlabDebug(page); + __ClearPageSlab(page); free_slab(s, page); } @@ -1111,7 +1127,7 @@ static void putback_slab(struct kmem_cac if (page->freelist) add_partial(n, page); - else if (PageError(page) && (s->flags & SLAB_STORE_USER)) + else if (SlabDebug(page) && (s->flags & SLAB_STORE_USER)) add_full(n, page); slab_unlock(page); @@ -1195,7 +1211,7 @@ static void flush_all(struct kmem_cache * per cpu array in the kmem_cache struct. * * Fastpath is not possible if we need to get a new slab or have - * debugging enabled (which means all slabs are marked with PageError) + * debugging enabled (which means all slabs are marked with SlabDebug) */ static void *slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node, void *addr) @@ -1218,7 +1234,7 @@ redo: object = page->freelist; if (unlikely(!object)) goto another_slab; - if (unlikely(PageError(page))) + if (unlikely(SlabDebug(page))) goto debug; have_object: @@ -1316,7 +1332,7 @@ static void slab_free(struct kmem_cache local_irq_save(flags); slab_lock(page); - if (unlikely(PageError(page))) + if (unlikely(SlabDebug(page))) goto debug; checks_ok: prior = object[page->offset] = page->freelist; @@ -2504,12 +2520,12 @@ static void validate_slab_slab(struct km s->name, page); if (s->flags & DEBUG_DEFAULT_FLAGS) { - if (!PageError(page)) - printk(KERN_ERR "SLUB %s: PageError not set " + if (!SlabDebug(page)) + printk(KERN_ERR "SLUB %s: SlabDebug not set " "on slab 0x%p\n", s->name, page); } else { - if (PageError(page)) - printk(KERN_ERR "SLUB %s: PageError set on " + if (SlabDebug(page)) + printk(KERN_ERR "SLUB %s: SlabDebug set on " "slab 0x%p\n", s->name, page); } } -- From clameter@sgi.com Mon May 7 14:14:50 2007 Message-Id: <20070507211449.991042782@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:47 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 13/17] SLUB: Consolidate trace code Content-Disposition: inline; filename=slub_trace Trace in both slab_alloc and slab_free has a lot of common code. Use a single function for both. Signed-off-by: Christoph Lameter --- mm/slub.c | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:56:25.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:57:04.000000000 -0700 @@ -807,6 +807,22 @@ fail: return 0; } +static void trace(struct kmem_cache *s, struct page *page, void *object, int alloc) +{ + if (s->flags & SLAB_TRACE) { + printk(KERN_INFO "TRACE %s %s 0x%p inuse=%d fp=0x%p\n", + s->name, + alloc ? "alloc" : "free", + object, page->inuse, + page->freelist); + + if (!alloc) + print_section("Object", (void *)object, s->objsize); + + dump_stack(); + } +} + /* * Slab allocation and freeing */ @@ -1291,12 +1307,7 @@ debug: goto another_slab; if (s->flags & SLAB_STORE_USER) set_track(s, object, TRACK_ALLOC, addr); - if (s->flags & SLAB_TRACE) { - printk(KERN_INFO "TRACE %s alloc 0x%p inuse=%d fp=0x%p\n", - s->name, object, page->inuse, - page->freelist); - dump_stack(); - } + trace(s, page, object, 1); init_object(s, object, 1); goto have_object; } @@ -1381,13 +1392,7 @@ debug: remove_full(s, page); if (s->flags & SLAB_STORE_USER) set_track(s, x, TRACK_FREE, addr); - if (s->flags & SLAB_TRACE) { - printk(KERN_INFO "TRACE %s free 0x%p inuse=%d fp=0x%p\n", - s->name, object, page->inuse, - page->freelist); - print_section("Object", (void *)object, s->objsize); - dump_stack(); - } + trace(s, page, object, 0); init_object(s, object, 0); goto checks_ok; } -- From clameter@sgi.com Mon May 7 14:14:50 2007 Message-Id: <20070507211450.163054003@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:48 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 14/17] SLUB: Move tracking definitions and check_valid_pointer() away from debug code Content-Disposition: inline; filename=move_tracking Move the tracking definitions and the check_valid_pointer() function away from the debugging related functions. Signed-off-by: Christoph Lameter --- mm/slub.c | 58 +++++++++++++++++++++++++++++----------------------------- 1 file changed, 29 insertions(+), 29 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:57:04.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:57:30.000000000 -0700 @@ -197,6 +197,18 @@ static enum { static DECLARE_RWSEM(slub_lock); LIST_HEAD(slab_caches); +/* + * Tracking user of a slab. + */ +struct track { + void *addr; /* Called from address */ + int cpu; /* Was running on cpu */ + int pid; /* Pid context */ + unsigned long when; /* When did the operation occur */ +}; + +enum track_item { TRACK_ALLOC, TRACK_FREE }; + #ifdef CONFIG_SYSFS static int sysfs_slab_add(struct kmem_cache *); static int sysfs_slab_alias(struct kmem_cache *, const char *); @@ -225,6 +237,23 @@ static inline struct kmem_cache_node *ge #endif } +static inline int check_valid_pointer(struct kmem_cache *s, + struct page *page, const void *object) +{ + void *base; + + if (!object) + return 1; + + base = page_address(page); + if (object < base || object >= base + s->objects * s->size || + (object - base) % s->size) { + return 0; + } + + return 1; +} + /* * Slow version of get and set free pointer. * @@ -292,18 +321,6 @@ static void print_section(char *text, u8 } } -/* - * Tracking user of a slab. - */ -struct track { - void *addr; /* Called from address */ - int cpu; /* Was running on cpu */ - int pid; /* Pid context */ - unsigned long when; /* When did the operation occur */ -}; - -enum track_item { TRACK_ALLOC, TRACK_FREE }; - static struct track *get_track(struct kmem_cache *s, void *object, enum track_item alloc) { @@ -438,23 +455,6 @@ static int check_bytes(u8 *start, unsign return 1; } -static inline int check_valid_pointer(struct kmem_cache *s, - struct page *page, const void *object) -{ - void *base; - - if (!object) - return 1; - - base = page_address(page); - if (object < base || object >= base + s->objects * s->size || - (object - base) % s->size) { - return 0; - } - - return 1; -} - /* * Object layout: * -- From clameter@sgi.com Mon May 7 14:14:50 2007 Message-Id: <20070507211450.341007090@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:49 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 15/17] SLUB: Add CONFIG_SLUB_DEBUG Content-Disposition: inline; filename=slub_debug CONFIG_SLUB_DEBUG can be used to switch off the debugging and sysfs components of SLUB. Thus SLUB will be able to replace SLOB. SLUB can arrange objects in a denser way than SLOB and the code size should be minimal without debugging and sysfs support. Note that CONFIG_SLUB_DEBUG is materially different from CONFIG_SLAB_DEBUG. CONFIG_SLAB_DEBUG is used to enable slab debugging in SLAB. SLUB enables debugging via a boot parameter. SLUB debug code should always be present. CONFIG_SLUB_DEBUG can be modified in the embedded config section. Signed-off-by: Christoph Lameter --- init/Kconfig | 9 ++ mm/slub.c | 189 +++++++++++++++++++++++++++++++++++------------------------ 2 files changed, 123 insertions(+), 75 deletions(-) Index: slub/init/Kconfig =================================================================== --- slub.orig/init/Kconfig 2007-05-07 13:51:41.000000000 -0700 +++ slub/init/Kconfig 2007-05-07 13:57:34.000000000 -0700 @@ -566,6 +566,15 @@ config VM_EVENT_COUNTERS on EMBEDDED systems. /proc/vmstat will only show page counts if VM event counters are disabled. +config SLUB_DEBUG + default y + bool "Enable SLUB debugging support" if EMBEDDED + help + SLUB has extensive debug support features. Disabling these can + result in significant savings in code size. This also disables + SLUB sysfs support. /sys/slab will not exist and there will be + no support for cache validation etc. + choice prompt "Choose SLAB allocator" default SLUB Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:57:30.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:57:34.000000000 -0700 @@ -89,17 +89,25 @@ static inline int SlabDebug(struct page *page) { +#ifdef CONFIG_SLUB_DEBUG return PageError(page); +#else + return 0; +#endif } static inline void SetSlabDebug(struct page *page) { +#ifdef CONFIG_SLUB_DEBUG SetPageError(page); +#endif } static inline void ClearSlabDebug(struct page *page) { +#ifdef CONFIG_SLUB_DEBUG ClearPageError(page); +#endif } /* @@ -209,7 +217,7 @@ struct track { enum track_item { TRACK_ALLOC, TRACK_FREE }; -#ifdef CONFIG_SYSFS +#if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG) static int sysfs_slab_add(struct kmem_cache *); static int sysfs_slab_alias(struct kmem_cache *, const char *); static void sysfs_slab_remove(struct kmem_cache *); @@ -286,6 +294,14 @@ static inline int slab_index(void *p, st return (p - addr) / s->size; } +#ifdef CONFIG_SLUB_DEBUG +/* + * Debug settings: + */ +static int slub_debug; + +static char *slub_debug_slabs; + /* * Object debugging */ @@ -823,6 +839,97 @@ static void trace(struct kmem_cache *s, } } +static int __init setup_slub_debug(char *str) +{ + if (!str || *str != '=') + slub_debug = DEBUG_DEFAULT_FLAGS; + else { + str++; + if (*str == 0 || *str == ',') + slub_debug = DEBUG_DEFAULT_FLAGS; + else + for( ;*str && *str != ','; str++) + switch (*str) { + case 'f' : case 'F' : + slub_debug |= SLAB_DEBUG_FREE; + break; + case 'z' : case 'Z' : + slub_debug |= SLAB_RED_ZONE; + break; + case 'p' : case 'P' : + slub_debug |= SLAB_POISON; + break; + case 'u' : case 'U' : + slub_debug |= SLAB_STORE_USER; + break; + case 't' : case 'T' : + slub_debug |= SLAB_TRACE; + break; + default: + printk(KERN_ERR "slub_debug option '%c' " + "unknown. skipped\n",*str); + } + } + + if (*str == ',') + slub_debug_slabs = str + 1; + return 1; +} + +__setup("slub_debug", setup_slub_debug); + +static void kmem_cache_open_debug_check(struct kmem_cache *s) +{ + /* + * The page->offset field is only 16 bit wide. This is an offset + * in units of words from the beginning of an object. If the slab + * size is bigger then we cannot move the free pointer behind the + * object anymore. + * + * On 32 bit platforms the limit is 256k. On 64bit platforms + * the limit is 512k. + * + * Debugging or ctor/dtors may create a need to move the free + * pointer. Fail if this happens. + */ + if (s->size >= 65535 * sizeof(void *)) { + BUG_ON(s->flags & (SLAB_RED_ZONE | SLAB_POISON | + SLAB_STORE_USER | SLAB_DESTROY_BY_RCU)); + BUG_ON(s->ctor || s->dtor); + } + else + /* + * Enable debugging if selected on the kernel commandline. + */ + if (slub_debug && (!slub_debug_slabs || + strncmp(slub_debug_slabs, s->name, + strlen(slub_debug_slabs)) == 0)) + s->flags |= slub_debug; +} +#else + +static inline int alloc_object_checks(struct kmem_cache *s, + struct page *page, void *object) { return 0; } + +static inline int free_object_checks(struct kmem_cache *s, + struct page *page, void *object) { return 0; } + +static inline void add_full(struct kmem_cache_node *n, struct page *page) {} +static inline void remove_full(struct kmem_cache *s, struct page *page) {} +static inline void trace(struct kmem_cache *s, struct page *page, + void *object, int alloc) {} +static inline void init_object(struct kmem_cache *s, + void *object, int active) {} +static inline void init_tracking(struct kmem_cache *s, void *object) {} +static inline int slab_pad_check(struct kmem_cache *s, struct page *page) + { return 1; } +static inline int check_object(struct kmem_cache *s, struct page *page, + void *object, int active) { return 1; } +static inline void set_track(struct kmem_cache *s, void *object, + enum track_item alloc, void *addr) {} +static inline void kmem_cache_open_debug_check(struct kmem_cache *s) {} +#define slub_debug 0 +#endif /* * Slab allocation and freeing */ @@ -1453,13 +1560,6 @@ static int slub_min_objects = DEFAULT_MI static int slub_nomerge; /* - * Debug settings: - */ -static int slub_debug; - -static char *slub_debug_slabs; - -/* * Calculate the order of allocation given an slab object size. * * The order of allocation has significant impact on performance and other @@ -1667,6 +1767,7 @@ static int calculate_sizes(struct kmem_c */ size = ALIGN(size, sizeof(void *)); +#ifdef CONFIG_SLUB_DEBUG /* * If we are Redzoning then check if there is some space between the * end of the object and the free pointer. If not then add an @@ -1674,6 +1775,7 @@ static int calculate_sizes(struct kmem_c */ if ((flags & SLAB_RED_ZONE) && size == s->objsize) size += sizeof(void *); +#endif /* * With that we have determined the number of bytes in actual use @@ -1681,6 +1783,7 @@ static int calculate_sizes(struct kmem_c */ s->inuse = size; +#ifdef CONFIG_SLUB_DEBUG if (((flags & (SLAB_DESTROY_BY_RCU | SLAB_POISON)) || s->ctor || s->dtor)) { /* @@ -1711,6 +1814,7 @@ static int calculate_sizes(struct kmem_c * of the object. */ size += sizeof(void *); +#endif /* * Determine the alignment based on various parameters that the @@ -1760,32 +1864,7 @@ static int kmem_cache_open(struct kmem_c s->objsize = size; s->flags = flags; s->align = align; - - /* - * The page->offset field is only 16 bit wide. This is an offset - * in units of words from the beginning of an object. If the slab - * size is bigger then we cannot move the free pointer behind the - * object anymore. - * - * On 32 bit platforms the limit is 256k. On 64bit platforms - * the limit is 512k. - * - * Debugging or ctor/dtors may create a need to move the free - * pointer. Fail if this happens. - */ - if (s->size >= 65535 * sizeof(void *)) { - BUG_ON(flags & (SLAB_RED_ZONE | SLAB_POISON | - SLAB_STORE_USER | SLAB_DESTROY_BY_RCU)); - BUG_ON(ctor || dtor); - } - else - /* - * Enable debugging if selected on the kernel commandline. - */ - if (slub_debug && (!slub_debug_slabs || - strncmp(slub_debug_slabs, name, - strlen(slub_debug_slabs)) == 0)) - s->flags |= slub_debug; + kmem_cache_open_debug_check(s); if (!calculate_sizes(s)) goto error; @@ -1956,45 +2035,6 @@ static int __init setup_slub_nomerge(cha __setup("slub_nomerge", setup_slub_nomerge); -static int __init setup_slub_debug(char *str) -{ - if (!str || *str != '=') - slub_debug = DEBUG_DEFAULT_FLAGS; - else { - str++; - if (*str == 0 || *str == ',') - slub_debug = DEBUG_DEFAULT_FLAGS; - else - for( ;*str && *str != ','; str++) - switch (*str) { - case 'f' : case 'F' : - slub_debug |= SLAB_DEBUG_FREE; - break; - case 'z' : case 'Z' : - slub_debug |= SLAB_RED_ZONE; - break; - case 'p' : case 'P' : - slub_debug |= SLAB_POISON; - break; - case 'u' : case 'U' : - slub_debug |= SLAB_STORE_USER; - break; - case 't' : case 'T' : - slub_debug |= SLAB_TRACE; - break; - default: - printk(KERN_ERR "slub_debug option '%c' " - "unknown. skipped\n",*str); - } - } - - if (*str == ',') - slub_debug_slabs = str + 1; - return 1; -} - -__setup("slub_debug", setup_slub_debug); - static struct kmem_cache *create_kmalloc_cache(struct kmem_cache *s, const char *name, int size, gfp_t gfp_flags) { @@ -2487,8 +2527,7 @@ void *__kmalloc_node_track_caller(size_t return slab_alloc(s, gfpflags, node, caller); } -#ifdef CONFIG_SYSFS - +#if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG) static int validate_slab(struct kmem_cache *s, struct page *page) { void *p; -- From clameter@sgi.com Mon May 7 14:14:50 2007 Message-Id: <20070507211450.527987346@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:50 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 16/17] SLUB: Include lifetime stats and sets of cpus / nodes in tracking output Content-Disposition: inline; filename=lifetime We have information about how long an object existed and about the nodes and cpus where the allocations and frees took place. Add that information to the tracking output in /sys/slab/xx/alloc_calls and /sys/slab/free_calls This will then enable slabinfo to output nice reports like this: christoph@qirst:~/slub$ ./slabinfo kmalloc-128 Slabcache: kmalloc-128 Aliases: 0 Order : 0 Sizes (bytes) Slabs Debug Memory ------------------------------------------------------------------------ Object : 128 Total : 12 Sanity Checks : On Total: 49152 SlabObj: 200 Full : 7 Redzoning : On Used : 24832 SlabSiz: 4096 Partial: 4 Poisoning : On Loss : 24320 Loss : 72 CpuSlab: 1 Tracking : On Lalig: 13968 Align : 8 Objects: 20 Tracing : Off Lpadd: 1152 kmalloc-128 has no kmem_cache operations kmalloc-128: Kernel object allocation ----------------------------------------------------------------------- 6 param_sysfs_setup+0x71/0x130 age=284512/284512/284512 pid=1 nodes=0-1,3 11 percpu_populate+0x39/0x80 age=283914/284428/284512 pid=1 nodes=0 21 __register_chrdev_region+0x31/0x170 age=282896/284347/284473 pid=1-1705 nodes=0-2 1 sys_inotify_init+0x76/0x1c0 age=283423 pid=1004 nodes=0 19 as_get_io_context+0x32/0xd0 age=6/247567/283988 pid=1-11782 nodes=0,2 10 ida_pre_get+0x4a/0x80 age=277666/283773/284526 pid=0-2177 nodes=0,2 24 kobject_kset_add_dir+0x37/0xb0 age=282727/283860/284472 pid=1-1723 nodes=0-2 1 acpi_ds_build_internal_buffer_obj+0xd3/0x11d age=284508 pid=1 nodes=0 24 con_insert_unipair+0xd7/0x110 age=284438/284438/284438 pid=1 nodes=0,2 1 uart_open+0x2d2/0x4b0 age=283896 pid=1 nodes=0 26 dma_pool_create+0x73/0x1a0 age=282762/282833/282916 pid=1705-1723 nodes=0 1 neigh_table_init_no_netlink+0xd2/0x210 age=284461 pid=1 nodes=0 2 neigh_parms_alloc+0x2b/0xe0 age=284410/284411/284412 pid=1 nodes=2 2 neigh_resolve_output+0x1e1/0x280 age=276289/276291/276293 pid=0-2443 nodes=0 1 netlink_kernel_create+0x90/0x170 age=284472 pid=1 nodes=0 4 xt_alloc_table_info+0x39/0xf0 age=283958/283958/283959 pid=1 nodes=1 3 fn_hash_insert+0x473/0x720 age=277653/277661/277666 pid=2177-2185 nodes=0 1 get_mtrr_state+0x285/0x2a0 age=284526 pid=0 nodes=0 1 cacheinfo_cpu_callback+0x26d/0x3e0 age=284458 pid=1 nodes=0 29 kernel_param_sysfs_setup+0x25/0x90 age=284511/284511/284512 pid=1 nodes=0-1,3 5 process_zones+0x5e/0x170 age=284546/284546/284546 pid=0 nodes=0 1 drm_core_init+0x48/0x160 age=284421 pid=1 nodes=2 kmalloc-128: Kernel object freeing ------------------------------------------------------------------------ 163 age=4295176847 pid=0 nodes=0-3 1 __vunmap+0x6e/0xf0 age=282907 pid=1723 nodes=0 28 free_as_io_context+0x12/0x90 age=9243/262197/283474 pid=42-11754 nodes=0 1 acpi_get_object_info+0x1b7/0x1d4 age=284475 pid=1 nodes=0 1 do_acpi_find_child+0x45/0x4e age=284475 pid=1 nodes=0 NUMA nodes : 0 1 2 3 ------------------------------------------ All slabs 7 2 2 1 Partial slabs 2 2 0 0 Signed-off-by: Christoph Lameter --- mm/slub.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 79 insertions(+), 15 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:57:34.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:57:42.000000000 -0700 @@ -1593,13 +1593,16 @@ static int calculate_order(int size) order < MAX_ORDER; order++) { unsigned long slab_size = PAGE_SIZE << order; - if (slub_max_order > order && + if (order < slub_max_order && slab_size < slub_min_objects * size) continue; if (slab_size < size) continue; + if (order >= slub_max_order) + break; + rem = slab_size % size; if (rem <= slab_size / 8) @@ -2684,6 +2687,13 @@ static void resiliency_test(void) {}; struct location { unsigned long count; void *addr; + long long sum_time; + long min_time; + long max_time; + long min_pid; + long max_pid; + cpumask_t cpus; + nodemask_t nodes; }; struct loc_track { @@ -2724,11 +2734,12 @@ static int alloc_loc_track(struct loc_tr } static int add_location(struct loc_track *t, struct kmem_cache *s, - void *addr) + const struct track *track) { long start, end, pos; struct location *l; void *caddr; + unsigned long age = jiffies - track->when; start = -1; end = t->count; @@ -2744,12 +2755,29 @@ static int add_location(struct loc_track break; caddr = t->loc[pos].addr; - if (addr == caddr) { - t->loc[pos].count++; + if (track->addr == caddr) { + + l = &t->loc[pos]; + l->count++; + if (track->when) { + l->sum_time += age; + if (age < l->min_time) + l->min_time = age; + if (age > l->max_time) + l->max_time = age; + + if (track->pid < l->min_pid) + l->min_pid = track->pid; + if (track->pid > l->max_pid) + l->max_pid = track->pid; + + cpu_set(track->cpu, l->cpus); + } + node_set(page_to_nid(virt_to_page(track)), l->nodes); return 1; } - if (addr < caddr) + if (track->addr < caddr) end = pos; else start = pos; @@ -2767,7 +2795,16 @@ static int add_location(struct loc_track (t->count - pos) * sizeof(struct location)); t->count++; l->count = 1; - l->addr = addr; + l->addr = track->addr; + l->sum_time = age; + l->min_time = age; + l->max_time = age; + l->min_pid = track->pid; + l->max_pid = track->pid; + cpus_clear(l->cpus); + cpu_set(track->cpu, l->cpus); + nodes_clear(l->nodes); + node_set(page_to_nid(virt_to_page(track)), l->nodes); return 1; } @@ -2783,11 +2820,8 @@ static void process_slab(struct loc_trac set_bit(slab_index(p, s, addr), map); for_each_object(p, s, addr) - if (!test_bit(slab_index(p, s, addr), map)) { - void *addr = get_track(s, p, alloc)->addr; - - add_location(t, s, addr); - } + if (!test_bit(slab_index(p, s, addr), map)) + add_location(t, s, get_track(s, p, alloc)); } static int list_locations(struct kmem_cache *s, char *buf, @@ -2821,15 +2855,45 @@ static int list_locations(struct kmem_ca } for (i = 0; i < t.count; i++) { - void *addr = t.loc[i].addr; + struct location *l = &t.loc[i]; if (n > PAGE_SIZE - 100) break; - n += sprintf(buf + n, "%7ld ", t.loc[i].count); - if (addr) - n += sprint_symbol(buf + n, (unsigned long)t.loc[i].addr); + n += sprintf(buf + n, "%7ld ", l->count); + + if (l->addr) + n += sprint_symbol(buf + n, (unsigned long)l->addr); else n += sprintf(buf + n, ""); + + if (l->sum_time != l->min_time) + n += sprintf(buf + n, " age=%ld/%ld/%ld", + l->min_time, + (unsigned long)(l->sum_time / l->count), + l->max_time); + else + n += sprintf(buf + n, " age=%ld", + l->min_time); + + if (l->min_pid != l->max_pid) + n += sprintf(buf + n, " pid=%ld-%ld", + l->min_pid, l->max_pid); + else + n += sprintf(buf + n, " pid=%ld", + l->min_pid); + + if (num_online_cpus() > 1 && !cpus_empty(l->cpus)) { + n += sprintf(buf + n, " cpus="); + n += cpulist_scnprintf(buf + n, PAGE_SIZE - n - 50, + l->cpus); + } + + if (num_online_nodes() > 1 && !nodes_empty(l->nodes)) { + n += sprintf(buf + n, " nodes="); + n += nodelist_scnprintf(buf + n, PAGE_SIZE - n - 50, + l->nodes); + } + n += sprintf(buf + n, "\n"); } -- From clameter@sgi.com Mon May 7 14:14:50 2007 Message-Id: <20070507211450.704162956@sgi.com> References: <20070507211334.710890195@sgi.com> User-Agent: quilt/0.46-1 Date: Mon, 07 May 2007 14:13:51 -0700 From: clameter@sgi.com To: akpm@linux-foundation.org Cc: linux-mm@kvack.org Subject: [patch 17/17] SLUB: Rework slab order determination Content-Disposition: inline; filename=fixordercalc In some cases SLUB is creating uselessly slabs that are larger than slub_max_order. Also the layout of some of the slabs was not satisfactory. Go to an iterarive approach. Signed-off-by: Christoph Lameter --- mm/slub.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 52 insertions(+), 14 deletions(-) Index: slub/mm/slub.c =================================================================== --- slub.orig/mm/slub.c 2007-05-07 13:57:42.000000000 -0700 +++ slub/mm/slub.c 2007-05-07 13:57:45.000000000 -0700 @@ -1584,37 +1584,75 @@ static int slub_nomerge; * requested a higher mininum order then we start with that one instead of * the smallest order which will fit the object. */ -static int calculate_order(int size) +static inline int slab_order(int size, int min_objects, + int max_order, int fract_leftover) { int order; int rem; - for (order = max(slub_min_order, fls(size - 1) - PAGE_SHIFT); - order < MAX_ORDER; order++) { - unsigned long slab_size = PAGE_SIZE << order; + for (order = max(slub_min_order, + fls(min_objects * size - 1) - PAGE_SHIFT); + order <= max_order; order++) { - if (order < slub_max_order && - slab_size < slub_min_objects * size) - continue; + unsigned long slab_size = PAGE_SIZE << order; - if (slab_size < size) + if (slab_size < min_objects * size) continue; - if (order >= slub_max_order) - break; - rem = slab_size % size; - if (rem <= slab_size / 8) + if (rem <= slab_size / fract_leftover) break; } - if (order >= MAX_ORDER) - return -E2BIG; return order; } +static inline int calculate_order(int size) +{ + int order; + int min_objects; + int fraction; + + /* + * Attempt to find best configuration for a slab. This + * works by first attempting to generate a layout with + * the best configuration and backing off gradually. + * + * First we reduce the acceptable waste in a slab. Then + * we reduce the minimum objects required in a slab. + */ + min_objects = slub_min_objects; + while (min_objects > 1) { + fraction = 8; + while (fraction >= 4) { + order = slab_order(size, min_objects, + slub_max_order, fraction); + if (order <= slub_max_order) + return order; + fraction /= 2; + } + min_objects /= 2; + } + + /* + * We were unable to place multiple objects in a slab. Now + * lets see if we can place a single object there. + */ + order = slab_order(size, 1, slub_max_order, 1); + if (order <= slub_max_order) + return order; + + /* + * Doh this slab cannot be placed using slub_max_order. + */ + order = slab_order(size, 1, MAX_ORDER, 1); + if (order <= MAX_ORDER) + return order; + return -ENOSYS; +} + /* * Figure out what the alignment of the objects will be. */ --