From aneesh.kumar@linux.vnet.ibm.com Wed Jan 9 10:59:16 2008 Date: Thu, 10 Jan 2008 00:28:59 +0530 From: Aneesh Kumar K.V To: Christoph Lameter Cc: Nishanth Aravamudan , KAMEZAWA Hiroyuki , linux-mm@kvack.org, lee.schermerhorn@hp.com, bob.picco@hp.com, mel@skynet.ie Subject: Re: [BUG] at mm/slab.c:3320 On Wed, Jan 09, 2008 at 09:50:56AM -0800, Christoph Lameter wrote: > On Tue, 8 Jan 2008, Nishanth Aravamudan wrote: > > > Do we (perhaps you already have done so, Christoph), want to validate > > any other users of numa_node_id() that then make assumptions about the > > characteristics of the nid? Hrm, that sounds good in theory, but seems > > hard in practice? > > Hmmm... The main allocs are the slab allocations. If we fallback in > kmalloc etc then we are fine for the common case. SLUB falls back > correctly. Its just the weird nesting of functions in SLAB that has made > this a bit difficult for that allocator. > This patch didn't work. I still see ------------[ cut here ]------------ kernel BUG at mm/slab.c:3323! invalid opcode: 0000 [#1] PREEMPT SMP Modules linked in: Pid: 0, comm: swapper Not tainted (2.6.24-rc5-autokern1 #1) EIP: 0060:[] EFLAGS: 00010046 CPU: 0 EIP is at ____cache_alloc_node+0x1c/0x130 EAX: e2c005c0 EBX: 00000000 ECX: 00000001 EDX: 000000d0 ESI: 00000000 EDI: e2c005c0 EBP: c03fef68 ESP: c03fef48 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 Process swapper (pid: 0, ti=c03fe000 task=c03cbd80 task.ti=c03fe000) Stack: c03cbd80 c03fef60 c017ac2a 00000001 000000d0 00000000 000000d0 e2c005c0 c03fef7c c018156a 0002080c 00099800 00000000 c03fefa8 c0181a90 22222222 22222222 00000246 c01395b5 000000d0 e2c005c0 0002080c 00099800 c03d2cec Call Trace: [] show_trace_log_lvl+0x19/0x2e [] show_stack_log_lvl+0x99/0xa1 [] show_registers+0xb3/0x1e9 [] die+0x11b/0x1fe [] do_trap+0x8e/0xa8 [] do_invalid_op+0x88/0x92 [] error_code+0x72/0x78 [] alternate_node_alloc+0x5b/0x60 [] kmem_cache_alloc+0x56/0x272 [] create_pid_cachep+0x4c/0xec [] pidmap_init+0x2f/0x6e [] start_kernel+0x1ca/0x23e [<00000000>] 0x0 ======================= Code: ff eb 02 31 ff 89 f8 83 c4 10 5b 5e 5f 5d c3 55 89 e5 57 89 c7 56 53 83 ec 14 89 55 f0 89 4d ec 8b b4 88 88 02 00 00 85 f6 75 04 <0f> 0b eb fe e8 f4 ee ff ff 8d 46 24 89 45 e4 e8 c0 0e 17 00 8b EIP: [] ____cache_alloc_node+0x1c/0x130 SS:ESP 0068:c03fef48 Kernel panic - not syncing: Attempted to kill the idle task! -- 0:conmux-control -- time-stamp -- Jan/09/08 10:21:55 -- -- 0:conmux-control -- time-stamp -- Jan/09/08 10:33:39 -- (bot:conmon-payload) disconnected Index: linux-2.6/mm/slab.c =================================================================== --- linux-2.6.orig/mm/slab.c 2008-01-03 12:26:42.000000000 -0800 +++ linux-2.6/mm/slab.c 2008-01-09 15:59:49.000000000 -0800 @@ -2977,7 +2977,10 @@ retry: } l3 = cachep->nodelists[node]; - BUG_ON(ac->avail > 0 || !l3); + if (!l3) + return NULL; + + BUG_ON(ac->avail > 0); spin_lock(&l3->list_lock); /* See if we can refill from the shared array */ @@ -3224,7 +3227,7 @@ static void *alternate_node_alloc(struct nid_alloc = cpuset_mem_spread_node(); else if (current->mempolicy) nid_alloc = slab_node(current->mempolicy); - if (nid_alloc != nid_here) + if (nid_alloc != nid_here && node_state(nid_alloc, N_NORMAL_MEMORY)) return ____cache_alloc_node(cachep, flags, nid_alloc); return NULL; } @@ -3439,8 +3442,14 @@ __do_cache_alloc(struct kmem_cache *cach * We may just have run out of memory on the local node. * ____cache_alloc_node() knows how to locate memory on other nodes */ - if (!objp) - objp = ____cache_alloc_node(cache, flags, numa_node_id()); + if (!objp) { + int node_id = numa_node_id(); + if (likely(cache->nodelists[node_id])) /* fast path */ + objp = ____cache_alloc_node(cache, flags, node_id); + else /* this function can do good fallback */ + objp = __cache_alloc_node(cache, flags, node_id, + __builtin_return_address(0)); + } out: return objp;