From: Andy Whitcroft To summarise the problem, the buddy allocator currently requires that the boundaries between zones occur at MAX_ORDER boundaries. The specific case where we were tripping up on this was in x86 with NUMA enabled. There we try to ensure that each node's stuct pages are in node local memory, in order to allow them to be virtually mapped we have to reduce the size of ZONE_NORMAL. Here we are rounding the remap space up to a large page size to allow large page TLB entries to be used. However, these are smaller than MAX_ORDER. This can lead to bad buddy merges. With VM_DEBUG enabled we detect the attempts to merge across this boundary and panic. We have two basic options we can either apply the appropriate alignment when we make make the NUMA remap space, or we can 'fix' the assumption in the buddy allocator. The fix for the buddy allocator involves adding conditionals to the free fast path and so it seems reasonable to at least favor realigning the remap space. Following this email are 3 patches: zone-init-check-and-report-unaligned-zone-boundaries -- introduces a zone alignement helper, and uses it to add a check to zone initialisation for unaligned zone boundaries, x86-align-highmem-zone-boundaries-with-NUMA -- uses the zone alignment helper to align the end of ZONE_NORMAL after the remap space has been reserved, and zone-allow-unaligned-zone-boundaries -- modifies the buddy allocator so that we can allow unaligned zone boundaries. A new configuration option is added to enable this functionality. The first two are the fixes for alignement in x86, these fix the panics thrown when VM_DEBUG is enabled. The last is a patch to support unaligned zone boundaries. As this (re)introduces a zone check into the free hot path it seems reasonable to only enable this should it be needed; for example we never need this if we have a single zone. I have tested the failing system with this patch enabled and it also fixes the panic. I am inclined to suggest that it be included as it very clearly documents the alignment requirements for the buddy allocator. This patch: The buddy allocator has a requirement that boundaries between contigious zones occur aligned with the the MAX_ORDER ranges. Where they do not we will incorrectly merge pages cross zone boundaries. Add a check during zone initialisation to warn when zones boundaries do not maintain this requirement. The buddy allocator coelesces a newly freed page if it finds that the buddy of that page is also free. A buddy is eligable for coelescing if the page* for its left hand end is marked free and is complete (ie all pages in the buddy are free). The buddy allocator takes no account of the node or zone in which the page* is located. This places an alignment requirement zones. Where two zones meet they must do so aligned with a MAX_ORDER range else we will pick up buddies from both sides of the boundary and merge them into one of the zones. This is very undesirable as it can lead to memory from the wrong zone being handed out. This patch adds checks during zone initialisation to check for misaligned zone boundaries and report them. Signed-off-by: Andy Whitcroft Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 5 +++++ mm/page_alloc.c | 5 +++++ 2 files changed, 10 insertions(+) diff -puN include/linux/mmzone.h~zone-init-check-and-report-unaligned-zone-boundaries include/linux/mmzone.h --- 25/include/linux/mmzone.h~zone-init-check-and-report-unaligned-zone-boundaries Fri May 26 15:49:37 2006 +++ 25-akpm/include/linux/mmzone.h Fri May 26 15:49:37 2006 @@ -388,6 +388,11 @@ static inline int is_dma(struct zone *zo return zone == zone->zone_pgdat->node_zones + ZONE_DMA; } +static inline unsigned long zone_boundary_align_pfn(unsigned long pfn) +{ + return pfn & ~((1 << MAX_ORDER) - 1); +} + /* These two functions are used to setup the per zone pages min values */ struct ctl_table; struct file; diff -puN mm/page_alloc.c~zone-init-check-and-report-unaligned-zone-boundaries mm/page_alloc.c --- 25/mm/page_alloc.c~zone-init-check-and-report-unaligned-zone-boundaries Fri May 26 15:49:37 2006 +++ 25-akpm/mm/page_alloc.c Fri May 26 15:49:44 2006 @@ -2087,6 +2087,11 @@ static void __init free_area_init_core(s nr_kernel_pages += realsize; nr_all_pages += realsize; + if (zone_boundary_align_pfn(zone_start_pfn) != + zone_start_pfn && j != 0 && size != 0) + printk(KERN_CRIT "node %d zone %s misaligned " + "start pfn\n", nid, zone_names[j]); + zone->spanned_pages = size; zone->present_pages = realsize; zone->name = zone_names[j]; _