[x86] 64 bit cpu_alloc configuration

Allow the use of a 1TB virtually mapped area for the cpu_allocations.
For UP and SMP we simply use mappings that are of page size. The
usual usage of cpu storage is less than 32k.

For the NUMA case we use PMD mappings in order to avoid the per cpu
areas generating TLB pressure. Large systems in particular may generate
a large need for per cpu storage since each additional node manged by
the page allocator adds to the cpu storage needs of a single processor.
Typically a single 2M segment is sufficient for small NUMA systems up
to 16 nodes.

Usually the 1TB of virtually mapped memory is providing enough space
for arbitrary amounts of cpu data. However, the following extreme cases
need to be kept in mind (mostly relevant to SGI machines):

4k cpu configurations:

Maximum mappable cpu data per cpu is 256MB (2^40-12)
The maximum per cpu data usage will be reached with 256 node machines.

Theoretically possible 16k machines in the far future:

Maximum per cpu data can only reach 64MB. These machines may have up to
1k nodes which will likely require about 4x the amounts of cpu storage
vs a 4k cpu configuration for both slab and page allocators.

Cpu memory use can grow rapidly. F.e. if we assume that a page struct
occupies 64 bytes of memory and we have 3 zones per node then we need
3 * 1k * 16k = 50 million pagesets or 3096 pagesets per processor.
This results in a total of 3.2 GB of page structs. So each cpu needs
around 200k of cpu storage for the page allocator alone.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 arch/x86/mm/init_64.c        |   38 ++++++++++++++++++++++++++++++++++++++
 include/asm-x86/pgtable_64.h |    5 +++++
 2 files changed, 43 insertions(+)

Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h	2007-11-03 13:32:08.717052956 -0700
+++ linux-2.6/include/asm-x86/pgtable_64.h	2007-11-03 13:58:37.789991830 -0700
@@ -138,6 +138,11 @@ static inline pte_t ptep_get_and_clear_f
 #define VMALLOC_START    _AC(0xffffc20000000000, UL)
 #define VMALLOC_END      _AC(0xffffe1ffffffffff, UL)
 #define VMEMMAP_START	 _AC(0xffffe20000000000, UL)
+#define CPU_AREA_BASE	 _AC(0xfffff20000000000, UL)
+#define CPU_AREA_BITS	 43
+#ifdef CONFIG_NUMA
+#define CPU_AREA_BLOCK_SHIFT	PMD_SHIFT
+#endif
 #define MODULES_VADDR    _AC(0xffffffff88000000, UL)
 #define MODULES_END      _AC(0xfffffffffff00000, UL)
 #define MODULES_LEN   (MODULES_END - MODULES_VADDR)
Index: linux-2.6/arch/x86/mm/init_64.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_64.c	2007-11-03 13:30:49.054553388 -0700
+++ linux-2.6/arch/x86/mm/init_64.c	2007-11-03 13:57:44.162053088 -0700
@@ -781,3 +781,41 @@ int __meminit vmemmap_populate(struct pa
 	return 0;
 }
 #endif
+
+#ifdef CONFIG_NUMA
+int __meminit cpu_area_populate(void *start, unsigned long size,
+						gfp_t flags, int node)
+{
+	unsigned long addr = (unsigned long)start;
+	unsigned long end = addr + size;
+	unsigned long next;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	for (; addr < end; addr = next) {
+		next = pmd_addr_end(addr, end);
+
+		pgd = cpu_area_pgd_populate(addr, flags, node);
+		if (!pgd)
+			return -ENOMEM;
+		pud = cpu_area_pud_populate(pgd, addr, flags, node);
+		if (!pud)
+			return -ENOMEM;
+
+		pmd = pmd_offset(pud, addr);
+		if (pmd_none(*pmd)) {
+			pte_t entry;
+			void *p = cpu_area_alloc_block(PMD_SIZE, flags, node);
+			if (!p)
+				return -ENOMEM;
+
+			entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
+			mk_pte_huge(entry);
+			set_pmd(pmd, __pmd(pte_val(entry)));
+		}
+	}
+
+	return 0;
+}
+#endif