From david@gibson.dropbear.id.au Sun Oct 22 21:12:23 2006
Date: Mon, 23 Oct 2006 14:12:08 +1000
From: David Gibson <david@gibson.dropbear.id.au>
To: Christoph Lameter <clameter@sgi.com>
Subject: Re: Huge performance regression in huge pages

On Mon, Oct 23, 2006 at 01:29:42PM +1000, David Gibson wrote:
> On Fri, Oct 20, 2006 at 04:19:51PM +1000, David Gibson wrote:
> > On Fri, Oct 20, 2006 at 02:13:16PM +1000, David Gibson wrote:
> > > On Thu, Oct 19, 2006 at 08:58:19AM -0700, Christoph Lameter wrote:
> > > > On Thu, 19 Oct 2006, David Gibson wrote:
> > > > 
> > > > > Ok, I think I've come up with a much simpler approach.  Instead of a
> > > > > tentative_pages count, we maintain an (atomic_t) generation count of
> > > > > hugepage faults.  On entry to hugetlb_fault() we do
> > > > > atomic_inc_return() on it.  Then on the exit path(s) from
> > > > > hugetlb_fault() we check if the count has changed from what we
> > > > > recorded on entry, and if it has we convert VM_FAULT_OOM to
> > > > > VM_FAULT_MINOR.  There would still need to be some tweaks to properly
> > > > > account resv_huge_pages in the case of concurrent faults, of course.
> > > > 
> > > > Ahh... Good idea.
> > > 
> > > Not good enough, though.  The last process to enter the fault handler
> > > - who is actually the one that will get the OOM - won't see the
> > > generation count tick over, because that happened before it entered.
> > > To make this work we need both a generation count, and a
> > > pending_faults count or similar.
> > > 
> > > Sigh.  I hope you're beginning to see why I took the lazy option and
> > > just serialized the fault path last time...
> > 
> > Well, I thought I'd figured out a way to fix the generation count
> > approach - by incrementing both on entry and exit, and checking that
> > the count was even, as well as unchanged from entry.  Then realized
> > that's broken with >2 CPUs.  Worked out (theoretically) a way around
> > *that*, with clever congruence tricks assigning a different prime
> > number to each CPU, before deciding that was just getting silly.
> > 
> > Talked to paulus who suggested what's seeming like a much simpler
> > idea: still have a mutex, but instead of a single global mutex we have
> > a table of them, indexed by a hash of the struct address_space and
> > file offset.  If we have 2*NR_CPUS available mutexes, we should almost
> > never lose parallelism.  I'm just now in the process of convincing
> > myself that this really does still squash all the spurious allocation
> > races (in particular thinking about cases where a set of racing
> > allocators might cause an OOM in another unrelated process).
> > 
> > I will try implementing on Monday.  Now I have some other hugepage
> > patches I'd better test.
> 
> Ok, implemented now.  How does the patch below work on your
> parallelized startup app?

Actually, better to try the version below.  The hash function is
cleaned up and made a bit better.  Still not sure that it's great
though.

hugepage: Allow parallelization of the hugepage fault path

At present, the page fault path for hugepages is serialized by a
single mutex.  This is used to avoid spurious out-of-memory conditions
when the hugepage pool is fully utilized (two processes or threads can
race to instantiate the same mapping with the last hugepage from the
pool, the race loser returning VM_FAULT_OOM).  This problem is
specific to hugepages, because it is normal to want to use every
single hugepage in the system - with normal pages we simply assume
there will always be a few spare pages which can be used temporarily
until the race is resolved.

Unfortunately this serialization also means that clearing of hugepages
cannot be parallelized across multiple CPUs, which can lead to very
long process startup times when using large numbers of hugepages.

This patch improves the situation by replacing the single mutex with a
table of mutexes, selected based on a hash of the address_space and
file offset being faulted (or mm and virtual address for MAP_PRIVATE
mappings).

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c	2006-10-23 13:50:31.000000000 +1000
+++ working-2.6/mm/hugetlb.c	2006-10-23 13:58:54.000000000 +1000
@@ -32,6 +32,13 @@ static unsigned int free_huge_pages_node
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+/*
+ * Serializes faults on the same logical page.  This is used to
+ * prevent spurious OOMs when the hugepage pool is fully utilized.
+ */
+static int num_fault_mutexes;
+static struct mutex *htlb_fault_mutex_table;
+
 static void clear_huge_page(struct page *page, unsigned long addr)
 {
 	int i;
@@ -160,6 +167,13 @@ static int __init hugetlb_init(void)
 	}
 	max_huge_pages = free_huge_pages = nr_huge_pages = i;
 	printk("Total HugeTLB memory allocated, %ld\n", free_huge_pages);
+
+	num_fault_mutexes = 2 * num_possible_cpus() - 1;
+	htlb_fault_mutex_table =
+		kmalloc(sizeof(struct mutex) * num_fault_mutexes, GFP_KERNEL);
+	for (i = 0; i < num_fault_mutexes; i++)
+		mutex_init(&htlb_fault_mutex_table[i]);
+
 	return 0;
 }
 module_init(hugetlb_init);
@@ -458,19 +472,14 @@ static int hugetlb_cow(struct mm_struct 
 }
 
 int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma,
-			unsigned long address, pte_t *ptep, int write_access)
+		    struct address_space *mapping, unsigned long idx,
+		    unsigned long address, pte_t *ptep, int write_access)
 {
 	int ret = VM_FAULT_SIGBUS;
-	unsigned long idx;
 	unsigned long size;
 	struct page *page;
-	struct address_space *mapping;
 	pte_t new_pte;
 
-	mapping = vma->vm_file->f_mapping;
-	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
-		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
-
 	/*
 	 * Use page lock to guard against racing truncation
 	 * before we get page_table_lock.
@@ -535,28 +544,50 @@ backout:
 	goto out;
 }
 
+static int fault_mutex_hash(struct mm_struct *mm, struct vm_area_struct *vma,
+			    struct address_space *mapping,
+			    unsigned long pagenum, unsigned long address)
+{
+	void *p = mapping;
+
+	if (! (vma->vm_flags & VM_SHARED)) {
+		p = mm;
+		pagenum = address << HPAGE_SIZE;
+	}
+
+	return ((unsigned long)p + pagenum) % num_fault_mutexes;
+}
+
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, int write_access)
 {
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+	struct address_space *mapping;
+	unsigned long idx;
+	int hash;
 
 	ptep = huge_pte_alloc(mm, address);
 	if (!ptep)
 		return VM_FAULT_OOM;
 
+	mapping = vma->vm_file->f_mapping;
+	idx = ((address - vma->vm_start) >> HPAGE_SHIFT)
+		+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
+
 	/*
 	 * Serialize hugepage allocation and instantiation, so that we don't
 	 * get spurious allocation failures if two CPUs race to instantiate
 	 * the same page in the page cache.
 	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
+	hash = fault_mutex_hash(mm, vma, mapping, idx, address);
+	mutex_lock(&htlb_fault_mutex_table[hash]);
 	entry = *ptep;
 	if (pte_none(entry)) {
-		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
+		ret = hugetlb_no_page(mm, vma, mapping, idx,
+				      address, ptep, write_access);
+		mutex_unlock(&htlb_fault_mutex_table[hash]);
 		return ret;
 	}
 
@@ -568,7 +599,7 @@ int hugetlb_fault(struct mm_struct *mm, 
 		if (write_access && !pte_write(entry))
 			ret = hugetlb_cow(mm, vma, address, ptep, entry);
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&hugetlb_instantiation_mutex);
+	mutex_unlock(&htlb_fault_mutex_table[hash]);
 
 	return ret;
 }


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson