From david@gibson.dropbear.id.au Wed Oct 18 00:47:33 2006
Date: Wed, 18 Oct 2006 17:47:23 +1000
From: David Gibson <david@gibson.dropbear.id.au>
To: Christoph Lameter <clameter@sgi.com>
Subject: Re: Huge performance regression in huge pages

On Tue, Oct 17, 2006 at 12:58:06PM -0700, Christoph Lameter wrote:
> Here an implementation of a reservation scheme.
> 
> We keep the number of hugetlb pages that are in flight in hugetlb reserve.
> The increment of hugetlb_reserve is safe since it occurs in 
> alloc_huge_page when we hold the hugetlb_lock anyways. We only decrement
> hugetlb_reserve when the complete allocation path has finished. 
> 
> We modify alloc_huge_page to increment hugetlb_reserve when a page is 
> allocated. In addition we set a concurrent flag if an allocation
> fails but we know that there are concurrent allocation ongoing. Again the 
> determination of concurrent allocations outstanding is safe due to the 
> hugetlb_lock.
> 
> We then modify the allocation paths to not return OOM but a minor fault if 
> we ran out of pages and we know that there are concurrent outstanding 
> allocations.
> 
> The mutex is then not useful anymore. So drop it.

Mm. Nice approach, some problems with the implementation:
	1. as it stands there is still a (very narrow) race.  Another
thread could finish with it's tentatively allocated page between our
failing dequeue_huge_page() and the atomic_read(), leading to spurious
failure again.  (alloc-instantiate-race from the libhugetlbfs
testsuite is still able to trigger even this small race).
	2. hugetlb_reserve is a terrible name, since it's unrelated
to resv_huge_pages.
	3. The resv_huge_pages accounting was also relying on the
serialization (my fault again, the serialization was already in IIRC,
so I sidestepped the bunch of tricky edge cases that crop up for the
reserve accounting).  I get leaks and wraparounds in the value shown
in /proc/meminfo with your patch applied.
	4. There were some error/backout paths in which you weren't
correctly updating the tentative page count.

Can I commend to you again the libhugetlbfs testsuite.  It will pick
up a whole bunch of potential hugepage regressions (some of them
pretty subtle) in moments.
	git clone git://ozlabs.org/srv/projects/libhugetlbfs/libhugetlbfs.git
then "make check".

Below is a substantially revised version of your patch, addressing the
issues above.  However, I've seen at least one VM: killing and one
Oops with it, when running the testsuite (but not most times through).
So there's more debugging to do yet.

It's also rather more complex than I'd really like, suggestions for
simplification gratefully accepted.

hugepage: Smarter handling of "in-flight" allocations

It's common in practice to want to use every single allocated hugepage
in the system.  This leads to some curly corner-cases which don't
occur with normal pages, where we assume that if we can't at least
temporarily allocate a page, then we're stuffed anyway.  Without care,
when multiple processes are racing to instantiate a MAP_SHARED page,
we can get spurious OOM conditions if one process fails to allocate a
huge page because it's attempted it after another process has taken
the last page, but before that other process has inserted the page
(which is actually the very one we want) back into the page cache.  A
similar situation is possible with multiple threads racing to
instantiate a MAP_PRIVATE page.

At present we avoid this by serializing most of the hugepage fault
path with a mutex - everything from the allocation of a hugepage from
the pool until the point we're sure the page will be used.
Unfortunately this includes the clearing of hugepages, which causes
bad performance for applications which otherwise might clear different
ranges of their hugepages in parallel.

Moving the clear out of the mutex is not easily possible, because the
clearing can't move after the set_pte(), which is the earliest point
we can drop the mutex for the MAP_PRIVATE case.  Lock ordering
constraints with the page_table_lock further complicate this approach.

This patch takes a different approach, abolishing the mutex.  Instead,
it keeps a count 'tentative_huge_pages', protected by hugetlb_lock of
the number of "in-flight" hugepages.  It is incremented as pages are
removed from the pool, then decremented when either: 1) we add a page
to the page cache, or 2) we add a page to the pagetables or 3) we
discover we didn't need the page we've allocated (due to a race) and
return it to the pool.  If we're unable to allocate a page while this
count is non-zero, we ensure that the fault returns VM_FAULT_MINOR
rather than OOM, so the fault will be retried.

The accounting of "reserved" hugepages (pages for shared mappings
whose availability we guarantee, but which haven't been physically
allocated yet) also relied on the serialization of the fault path.
Therefore this patch also updates this accounting - because races are
now possible we can (apparently) exhaust the reserved hugepage pool
even when filling a pre-reserved mapping, so we need to check for that
condition.  Also, when a tentatively allocated hugepage which was
taken from the reserved pool turns out not to be needed, we must
return it to the reserved pool, rather than to the general free pool.

Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2006-10-18 19:16:44.000000000 -0700
+++ linux-2.6/mm/hugetlb.c	2006-10-18 19:17:06.000000000 -0700
@@ -13,7 +13,6 @@
 #include <linux/pagemap.h>
 #include <linux/mempolicy.h>
 #include <linux/cpuset.h>
-#include <linux/mutex.h>
 
 #include <asm/page.h>
 #include <asm/pgtable.h>
@@ -28,10 +27,32 @@ static struct list_head hugepage_freelis
 static unsigned int nr_huge_pages_node[MAX_NUMNODES];
 static unsigned int free_huge_pages_node[MAX_NUMNODES];
 /*
- * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages
+ * Number of pages taken from the pool that we have not processed yet.
+ * This is used to avoid falsely reporting OOM conditions.
+ */
+static unsigned int tentative_huge_pages; /* = 0 */
+
+/*
+ * Protects updates to hugepage_freelists, nr_huge_pages, free_huge_pages and
+ * tentative_huge_pages
  */
 static DEFINE_SPINLOCK(hugetlb_lock);
 
+
+static int page_tentative(struct page *page)
+{
+	BUG_ON(!PageCompound(page));
+	BUG_ON(page != (struct page *)page_private(page));
+	return atomic_read(&page[1]._count);
+}
+
+static void set_page_tentative(struct page *page, int value)
+{
+	BUG_ON(!PageCompound(page));
+	BUG_ON(page != (struct page *)page_private(page));
+	atomic_set(&page[1]._count, value);
+}
+
 static void clear_huge_page(struct page *page, unsigned long addr)
 {
 	int i;
@@ -95,6 +116,15 @@ static void free_huge_page(struct page *
 	INIT_LIST_HEAD(&page->lru);
 
 	spin_lock(&hugetlb_lock);
+	switch (page_tentative(page)) {
+	case 2:
+		/* tentative and from the reserved pool */
+		resv_huge_pages++;
+		/* FALLTHROUGH */
+	case 1:
+		/* tentative */
+		tentative_huge_pages--;
+	}
 	enqueue_huge_page(page);
 	spin_unlock(&hugetlb_lock);
 }
@@ -126,22 +156,46 @@ static struct page *alloc_huge_page(stru
 	struct page *page;
 
 	spin_lock(&hugetlb_lock);
-	if (vma->vm_flags & VM_MAYSHARE)
+	if (vma->vm_flags & VM_MAYSHARE) {
+		/* Draw from the pre-reserved pool.  Despite
+		 * pre-reservation it's possible to (apparently) run
+		 * out of pages here, if we have racing processes
+		 * attempting to instantiate the same page in a shared
+		 * mapping. */
+		if (! resv_huge_pages ) {
+			BUG_ON(! tentative_huge_pages);
+			spin_unlock(&hugetlb_lock);
+			return ERR_PTR(-EAGAIN);
+		}
 		resv_huge_pages--;
-	else if (free_huge_pages <= resv_huge_pages)
-		goto fail;
+	} else if (free_huge_pages <= resv_huge_pages) {
+		/* Draw from the non-reserved free pool */
+		if (tentative_huge_pages)
+			page = ERR_PTR(-EAGAIN);
+		else
+			page = ERR_PTR(-ENOMEM);
+		spin_unlock(&hugetlb_lock);
+		return page;
+	}
 
 	page = dequeue_huge_page(vma, addr);
-	if (!page)
-		goto fail;
+	BUG_ON(! page); /* We've already checked the counts */
 
+	tentative_huge_pages++;
+	set_page_tentative(page, (vma->vm_flags & VM_MAYSHARE) ? 2 : 1);
 	spin_unlock(&hugetlb_lock);
 	set_page_refcounted(page);
 	return page;
+}
 
-fail:
+static void keep_huge_page(struct page *page)
+{
+	spin_lock(&hugetlb_lock);
+	if (page_tentative(page)) {
+		tentative_huge_pages--;
+		set_page_tentative(page, 0);
+	}
 	spin_unlock(&hugetlb_lock);
-	return NULL;
 }
 
 static int __init hugetlb_init(void)
@@ -434,10 +488,10 @@ static int hugetlb_cow(struct mm_struct 
 
 	page_cache_get(old_page);
 	new_page = alloc_huge_page(vma, address);
-
-	if (!new_page) {
+	if (IS_ERR(new_page)) {
 		page_cache_release(old_page);
-		return VM_FAULT_OOM;
+		return (PTR_ERR(new_page) == -EAGAIN)
+			? VM_FAULT_MINOR : VM_FAULT_OOM;
 	}
 
 	spin_unlock(&mm->page_table_lock);
@@ -449,6 +503,7 @@ static int hugetlb_cow(struct mm_struct 
 		/* Break COW */
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
+		keep_huge_page(new_page);
 		/* Make the old page be freed below */
 		new_page = old_page;
 	}
@@ -477,14 +532,30 @@ int hugetlb_no_page(struct mm_struct *mm
 	 */
 retry:
 	page = find_lock_page(mapping, idx);
+	/* In the case the page exists, we want to lock it before we
+	 * check against i_size to guard against racing truncations.
+	 * In the case it doesn't exist, we have to check against
+	 * i_size before attempting to allocate a page, or we could
+	 * get the wrong error if we're also out of hugepages in the
+	 * pool (OOM instead of SIGBUS).  So the i_size test has to go
+	 * in this slightly odd location. */
+	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
+	if (idx >= size) {
+		hugetlb_put_quota(mapping);
+		if (page) {
+			unlock_page(page);
+			put_page(page);
+		}
+		return VM_FAULT_SIGBUS;
+	}
 	if (!page) {
 		if (hugetlb_get_quota(mapping))
-			goto out;
+			return VM_FAULT_SIGBUS;
 		page = alloc_huge_page(vma, address);
-		if (!page) {
+		if (IS_ERR(page)) {
 			hugetlb_put_quota(mapping);
-			ret = VM_FAULT_OOM;
-			goto out;
+			return (PTR_ERR(page) == -EAGAIN)
+				? VM_FAULT_MINOR : VM_FAULT_OOM;
 		}
 		clear_huge_page(page, address);
 
@@ -497,25 +568,29 @@ retry:
 				hugetlb_put_quota(mapping);
 				if (err == -EEXIST)
 					goto retry;
-				goto out;
+				return VM_FAULT_SIGBUS;
 			}
+			keep_huge_page(page);
 		} else
 			lock_page(page);
 	}
 
 	spin_lock(&mm->page_table_lock);
-	size = i_size_read(mapping->host) >> HPAGE_SHIFT;
-	if (idx >= size)
-		goto backout;
-
 	ret = VM_FAULT_MINOR;
-	if (!pte_none(*ptep))
-		goto backout;
+
+	if (!pte_none(*ptep)) {
+		spin_unlock(&mm->page_table_lock);
+		hugetlb_put_quota(mapping);
+		unlock_page(page);
+		put_page(page);
+		return VM_FAULT_MINOR;
+	}
 
 	add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
 	new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE)
 				&& (vma->vm_flags & VM_SHARED)));
 	set_huge_pte_at(mm, address, ptep, new_pte);
+	keep_huge_page(page);
 
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
@@ -524,15 +599,7 @@ retry:
 
 	spin_unlock(&mm->page_table_lock);
 	unlock_page(page);
-out:
 	return ret;
-
-backout:
-	spin_unlock(&mm->page_table_lock);
-	hugetlb_put_quota(mapping);
-	unlock_page(page);
-	put_page(page);
-	goto out;
 }
 
 int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -541,22 +608,14 @@ int hugetlb_fault(struct mm_struct *mm, 
 	pte_t *ptep;
 	pte_t entry;
 	int ret;
-	static DEFINE_MUTEX(hugetlb_instantiation_mutex);
 
 	ptep = huge_pte_alloc(mm, address);
 	if (!ptep)
 		return VM_FAULT_OOM;
 
-	/*
-	 * Serialize hugepage allocation and instantiation, so that we don't
-	 * get spurious allocation failures if two CPUs race to instantiate
-	 * the same page in the page cache.
-	 */
-	mutex_lock(&hugetlb_instantiation_mutex);
 	entry = *ptep;
 	if (pte_none(entry)) {
 		ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
-		mutex_unlock(&hugetlb_instantiation_mutex);
 		return ret;
 	}
 
@@ -568,7 +627,6 @@ int hugetlb_fault(struct mm_struct *mm, 
 		if (write_access && !pte_write(entry))
 			ret = hugetlb_cow(mm, vma, address, ptep, entry);
 	spin_unlock(&mm->page_table_lock);
-	mutex_unlock(&hugetlb_instantiation_mutex);
 
 	return ret;
 }