From david@gibson.dropbear.id.au Wed Oct 18 00:47:33 2006 Date: Wed, 18 Oct 2006 17:47:23 +1000 From: David Gibson To: Christoph Lameter Subject: Re: Huge performance regression in huge pages On Tue, Oct 17, 2006 at 12:58:06PM -0700, Christoph Lameter wrote: > Here an implementation of a reservation scheme. > > We keep the number of hugetlb pages that are in flight in hugetlb reserve. > The increment of hugetlb_reserve is safe since it occurs in > alloc_huge_page when we hold the hugetlb_lock anyways. We only decrement > hugetlb_reserve when the complete allocation path has finished. > > We modify alloc_huge_page to increment hugetlb_reserve when a page is > allocated. In addition we set a concurrent flag if an allocation > fails but we know that there are concurrent allocation ongoing. Again the > determination of concurrent allocations outstanding is safe due to the > hugetlb_lock. > > We then modify the allocation paths to not return OOM but a minor fault if > we ran out of pages and we know that there are concurrent outstanding > allocations. > > The mutex is then not useful anymore. So drop it. Mm. Nice approach, some problems with the implementation: 1. as it stands there is still a (very narrow) race. Another thread could finish with it's tentatively allocated page between our failing dequeue_huge_page() and the atomic_read(), leading to spurious failure again. (alloc-instantiate-race from the libhugetlbfs testsuite is still able to trigger even this small race). 2. hugetlb_reserve is a terrible name, since it's unrelated to resv_huge_pages. 3. The resv_huge_pages accounting was also relying on the serialization (my fault again, the serialization was already in IIRC, so I sidestepped the bunch of tricky edge cases that crop up for the reserve accounting). I get leaks and wraparounds in the value shown in /proc/meminfo with your patch applied. 4. There were some error/backout paths in which you weren't correctly updating the tentative page count. Can I commend to you again the libhugetlbfs testsuite. It will pick up a whole bunch of potential hugepage regressions (some of them pretty subtle) in moments. git clone git://ozlabs.org/srv/projects/libhugetlbfs/libhugetlbfs.git then "make check". Below is a substantially revised version of your patch, addressing the issues above. However, I've seen at least one VM: killing and one Oops with it, when running the testsuite (but not most times through). So there's more debugging to do yet. It's also rather more complex than I'd really like, suggestions for simplification gratefully accepted. hugepage: Smarter handling of "in-flight" allocations It's common in practice to want to use every single allocated hugepage in the system. This leads to some curly corner-cases which don't occur with normal pages, where we assume that if we can't at least temporarily allocate a page, then we're stuffed anyway. Without care, when multiple processes are racing to instantiate a MAP_SHARED page, we can get spurious OOM conditions if one process fails to allocate a huge page because it's attempted it after another process has taken the last page, but before that other process has inserted the page (which is actually the very one we want) back into the page cache. A similar situation is possible with multiple threads racing to instantiate a MAP_PRIVATE page. At present we avoid this by serializing most of the hugepage fault path with a mutex - everything from the allocation of a hugepage from the pool until the point we're sure the page will be used. Unfortunately this includes the clearing of hugepages, which causes bad performance for applications which otherwise might clear different ranges of their hugepages in parallel. Moving the clear out of the mutex is not easily possible, because the clearing can't move after the set_pte(), which is the earliest point we can drop the mutex for the MAP_PRIVATE case. Lock ordering constraints with the page_table_lock further complicate this approach. This patch takes a different approach, abolishing the mutex. Instead, it keeps a count 'tentative_huge_pages', protected by hugetlb_lock of the number of "in-flight" hugepages. It is incremented as pages are removed from the pool, then decremented when either: 1) we add a page to the page cache, or 2) we add a page to the pagetables or 3) we discover we didn't need the page we've allocated (due to a race) and return it to the pool. If we're unable to allocate a page while this count is non-zero, we ensure that the fault returns VM_FAULT_MINOR rather than OOM, so the fault will be retried. The accounting of "reserved" hugepages (pages for shared mappings whose availability we guarantee, but which haven't been physically allocated yet) also relied on the serialization of the fault path. Therefore this patch also updates this accounting - because races are now possible we can (apparently) exhaust the reserved hugepage pool even when filling a pre-reserved mapping, so we need to check for that condition. Also, when a tentatively allocated hugepage which was taken from the reserved pool turns out not to be needed, we must return it to the reserved pool, rather than to the general free pool. Index: linux-2.6/mm/hugetlb.c =================================================================== --- linux-2.6.orig/mm/hugetlb.c 2006-10-18 19:16:44.000000000 -0700 +++ linux-2.6/mm/hugetlb.c 2006-10-18 19:17:06.000000000 -0700 @@ -13,7 +13,6 @@ #include #include #include -#include #include #include @@ -28,10 +27,32 @@ static struct list_head hugepage_freelis static unsigned int nr_huge_pages_node[MAX_NUMNODES]; static unsigned int free_huge_pages_node[MAX_NUMNODES]; /* - * Protects updates to hugepage_freelists, nr_huge_pages, and free_huge_pages + * Number of pages taken from the pool that we have not processed yet. + * This is used to avoid falsely reporting OOM conditions. + */ +static unsigned int tentative_huge_pages; /* = 0 */ + +/* + * Protects updates to hugepage_freelists, nr_huge_pages, free_huge_pages and + * tentative_huge_pages */ static DEFINE_SPINLOCK(hugetlb_lock); + +static int page_tentative(struct page *page) +{ + BUG_ON(!PageCompound(page)); + BUG_ON(page != (struct page *)page_private(page)); + return atomic_read(&page[1]._count); +} + +static void set_page_tentative(struct page *page, int value) +{ + BUG_ON(!PageCompound(page)); + BUG_ON(page != (struct page *)page_private(page)); + atomic_set(&page[1]._count, value); +} + static void clear_huge_page(struct page *page, unsigned long addr) { int i; @@ -95,6 +116,15 @@ static void free_huge_page(struct page * INIT_LIST_HEAD(&page->lru); spin_lock(&hugetlb_lock); + switch (page_tentative(page)) { + case 2: + /* tentative and from the reserved pool */ + resv_huge_pages++; + /* FALLTHROUGH */ + case 1: + /* tentative */ + tentative_huge_pages--; + } enqueue_huge_page(page); spin_unlock(&hugetlb_lock); } @@ -126,22 +156,46 @@ static struct page *alloc_huge_page(stru struct page *page; spin_lock(&hugetlb_lock); - if (vma->vm_flags & VM_MAYSHARE) + if (vma->vm_flags & VM_MAYSHARE) { + /* Draw from the pre-reserved pool. Despite + * pre-reservation it's possible to (apparently) run + * out of pages here, if we have racing processes + * attempting to instantiate the same page in a shared + * mapping. */ + if (! resv_huge_pages ) { + BUG_ON(! tentative_huge_pages); + spin_unlock(&hugetlb_lock); + return ERR_PTR(-EAGAIN); + } resv_huge_pages--; - else if (free_huge_pages <= resv_huge_pages) - goto fail; + } else if (free_huge_pages <= resv_huge_pages) { + /* Draw from the non-reserved free pool */ + if (tentative_huge_pages) + page = ERR_PTR(-EAGAIN); + else + page = ERR_PTR(-ENOMEM); + spin_unlock(&hugetlb_lock); + return page; + } page = dequeue_huge_page(vma, addr); - if (!page) - goto fail; + BUG_ON(! page); /* We've already checked the counts */ + tentative_huge_pages++; + set_page_tentative(page, (vma->vm_flags & VM_MAYSHARE) ? 2 : 1); spin_unlock(&hugetlb_lock); set_page_refcounted(page); return page; +} -fail: +static void keep_huge_page(struct page *page) +{ + spin_lock(&hugetlb_lock); + if (page_tentative(page)) { + tentative_huge_pages--; + set_page_tentative(page, 0); + } spin_unlock(&hugetlb_lock); - return NULL; } static int __init hugetlb_init(void) @@ -434,10 +488,10 @@ static int hugetlb_cow(struct mm_struct page_cache_get(old_page); new_page = alloc_huge_page(vma, address); - - if (!new_page) { + if (IS_ERR(new_page)) { page_cache_release(old_page); - return VM_FAULT_OOM; + return (PTR_ERR(new_page) == -EAGAIN) + ? VM_FAULT_MINOR : VM_FAULT_OOM; } spin_unlock(&mm->page_table_lock); @@ -449,6 +503,7 @@ static int hugetlb_cow(struct mm_struct /* Break COW */ set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, new_page, 1)); + keep_huge_page(new_page); /* Make the old page be freed below */ new_page = old_page; } @@ -477,14 +532,30 @@ int hugetlb_no_page(struct mm_struct *mm */ retry: page = find_lock_page(mapping, idx); + /* In the case the page exists, we want to lock it before we + * check against i_size to guard against racing truncations. + * In the case it doesn't exist, we have to check against + * i_size before attempting to allocate a page, or we could + * get the wrong error if we're also out of hugepages in the + * pool (OOM instead of SIGBUS). So the i_size test has to go + * in this slightly odd location. */ + size = i_size_read(mapping->host) >> HPAGE_SHIFT; + if (idx >= size) { + hugetlb_put_quota(mapping); + if (page) { + unlock_page(page); + put_page(page); + } + return VM_FAULT_SIGBUS; + } if (!page) { if (hugetlb_get_quota(mapping)) - goto out; + return VM_FAULT_SIGBUS; page = alloc_huge_page(vma, address); - if (!page) { + if (IS_ERR(page)) { hugetlb_put_quota(mapping); - ret = VM_FAULT_OOM; - goto out; + return (PTR_ERR(page) == -EAGAIN) + ? VM_FAULT_MINOR : VM_FAULT_OOM; } clear_huge_page(page, address); @@ -497,25 +568,29 @@ retry: hugetlb_put_quota(mapping); if (err == -EEXIST) goto retry; - goto out; + return VM_FAULT_SIGBUS; } + keep_huge_page(page); } else lock_page(page); } spin_lock(&mm->page_table_lock); - size = i_size_read(mapping->host) >> HPAGE_SHIFT; - if (idx >= size) - goto backout; - ret = VM_FAULT_MINOR; - if (!pte_none(*ptep)) - goto backout; + + if (!pte_none(*ptep)) { + spin_unlock(&mm->page_table_lock); + hugetlb_put_quota(mapping); + unlock_page(page); + put_page(page); + return VM_FAULT_MINOR; + } add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); new_pte = make_huge_pte(vma, page, ((vma->vm_flags & VM_WRITE) && (vma->vm_flags & VM_SHARED))); set_huge_pte_at(mm, address, ptep, new_pte); + keep_huge_page(page); if (write_access && !(vma->vm_flags & VM_SHARED)) { /* Optimization, do the COW without a second fault */ @@ -524,15 +599,7 @@ retry: spin_unlock(&mm->page_table_lock); unlock_page(page); -out: return ret; - -backout: - spin_unlock(&mm->page_table_lock); - hugetlb_put_quota(mapping); - unlock_page(page); - put_page(page); - goto out; } int hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, @@ -541,22 +608,14 @@ int hugetlb_fault(struct mm_struct *mm, pte_t *ptep; pte_t entry; int ret; - static DEFINE_MUTEX(hugetlb_instantiation_mutex); ptep = huge_pte_alloc(mm, address); if (!ptep) return VM_FAULT_OOM; - /* - * Serialize hugepage allocation and instantiation, so that we don't - * get spurious allocation failures if two CPUs race to instantiate - * the same page in the page cache. - */ - mutex_lock(&hugetlb_instantiation_mutex); entry = *ptep; if (pte_none(entry)) { ret = hugetlb_no_page(mm, vma, address, ptep, write_access); - mutex_unlock(&hugetlb_instantiation_mutex); return ret; } @@ -568,7 +627,6 @@ int hugetlb_fault(struct mm_struct *mm, if (write_access && !pte_write(entry)) ret = hugetlb_cow(mm, vma, address, ptep, entry); spin_unlock(&mm->page_table_lock); - mutex_unlock(&hugetlb_instantiation_mutex); return ret; }