Subject: transparent hugepage support documentation

From: Andrea Arcangeli <aarcange@redhat.com>

Documentation/vm/transhuge.txt

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
new file mode 100644
--- /dev/null
+++ b/Documentation/vm/transhuge.txt
@@ -0,0 +1,194 @@
+= Transparent Hugepage Support =
+
+== Objective ==
+
+Performance critical computing applications dealing with large memory
+working are already running on top of libhugetlbfs and in turn
+hugetlbfs. Transparent Hugepage Support is an alternative to
+libhugetlbfs that offers the same feature of libhugetlbfs but without
+the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..).
+
+In the future it can expand over the pagecache layer starting with
+tmpfs to reduce even further the hugetlbfs usages.
+
+The reason applications are running faster is because of two
+factors. The first factor is almost completely irrelevant and it's not
+of significant interest because it'll also have the downside of
+requiring larger clear-page copy-page in page faults which is a
+potentially negative effect. The first factor consists in taking a
+single page fault for each 2M virtual region touched by userland (so
+reducing the enter/exit kernel frequency by a 512 times factor). This
+only matters the first time the memory is accessed for the lifetime of
+a memory mapping. The second long lasting and much more important
+factor will affect all subsequent accesses to the memory for the whole
+runtime of the application. The second factor consist of two
+components: 1) the TLB miss will run faster (especially with
+virtualization using nested pagetables but also on bare metal without
+virtualization) and 2) a single TLB entry will be mapping a much
+larger amount of virtual memory in turn reducing the number of TLB
+misses. With virtualization and nested pagetables the TLB can be
+mapped of larger size only if both KVM and the Linux guest are using
+hugepages but a significant speedup already happens if only one of the
+two is using hugepages just because of the fact the TLB miss is going
+to run faster.
+
+== Design ==
+
+- hugepages have to be swappable and be handled as any other page by
+  the core Linux VM
+
+- if an hugepage allocation fails because of memory fragmentation,
+  regular pages should be gracefully allocated instead and mixed in
+  the same vma without any failure or significant delay and generally
+  without userland noticing
+
+- if some task quits and more hugepages become available (either
+  immediately in the buddy or through the VM), guest physical memory
+  backed by regular pages should be relocated on hugepages
+  automatically (with khugepaged)
+
+- it doesn't require boot-time memory reservation and in turn it uses
+  hugepages whenever possible (the only possible reservation here is
+  kernelcore= to avoid unmovable pages to fragment all the memory but
+  such a tweak is not specific to transparent hugepage support and
+  it's a generic feature that applies to all dynamic high order
+  allocations in the kernel)
+
+- this initial support only offers the feature in the anonymous memory
+  regions but it'd be ideal to move it to tmpfs and the pagecache
+  later
+
+Transparent Hugepage Support maximizes the usefulness of free memory
+if compared to the reservation approach of hugetlbfs by allowing all
+unused memory to be used as cache or other movable (or even unmovable
+entities). It doesn't require reservation to prevent allocation
+failures to be noticeable from userland. It allows paging and all
+other advanced VM features to be available on the hugepages. It
+requires no modifications for applications to take advantage of it.
+
+Applications however can be further optimized to take advantage of
+this feature to the maximum, like for example they've been optimized
+before to avoid a flood of mmap system calls for every
+malloc(4k). Optimizing userland is by far not mandatory and khugepaged
+already can take care of long lived page allocations even for hugepage
+unaware applications that deals with large amounts of memory.
+
+In certain cases when hugepages are enabled system wide, application
+may end up allocating more memory resources. An application may mmap a
+large region but only touch 1 byte of it, in that case a 2M page might
+be allocated instead of a 4k page for no good. This is why it's
+possible to disable hugepages system-wide and to only have them inside
+MADV_HUGEPAGE madvise regions.
+
+Embedded systems should enable hugepages only inside madvise regions
+to eliminate any risk of wasting any precious byte of memory and to
+only run faster.
+
+Applications that gets a lot of benefit from hugepages and that don't
+risk to lose memory by using hugepages, should use
+madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+
+== sysfs ==
+
+Transparent Hugepage Support can be entirely disabled (mostly for
+debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
+avoid the risk of consuming more memory resources) or enabled system
+wide. This can be achieved with one of:
+
+echo always >/sys/kernel/mm/transparent_hugepage/enabled
+echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+echo never >/sys/kernel/mm/transparent_hugepage/enabled
+
+It's also possible to limit defrag efforts in the VM to generate
+hugepages in case they're not immediately free to madvise regions or
+to never try to defrag memory and simply fallback to regular pages
+unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact
+we use hugepages later instead of regular pages. This isn't always
+guaranteed, but it may be more likely in case the allocation is for a
+MADV_HUGEPAGE region.
+
+echo always >/sys/kernel/mm/transparent_hugepage/defrag
+echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
+echo never >/sys/kernel/mm/transparent_hugepage/defrag
+
+khugepaged can also be instructed to scan all regions that are large
+enough to fit hugepages, or only the ones large enough inside madvise
+regions, or none (in which case khugepaged kernel deamon will quit).
+
+echo always >/sys/kernel/mm/transparent_hugepage/khugepaged/enabled
+echo madvise >/sys/kernel/mm/transparent_hugepage/khugepaged/enabled
+echo never >/sys/kernel/mm/transparent_hugepage/khugepaged/enabled
+
+khugepaged runs usually at low frequency so while one may not want to
+invoke defrag algorithms synchronously during the page faults, it
+should be worth invoking defrag at least in khugepaged. However it's
+also possible to disable defrag in khugepaged:
+
+echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+
+You can also control how many pages khugepaged should scan at each
+pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
+
+and how many milliseconds to wait in khugepaged between each pass (you
+can se this to 0 to run khugepaged at 100% utilization of one core):
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
+
+and how many milliseconds to wait in khugepaged if there's an hugepage
+allocation failure to throttle the next allocation attempt.
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
+
+The khugepaged progress can be seen in the number of pages collapsed:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
+
+for each pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
+
+== Need of restart ==
+
+The sysfs control values only affect future behavior. So to make them
+effective you need to restart any application that could have been
+using hugepages. This also applies to the regions scanned by
+khugepaged.
+
+== GUP ==
+
+get_user_pages or follow_page if run on a hugepage, will return the
+head or tail pages as usual (as they would do on hugetlbfs). Most gup
+users will only care about the actual physical address of the page and
+its temporary pinning to release after the I/O is complete, so they
+won't ever notice the fact the page is huge. But if any driver is
+going to mangle over the page structure of the tail page (like for
+checking page->mapping or other bits that are relevant for the head
+page and not the tail page), it should be updated to jump to check
+head page instead (while serializing properly against
+split_huge_page() to avoid the head and tail pages to disappear from
+under it).
+
+In case you can't handle compound pages if they're returned by
+follow_page, the FOLL_SPLIT bit can be specified as parameter to
+follow_page, so that it will split the hugepages before returning
+them. migration uses this trick as it's not hugepage aware and it
+can't deal with hugepages being returned (as it's not only checking
+the pfn of the page and pinning it during the copy).
+
+== Optimizing the applications ==
+
+To be guaranteed that the kernel will map a 2M page immediately in any
+memory region, the mmap region has to be hugepage naturally
+aligned. posix_memalign() can provide that guarantee.
+
+== Hugetlbfs ==
+
+You can use hugetlbfs on a kernel that has transparent hugepage
+support enabled just fine as always. No difference can be noted in
+hugetlbfs other than there will be less overall fragmentation. All
+usual features belonging to hugetlbfs are preserved and
+unaffected. libhugetlbfs will also work fine as usual.