Subject: transparent hugepage support documentation From: Andrea Arcangeli Documentation/vm/transhuge.txt Signed-off-by: Andrea Arcangeli --- diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt new file mode 100644 --- /dev/null +++ b/Documentation/vm/transhuge.txt @@ -0,0 +1,194 @@ += Transparent Hugepage Support = + +== Objective == + +Performance critical computing applications dealing with large memory +working are already running on top of libhugetlbfs and in turn +hugetlbfs. Transparent Hugepage Support is an alternative to +libhugetlbfs that offers the same feature of libhugetlbfs but without +the shortcomings of hugetlbfs (for KVM, JVM, HPC, even gcc etc..). + +In the future it can expand over the pagecache layer starting with +tmpfs to reduce even further the hugetlbfs usages. + +The reason applications are running faster is because of two +factors. The first factor is almost completely irrelevant and it's not +of significant interest because it'll also have the downside of +requiring larger clear-page copy-page in page faults which is a +potentially negative effect. The first factor consists in taking a +single page fault for each 2M virtual region touched by userland (so +reducing the enter/exit kernel frequency by a 512 times factor). This +only matters the first time the memory is accessed for the lifetime of +a memory mapping. The second long lasting and much more important +factor will affect all subsequent accesses to the memory for the whole +runtime of the application. The second factor consist of two +components: 1) the TLB miss will run faster (especially with +virtualization using nested pagetables but also on bare metal without +virtualization) and 2) a single TLB entry will be mapping a much +larger amount of virtual memory in turn reducing the number of TLB +misses. With virtualization and nested pagetables the TLB can be +mapped of larger size only if both KVM and the Linux guest are using +hugepages but a significant speedup already happens if only one of the +two is using hugepages just because of the fact the TLB miss is going +to run faster. + +== Design == + +- hugepages have to be swappable and be handled as any other page by + the core Linux VM + +- if an hugepage allocation fails because of memory fragmentation, + regular pages should be gracefully allocated instead and mixed in + the same vma without any failure or significant delay and generally + without userland noticing + +- if some task quits and more hugepages become available (either + immediately in the buddy or through the VM), guest physical memory + backed by regular pages should be relocated on hugepages + automatically (with khugepaged) + +- it doesn't require boot-time memory reservation and in turn it uses + hugepages whenever possible (the only possible reservation here is + kernelcore= to avoid unmovable pages to fragment all the memory but + such a tweak is not specific to transparent hugepage support and + it's a generic feature that applies to all dynamic high order + allocations in the kernel) + +- this initial support only offers the feature in the anonymous memory + regions but it'd be ideal to move it to tmpfs and the pagecache + later + +Transparent Hugepage Support maximizes the usefulness of free memory +if compared to the reservation approach of hugetlbfs by allowing all +unused memory to be used as cache or other movable (or even unmovable +entities). It doesn't require reservation to prevent allocation +failures to be noticeable from userland. It allows paging and all +other advanced VM features to be available on the hugepages. It +requires no modifications for applications to take advantage of it. + +Applications however can be further optimized to take advantage of +this feature to the maximum, like for example they've been optimized +before to avoid a flood of mmap system calls for every +malloc(4k). Optimizing userland is by far not mandatory and khugepaged +already can take care of long lived page allocations even for hugepage +unaware applications that deals with large amounts of memory. + +In certain cases when hugepages are enabled system wide, application +may end up allocating more memory resources. An application may mmap a +large region but only touch 1 byte of it, in that case a 2M page might +be allocated instead of a 4k page for no good. This is why it's +possible to disable hugepages system-wide and to only have them inside +MADV_HUGEPAGE madvise regions. + +Embedded systems should enable hugepages only inside madvise regions +to eliminate any risk of wasting any precious byte of memory and to +only run faster. + +Applications that gets a lot of benefit from hugepages and that don't +risk to lose memory by using hugepages, should use +madvise(MADV_HUGEPAGE) on their critical mmapped regions. + +== sysfs == + +Transparent Hugepage Support can be entirely disabled (mostly for +debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to +avoid the risk of consuming more memory resources) or enabled system +wide. This can be achieved with one of: + +echo always >/sys/kernel/mm/transparent_hugepage/enabled +echo madvise >/sys/kernel/mm/transparent_hugepage/enabled +echo never >/sys/kernel/mm/transparent_hugepage/enabled + +It's also possible to limit defrag efforts in the VM to generate +hugepages in case they're not immediately free to madvise regions or +to never try to defrag memory and simply fallback to regular pages +unless hugepages are immediately available. Clearly if we spend CPU +time to defrag memory, we would expect to gain even more by the fact +we use hugepages later instead of regular pages. This isn't always +guaranteed, but it may be more likely in case the allocation is for a +MADV_HUGEPAGE region. + +echo always >/sys/kernel/mm/transparent_hugepage/defrag +echo madvise >/sys/kernel/mm/transparent_hugepage/defrag +echo never >/sys/kernel/mm/transparent_hugepage/defrag + +khugepaged can also be instructed to scan all regions that are large +enough to fit hugepages, or only the ones large enough inside madvise +regions, or none (in which case khugepaged kernel deamon will quit). + +echo always >/sys/kernel/mm/transparent_hugepage/khugepaged/enabled +echo madvise >/sys/kernel/mm/transparent_hugepage/khugepaged/enabled +echo never >/sys/kernel/mm/transparent_hugepage/khugepaged/enabled + +khugepaged runs usually at low frequency so while one may not want to +invoke defrag algorithms synchronously during the page faults, it +should be worth invoking defrag at least in khugepaged. However it's +also possible to disable defrag in khugepaged: + +echo yes >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag +echo no >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag + +You can also control how many pages khugepaged should scan at each +pass: + +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan + +and how many milliseconds to wait in khugepaged between each pass (you +can se this to 0 to run khugepaged at 100% utilization of one core): + +/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs + +and how many milliseconds to wait in khugepaged if there's an hugepage +allocation failure to throttle the next allocation attempt. + +/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs + +The khugepaged progress can be seen in the number of pages collapsed: + +/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed + +for each pass: + +/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans + +== Need of restart == + +The sysfs control values only affect future behavior. So to make them +effective you need to restart any application that could have been +using hugepages. This also applies to the regions scanned by +khugepaged. + +== GUP == + +get_user_pages or follow_page if run on a hugepage, will return the +head or tail pages as usual (as they would do on hugetlbfs). Most gup +users will only care about the actual physical address of the page and +its temporary pinning to release after the I/O is complete, so they +won't ever notice the fact the page is huge. But if any driver is +going to mangle over the page structure of the tail page (like for +checking page->mapping or other bits that are relevant for the head +page and not the tail page), it should be updated to jump to check +head page instead (while serializing properly against +split_huge_page() to avoid the head and tail pages to disappear from +under it). + +In case you can't handle compound pages if they're returned by +follow_page, the FOLL_SPLIT bit can be specified as parameter to +follow_page, so that it will split the hugepages before returning +them. migration uses this trick as it's not hugepage aware and it +can't deal with hugepages being returned (as it's not only checking +the pfn of the page and pinning it during the copy). + +== Optimizing the applications == + +To be guaranteed that the kernel will map a 2M page immediately in any +memory region, the mmap region has to be hugepage naturally +aligned. posix_memalign() can provide that guarantee. + +== Hugetlbfs == + +You can use hugetlbfs on a kernel that has transparent hugepage +support enabled just fine as always. No difference can be noted in +hugetlbfs other than there will be less overall fragmentation. All +usual features belonging to hugetlbfs are preserved and +unaffected. libhugetlbfs will also work fine as usual.