The current state and the future of page migration

The current page migration in Linus tree uses swap entries to track unmapped
anonymous pages and has the side effect of removing all references to file
backed pages. If multiple migrations run concurrently then we typically are
limited by contention around the tree_lock for swap space. We see migration
rates of around 600-900 MB/sec for a single migration and around 250MB/sec
for 4 concurrent migrations.

The code in Andrew's tree uses migration entries, restores ptes
to file backed pages and preserves the write enable bit. This means
that a process can be repeatedly migrated without loosing
the file backed pages that were not referenced in the intermediate
period. Also we avoid useless COW faults. The contention around
the swap tree_lock has been removed and so we see increased
migration rates for a single process of around 800-1GB/sec that then
only slightly degrades for 4 concurrent processes.

I would like to keep the features of page migraton as they are right now
in Andrew's tree until the patches have made it into Linus tree.

I have put some additional patches for page migration at
ftp://ftp.kernel.org/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc3-mm1/.
These are in testing and need work. Feedback on these would be useful.

1. Restructure migrate_pages() so that the current goto mess is avoided. This
   extracts two functions from migrate pages that deal with either taking the
   page lock for the source or destination page.

2. Dispose of migrated pages immediately. Moves the recycling of migrated
   pages into migrate_pages(). Callers only have to deal with pages that
   are still candidates for still could be repeated. This simplifies handling
   but prevents potential necessary post processing of migrated pages.
   Should we do this at all?

3. Uses arrays to pass list of pages to migrate_pages().
   Doing so will make a 1-1 association possible between the pages to be
   migrated. If we have this 1-1 association then we can accurately allocate
   pages for MPOL_INTERLEAVE during migration. Specifying
   MPOL_INTERLEAVE|MPOL_MF_MOVE to mbind() could move all pages so that they
   follow the best interleave pattern accurately.

4. A new system call for the migration of lists of pages (incomplete
   implementation!)

   sys_move_pages([int pid,?] int nr_pages, unsigned long *addresses,
   		int *nodes, unsigned int flags);

   This function would migrate individual pages of a process to specific nodes.
   F.e. user space tools exist that can provide off node access statistics
   that show from what node a pages is most frequently accessed.
   Additional code could then use this new system call to migrate the lists
   of pages to the more advantageous location. Automatic page migration
   could be implemented in user space. Many of us remain unconvinced that
   automatic page migration can provide a consistent benefit.
   This API would allow the implementation of various automatic migration
   methods without changes to the kernel.

5. vma migration hooks
   Adds a new function call "migrate" to the vm_operations structure. The
   vm_ops migration method may be used by vmas without page structs (PFN_MAP?)
   to implement their own migration schemes. Currently there is no user of
   such functionality. The uncached allocator for IA64 could potentially use
   such vma migration hooks.

Potential future work:

- Implement the migration of mlocked pages. This would mean to ignore
  VM_LOCKED in try_to_unmap. Currently VM_LOCKED can be used to prevent the
  migration of pages. If we allow the migration of mlocked pages then we
  would need to introduce some alternate means of being able to declare a
  page not migratable (VM_DONTMIGRATE?).
  Not sure if this should be done at all.

- Migration of pages outside of a process context.
  Currently page migration requires that a read lock on mmap_sem is held to
  prevent the anonymous vmas from vanishing while we migrate pages.
  If page migration would be used to remove all pages from a zone (like needed
  by the memory hotplug project) then we would need to first find a way
  to insure that the anon_vmas do not vanish under us. We could f.e. take
  a read_lock on the one of the mm_structs that may be discovered via the
  reverse maps.

Did I miss anything?