Add ability to modify memory policies via the /proc filesystem 1. Read and write the memory policy of a task via /proc//numa_policies If read this file will output a text string describing the memory policy for the process. A new policy may be written to "numa_policy" in order to change the memory policy for the process. The following strings may be written to /proc//numa_policy: default -> Reset allocation policy to default prefer= -> Prefer allocation on specified node interleave={nodelist} -> Interleaved allocation on the given nodes bind={nodelist} -> Restrict allocation to the specified zones. Zones are specified by either only providing the node number or using the notation zone/name. I.e. 3/normal 1/high 0/dma etc. Additionally the patch also adds write capability to the "numa_maps". One can write a VMA address followed by the policy to that file to change the mempolicy of an individual virtual memory area. i.e. echo "2aaaaaaab000 bind=0,2-5" >numa_maps This is compatible with the output format of numa_maps. These functions are a core requirement for the ability to manage the memory allocation of processes dynamically. This may be done by the administrator manually as described here or one may write a batch process manager that manages the memory on a numa system. DANGER: Note that there is no locking scheme that would allow updates to the a tasks memory policy from outside since a thread may read its memory policy without any locks at any time (in particular for get_pages()!). This is no problem if there was no prior memory policy. If a policy exists then the system will wait for one second before freeing the old policy. Hopefully this will be enough to allow existinng uses of the policy to cease. If not then the application may segfault! Here is an example. We want to reorganize how process 12024 is allocating memory. We would like to allocate most pages on node 1. However, we would like the heap pages to be allocated interleaved on nodes 2 and 3 to allow better throughput. cd /proc/12024/ echo "prefer=1" >numa_policy margin:/proc/12024 # cat numa_maps 00000000 prefer=1 MaxRef=0 Pages=0 Mapped=0 2000000000000000 prefer=1 MaxRef=42 Pages=11 Mapped=11 N0=3 N1=2 N2=2 N3=4 2000000000038000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2 2000000000040000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 2000000000058000 prefer=1 MaxRef=42 Pages=59 Mapped=59 N0=14 N1=16 N2=15 N3=14 2000000000260000 prefer=1 MaxRef=0 Pages=0 Mapped=0 2000000000268000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2 2000000000274000 prefer=1 MaxRef=1 Pages=3 Mapped=3 Anon=3 N1=3 2000000000280000 prefer=1 MaxRef=8 Pages=3 Mapped=3 N0=3 2000000000300000 prefer=1 MaxRef=8 Pages=2 Mapped=2 N0=2 2000000000318000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 4000000000000000 prefer=1 MaxRef=6 Pages=2 Mapped=2 N1=2 6000000000004000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 6000000000008000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 60000fff7fffc000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 60000ffffff3c000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 margin:/proc/12024 # cat maps 00000000-00004000 r--p 00000000 00:00 0 2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so 2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so 2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0 2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0 2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE 2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache 2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0 4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty 6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty 6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap] 60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0 60000ffffff3c000-60000ffffff90000 rw-p 60000ffffff3c000 00:00 0 [stack] a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso] echo "2xxxx interleave={2,3}" >numa_maps margin:/proc/12024 # cat numa_maps 00000000 prefer=1 MaxRef=0 Pages=0 Mapped=0 2000000000000000 prefer=1 MaxRef=42 Pages=11 Mapped=11 N0=3 N1=2 N2=2 N3=4 2000000000038000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2 2000000000040000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 2000000000058000 prefer=1 MaxRef=42 Pages=59 Mapped=59 N0=14 N1=16 N2=15 N3=14 2000000000260000 prefer=1 MaxRef=0 Pages=0 Mapped=0 2000000000268000 prefer=1 MaxRef=1 Pages=2 Mapped=2 Anon=2 N1=2 2000000000274000 prefer=1 MaxRef=1 Pages=3 Mapped=3 Anon=3 N1=3 2000000000280000 prefer=1 MaxRef=8 Pages=3 Mapped=3 N0=3 2000000000300000 prefer=1 MaxRef=8 Pages=2 Mapped=2 N0=2 2000000000318000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 4000000000000000 prefer=1 MaxRef=6 Pages=2 Mapped=2 N1=2 6000000000004000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 6000000000008000 interleave=2,3 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 60000fff7fffc000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 60000ffffff3c000 prefer=1 MaxRef=1 Pages=1 Mapped=1 Anon=1 N1=1 Signed-off-by: Christoph Lameter Index: linux-2.6.14-rc5-mm1/fs/proc/base.c =================================================================== --- linux-2.6.14-rc5-mm1.orig/fs/proc/base.c 2005-11-03 11:55:55.000000000 -0800 +++ linux-2.6.14-rc5-mm1/fs/proc/base.c 2005-11-03 12:02:49.000000000 -0800 @@ -100,7 +100,10 @@ enum pid_directory_inos { PROC_TGID_STAT, PROC_TGID_STATM, PROC_TGID_MAPS, +#ifdef CONFIG_NUMA PROC_TGID_NUMA_MAPS, + PROC_TGID_NUMA_POLICY, +#endif PROC_TGID_MOUNTS, PROC_TGID_WCHAN, #ifdef CONFIG_MMU @@ -140,7 +143,10 @@ enum pid_directory_inos { PROC_TID_STAT, PROC_TID_STATM, PROC_TID_MAPS, +#ifdef CONFIG_NUMA PROC_TID_NUMA_MAPS, + PROC_TID_NUMA_POLICY, +#endif PROC_TID_MOUNTS, PROC_TID_WCHAN, #ifdef CONFIG_MMU @@ -190,6 +196,7 @@ static struct pid_entry tgid_base_stuff[ E(PROC_TGID_MAPS, "maps", S_IFREG|S_IRUGO), #ifdef CONFIG_NUMA E(PROC_TGID_NUMA_MAPS, "numa_maps", S_IFREG|S_IRUGO), + E(PROC_TGID_NUMA_POLICY, "numa_policy", S_IFREG|S_IRUGO|S_IWUSR), #endif E(PROC_TGID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR), #ifdef CONFIG_SECCOMP @@ -232,6 +239,7 @@ static struct pid_entry tid_base_stuff[] E(PROC_TID_MAPS, "maps", S_IFREG|S_IRUGO), #ifdef CONFIG_NUMA E(PROC_TID_NUMA_MAPS, "numa_maps", S_IFREG|S_IRUGO), + E(PROC_TID_NUMA_POLICY, "numa_policy", S_IFREG|S_IRUGO|S_IWUSR), #endif E(PROC_TID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR), #ifdef CONFIG_SECCOMP @@ -1659,6 +1667,10 @@ static struct dentry *proc_pident_lookup case PROC_TGID_NUMA_MAPS: inode->i_fop = &proc_numa_maps_operations; break; + case PROC_TID_NUMA_POLICY: + case PROC_TGID_NUMA_POLICY: + inode->i_fop = &proc_numa_policy_operations; + break; #endif case PROC_TID_MEM: case PROC_TGID_MEM: Index: linux-2.6.14-rc5-mm1/mm/mempolicy.c =================================================================== --- linux-2.6.14-rc5-mm1.orig/mm/mempolicy.c 2005-11-03 12:02:49.000000000 -0800 +++ linux-2.6.14-rc5-mm1/mm/mempolicy.c 2005-11-03 17:31:05.000000000 -0800 @@ -86,6 +86,10 @@ #include #include #include +#include +#include +#include +#include /* Internal MPOL_MF_xxx flags */ #define MPOL_MF_DISCONTIG_OK (1<<20) /* Skip checks for continuous vmas */ @@ -1539,10 +1543,6 @@ void numa_default_policy(void) #define MPOL_BUFFER_SIZE 50 -#define MPOL_BUFFER_SIZE 50 - -#define MPOL_BUFFER_SIZE 50 - struct numa_maps { unsigned long pages; unsigned long anon; @@ -1644,7 +1644,7 @@ static int mpol_to_str(char *buffer, int return p - buffer; } -static int show_numa_map(struct seq_file *m, void *v) +int show_numa_map(struct seq_file *m, void *v) { struct task_struct *task = m->private; struct vm_area_struct *vma = v; @@ -1677,12 +1677,42 @@ static int show_numa_map(struct seq_file return 0; } -struct seq_operations proc_pid_numa_maps_op = { - .start = m_start, - .next = m_next, - .stop = m_stop, - .show = show_numa_map -}; +/* + * Convert a representation of a memory policy from text + * form to binary. The function does not restrict + * the set of nodes to the ones allowed by a task. + * + * Returns either a memory policy or NULL for error. + */ +static struct mempolicy *str_to_mpol(char *buffer) +{ + nodemask_t nodes; + int mode; + int l; + + for (mode = 0; mode <= MPOL_MAX; mode++) { + l = strlen(policy_types[mode]); + if (strnicmp(policy_types[mode], buffer, l) == 0 + && (mode == MPOL_DEFAULT || buffer[l] == '=')) + break; + } + + if (mode > MPOL_MAX) + return NULL; + + if (mode == MPOL_DEFAULT) + return &default_policy; + + if (nodelist_parse(buffer + l + 1, nodes) || nodes_empty(nodes)) + return NULL; + + if (mpol_check_policy(mode, &nodes)) + return NULL; + + return mpol_new(mode, &nodes); +} + +#define proc_task(inode) (PROC_I(inode)->task) ssize_t numa_maps_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) @@ -1690,14 +1720,12 @@ ssize_t numa_maps_write(struct file *fil struct task_struct *task = proc_task(file->f_dentry->d_inode); struct vm_area_struct *vma; char *p, *q; - unsigned long addr; - nodemask_t nodes; - int target = -1; + unsigned long from_addr, to_addr; + nodemask_t from_nodes, to_nodes; int pages = -1; int rc; char buffer[MPOL_BUFFER_SIZE]; - nodes_clear(nodes); if (!capable(CAP_SYS_RESOURCE)) return -EPERM; @@ -1707,7 +1735,10 @@ ssize_t numa_maps_write(struct file *fil if (copy_from_user(buffer, buf, count)) return -EFAULT; - addr = simple_strtoul(buffer, &p, 16); + from_addr = simple_strtoul(buffer, &p, 16); + if (*p == '-') + to_addr = simple_strtoul(buffer,&p, p+1); + if (*p++ != ' ') return -EINVAL; @@ -1716,40 +1747,41 @@ ssize_t numa_maps_write(struct file *fil return -EINVAL; if (toupper(*p) == 'N') { + p++; /* Node number must follow */ - node_set(simple_strtoul(p, &q, 10), nodes); + nodelist_scnprintf(p, MPOL_BUFFER_SIZE, from_nodes); - if (q == p) + if (!nodes_weight(from_nodes)) return -EINVAL; + while (*p && *p != '>') + p++; + /* Check for optional number of pages */ if (*q == '(') { q++; - pages = simple_strtoul(q, &p, 10); - if (*p != ')') + pages = simple_strtoul(p, &q, 10); + if (*q != ')') return -EINVAL; - p++; - } else - p = q; + p = q + 1; + } if (*p == '>') { p++; - target = simple_strtoul(p, &q, 10); - if (q == p) - return -EINVAL; - p = q; - } + nodelist_scnprintf(p, MPOL_BUFFER_SIZE, to_nodes); + } else + nodes_clear(to_nodes); - down_read(&vma->vm_mm->mmap_sem); - rc = try_to_migrate_vma_pages(vma, nodes, target, pages); - up_read(&vma->vm_mm->mmap_sem); + rc = migrate(task->vm_mm, from_addr, to_addr, &from_nodes, &to_nodes, pages, MPOL_MF_MOVE); if (rc) - printk(KERN_ERR "migrate_vma(%p,%d,%d)=%d\n", vma, target, pages, rc); + printk(KERN_ERR "migrate_vma(%p,%d)=%d\n", vma, pages, rc); return p - buffer; + } else { + struct mempolicy *pol, *old_policy; /* @@ -1774,24 +1806,103 @@ ssize_t numa_maps_write(struct file *fil up_write(&vma->vm_mm->mmap_sem); mpol_free(old_policy); return count; + + } +} + +/* + * Retrieval and setting of the memory policy for a task + */ +static ssize_t numa_policy_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = proc_task(file->f_dentry->d_inode); + char buffer[MPOL_BUFFER_SIZE]; /* Should this really be on the stack ?? */ + size_t len; + loff_t __ppos = *ppos; + + if (task->mm) + down_read(&task->mm->mmap_sem); + + len = mpol_to_str(buffer, MPOL_BUFFER_SIZE, task->mempolicy); + if (__ppos >= len) { + count = 0; + goto out; + } + if (count > len-__ppos) + count = len-__ppos; + if (copy_to_user(buf, buffer + __ppos, count)) { + count = -EFAULT; + goto out; } + *ppos = __ppos + count; +out: + if (task->mm) + up_read(&task->mm->mmap_sem); + + return count; } -static int numa_maps_open(struct inode *inode, struct file *file) +static ssize_t numa_policy_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) { - struct task_struct *task = proc_task(inode); - int ret = seq_open(file, &proc_pid_numa_maps_op); - if (!ret) { - struct seq_file *m = file->private_data; - m->private = task; - } - return ret; -} - -struct file_operations proc_numa_maps_operations = { - .open = numa_maps_open, - .read = seq_read, - .llseek = seq_lseek, - .release = seq_release, - .write = numa_maps_write + struct task_struct *task = proc_task(file->f_dentry->d_inode); + char buffer[MPOL_BUFFER_SIZE]; + struct mempolicy *pol, *old_policy; + + if (!capable(CAP_SYS_RESOURCE)) + return -EPERM; + if (count >= MPOL_BUFFER_SIZE || !task->mm) + return -EINVAL; + if (copy_from_user(buffer, buf, count)) + return -EFAULT; + + /* + * Note that this policy is not properly contextualized and may + * contain nodes not allowed in the current cpuset! + */ + pol = str_to_mpol(buffer); + if (!pol) + return -EINVAL; + + /* + * WARNING: The process whose numa policy is set + * may read the policy at any time without locking. + * Taking the write lock on mmap_sem here will only synchronize + * writes to the memory policy. + * + * The update here may cause the application to fail if it just + * happens to follow the policy pointer! + * the down_write will only synchronize writes to the policy + */ + down_write(&task->mm->mmap_sem); + old_policy = task->mempolicy; + + if (!mpol_equal(pol, old_policy)) { + if (pol->policy == MPOL_DEFAULT) + pol = NULL; + + task->mempolicy = pol; + } else + old_policy = pol; + + up_write(&task->mm->mmap_sem); + if (old_policy) { + /* + * Hack to insure that some time passes before the old + * policy is released. Hopefully that was enough + * so that others have stopped using the policy are finished. + */ + ssleep(1); + mpol_free(old_policy); + } + + return count; +} + + +struct file_operations proc_numa_policy_operations = { + .read = numa_policy_read, + .write = numa_policy_write }; + Index: linux-2.6.14-rc5-mm1/fs/proc/task_mmu.c =================================================================== --- linux-2.6.14-rc5-mm1.orig/fs/proc/task_mmu.c 2005-11-03 11:26:18.000000000 -0800 +++ linux-2.6.14-rc5-mm1/fs/proc/task_mmu.c 2005-11-03 17:07:36.000000000 -0800 @@ -389,3 +389,34 @@ struct seq_operations proc_pid_smaps_op .show = show_smap }; +#ifdef CONFIG_NUMA +extern int show_numa_map(struct seq_file *m, void *v); +extern ssize_t numa_maps_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos); + +static struct seq_operations proc_pid_numa_maps_op = { + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_numa_map +}; + +static int numa_maps_open(struct inode *inode, struct file *file) +{ + struct task_struct *task = proc_task(inode); + int ret = seq_open(file, &proc_pid_numa_maps_op); + if (!ret) { + struct seq_file *m = file->private_data; + m->private = task; + } + return ret; +} + +struct file_operations proc_numa_maps_operations = { + .open = numa_maps_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release, + .write = numa_maps_write +}; +#endif