From: Ingo Molnar <mingo@elte.hu>

This is the CFS scheduler.  CFS is a drop-in replacement for the vanilla
scheduler.

80% of CFS's design can be summed up in a single sentence: CFS basically
models an "ideal, precise multi-tasking CPU" on real hardware.

"Ideal multi-tasking CPU" is a (non-existent :-) CPU that has 100% physical
power and which can run each task at precise equal speed, in parallel, each
at 1/nr_running speed.  For example: if there are 2 tasks running then it
runs each at 50% physical power - totally in parallel.

On real hardware, we can run only a single task at once, so while that one
task runs the other tasks that are waiting for the CPU are at a
disadvantage - the current task gets an unfair amount of CPU time.  In CFS
this fairness imbalance is expressed and tracked via the per-task
p->wait_runtime (nanosec-unit) value.  "wait_runtime" is the amount of time
the task should now run on the CPU for it become completely fair and
balanced.

( small detail: on 'ideal' hardware, the p->wait_runtime value would
  always be zero - no task would ever get 'out of balance' from the
  'ideal' share of CPU time. )

CFS's task picking logic is based on this p->wait_runtime value and it is
thus very simple: it always tries to run the task with the largest
p->wait_runtime value.  In other words, CFS tries to run the task with the
'gravest need' for more CPU time.  So CFS always tries to split up CPU time
between runnable tasks as close to 'ideal multitasking hardware' as
possible.

Most of the rest of CFS's design just falls out of this really simple
concept, with a few add-on embellishments like nice levels, multiprocessing
and various algorithm variants to recognize sleepers.

In practice it works like this: the system runs a task a bit, and when the
task schedules (or a scheduler tick happens) the task's CPU usage is
'accounted for': the (small) time it just spent using the physical CPU is
deducted from p->wait_runtime.  [minus the 'fair share' it would have
gotten anyway].  Once p->wait_runtime gets low enough so that another task
becomes the 'leftmost task' (plus a small amount of 'granularity' distance
relative to the leftmost task so that we do not over-schedule tasks and
trash the cache) then the new leftmost task is picked and the current task
is preempted.

The rq->fair_clock value tracks the 'CPU time a runnable task would have
fairly gotten, had it been runnable during that time'.  So by using
rq->fair_clock values we can accurately timestamp and measure the 'expected
CPU time' a task should have gotten.  All runnable tasks are sorted in the
rbtree by the "rq->fair_clock - p->wait_runtime" key, and CFS picks the
'leftmost' task and sticks to it.  As the system progresses forwards, newly
woken tasks are put into the tree more and more to the right - slowly but
surely giving a chance for every task to become the 'leftmost task' and
thus get on the CPU within a deterministic amount of time.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/kernel-parameters.txt |   43 
 Documentation/sched-design-CFS.txt  |  107 +
 arch/i386/kernel/smpboot.c          |   13 
 arch/i386/kernel/syscall_table.S    |    1 
 arch/i386/kernel/tsc.c              |    1 
 arch/ia64/kernel/setup.c            |    6 
 arch/mips/kernel/smp.c              |   11 
 arch/sparc/kernel/smp.c             |   10 
 arch/sparc64/kernel/smp.c           |   27 
 drivers/char/drm/radeon_cp.c        |    5 
 fs/proc/array.c                     |   18 
 fs/proc/base.c                      |   62 
 include/asm-generic/bitops/sched.h  |   21 
 include/asm-i386/unistd.h           |    3 
 include/asm-x86_64/unistd.h         |    2 
 include/linux/hardirq.h             |   13 
 include/linux/sched.h               |  105 -
 include/linux/topology.h            |   12 
 init/main.c                         |    2 
 kernel/exit.c                       |    3 
 kernel/fork.c                       |    5 
 kernel/posix-cpu-timers.c           |   34 
 kernel/sched.c                      | 2490 +++++++-------------------
 kernel/sched_debug.c                |  224 ++
 kernel/sched_fair.c                 |  756 +++++++
 kernel/sched_rt.c                   |  184 +
 kernel/sched_stats.h                |  235 ++
 kernel/sysctl.c                     |   48 
 28 files changed, 2519 insertions(+), 1922 deletions(-)

diff -puN Documentation/kernel-parameters.txt~cfs-scheduler Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt~cfs-scheduler
+++ a/Documentation/kernel-parameters.txt
@@ -1032,49 +1032,6 @@ and is between 256 and 4096 characters. 
 
 	mga=		[HW,DRM]
 
-	migration_cost=
-			[KNL,SMP] debug: override scheduler migration costs
-			Format: <level-1-usecs>,<level-2-usecs>,...
-			This debugging option can be used to override the
-			default scheduler migration cost matrix. The numbers
-			are indexed by 'CPU domain distance'.
-			E.g. migration_cost=1000,2000,3000 on an SMT NUMA
-			box will set up an intra-core migration cost of
-			1 msec, an inter-core migration cost of 2 msecs,
-			and an inter-node migration cost of 3 msecs.
-
-			WARNING: using the wrong values here can break
-			scheduler performance, so it's only for scheduler
-			development purposes, not production environments.
-
-	migration_debug=
-			[KNL,SMP] migration cost auto-detect verbosity
-			Format=<0|1|2>
-			If a system's migration matrix reported at bootup
-			seems erroneous then this option can be used to
-			increase verbosity of the detection process.
-			We default to 0 (no extra messages), 1 will print
-			some more information, and 2 will be really
-			verbose (probably only useful if you also have a
-			serial console attached to the system).
-
-	migration_factor=
-			[KNL,SMP] multiply/divide migration costs by a factor
-			Format=<percent>
-			This debug option can be used to proportionally
-			increase or decrease the auto-detected migration
-			costs for all entries of the migration matrix.
-			E.g. migration_factor=150 will increase migration
-			costs by 50%. (and thus the scheduler will be less
-			eager migrating cache-hot tasks)
-			migration_factor=80 will decrease migration costs
-			by 20%. (thus the scheduler will be more eager to
-			migrate tasks)
-
-			WARNING: using the wrong values here can break
-			scheduler performance, so it's only for scheduler
-			development purposes, not production environments.
-
 	mousedev.tap_time=
 			[MOUSE] Maximum time between finger touching and
 			leaving touchpad surface for touch to be considered
diff -puN /dev/null Documentation/sched-design-CFS.txt
--- /dev/null
+++ a/Documentation/sched-design-CFS.txt
@@ -0,0 +1,107 @@
+[announce] [patch] Modular Scheduler Core and Completely Fair Scheduler [CFS]
+
+i'm pleased to announce the first release of the "Modular Scheduler Core
+and Completely Fair Scheduler [CFS]" patchset:
+
+   http://redhat.com/~mingo/cfs-scheduler/
+
+This project is a complete rewrite of the Linux task scheduler. My goal
+is to address various feature requests and to fix deficiencies in the
+vanilla scheduler that were suggested/found in the past few years, both
+for desktop scheduling and for server scheduling workloads.
+
+[ QuickStart: apply the patch, recompile, reboot. The new scheduler
+  will be active by default and all tasks will default to the
+  SCHED_NORMAL interactive scheduling class. ]
+
+Highlights are:
+
+ - the introduction of Scheduling Classes: an extensible hierarchy of
+   scheduler modules. These modules encapsulate scheduling policy
+   details and are handled by the scheduler core without the core
+   code assuming about them too much.
+
+ - sched_fair.c implements the 'CFS desktop scheduler': it is a
+   replacement for the vanilla scheduler's SCHED_OTHER interactivity
+   code.
+
+   i'd like to give credit to Con Kolivas for the general approach here:
+   he has proven via RSDL/SD that 'fair scheduling' is possible and that
+   it results in better desktop scheduling. Kudos Con!
+
+   The CFS patch uses a completely different approach and implementation
+   from RSDL/SD. My goal was to make CFS's interactivity quality exceed
+   that of RSDL/SD, which is a high standard to meet :-) Testing
+   feedback is welcome to decide this one way or another. [ and, in any
+   case, all of SD's logic could be added via a kernel/sched_sd.c module
+   as well, if Con is interested in such an approach. ]
+
+   CFS's design is quite radical: it does not use runqueues, it uses a
+   time-ordered rbtree to build a 'timeline' of future task execution,
+   and thus has no 'array switch' artifacts (by which both the vanilla
+   scheduler and RSDL/SD are affected).
+
+   CFS uses nanosecond granularity accounting and does not rely on any
+   jiffies or other HZ detail. Thus the CFS scheduler has no notion of
+   'timeslices' and has no heuristics whatsoever. There is only one
+   central tunable:
+
+         /proc/sys/kernel/sched_granularity_ns
+
+   which can be used to tune the scheduler from 'desktop' (low
+   latencies) to 'server' (good batching) workloads. It defaults to a
+   setting suitable for desktop workloads. SCHED_BATCH is handled by the
+   CFS scheduler module too.
+
+   due to its design, the CFS scheduler is not prone to any of the
+   'attacks' that exist today against the heuristics of the stock
+   scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
+   work fine and do not impact interactivity and produce the expected
+   behavior.
+
+   the CFS scheduler has a much stronger handling of nice levels and
+   SCHED_BATCH: both types of workloads should be isolated much more
+   agressively than under the vanilla scheduler.
+
+   ( another rdetail: due to nanosec accounting and timeline sorting,
+     sched_yield() support is very simple under CFS, and in fact under
+     CFS sched_yield() behaves much better than under any other
+     scheduler i have tested so far. )
+
+ - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
+   way than the vanilla scheduler does. It uses 100 runqueues (for all
+   100 RT priority levels, instead of 140 in the vanilla scheduler)
+   and it needs no expired array.
+
+ - reworked/sanitized SMP load-balancing: the runqueue-walking
+   assumptions are gone from the load-balancing code now, and
+   iterators of the scheduling modules are used. The balancing code got
+   quite a bit simpler as a result.
+
+the core scheduler got smaller by more than 700 lines:
+
+ kernel/sched.c | 1454 ++++++++++++++++------------------------------------------------
+ 1 file changed, 372 insertions(+), 1082 deletions(-)
+
+and even adding all the scheduling modules, the total size impact is
+relatively small:
+
+ 18 files changed, 1454 insertions(+), 1133 deletions(-)
+
+most of the increase is due to extensive comments. The kernel size
+impact is in fact a small negative:
+
+   text    data     bss     dec     hex filename
+  23366    4001      24   27391    6aff kernel/sched.o.vanilla
+  24159    2705      56   26920    6928 kernel/sched.o.CFS
+
+(this is mainly due to the benefit of getting rid of the expired array
+and its data structure overhead.)
+
+thanks go to Thomas Gleixner and Arjan van de Ven for review of this
+patchset.
+
+as usual, any sort of feedback, bugreports, fixes and suggestions are
+more than welcome,
+
+	Ingo
diff -puN arch/i386/kernel/smpboot.c~cfs-scheduler arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c~cfs-scheduler
+++ a/arch/i386/kernel/smpboot.c
@@ -940,18 +940,6 @@ exit:
 }
 #endif
 
-static void smp_tune_scheduling(void)
-{
-	unsigned long cachesize;       /* kB   */
-
-	if (cpu_khz) {
-		cachesize = boot_cpu_data.x86_cache_size;
-
-		if ((long)cachesize > 0)
-			max_cache_size = cachesize * 1024;
-	}
-}
-
 /*
  * Cycle through the processors sending APIC IPIs to boot each.
  */
@@ -980,7 +968,6 @@ static void __init smp_boot_cpus(unsigne
 	x86_cpu_to_apicid[0] = boot_cpu_physical_apicid;
 
 	current_thread_info()->cpu = 0;
-	smp_tune_scheduling();
 
 	set_cpu_sibling_map(0);
 
diff -puN arch/i386/kernel/syscall_table.S~cfs-scheduler arch/i386/kernel/syscall_table.S
--- a/arch/i386/kernel/syscall_table.S~cfs-scheduler
+++ a/arch/i386/kernel/syscall_table.S
@@ -326,3 +326,4 @@ ENTRY(sys_call_table)
 	.long sys_revokeat
 	.long sys_frevoke		/* 325 */
 	.long sys_fallocate
+	.long sys_sched_yield_to
diff -puN arch/i386/kernel/tsc.c~cfs-scheduler arch/i386/kernel/tsc.c
--- a/arch/i386/kernel/tsc.c~cfs-scheduler
+++ a/arch/i386/kernel/tsc.c
@@ -220,6 +220,7 @@ static struct clocksource clocksource_ts
 
 void mark_tsc_unstable(char *reason)
 {
+	sched_clock_unstable_event();
 	if (!tsc_unstable) {
 		tsc_unstable = 1;
 		tsc_enabled = 0;
diff -puN arch/ia64/kernel/setup.c~cfs-scheduler arch/ia64/kernel/setup.c
--- a/arch/ia64/kernel/setup.c~cfs-scheduler
+++ a/arch/ia64/kernel/setup.c
@@ -801,7 +801,6 @@ static void __cpuinit
 get_max_cacheline_size (void)
 {
 	unsigned long line_size, max = 1;
-	unsigned int cache_size = 0;
 	u64 l, levels, unique_caches;
         pal_cache_config_info_t cci;
         s64 status;
@@ -831,8 +830,6 @@ get_max_cacheline_size (void)
 		line_size = 1 << cci.pcci_line_size;
 		if (line_size > max)
 			max = line_size;
-		if (cache_size < cci.pcci_cache_size)
-			cache_size = cci.pcci_cache_size;
 		if (!cci.pcci_unified) {
 			status = ia64_pal_cache_config_info(l,
 						    /* cache_type (instruction)= */ 1,
@@ -849,9 +846,6 @@ get_max_cacheline_size (void)
 			ia64_i_cache_stride_shift = cci.pcci_stride;
 	}
   out:
-#ifdef CONFIG_SMP
-	max_cache_size = max(max_cache_size, cache_size);
-#endif
 	if (max > ia64_max_cacheline_size)
 		ia64_max_cacheline_size = max;
 }
diff -puN arch/mips/kernel/smp.c~cfs-scheduler arch/mips/kernel/smp.c
--- a/arch/mips/kernel/smp.c~cfs-scheduler
+++ a/arch/mips/kernel/smp.c
@@ -51,16 +51,6 @@ int __cpu_logical_map[NR_CPUS];		/* Map 
 EXPORT_SYMBOL(phys_cpu_present_map);
 EXPORT_SYMBOL(cpu_online_map);
 
-/* This happens early in bootup, can't really do it better */
-static void smp_tune_scheduling (void)
-{
-	struct cache_desc *cd = &current_cpu_data.scache;
-	unsigned long cachesize = cd->linesz * cd->sets * cd->ways;
-
-	if (cachesize > max_cache_size)
-		max_cache_size = cachesize;
-}
-
 extern void __init calibrate_delay(void);
 extern ATTRIB_NORET void cpu_idle(void);
 
@@ -228,7 +218,6 @@ void __init smp_prepare_cpus(unsigned in
 {
 	init_new_context(current, &init_mm);
 	current_thread_info()->cpu = 0;
-	smp_tune_scheduling();
 	plat_prepare_cpus(max_cpus);
 #ifndef CONFIG_HOTPLUG_CPU
 	cpu_present_map = cpu_possible_map;
diff -puN arch/sparc/kernel/smp.c~cfs-scheduler arch/sparc/kernel/smp.c
--- a/arch/sparc/kernel/smp.c~cfs-scheduler
+++ a/arch/sparc/kernel/smp.c
@@ -68,16 +68,6 @@ void __cpuinit smp_store_cpu_info(int id
 	cpu_data(id).prom_node = cpu_node;
 	cpu_data(id).mid = cpu_get_hwmid(cpu_node);
 
-	/* this is required to tune the scheduler correctly */
-	/* is it possible to have CPUs with different cache sizes? */
-	if (id == boot_cpu_id) {
-		int cache_line,cache_nlines;
-		cache_line = 0x20;
-		cache_line = prom_getintdefault(cpu_node, "ecache-line-size", cache_line);
-		cache_nlines = 0x8000;
-		cache_nlines = prom_getintdefault(cpu_node, "ecache-nlines", cache_nlines);
-		max_cache_size = cache_line * cache_nlines;
-	}
 	if (cpu_data(id).mid < 0)
 		panic("No MID found for CPU%d at node 0x%08d", id, cpu_node);
 }
diff -puN arch/sparc64/kernel/smp.c~cfs-scheduler arch/sparc64/kernel/smp.c
--- a/arch/sparc64/kernel/smp.c~cfs-scheduler
+++ a/arch/sparc64/kernel/smp.c
@@ -1159,32 +1159,6 @@ int setup_profiling_timer(unsigned int m
 	return -EINVAL;
 }
 
-static void __init smp_tune_scheduling(void)
-{
-	unsigned int smallest = ~0U;
-	int i;
-
-	for (i = 0; i < NR_CPUS; i++) {
-		unsigned int val = cpu_data(i).ecache_size;
-
-		if (val && val < smallest)
-			smallest = val;
-	}
-
-	/* Any value less than 256K is nonsense.  */
-	if (smallest < (256U * 1024U))
-		smallest = 256 * 1024;
-
-	max_cache_size = smallest;
-
-	if (smallest < 1U * 1024U * 1024U)
-		printk(KERN_INFO "Using max_cache_size of %uKB\n",
-		       smallest / 1024U);
-	else
-		printk(KERN_INFO "Using max_cache_size of %uMB\n",
-		       smallest / 1024U / 1024U);
-}
-
 /* Constrain the number of cpus to max_cpus.  */
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
@@ -1202,7 +1176,6 @@ void __init smp_prepare_cpus(unsigned in
 	}
 
 	cpu_data(boot_cpu_id).udelay_val = loops_per_jiffy;
-	smp_tune_scheduling();
 }
 
 void __devinit smp_prepare_boot_cpu(void)
diff -puN drivers/char/drm/radeon_cp.c~cfs-scheduler drivers/char/drm/radeon_cp.c
--- a/drivers/char/drm/radeon_cp.c~cfs-scheduler
+++ a/drivers/char/drm/radeon_cp.c
@@ -2267,6 +2267,11 @@ int radeon_driver_load(struct drm_device
 
 	DRM_DEBUG("%s card detected\n",
 		  ((dev_priv->flags & RADEON_IS_AGP) ? "AGP" : (((dev_priv->flags & RADEON_IS_PCIE) ? "PCIE" : "PCI"))));
+	if (sysctl_sched_yield_bug_workaround == -1) {
+		sysctl_sched_yield_bug_workaround = 1;
+		printk(KERN_WARNING "quirk installed: turning on "
+			"sys_sched_yield() workaround for Radeon DRM.\n");
+	}
 	return ret;
 }
 
diff -puN fs/proc/array.c~cfs-scheduler fs/proc/array.c
--- a/fs/proc/array.c~cfs-scheduler
+++ a/fs/proc/array.c
@@ -165,7 +165,6 @@ static inline char * task_state(struct t
 	rcu_read_lock();
 	buffer += sprintf(buffer,
 		"State:\t%s\n"
-		"SleepAVG:\t%lu%%\n"
 		"Tgid:\t%d\n"
 		"Pid:\t%d\n"
 		"PPid:\t%d\n"
@@ -173,9 +172,8 @@ static inline char * task_state(struct t
 		"Uid:\t%d\t%d\t%d\t%d\n"
 		"Gid:\t%d\t%d\t%d\t%d\n",
 		get_task_state(p),
-		(p->sleep_avg/1024)*100/(1020000000/1024),
-	       	p->tgid, p->pid,
-	       	pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
+		p->tgid, p->pid,
+		pid_alive(p) ? rcu_dereference(p->real_parent)->tgid : 0,
 		pid_alive(p) && p->ptrace ? rcu_dereference(p->parent)->pid : 0,
 		p->uid, p->euid, p->suid, p->fsuid,
 		p->gid, p->egid, p->sgid, p->fsgid);
@@ -411,6 +409,14 @@ static int do_task_stat(struct task_stru
 	/* convert nsec -> ticks */
 	start_time = nsec_to_clock_t(start_time);
 
+	/*
+	 * Use CFS's precise accounting, if available:
+	 */
+	if (!has_rt_policy(task)) {
+		utime = nsec_to_clock_t(task->sum_exec_runtime);
+		stime = 0;
+	}
+
 	res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %u %lu \
 %lu %lu %lu %lu %lu %ld %ld %ld %ld %d 0 %llu %lu %ld %lu %lu %lu %lu %lu \
 %lu %lu %lu %lu %lu %lu %lu %lu %d %d %u %u %llu\n",
@@ -427,8 +433,8 @@ static int do_task_stat(struct task_stru
 		cmin_flt,
 		maj_flt,
 		cmaj_flt,
-		cputime_to_clock_t(utime),
-		cputime_to_clock_t(stime),
+		utime,
+		stime,
 		cputime_to_clock_t(cutime),
 		cputime_to_clock_t(cstime),
 		priority,
diff -puN fs/proc/base.c~cfs-scheduler fs/proc/base.c
--- a/fs/proc/base.c~cfs-scheduler
+++ a/fs/proc/base.c
@@ -898,6 +898,66 @@ static const struct file_operations proc
 };
 #endif
 
+/*
+ * Print out various scheduling related per-task fields:
+ */
+static int sched_show(struct seq_file *m, void *v)
+{
+	struct inode *inode = m->private;
+	struct task_struct *p;
+
+	WARN_ON(!inode);
+
+	p = get_proc_task(inode);
+	if (!p)
+		return -ESRCH;
+	proc_sched_show_task(p, m);
+
+	put_task_struct(p);
+
+	return 0;
+}
+
+static ssize_t
+sched_write(struct file *file, const char __user *buf,
+	    size_t count, loff_t *offset)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	struct task_struct *p;
+
+	WARN_ON(!inode);
+
+	p = get_proc_task(inode);
+	if (!p)
+		return -ESRCH;
+	proc_sched_set_task(p);
+
+	put_task_struct(p);
+
+	return count;
+}
+
+static int sched_open(struct inode *inode, struct file *filp)
+{
+	int ret;
+
+	ret = single_open(filp, sched_show, NULL);
+	if (!ret) {
+		struct seq_file *m = filp->private_data;
+
+		m->private = inode;
+	}
+	return ret;
+}
+
+const struct file_operations proc_pid_sched_operations = {
+	.open		= sched_open,
+	.read		= seq_read,
+	.write		= sched_write,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 static void *proc_pid_follow_link(struct dentry *dentry, struct nameidata *nd)
 {
 	struct inode *inode = dentry->d_inode;
@@ -1933,6 +1993,7 @@ static const struct pid_entry tgid_base_
 	INF("environ",    S_IRUSR, pid_environ),
 	INF("auxv",       S_IRUSR, pid_auxv),
 	INF("status",     S_IRUGO, pid_status),
+	REG("sched",      S_IRUGO|S_IWUSR, pid_sched),
 	INF("cmdline",    S_IRUGO, pid_cmdline),
 	INF("stat",       S_IRUGO, tgid_stat),
 	INF("statm",      S_IRUGO, pid_statm),
@@ -2224,6 +2285,7 @@ static const struct pid_entry tid_base_s
 	INF("environ",   S_IRUSR, pid_environ),
 	INF("auxv",      S_IRUSR, pid_auxv),
 	INF("status",    S_IRUGO, pid_status),
+	REG("sched",     S_IRUGO|S_IWUSR, pid_sched),
 	INF("cmdline",   S_IRUGO, pid_cmdline),
 	INF("stat",      S_IRUGO, tid_stat),
 	INF("statm",     S_IRUGO, pid_statm),
diff -puN include/asm-generic/bitops/sched.h~cfs-scheduler include/asm-generic/bitops/sched.h
--- a/include/asm-generic/bitops/sched.h~cfs-scheduler
+++ a/include/asm-generic/bitops/sched.h
@@ -6,28 +6,23 @@
 
 /*
  * Every architecture must define this function. It's the fastest
- * way of searching a 140-bit bitmap where the first 100 bits are
- * unlikely to be set. It's guaranteed that at least one of the 140
- * bits is cleared.
+ * way of searching a 100-bit bitmap.  It's guaranteed that at least
+ * one of the 100 bits is cleared.
  */
 static inline int sched_find_first_bit(const unsigned long *b)
 {
 #if BITS_PER_LONG == 64
-	if (unlikely(b[0]))
+	if (b[0])
 		return __ffs(b[0]);
-	if (likely(b[1]))
-		return __ffs(b[1]) + 64;
-	return __ffs(b[2]) + 128;
+	return __ffs(b[1]) + 64;
 #elif BITS_PER_LONG == 32
-	if (unlikely(b[0]))
+	if (b[0])
 		return __ffs(b[0]);
-	if (unlikely(b[1]))
+	if (b[1])
 		return __ffs(b[1]) + 32;
-	if (unlikely(b[2]))
+	if (b[2])
 		return __ffs(b[2]) + 64;
-	if (b[3])
-		return __ffs(b[3]) + 96;
-	return __ffs(b[4]) + 128;
+	return __ffs(b[3]) + 96;
 #else
 #error BITS_PER_LONG not defined
 #endif
diff -puN include/asm-i386/unistd.h~cfs-scheduler include/asm-i386/unistd.h
--- a/include/asm-i386/unistd.h~cfs-scheduler
+++ a/include/asm-i386/unistd.h
@@ -332,10 +332,11 @@
 #define __NR_revokeat		324
 #define __NR_frevoke		325
 #define __NR_fallocate		326
+#define __NR_sched_yield_to	327
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 327
+#define NR_syscalls 328
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff -puN include/asm-x86_64/unistd.h~cfs-scheduler include/asm-x86_64/unistd.h
--- a/include/asm-x86_64/unistd.h~cfs-scheduler
+++ a/include/asm-x86_64/unistd.h
@@ -632,6 +632,8 @@ __SYSCALL(__NR_timerfd, sys_timerfd)
 __SYSCALL(__NR_eventfd, sys_eventfd)
 #define __NR_fallocate		284
 __SYSCALL(__NR_fallocate, sys_fallocate)
+#define __NR_sched_yield_to	285
+__SYSCALL(__NR_sched_yield_to, sys_sched_yield_to)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff -puN include/linux/hardirq.h~cfs-scheduler include/linux/hardirq.h
--- a/include/linux/hardirq.h~cfs-scheduler
+++ a/include/linux/hardirq.h
@@ -79,6 +79,19 @@
 #endif
 
 #ifdef CONFIG_PREEMPT
+# define PREEMPT_CHECK_OFFSET 1
+#else
+# define PREEMPT_CHECK_OFFSET 0
+#endif
+
+/*
+ * Check whether we were atomic before we did preempt_disable():
+ * (used by the scheduler)
+ */
+#define in_atomic_preempt_off() \
+		((preempt_count() & ~PREEMPT_ACTIVE) != PREEMPT_CHECK_OFFSET)
+
+#ifdef CONFIG_PREEMPT
 # define preemptible()	(preempt_count() == 0 && !irqs_disabled())
 # define IRQ_EXIT_OFFSET (HARDIRQ_OFFSET-1)
 #else
diff -puN include/linux/sched.h~cfs-scheduler include/linux/sched.h
--- a/include/linux/sched.h~cfs-scheduler
+++ a/include/linux/sched.h
@@ -2,7 +2,6 @@
 #define _LINUX_SCHED_H
 
 #include <linux/auxvec.h>	/* For AT_VECTOR_SIZE */
-
 /*
  * cloning flags:
  */
@@ -37,6 +36,8 @@
 
 #ifdef __KERNEL__
 
+#include <linux/rbtree.h>	/* For run_node */
+
 struct sched_param {
 	int sched_priority;
 };
@@ -130,6 +131,9 @@ extern unsigned long nr_active(void);
 extern unsigned long nr_iowait(void);
 extern unsigned long weighted_cpuload(const int cpu);
 
+struct seq_file;
+extern void proc_sched_show_task(struct task_struct *p, struct seq_file *m);
+extern void proc_sched_set_task(struct task_struct *p);
 
 /*
  * Task state bitmask. NOTE! These bits are also
@@ -479,7 +483,7 @@ struct signal_struct {
 	 * from jiffies_to_ns(utime + stime) if sched_clock uses something
 	 * other than jiffies.)
 	 */
-	unsigned long long sched_time;
+	unsigned long long sum_sched_runtime;
 
 	/*
 	 * We don't bother to synchronize most readers of this at all,
@@ -539,6 +543,7 @@ struct signal_struct {
 #define MAX_RT_PRIO		MAX_USER_RT_PRIO
 
 #define MAX_PRIO		(MAX_RT_PRIO + 40)
+#define DEFAULT_PRIO		(MAX_RT_PRIO + 20)
 
 #define rt_prio(prio)		unlikely((prio) < MAX_RT_PRIO)
 #define rt_task(p)		rt_prio((p)->prio)
@@ -650,7 +655,14 @@ enum idle_type
 /*
  * sched-domains (multiprocessor balancing) declarations:
  */
-#define SCHED_LOAD_SCALE	128UL	/* increase resolution of load */
+
+/*
+ * Increase resolution of nice-level calculations:
+ */
+#define SCHED_LOAD_SHIFT	10
+#define SCHED_LOAD_SCALE	(1UL << SCHED_LOAD_SHIFT)
+
+#define SCHED_LOAD_SCALE_FUZZ	(SCHED_LOAD_SCALE >> 5)
 
 #ifdef CONFIG_SMP
 #define SD_LOAD_BALANCE		1	/* Do load balancing on this domain. */
@@ -750,12 +762,6 @@ struct sched_domain {
 #endif
 };
 
-/*
- * Maximum cache size the migration-costs auto-tuning code will
- * search from:
- */
-extern unsigned int max_cache_size;
-
 #endif	/* CONFIG_SMP */
 
 
@@ -806,14 +812,28 @@ struct mempolicy;
 struct pipe_inode_info;
 struct uts_namespace;
 
-enum sleep_type {
-	SLEEP_NORMAL,
-	SLEEP_NONINTERACTIVE,
-	SLEEP_INTERACTIVE,
-	SLEEP_INTERRUPTED,
-};
+struct rq;
 
-struct prio_array;
+struct sched_class {
+	struct sched_class *next;
+
+	void (*enqueue_task) (struct rq *rq, struct task_struct *p,
+			      int wakeup, u64 now);
+	void (*dequeue_task) (struct rq *rq, struct task_struct *p,
+			      int sleep, u64 now);
+	void (*yield_task) (struct rq *rq, struct task_struct *p,
+			    struct task_struct *p_to);
+
+	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
+
+	struct task_struct * (*pick_next_task) (struct rq *rq, u64 now);
+	void (*put_prev_task) (struct rq *rq, struct task_struct *p, u64 now);
+
+	struct task_struct * (*load_balance_start) (struct rq *rq);
+	struct task_struct * (*load_balance_next) (struct rq *rq);
+	void (*task_tick) (struct rq *rq, struct task_struct *p);
+	void (*task_new) (struct rq *rq, struct task_struct *p);
+};
 
 struct task_struct {
 	volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */
@@ -830,26 +850,43 @@ struct task_struct {
 #endif
 #endif
 	int load_weight;	/* for niceness load balancing purposes */
+
 	int prio, static_prio, normal_prio;
+	int on_rq;
 	struct list_head run_list;
-	struct prio_array *array;
+	struct rb_node run_node;
 
 	unsigned short ioprio;
 #ifdef CONFIG_BLK_DEV_IO_TRACE
 	unsigned int btrace_seq;
 #endif
-	unsigned long sleep_avg;
-	unsigned long long timestamp, last_ran;
-	unsigned long long sched_time; /* sched_clock time spent running */
-	enum sleep_type sleep_type;
+	/* CFS scheduling class statistics fields: */
+	u64 wait_start_fair;
+	u64 wait_start;
+	u64 exec_start;
+	u64 sleep_start, sleep_start_fair;
+	u64 block_start;
+	u64 sleep_max;
+	u64 block_max;
+	u64 exec_max;
+	u64 wait_max;
+	u64 last_ran;
+
+	s64 wait_runtime;
+	u64 sum_exec_runtime;
+	s64 fair_key;
+	s64 sum_wait_runtime, sum_sleep_runtime;
+	unsigned long wait_runtime_overruns, wait_runtime_underruns;
 
 	unsigned int policy;
 	cpumask_t cpus_allowed;
-	unsigned int time_slice, first_time_slice;
+	unsigned int time_slice;
+	struct sched_class *sched_class;
 
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 	struct sched_info sched_info;
 #endif
+	u64 nr_switches;
 
 	struct list_head tasks;
 	/*
@@ -1219,8 +1256,9 @@ static inline int set_cpus_allowed(struc
 #endif
 
 extern unsigned long long sched_clock(void);
+extern void sched_clock_unstable_event(void);
 extern unsigned long long
-current_sched_time(const struct task_struct *current_task);
+current_sched_runtime(const struct task_struct *current_task);
 
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
@@ -1236,6 +1274,14 @@ static inline void idle_task_exit(void) 
 #endif
 
 extern void sched_idle_next(void);
+extern char * sched_print_task_state(struct task_struct *p, char *buffer);
+
+extern unsigned int sysctl_sched_granularity;
+extern unsigned int sysctl_sched_wakeup_granularity;
+extern unsigned int sysctl_sched_runtime_limit;
+extern unsigned int sysctl_sched_child_runs_first;
+extern unsigned int sysctl_sched_load_smoothing;
+extern int sysctl_sched_yield_bug_workaround;
 
 #ifdef CONFIG_RT_MUTEXES
 extern int rt_mutex_getprio(struct task_struct *p);
@@ -1314,8 +1360,8 @@ extern void FASTCALL(wake_up_new_task(st
 #else
  static inline void kick_process(struct task_struct *tsk) { }
 #endif
-extern void FASTCALL(sched_fork(struct task_struct * p, int clone_flags));
-extern void FASTCALL(sched_exit(struct task_struct * p));
+extern void sched_fork(struct task_struct * p, int clone_flags);
+extern void sched_dead(struct task_struct * p);
 
 extern int in_group_p(gid_t);
 extern int in_egroup_p(gid_t);
@@ -1635,10 +1681,7 @@ static inline unsigned int task_cpu(cons
 	return task_thread_info(p)->cpu;
 }
 
-static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
-{
-	task_thread_info(p)->cpu = cpu;
-}
+extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
 
 #else
 
@@ -1647,6 +1690,10 @@ static inline unsigned int task_cpu(cons
 	return 0;
 }
 
+static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
+{
+}
+
 static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 }
diff -puN include/linux/topology.h~cfs-scheduler include/linux/topology.h
--- a/include/linux/topology.h~cfs-scheduler
+++ a/include/linux/topology.h
@@ -98,7 +98,7 @@
 	.cache_nice_tries	= 0,			\
 	.busy_idx		= 0,			\
 	.idle_idx		= 0,			\
-	.newidle_idx		= 1,			\
+	.newidle_idx		= 0,			\
 	.wake_idx		= 0,			\
 	.forkexec_idx		= 0,			\
 	.flags			= SD_LOAD_BALANCE	\
@@ -128,14 +128,15 @@
 	.imbalance_pct		= 125,			\
 	.cache_nice_tries	= 1,			\
 	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 2,			\
+	.idle_idx		= 0,			\
+	.newidle_idx		= 0,			\
 	.wake_idx		= 1,			\
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
+				| SD_WAKE_IDLE		\
 				| SD_SHARE_PKG_RESOURCES\
 				| BALANCE_FOR_MC_POWER,	\
 	.last_balance		= jiffies,		\
@@ -158,14 +159,15 @@
 	.imbalance_pct		= 125,			\
 	.cache_nice_tries	= 1,			\
 	.busy_idx		= 2,			\
-	.idle_idx		= 1,			\
-	.newidle_idx		= 2,			\
+	.idle_idx		= 0,			\
+	.newidle_idx		= 0,			\
 	.wake_idx		= 1,			\
 	.forkexec_idx		= 1,			\
 	.flags			= SD_LOAD_BALANCE	\
 				| SD_BALANCE_NEWIDLE	\
 				| SD_BALANCE_EXEC	\
 				| SD_WAKE_AFFINE	\
+				| SD_WAKE_IDLE		\
 				| BALANCE_FOR_PKG_POWER,\
 	.last_balance		= jiffies,		\
 	.balance_interval	= 1,			\
diff -puN init/main.c~cfs-scheduler init/main.c
--- a/init/main.c~cfs-scheduler
+++ a/init/main.c
@@ -436,7 +436,7 @@ static void noinline __init_refok rest_i
 
 	/*
 	 * The boot idle thread must execute schedule()
-	 * at least one to get things moving:
+	 * at least once to get things moving:
 	 */
 	preempt_enable_no_resched();
 	schedule();
diff -puN kernel/exit.c~cfs-scheduler kernel/exit.c
--- a/kernel/exit.c~cfs-scheduler
+++ a/kernel/exit.c
@@ -123,9 +123,9 @@ static void __exit_signal(struct task_st
 		sig->maj_flt += tsk->maj_flt;
 		sig->nvcsw += tsk->nvcsw;
 		sig->nivcsw += tsk->nivcsw;
-		sig->sched_time += tsk->sched_time;
 		sig->inblock += task_io_get_inblock(tsk);
 		sig->oublock += task_io_get_oublock(tsk);
+		sig->sum_sched_runtime += tsk->sum_exec_runtime;
 		sig = NULL; /* Marker for below. */
 	}
 
@@ -183,7 +183,6 @@ repeat:
 		zap_leader = (leader->exit_signal == -1);
 	}
 
-	sched_exit(p);
 	write_unlock_irq(&tasklist_lock);
 	proc_flush_task(p);
 	release_thread(p);
diff -puN kernel/fork.c~cfs-scheduler kernel/fork.c
--- a/kernel/fork.c~cfs-scheduler
+++ a/kernel/fork.c
@@ -117,6 +117,7 @@ void __put_task_struct(struct task_struc
 	WARN_ON(!(tsk->exit_state & (EXIT_DEAD | EXIT_ZOMBIE)));
 	WARN_ON(atomic_read(&tsk->usage));
 	WARN_ON(tsk == current);
+	sched_dead(tsk);
 
 	security_task_free(tsk);
 	free_uid(tsk->user);
@@ -878,7 +879,7 @@ static inline int copy_signal(unsigned l
 	sig->nvcsw = sig->nivcsw = sig->cnvcsw = sig->cnivcsw = 0;
 	sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0;
 	sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0;
-	sig->sched_time = 0;
+	sig->sum_sched_runtime = 0;
 	INIT_LIST_HEAD(&sig->cpu_timers[0]);
 	INIT_LIST_HEAD(&sig->cpu_timers[1]);
 	INIT_LIST_HEAD(&sig->cpu_timers[2]);
@@ -1036,7 +1037,7 @@ static struct task_struct *copy_process(
 
 	p->utime = cputime_zero;
 	p->stime = cputime_zero;
- 	p->sched_time = 0;
+
 #ifdef CONFIG_TASK_XACCT
 	p->rchar = 0;		/* I/O counter: bytes read */
 	p->wchar = 0;		/* I/O counter: bytes written */
diff -puN kernel/posix-cpu-timers.c~cfs-scheduler kernel/posix-cpu-timers.c
--- a/kernel/posix-cpu-timers.c~cfs-scheduler
+++ a/kernel/posix-cpu-timers.c
@@ -161,7 +161,7 @@ static inline cputime_t virt_ticks(struc
 }
 static inline unsigned long long sched_ns(struct task_struct *p)
 {
-	return (p == current) ? current_sched_time(p) : p->sched_time;
+	return (p == current) ? current_sched_runtime(p) : p->sum_exec_runtime;
 }
 
 int posix_cpu_clock_getres(const clockid_t which_clock, struct timespec *tp)
@@ -246,10 +246,10 @@ static int cpu_clock_sample_group_locked
 		} while (t != p);
 		break;
 	case CPUCLOCK_SCHED:
-		cpu->sched = p->signal->sched_time;
+		cpu->sched = p->signal->sum_sched_runtime;
 		/* Add in each other live thread.  */
 		while ((t = next_thread(t)) != p) {
-			cpu->sched += t->sched_time;
+			cpu->sched += t->sum_exec_runtime;
 		}
 		cpu->sched += sched_ns(p);
 		break;
@@ -422,7 +422,7 @@ int posix_cpu_timer_del(struct k_itimer 
  */
 static void cleanup_timers(struct list_head *head,
 			   cputime_t utime, cputime_t stime,
-			   unsigned long long sched_time)
+			   unsigned long long sum_exec_runtime)
 {
 	struct cpu_timer_list *timer, *next;
 	cputime_t ptime = cputime_add(utime, stime);
@@ -451,10 +451,10 @@ static void cleanup_timers(struct list_h
 	++head;
 	list_for_each_entry_safe(timer, next, head, entry) {
 		list_del_init(&timer->entry);
-		if (timer->expires.sched < sched_time) {
+		if (timer->expires.sched < sum_exec_runtime) {
 			timer->expires.sched = 0;
 		} else {
-			timer->expires.sched -= sched_time;
+			timer->expires.sched -= sum_exec_runtime;
 		}
 	}
 }
@@ -467,7 +467,7 @@ static void cleanup_timers(struct list_h
 void posix_cpu_timers_exit(struct task_struct *tsk)
 {
 	cleanup_timers(tsk->cpu_timers,
-		       tsk->utime, tsk->stime, tsk->sched_time);
+		       tsk->utime, tsk->stime, tsk->sum_exec_runtime);
 
 }
 void posix_cpu_timers_exit_group(struct task_struct *tsk)
@@ -475,7 +475,7 @@ void posix_cpu_timers_exit_group(struct 
 	cleanup_timers(tsk->signal->cpu_timers,
 		       cputime_add(tsk->utime, tsk->signal->utime),
 		       cputime_add(tsk->stime, tsk->signal->stime),
-		       tsk->sched_time + tsk->signal->sched_time);
+		       tsk->sum_exec_runtime + tsk->signal->sum_sched_runtime);
 }
 
 
@@ -536,7 +536,7 @@ static void process_timer_rebalance(stru
 		nsleft = max_t(unsigned long long, nsleft, 1);
 		do {
 			if (likely(!(t->flags & PF_EXITING))) {
-				ns = t->sched_time + nsleft;
+				ns = t->sum_exec_runtime + nsleft;
 				if (t->it_sched_expires == 0 ||
 				    t->it_sched_expires > ns) {
 					t->it_sched_expires = ns;
@@ -1004,7 +1004,7 @@ static void check_thread_timers(struct t
 		struct cpu_timer_list *t = list_first_entry(timers,
 						      struct cpu_timer_list,
 						      entry);
-		if (!--maxfire || tsk->sched_time < t->expires.sched) {
+		if (!--maxfire || tsk->sum_exec_runtime < t->expires.sched) {
 			tsk->it_sched_expires = t->expires.sched;
 			break;
 		}
@@ -1024,7 +1024,7 @@ static void check_process_timers(struct 
 	int maxfire;
 	struct signal_struct *const sig = tsk->signal;
 	cputime_t utime, stime, ptime, virt_expires, prof_expires;
-	unsigned long long sched_time, sched_expires;
+	unsigned long long sum_sched_runtime, sched_expires;
 	struct task_struct *t;
 	struct list_head *timers = sig->cpu_timers;
 
@@ -1044,12 +1044,12 @@ static void check_process_timers(struct 
 	 */
 	utime = sig->utime;
 	stime = sig->stime;
-	sched_time = sig->sched_time;
+	sum_sched_runtime = sig->sum_sched_runtime;
 	t = tsk;
 	do {
 		utime = cputime_add(utime, t->utime);
 		stime = cputime_add(stime, t->stime);
-		sched_time += t->sched_time;
+		sum_sched_runtime += t->sum_exec_runtime;
 		t = next_thread(t);
 	} while (t != tsk);
 	ptime = cputime_add(utime, stime);
@@ -1090,7 +1090,7 @@ static void check_process_timers(struct 
 		struct cpu_timer_list *t = list_first_entry(timers,
 						      struct cpu_timer_list,
 						      entry);
-		if (!--maxfire || sched_time < t->expires.sched) {
+		if (!--maxfire || sum_sched_runtime < t->expires.sched) {
 			sched_expires = t->expires.sched;
 			break;
 		}
@@ -1182,7 +1182,7 @@ static void check_process_timers(struct 
 		virt_left = cputime_sub(virt_expires, utime);
 		virt_left = cputime_div_non_zero(virt_left, nthreads);
 		if (sched_expires) {
-			sched_left = sched_expires - sched_time;
+			sched_left = sched_expires - sum_sched_runtime;
 			do_div(sched_left, nthreads);
 			sched_left = max_t(unsigned long long, sched_left, 1);
 		} else {
@@ -1208,7 +1208,7 @@ static void check_process_timers(struct 
 				t->it_virt_expires = ticks;
 			}
 
-			sched = t->sched_time + sched_left;
+			sched = t->sum_exec_runtime + sched_left;
 			if (sched_expires && (t->it_sched_expires == 0 ||
 					      t->it_sched_expires > sched)) {
 				t->it_sched_expires = sched;
@@ -1300,7 +1300,7 @@ void run_posix_cpu_timers(struct task_st
 
 	if (UNEXPIRED(prof) && UNEXPIRED(virt) &&
 	    (tsk->it_sched_expires == 0 ||
-	     tsk->sched_time < tsk->it_sched_expires))
+	     tsk->sum_exec_runtime < tsk->it_sched_expires))
 		return;
 
 #undef	UNEXPIRED
diff -puN kernel/sched.c~cfs-scheduler kernel/sched.c
--- a/kernel/sched.c~cfs-scheduler
+++ a/kernel/sched.c
@@ -16,6 +16,11 @@
  *		by Davide Libenzi, preemptible kernel bits by Robert Love.
  *  2003-09-03	Interactivity tuning by Con Kolivas.
  *  2004-04-02	Scheduler domains code by Nick Piggin
+ *  2007-04-15  Work begun on replacing all interactivity tuning with a
+ *              fair scheduling design by Con Kolivas.
+ *  2007-05-05  Load balancing (smp-nice) and other improvements
+ *              by Peter Williams
+ *  2007-05-06  Interactivity improvements to CFS by Mike Galbraith
  */
 
 #include <linux/mm.h>
@@ -100,131 +105,13 @@ unsigned long long __attribute__((weak))
  */
 #define MIN_TIMESLICE		max(5 * HZ / 1000, 1)
 #define DEF_TIMESLICE		(100 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT	 30
-#define CHILD_PENALTY		 95
-#define PARENT_PENALTY		100
-#define EXIT_WEIGHT		  3
-#define PRIO_BONUS_RATIO	 25
-#define MAX_BONUS		(MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA	  2
-#define MAX_SLEEP_AVG		(DEF_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT	(MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG	(JIFFIES_TO_NS(MAX_SLEEP_AVG))
-
-/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- *  TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- *  TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- *  TASK_INTERACTIVE(  0): [1,1,1,1,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- *  TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- *  priority range a task can explore, a value of '1' means the
- *  task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
- */
-
-#define CURRENT_BONUS(p) \
-	(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
-		MAX_SLEEP_AVG)
-
-#define GRANULARITY	(10 * HZ / 1000 ? : 1)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p)	(GRANULARITY * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
-			num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p)	(GRANULARITY * \
-		(1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
-	(v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
-	(SCALE(TASK_NICE(p) + 20, 40, MAX_BONUS) - 20 * MAX_BONUS / 40 + \
-		INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
-	((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define INTERACTIVE_SLEEP(p) \
-	(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
-		(MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define TASK_PREEMPTS_CURR(p, rq) \
-	((p)->prio < (rq)->curr->prio)
-
-#define SCALE_PRIO(x, prio) \
-	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
-
-static unsigned int static_prio_timeslice(int static_prio)
-{
-	if (static_prio < NICE_TO_PRIO(0))
-		return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio);
-	else
-		return SCALE_PRIO(DEF_TIMESLICE, static_prio);
-}
-
-#ifdef CONFIG_SMP
-/*
- * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
- * Since cpu_power is a 'constant', we can use a reciprocal divide.
- */
-static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
-{
-	return reciprocal_divide(load, sg->reciprocal_cpu_power);
-}
 
 /*
- * Each time a sched group cpu_power is changed,
- * we must compute its reciprocal value
- */
-static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
-{
-	sg->__cpu_power += val;
-	sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power);
-}
-#endif
-
-/*
- * task_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
- * to time slice values: [800ms ... 100ms ... 5ms]
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
+ * This is the priority-queue data structure of the RT scheduling class:
  */
-
-static inline unsigned int task_timeslice(struct task_struct *p)
-{
-	return static_prio_timeslice(p->static_prio);
-}
-
-/*
- * These are the runqueue data structures:
- */
-
 struct prio_array {
-	unsigned int nr_active;
-	DECLARE_BITMAP(bitmap, MAX_PRIO+1); /* include 1 bit for delimiter */
-	struct list_head queue[MAX_PRIO];
+	DECLARE_BITMAP(bitmap, MAX_RT_PRIO+1); /* include 1 bit for delimiter */
+	struct list_head queue[MAX_RT_PRIO];
 };
 
 /*
@@ -241,16 +128,17 @@ struct rq {
 	 * nr_running and cpu_load should be in the same cacheline because
 	 * remote CPUs use both these fields when doing load calculation.
 	 */
-	unsigned long nr_running;
+	long nr_running;
 	unsigned long raw_weighted_load;
-#ifdef CONFIG_SMP
-	unsigned long cpu_load[3];
+	#define CPU_LOAD_IDX_MAX 5
+	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
+
 	unsigned char idle_at_tick;
 #ifdef CONFIG_NO_HZ
 	unsigned char in_nohz_recently;
 #endif
-#endif
-	unsigned long long nr_switches;
+	u64 nr_switches;
+	unsigned long nr_load_updates;
 
 	/*
 	 * This is part of a global counter where only the total sum
@@ -260,14 +148,30 @@ struct rq {
 	 */
 	unsigned long nr_uninterruptible;
 
-	unsigned long expired_timestamp;
-	/* Cached timestamp set by update_cpu_clock() */
-	unsigned long long most_recent_timestamp;
 	struct task_struct *curr, *idle;
 	unsigned long next_balance;
 	struct mm_struct *prev_mm;
-	struct prio_array *active, *expired, arrays[2];
-	int best_expired_prio;
+
+	u64 clock, prev_clock_raw;
+	s64 clock_max_delta;
+	u64 fair_clock, prev_fair_clock;
+	u64 exec_clock, prev_exec_clock;
+	s64 wait_runtime;
+	unsigned long wait_runtime_overruns, wait_runtime_underruns;
+
+	unsigned int clock_warps;
+	unsigned int clock_unstable_events;
+
+	struct sched_class *load_balance_class;
+
+	struct prio_array active;
+	int rt_load_balance_idx;
+	struct list_head *rt_load_balance_head, *rt_load_balance_curr;
+
+	struct rb_root tasks_timeline;
+	struct rb_node *rb_leftmost;
+	struct rb_node *rb_load_balance_curr;
+
 	atomic_t nr_iowait;
 
 #ifdef CONFIG_SMP
@@ -307,6 +211,77 @@ struct rq {
 static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 static DEFINE_MUTEX(sched_hotcpu_mutex);
 
+static inline void check_preempt_curr(struct rq *rq, struct task_struct *p)
+{
+	rq->curr->sched_class->check_preempt_curr(rq, p);
+}
+
+#define SCALE_PRIO(x, prio) \
+	max(x * (MAX_PRIO - prio) / (MAX_USER_PRIO / 2), MIN_TIMESLICE)
+
+/*
+ * static_prio_timeslice() scales user-nice values [ -20 ... 0 ... 19 ]
+ * to time slice values: [800ms ... 100ms ... 5ms]
+ */
+static unsigned int static_prio_timeslice(int static_prio)
+{
+	if (static_prio == NICE_TO_PRIO(19))
+		return 1;
+
+	if (static_prio < NICE_TO_PRIO(0))
+		return SCALE_PRIO(DEF_TIMESLICE * 4, static_prio);
+	else
+		return SCALE_PRIO(DEF_TIMESLICE, static_prio);
+}
+
+#ifdef CONFIG_SMP
+/*
+ * Divide a load by a sched group cpu_power : (load / sg->__cpu_power)
+ * Since cpu_power is a 'constant', we can use a reciprocal divide.
+ */
+static inline u32 sg_div_cpu_power(const struct sched_group *sg, u32 load)
+{
+	return reciprocal_divide(load, sg->reciprocal_cpu_power);
+}
+
+/*
+ * Each time a sched group cpu_power is changed,
+ * we must compute its reciprocal value
+ */
+static inline void sg_inc_cpu_power(struct sched_group *sg, u32 val)
+{
+	sg->__cpu_power += val;
+	sg->reciprocal_cpu_power = reciprocal_value(sg->__cpu_power);
+}
+#endif
+/*
+ * Per-runqueue clock, as finegrained as the platform can give us:
+ */
+static inline unsigned long long __rq_clock(struct rq *rq)
+{
+	u64 now = sched_clock();
+	u64 clock = rq->clock;
+	u64 prev_raw = rq->prev_clock_raw;
+	s64 delta = now - prev_raw;
+
+	/*
+	 * Protect against sched_clock() occasionally going backwards:
+	 */
+	if (unlikely(delta < 0)) {
+		clock++;
+		rq->clock_warps++;
+	} else {
+		if (unlikely(delta > rq->clock_max_delta))
+			rq->clock_max_delta = delta;
+		clock += delta;
+	}
+
+	rq->prev_clock_raw = now;
+	rq->clock = clock;
+
+	return clock;
+}
+
 static inline int cpu_of(struct rq *rq)
 {
 #ifdef CONFIG_SMP
@@ -316,6 +291,16 @@ static inline int cpu_of(struct rq *rq)
 #endif
 }
 
+static inline unsigned long long rq_clock(struct rq *rq)
+{
+	int this_cpu = smp_processor_id();
+
+	if (this_cpu == cpu_of(rq))
+		return __rq_clock(rq);
+
+	return rq->clock;
+}
+
 /*
  * The domain tree (rq->sd) is protected by RCU's quiescent state transition.
  * See detach_destroy_domains: synchronize_sched for details.
@@ -460,134 +445,6 @@ static inline void task_rq_unlock(struct
 	spin_unlock_irqrestore(&rq->lock, *flags);
 }
 
-#ifdef CONFIG_SCHEDSTATS
-/*
- * bump this up when changing the output format or the meaning of an existing
- * format, so that tools can adapt (or abort)
- */
-#define SCHEDSTAT_VERSION 14
-
-static int show_schedstat(struct seq_file *seq, void *v)
-{
-	int cpu;
-
-	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
-	seq_printf(seq, "timestamp %lu\n", jiffies);
-	for_each_online_cpu(cpu) {
-		struct rq *rq = cpu_rq(cpu);
-#ifdef CONFIG_SMP
-		struct sched_domain *sd;
-		int dcnt = 0;
-#endif
-
-		/* runqueue-specific stats */
-		seq_printf(seq,
-		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
-		    cpu, rq->yld_both_empty,
-		    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt,
-		    rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
-		    rq->ttwu_cnt, rq->ttwu_local,
-		    rq->rq_sched_info.cpu_time,
-		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
-
-		seq_printf(seq, "\n");
-
-#ifdef CONFIG_SMP
-		/* domain-specific stats */
-		preempt_disable();
-		for_each_domain(cpu, sd) {
-			enum idle_type itype;
-			char mask_str[NR_CPUS];
-
-			cpumask_scnprintf(mask_str, NR_CPUS, sd->span);
-			seq_printf(seq, "domain%d %s", dcnt++, mask_str);
-			for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES;
-					itype++) {
-				seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu "
-						"%lu",
-				    sd->lb_cnt[itype],
-				    sd->lb_balanced[itype],
-				    sd->lb_failed[itype],
-				    sd->lb_imbalance[itype],
-				    sd->lb_gained[itype],
-				    sd->lb_hot_gained[itype],
-				    sd->lb_nobusyq[itype],
-				    sd->lb_nobusyg[itype]);
-			}
-			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu"
-			    " %lu %lu %lu\n",
-			    sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
-			    sd->sbe_cnt, sd->sbe_balanced, sd->sbe_pushed,
-			    sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed,
-			    sd->ttwu_wake_remote, sd->ttwu_move_affine,
-			    sd->ttwu_move_balance);
-		}
-		preempt_enable();
-#endif
-	}
-	return 0;
-}
-
-static int schedstat_open(struct inode *inode, struct file *file)
-{
-	unsigned int size = PAGE_SIZE * (1 + num_online_cpus() / 32);
-	char *buf = kmalloc(size, GFP_KERNEL);
-	struct seq_file *m;
-	int res;
-
-	if (!buf)
-		return -ENOMEM;
-	res = single_open(file, show_schedstat, NULL);
-	if (!res) {
-		m = file->private_data;
-		m->buf = buf;
-		m->size = size;
-	} else
-		kfree(buf);
-	return res;
-}
-
-const struct file_operations proc_schedstat_operations = {
-	.open    = schedstat_open,
-	.read    = seq_read,
-	.llseek  = seq_lseek,
-	.release = single_release,
-};
-
-/*
- * Expects runqueue lock to be held for atomicity of update
- */
-static inline void
-rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies)
-{
-	if (rq) {
-		rq->rq_sched_info.run_delay += delta_jiffies;
-		rq->rq_sched_info.pcnt++;
-	}
-}
-
-/*
- * Expects runqueue lock to be held for atomicity of update
- */
-static inline void
-rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies)
-{
-	if (rq)
-		rq->rq_sched_info.cpu_time += delta_jiffies;
-}
-# define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
-# define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
-#else /* !CONFIG_SCHEDSTATS */
-static inline void
-rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies)
-{}
-static inline void
-rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies)
-{}
-# define schedstat_inc(rq, field)	do { } while (0)
-# define schedstat_add(rq, field, amt)	do { } while (0)
-#endif
-
 /*
  * this_rq_lock - lock this runqueue and disable interrupts.
  */
@@ -603,178 +460,72 @@ static inline struct rq *this_rq_lock(vo
 	return rq;
 }
 
-#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
 /*
- * Called when a process is dequeued from the active array and given
- * the cpu.  We should note that with the exception of interactive
- * tasks, the expired queue will become the active queue after the active
- * queue is empty, without explicitly dequeuing and requeuing tasks in the
- * expired queue.  (Interactive tasks may be requeued directly to the
- * active queue, thus delaying tasks in the expired queue from running;
- * see scheduler_tick()).
- *
- * This function is only called from sched_info_arrive(), rather than
- * dequeue_task(). Even though a task may be queued and dequeued multiple
- * times as it is shuffled about, we're really interested in knowing how
- * long it was from the *first* time it was queued to the time that it
- * finally hit a cpu.
+ * CPU frequency is/was unstable - start new by setting prev_clock_raw:
  */
-static inline void sched_info_dequeued(struct task_struct *t)
+void sched_clock_unstable_event(void)
 {
-	t->sched_info.last_queued = 0;
-}
-
-/*
- * Called when a task finally hits the cpu.  We can now calculate how
- * long it was waiting to run.  We also note when it began so that we
- * can keep stats on how long its timeslice is.
- */
-static void sched_info_arrive(struct task_struct *t)
-{
-	unsigned long now = jiffies, delta_jiffies = 0;
-
-	if (t->sched_info.last_queued)
-		delta_jiffies = now - t->sched_info.last_queued;
-	sched_info_dequeued(t);
-	t->sched_info.run_delay += delta_jiffies;
-	t->sched_info.last_arrival = now;
-	t->sched_info.pcnt++;
+	unsigned long flags;
+	struct rq *rq;
 
-	rq_sched_info_arrive(task_rq(t), delta_jiffies);
+	rq = task_rq_lock(current, &flags);
+	rq->prev_clock_raw = sched_clock();
+	rq->clock_unstable_events++;
+	task_rq_unlock(rq, &flags);
 }
 
 /*
- * Called when a process is queued into either the active or expired
- * array.  The time is noted and later used to determine how long we
- * had to wait for us to reach the cpu.  Since the expired queue will
- * become the active queue after active queue is empty, without dequeuing
- * and requeuing any tasks, we are interested in queuing to either. It
- * is unusual but not impossible for tasks to be dequeued and immediately
- * requeued in the same or another array: this can happen in sched_yield(),
- * set_user_nice(), and even load_balance() as it moves tasks from runqueue
- * to runqueue.
+ * resched_task - mark a task 'to be rescheduled now'.
  *
- * This function is only called from enqueue_task(), but also only updates
- * the timestamp if it is already not set.  It's assumed that
- * sched_info_dequeued() will clear that stamp when appropriate.
- */
-static inline void sched_info_queued(struct task_struct *t)
-{
-	if (unlikely(sched_info_on()))
-		if (!t->sched_info.last_queued)
-			t->sched_info.last_queued = jiffies;
-}
-
-/*
- * Called when a process ceases being the active-running process, either
- * voluntarily or involuntarily.  Now we can calculate how long we ran.
+ * On UP this means the setting of the need_resched flag, on SMP it
+ * might also involve a cross-CPU call to trigger the scheduler on
+ * the target CPU.
  */
-static inline void sched_info_depart(struct task_struct *t)
-{
-	unsigned long delta_jiffies = jiffies - t->sched_info.last_arrival;
+#ifdef CONFIG_SMP
 
-	t->sched_info.cpu_time += delta_jiffies;
-	rq_sched_info_depart(task_rq(t), delta_jiffies);
-}
+#ifndef tsk_is_polling
+#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
+#endif
 
-/*
- * Called when tasks are switched involuntarily due, typically, to expiring
- * their time slice.  (This may also be called when switching to or from
- * the idle task.)  We are only called when prev != next.
- */
-static inline void
-__sched_info_switch(struct task_struct *prev, struct task_struct *next)
+static void resched_task(struct task_struct *p)
 {
-	struct rq *rq = task_rq(prev);
+	int cpu;
 
-	/*
-	 * prev now departs the cpu.  It's not interesting to record
-	 * stats about how efficient we were at scheduling the idle
-	 * process, however.
-	 */
-	if (prev != rq->idle)
-		sched_info_depart(prev);
+	assert_spin_locked(&task_rq(p)->lock);
 
-	if (next != rq->idle)
-		sched_info_arrive(next);
-}
-static inline void
-sched_info_switch(struct task_struct *prev, struct task_struct *next)
-{
-	if (unlikely(sched_info_on()))
-		__sched_info_switch(prev, next);
-}
-#else
-#define sched_info_queued(t)		do { } while (0)
-#define sched_info_switch(t, next)	do { } while (0)
-#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
+	if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
+		return;
 
-/*
- * Adding/removing a task to/from a priority array:
- */
-static void dequeue_task(struct task_struct *p, struct prio_array *array)
-{
-	array->nr_active--;
-	list_del(&p->run_list);
-	if (list_empty(array->queue + p->prio))
-		__clear_bit(p->prio, array->bitmap);
-}
+	set_tsk_thread_flag(p, TIF_NEED_RESCHED);
 
-static void enqueue_task(struct task_struct *p, struct prio_array *array)
-{
-	sched_info_queued(p);
-	list_add_tail(&p->run_list, array->queue + p->prio);
-	__set_bit(p->prio, array->bitmap);
-	array->nr_active++;
-	p->array = array;
-}
+	cpu = task_cpu(p);
+	if (cpu == smp_processor_id())
+		return;
 
-/*
- * Put task to the end of the run list without the overhead of dequeue
- * followed by enqueue.
- */
-static void requeue_task(struct task_struct *p, struct prio_array *array)
-{
-	list_move_tail(&p->run_list, array->queue + p->prio);
+	/* NEED_RESCHED must be visible before we test polling */
+	smp_mb();
+	if (!tsk_is_polling(p))
+		smp_send_reschedule(cpu);
 }
 
-static inline void
-enqueue_task_head(struct task_struct *p, struct prio_array *array)
+static void resched_cpu(int cpu)
 {
-	list_add(&p->run_list, array->queue + p->prio);
-	__set_bit(p->prio, array->bitmap);
-	array->nr_active++;
-	p->array = array;
-}
+	struct rq *rq = cpu_rq(cpu);
+	unsigned long flags;
 
-/*
- * __normal_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
- *
- * Both properties are important to certain workloads.
- */
+	if (!spin_trylock_irqsave(&rq->lock, flags))
+		return;
+	resched_task(cpu_curr(cpu));
+	spin_unlock_irqrestore(&rq->lock, flags);
+}
 
-static inline int __normal_prio(struct task_struct *p)
+#else
+static inline void resched_task(struct task_struct *p)
 {
-	int bonus, prio;
-
-	bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
-
-	prio = p->static_prio - bonus;
-	if (prio < MAX_RT_PRIO)
-		prio = MAX_RT_PRIO;
-	if (prio > MAX_PRIO-1)
-		prio = MAX_PRIO-1;
-	return prio;
+	assert_spin_locked(&task_rq(p)->lock);
+	set_tsk_need_resched(p);
 }
+#endif
 
 /*
  * To aid in avoiding the subversion of "niceness" due to uneven distribution
@@ -798,23 +549,26 @@ static inline int __normal_prio(struct t
 #define RTPRIO_TO_LOAD_WEIGHT(rp) \
 	(PRIO_TO_LOAD_WEIGHT(MAX_RT_PRIO) + LOAD_WEIGHT(rp))
 
-static void set_load_weight(struct task_struct *p)
-{
-	if (has_rt_policy(p)) {
-#ifdef CONFIG_SMP
-		if (p == task_rq(p)->migration_thread)
-			/*
-			 * The migration thread does the actual balancing.
-			 * Giving its load any weight will skew balancing
-			 * adversely.
-			 */
-			p->load_weight = 0;
-		else
-#endif
-			p->load_weight = RTPRIO_TO_LOAD_WEIGHT(p->rt_priority);
-	} else
-		p->load_weight = PRIO_TO_LOAD_WEIGHT(p->static_prio);
-}
+#define NICE_0_LOAD	SCHED_LOAD_SCALE
+#define NICE_0_SHIFT	SCHED_LOAD_SHIFT
+
+/*
+ * Nice levels are multiplicative, with a gentle 10% change for every
+ * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
+ * nice 1, it will get ~10% less CPU time than another CPU-bound task
+ * that remained on nice 0.
+ *
+ * The "10% effect" is relative and cumulative: from _any_ nice level,
+ * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
+ * it's +10% CPU usage.
+ */
+const int prio_to_weight[40] = {
+/* -20 */ 88818, 71054, 56843, 45475, 36380, 29104, 23283, 18626, 14901, 11921,
+/* -10 */  9537,  7629,  6103,  4883,  3906,  3125,  2500,  2000,  1600,  1280,
+/*   0 */  NICE_0_LOAD /* 1024 */,
+/*   1 */          819,   655,   524,   419,   336,   268,   215,   172,   137,
+/*  10 */   110,    87,    70,    56,    45,    36,    29,    23,    18,    15,
+};
 
 static inline void
 inc_raw_weighted_load(struct rq *rq, const struct task_struct *p)
@@ -840,6 +594,60 @@ static inline void dec_nr_running(struct
 	dec_raw_weighted_load(rq, p);
 }
 
+static void activate_task(struct rq *rq, struct task_struct *p, int wakeup);
+
+#include "sched_stats.h"
+#include "sched_rt.c"
+#include "sched_fair.c"
+#include "sched_debug.c"
+
+#define sched_class_highest (&rt_sched_class)
+
+static void set_load_weight(struct task_struct *p)
+{
+	task_rq(p)->wait_runtime -= p->wait_runtime;
+	p->wait_runtime = 0;
+
+	if (has_rt_policy(p)) {
+		p->load_weight = prio_to_weight[0] * 2;
+		return;
+	}
+	/*
+	 * SCHED_BATCH tasks get minimal weight:
+	 */
+	if (p->policy == SCHED_BATCH) {
+		p->load_weight = 1;
+		return;
+	}
+
+	p->load_weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
+}
+
+static void enqueue_task(struct rq *rq, struct task_struct *p, int wakeup)
+{
+	u64 now = rq_clock(rq);
+
+	sched_info_queued(p);
+	p->sched_class->enqueue_task(rq, p, wakeup, now);
+	p->on_rq = 1;
+}
+
+static void dequeue_task(struct rq *rq, struct task_struct *p, int sleep)
+{
+	u64 now = rq_clock(rq);
+
+	p->sched_class->dequeue_task(rq, p, sleep, now);
+	p->on_rq = 0;
+}
+
+/*
+ * __normal_prio - return the priority that is based on the static prio
+ */
+static inline int __normal_prio(struct task_struct *p)
+{
+	return p->static_prio;
+}
+
 /*
  * Calculate the expected normal priority: i.e. priority
  * without taking RT-inheritance into account. Might be
@@ -879,222 +687,32 @@ static int effective_prio(struct task_st
 }
 
 /*
- * __activate_task - move a task to the runqueue.
+ * activate_task - move a task to the runqueue.
  */
-static void __activate_task(struct task_struct *p, struct rq *rq)
+static void activate_task(struct rq *rq, struct task_struct *p, int wakeup)
 {
-	struct prio_array *target = rq->active;
-
-	if (batch_task(p))
-		target = rq->expired;
-	enqueue_task(p, target);
+	enqueue_task(rq, p, wakeup);
 	inc_nr_running(p, rq);
 }
 
 /*
- * __activate_idle_task - move idle task to the _front_ of runqueue.
+ * activate_idle_task - move idle task to the _front_ of runqueue.
  */
-static inline void __activate_idle_task(struct task_struct *p, struct rq *rq)
+static inline void activate_idle_task(struct task_struct *p, struct rq *rq)
 {
-	enqueue_task_head(p, rq->active);
+	enqueue_task(rq, p, 0);
 	inc_nr_running(p, rq);
 }
 
 /*
- * Recalculate p->normal_prio and p->prio after having slept,
- * updating the sleep-average too:
- */
-static int recalc_task_prio(struct task_struct *p, unsigned long long now)
-{
-	/* Caller must always ensure 'now >= p->timestamp' */
-	unsigned long sleep_time = now - p->timestamp;
-
-	if (batch_task(p))
-		sleep_time = 0;
-
-	if (likely(sleep_time > 0)) {
-		/*
-		 * This ceiling is set to the lowest priority that would allow
-		 * a task to be reinserted into the active array on timeslice
-		 * completion.
-		 */
-		unsigned long ceiling = INTERACTIVE_SLEEP(p);
-
-		if (p->mm && sleep_time > ceiling && p->sleep_avg < ceiling) {
-			/*
-			 * Prevents user tasks from achieving best priority
-			 * with one single large enough sleep.
-			 */
-			p->sleep_avg = ceiling;
-			/*
-			 * Using INTERACTIVE_SLEEP() as a ceiling places a
-			 * nice(0) task 1ms sleep away from promotion, and
-			 * gives it 700ms to round-robin with no chance of
-			 * being demoted.  This is more than generous, so
-			 * mark this sleep as non-interactive to prevent the
-			 * on-runqueue bonus logic from intervening should
-			 * this task not receive cpu immediately.
-			 */
-			p->sleep_type = SLEEP_NONINTERACTIVE;
-		} else {
-			/*
-			 * Tasks waking from uninterruptible sleep are
-			 * limited in their sleep_avg rise as they
-			 * are likely to be waiting on I/O
-			 */
-			if (p->sleep_type == SLEEP_NONINTERACTIVE && p->mm) {
-				if (p->sleep_avg >= ceiling)
-					sleep_time = 0;
-				else if (p->sleep_avg + sleep_time >=
-					 ceiling) {
-						p->sleep_avg = ceiling;
-						sleep_time = 0;
-				}
-			}
-
-			/*
-			 * This code gives a bonus to interactive tasks.
-			 *
-			 * The boost works by updating the 'average sleep time'
-			 * value here, based on ->timestamp. The more time a
-			 * task spends sleeping, the higher the average gets -
-			 * and the higher the priority boost gets as well.
-			 */
-			p->sleep_avg += sleep_time;
-
-		}
-		if (p->sleep_avg > NS_MAX_SLEEP_AVG)
-			p->sleep_avg = NS_MAX_SLEEP_AVG;
-	}
-
-	return effective_prio(p);
-}
-
-/*
- * activate_task - move a task to the runqueue and do priority recalculation
- *
- * Update all the scheduling statistics stuff. (sleep average
- * calculation, priority modifiers, etc.)
- */
-static void activate_task(struct task_struct *p, struct rq *rq, int local)
-{
-	unsigned long long now;
-
-	if (rt_task(p))
-		goto out;
-
-	now = sched_clock();
-#ifdef CONFIG_SMP
-	if (!local) {
-		/* Compensate for drifting sched_clock */
-		struct rq *this_rq = this_rq();
-		now = (now - this_rq->most_recent_timestamp)
-			+ rq->most_recent_timestamp;
-	}
-#endif
-
-	/*
-	 * Sleep time is in units of nanosecs, so shift by 20 to get a
-	 * milliseconds-range estimation of the amount of time that the task
-	 * spent sleeping:
-	 */
-	if (unlikely(prof_on == SLEEP_PROFILING)) {
-		if (p->state == TASK_UNINTERRUPTIBLE)
-			profile_hits(SLEEP_PROFILING, (void *)get_wchan(p),
-				     (now - p->timestamp) >> 20);
-	}
-
-	p->prio = recalc_task_prio(p, now);
-
-	/*
-	 * This checks to make sure it's not an uninterruptible task
-	 * that is now waking up.
-	 */
-	if (p->sleep_type == SLEEP_NORMAL) {
-		/*
-		 * Tasks which were woken up by interrupts (ie. hw events)
-		 * are most likely of interactive nature. So we give them
-		 * the credit of extending their sleep time to the period
-		 * of time they spend on the runqueue, waiting for execution
-		 * on a CPU, first time around:
-		 */
-		if (in_interrupt())
-			p->sleep_type = SLEEP_INTERRUPTED;
-		else {
-			/*
-			 * Normal first-time wakeups get a credit too for
-			 * on-runqueue time, but it will be weighted down:
-			 */
-			p->sleep_type = SLEEP_INTERACTIVE;
-		}
-	}
-	p->timestamp = now;
-out:
-	__activate_task(p, rq);
-}
-
-/*
  * deactivate_task - remove a task from the runqueue.
  */
-static void deactivate_task(struct task_struct *p, struct rq *rq)
+static void deactivate_task(struct rq *rq, struct task_struct *p, int sleep)
 {
+	dequeue_task(rq, p, sleep);
 	dec_nr_running(p, rq);
-	dequeue_task(p, p->array);
-	p->array = NULL;
 }
 
-/*
- * resched_task - mark a task 'to be rescheduled now'.
- *
- * On UP this means the setting of the need_resched flag, on SMP it
- * might also involve a cross-CPU call to trigger the scheduler on
- * the target CPU.
- */
-#ifdef CONFIG_SMP
-
-#ifndef tsk_is_polling
-#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)
-#endif
-
-static void resched_task(struct task_struct *p)
-{
-	int cpu;
-
-	assert_spin_locked(&task_rq(p)->lock);
-
-	if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))
-		return;
-
-	set_tsk_thread_flag(p, TIF_NEED_RESCHED);
-
-	cpu = task_cpu(p);
-	if (cpu == smp_processor_id())
-		return;
-
-	/* NEED_RESCHED must be visible before we test polling */
-	smp_mb();
-	if (!tsk_is_polling(p))
-		smp_send_reschedule(cpu);
-}
-
-static void resched_cpu(int cpu)
-{
-	struct rq *rq = cpu_rq(cpu);
-	unsigned long flags;
-
-	if (!spin_trylock_irqsave(&rq->lock, flags))
-		return;
-	resched_task(cpu_curr(cpu));
-	spin_unlock_irqrestore(&rq->lock, flags);
-}
-#else
-static inline void resched_task(struct task_struct *p)
-{
-	assert_spin_locked(&task_rq(p)->lock);
-	set_tsk_need_resched(p);
-}
-#endif
-
 /**
  * task_curr - is this task currently executing on a CPU?
  * @p: the task in question.
@@ -1111,6 +729,36 @@ unsigned long weighted_cpuload(const int
 }
 
 #ifdef CONFIG_SMP
+
+static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
+{
+	task_thread_info(p)->cpu = cpu;
+}
+
+void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
+{
+	int old_cpu = task_cpu(p);
+	struct rq *old_rq = cpu_rq(old_cpu), *new_rq = cpu_rq(new_cpu);
+	u64 clock_offset, fair_clock_offset;
+
+	clock_offset = old_rq->clock - new_rq->clock;
+	fair_clock_offset = old_rq->fair_clock - new_rq->fair_clock;
+
+	if (p->wait_start)
+		p->wait_start -= clock_offset;
+	if (p->wait_start_fair)
+		p->wait_start_fair -= fair_clock_offset;
+	if (p->sleep_start)
+		p->sleep_start -= clock_offset;
+	if (p->block_start)
+		p->block_start -= clock_offset;
+	if (p->sleep_start_fair)
+		p->sleep_start_fair -= fair_clock_offset;
+
+	task_thread_info(p)->cpu = new_cpu;
+
+}
+
 struct migration_req {
 	struct list_head list;
 
@@ -1133,7 +781,7 @@ migrate_task(struct task_struct *p, int 
 	 * If the task is not on a runqueue (and not running), then
 	 * it is sufficient to simply update the task's cpu field.
 	 */
-	if (!p->array && !task_running(rq, p)) {
+	if (!p->on_rq && !task_running(rq, p)) {
 		set_task_cpu(p, dest_cpu);
 		return 0;
 	}
@@ -1164,7 +812,7 @@ void wait_task_inactive(struct task_stru
 repeat:
 	rq = task_rq_lock(p, &flags);
 	/* Must be off runqueue entirely, not preempted. */
-	if (unlikely(p->array || task_running(rq, p))) {
+	if (unlikely(p->on_rq || task_running(rq, p))) {
 		/* If it's preempted, we yield.  It could be a while. */
 		preempted = !task_running(rq, p);
 		task_rq_unlock(rq, &flags);
@@ -1341,9 +989,9 @@ static int sched_balance_self(int cpu, i
 	struct sched_domain *tmp, *sd = NULL;
 
 	for_each_domain(cpu, tmp) {
- 		/*
- 	 	 * If power savings logic is enabled for a domain, stop there.
- 	 	 */
+		/*
+		 * If power savings logic is enabled for a domain, stop there.
+		 */
 		if (tmp->flags & SD_POWERSAVINGS_BALANCE)
 			break;
 		if (tmp->flags & flag)
@@ -1470,7 +1118,7 @@ static int try_to_wake_up(struct task_st
 	if (!(old_state & state))
 		goto out;
 
-	if (p->array)
+	if (p->on_rq)
 		goto out_running;
 
 	cpu = task_cpu(p);
@@ -1563,7 +1211,7 @@ out_set_cpu:
 		old_state = p->state;
 		if (!(old_state & state))
 			goto out;
-		if (p->array)
+		if (p->on_rq)
 			goto out_running;
 
 		this_cpu = smp_processor_id();
@@ -1572,25 +1220,10 @@ out_set_cpu:
 
 out_activate:
 #endif /* CONFIG_SMP */
-	if (old_state == TASK_UNINTERRUPTIBLE) {
+	if (old_state == TASK_UNINTERRUPTIBLE)
 		rq->nr_uninterruptible--;
-		/*
-		 * Tasks on involuntary sleep don't earn
-		 * sleep_avg beyond just interactive state.
-		 */
-		p->sleep_type = SLEEP_NONINTERACTIVE;
-	} else
-
-	/*
-	 * Tasks that have marked their sleep as noninteractive get
-	 * woken up with their sleep average not weighted in an
-	 * interactive way.
-	 */
-		if (old_state & TASK_NONINTERACTIVE)
-			p->sleep_type = SLEEP_NONINTERACTIVE;
-
 
-	activate_task(p, rq, cpu == this_cpu);
+	activate_task(rq, p, 1);
 	/*
 	 * Sync wakeups (i.e. those types of wakeups where the waker
 	 * has indicated that it will leave the CPU in short order)
@@ -1599,10 +1232,8 @@ out_activate:
 	 * the waker guarantees that the freshly woken up task is going
 	 * to be considered on this CPU.)
 	 */
-	if (!sync || cpu != this_cpu) {
-		if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
-	}
+	if (!sync || cpu != this_cpu)
+		check_preempt_curr(rq, p);
 	success = 1;
 
 out_running:
@@ -1625,19 +1256,38 @@ int fastcall wake_up_state(struct task_s
 	return try_to_wake_up(p, state, 0);
 }
 
-static void task_running_tick(struct rq *rq, struct task_struct *p);
+/*
+ * The task was running during this tick - call the class tick
+ * (to update the time slice counter and other statistics, etc.):
+ */
+static void task_running_tick(struct rq *rq, struct task_struct *p)
+{
+	spin_lock(&rq->lock);
+	p->sched_class->task_tick(rq, p);
+	spin_unlock(&rq->lock);
+}
+
 /*
  * Perform scheduler related setup for a newly forked process p.
  * p is forked by current.
+ *
+ * __sched_fork() is basic setup used by init_idle() too:
  */
-void fastcall sched_fork(struct task_struct *p, int clone_flags)
+static void __sched_fork(struct task_struct *p)
 {
-	int cpu = get_cpu();
+	p->wait_start_fair = p->wait_start = p->exec_start = p->last_ran = 0;
+	p->sum_exec_runtime = 0;
 
-#ifdef CONFIG_SMP
-	cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
-#endif
-	set_task_cpu(p, cpu);
+	p->wait_runtime = 0;
+
+	p->sum_wait_runtime = p->sum_sleep_runtime = 0;
+	p->sleep_start = p->sleep_start_fair = p->block_start = 0;
+	p->sleep_max = p->block_max = p->exec_max = p->wait_max = 0;
+	p->wait_runtime_overruns = p->wait_runtime_underruns = 0;
+
+	INIT_LIST_HEAD(&p->run_list);
+	p->on_rq = 0;
+	p->nr_switches = 0;
 
 	/*
 	 * We mark the process as running here, but have not actually
@@ -1646,16 +1296,29 @@ void fastcall sched_fork(struct task_str
 	 * event cannot wake it up and insert it on the runqueue either.
 	 */
 	p->state = TASK_RUNNING;
+}
+
+/*
+ * fork()/clone()-time setup:
+ */
+void sched_fork(struct task_struct *p, int clone_flags)
+{
+	int cpu = get_cpu();
+
+	__sched_fork(p);
+
+#ifdef CONFIG_SMP
+	cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
+#endif
+	__set_task_cpu(p, cpu);
 
 	/*
 	 * Make sure we do not leak PI boosting priority to the child:
 	 */
 	p->prio = current->normal_prio;
 
-	INIT_LIST_HEAD(&p->run_list);
-	p->array = NULL;
 #if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
-	if (unlikely(sched_info_on()))
+	if (likely(sched_info_on()))
 		memset(&p->sched_info, 0, sizeof(p->sched_info));
 #endif
 #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW)
@@ -1665,146 +1328,53 @@ void fastcall sched_fork(struct task_str
 	/* Want to start with kernel preemption disabled. */
 	task_thread_info(p)->preempt_count = 1;
 #endif
-	/*
-	 * Share the timeslice between parent and child, thus the
-	 * total amount of pending timeslices in the system doesn't change,
-	 * resulting in more scheduling fairness.
-	 */
-	local_irq_disable();
-	p->time_slice = (current->time_slice + 1) >> 1;
-	/*
-	 * The remainder of the first timeslice might be recovered by
-	 * the parent if the child exits early enough.
-	 */
-	p->first_time_slice = 1;
-	current->time_slice >>= 1;
-	p->timestamp = sched_clock();
-	if (unlikely(!current->time_slice)) {
-		/*
-		 * This case is rare, it happens when the parent has only
-		 * a single jiffy left from its timeslice. Taking the
-		 * runqueue lock is not a problem.
-		 */
-		current->time_slice = 1;
-		task_running_tick(cpu_rq(cpu), current);
-	}
-	local_irq_enable();
 	put_cpu();
 }
 
 /*
- * wake_up_new_task - wake up a newly created task for the first time.
- *
- * This function will do some initial scheduler statistics housekeeping
- * that must be done for every newly created context, then puts the task
- * on the runqueue and wakes it.
- */
-void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
-{
-	struct rq *rq, *this_rq;
-	unsigned long flags;
-	int this_cpu, cpu;
-
-	rq = task_rq_lock(p, &flags);
-	BUG_ON(p->state != TASK_RUNNING);
-	this_cpu = smp_processor_id();
-	cpu = task_cpu(p);
-
-	/*
-	 * We decrease the sleep average of forking parents
-	 * and children as well, to keep max-interactive tasks
-	 * from forking tasks that are max-interactive. The parent
-	 * (current) is done further down, under its lock.
-	 */
-	p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
-		CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-
-	p->prio = effective_prio(p);
-
-	if (likely(cpu == this_cpu)) {
-		if (!(clone_flags & CLONE_VM)) {
-			/*
-			 * The VM isn't cloned, so we're in a good position to
-			 * do child-runs-first in anticipation of an exec. This
-			 * usually avoids a lot of COW overhead.
-			 */
-			if (unlikely(!current->array))
-				__activate_task(p, rq);
-			else {
-				p->prio = current->prio;
-				p->normal_prio = current->normal_prio;
-				list_add_tail(&p->run_list, &current->run_list);
-				p->array = current->array;
-				p->array->nr_active++;
-				inc_nr_running(p, rq);
-			}
-			set_need_resched();
-		} else
-			/* Run child last */
-			__activate_task(p, rq);
-		/*
-		 * We skip the following code due to cpu == this_cpu
-	 	 *
-		 *   task_rq_unlock(rq, &flags);
-		 *   this_rq = task_rq_lock(current, &flags);
-		 */
-		this_rq = rq;
-	} else {
-		this_rq = cpu_rq(this_cpu);
-
-		/*
-		 * Not the local CPU - must adjust timestamp. This should
-		 * get optimised away in the !CONFIG_SMP case.
-		 */
-		p->timestamp = (p->timestamp - this_rq->most_recent_timestamp)
-					+ rq->most_recent_timestamp;
-		__activate_task(p, rq);
-		if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
-
-		/*
-		 * Parent and child are on different CPUs, now get the
-		 * parent runqueue to update the parent's ->sleep_avg:
-		 */
-		task_rq_unlock(rq, &flags);
-		this_rq = task_rq_lock(current, &flags);
-	}
-	current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
-		PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
-	task_rq_unlock(this_rq, &flags);
-}
+ * After fork, child runs first. (default) If set to 0 then
+ * parent will (try to) run first.
+ */
+unsigned int __read_mostly sysctl_sched_child_runs_first = 1;
 
 /*
- * Potentially available exiting-child timeslices are
- * retrieved here - this way the parent does not get
- * penalized for creating too many threads.
+ * wake_up_new_task - wake up a newly created task for the first time.
  *
- * (this cannot be used to 'generate' timeslices
- * artificially, because any timeslice recovered here
- * was given away by the parent in the first place.)
+ * This function will do some initial scheduler statistics housekeeping
+ * that must be done for every newly created context, then puts the task
+ * on the runqueue and wakes it.
  */
-void fastcall sched_exit(struct task_struct *p)
+void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 {
 	unsigned long flags;
 	struct rq *rq;
+	int this_cpu;
 
-	/*
-	 * If the child was a (relative-) CPU hog then decrease
-	 * the sleep_avg of the parent as well.
-	 */
-	rq = task_rq_lock(p->parent, &flags);
-	if (p->first_time_slice && task_cpu(p) == task_cpu(p->parent)) {
-		p->parent->time_slice += p->time_slice;
-		if (unlikely(p->parent->time_slice > task_timeslice(p)))
-			p->parent->time_slice = task_timeslice(p);
-	}
-	if (p->sleep_avg < p->parent->sleep_avg)
-		p->parent->sleep_avg = p->parent->sleep_avg /
-		(EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
-		(EXIT_WEIGHT + 1);
+	rq = task_rq_lock(p, &flags);
+	BUG_ON(p->state != TASK_RUNNING);
+	this_cpu = smp_processor_id(); /* parent's CPU */
+
+	p->prio = effective_prio(p);
+
+	if (!sysctl_sched_child_runs_first || (clone_flags & CLONE_VM) ||
+			task_cpu(p) != this_cpu || !current->on_rq) {
+		activate_task(rq, p, 0);
+	} else {
+		/*
+		 * Let the scheduling class do new task startup
+		 * management (if any):
+		 */
+		p->sched_class->task_new(rq, p);
+	}
+	check_preempt_curr(rq, p);
 	task_rq_unlock(rq, &flags);
 }
 
+void sched_dead(struct task_struct *p)
+{
+	WARN_ON_ONCE(p->on_rq);
+}
+
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -2006,17 +1576,59 @@ unsigned long nr_active(void)
 	return running + uninterruptible;
 }
 
-#ifdef CONFIG_SMP
-
-/*
- * Is this task likely cache-hot:
- */
-static inline int
-task_hot(struct task_struct *p, unsigned long long now, struct sched_domain *sd)
+static void update_load_fair(struct rq *this_rq)
 {
-	return (long long)(now - p->last_ran) < (long long)sd->cache_hot_time;
+	unsigned long this_load, fair_delta, exec_delta, idle_delta;
+	unsigned int i, scale;
+	s64 fair_delta64, exec_delta64;
+	unsigned long tmp;
+	u64 tmp64;
+
+	this_rq->nr_load_updates++;
+	if (!(sysctl_sched_load_smoothing & 64)) {
+		this_load = this_rq->raw_weighted_load;
+		goto do_avg;
+	}
+
+	fair_delta64 = this_rq->fair_clock - this_rq->prev_fair_clock + 1;
+	this_rq->prev_fair_clock = this_rq->fair_clock;
+
+	exec_delta64 = this_rq->exec_clock - this_rq->prev_exec_clock + 1;
+	this_rq->prev_exec_clock = this_rq->exec_clock;
+
+	if (fair_delta64 > (s64)LONG_MAX)
+		fair_delta64 = (s64)LONG_MAX;
+	fair_delta = (unsigned long)fair_delta64;
+
+	if (exec_delta64 > (s64)LONG_MAX)
+		exec_delta64 = (s64)LONG_MAX;
+	exec_delta = (unsigned long)exec_delta64;
+	if (exec_delta > TICK_NSEC)
+		exec_delta = TICK_NSEC;
+
+	idle_delta = TICK_NSEC - exec_delta;
+
+	tmp = (SCHED_LOAD_SCALE * exec_delta) / fair_delta;
+	tmp64 = (u64)tmp * (u64)exec_delta;
+	do_div(tmp64, TICK_NSEC);
+	this_load = (unsigned long)tmp64;
+
+do_avg:
+	/* Update our load: */
+	for (i = 0, scale = 1; i < CPU_LOAD_IDX_MAX; i++, scale += scale) {
+		unsigned long old_load, new_load;
+
+		/* scale is effectively 1 << i now, and >> i divides by scale */
+
+		old_load = this_rq->cpu_load[i];
+		new_load = this_load;
+
+		this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
+	}
 }
 
+#ifdef CONFIG_SMP
+
 /*
  * double_rq_lock - safely lock two runqueues
  *
@@ -2133,23 +1745,17 @@ void sched_exec(void)
  * pull_task - move a task from a remote runqueue to the local runqueue.
  * Both runqueues must be locked.
  */
-static void pull_task(struct rq *src_rq, struct prio_array *src_array,
-		      struct task_struct *p, struct rq *this_rq,
-		      struct prio_array *this_array, int this_cpu)
+static void pull_task(struct rq *src_rq, struct task_struct *p,
+		      struct rq *this_rq, int this_cpu)
 {
-	dequeue_task(p, src_array);
-	dec_nr_running(p, src_rq);
+	deactivate_task(src_rq, p, 0);
 	set_task_cpu(p, this_cpu);
-	inc_nr_running(p, this_rq);
-	enqueue_task(p, this_array);
-	p->timestamp = (p->timestamp - src_rq->most_recent_timestamp)
-				+ this_rq->most_recent_timestamp;
+	activate_task(this_rq, p, 0);
 	/*
 	 * Note that idle threads have a prio of MAX_PRIO, for this test
 	 * to be always true for them.
 	 */
-	if (TASK_PREEMPTS_CURR(p, this_rq))
-		resched_task(this_rq->curr);
+	check_preempt_curr(this_rq, p);
 }
 
 /*
@@ -2174,25 +1780,59 @@ int can_migrate_task(struct task_struct 
 		return 0;
 
 	/*
-	 * Aggressive migration if:
-	 * 1) task is cache cold, or
-	 * 2) too many balance attempts have failed.
+	 * Aggressive migration if too many balance attempts have failed:
 	 */
-
-	if (sd->nr_balance_failed > sd->cache_nice_tries) {
-#ifdef CONFIG_SCHEDSTATS
-		if (task_hot(p, rq->most_recent_timestamp, sd))
-			schedstat_inc(sd, lb_hot_gained[idle]);
-#endif
+	if (sd->nr_balance_failed > sd->cache_nice_tries)
 		return 1;
-	}
 
-	if (task_hot(p, rq->most_recent_timestamp, sd))
-		return 0;
 	return 1;
 }
 
-#define rq_best_prio(rq) min((rq)->curr->prio, (rq)->best_expired_prio)
+/*
+ * Load-balancing iterator: iterate through the hieararchy of scheduling
+ * classes, starting with the highest-prio one:
+ */
+
+struct task_struct * load_balance_start(struct rq *rq)
+{
+	struct sched_class *class = sched_class_highest;
+	struct task_struct *p;
+
+	do {
+		p = class->load_balance_start(rq);
+		if (p) {
+			rq->load_balance_class = class;
+			return p;
+		}
+		class = class->next;
+	} while (class);
+
+	return NULL;
+}
+
+struct task_struct * load_balance_next(struct rq *rq)
+{
+	struct sched_class *class = rq->load_balance_class;
+	struct task_struct *p;
+
+	p = class->load_balance_next(rq);
+	if (p)
+		return p;
+	/*
+	 * Pick up the next class (if any) and attempt to start
+	 * the iterator there:
+	 */
+	while ((class = class->next)) {
+		p = class->load_balance_start(rq);
+		if (p) {
+			rq->load_balance_class = class;
+			return p;
+		}
+	}
+	return NULL;
+}
+
+#define rq_best_prio(rq) (rq)->curr->prio
 
 /*
  * move_tasks tries to move up to max_nr_move tasks and max_load_move weighted
@@ -2206,11 +1846,9 @@ static int move_tasks(struct rq *this_rq
 		      struct sched_domain *sd, enum idle_type idle,
 		      int *all_pinned)
 {
-	int idx, pulled = 0, pinned = 0, this_best_prio, best_prio,
+	int pulled = 0, pinned = 0, this_best_prio, best_prio,
 	    best_prio_seen, skip_for_load;
-	struct prio_array *array, *dst_array;
-	struct list_head *head, *curr;
-	struct task_struct *tmp;
+	struct task_struct *p;
 	long rem_load_move;
 
 	if (max_nr_move == 0 || max_load_move == 0)
@@ -2230,76 +1868,41 @@ static int move_tasks(struct rq *this_rq
 	best_prio_seen = best_prio == busiest->curr->prio;
 
 	/*
-	 * We first consider expired tasks. Those will likely not be
-	 * executed in the near future, and they are most likely to
-	 * be cache-cold, thus switching CPUs has the least effect
-	 * on them.
-	 */
-	if (busiest->expired->nr_active) {
-		array = busiest->expired;
-		dst_array = this_rq->expired;
-	} else {
-		array = busiest->active;
-		dst_array = this_rq->active;
-	}
-
-new_array:
-	/* Start searching at priority 0: */
-	idx = 0;
-skip_bitmap:
-	if (!idx)
-		idx = sched_find_first_bit(array->bitmap);
-	else
-		idx = find_next_bit(array->bitmap, MAX_PRIO, idx);
-	if (idx >= MAX_PRIO) {
-		if (array == busiest->expired && busiest->active->nr_active) {
-			array = busiest->active;
-			dst_array = this_rq->active;
-			goto new_array;
-		}
+	 * Start the load-balancing iterator:
+	 */
+	p = load_balance_start(busiest);
+next:
+	if (!p)
 		goto out;
-	}
-
-	head = array->queue + idx;
-	curr = head->prev;
-skip_queue:
-	tmp = list_entry(curr, struct task_struct, run_list);
-
-	curr = curr->prev;
-
 	/*
 	 * To help distribute high priority tasks accross CPUs we don't
 	 * skip a task if it will be the highest priority task (i.e. smallest
 	 * prio value) on its new queue regardless of its load weight
 	 */
-	skip_for_load = tmp->load_weight > rem_load_move;
-	if (skip_for_load && idx < this_best_prio)
-		skip_for_load = !best_prio_seen && idx == best_prio;
+	skip_for_load = (p->load_weight >> 1) > rem_load_move + SCHED_LOAD_SCALE_FUZZ;
+	if (skip_for_load && p->prio < this_best_prio)
+		skip_for_load = !best_prio_seen && p->prio == best_prio;
 	if (skip_for_load ||
-	    !can_migrate_task(tmp, busiest, this_cpu, sd, idle, &pinned)) {
+	    !can_migrate_task(p, busiest, this_cpu, sd, idle, &pinned)) {
 
-		best_prio_seen |= idx == best_prio;
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
+		best_prio_seen |= p->prio == best_prio;
+		p = load_balance_next(busiest);
+		goto next;
 	}
 
-	pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu);
+	pull_task(busiest, p, this_rq, this_cpu);
 	pulled++;
-	rem_load_move -= tmp->load_weight;
+	rem_load_move -= p->load_weight;
 
 	/*
 	 * We only want to steal up to the prescribed number of tasks
 	 * and the prescribed amount of weighted load.
 	 */
 	if (pulled < max_nr_move && rem_load_move > 0) {
-		if (idx < this_best_prio)
-			this_best_prio = idx;
-		if (curr != head)
-			goto skip_queue;
-		idx++;
-		goto skip_bitmap;
+		if (p->prio < this_best_prio)
+			this_best_prio = p->prio;
+		p = load_balance_next(busiest);
+		goto next;
 	}
 out:
 	/*
@@ -2426,8 +2029,8 @@ find_busiest_group(struct sched_domain *
 		 * Busy processors will not participate in power savings
 		 * balance.
 		 */
- 		if (idle == NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
- 			goto group_next;
+		if (idle == NOT_IDLE || !(sd->flags & SD_POWERSAVINGS_BALANCE))
+			goto group_next;
 
 		/*
 		 * If the local group is idle or completely loaded
@@ -2437,42 +2040,42 @@ find_busiest_group(struct sched_domain *
 				    !this_nr_running))
 			power_savings_balance = 0;
 
- 		/*
+		/*
 		 * If a group is already running at full capacity or idle,
 		 * don't include that group in power savings calculations
- 		 */
- 		if (!power_savings_balance || sum_nr_running >= group_capacity
+		 */
+		if (!power_savings_balance || sum_nr_running >= group_capacity
 		    || !sum_nr_running)
- 			goto group_next;
+			goto group_next;
 
- 		/*
+		/*
 		 * Calculate the group which has the least non-idle load.
- 		 * This is the group from where we need to pick up the load
- 		 * for saving power
- 		 */
- 		if ((sum_nr_running < min_nr_running) ||
- 		    (sum_nr_running == min_nr_running &&
+		 * This is the group from where we need to pick up the load
+		 * for saving power
+		 */
+		if ((sum_nr_running < min_nr_running) ||
+		    (sum_nr_running == min_nr_running &&
 		     first_cpu(group->cpumask) <
 		     first_cpu(group_min->cpumask))) {
- 			group_min = group;
- 			min_nr_running = sum_nr_running;
+			group_min = group;
+			min_nr_running = sum_nr_running;
 			min_load_per_task = sum_weighted_load /
 						sum_nr_running;
- 		}
+		}
 
- 		/*
+		/*
 		 * Calculate the group which is almost near its
- 		 * capacity but still has some space to pick up some load
- 		 * from other group and save more power
- 		 */
- 		if (sum_nr_running <= group_capacity - 1) {
- 			if (sum_nr_running > leader_nr_running ||
- 			    (sum_nr_running == leader_nr_running &&
- 			     first_cpu(group->cpumask) >
- 			      first_cpu(group_leader->cpumask))) {
- 				group_leader = group;
- 				leader_nr_running = sum_nr_running;
- 			}
+		 * capacity but still has some space to pick up some load
+		 * from other group and save more power
+		 */
+		if (sum_nr_running <= group_capacity - 1) {
+			if (sum_nr_running > leader_nr_running ||
+			    (sum_nr_running == leader_nr_running &&
+			     first_cpu(group->cpumask) >
+			      first_cpu(group_leader->cpumask))) {
+				group_leader = group;
+				leader_nr_running = sum_nr_running;
+			}
 		}
 group_next:
 #endif
@@ -2527,7 +2130,7 @@ group_next:
 	 * a think about bumping its value to force at least one task to be
 	 * moved
 	 */
-	if (*imbalance < busiest_load_per_task) {
+	if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task/2) {
 		unsigned long tmp, pwr_now, pwr_move;
 		unsigned int imbn;
 
@@ -2541,7 +2144,8 @@ small_imbalance:
 		} else
 			this_load_per_task = SCHED_LOAD_SCALE;
 
-		if (max_load - this_load >= busiest_load_per_task * imbn) {
+		if (max_load - this_load + SCHED_LOAD_SCALE_FUZZ >=
+					busiest_load_per_task * imbn) {
 			*imbalance = busiest_load_per_task;
 			return busiest;
 		}
@@ -2957,32 +2561,6 @@ static void active_load_balance(struct r
 	spin_unlock(&target_rq->lock);
 }
 
-static void update_load(struct rq *this_rq)
-{
-	unsigned long this_load;
-	unsigned int i, scale;
-
-	this_load = this_rq->raw_weighted_load;
-
-	/* Update our load: */
-	for (i = 0, scale = 1; i < 3; i++, scale += scale) {
-		unsigned long old_load, new_load;
-
-		/* scale is effectively 1 << i now, and >> i divides by scale */
-
-		old_load = this_rq->cpu_load[i];
-		new_load = this_load;
-		/*
-		 * Round up the averaging division if load is increasing. This
-		 * prevents us from getting stuck on 9 if the load is 10, for
-		 * example.
-		 */
-		if (new_load > old_load)
-			new_load += scale-1;
-		this_rq->cpu_load[i] = (old_load*(scale-1) + new_load) >> i;
-	}
-}
-
 #ifdef CONFIG_NO_HZ
 static struct {
 	atomic_t load_balancer;
@@ -3086,6 +2664,9 @@ static inline void rebalance_domains(int
 		interval = msecs_to_jiffies(interval);
 		if (unlikely(!interval))
 			interval = 1;
+		if (interval > HZ*NR_CPUS/10)
+			interval = HZ*NR_CPUS/10;
+
 
 		if (sd->flags & SD_SERIALIZE) {
 			if (!spin_trylock(&balancing))
@@ -3240,54 +2821,22 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
 EXPORT_PER_CPU_SYMBOL(kstat);
 
 /*
- * This is called on clock ticks and on context switches.
- * Bank in p->sched_time the ns elapsed since the last tick or switch.
- */
-static inline void
-update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now)
-{
-	p->sched_time += now - p->last_ran;
-	p->last_ran = rq->most_recent_timestamp = now;
-}
-
-/*
- * Return current->sched_time plus any more ns on the sched_clock
+ * Return current->sum_exec_runtime plus any more ns on the sched_clock
  * that have not yet been banked.
  */
-unsigned long long current_sched_time(const struct task_struct *p)
+unsigned long long current_sched_runtime(const struct task_struct *p)
 {
 	unsigned long long ns;
 	unsigned long flags;
 
 	local_irq_save(flags);
-	ns = p->sched_time + sched_clock() - p->last_ran;
+	ns = p->sum_exec_runtime + sched_clock() - p->last_ran;
 	local_irq_restore(flags);
 
 	return ns;
 }
 
 /*
- * We place interactive tasks back into the active array, if possible.
- *
- * To guarantee that this does not starve expired tasks we ignore the
- * interactivity of a task if the first expired task had to wait more
- * than a 'reasonable' amount of time. This deadline timeout is
- * load-dependent, as the frequency of array switched decreases with
- * increasing number of running tasks. We also ignore the interactivity
- * if a better static_prio task has expired:
- */
-static inline int expired_starving(struct rq *rq)
-{
-	if (rq->curr->static_prio > rq->best_expired_prio)
-		return 1;
-	if (!STARVATION_LIMIT || !rq->expired_timestamp)
-		return 0;
-	if (jiffies - rq->expired_timestamp > STARVATION_LIMIT * rq->nr_running)
-		return 1;
-	return 0;
-}
-
-/*
  * Account user cpu time to a process.
  * @p: the process that the cpu time gets accounted to
  * @hardirq_offset: the offset to subtract from hardirq_count()
@@ -3360,81 +2909,6 @@ void account_steal_time(struct task_stru
 		cpustat->steal = cputime64_add(cpustat->steal, tmp);
 }
 
-static void task_running_tick(struct rq *rq, struct task_struct *p)
-{
-	if (p->array != rq->active) {
-		/* Task has expired but was not scheduled yet */
-		set_tsk_need_resched(p);
-		return;
-	}
-	spin_lock(&rq->lock);
-	/*
-	 * The task was running during this tick - update the
-	 * time slice counter. Note: we do not update a thread's
-	 * priority until it either goes to sleep or uses up its
-	 * timeslice. This makes it possible for interactive tasks
-	 * to use up their timeslices at their highest priority levels.
-	 */
-	if (rt_task(p)) {
-		/*
-		 * RR tasks need a special form of timeslice management.
-		 * FIFO tasks have no timeslices.
-		 */
-		if ((p->policy == SCHED_RR) && !--p->time_slice) {
-			p->time_slice = task_timeslice(p);
-			p->first_time_slice = 0;
-			set_tsk_need_resched(p);
-
-			/* put it at the end of the queue: */
-			requeue_task(p, rq->active);
-		}
-		goto out_unlock;
-	}
-	if (!--p->time_slice) {
-		dequeue_task(p, rq->active);
-		set_tsk_need_resched(p);
-		p->prio = effective_prio(p);
-		p->time_slice = task_timeslice(p);
-		p->first_time_slice = 0;
-
-		if (!rq->expired_timestamp)
-			rq->expired_timestamp = jiffies;
-		if (!TASK_INTERACTIVE(p) || expired_starving(rq)) {
-			enqueue_task(p, rq->expired);
-			if (p->static_prio < rq->best_expired_prio)
-				rq->best_expired_prio = p->static_prio;
-		} else
-			enqueue_task(p, rq->active);
-	} else {
-		/*
-		 * Prevent a too long timeslice allowing a task to monopolize
-		 * the CPU. We do this by splitting up the timeslice into
-		 * smaller pieces.
-		 *
-		 * Note: this does not mean the task's timeslices expire or
-		 * get lost in any way, they just might be preempted by
-		 * another task of equal priority. (one with higher
-		 * priority would have preempted this task already.) We
-		 * requeue this task to the end of the list on this priority
-		 * level, which is in essence a round-robin of tasks with
-		 * equal priority.
-		 *
-		 * This only applies to tasks in the interactive
-		 * delta range with at least TIMESLICE_GRANULARITY to requeue.
-		 */
-		if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
-			p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
-			(p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
-			(p->array == rq->active)) {
-
-			requeue_task(p, rq->active);
-			set_tsk_need_resched(p);
-		}
-	}
-out_unlock:
-	spin_unlock(&rq->lock);
-}
-
 /*
  * This function gets called by the timer code, with HZ frequency.
  * We call it with interrupts disabled.
@@ -3444,18 +2918,15 @@ out_unlock:
  */
 void scheduler_tick(void)
 {
-	unsigned long long now = sched_clock();
 	struct task_struct *p = current;
 	int cpu = smp_processor_id();
 	int idle_at_tick = idle_cpu(cpu);
 	struct rq *rq = cpu_rq(cpu);
 
-	update_cpu_clock(p, rq, now);
-
 	if (!idle_at_tick)
 		task_running_tick(rq, p);
+	update_load_fair(rq);
 #ifdef CONFIG_SMP
-	update_load(rq);
 	rq->idle_at_tick = idle_at_tick;
 	trigger_load_balance(cpu);
 #endif
@@ -3499,49 +2970,27 @@ EXPORT_SYMBOL(sub_preempt_count);
 
 #endif
 
-static inline int interactive_sleep(enum sleep_type sleep_type)
-{
-	return (sleep_type == SLEEP_INTERACTIVE ||
-		sleep_type == SLEEP_INTERRUPTED);
-}
-
 /*
- * schedule() is the main scheduler function.
+ * Various schedule()-time debugging checks and statistics:
  */
-asmlinkage void __sched schedule(void)
+static inline void schedule_debug(struct rq *rq, struct task_struct *prev)
 {
-	struct task_struct *prev, *next;
-	struct prio_array *array;
-	struct list_head *queue;
-	unsigned long long now;
-	unsigned long run_time;
-	int cpu, idx, new_prio;
-	long *switch_count;
-	struct rq *rq;
-
 	/*
 	 * Test if we are atomic.  Since do_exit() needs to call into
 	 * schedule() atomically, we ignore that path for now.
 	 * Otherwise, whine if we are scheduling when we should not be.
 	 */
-	if (unlikely(in_atomic() && !current->exit_state)) {
+	if (unlikely(in_atomic_preempt_off() && !prev->exit_state)) {
 		printk(KERN_ERR "BUG: scheduling while atomic: "
 			"%s/0x%08x/%d\n",
-			current->comm, preempt_count(), current->pid);
-		debug_show_held_locks(current);
+			prev->comm, preempt_count(), prev->pid);
+		debug_show_held_locks(prev);
 		if (irqs_disabled())
-			print_irqtrace_events(current);
+			print_irqtrace_events(prev);
 		dump_stack();
 	}
 	profile_hit(SCHED_PROFILING, __builtin_return_address(0));
 
-need_resched:
-	preempt_disable();
-	prev = current;
-	release_kernel_lock(prev);
-need_resched_nonpreemptible:
-	rq = this_rq();
-
 	/*
 	 * The idle thread is not allowed to schedule!
 	 * Remove this check after it has been exercised a bit.
@@ -3552,19 +3001,45 @@ need_resched_nonpreemptible:
 	}
 
 	schedstat_inc(rq, sched_cnt);
-	now = sched_clock();
-	if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
-		run_time = now - prev->timestamp;
-		if (unlikely((long long)(now - prev->timestamp) < 0))
-			run_time = 0;
-	} else
-		run_time = NS_MAX_SLEEP_AVG;
+}
 
-	/*
-	 * Tasks charged proportionately less run_time at high sleep_avg to
-	 * delay them losing their interactive status
-	 */
-	run_time /= (CURRENT_BONUS(prev) ? : 1);
+static inline struct task_struct *
+pick_next_task(struct rq *rq, struct task_struct *prev)
+{
+	struct sched_class *class = sched_class_highest;
+	u64 now = __rq_clock(rq);
+	struct task_struct *p;
+
+	prev->sched_class->put_prev_task(rq, prev, now);
+
+	do {
+		p = class->pick_next_task(rq, now);
+		if (p)
+			return p;
+		class = class->next;
+	} while (class);
+
+	return NULL;
+}
+
+/*
+ * schedule() is the main scheduler function.
+ */
+asmlinkage void __sched schedule(void)
+{
+	struct task_struct *prev, *next;
+	long *switch_count;
+	struct rq *rq;
+	int cpu;
+
+need_resched:
+	preempt_disable();
+	prev = current;
+	release_kernel_lock(prev);
+need_resched_nonpreemptible:
+	rq = this_rq();
+
+	schedule_debug(rq, prev);
 
 	spin_lock_irq(&rq->lock);
 
@@ -3577,7 +3052,7 @@ need_resched_nonpreemptible:
 		else {
 			if (prev->state == TASK_UNINTERRUPTIBLE)
 				rq->nr_uninterruptible++;
-			deactivate_task(prev, rq);
+			deactivate_task(rq, prev, 1);
 		}
 	}
 
@@ -3585,65 +3060,25 @@ need_resched_nonpreemptible:
 	if (unlikely(!rq->nr_running)) {
 		idle_balance(cpu, rq);
 		if (!rq->nr_running) {
+			prev->sched_class->put_prev_task(rq, prev,
+							 __rq_clock(rq));
 			next = rq->idle;
-			rq->expired_timestamp = 0;
+			schedstat_inc(rq, sched_goidle);
 			goto switch_tasks;
 		}
 	}
 
-	array = rq->active;
-	if (unlikely(!array->nr_active)) {
-		/*
-		 * Switch the active and expired arrays.
-		 */
-		schedstat_inc(rq, sched_switch);
-		rq->active = rq->expired;
-		rq->expired = array;
-		array = rq->active;
-		rq->expired_timestamp = 0;
-		rq->best_expired_prio = MAX_PRIO;
-	}
-
-	idx = sched_find_first_bit(array->bitmap);
-	queue = array->queue + idx;
-	next = list_entry(queue->next, struct task_struct, run_list);
+	next = pick_next_task(rq, prev);
+	next->nr_switches++;
 
-	if (!rt_task(next) && interactive_sleep(next->sleep_type)) {
-		unsigned long long delta = now - next->timestamp;
-		if (unlikely((long long)(now - next->timestamp) < 0))
-			delta = 0;
-
-		if (next->sleep_type == SLEEP_INTERACTIVE)
-			delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
-		array = next->array;
-		new_prio = recalc_task_prio(next, next->timestamp + delta);
-
-		if (unlikely(next->prio != new_prio)) {
-			dequeue_task(next, array);
-			next->prio = new_prio;
-			enqueue_task(next, array);
-		}
-	}
-	next->sleep_type = SLEEP_NORMAL;
 switch_tasks:
-	if (next == rq->idle)
-		schedstat_inc(rq, sched_goidle);
 	prefetch(next);
 	prefetch_stack(next);
 	clear_tsk_need_resched(prev);
 	rcu_qsctr_inc(task_cpu(prev));
 
-	update_cpu_clock(prev, rq, now);
-
-	prev->sleep_avg -= run_time;
-	if ((long)prev->sleep_avg <= 0)
-		prev->sleep_avg = 0;
-	prev->timestamp = prev->last_ran = now;
-
 	sched_info_switch(prev, next);
 	if (likely(prev != next)) {
-		next->timestamp = next->last_ran = now;
 		rq->nr_switches++;
 		rq->curr = next;
 		++*switch_count;
@@ -4074,29 +3509,28 @@ EXPORT_SYMBOL(sleep_on_timeout);
  */
 void rt_mutex_setprio(struct task_struct *p, int prio)
 {
-	struct prio_array *array;
 	unsigned long flags;
+	int oldprio, on_rq;
 	struct rq *rq;
-	int oldprio;
 
 	BUG_ON(prio < 0 || prio > MAX_PRIO);
 
 	rq = task_rq_lock(p, &flags);
 
 	oldprio = p->prio;
-	array = p->array;
-	if (array)
-		dequeue_task(p, array);
+	on_rq = p->on_rq;
+	if (on_rq)
+		dequeue_task(rq, p, 0);
+
+	if (rt_prio(prio))
+		p->sched_class = &rt_sched_class;
+	else
+		p->sched_class = &fair_sched_class;
+
 	p->prio = prio;
 
-	if (array) {
-		/*
-		 * If changing to an RT priority then queue it
-		 * in the active array!
-		 */
-		if (rt_task(p))
-			array = rq->active;
-		enqueue_task(p, array);
+	if (on_rq) {
+		enqueue_task(rq, p, 0);
 		/*
 		 * Reschedule if we are currently running on this runqueue and
 		 * our priority decreased, or if we are not currently running on
@@ -4105,8 +3539,9 @@ void rt_mutex_setprio(struct task_struct
 		if (task_running(rq, p)) {
 			if (p->prio > oldprio)
 				resched_task(rq->curr);
-		} else if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
+		} else {
+			check_preempt_curr(rq, p);
+		}
 	}
 	task_rq_unlock(rq, &flags);
 }
@@ -4115,8 +3550,7 @@ void rt_mutex_setprio(struct task_struct
 
 void set_user_nice(struct task_struct *p, long nice)
 {
-	struct prio_array *array;
-	int old_prio, delta;
+	int old_prio, delta, on_rq;
 	unsigned long flags;
 	struct rq *rq;
 
@@ -4137,9 +3571,9 @@ void set_user_nice(struct task_struct *p
 		p->static_prio = NICE_TO_PRIO(nice);
 		goto out_unlock;
 	}
-	array = p->array;
-	if (array) {
-		dequeue_task(p, array);
+	on_rq = p->on_rq;
+	if (on_rq) {
+		dequeue_task(rq, p, 0);
 		dec_raw_weighted_load(rq, p);
 	}
 
@@ -4149,8 +3583,8 @@ void set_user_nice(struct task_struct *p
 	p->prio = effective_prio(p);
 	delta = p->prio - old_prio;
 
-	if (array) {
-		enqueue_task(p, array);
+	if (on_rq) {
+		enqueue_task(rq, p, 0);
 		inc_raw_weighted_load(rq, p);
 		/*
 		 * If the task increased its priority or is running and
@@ -4271,20 +3705,27 @@ static inline struct task_struct *find_p
 }
 
 /* Actually do priority change: must hold rq lock. */
-static void __setscheduler(struct task_struct *p, int policy, int prio)
+static void
+__setscheduler(struct rq *rq, struct task_struct *p, int policy, int prio)
 {
-	BUG_ON(p->array);
+	BUG_ON(p->on_rq);
 
 	p->policy = policy;
+	switch (p->policy) {
+	case SCHED_NORMAL:
+	case SCHED_BATCH:
+		p->sched_class = &fair_sched_class;
+		break;
+	case SCHED_FIFO:
+	case SCHED_RR:
+		p->sched_class = &rt_sched_class;
+		break;
+	}
+
 	p->rt_priority = prio;
 	p->normal_prio = normal_prio(p);
 	/* we are holding p->pi_lock already */
 	p->prio = rt_mutex_getprio(p);
-	/*
-	 * SCHED_BATCH tasks are treated as perpetual CPU hogs:
-	 */
-	if (policy == SCHED_BATCH)
-		p->sleep_avg = 0;
 	set_load_weight(p);
 }
 
@@ -4299,8 +3740,7 @@ static void __setscheduler(struct task_s
 int sched_setscheduler(struct task_struct *p, int policy,
 		       struct sched_param *param)
 {
-	int retval, oldprio, oldpolicy = -1;
-	struct prio_array *array;
+	int retval, oldprio, oldpolicy = -1, on_rq;
 	unsigned long flags;
 	struct rq *rq;
 
@@ -4374,13 +3814,13 @@ recheck:
 		spin_unlock_irqrestore(&p->pi_lock, flags);
 		goto recheck;
 	}
-	array = p->array;
-	if (array)
-		deactivate_task(p, rq);
+	on_rq = p->on_rq;
+	if (on_rq)
+		deactivate_task(rq, p, 0);
 	oldprio = p->prio;
-	__setscheduler(p, policy, param->sched_priority);
-	if (array) {
-		__activate_task(p, rq);
+	__setscheduler(rq, p, policy, param->sched_priority);
+	if (on_rq) {
+		activate_task(rq, p, 0);
 		/*
 		 * Reschedule if we are currently running on this runqueue and
 		 * our priority decreased, or if we are not currently running on
@@ -4389,8 +3829,9 @@ recheck:
 		if (task_running(rq, p)) {
 			if (p->prio > oldprio)
 				resched_task(rq->curr);
-		} else if (TASK_PREEMPTS_CURR(p, rq))
-			resched_task(rq->curr);
+		} else {
+			check_preempt_curr(rq, p);
+		}
 	}
 	__task_rq_unlock(rq);
 	spin_unlock_irqrestore(&p->pi_lock, flags);
@@ -4653,50 +4094,66 @@ asmlinkage long sys_sched_getaffinity(pi
 	if (ret < 0)
 		return ret;
 
-	if (copy_to_user(user_mask_ptr, &mask, sizeof(cpumask_t)))
-		return -EFAULT;
+	if (copy_to_user(user_mask_ptr, &mask, sizeof(cpumask_t)))
+		return -EFAULT;
+
+	return sizeof(cpumask_t);
+}
+
+/**
+ * sys_sched_yield - yield the current processor to other threads.
+ *
+ * This function yields the current CPU to other tasks. If there are no
+ * other threads running on this CPU then this function will return.
+ */
+asmlinkage long sys_sched_yield(void)
+{
+	struct rq *rq = this_rq_lock();
+
+	schedstat_inc(rq, yld_cnt);
+	if (rq->nr_running == 1)
+		schedstat_inc(rq, yld_act_empty);
+	else
+		current->sched_class->yield_task(rq, current, NULL);
+
+	/*
+	 * Since we are going to call schedule() anyway, there's
+	 * no need to preempt or enable interrupts:
+	 */
+	__release(rq->lock);
+	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
+	_raw_spin_unlock(&rq->lock);
+	preempt_enable_no_resched();
+
+	schedule();
 
-	return sizeof(cpumask_t);
+	return 0;
 }
 
 /**
- * sys_sched_yield - yield the current processor to other threads.
+ * sys_sched_yield_to - yield the current processor to another thread
  *
  * This function yields the current CPU by moving the calling thread
  * to the expired array. If there are no other threads running on this
  * CPU then this function will return.
  */
-asmlinkage long sys_sched_yield(void)
+asmlinkage long sys_sched_yield_to(pid_t pid)
 {
-	struct rq *rq = this_rq_lock();
-	struct prio_array *array = current->array, *target = rq->expired;
+	struct task_struct *p_to;
+	struct rq *rq;
 
-	schedstat_inc(rq, yld_cnt);
-	/*
-	 * We implement yielding by moving the task into the expired
-	 * queue.
-	 *
-	 * (special rule: RT tasks will just roundrobin in the active
-	 *  array.)
-	 */
-	if (rt_task(current))
-		target = rq->active;
+	rcu_read_lock();
+	p_to = find_task_by_pid(pid);
+	if (!p_to)
+		goto out_unlock;
+
+	rq = this_rq_lock();
 
-	if (array->nr_active == 1) {
+	schedstat_inc(rq, yld_cnt);
+	if (rq->nr_running == 1)
 		schedstat_inc(rq, yld_act_empty);
-		if (!rq->expired->nr_active)
-			schedstat_inc(rq, yld_both_empty);
-	} else if (!rq->expired->nr_active)
-		schedstat_inc(rq, yld_exp_empty);
-
-	if (array != target) {
-		dequeue_task(current, array);
-		enqueue_task(current, target);
-	} else
-		/*
-		 * requeue_task is cheaper so perform that if possible.
-		 */
-		requeue_task(current, array);
+	else
+		current->sched_class->yield_task(rq, current, p_to);
 
 	/*
 	 * Since we are going to call schedule() anyway, there's
@@ -4705,13 +4162,19 @@ asmlinkage long sys_sched_yield(void)
 	__release(rq->lock);
 	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
 	_raw_spin_unlock(&rq->lock);
+	rcu_read_unlock();
 	preempt_enable_no_resched();
 
 	schedule();
 
 	return 0;
+
+out_unlock:
+	rcu_read_unlock();
+	return -ESRCH;
 }
 
+
 static void __cond_resched(void)
 {
 #ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
@@ -4905,7 +4368,7 @@ long sys_sched_rr_get_interval(pid_t pid
 		goto out_unlock;
 
 	jiffies_to_timespec(p->policy == SCHED_FIFO ?
-				0 : task_timeslice(p), &t);
+				0 : static_prio_timeslice(p->static_prio), &t);
 	read_unlock(&tasklist_lock);
 	retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
 out_nounlock:
@@ -4986,6 +4449,7 @@ void show_state_filter(unsigned long sta
 	 */
 	if (state_filter == -1)
 		debug_show_all_locks();
+	sysrq_sched_debug_show();
 }
 
 /**
@@ -5001,13 +4465,12 @@ void __cpuinit init_idle(struct task_str
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
-	idle->timestamp = sched_clock();
-	idle->sleep_avg = 0;
-	idle->array = NULL;
+	__sched_fork(idle);
+	idle->exec_start = sched_clock();
+
 	idle->prio = idle->normal_prio = MAX_PRIO;
-	idle->state = TASK_RUNNING;
 	idle->cpus_allowed = cpumask_of_cpu(cpu);
-	set_task_cpu(idle, cpu);
+	__set_task_cpu(idle, cpu);
 
 	spin_lock_irqsave(&rq->lock, flags);
 	rq->curr = rq->idle = idle;
@@ -5123,19 +4586,10 @@ static int __migrate_task(struct task_st
 		goto out;
 
 	set_task_cpu(p, dest_cpu);
-	if (p->array) {
-		/*
-		 * Sync timestamp with rq_dest's before activating.
-		 * The same thing could be achieved by doing this step
-		 * afterwards, and pretending it was a local activate.
-		 * This way is cleaner and logically correct.
-		 */
-		p->timestamp = p->timestamp - rq_src->most_recent_timestamp
-				+ rq_dest->most_recent_timestamp;
-		deactivate_task(p, rq_src);
-		__activate_task(p, rq_dest);
-		if (TASK_PREEMPTS_CURR(p, rq_dest))
-			resched_task(rq_dest->curr);
+	if (p->on_rq) {
+		deactivate_task(rq_src, p, 0);
+		activate_task(rq_dest, p, 0);
+		check_preempt_curr(rq_dest, p);
 	}
 	ret = 1;
 out:
@@ -5308,10 +4762,10 @@ void sched_idle_next(void)
 	 */
 	spin_lock_irqsave(&rq->lock, flags);
 
-	__setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1);
+	__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
 
 	/* Add idle task to the _front_ of its priority queue: */
-	__activate_idle_task(p, rq);
+	activate_idle_task(p, rq);
 
 	spin_unlock_irqrestore(&rq->lock, flags);
 }
@@ -5361,16 +4815,15 @@ static void migrate_dead(unsigned int de
 static void migrate_dead_tasks(unsigned int dead_cpu)
 {
 	struct rq *rq = cpu_rq(dead_cpu);
-	unsigned int arr, i;
+	struct task_struct *next;
 
-	for (arr = 0; arr < 2; arr++) {
-		for (i = 0; i < MAX_PRIO; i++) {
-			struct list_head *list = &rq->arrays[arr].queue[i];
-
-			while (!list_empty(list))
-				migrate_dead(dead_cpu, list_entry(list->next,
-					     struct task_struct, run_list));
-		}
+	for (;;) {
+		if (!rq->nr_running)
+			break;
+		next = pick_next_task(rq, rq->curr);
+		if (!next)
+			break;
+		migrate_dead(dead_cpu, next);
 	}
 }
 #endif /* CONFIG_HOTPLUG_CPU */
@@ -5400,7 +4853,7 @@ migration_call(struct notifier_block *nf
 		kthread_bind(p, cpu);
 		/* Must be high prio: stop_machine expects to yield to it. */
 		rq = task_rq_lock(p, &flags);
-		__setscheduler(p, SCHED_FIFO, MAX_RT_PRIO-1);
+		__setscheduler(rq, p, SCHED_FIFO, MAX_RT_PRIO-1);
 		task_rq_unlock(rq, &flags);
 		cpu_rq(cpu)->migration_thread = p;
 		break;
@@ -5431,9 +4884,9 @@ migration_call(struct notifier_block *nf
 		rq->migration_thread = NULL;
 		/* Idle task back to normal (off runqueue, low prio) */
 		rq = task_rq_lock(rq->idle, &flags);
-		deactivate_task(rq->idle, rq);
+		deactivate_task(rq, rq->idle, 0);
 		rq->idle->static_prio = MAX_PRIO;
-		__setscheduler(rq->idle, SCHED_NORMAL, 0);
+		__setscheduler(rq, rq->idle, SCHED_NORMAL, 0);
 		migrate_dead_tasks(cpu);
 		task_rq_unlock(rq, &flags);
 		migrate_nr_uninterruptible(rq);
@@ -5742,483 +5195,6 @@ init_sched_build_groups(cpumask_t span, 
 
 #define SD_NODES_PER_DOMAIN 16
 
-/*
- * Self-tuning task migration cost measurement between source and target CPUs.
- *
- * This is done by measuring the cost of manipulating buffers of varying
- * sizes. For a given buffer-size here are the steps that are taken:
- *
- * 1) the source CPU reads+dirties a shared buffer
- * 2) the target CPU reads+dirties the same shared buffer
- *
- * We measure how long they take, in the following 4 scenarios:
- *
- *  - source: CPU1, target: CPU2 | cost1
- *  - source: CPU2, target: CPU1 | cost2
- *  - source: CPU1, target: CPU1 | cost3
- *  - source: CPU2, target: CPU2 | cost4
- *
- * We then calculate the cost3+cost4-cost1-cost2 difference - this is
- * the cost of migration.
- *
- * We then start off from a small buffer-size and iterate up to larger
- * buffer sizes, in 5% steps - measuring each buffer-size separately, and
- * doing a maximum search for the cost. (The maximum cost for a migration
- * normally occurs when the working set size is around the effective cache
- * size.)
- */
-#define SEARCH_SCOPE		2
-#define MIN_CACHE_SIZE		(64*1024U)
-#define DEFAULT_CACHE_SIZE	(5*1024*1024U)
-#define ITERATIONS		1
-#define SIZE_THRESH		130
-#define COST_THRESH		130
-
-/*
- * The migration cost is a function of 'domain distance'. Domain
- * distance is the number of steps a CPU has to iterate down its
- * domain tree to share a domain with the other CPU. The farther
- * two CPUs are from each other, the larger the distance gets.
- *
- * Note that we use the distance only to cache measurement results,
- * the distance value is not used numerically otherwise. When two
- * CPUs have the same distance it is assumed that the migration
- * cost is the same. (this is a simplification but quite practical)
- */
-#define MAX_DOMAIN_DISTANCE 32
-
-static unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
-		{ [ 0 ... MAX_DOMAIN_DISTANCE-1 ] =
-/*
- * Architectures may override the migration cost and thus avoid
- * boot-time calibration. Unit is nanoseconds. Mostly useful for
- * virtualized hardware:
- */
-#ifdef CONFIG_DEFAULT_MIGRATION_COST
-			CONFIG_DEFAULT_MIGRATION_COST
-#else
-			-1LL
-#endif
-};
-
-/*
- * Allow override of migration cost - in units of microseconds.
- * E.g. migration_cost=1000,2000,3000 will set up a level-1 cost
- * of 1 msec, level-2 cost of 2 msecs and level3 cost of 3 msecs:
- */
-static int __init migration_cost_setup(char *str)
-{
-	int ints[MAX_DOMAIN_DISTANCE+1], i;
-
-	str = get_options(str, ARRAY_SIZE(ints), ints);
-
-	printk("#ints: %d\n", ints[0]);
-	for (i = 1; i <= ints[0]; i++) {
-		migration_cost[i-1] = (unsigned long long)ints[i]*1000;
-		printk("migration_cost[%d]: %Ld\n", i-1, migration_cost[i-1]);
-	}
-	return 1;
-}
-
-__setup ("migration_cost=", migration_cost_setup);
-
-/*
- * Global multiplier (divisor) for migration-cutoff values,
- * in percentiles. E.g. use a value of 150 to get 1.5 times
- * longer cache-hot cutoff times.
- *
- * (We scale it from 100 to 128 to long long handling easier.)
- */
-
-#define MIGRATION_FACTOR_SCALE 128
-
-static unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
-
-static int __init setup_migration_factor(char *str)
-{
-	get_option(&str, &migration_factor);
-	migration_factor = migration_factor * MIGRATION_FACTOR_SCALE / 100;
-	return 1;
-}
-
-__setup("migration_factor=", setup_migration_factor);
-
-/*
- * Estimated distance of two CPUs, measured via the number of domains
- * we have to pass for the two CPUs to be in the same span:
- */
-static unsigned long domain_distance(int cpu1, int cpu2)
-{
-	unsigned long distance = 0;
-	struct sched_domain *sd;
-
-	for_each_domain(cpu1, sd) {
-		WARN_ON(!cpu_isset(cpu1, sd->span));
-		if (cpu_isset(cpu2, sd->span))
-			return distance;
-		distance++;
-	}
-	if (distance >= MAX_DOMAIN_DISTANCE) {
-		WARN_ON(1);
-		distance = MAX_DOMAIN_DISTANCE-1;
-	}
-
-	return distance;
-}
-
-static unsigned int migration_debug;
-
-static int __init setup_migration_debug(char *str)
-{
-	get_option(&str, &migration_debug);
-	return 1;
-}
-
-__setup("migration_debug=", setup_migration_debug);
-
-/*
- * Maximum cache-size that the scheduler should try to measure.
- * Architectures with larger caches should tune this up during
- * bootup. Gets used in the domain-setup code (i.e. during SMP
- * bootup).
- */
-unsigned int max_cache_size;
-
-static int __init setup_max_cache_size(char *str)
-{
-	get_option(&str, &max_cache_size);
-	return 1;
-}
-
-__setup("max_cache_size=", setup_max_cache_size);
-
-/*
- * Dirty a big buffer in a hard-to-predict (for the L2 cache) way. This
- * is the operation that is timed, so we try to generate unpredictable
- * cachemisses that still end up filling the L2 cache:
- */
-static void touch_cache(void *__cache, unsigned long __size)
-{
-	unsigned long size = __size / sizeof(long);
-	unsigned long chunk1 = size / 3;
-	unsigned long chunk2 = 2 * size / 3;
-	unsigned long *cache = __cache;
-	int i;
-
-	for (i = 0; i < size/6; i += 8) {
-		switch (i % 6) {
-			case 0: cache[i]++;
-			case 1: cache[size-1-i]++;
-			case 2: cache[chunk1-i]++;
-			case 3: cache[chunk1+i]++;
-			case 4: cache[chunk2-i]++;
-			case 5: cache[chunk2+i]++;
-		}
-	}
-}
-
-/*
- * Measure the cache-cost of one task migration. Returns in units of nsec.
- */
-static unsigned long long
-measure_one(void *cache, unsigned long size, int source, int target)
-{
-	cpumask_t mask, saved_mask;
-	unsigned long long t0, t1, t2, t3, cost;
-
-	saved_mask = current->cpus_allowed;
-
-	/*
-	 * Flush source caches to RAM and invalidate them:
-	 */
-	sched_cacheflush();
-
-	/*
-	 * Migrate to the source CPU:
-	 */
-	mask = cpumask_of_cpu(source);
-	set_cpus_allowed(current, mask);
-	WARN_ON(smp_processor_id() != source);
-
-	/*
-	 * Dirty the working set:
-	 */
-	t0 = sched_clock();
-	touch_cache(cache, size);
-	t1 = sched_clock();
-
-	/*
-	 * Migrate to the target CPU, dirty the L2 cache and access
-	 * the shared buffer. (which represents the working set
-	 * of a migrated task.)
-	 */
-	mask = cpumask_of_cpu(target);
-	set_cpus_allowed(current, mask);
-	WARN_ON(smp_processor_id() != target);
-
-	t2 = sched_clock();
-	touch_cache(cache, size);
-	t3 = sched_clock();
-
-	cost = t1-t0 + t3-t2;
-
-	if (migration_debug >= 2)
-		printk("[%d->%d]: %8Ld %8Ld %8Ld => %10Ld.\n",
-			source, target, t1-t0, t1-t0, t3-t2, cost);
-	/*
-	 * Flush target caches to RAM and invalidate them:
-	 */
-	sched_cacheflush();
-
-	set_cpus_allowed(current, saved_mask);
-
-	return cost;
-}
-
-/*
- * Measure a series of task migrations and return the average
- * result. Since this code runs early during bootup the system
- * is 'undisturbed' and the average latency makes sense.
- *
- * The algorithm in essence auto-detects the relevant cache-size,
- * so it will properly detect different cachesizes for different
- * cache-hierarchies, depending on how the CPUs are connected.
- *
- * Architectures can prime the upper limit of the search range via
- * max_cache_size, otherwise the search range defaults to 20MB...64K.
- */
-static unsigned long long
-measure_cost(int cpu1, int cpu2, void *cache, unsigned int size)
-{
-	unsigned long long cost1, cost2;
-	int i;
-
-	/*
-	 * Measure the migration cost of 'size' bytes, over an
-	 * average of 10 runs:
-	 *
-	 * (We perturb the cache size by a small (0..4k)
-	 *  value to compensate size/alignment related artifacts.
-	 *  We also subtract the cost of the operation done on
-	 *  the same CPU.)
-	 */
-	cost1 = 0;
-
-	/*
-	 * dry run, to make sure we start off cache-cold on cpu1,
-	 * and to get any vmalloc pagefaults in advance:
-	 */
-	measure_one(cache, size, cpu1, cpu2);
-	for (i = 0; i < ITERATIONS; i++)
-		cost1 += measure_one(cache, size - i * 1024, cpu1, cpu2);
-
-	measure_one(cache, size, cpu2, cpu1);
-	for (i = 0; i < ITERATIONS; i++)
-		cost1 += measure_one(cache, size - i * 1024, cpu2, cpu1);
-
-	/*
-	 * (We measure the non-migrating [cached] cost on both
-	 *  cpu1 and cpu2, to handle CPUs with different speeds)
-	 */
-	cost2 = 0;
-
-	measure_one(cache, size, cpu1, cpu1);
-	for (i = 0; i < ITERATIONS; i++)
-		cost2 += measure_one(cache, size - i * 1024, cpu1, cpu1);
-
-	measure_one(cache, size, cpu2, cpu2);
-	for (i = 0; i < ITERATIONS; i++)
-		cost2 += measure_one(cache, size - i * 1024, cpu2, cpu2);
-
-	/*
-	 * Get the per-iteration migration cost:
-	 */
-	do_div(cost1, 2 * ITERATIONS);
-	do_div(cost2, 2 * ITERATIONS);
-
-	return cost1 - cost2;
-}
-
-static unsigned long long measure_migration_cost(int cpu1, int cpu2)
-{
-	unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;
-	unsigned int max_size, size, size_found = 0;
-	long long cost = 0, prev_cost;
-	void *cache;
-
-	/*
-	 * Search from max_cache_size*5 down to 64K - the real relevant
-	 * cachesize has to lie somewhere inbetween.
-	 */
-	if (max_cache_size) {
-		max_size = max(max_cache_size * SEARCH_SCOPE, MIN_CACHE_SIZE);
-		size = max(max_cache_size / SEARCH_SCOPE, MIN_CACHE_SIZE);
-	} else {
-		/*
-		 * Since we have no estimation about the relevant
-		 * search range
-		 */
-		max_size = DEFAULT_CACHE_SIZE * SEARCH_SCOPE;
-		size = MIN_CACHE_SIZE;
-	}
-
-	if (!cpu_online(cpu1) || !cpu_online(cpu2)) {
-		printk("cpu %d and %d not both online!\n", cpu1, cpu2);
-		return 0;
-	}
-
-	/*
-	 * Allocate the working set:
-	 */
-	cache = vmalloc(max_size);
-	if (!cache) {
-		printk("could not vmalloc %d bytes for cache!\n", 2 * max_size);
-		return 1000000; /* return 1 msec on very small boxen */
-	}
-
-	while (size <= max_size) {
-		prev_cost = cost;
-		cost = measure_cost(cpu1, cpu2, cache, size);
-
-		/*
-		 * Update the max:
-		 */
-		if (cost > 0) {
-			if (max_cost < cost) {
-				max_cost = cost;
-				size_found = size;
-			}
-		}
-		/*
-		 * Calculate average fluctuation, we use this to prevent
-		 * noise from triggering an early break out of the loop:
-		 */
-		fluct = abs(cost - prev_cost);
-		avg_fluct = (avg_fluct + fluct)/2;
-
-		if (migration_debug)
-			printk("-> [%d][%d][%7d] %3ld.%ld [%3ld.%ld] (%ld): "
-				"(%8Ld %8Ld)\n",
-				cpu1, cpu2, size,
-				(long)cost / 1000000,
-				((long)cost / 100000) % 10,
-				(long)max_cost / 1000000,
-				((long)max_cost / 100000) % 10,
-				domain_distance(cpu1, cpu2),
-				cost, avg_fluct);
-
-		/*
-		 * If we iterated at least 20% past the previous maximum,
-		 * and the cost has dropped by more than 20% already,
-		 * (taking fluctuations into account) then we assume to
-		 * have found the maximum and break out of the loop early:
-		 */
-		if (size_found && (size*100 > size_found*SIZE_THRESH))
-			if (cost+avg_fluct <= 0 ||
-				max_cost*100 > (cost+avg_fluct)*COST_THRESH) {
-
-				if (migration_debug)
-					printk("-> found max.\n");
-				break;
-			}
-		/*
-		 * Increase the cachesize in 10% steps:
-		 */
-		size = size * 10 / 9;
-	}
-
-	if (migration_debug)
-		printk("[%d][%d] working set size found: %d, cost: %Ld\n",
-			cpu1, cpu2, size_found, max_cost);
-
-	vfree(cache);
-
-	/*
-	 * A task is considered 'cache cold' if at least 2 times
-	 * the worst-case cost of migration has passed.
-	 *
-	 * (this limit is only listened to if the load-balancing
-	 * situation is 'nice' - if there is a large imbalance we
-	 * ignore it for the sake of CPU utilization and
-	 * processing fairness.)
-	 */
-	return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;
-}
-
-static void calibrate_migration_costs(const cpumask_t *cpu_map)
-{
-	int cpu1 = -1, cpu2 = -1, cpu, orig_cpu = raw_smp_processor_id();
-	unsigned long j0, j1, distance, max_distance = 0;
-	struct sched_domain *sd;
-
-	j0 = jiffies;
-
-	/*
-	 * First pass - calculate the cacheflush times:
-	 */
-	for_each_cpu_mask(cpu1, *cpu_map) {
-		for_each_cpu_mask(cpu2, *cpu_map) {
-			if (cpu1 == cpu2)
-				continue;
-			distance = domain_distance(cpu1, cpu2);
-			max_distance = max(max_distance, distance);
-			/*
-			 * No result cached yet?
-			 */
-			if (migration_cost[distance] == -1LL)
-				migration_cost[distance] =
-					measure_migration_cost(cpu1, cpu2);
-		}
-	}
-	/*
-	 * Second pass - update the sched domain hierarchy with
-	 * the new cache-hot-time estimations:
-	 */
-	for_each_cpu_mask(cpu, *cpu_map) {
-		distance = 0;
-		for_each_domain(cpu, sd) {
-			sd->cache_hot_time = migration_cost[distance];
-			distance++;
-		}
-	}
-	/*
-	 * Print the matrix:
-	 */
-	if (migration_debug)
-		printk("migration: max_cache_size: %d, cpu: %d MHz:\n",
-			max_cache_size,
-#ifdef CONFIG_X86
-			cpu_khz/1000
-#else
-			-1
-#endif
-		);
-	if (system_state == SYSTEM_BOOTING && num_online_cpus() > 1) {
-		printk("migration_cost=");
-		for (distance = 0; distance <= max_distance; distance++) {
-			if (distance)
-				printk(",");
-			printk("%ld", (long)migration_cost[distance] / 1000);
-		}
-		printk("\n");
-	}
-	j1 = jiffies;
-	if (migration_debug)
-		printk("migration: %ld seconds\n", (j1-j0) / HZ);
-
-	/*
-	 * Move back to the original CPU. NUMA-Q gets confused
-	 * if we migrate to another quad during bootup.
-	 */
-	if (raw_smp_processor_id() != orig_cpu) {
-		cpumask_t mask = cpumask_of_cpu(orig_cpu),
-			saved_mask = current->cpus_allowed;
-
-		set_cpus_allowed(current, mask);
-		set_cpus_allowed(current, saved_mask);
-	}
-}
-
 #ifdef CONFIG_NUMA
 
 /**
@@ -6748,10 +5724,6 @@ static int build_sched_domains(const cpu
 #endif
 		cpu_attach_domain(sd, i);
 	}
-	/*
-	 * Tune cache-hot values:
-	 */
-	calibrate_migration_costs(cpu_map);
 
 	return 0;
 
@@ -6931,6 +5903,21 @@ void __init sched_init_smp(void)
 	/* Move init over to a non-isolated CPU */
 	if (set_cpus_allowed(current, non_isolated_cpus) < 0)
 		BUG();
+	/*
+	 * Increase the granularity value when there are more CPUs,
+	 * because with more CPUs the 'effective latency' as visible
+	 * to users decreases. But the relationship is not linear,
+	 * so pick a second-best guess by going with the log2 of the
+	 * number of CPUs.
+	 *
+	 * This idea comes from the SD scheduler of Con Kolivas:
+	 */
+	{
+		unsigned int factor = 1 + ilog2(num_online_cpus());
+
+		sysctl_sched_granularity *= factor;
+		sysctl_sched_runtime_limit *= factor;
+	}
 }
 #else
 void __init sched_init_smp(void)
@@ -6950,8 +5937,15 @@ int in_sched_functions(unsigned long add
 
 void __init sched_init(void)
 {
-	int i, j, k;
 	int highest_cpu = 0;
+	int i, j;
+
+	current->sched_class = &fair_sched_class;
+	/*
+	 * Link up the scheduling class hierarchy:
+	 */
+	rt_sched_class.next = &fair_sched_class;
+	fair_sched_class.next = NULL;
 
 	for_each_possible_cpu(i) {
 		struct prio_array *array;
@@ -6961,14 +5955,13 @@ void __init sched_init(void)
 		spin_lock_init(&rq->lock);
 		lockdep_set_class(&rq->lock, &rq->rq_lock_key);
 		rq->nr_running = 0;
-		rq->active = rq->arrays;
-		rq->expired = rq->arrays + 1;
-		rq->best_expired_prio = MAX_PRIO;
+		rq->tasks_timeline = RB_ROOT;
+		rq->clock = rq->fair_clock = 1;
 
+		for (j = 0; j < CPU_LOAD_IDX_MAX; j++)
+			rq->cpu_load[j] = 0;
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
-		for (j = 1; j < 3; j++)
-			rq->cpu_load[j] = 0;
 		rq->active_balance = 0;
 		rq->push_cpu = 0;
 		rq->cpu = i;
@@ -6977,16 +5970,14 @@ void __init sched_init(void)
 #endif
 		atomic_set(&rq->nr_iowait, 0);
 
-		for (j = 0; j < 2; j++) {
-			array = rq->arrays + j;
-			for (k = 0; k < MAX_PRIO; k++) {
-				INIT_LIST_HEAD(array->queue + k);
-				__clear_bit(k, array->bitmap);
-			}
-			// delimiter for bitsearch
-			__set_bit(MAX_PRIO, array->bitmap);
+		array = &rq->active;
+		for (j = 0; j < MAX_RT_PRIO; j++) {
+			INIT_LIST_HEAD(array->queue + j);
+			__clear_bit(j, array->bitmap);
 		}
 		highest_cpu = i;
+		/* delimiter for bitsearch: */
+		__set_bit(MAX_RT_PRIO, array->bitmap);
 	}
 
 	set_load_weight(&init_task);
@@ -7043,28 +6034,55 @@ EXPORT_SYMBOL(__might_sleep);
 #ifdef CONFIG_MAGIC_SYSRQ
 void normalize_rt_tasks(void)
 {
-	struct prio_array *array;
 	struct task_struct *p;
 	unsigned long flags;
 	struct rq *rq;
+	int on_rq;
 
 	read_lock_irq(&tasklist_lock);
 	for_each_process(p) {
-		if (!rt_task(p))
+		p->fair_key = 0;
+		p->wait_runtime = 0;
+		p->wait_start_fair = 0;
+		p->wait_start = 0;
+		p->exec_start = 0;
+		p->sleep_start = 0;
+		p->sleep_start_fair = 0;
+		p->block_start = 0;
+		task_rq(p)->fair_clock = 0;
+		task_rq(p)->clock = 0;
+
+		if (!rt_task(p)) {
+			/*
+			 * Renice negative nice level userspace
+			 * tasks back to 0:
+			 */
+			if (TASK_NICE(p) < 0 && p->mm)
+				set_user_nice(p, 0);
 			continue;
+		}
 
 		spin_lock_irqsave(&p->pi_lock, flags);
 		rq = __task_rq_lock(p);
+#ifdef CONFIG_SMP
+		/*
+		 * Do not touch the migration thread:
+		 */
+		if (p == rq->migration_thread)
+			goto out_unlock;
+#endif
 
-		array = p->array;
-		if (array)
-			deactivate_task(p, task_rq(p));
-		__setscheduler(p, SCHED_NORMAL, 0);
-		if (array) {
-			__activate_task(p, task_rq(p));
+		on_rq = p->on_rq;
+		if (on_rq)
+			deactivate_task(task_rq(p), p, 0);
+		__setscheduler(rq, p, SCHED_NORMAL, 0);
+		if (on_rq) {
+			activate_task(task_rq(p), p, 0);
 			resched_task(rq->curr);
 		}
-
+#ifdef CONFIG_SMP
+ out_unlock:
+#endif
 		__task_rq_unlock(rq);
 		spin_unlock_irqrestore(&p->pi_lock, flags);
 	}
diff -puN /dev/null kernel/sched_debug.c
--- /dev/null
+++ a/kernel/sched_debug.c
@@ -0,0 +1,224 @@
+/*
+ * kernel/time/sched_debug.c
+ *
+ * Print the CFS rbtree
+ *
+ * Copyright(C) 2007, Red Hat, Inc., Ingo Molnar
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/proc_fs.h>
+#include <linux/sched.h>
+#include <linux/seq_file.h>
+#include <linux/kallsyms.h>
+
+typedef void (*print_fn_t)(struct seq_file *m, unsigned int *classes);
+
+/*
+ * This allows printing both to /proc/sched_debug and
+ * to the console
+ */
+#define SEQ_printf(m, x...)			\
+ do {						\
+	if (m)					\
+		seq_printf(m, x);		\
+	else					\
+		printk(x);			\
+ } while (0)
+
+static void
+print_task(struct seq_file *m, struct rq *rq, struct task_struct *p, u64 now)
+{
+	if (rq->curr == p)
+		SEQ_printf(m, "R");
+	else
+		SEQ_printf(m, " ");
+
+	SEQ_printf(m, "%15s %5d %15Ld %13Ld %13Ld %9Ld %5d "
+		      "%15Ld %15Ld %15Ld %15Ld %15Ld\n",
+		p->comm, p->pid,
+		(long long)p->fair_key, (long long)p->fair_key - rq->fair_clock,
+		(long long)p->wait_runtime,
+		(long long)p->nr_switches,
+		p->prio,
+		(long long)p->sum_exec_runtime,
+		(long long)p->sum_wait_runtime,
+		(long long)p->sum_sleep_runtime,
+		(long long)p->wait_runtime_overruns,
+		(long long)p->wait_runtime_underruns);
+}
+
+static void print_rq(struct seq_file *m, struct rq *rq, u64 now)
+{
+	struct task_struct *p;
+	struct rb_node *curr;
+
+	SEQ_printf(m,
+	"\nrunnable tasks:\n"
+	"            task   PID        tree-key         delta       waiting"
+	"  switches  prio"
+	"        sum-exec        sum-wait       sum-sleep"
+	"    wait-overrun   wait-underrun\n"
+	"------------------------------------------------------------------"
+	"----------------"
+	"------------------------------------------------"
+	"--------------------------------\n");
+
+	curr = first_fair(rq);
+	while (curr) {
+		p = rb_entry(curr, struct task_struct, run_node);
+		print_task(m, rq, p, now);
+
+		curr = rb_next(curr);
+	}
+}
+
+static void print_rq_runtime_sum(struct seq_file *m, struct rq *rq)
+{
+	s64 wait_runtime_rq_sum = 0;
+	struct task_struct *p;
+	struct rb_node *curr;
+	unsigned long flags;
+
+	spin_lock_irqsave(&rq->lock, flags);
+	curr = first_fair(rq);
+	while (curr) {
+		p = rb_entry(curr, struct task_struct, run_node);
+		wait_runtime_rq_sum += p->wait_runtime;
+
+		curr = rb_next(curr);
+	}
+	spin_unlock_irqrestore(&rq->lock, flags);
+
+	SEQ_printf(m, "  .%-22s: %Ld\n", "wait_runtime_rq_sum",
+		wait_runtime_rq_sum);
+}
+
+static void print_cpu(struct seq_file *m, int cpu, u64 now)
+{
+	struct rq *rq = &per_cpu(runqueues, cpu);
+
+	SEQ_printf(m, "\ncpu: %d\n", cpu);
+#define P(x) \
+	SEQ_printf(m, "  .%-22s: %Ld\n", #x, (long long)(rq->x))
+
+	P(nr_running);
+	P(raw_weighted_load);
+	P(nr_switches);
+	P(nr_load_updates);
+	P(nr_uninterruptible);
+	SEQ_printf(m, "  .%-22s: %lu\n", "jiffies", jiffies);
+	P(next_balance);
+	P(curr->pid);
+	P(clock);
+	P(prev_clock_raw);
+	P(clock_warps);
+	P(clock_unstable_events);
+	P(clock_max_delta);
+	rq->clock_max_delta = 0;
+	P(fair_clock);
+	P(prev_fair_clock);
+	P(exec_clock);
+	P(prev_exec_clock);
+	P(wait_runtime);
+	P(wait_runtime_overruns);
+	P(wait_runtime_underruns);
+	P(cpu_load[0]);
+	P(cpu_load[1]);
+	P(cpu_load[2]);
+	P(cpu_load[3]);
+	P(cpu_load[4]);
+#undef P
+	print_rq_runtime_sum(m, rq);
+
+	print_rq(m, rq, now);
+}
+
+static int sched_debug_show(struct seq_file *m, void *v)
+{
+	u64 now = ktime_to_ns(ktime_get());
+	int cpu;
+
+	SEQ_printf(m, "Sched Debug Version: v0.02\n");
+	SEQ_printf(m, "now at %Lu nsecs\n", (unsigned long long)now);
+
+	for_each_online_cpu(cpu)
+		print_cpu(m, cpu, now);
+
+	SEQ_printf(m, "\n");
+
+	return 0;
+}
+
+void sysrq_sched_debug_show(void)
+{
+	sched_debug_show(NULL, NULL);
+}
+
+static int sched_debug_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_debug_show, NULL);
+}
+
+static struct file_operations sched_debug_fops = {
+	.open		= sched_debug_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init init_sched_debug_procfs(void)
+{
+	struct proc_dir_entry *pe;
+
+	pe = create_proc_entry("sched_debug", 0644, NULL);
+	if (!pe)
+		return -ENOMEM;
+
+	pe->proc_fops = &sched_debug_fops;
+
+	return 0;
+}
+
+__initcall(init_sched_debug_procfs);
+
+void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
+{
+#define P(F) \
+	SEQ_printf(m, "%-25s:%20Ld\n", #F, (long long)p->F)
+
+	P(wait_start);
+	P(wait_start_fair);
+	P(exec_start);
+	P(sleep_start);
+	P(sleep_start_fair);
+	P(block_start);
+	P(sleep_max);
+	P(block_max);
+	P(exec_max);
+	P(wait_max);
+	P(last_ran);
+	P(wait_runtime);
+	P(wait_runtime_overruns);
+	P(wait_runtime_underruns);
+	P(sum_exec_runtime);
+#undef P
+
+	{
+		u64 t0, t1;
+
+		t0 = sched_clock();
+		t1 = sched_clock();
+		SEQ_printf(m, "%-25s:%20Ld\n", "clock-delta", (long long)t1-t0);
+	}
+}
+
+void proc_sched_set_task(struct task_struct *p)
+{
+	p->sleep_max = p->block_max = p->exec_max = p->wait_max = 0;
+	p->wait_runtime_overruns = p->wait_runtime_underruns = 0;
+	p->sum_exec_runtime = 0;
+}
diff -puN /dev/null kernel/sched_fair.c
--- /dev/null
+++ a/kernel/sched_fair.c
@@ -0,0 +1,756 @@
+/*
+ * Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH)
+ */
+
+/*
+ * Preemption granularity:
+ * (default: 2 msec, units: nanoseconds)
+ *
+ * NOTE: this granularity value is not the same as the concept of
+ * 'timeslice length' - timeslices in CFS will typically be somewhat
+ * larger than this value. (to see the precise effective timeslice
+ * length of your workload, run vmstat and monitor the context-switches
+ * field)
+ *
+ * On SMP systems the value of this is multiplied by the log2 of the
+ * number of CPUs. (i.e. factor 2x on 2-way systems, 3x on 4-way
+ * systems, 4x on 8-way systems, 5x on 16-way systems, etc.)
+ */
+unsigned int sysctl_sched_granularity __read_mostly = 1000000000/HZ;
+
+/*
+ * Wake-up granularity.
+ * (default: 1 msec, units: nanoseconds)
+ *
+ * This option delays the preemption effects of decoupled workloads
+ * and reduces their over-scheduling. Synchronous workloads will still
+ * have immediate wakeup/sleep latencies.
+ */
+unsigned int sysctl_sched_wakeup_granularity __read_mostly = 0;
+
+unsigned int sysctl_sched_runtime_limit __read_mostly = 2000000000/HZ;
+
+unsigned int sysctl_sched_load_smoothing __read_mostly = 0;
+
+/*
+ * sys_sched_yield unfairness bug workaround switch.
+ * (default: -1:auto-detect+disabled. Other values: 0:disabled, 1:enabled)
+ *
+ * This option switches the unfair yield implementation of the
+ * old scheduler back on. Needed for good performance of certain
+ * apps like 3D games on Radeon cards.
+ */
+int sysctl_sched_yield_bug_workaround __read_mostly = 1;
+
+EXPORT_SYMBOL_GPL(sysctl_sched_yield_bug_workaround);
+
+extern struct sched_class fair_sched_class;
+
+/**************************************************************/
+/* Scheduling class tree data structure manipulation methods:
+ */
+
+/*
+ * Enqueue a task into the rb-tree:
+ */
+static inline void __enqueue_task_fair(struct rq *rq, struct task_struct *p)
+{
+	struct rb_node **link = &rq->tasks_timeline.rb_node;
+	struct rb_node *parent = NULL;
+	struct task_struct *entry;
+	s64 key = p->fair_key;
+	int leftmost = 1;
+
+	/*
+	 * Find the right place in the rbtree:
+	 */
+	while (*link) {
+		parent = *link;
+		entry = rb_entry(parent, struct task_struct, run_node);
+		/*
+		 * We dont care about collisions. Nodes with
+		 * the same key stay together.
+		 */
+		if ((s64)(key - entry->fair_key) < 0) {
+			link = &parent->rb_left;
+		} else {
+			link = &parent->rb_right;
+			leftmost = 0;
+		}
+	}
+
+	/*
+	 * Maintain a cache of leftmost tree entries (it is frequently
+	 * used):
+	 */
+	if (leftmost)
+		rq->rb_leftmost = &p->run_node;
+
+	rb_link_node(&p->run_node, parent, link);
+	rb_insert_color(&p->run_node, &rq->tasks_timeline);
+}
+
+static inline void __dequeue_task_fair(struct rq *rq, struct task_struct *p)
+{
+	if (rq->rb_leftmost == &p->run_node)
+		rq->rb_leftmost = NULL;
+	rb_erase(&p->run_node, &rq->tasks_timeline);
+}
+
+static inline struct rb_node * first_fair(struct rq *rq)
+{
+	if (rq->rb_leftmost)
+		return rq->rb_leftmost;
+	/* Cache the value returned by rb_first() */
+	rq->rb_leftmost = rb_first(&rq->tasks_timeline);
+	return rq->rb_leftmost;
+}
+
+static struct task_struct * __pick_next_task_fair(struct rq *rq)
+{
+	return rb_entry(first_fair(rq), struct task_struct, run_node);
+}
+
+/**************************************************************/
+/* Scheduling class statistics methods:
+ */
+
+/*
+ * We rescale the rescheduling granularity of tasks according to their
+ * nice level, but only linearly, not exponentially:
+ */
+static u64
+niced_granularity(struct task_struct *curr, unsigned long granularity)
+{
+	/*
+	 * Negative nice levels get the same granularity as nice-0:
+	 */
+	if (curr->load_weight >= NICE_0_LOAD)
+		return granularity;
+	/*
+	 * Positive nice level tasks get linearly finer
+	 * granularity:
+	 */
+	return curr->load_weight * (s64)(granularity / NICE_0_LOAD);
+}
+
+unsigned long get_rq_load(struct rq *rq)
+{
+	unsigned long load = rq->cpu_load[CPU_LOAD_IDX_MAX-1] + 1;
+
+	if (!(sysctl_sched_load_smoothing & 1))
+		return rq->raw_weighted_load;
+
+	if (sysctl_sched_load_smoothing & 4)
+		load = max(load, rq->raw_weighted_load);
+
+	return load;
+}
+
+static void limit_wait_runtime(struct rq *rq, struct task_struct *p)
+{
+	s64 limit = sysctl_sched_runtime_limit;
+	s64 nice_limit = limit; // niced_granularity(p, limit);
+
+	/*
+	 * Niced tasks have the same history dynamic range as
+	 * non-niced tasks, but their limits are offset.
+	 */
+	if (p->wait_runtime > nice_limit) {
+		p->wait_runtime = nice_limit;
+		p->wait_runtime_overruns++;
+		rq->wait_runtime_overruns++;
+	}
+	limit = (limit << 1) - nice_limit;
+	if (p->wait_runtime < -limit) {
+		p->wait_runtime = -limit;
+		p->wait_runtime_underruns++;
+		rq->wait_runtime_underruns++;
+	}
+}
+
+static void __add_wait_runtime(struct rq *rq, struct task_struct *p, s64 delta)
+{
+	p->wait_runtime += delta;
+	p->sum_wait_runtime += delta;
+	limit_wait_runtime(rq, p);
+}
+
+static void add_wait_runtime(struct rq *rq, struct task_struct *p, s64 delta)
+{
+	rq->wait_runtime -= p->wait_runtime;
+	__add_wait_runtime(rq, p, delta);
+	rq->wait_runtime += p->wait_runtime;
+}
+
+/*
+ * Update the current task's runtime statistics. Skip current tasks that
+ * are not in our scheduling class.
+ */
+static inline void update_curr(struct rq *rq, u64 now)
+{
+	u64 delta_exec, delta_fair, delta_mine;
+	struct task_struct *curr = rq->curr;
+
+	if (curr->sched_class != &fair_sched_class || curr == rq->idle
+			|| !curr->on_rq)
+		return;
+	/*
+	 * Get the amount of time the current task was running
+	 * since the last time we changed raw_weighted_load:
+	 */
+	delta_exec = now - curr->exec_start;
+	if (unlikely(delta_exec > curr->exec_max))
+		curr->exec_max = delta_exec;
+
+	if (sysctl_sched_load_smoothing & 1) {
+		unsigned long load = get_rq_load(rq);
+
+		if (sysctl_sched_load_smoothing & 2) {
+			delta_fair = delta_exec * NICE_0_LOAD;
+			do_div(delta_fair, load);
+		} else {
+			delta_fair = delta_exec * NICE_0_LOAD;
+			do_div(delta_fair, rq->raw_weighted_load);
+		}
+
+		delta_mine = delta_exec * curr->load_weight;
+		do_div(delta_mine, load);
+	} else {
+		delta_fair = delta_exec * NICE_0_LOAD;
+		delta_fair += rq->raw_weighted_load >> 1;
+		do_div(delta_fair, rq->raw_weighted_load);
+
+		delta_mine = delta_exec * curr->load_weight;
+		delta_mine += rq->raw_weighted_load >> 1;
+		do_div(delta_mine, rq->raw_weighted_load);
+	}
+
+	curr->sum_exec_runtime += delta_exec;
+	curr->exec_start = now;
+	rq->exec_clock += delta_exec;
+
+	/*
+	 * Task already marked for preemption, do not burden
+	 * it with the cost of not having left the CPU yet.
+	 */
+	if (unlikely(test_tsk_thread_flag(curr, TIF_NEED_RESCHED)))
+		goto out_nowait;
+
+	rq->fair_clock += delta_fair;
+	/*
+	 * We executed delta_exec amount of time on the CPU,
+	 * but we were only entitled to delta_mine amount of
+	 * time during that period (if nr_running == 1 then
+	 * the two values are equal)
+	 * [Note: delta_mine - delta_exec is negative]:
+	 */
+	add_wait_runtime(rq, curr, delta_mine - delta_exec);
+out_nowait:
+	;
+}
+
+static inline void
+update_stats_wait_start(struct rq *rq, struct task_struct *p, u64 now)
+{
+	p->wait_start_fair = rq->fair_clock;
+	p->wait_start = now;
+}
+
+/*
+ * Task is being enqueued - update stats:
+ */
+static inline void
+update_stats_enqueue(struct rq *rq, struct task_struct *p, u64 now)
+{
+	s64 key;
+
+	/*
+	 * Are we enqueueing a waiting task? (for current tasks
+	 * a dequeue/enqueue event is a NOP)
+	 */
+	if (p != rq->curr)
+		update_stats_wait_start(rq, p, now);
+	/*
+	 * Update the key:
+	 */
+	key = rq->fair_clock;
+
+	/*
+	 * Optimize the common nice 0 case:
+	 */
+	if (likely(p->load_weight == NICE_0_LOAD)) {
+		key -= p->wait_runtime;
+	} else {
+		int negative = p->wait_runtime < 0;
+		u64 tmp;
+
+		if (p->load_weight > NICE_0_LOAD) {
+			/* negative-reniced tasks get helped: */
+
+			if (negative) {
+				tmp = -p->wait_runtime;
+				tmp *= NICE_0_LOAD;
+				do_div(tmp, p->load_weight);
+
+				key += tmp;
+			} else {
+				tmp = p->wait_runtime;
+				tmp *= p->load_weight;
+				do_div(tmp, NICE_0_LOAD);
+
+				key -= tmp;
+			}
+		} else {
+			/* plus-reniced tasks get hurt: */
+
+			if (negative) {
+				tmp = -p->wait_runtime;
+
+				tmp *= NICE_0_LOAD;
+				do_div(tmp, p->load_weight);
+
+				key += tmp;
+			} else {
+				tmp = p->wait_runtime;
+
+				tmp *= p->load_weight;
+				do_div(tmp, NICE_0_LOAD);
+
+				key -= tmp;
+			}
+		}
+	}
+
+	p->fair_key = key;
+}
+
+/*
+ * Note: must be called with a freshly updated rq->fair_clock.
+ */
+static inline void
+update_stats_wait_end(struct rq *rq, struct task_struct *p, u64 now)
+{
+	s64 delta_fair, delta_wait;
+
+	delta_wait = now - p->wait_start;
+	if (unlikely(delta_wait > p->wait_max))
+		p->wait_max = delta_wait;
+
+	if (p->wait_start_fair) {
+		delta_fair = rq->fair_clock - p->wait_start_fair;
+		if (unlikely(p->load_weight != NICE_0_LOAD))
+			delta_fair = (delta_fair * p->load_weight) /
+								NICE_0_LOAD;
+		add_wait_runtime(rq, p, delta_fair);
+	}
+
+	p->wait_start_fair = 0;
+	p->wait_start = 0;
+}
+
+static inline void
+update_stats_dequeue(struct rq *rq, struct task_struct *p, u64 now)
+{
+	update_curr(rq, now);
+	/*
+	 * Mark the end of the wait period if dequeueing a
+	 * waiting task:
+	 */
+	if (p != rq->curr)
+		update_stats_wait_end(rq, p, now);
+}
+
+/*
+ * We are picking a new current task - update its stats:
+ */
+static inline void
+update_stats_curr_start(struct rq *rq, struct task_struct *p, u64 now)
+{
+	/*
+	 * We are starting a new run period:
+	 */
+	p->exec_start = now;
+}
+
+/*
+ * We are descheduling a task - update its stats:
+ */
+static inline void
+update_stats_curr_end(struct rq *rq, struct task_struct *p, u64 now)
+{
+	update_curr(rq, now);
+
+	p->exec_start = 0;
+}
+
+/**************************************************************/
+/* Scheduling class queueing methods:
+ */
+
+static void enqueue_sleeper(struct rq *rq, struct task_struct *p)
+{
+	unsigned long load = get_rq_load(rq);
+	u64 delta_fair = 0;
+
+	if (!(sysctl_sched_load_smoothing & 16))
+		goto out;
+
+	delta_fair = rq->fair_clock - p->sleep_start_fair;
+	if ((s64)delta_fair < 0)
+		delta_fair = 0;
+
+	/*
+	 * Fix up delta_fair with the effect of us running
+	 * during the whole sleep period:
+	 */
+	if (sysctl_sched_load_smoothing & 8) {
+		delta_fair = delta_fair * load;
+		do_div(delta_fair, load + p->load_weight);
+	}
+
+	__add_wait_runtime(rq, p, delta_fair);
+
+out:
+	rq->wait_runtime += p->wait_runtime;
+
+	p->sleep_start_fair = 0;
+}
+
+/*
+ * The enqueue_task method is called before nr_running is
+ * increased. Here we update the fair scheduling stats and
+ * then put the task into the rbtree:
+ */
+static void
+enqueue_task_fair(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
+{
+	u64 delta = 0;
+
+	/*
+	 * Update the fair clock.
+	 */
+	update_curr(rq, now);
+
+	if (wakeup) {
+		if (p->sleep_start) {
+			delta = now - p->sleep_start;
+			if ((s64)delta < 0)
+				delta = 0;
+
+			if (unlikely(delta > p->sleep_max))
+				p->sleep_max = delta;
+
+			p->sleep_start = 0;
+		}
+		if (p->block_start) {
+			delta = now - p->block_start;
+			if ((s64)delta < 0)
+				delta = 0;
+
+			if (unlikely(delta > p->block_max))
+				p->block_max = delta;
+
+			p->block_start = 0;
+		}
+		p->sum_sleep_runtime += delta;
+
+		if (p->sleep_start_fair)
+			enqueue_sleeper(rq, p);
+	}
+	update_stats_enqueue(rq, p, now);
+	__enqueue_task_fair(rq, p);
+}
+
+/*
+ * The dequeue_task method is called before nr_running is
+ * decreased. We remove the task from the rbtree and
+ * update the fair scheduling stats:
+ */
+static void
+dequeue_task_fair(struct rq *rq, struct task_struct *p, int sleep, u64 now)
+{
+	update_stats_dequeue(rq, p, now);
+	if (sleep) {
+		if (p->state & TASK_INTERRUPTIBLE)
+			p->sleep_start = now;
+		if (p->state & TASK_UNINTERRUPTIBLE)
+			p->block_start = now;
+		p->sleep_start_fair = rq->fair_clock;
+		rq->wait_runtime -= p->wait_runtime;
+	}
+	__dequeue_task_fair(rq, p);
+}
+
+/*
+ * sched_yield() support is very simple via the rbtree: we just
+ * dequeue the task and move it after the next task, which
+ * causes tasks to roundrobin.
+ */
+static void
+yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
+{
+	struct rb_node *curr, *next, *first;
+	struct task_struct *p_next;
+	s64 yield_key;
+	u64 now;
+
+	/*
+	 * Bug workaround for 3D apps running on the radeon 3D driver:
+	 */
+	if (unlikely(sysctl_sched_yield_bug_workaround > 0)) {
+		if (sysctl_sched_yield_bug_workaround == 2) {
+			resched_task(p);
+			return;
+		}
+		now = __rq_clock(rq);
+		/*
+		 * Dequeue and enqueue the task to update its
+		 * position within the tree:
+		 */
+		dequeue_task_fair(rq, p, 0, now);
+		p->on_rq = 0;
+		enqueue_task_fair(rq, p, 0, now);
+		p->on_rq = 1;
+
+		/*
+		 * Reschedule if another task tops the current one.
+		 */
+		p_next = __pick_next_task_fair(rq);
+		if (p_next != p)
+			resched_task(p);
+		return;
+	}
+
+	/*
+	 * yield-to support: if we are on the same runqueue then
+	 * give half of our wait_runtime (if it's positive) to the other task:
+	 */
+	if (p_to && p->wait_runtime > 0) {
+		p_to->wait_runtime += p->wait_runtime >> 1;
+		p->wait_runtime >>= 1;
+	}
+	curr = &p->run_node;
+	first = first_fair(rq);
+	/*
+	 * Move this task to the second place in the tree:
+	 */
+	if (unlikely(curr != first)) {
+		next = first;
+	} else {
+		next = rb_next(curr);
+		/*
+		 * We were the last one already - nothing to do, return
+		 * and reschedule:
+		 */
+		if (unlikely(!next))
+			return;
+	}
+
+	p_next = rb_entry(next, struct task_struct, run_node);
+	/*
+	 * Minimally necessary key value to be the second in the tree:
+	 */
+	yield_key = p_next->fair_key + 1;
+
+	now = __rq_clock(rq);
+	dequeue_task_fair(rq, p, 0, now);
+	p->on_rq = 0;
+
+	/*
+	 * Only update the key if we need to move more backwards
+	 * than the minimally necessary position to be the second:
+	 */
+	if (p->fair_key < yield_key)
+		p->fair_key = yield_key;
+
+	__enqueue_task_fair(rq, p);
+	p->on_rq = 1;
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static inline void
+__check_preempt_curr_fair(struct rq *rq, struct task_struct *p,
+			  struct task_struct *curr, unsigned long granularity)
+{
+	s64 __delta = curr->fair_key - p->fair_key;
+
+	/*
+	 * Take scheduling granularity into account - do not
+	 * preempt the current task unless the best task has
+	 * a larger than sched_granularity fairness advantage:
+	 */
+	if (__delta > niced_granularity(curr, granularity))
+		resched_task(curr);
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_fair(struct rq *rq, struct task_struct *p)
+{
+	struct task_struct *curr = rq->curr;
+
+	if ((curr == rq->idle) || rt_prio(p->prio)) {
+		resched_task(curr);
+	} else {
+		__check_preempt_curr_fair(rq, p, curr,
+					  sysctl_sched_wakeup_granularity);
+	}
+}
+
+static struct task_struct * pick_next_task_fair(struct rq *rq, u64 now)
+{
+	struct task_struct *p = __pick_next_task_fair(rq);
+
+	/*
+	 * Any task has to be enqueued before it get to execute on
+	 * a CPU. So account for the time it spent waiting on the
+	 * runqueue. (note, here we rely on pick_next_task() having
+	 * done a put_prev_task_fair() shortly before this, which
+	 * updated rq->fair_clock - used by update_stats_wait_end())
+	 */
+	update_stats_wait_end(rq, p, now);
+	update_stats_curr_start(rq, p, now);
+
+	return p;
+}
+
+/*
+ * Account for a descheduled task:
+ */
+static void put_prev_task_fair(struct rq *rq, struct task_struct *prev, u64 now)
+{
+	if (prev == rq->idle)
+		return;
+
+	update_stats_curr_end(rq, prev, now);
+	/*
+	 * If the task is still waiting for the CPU (it just got
+	 * preempted), start the wait period:
+	 */
+	if (prev->on_rq)
+		update_stats_wait_start(rq, prev, now);
+}
+
+/**************************************************************/
+/* Fair scheduling class load-balancing methods:
+ */
+
+/*
+ * Load-balancing iterator. Note: while the runqueue stays locked
+ * during the whole iteration, the current task might be
+ * dequeued so the iterator has to be dequeue-safe. Here we
+ * achieve that by always pre-iterating before returning
+ * the current task:
+ */
+static struct task_struct * load_balance_start_fair(struct rq *rq)
+{
+	struct rb_node *first = first_fair(rq);
+	struct task_struct *p;
+
+	if (!first)
+		return NULL;
+
+	p = rb_entry(first, struct task_struct, run_node);
+
+	rq->rb_load_balance_curr = rb_next(first);
+
+	return p;
+}
+
+static struct task_struct * load_balance_next_fair(struct rq *rq)
+{
+	struct rb_node *curr = rq->rb_load_balance_curr;
+	struct task_struct *p;
+
+	if (!curr)
+		return NULL;
+
+	p = rb_entry(curr, struct task_struct, run_node);
+	rq->rb_load_balance_curr = rb_next(curr);
+
+	return p;
+}
+
+/*
+ * scheduler tick hitting a task of our scheduling class:
+ */
+static void task_tick_fair(struct rq *rq, struct task_struct *curr)
+{
+	struct task_struct *next;
+	u64 now = __rq_clock(rq);
+
+	/*
+	 * Dequeue and enqueue the task to update its
+	 * position within the tree:
+	 */
+	dequeue_task_fair(rq, curr, 0, now);
+	curr->on_rq = 0;
+	enqueue_task_fair(rq, curr, 0, now);
+	curr->on_rq = 1;
+
+	/*
+	 * Reschedule if another task tops the current one.
+	 */
+	next = __pick_next_task_fair(rq);
+	if (next == curr)
+		return;
+
+	if ((curr == rq->idle) || (rt_prio(next->prio) &&
+					(next->prio < curr->prio)))
+		resched_task(curr);
+	else
+		__check_preempt_curr_fair(rq, next, curr,
+					  sysctl_sched_granularity);
+}
+
+/*
+ * Share the fairness runtime between parent and child, thus the
+ * total amount of pressure for CPU stays equal - new tasks
+ * get a chance to run but frequent forkers are not allowed to
+ * monopolize the CPU. Note: the parent runqueue is locked,
+ * the child is not running yet.
+ */
+static void task_new_fair(struct rq *rq, struct task_struct *p)
+{
+	sched_info_queued(p);
+	update_stats_enqueue(rq, p, rq_clock(rq));
+	/*
+	 * Child runs first: we let it run before the parent
+	 * until it reschedules once. We set up the key so that
+	 * it will preempt the parent:
+	 */
+	p->fair_key = current->fair_key - niced_granularity(rq->curr,
+						sysctl_sched_granularity) - 1;
+	/*
+	 * The first wait is dominated by the child-runs-first logic,
+	 * so do not credit it with that waiting time yet:
+	 */
+	p->wait_start_fair = 0;
+
+	__enqueue_task_fair(rq, p);
+	p->on_rq = 1;
+	inc_nr_running(p, rq);
+}
+
+/*
+ * All the scheduling class methods:
+ */
+struct sched_class fair_sched_class __read_mostly = {
+	.enqueue_task		= enqueue_task_fair,
+	.dequeue_task		= dequeue_task_fair,
+	.yield_task		= yield_task_fair,
+
+	.check_preempt_curr	= check_preempt_curr_fair,
+
+	.pick_next_task		= pick_next_task_fair,
+	.put_prev_task		= put_prev_task_fair,
+
+	.load_balance_start	= load_balance_start_fair,
+	.load_balance_next	= load_balance_next_fair,
+	.task_tick		= task_tick_fair,
+	.task_new		= task_new_fair,
+};
diff -puN /dev/null kernel/sched_rt.c
--- /dev/null
+++ a/kernel/sched_rt.c
@@ -0,0 +1,184 @@
+/*
+ * Real-Time Scheduling Class (mapped to the SCHED_FIFO and SCHED_RR
+ * policies)
+ */
+
+static void
+enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup, u64 now)
+{
+	struct prio_array *array = &rq->active;
+
+	list_add_tail(&p->run_list, array->queue + p->prio);
+	__set_bit(p->prio, array->bitmap);
+}
+
+/*
+ * Adding/removing a task to/from a priority array:
+ */
+static void
+dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep, u64 now)
+{
+	struct prio_array *array = &rq->active;
+
+	list_del(&p->run_list);
+	if (list_empty(array->queue + p->prio))
+		__clear_bit(p->prio, array->bitmap);
+}
+
+/*
+ * Put task to the end of the run list without the overhead of dequeue
+ * followed by enqueue.
+ */
+static void requeue_task_rt(struct rq *rq, struct task_struct *p)
+{
+	struct prio_array *array = &rq->active;
+
+	list_move_tail(&p->run_list, array->queue + p->prio);
+}
+
+static void
+yield_task_rt(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
+{
+	requeue_task_rt(rq, p);
+}
+
+/*
+ * Preempt the current task with a newly woken task if needed:
+ */
+static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p)
+{
+	if (p->prio < rq->curr->prio)
+		resched_task(rq->curr);
+}
+
+static struct task_struct * pick_next_task_rt(struct rq *rq, u64 now)
+{
+	struct prio_array *array = &rq->active;
+	struct list_head *queue;
+	int idx;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (idx >= MAX_RT_PRIO)
+		return NULL;
+
+	queue = array->queue + idx;
+	return list_entry(queue->next, struct task_struct, run_list);
+}
+
+/*
+ * No accounting done when RT tasks are descheduled:
+ */
+static void put_prev_task_rt(struct rq *rq, struct task_struct *p, u64 now)
+{
+}
+
+/*
+ * Load-balancing iterator. Note: while the runqueue stays locked
+ * during the whole iteration, the current task might be
+ * dequeued so the iterator has to be dequeue-safe. Here we
+ * achieve that by always pre-iterating before returning
+ * the current task:
+ */
+static struct task_struct * load_balance_start_rt(struct rq *rq)
+{
+	struct prio_array *array = &rq->active;
+	struct list_head *head, *curr;
+	struct task_struct *p;
+	int idx;
+
+	idx = sched_find_first_bit(array->bitmap);
+	if (idx >= MAX_RT_PRIO)
+		return NULL;
+
+	head = array->queue + idx;
+	curr = head->prev;
+
+	p = list_entry(curr, struct task_struct, run_list);
+
+	curr = curr->prev;
+
+	rq->rt_load_balance_idx = idx;
+	rq->rt_load_balance_head = head;
+	rq->rt_load_balance_curr = curr;
+
+	return p;
+}
+
+static struct task_struct * load_balance_next_rt(struct rq *rq)
+{
+	struct prio_array *array = &rq->active;
+	struct list_head *head, *curr;
+	struct task_struct *p;
+	int idx;
+
+	idx = rq->rt_load_balance_idx;
+	head = rq->rt_load_balance_head;
+	curr = rq->rt_load_balance_curr;
+
+	/*
+	 * If we arrived back to the head again then
+	 * iterate to the next queue (if any):
+	 */
+	if (unlikely(head == curr)) {
+		int next_idx = find_next_bit(array->bitmap, MAX_RT_PRIO, idx+1);
+
+		if (next_idx >= MAX_RT_PRIO)
+			return NULL;
+
+		idx = next_idx;
+		head = array->queue + idx;
+		curr = head->prev;
+
+		rq->rt_load_balance_idx = idx;
+		rq->rt_load_balance_head = head;
+	}
+
+	p = list_entry(curr, struct task_struct, run_list);
+
+	curr = curr->prev;
+
+	rq->rt_load_balance_curr = curr;
+
+	return p;
+}
+
+static void task_tick_rt(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * RR tasks need a special form of timeslice management.
+	 * FIFO tasks have no timeslices.
+	 */
+	if ((p->policy == SCHED_RR) && !--p->time_slice) {
+		p->time_slice = static_prio_timeslice(p->static_prio);
+		set_tsk_need_resched(p);
+
+		/* put it at the end of the queue: */
+		requeue_task_rt(rq, p);
+	}
+}
+
+/*
+ * No parent/child timeslice management necessary for RT tasks,
+ * just activate them:
+ */
+static void task_new_rt(struct rq *rq, struct task_struct *p)
+{
+	activate_task(rq, p, 1);
+}
+
+static struct sched_class rt_sched_class __read_mostly = {
+	.enqueue_task		= enqueue_task_rt,
+	.dequeue_task		= dequeue_task_rt,
+	.yield_task		= yield_task_rt,
+
+	.check_preempt_curr	= check_preempt_curr_rt,
+
+	.pick_next_task		= pick_next_task_rt,
+	.put_prev_task		= put_prev_task_rt,
+
+	.load_balance_start	= load_balance_start_rt,
+	.load_balance_next	= load_balance_next_rt,
+
+	.task_tick		= task_tick_rt,
+	.task_new		= task_new_rt,
+};
diff -puN /dev/null kernel/sched_stats.h
--- /dev/null
+++ a/kernel/sched_stats.h
@@ -0,0 +1,235 @@
+
+#ifdef CONFIG_SCHEDSTATS
+/*
+ * bump this up when changing the output format or the meaning of an existing
+ * format, so that tools can adapt (or abort)
+ */
+#define SCHEDSTAT_VERSION 14
+
+static int show_schedstat(struct seq_file *seq, void *v)
+{
+	int cpu;
+
+	seq_printf(seq, "version %d\n", SCHEDSTAT_VERSION);
+	seq_printf(seq, "timestamp %lu\n", jiffies);
+	for_each_online_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+#ifdef CONFIG_SMP
+		struct sched_domain *sd;
+		int dcnt = 0;
+#endif
+
+		/* runqueue-specific stats */
+		seq_printf(seq,
+		    "cpu%d %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu %lu",
+		    cpu, rq->yld_both_empty,
+		    rq->yld_act_empty, rq->yld_exp_empty, rq->yld_cnt,
+		    rq->sched_switch, rq->sched_cnt, rq->sched_goidle,
+		    rq->ttwu_cnt, rq->ttwu_local,
+		    rq->rq_sched_info.cpu_time,
+		    rq->rq_sched_info.run_delay, rq->rq_sched_info.pcnt);
+
+		seq_printf(seq, "\n");
+
+#ifdef CONFIG_SMP
+		/* domain-specific stats */
+		preempt_disable();
+		for_each_domain(cpu, sd) {
+			enum idle_type itype;
+			char mask_str[NR_CPUS];
+
+			cpumask_scnprintf(mask_str, NR_CPUS, sd->span);
+			seq_printf(seq, "domain%d %s", dcnt++, mask_str);
+			for (itype = SCHED_IDLE; itype < MAX_IDLE_TYPES;
+					itype++) {
+				seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu "
+						"%lu",
+				    sd->lb_cnt[itype],
+				    sd->lb_balanced[itype],
+				    sd->lb_failed[itype],
+				    sd->lb_imbalance[itype],
+				    sd->lb_gained[itype],
+				    sd->lb_hot_gained[itype],
+				    sd->lb_nobusyq[itype],
+				    sd->lb_nobusyg[itype]);
+			}
+			seq_printf(seq, " %lu %lu %lu %lu %lu %lu %lu %lu %lu"
+			    " %lu %lu %lu\n",
+			    sd->alb_cnt, sd->alb_failed, sd->alb_pushed,
+			    sd->sbe_cnt, sd->sbe_balanced, sd->sbe_pushed,
+			    sd->sbf_cnt, sd->sbf_balanced, sd->sbf_pushed,
+			    sd->ttwu_wake_remote, sd->ttwu_move_affine,
+			    sd->ttwu_move_balance);
+		}
+		preempt_enable();
+#endif
+	}
+	return 0;
+}
+
+static int schedstat_open(struct inode *inode, struct file *file)
+{
+	unsigned int size = PAGE_SIZE * (1 + num_online_cpus() / 32);
+	char *buf = kmalloc(size, GFP_KERNEL);
+	struct seq_file *m;
+	int res;
+
+	if (!buf)
+		return -ENOMEM;
+	res = single_open(file, show_schedstat, NULL);
+	if (!res) {
+		m = file->private_data;
+		m->buf = buf;
+		m->size = size;
+	} else
+		kfree(buf);
+	return res;
+}
+
+const struct file_operations proc_schedstat_operations = {
+	.open    = schedstat_open,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.release = single_release,
+};
+
+/*
+ * Expects runqueue lock to be held for atomicity of update
+ */
+static inline void
+rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies)
+{
+	if (rq) {
+		rq->rq_sched_info.run_delay += delta_jiffies;
+		rq->rq_sched_info.pcnt++;
+	}
+}
+
+/*
+ * Expects runqueue lock to be held for atomicity of update
+ */
+static inline void
+rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies)
+{
+	if (rq)
+		rq->rq_sched_info.cpu_time += delta_jiffies;
+}
+# define schedstat_inc(rq, field)	do { (rq)->field++; } while (0)
+# define schedstat_add(rq, field, amt)	do { (rq)->field += (amt); } while (0)
+#else /* !CONFIG_SCHEDSTATS */
+static inline void
+rq_sched_info_arrive(struct rq *rq, unsigned long delta_jiffies)
+{}
+static inline void
+rq_sched_info_depart(struct rq *rq, unsigned long delta_jiffies)
+{}
+# define schedstat_inc(rq, field)	do { } while (0)
+# define schedstat_add(rq, field, amt)	do { } while (0)
+#endif
+
+#if defined(CONFIG_SCHEDSTATS) || defined(CONFIG_TASK_DELAY_ACCT)
+/*
+ * Called when a process is dequeued from the active array and given
+ * the cpu.  We should note that with the exception of interactive
+ * tasks, the expired queue will become the active queue after the active
+ * queue is empty, without explicitly dequeuing and requeuing tasks in the
+ * expired queue.  (Interactive tasks may be requeued directly to the
+ * active queue, thus delaying tasks in the expired queue from running;
+ * see scheduler_tick()).
+ *
+ * This function is only called from sched_info_arrive(), rather than
+ * dequeue_task(). Even though a task may be queued and dequeued multiple
+ * times as it is shuffled about, we're really interested in knowing how
+ * long it was from the *first* time it was queued to the time that it
+ * finally hit a cpu.
+ */
+static inline void sched_info_dequeued(struct task_struct *t)
+{
+	t->sched_info.last_queued = 0;
+}
+
+/*
+ * Called when a task finally hits the cpu.  We can now calculate how
+ * long it was waiting to run.  We also note when it began so that we
+ * can keep stats on how long its timeslice is.
+ */
+static void sched_info_arrive(struct task_struct *t)
+{
+	unsigned long now = jiffies, delta_jiffies = 0;
+
+	if (t->sched_info.last_queued)
+		delta_jiffies = now - t->sched_info.last_queued;
+	sched_info_dequeued(t);
+	t->sched_info.run_delay += delta_jiffies;
+	t->sched_info.last_arrival = now;
+	t->sched_info.pcnt++;
+
+	rq_sched_info_arrive(task_rq(t), delta_jiffies);
+}
+
+/*
+ * Called when a process is queued into either the active or expired
+ * array.  The time is noted and later used to determine how long we
+ * had to wait for us to reach the cpu.  Since the expired queue will
+ * become the active queue after active queue is empty, without dequeuing
+ * and requeuing any tasks, we are interested in queuing to either. It
+ * is unusual but not impossible for tasks to be dequeued and immediately
+ * requeued in the same or another array: this can happen in sched_yield(),
+ * set_user_nice(), and even load_balance() as it moves tasks from runqueue
+ * to runqueue.
+ *
+ * This function is only called from enqueue_task(), but also only updates
+ * the timestamp if it is already not set.  It's assumed that
+ * sched_info_dequeued() will clear that stamp when appropriate.
+ */
+static inline void sched_info_queued(struct task_struct *t)
+{
+	if (unlikely(sched_info_on()))
+		if (!t->sched_info.last_queued)
+			t->sched_info.last_queued = jiffies;
+}
+
+/*
+ * Called when a process ceases being the active-running process, either
+ * voluntarily or involuntarily.  Now we can calculate how long we ran.
+ */
+static inline void sched_info_depart(struct task_struct *t)
+{
+	unsigned long delta_jiffies = jiffies - t->sched_info.last_arrival;
+
+	t->sched_info.cpu_time += delta_jiffies;
+	rq_sched_info_depart(task_rq(t), delta_jiffies);
+}
+
+/*
+ * Called when tasks are switched involuntarily due, typically, to expiring
+ * their time slice.  (This may also be called when switching to or from
+ * the idle task.)  We are only called when prev != next.
+ */
+static inline void
+__sched_info_switch(struct task_struct *prev, struct task_struct *next)
+{
+	struct rq *rq = task_rq(prev);
+
+	/*
+	 * prev now departs the cpu.  It's not interesting to record
+	 * stats about how efficient we were at scheduling the idle
+	 * process, however.
+	 */
+	if (prev != rq->idle)
+		sched_info_depart(prev);
+
+	if (next != rq->idle)
+		sched_info_arrive(next);
+}
+static inline void
+sched_info_switch(struct task_struct *prev, struct task_struct *next)
+{
+	if (unlikely(sched_info_on()))
+		__sched_info_switch(prev, next);
+}
+#else
+#define sched_info_queued(t)		do { } while (0)
+#define sched_info_switch(t, next)	do { } while (0)
+#endif /* CONFIG_SCHEDSTATS || CONFIG_TASK_DELAY_ACCT */
+
diff -puN kernel/sysctl.c~cfs-scheduler kernel/sysctl.c
--- a/kernel/sysctl.c~cfs-scheduler
+++ a/kernel/sysctl.c
@@ -210,6 +210,54 @@ static ctl_table root_table[] = {
 
 static ctl_table kern_table[] = {
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_granularity_ns",
+		.data		= &sysctl_sched_granularity,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_wakeup_granularity_ns",
+		.data		= &sysctl_sched_wakeup_granularity,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_runtime_limit_ns",
+		.data		= &sysctl_sched_runtime_limit,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_child_runs_first",
+		.data		= &sysctl_sched_child_runs_first,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_load_smoothing",
+		.data		= &sysctl_sched_load_smoothing,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "sched_yield_bug_workaround",
+		.data		= &sysctl_sched_yield_bug_workaround,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_PANIC,
 		.procname	= "panic",
 		.data		= &panic_timeout,
_