From: Peter Williams <pwil3058@bigpond.net.au>

Problem:

On systems with more than 2 CPUs it is possible for a single task with a
high smpnice load weight to suppress load balancing on other CPUs (to the
one that it's running on) if it is the only runnable task on its CPU.  E.g.
 consider a 4-way system (simple SMP system with no HT and cores) scenario
where a high priority task (nice-20) is running on P0 and two normal
priority tasks running on P1.  load balance with smp nice code will never
be able to detect an imbalance and hence will never move one of the normal
priority tasks on P1 to idle cpus P2 or P3 as P0 will always be identified
as the busiest CPU but it has no tasks that can be moved.

Solution:

Make sure that only CPUs with tasks that can be moved get selected as the
busiest queue.  This involves ensuring that find_busiest_group() only
considers groups that have at least one CPU with more than one task running
as candidates for the busiest group and that find_busiest_queue() only
considers CPUs that have more than one task running as candidates for the
busiest run queue.

One effect of this is that load balancing will be abandoned earlier in the
sequence (i.e.  before the double run queue locks are taken prior to
calling move_tasks() rather than in move_tasks() itself) when there are no
tasks that can be moved than would be the case without this patch.

However, it is undesirable for HT/MC packages to have more than one of
their CPUs busy if there are other packages that have all of their CPUs
idle.  This involves moving the only running task (i.e.  the one actually
on the CPU) off on to another CPU and is achieved by using
active_load_balance() and relying on the fact that (when it starts) the
queue's migration thread will preempt the sole running task and (therefore)
make it movable.  The migration thread then moves it to an idle package.

Unfortunately, the mechanism for setting the run queues active_balance flag
is buried deep inside load_balance() and relies heavily on
find_busiest_group() and find_busiest_queue() reporting success even if the
busiest queue has only one task running.  To support this requirement the
solution has been modified so that queues with only one task will be found
(if there are none available with more than 1 task) if the value of idle
passed to find_busiest_group() and find_busiest_queue() is not NEWLY_IDLE
which will never be true when they are called from load_balance().  This
sub optimal modification should be removed when a proper implementation of
the HT/MC special balancing requirements is available.

PS: This doesn't take into account tasks that can't be moved because they
are pinned to a particular CPU.  At this stage, I don't think that it's
worth the effort to make the changes that would enable this.

Signed-off-by: Peter Williams <pwil3058@bigpond.com.au>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: Con Kolivas <kernel@kolivas.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
---

 kernel/sched.c |   35 +++++++++++++++++++++++++++--------
 1 file changed, 27 insertions(+), 8 deletions(-)

diff -puN kernel/sched.c~sched-prevent-high-load-weight-tasks-suppressing-balancing kernel/sched.c
--- devel/kernel/sched.c~sched-prevent-high-load-weight-tasks-suppressing-balancing	2006-06-09 15:22:29.000000000 -0700
+++ devel-akpm/kernel/sched.c	2006-06-09 15:22:29.000000000 -0700
@@ -2084,6 +2084,7 @@ find_busiest_group(struct sched_domain *
 	unsigned long max_pull;
 	unsigned long busiest_load_per_task, busiest_nr_running;
 	unsigned long this_load_per_task, this_nr_running;
+	unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE;
 	int load_idx;
 
 	max_load = this_load = total_load = total_pwr = 0;
@@ -2101,6 +2102,7 @@ find_busiest_group(struct sched_domain *
 		int local_group;
 		int i;
 		unsigned long sum_nr_running, sum_weighted_load;
+		unsigned int nr_loaded_cpus = 0; /* where nr_running > 1 */
 
 		local_group = cpu_isset(this_cpu, group->cpumask);
 
@@ -2121,6 +2123,8 @@ find_busiest_group(struct sched_domain *
 
 			avg_load += load;
 			sum_nr_running += rq->nr_running;
+			if (rq->nr_running > 1)
+				++nr_loaded_cpus;
 			sum_weighted_load += rq->raw_weighted_load;
 		}
 
@@ -2135,7 +2139,15 @@ find_busiest_group(struct sched_domain *
 			this = group;
 			this_nr_running = sum_nr_running;
 			this_load_per_task = sum_weighted_load;
-		} else if (avg_load > max_load) {
+		} else if (nr_loaded_cpus) {
+			if (avg_load > max_load || !busiest_has_loaded_cpus) {
+				max_load = avg_load;
+				busiest = group;
+				busiest_nr_running = sum_nr_running;
+				busiest_load_per_task = sum_weighted_load;
+				busiest_has_loaded_cpus = 1;
+			}
+		} else if (!busiest_has_loaded_cpus && avg_load > max_load) {
 			max_load = avg_load;
 			busiest = group;
 			busiest_nr_running = sum_nr_running;
@@ -2144,7 +2156,7 @@ find_busiest_group(struct sched_domain *
 		group = group->next;
 	} while (group != sd->groups);
 
-	if (!busiest || this_load >= max_load || busiest_nr_running <= 1)
+	if (!busiest || this_load >= max_load || busiest_nr_running == 0)
 		goto out_balanced;
 
 	avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
@@ -2246,16 +2258,23 @@ out_balanced:
 static runqueue_t *find_busiest_queue(struct sched_group *group,
 	enum idle_type idle)
 {
-	unsigned long load, max_load = 0;
-	runqueue_t *busiest = NULL;
+	unsigned long max_load = 0;
+	runqueue_t *busiest = NULL, *rqi;
+	unsigned int busiest_is_loaded = idle == NEWLY_IDLE;
 	int i;
 
 	for_each_cpu_mask(i, group->cpumask) {
-		load = weighted_cpuload(i);
+		rqi = cpu_rq(i);
 
-		if (load > max_load) {
-			max_load = load;
-			busiest = cpu_rq(i);
+		if (rqi->nr_running > 1) {
+			if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
+				max_load = rqi->raw_weighted_load;
+				busiest = rqi;
+				busiest_is_loaded = 1;
+			}
+		} else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
+			max_load = rqi->raw_weighted_load;
+			busiest = rqi;
 		}
 	}
 
_