From: Peter Williams Problem: On systems with more than 2 CPUs it is possible for a single task with a high smpnice load weight to suppress load balancing on other CPUs (to the one that it's running on) if it is the only runnable task on its CPU. E.g. consider a 4-way system (simple SMP system with no HT and cores) scenario where a high priority task (nice-20) is running on P0 and two normal priority tasks running on P1. load balance with smp nice code will never be able to detect an imbalance and hence will never move one of the normal priority tasks on P1 to idle cpus P2 or P3 as P0 will always be identified as the busiest CPU but it has no tasks that can be moved. Solution: Make sure that only CPUs with tasks that can be moved get selected as the busiest queue. This involves ensuring that find_busiest_group() only considers groups that have at least one CPU with more than one task running as candidates for the busiest group and that find_busiest_queue() only considers CPUs that have more than one task running as candidates for the busiest run queue. One effect of this is that load balancing will be abandoned earlier in the sequence (i.e. before the double run queue locks are taken prior to calling move_tasks() rather than in move_tasks() itself) when there are no tasks that can be moved than would be the case without this patch. However, it is undesirable for HT/MC packages to have more than one of their CPUs busy if there are other packages that have all of their CPUs idle. This involves moving the only running task (i.e. the one actually on the CPU) off on to another CPU and is achieved by using active_load_balance() and relying on the fact that (when it starts) the queue's migration thread will preempt the sole running task and (therefore) make it movable. The migration thread then moves it to an idle package. Unfortunately, the mechanism for setting the run queues active_balance flag is buried deep inside load_balance() and relies heavily on find_busiest_group() and find_busiest_queue() reporting success even if the busiest queue has only one task running. To support this requirement the solution has been modified so that queues with only one task will be found (if there are none available with more than 1 task) if the value of idle passed to find_busiest_group() and find_busiest_queue() is not NEWLY_IDLE which will never be true when they are called from load_balance(). This sub optimal modification should be removed when a proper implementation of the HT/MC special balancing requirements is available. PS: This doesn't take into account tasks that can't be moved because they are pinned to a particular CPU. At this stage, I don't think that it's worth the effort to make the changes that would enable this. Signed-off-by: Peter Williams Cc: "Siddha, Suresh B" Cc: Con Kolivas Cc: Nick Piggin Cc: Ingo Molnar Cc: "Chen, Kenneth W" Signed-off-by: Andrew Morton --- kernel/sched.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff -puN kernel/sched.c~sched-prevent-high-load-weight-tasks-suppressing-balancing kernel/sched.c --- devel/kernel/sched.c~sched-prevent-high-load-weight-tasks-suppressing-balancing 2006-06-09 15:22:29.000000000 -0700 +++ devel-akpm/kernel/sched.c 2006-06-09 15:22:29.000000000 -0700 @@ -2084,6 +2084,7 @@ find_busiest_group(struct sched_domain * unsigned long max_pull; unsigned long busiest_load_per_task, busiest_nr_running; unsigned long this_load_per_task, this_nr_running; + unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE; int load_idx; max_load = this_load = total_load = total_pwr = 0; @@ -2101,6 +2102,7 @@ find_busiest_group(struct sched_domain * int local_group; int i; unsigned long sum_nr_running, sum_weighted_load; + unsigned int nr_loaded_cpus = 0; /* where nr_running > 1 */ local_group = cpu_isset(this_cpu, group->cpumask); @@ -2121,6 +2123,8 @@ find_busiest_group(struct sched_domain * avg_load += load; sum_nr_running += rq->nr_running; + if (rq->nr_running > 1) + ++nr_loaded_cpus; sum_weighted_load += rq->raw_weighted_load; } @@ -2135,7 +2139,15 @@ find_busiest_group(struct sched_domain * this = group; this_nr_running = sum_nr_running; this_load_per_task = sum_weighted_load; - } else if (avg_load > max_load) { + } else if (nr_loaded_cpus) { + if (avg_load > max_load || !busiest_has_loaded_cpus) { + max_load = avg_load; + busiest = group; + busiest_nr_running = sum_nr_running; + busiest_load_per_task = sum_weighted_load; + busiest_has_loaded_cpus = 1; + } + } else if (!busiest_has_loaded_cpus && avg_load > max_load) { max_load = avg_load; busiest = group; busiest_nr_running = sum_nr_running; @@ -2144,7 +2156,7 @@ find_busiest_group(struct sched_domain * group = group->next; } while (group != sd->groups); - if (!busiest || this_load >= max_load || busiest_nr_running <= 1) + if (!busiest || this_load >= max_load || busiest_nr_running == 0) goto out_balanced; avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr; @@ -2246,16 +2258,23 @@ out_balanced: static runqueue_t *find_busiest_queue(struct sched_group *group, enum idle_type idle) { - unsigned long load, max_load = 0; - runqueue_t *busiest = NULL; + unsigned long max_load = 0; + runqueue_t *busiest = NULL, *rqi; + unsigned int busiest_is_loaded = idle == NEWLY_IDLE; int i; for_each_cpu_mask(i, group->cpumask) { - load = weighted_cpuload(i); + rqi = cpu_rq(i); - if (load > max_load) { - max_load = load; - busiest = cpu_rq(i); + if (rqi->nr_running > 1) { + if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) { + max_load = rqi->raw_weighted_load; + busiest = rqi; + busiest_is_loaded = 1; + } + } else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) { + max_load = rqi->raw_weighted_load; + busiest = rqi; } } _