From: Peter Williams Problem: 2 CPU system: if the cpu-0 has two high priority and cpu-1 has one normal priority task, how can the current code detect this imbalance because imbalance will be always < busiest_load_per_task and max_load - this_load will be < 2 * busiest_load_per_task and pwr_move will be <= pwr_now. Solution: Modify the assessment of small imbalances to take into account the relative sizes of busiest_load_per_task and this_load_per_task. This is exploiting the fact that if the difference between the loads is greater than busiest_load_per_task and busiest_load_per_task is greater than this_load_per_task then moving busiest_load_per_task worth of load from busiest to this will be an improvement in the distribution of weighted load. Note: This patch makes no change to load balancing in the case where all tasks are nice==0. Signed-off-by: Peter Williams Cc: "Chen, Kenneth W" Cc: "Siddha, Suresh B" Acked-by: Ingo Molnar Signed-off-by: Andrew Morton --- kernel/sched.c | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff -puN kernel/sched.c~sched-improve-smpnice-load-balancing-when-load-per-task kernel/sched.c --- devel/kernel/sched.c~sched-improve-smpnice-load-balancing-when-load-per-task 2006-05-19 16:01:06.000000000 -0700 +++ devel-akpm/kernel/sched.c 2006-05-19 16:01:06.000000000 -0700 @@ -2237,8 +2237,16 @@ find_busiest_group(struct sched_domain * if (*imbalance < busiest_load_per_task) { unsigned long pwr_now = 0, pwr_move = 0; unsigned long tmp; + unsigned int imbn = 2; - if (max_load - this_load >= busiest_load_per_task*2) { + if (this_nr_running) { + this_load_per_task /= this_nr_running; + if (busiest_load_per_task > this_load_per_task) + imbn = 1; + } else + this_load_per_task = SCHED_LOAD_SCALE; + + if (max_load - this_load >= busiest_load_per_task * imbn) { *imbalance = busiest_load_per_task; return busiest; } @@ -2251,10 +2259,6 @@ find_busiest_group(struct sched_domain * pwr_now += busiest->cpu_power * min(busiest_load_per_task, max_load); - if (this_nr_running) - this_load_per_task /= this_nr_running; - else - this_load_per_task = SCHED_LOAD_SCALE; pwr_now += this->cpu_power * min(this_load_per_task, this_load); pwr_now /= SCHED_LOAD_SCALE; _