From: Oleg Nesterov sched_exit() is called by release_task(). If the task auto-reaps itself this call happens a bit too early (the task still uses cpu after that). If the task goes to TASK_ZOMBIE this call is unpredictably delayed. I think it is better to do sched_exit() right before the last schedule(). We can use read_lock_rcu() instead of write_lock(tasklist). In that case it is possible that sched_exit() changes ->time_slice/->sleep_avg of the already dead ->parent, but I think this is tolerable. Signed-off-by: Oleg Nesterov Cc: Ingo Molnar Cc: Nick Piggin Cc: Con Kolivas Cc: Peter Williams Cc: "Paul E. McKenney" Signed-off-by: Andrew Morton --- kernel/exit.c | 3 ++- kernel/sched.c | 6 +++++- 2 files changed, 7 insertions(+), 2 deletions(-) diff -puN kernel/exit.c~sched_exit-move-the-callsite-to-do_exit kernel/exit.c --- a/kernel/exit.c~sched_exit-move-the-callsite-to-do_exit +++ a/kernel/exit.c @@ -171,7 +171,6 @@ repeat: zap_leader = (leader->exit_signal == -1); } - sched_exit(p); write_unlock_irq(&tasklist_lock); spin_unlock(&p->proc_lock); proc_pid_flush(proc_dentry); @@ -950,6 +949,8 @@ fastcall NORET_TYPE void do_exit(long co if (tsk->splice_pipe) __free_pipe_info(tsk->splice_pipe); + sched_exit(tsk); + /* PF_DEAD causes final put_task_struct after we schedule. */ preempt_disable(); BUG_ON(tsk->flags & PF_DEAD); diff -puN kernel/sched.c~sched_exit-move-the-callsite-to-do_exit kernel/sched.c --- a/kernel/sched.c~sched_exit-move-the-callsite-to-do_exit +++ a/kernel/sched.c @@ -1606,10 +1606,13 @@ void fastcall wake_up_new_task(task_t *p */ void fastcall sched_exit(task_t *p) { - task_t *parent = p->parent; + task_t *parent; unsigned long flags; runqueue_t *rq; + rcu_read_lock(); + parent = p->real_parent; + /* * If the child was a (relative-) CPU hog then decrease * the sleep_avg of the parent as well. @@ -1625,6 +1628,7 @@ void fastcall sched_exit(task_t *p) (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg / (EXIT_WEIGHT + 1); task_rq_unlock(rq, &flags); + rcu_read_unlock(); } /** _