Noticed that a slave was hanging on a job cancelled yesterday, this is whats in its log:
2013-07-11 13:32:21: Scheduler Thread - Cancelling task because task “21_1025-1025” could not be found
2013-07-11 13:32:21: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-07-11 13:32:37: Scheduler Thread - Cancelling task because task “21_1025-1025” could not be found
2013-07-11 13:32:37: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-07-11 13:32:53: Scheduler Thread - Cancelling task because task “21_1025-1025” could not be found
2013-07-11 13:32:53: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-07-11 13:33:08: Scheduler Thread - Cancelling task because task “21_1025-1025” could not be found
2013-07-11 13:33:08: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
And the slave app is going full tilt on a core (100% cpu usage)
Are you guys still on a beta version, or have you upgraded to the final release? I seem to recall that issue being fixed before the release.
still on beta, will try to update soon. Ill re-report if it happens with the release
On the release version now, still happening. About 15% of all our slaves are hanging like this, these slaves just loop the following messages (some for days now):
slave#329:
Scheduler Thread - Cancelling task because task “20_1081-1084” could not be found
Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
slave#326:
Scheduler Thread - Cancelling task because task “2_2-2” could not be found
Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
slave#327:
Scheduler Thread - Cancelling task because task “9_1010-1010” could not be found
Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
slave#316:
Scheduler Thread - Cancelling task because task “1_1005-1005” could not be found
Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
It seems to be related to this problem:
viewtopic.php?f=86&t=9899
Sounds like we need to get to the bottom of why a slave would hang like this when running python scripts.