AWS Thinkbox Discussion Forums

intermittently disappearing tasks from db?

We noticed that sometimes slaves restart their plugins, even though they are rendering the same job, and not skipping jobs.

Looking at the logs, it seems that their task gets cancelled, but then they pick the same task right away again. Very odd (in reality nothing is cancelled):

2013-10-28 12:52:34:  0: Got task!
2013-10-28 12:52:34:  0: Plugin already loaded: 3dsmax
2013-10-28 12:52:34:  0: Task timeout is 960 seconds (Regular Task Timeout)
2013-10-28 12:52:34:  0: Job already loaded: [NEMO] HB_118_0970_v0060_str_Fixedwind2_cache_flowline_Spray_1
2013-10-28 12:52:34:  0: Plugin rendering frame(s): 1019
2013-10-28 12:52:34:  0: INFO: Render Tasks called
2013-10-28 12:52:34:  0: INFO: STARTED
2013-10-28 12:52:34:  0: INFO: Lightning: Render frame 1019
2013-10-28 12:53:41:  Scheduler Thread - Cancelling task because task "35_1019-1019" could not be found
2013-10-28 12:53:41:  Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-10-28 12:53:43:  0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2013-10-28 12:53:43:  0: Unloading plugin: 3dsmax

Then, the next task it loads is the exact same task!

2013-10-28 12:53:51:  0: Got task!
2013-10-28 12:53:51:  Constructor: 3dsmax
2013-10-28 12:53:51:  0: Loaded plugin: 3dsmax
2013-10-28 12:53:51:  0: Task timeout is 960 seconds (Regular Task Timeout)
2013-10-28 12:53:51:  0: Loaded job: [NEMO] HB_118_0970_v0060_str_Fixedwind2_cache_flowline_Spray_1 (526eb089c3f6eb24b09fb958)
misc log entries of max starting up
2013-10-28 12:57:27:  0: Plugin rendering frame(s): 1019
2013-10-28 12:57:27:  0: INFO: Render Tasks called
2013-10-28 12:57:27:  0: INFO: STARTED

Any ideas?

Are there any requeue reports or Stalled Slave error reports for this job in the Job Reports panel? Just wondering if the task was requeued at some point.

Nope nothing… the same machine popped this error, then right away the next task it picked up (2 seconds later) was the exact same task it thought had disappeared. The task / job has not been requeued.

No errors are popped in the monitor, so i suspect this is happening all over the place, but it was really visible on this job.

You can see the ‘restarts’ on this list, look at the Startup time column, whenever it goes into the minutes, this has happened…:

Yeah, looking at other jobs rendered by a single slave like this, this is happening around every 10-15th frame.

Thanks for the additional info. We’ve logged it as a bug, and will try to reproduce. Hopefully we can figure out what’s going on for the next beta release.

I just watched around 40 jobs do this all at once! exact same second, they went from rendering -> queued. No errors on either one of them, just entries like this:

0: INFO: Lightning: Render frame 1051
---- 2013/10/29 22:59 ----
Scheduler Thread - Cancelling task because task “35_1051-1051” could not be found
Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
0: In the process of canceling current task: ignoring exception thrown by PluginLoader
0: Unloading plugin: 3dsmax

Timeout to the database? rejected db connection?

This seems to happen periodically, and then it affects a LOT of slaves all at once.

We’re going to be adding additional debug logging to this check for beta 10. These are the reasons why Deadline would cancel the task in this case:

  1. The job has been deleted.
  2. The task has been deleted.
  3. The properties for the task in the database doesn’t match the slave’s current task (ie: the task state has changed, the slave that’s rendering the task is different, or the task’s frame list has changed).

Now, when the task is canceled, we’ll know which of the 3 possibilities above triggered it, and hopefully that will help shed some light on the problem.

Cheers,

  • Ryan

Cool, hopefully that helps narrow it down

I know in beta10 you guys changed how the housecleaning handles this, and we are eager to test it with beta11, but some additional information in the meantime…:

  • this issue is causing our sims to jump from machine to machine, and randomly get requeued without it being logged via the monitor. So artists are quite frustrated
  • What i found is that even if a slave is still rendering the same task, pulse still triggers a requeue

See this slave log:
2013-11-13 20:08:00: Scheduler Thread - Render Thread 0 completed its task
2013-11-13 20:08:00: Scheduler Thread - Seconds before next job scan: 2
2013-11-13 20:08:02: Scheduler - Performing Job scan on Primary Pools.
2013-11-13 20:08:02: Scheduler - The 52843aa180eca30d8051fe79 limit is maxed out.
2013-11-13 20:08:02: Scheduler - Successfully dequeued 1 task(s). Returning.
2013-11-13 20:08:02: Scheduler - Returning limit stubs not in use.
2013-11-13 20:08:02: Scheduler Thread - Synchronizing job auxiliary files from \inferno2\deadline\repository6\jobs\52844a6580eca303bc9b976c
2013-11-13 20:08:02: Scheduler Thread - All job files are already synchronized
2013-11-13 20:08:03: 0: Got task!
2013-11-13 20:08:03: 0: Plugin already loaded: 3dsmax
2013-11-13 20:08:03: 0: Task timeout is 24000 seconds (Regular Task Timeout)
2013-11-13 20:08:03: 0: Job already loaded: [VULCAN] CE_100_0180_v0055_tya_sizeadj_cache_flowline_Meteor_1
2013-11-13 20:08:03: 0: Plugin rendering frame(s): 1025-1034
2013-11-13 20:08:04: 0: INFO: Render Tasks called
2013-11-13 20:08:04: 0: INFO: STARTED
2013-11-13 20:08:04: 0: INFO: Lightning: Render frame 1025

2013-11-13 20:11:43: 0: INFO: Lightning: Render frame 1032
2013-11-13 20:11:58: Scheduler Thread - Cancelling task because task “1_1025-1034” could not be found
2013-11-13 20:11:58: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-11-13 20:11:59: 0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2013-11-13 20:11:59: 0: Unloading plugin: 3dsmax

As you see its rendering the same task from 20:08 to 20:11. In the pulse log, it says that very same task (task1: 1025-1034) is orphaned, and that the slave is not rendering it any more…:

2013-11-13 20:11:54: Requeuing orphaned task ‘1’ for LAPRO0415: the task is in the rendering state, but the slave is no longer rendering this task

It seems that the house cleaning starts at around 20:05:

2013-11-13 20:05:05: Executable: “/opt/Thinkbox/Deadline6/bin/deadlinecommand”
2013-11-13 20:05:05: Argument: -DoHouseCleaning 10 True
2013-11-13 20:05:05: Startup Directory: “/opt/Thinkbox/Deadline6/bin”
2013-11-13 20:05:05: Process Priority: BelowNormal
2013-11-13 20:05:05: Process Affinity: default
2013-11-13 20:05:05: Process is now running

Maybe the log message is misleading? As in, lapro0415 IS rendering that task, but thats not the same task as it was rendering when the housecleaning started? Dunno.

Privacy | Site terms | Cookie preferences