slave hanging

LaszloSebo · March 4, 2013, 9:56pm

Log last lines:

2013-03-04 08:56:06: 0: STDOUT: // generating Maya nodes…
2013-03-04 08:56:21: Scheduler Thread - Cancelling task because task “2_4415-4429” could not be found
2013-03-04 09:53:28: Slave - Updating slave settings

Been idle since. The slave is on the windows taskbar minimized, and cant be maximized, its using 100% of a single core. Mayabatch process is still in the task manager using 0 percent cpu. If i kill the slave process, the mayabatch process goes to 100% usage on 1 core.

LaszloSebo · March 4, 2013, 10:28pm

Similar node:

Last log lines:
2013-03-04 08:30:24: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : pop widget QMainWindow MainWindow
2013-03-04 08:30:24: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : new topwidget None
2013-03-04 09:50:25: Slave - Updating slave settings
2013-03-04 11:51:18: Scheduler Thread - Cancelling task because task “11_1166-1180” could not be found

Monitor reports: stalled
Deadline slave process: using 100% of a core, gui hanging
mayabatch process: 0% cpu usage

Killing deadline slave resumes the mayabatch which starts going at 100%

LaszloSebo · March 4, 2013, 10:33pm

Anohter:

Last log lines:
2013-03-02 15:29:42: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : pop widget QDialog Dialog
2013-03-02 15:29:42: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : new topwidget None
2013-03-02 15:29:42: 0: STDOUT: Sat Mar 02 15:29:42 2013 (+312ms) : debug : Registering Node Class : “MocapAnimationExporter”
2013-03-02 15:29:45: Scheduler Thread - Cancelling task because task “0_1001-1015” could not be found

Slave: lapro0230
Monitor reports: stalled
Deadline slave process: using 100% of a core, gui hanging
mayabatch process: 0% cpu usage

Killing deadline slave resumes the mayabatch which starts going at 100%

rrussell · March 4, 2013, 10:44pm

Thanks for these logs. This was one of the cases from your previous reports that we couldn’t reproduce. I’m guessing what’s happening is that a job gets deleted or requeued, and the slave working on it detects this, but then craps out when trying to deal with it.

It would actually be good news if this is currently the only problems you guys are seeing after upgrading to beta 13, because that means some of the issues we fixed actually addressed some of your problems. We will do some digging on this one and hopefully deal with it for beta 14.

Cheers,

Ryan

rrussell · March 4, 2013, 10:56pm

Hey, just out of curiosity, do you only see this problem happen for MayaBatch jobs? I know you guys run Nuke jobs too, and it would be interesting to know if you ever see this problem for Nuke jobs too.

If you only see them for MayaBatch jobs, I just want to confirm if all the jobs they happen to are script jobs, or if they also happen with regular Maya render jobs (assuming you’re doing both).

Thanks!

Ryan

LaszloSebo · March 4, 2013, 11:02pm

So far i have only seen it with mayabatch, but thats basically 95% of the usage currently on deadline (we are slowly rolling out nuke right now, with just 3 users first).

There are two types of maya jobs we have: animation publishes, and software renders. Both are mayabatch jobs

rrussell · March 4, 2013, 11:07pm

Are either of those script-based jobs (using the ScriptJob=True setting in the plugin info file)? I’m just wondering if the problem is specific to script-based jobs.

LaszloSebo · March 4, 2013, 11:12pm

The weird thing about this error, that in these cases (as much as i can trace back), even though the log says something like this:

2013-03-01 14:27:26: Scheduler Thread - Cancelling task because task “2_1031-1045” could not be found

Based on the job history, task reports, no requeueing, or cancelling took place. So for example, there is this task (slave log attached)…
In the monitor, i see 1 error for a different task altogether. The stalled task had no error report in the job(!).

Attached are 3 logs:

stalled log from slave 0291 (error report for job)
successful log for same task from another slave (no mention of previous stall error in job history for this task) (Job_2013-03-01_15-40-56_47f811715c77b90f80a98ff6.txt)
only error on the job, completely different task (Job_2013-03-01_14-29-17_51312b5fb2469c0f9cef1983.txt)
Job_2013-03-01_15-40-56_47f811715c77b90f80a98ff6.txt (104 KB)
Job_2013-03-01_14-29-17_51312b5fb2469c0f9cef1983.txt (4.63 KB)
deadlineslave_LAPRO0291-LAPRO0291-2013-03-01-0001.log (50.5 KB)

LaszloSebo · March 4, 2013, 11:13pm

Actually, scratch that. In this case it WAS requeued:

2013/03/01 14:27:28 ScanlineVfx_user LAPRO3015 (LAPRO3015\ScanlineVfx_user): Requeuing Tasks: [4, 3, 2]

hrmpf

LaszloSebo · March 4, 2013, 11:22pm

Another maybe related issue, a slave that timed out:

(only contents of the logs from the 2nd)
2013-03-02 15:15:09: BEGIN - LAPRO0304\ScanlineVFX
2013-03-02 15:15:09: 0: Slave timed out – canceling current task…

(only contents of the log from the 4th):
2013-03-04 09:55:35: BEGIN - LAPRO0304\ScanlineVFX
2013-03-04 09:55:35: Slave - Updating slave settings
2013-03-04 11:55:59: Scheduler Thread - Cancelling task because task “0_1087-1101” could not be found

no log from the 3rd

Job start log contains no error messages, last lines:
2013-03-01 19:18:52: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : pop layout QVBoxLayout verticalLayout
2013-03-01 19:18:52: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : pop widget QWidget centralwidget
2013-03-01 19:18:52: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : new topwidget MainWindow
2013-03-01 19:18:52: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : pop widget QMainWindow MainWindow
2013-03-01 19:18:52: 0: STDOUT: PyQt4.uic.uiparser : DEBUG : new topwidget None
2013-03-01 19:18:52: 0: STDOUT: Fri Mar 01 19:18:52 2013 (+141ms) : debug : Registering Node Class : “ImagesExporter”

From our logs, the last lines are (it doesnt seem like the job even started):
Fri Mar 01 19:19:01 2013 (+391ms) : Loading plugin: poseDeformer.mll
Fri Mar 01 19:19:01 2013 (+125ms) : Loading plugin: poseReader.mll
Fri Mar 01 19:19:01 2013 (+108ms) : Loading plugin: rigBlueprint.py
Fri Mar 01 19:19:01 2013 (+141ms) : Loading plugin: rigConfig.py
Fri Mar 01 19:19:02 2013 (+108ms) : Loading plugin: rigRoot.py
Fri Mar 01 19:19:02 2013 (+93ms) : Loading plugin: savePc2Cmd.mll
Fri Mar 01 19:19:02 2013 (+125ms) : Loading plugin: cacheFilePc2.mll
Fri Mar 01 19:19:02 2013 (+187ms) : Loading plugin: relaxNode.mll
Fri Mar 01 19:19:02 2013 (+484ms) : Loading plugin: postBlendShape.mll
Fri Mar 01 19:19:03 2013 (+155ms) : Loading plugin: Mayatomr
Fri Mar 01 19:19:17 2013 (+14141ms) : Scanline Pipeline Maya Setup Complete!

Slave is: lapro0304
monitor reporting: stalled
slave: process running at 100% on 1 core
mayabatch also active but idle. Killing the slave makes mayabatch go at 100% of 1 core

LaszloSebo · March 4, 2013, 11:49pm

Ok, here is one where the slave says:

2013-03-04 15:44:29: Scheduler Thread - Cancelling task because task “4_1061-1075” could not be found
2013-03-04 15:44:29: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-03-04 15:44:34: Scheduler Thread - Cancelling task because task “4_1061-1075” could not be found
2013-03-04 15:44:34: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-03-04 15:44:43: Scheduler Thread - Cancelling task because task “4_1061-1075” could not be found
2013-03-04 15:44:43: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-03-04 15:44:51: Scheduler Thread - Cancelling task because task “4_1061-1075” could not be found
2013-03-04 15:44:51: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-03-04 15:44:59: Scheduler Thread - Cancelling task because task “4_1061-1075” could not be found
2013-03-04 15:44:59: Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.

Every couple of seconds, but the task was never requeued or anything. In the monitor, the task is reporting no errors, just a single success from another slave.
Slave logs attached
continuous_cancel.zip (91.1 KB)

LaszloSebo · March 4, 2013, 11:51pm

Yep, the anim publishes are script jobs, the scanline renders are not. Ill try to keep an eye out on whether both fail like this, or only the script jobs.

LaszloSebo · March 5, 2013, 5:49pm

ANother hanging slave.

This one seems to be stuck in a cycle of ‘updating slave settings’.

Logs attached
deadlinelauncher-LAPRO0240-2013-03-05-0000.zip (83.2 KB)

LaszloSebo · March 5, 2013, 5:53pm

Another, same thing, logs attached.
deadlinelauncher-LAPRO0218-2013-03-05-0000.zip (949 KB)

LaszloSebo · March 6, 2013, 5:26pm

Happened with a nuke job as well:

Slave reporting rendering, but its not doing anything. GUI is up, interactive, but other than a loop of updating slave settings, nothing is happening:
Last lines of the log:

2013-03-05 11:46:37: 0: INFO: Popup Handling Enabled: True
2013-03-05 11:46:37: 0: INFO: Using Process Tree: True
2013-03-05 11:46:37: 0: INFO: Hiding DOS Window: True
2013-03-05 11:46:37: 0: INFO: Creating New Console: False
2013-03-05 11:46:37: 0: INFO: Render Executable: “C:\Program Files\Nuke6.3v7\Nuke6.3.exe”
2013-03-05 11:46:37: 0: INFO: Render Argument: -V -x “C:\Documents and Settings\ScanlineVFX\Local Settings\Temp\CB_036_0510_comp_v0182_rbr_thread0.nk” 981-984
2013-03-05 11:46:37: 0: INFO: Startup Directory: “C:\Program Files\Nuke6.3v7”
2013-03-05 11:46:37: 0: INFO: Process Priority: BelowNormal
2013-03-05 11:46:37: 0: INFO: Process Affinity: default
2013-03-05 11:46:37: 0: INFO: Process is now running
2013-03-05 11:46:38: 0: STDOUT: Nuke 6.3v7, 64 bit, built Mar 13 2012.
2013-03-05 11:46:38: 0: STDOUT: Copyright © 2012 The Foundry Visionmongers Ltd. All Rights Reserved.
2013-03-05 11:46:39: 0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2013-03-05 11:46:39: 0: Unloading plugin: Nuke
2013-03-05 11:51:08: Slave - Updating slave settings
2013-03-05 11:56:09: Slave - Updating slave settings
2013-03-05 12:01:10: Slave - Updating slave settings
2013-03-05 12:06:12: Slave - Updating slave settings
2013-03-05 12:11:13: Slave - Updating slave settings
2013-03-05 12:16:19: Slave - Updating slave settings
2013-03-05 12:21:25: Slave - Updating slave settings

rrussell · March 6, 2013, 5:32pm

Thanks for this. At least this rules out the possibility of it being plugin specific. As mentioned here, we’ll be trying out something new with the slave in beta 14:
viewtopic.php?f=86&t=9163&start=10#p39160

If you still see this problem with beta 14, at the very least we’ve ruled out the slave UI log as the cause.

LaszloSebo · March 6, 2013, 5:57pm

Found a slave that crashed out (not hanging, the slave process was gone altogether) even though it was not rendering anything (we deactivated a large chunk of hte deadline farm by removing all pools/groups).

Logs attached.

Lines of interest:

2013-03-05 23:53:13: Scheduler - Job chooser found no jobs.
2013-03-05 23:53:13: Starting between task wait - seconds: 2
2013-03-05 23:53:37: Slave - An error occurred while updating the slave’s info: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017):
2013-03-05 23:53:37: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. (FranticX.Database.DocumentException)
2013-03-05 23:53:37: Scheduler - Error occured while scanning for jobs: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017):
2013-03-05 23:53:37: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. (FranticX.Database.DocumentException)
2013-03-05 23:53:38: Starting between task wait - seconds: 1
2013-03-05 23:54:00: Scheduler - Error occured while scanning for jobs: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
2013-03-05 23:54:00: Starting between task wait - seconds: 2
2013-03-05 23:54:05: Scheduler - Job chooser found no jobs.

The slave reported stalled 2.7hrs ago according to the monitor (so at around 7am on 3/6)
deadlinelauncher-LAPRO0250-2013-03-05-0002.zip (128 KB)