Slave Auto canceling task

pingus43 · December 8, 2014, 8:50am

Yep !

After a week end trying to rendering some basic jobs, i’m still into troubles.

Some task worked pretty fine, then after a while, all slaves goes into “auto cancellling mode” and no image goes out.
I checked somes log files and the basic path is like that :

—RENDERING FINE…
2014-12-08 08:58:27: 0: INFO: Prepass 3 of 3… [00:00:04.6] [00:00:25.2 est]
2014-12-08 08:58:27: 0: INFO: Prepass 3 of 3… [00:00:04.7] [00:00:25.1 est]
2014-12-08 08:58:28: 0: INFO: Prepass 3 of 3… [00:00:04.8] [00:00:25.0 est]
2014-12-08 08:58:28: 0: INFO: Prepass 3 of 3…: done [00:00:05.0]
2014-12-08 08:59:00: Scheduler Thread - Task “16_14-14” could not be found because task has been modified:
2014-12-08 08:59:00: current status = Rendering, new status = Rendering
2014-12-08 08:59:00: current slave = NODE106, new slave = NODE105
2014-12-08 08:59:00: current frames = 14-14, new frames = 14-14
2014-12-08 08:59:00: Scheduler Thread - Cancelling task…
2014-12-08 08:59:02: 0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2014-12-08 08:59:02: 0: Unloading plugin: 3dsmax
2014-12-08 08:59:02: Scheduler Thread - In the process of canceling current tasks: ignoring exception thrown by render thread 0
2014-12-08 08:59:03: Scheduler Thread - Seconds before next job scan: 2
2014-12-08 08:59:05: Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Priority, First-in First-out
2014-12-08 08:59:05: Scheduler - Successfully dequeued 1 task(s). Returning.
2014-12-08 08:59:05: 0: Shutdown
2014-12-08 08:59:05: 0: Exited ThreadMain(), cleaning up…
2014-12-08 08:59:06: Scheduler - Returning limit stubs not in use.
2014-12-08 08:59:06: Scheduler - returning 5481e3de6466b54914bd3a11
2014-12-08 08:59:06: Scheduler - Checking with Pulse if throttling is necessary…
2014-12-08 08:59:06: Scheduler - Throttling is necessary, waiting to start job.
2014-12-08 08:59:06: Scheduler Thread - Waiting for 20 seconds to ask Pulse for permission to start job.
2014-12-08 08:59:26: Scheduler - Checking with Pulse for permission to start job…
2014-12-08 08:59:26: Scheduler - Pulse returned wait message. Continuing to wait.

Pulse log corresponding :

2014-12-08 08:59:12: Process exit code: 0
2014-12-08 08:59:26: Update timeout has been set to 1800 seconds
2014-12-08 08:59:26: Stdout Handling Enabled: False
2014-12-08 08:59:26: Popup Handling Enabled: False
2014-12-08 08:59:26: Using Process Tree: True
2014-12-08 08:59:26: Hiding DOS Window: True
2014-12-08 08:59:26: Creating New Console: False
2014-12-08 08:59:26: Running as user: Administrateur
2014-12-08 08:59:26: Executable: “C:\Program Files\Thinkbox\Deadline7\bin\deadlinecommand.exe”
2014-12-08 08:59:26: Argument: -DoHouseCleaning False False
2014-12-08 08:59:26: Startup Directory: “C:\Program Files\Thinkbox\Deadline7\bin”
2014-12-08 08:59:26: Process Priority: BelowNormal
2014-12-08 08:59:26: Process Affinity: default
2014-12-08 08:59:26: Process is now running
2014-12-08 08:59:28: Update timeout has been set to 1800 seconds
2014-12-08 08:59:28: Stdout Handling Enabled: False
2014-12-08 08:59:28: Popup Handling Enabled: False
2014-12-08 08:59:28: Using Process Tree: True
2014-12-08 08:59:28: Hiding DOS Window: True
2014-12-08 08:59:28: Creating New Console: False
2014-12-08 08:59:28: Running as user: Administrateur
2014-12-08 08:59:28: Executable: “C:\Program Files\Thinkbox\Deadline7\bin\deadlinecommand.exe”
2014-12-08 08:59:28: Argument: -DoRepositoryRepair False False
2014-12-08 08:59:28: Startup Directory: “C:\Program Files\Thinkbox\Deadline7\bin”
2014-12-08 08:59:28: Process Priority: BelowNormal
2014-12-08 08:59:28: Process Affinity: default
2014-12-08 08:59:28: Process is now running
2014-12-08 08:59:28: Performing house cleaning
2014-12-08 08:59:28: Performing Job Cleanup Scan…
2014-12-08 08:59:28: Job Cleanup Scan - Loading completed jobs
2014-12-08 08:59:28: Job Cleanup Scan - Loaded 30 completed jobs in 234.002 ms
2014-12-08 08:59:28: Job Cleanup Scan - Scanning completed jobs
2014-12-08 08:59:28: Job Cleanup Scan - Deleted 0 and archived 0 completed jobs in 0.000 s
2014-12-08 08:59:28: Job Cleanup Scan - Done.
2014-12-08 08:59:28: Purging Unsubmitted Jobs
2014-12-08 08:59:28: Unsubmitted Job Scan - Loading unsubmitted jobs
2014-12-08 08:59:28: Unsubmitted Job Scan - Loaded 0 unsubmitted jobs in 0.000 s
2014-12-08 08:59:28: Unsubmitted Job Scan - Done.
2014-12-08 08:59:28: Purging Deleted Jobs
2014-12-08 08:59:28: Deleted Job Scan - Loading deleted jobs
2014-12-08 08:59:28: Deleted Job Scan - Loaded 0 deleted jobs in 0.000 s
2014-12-08 08:59:28: Deleted Job Scan - Done.
2014-12-08 08:59:28: Purging Old Job Auxiliary Files
2014-12-08 08:59:28: Auxiliary File Scan - Scanning for auxiliary directories
2014-12-08 08:59:28: Auxiliary File Scan - Found 52 auxiliary directories in 0.000 s
2014-12-08 08:59:28: Auxiliary File Scan - Loading job IDs
2014-12-08 08:59:28: Auxiliary File Scan - Loaded 52 job IDs in 15.600 ms
2014-12-08 08:59:28: Auxiliary File Scan - Purged 0 auxiliary folders in 0.000 s
2014-12-08 08:59:28: Auxiliary File Scan - Done.
2014-12-08 08:59:28: Purging Old Job Reports
2014-12-08 08:59:28: Job Report Scan - Loading job report collections
2014-12-08 08:59:28: Job Report Scan - Found 52 report collections in 31.200 ms
2014-12-08 08:59:28: Job Report Scan - Loading job IDs
2014-12-08 08:59:28: Job Report Scan - Loaded 52 job IDs in 0.000 s
2014-12-08 08:59:28: Job Report Scan - Purged 0 report collections in 0.000 s
2014-12-08 08:59:28: Job Report Scan - Purging old job report files
2014-12-08 08:59:28: Job Report Scan - Purged 0 report files in 15.600 ms
2014-12-08 08:59:28: Job Report Scan - Done.
2014-12-08 08:59:28: Purging Obsolete Slaves
2014-12-08 08:59:28: Obsolete Slave Scan - Skipping because it is disabled in the Repository Options

Tasks are requeue forever, so nothing goes ahead.
Maybe this could be a bad options / timer for pulse, but i ain’t change many options for the moment.

Windows Env.
Repo & Pulse upgraded 0.50 manually, slaves automatic upgraded.
3dsmax, vray and some additionnals plugins (ornatrix, forest…)

rrussell · December 8, 2014, 4:02pm

I think for RC5 we’re going to try reverting back to the way the slaves reported their status to see if that helps. The new way we introduced can result in a 30% reduction in the amount of data the slave is sending to the database, but that’s only in the best case scenario.

We’ve been running some scale testing in the cloud, and we seem to be hitting a state-related bug as well, and we think this could be the potential fix.

We hope to have RC5 out tomorrow.

Cheers,
Ryan

LaszloSebo · December 8, 2014, 5:54pm

Ryan does this affect farms only with throttling on? Or the problem would affect all scenarios?

rrussell · December 8, 2014, 5:57pm

It would apply to all scenarios.

pingus43 · December 12, 2014, 10:04am

Hello !

update 0.51 is munch better !

But i still have some random slaves that stay stuck sending information to the pulse.
Slave log :

2014-12-12 10:21:04: 0: STDOUT: - 10:21:03.389 INFO: End loading assembly
2014-12-12 10:21:04: 0: STDOUT: - 10:21:03.389 INFO: End loading assemblies
2014-12-12 10:21:04: 0: STDOUT: - 10:21:03.389 INFO: Begin registering loaded plugins
2014-12-12 10:21:04: 0: STDOUT: - 10:21:03.560 INFO: End registering loaded plugins
2014-12-12 10:21:19: Scheduler - Notifying Pulse of job start progress…
2014-12-12 10:57:33: Connecting to slave log: NODE113
Success
2014-12-12 10:58:28: 0: An exception occurred: Error: 3dsmax startup: Timed out waiting for 3ds max to start - consider increasing the LoadMaxTimeout in the 3dsmax plugin configuration
2014-12-12 10:58:28: à Deadline.Plugins.ScriptPlugin.StartJob(Job job, String& outMessage, AbortLevel& abortLevel) (Deadline.Plugins.RenderPluginException)
2014-12-12 10:58:28: 0: Unloading plugin: 3dsmax
2014-12-12 10:58:28: Scheduler Thread - Render Thread 0 threw a major error:
2014-12-12 10:58:28: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

In the monitor they are easily trackable because the status is “starting up” even after 10 or 15min (my timeout for 3dsmax plugin is 1K seconds) so i can requeue them manually.
Did you tracked it ?

Thanks !

MikeOwen · December 12, 2014, 10:11am

Hi,
It looks like NODE113 is having difficulty starting 3dsMax. Can you test manually starting 3dsMax on NODE113 under the same user account that Deadline runs as? Do you see any issues which you need to resolve? Is NODE113 running the latest SP for 3dsMax? If there is still an issue, please can you provide the full 3dsMax job log report for this node, which should tell us some more information potentially why 3dsMax is failing to start on NODE 113.

pingus43 · December 12, 2014, 10:24am

Node113 is a totally rendernode, so i can’t launch max on it.
BTW i know it works fine. It rendered some other frames, and if a requeue the bugged frame it could work fine on same render node. That’s a bit random at the moment…
And yes, all my nodes are up to date from version 3DSmax / Vray / deadline and others plugs…

MikeOwen · December 12, 2014, 10:45am

If this is an option for you, you could use the 3dsMax 30-day eval license to startup 3dsMax on this troublesome machine if that would help?
Alternatively, our 3dsMax job log reports are reasonably comprehensive, so if you prefer, feel free to post one of these logs from a machine displaying this issue and we can take a look and see if anything stands out for you.

pingus43 · December 12, 2014, 4:36pm

As i said, this looks like pretty random at the moment, differents frames, different nodes, differents jobs…

Here’s a recent log from a “waiting” slave.
You can see job queued at 17h15, slave still in “waiting to start” at 17h30 when i connect to it.
deadlineslave-NODE214-2014-12-12-0000.log (3.67 MB)

rrussell · December 12, 2014, 6:19pm

In your 3dsmax plugin configuration settings (Tools -> Configure Plugins while in monitor), can you turn on Kill ADSK Comms Center Process option to see if it helps? We’ve seen cases where Max can lock up randomly like this, and enabling this option can sometimes help.

Cheers,
Ryan

pingus43 · December 15, 2014, 12:52pm

I turned on this option, but i still have random waiting slaves