AWS Thinkbox Discussion Forums

Slave Auto canceling task

Hi all !

I’ve got a problem with deadline 7.0.0.47R (win7)

I’ve got a job 3dsmax 2015, vray 3, rendering frame is between 15-20min, but the job is extremly long to finish.
I’ve just noticed that, regularly, slave exit the job, during diffuse calculation.
Here’s a screen of logger :

Looks like slave detect that job changed, so it requeue frame. Off course it didn’t changed. No errors or job report are send.

Any tips ?
Thanks !

Many users have reported issues like this with the current batch of v7 release candidates. We’ll be doing an RC4 release tomorrow that should address this issue.

In the meantime, you can try bumping up your stalled slave detection interval, which can help alleviate this problem. The default is 10 minutes, so if you bump it up to something higher than any render would take (like 1000 minutes), it should help. You can find the stalled slave interval setting in the Slave Settings section of the Repository Options in the Monitor.

Sorry for the inconvenience!

Cheers,
Ryan

exactly the problem we had,
i think thinkbox are on the issue, it goes through some things to check

it seems to of gone away when we made sure that all nodes have exactly the same render software on them, ie max 2014 and max2015
if it had a machine with out the same software it would error

check the corona animation errors thread
viewtopic.php?f=205&t=12692&p=56625#p56625

Ok thanks for answer.

I’ve increased stalled timeout to 1k, let’s see if it fix a little bit waiting next released.

Thanks !

Just an FYI that Deadline 7 RC4 has been released!
viewtopic.php?f=204&t=12744

Cheers,
Ryan

Yep, thanks !

I’m downloading it.
Should i redown timeout stalled to 10 instead of 1k ?

Yup, once you have all your machine upgraded, you should drop it back down to 10.

Just note that if you are running pulse, make sure that machine gets updated too.

Cheers,
Ryan

Humm, version .0.50 doesn’t seems to fix the probleme.
My slaves are still ghost : Thet are, in monitor, in rendering status, but no job attributed. When i connect to them, i got this :

I’ll try to re-up stalled timer to 1k…

in fact, since i’ve update to .0.50, slaves don’t work at all, i can render anything, they just block in waiting to start.

In the column called: “Version” in your Deadline Monitor, please could you confirm that ALL your machines have been updated to “v7.0.0.50” and Slave & Pulse have been restarted on ALL your machines?

Off course i checked it, i’ve just updated about 50node this morning.

i’ve update repository, then update pulse with automatic upgrade, then i reboot all my slaves.
They’r well detected .0.50 in my monitor.

BTW, i didn’t upgrade ALL my nodes (got 160+ nodes), the rest is offline or unknown

i’m just getting up a backburner to get some images out sic

Due to the slave state issue, is it possible that any of these offline/unknown slaves are actually running? Also, are all your slaves able to connect to Pulse? You can check the Connected To Pulse column in the slave list to see if this is the case. If not, make sure your Pulse is set to be the primary Pulse, which can be done from the right-click menu in the Pulse list in the Monitor while in super user mode (choose the option to modify the pulse settings). If the slaves can’t connect to pulse, and you have an old slave running, that old slave could be doing housecleaning with old code and causing problems.

This still sounds like the previous problem. Are these slaves updating their state in a regular fashion in the slave list in the Monitor? Maybe try running the client installer on some of these slaves to see if that helps (ie: maybe the auto-upgrade had an issue and not all files were updated properly).

Cheers,
Ryan

Yes they are connected to pulse, btw slave is saying it and i can ping pulse.

I’ve tryed to install .0.50 manually on a node but doesn’t seems to fix anything.
I’m trying to render an easy job (<1min, low ram), and he’s not starting at all, after 10min node goes into stalled status. Log is empty.

Can you send me the full slave log from this node (NODE201)? To get it, select Help -> Explore Log Folder from the slave UI, and then grab the slave log for the current session and upload it as an attachment.

Thanks!
Ryan

Here it is.

log 0002 is current job.
20 min running slave on a really easy jo, still no information in log.

log 0001 is the first after the .0.50 install. Maybe more infos on it.
deadlineslave-NODE201-2014-12-04-0001.log (55 KB)
deadlineslave-NODE201-2014-12-04-0002.log (745 Bytes)

Thanks! Just to confirm, do you have Throttling enabled in the Pulse settings in the Repository Options? Also, could you enable Slave Verbose Logging in the Application Logging section of the Repository Options?

I’ve also attached an updated deadline.dll here to try and eliminate a couple suspicions of what the problem could be. On this machine (NODE201), stop all the Deadline applications (Monitor, Slave, Launcher, etc), then unzip the attached file to C:\Program Files\Thinkbox\Deadline7\bin and overwrite the existing deadline.dll. Then start up the slave again and see if can render. If it can’t, send us the new log and we’ll take another look.

Thanks!
Ryan
deadline.zip (570 KB)

Yes Throttling is enabled.
I’ve changed DLL and activated Slave Verbose Logging.

Looks like pulse is sending waiting commands. Maybe i should upgrad pulse manually.
Here’s Log file.
deadlineslave-NODE201-2014-12-04-0003.log (3.1 KB)

Yeah, try upgrading pulse manually and let’s see if that makes a difference. After starting up pulse again after the upgrade, restart the slave on NODE201 so that we get a fresh log, and then post that log if there are still issues.

Thanks!
Ryan

Wow, i stopped pulse to upgrad manually it and suddenly all my stalled node goes rendering :smiley:

I’ve relauch Pulse, and now looks like rendering is fine. I’ll try harder jobs to see.

Looks like it is Automatic Upagrade on Pulse that broke all nodes…
stay tuned

I’m having similar problems. Also went to .50

Slaves get up to 75%, then die for no reason. I’ve tried smaller tasks sizes down to single frames, doesn’t work. Light scenes, works sort of. Heavy scenes, doesn’t work. I’ll try to change the limits as suggested.

I’d estimate that over 80% of all tasks we have tested have failed with no error other than just exception 0. This is with Maya 2015. Multiple OS tested.

Privacy | Site terms | Cookie preferences