repeat slave stalls

Hi,

Been rendering an old project with deadline 7.2.3.0 R and 3ds max 1014, and Vray 3.4 and been getting a lot of the below. I haven’t narrowed it down to any specific geometry yet, but could this error be for any other reason?

The frames seem to be 30 mins in, and rendering, then it stalls, but the frame seems to be still rendering, and then the task is picked up by another slave. Slighting annoying!

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: DHSBS
Version: v7.2.3.0 R (d21b3e911)

Stalled Slave: Node06
Slave Version: v7.2.3.0 R (d21b3e911)
Last Slave Update: 2016-07-12 10:56:08
Current Time: 2016-07-12 11:06:27
Time Difference: 10.316 m
Maximum Time Allowed Between Updates: 10.000 m

Current Job Name: 2463_EVR_UILW_Vault
Current Job ID: 5784baa7dfb99f4bd82eabef
Current Job User: mark.h
Current Task Names: 0
Current Task Ids: 0

Searching for job with id “5784baa7dfb99f4bd82eabef”
Found possible job: 2463_EVR_UILW_Vault
Searching for task with id “0”
Found possible task: 0:[0-0]
Task’s current slave: Node06
Slave machine names match, stopping search
Associated Job Found: 2463_EVR_UILW_Vault
Job User: mark.h
Submission Machine: Hive71
Submit Time: 07/12/2016 10:38:47
Associated Task Found: 0:[0-0]
Task’s current slave: Node06
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.

Thanks,
Mark

Just seen another thread, and suggested it might be to do with multiple slaves with the same name. I have D8 running, but disabled, so I’ve stopped all the D8 slaves, and restarted all the frames to see if that helps.

I’m running a pulse as well.

So far so good! They jobs have been going way longer then before, and not stalled. Hopefully thats the issue, running both D7 and 8 at the same time. What would be the quickest way to stop all the slaves for D8 from starting, without going on each machine and stopping?

Thanks

Hey Mark,

Off the top of my head I can’t think of anything faster than stopping them all via a Deadline 8 monitor. As for disabling them from starting up for good (or until you are able to move up to 8) you could remove the Deadline 8 launcher from startup.

Assuming that when you say ‘slave’ you mean the Deadline Slave application and not a machine that does rendering all day.

Would you be able to send over a job log from a job that’s failing because of this behavior? I understand if you don’t want to put it up in the forum you can send it to us at support@thinkboxsoftware.com.

I also have this problem, hope you can solve them

Hi Justin,

Here is a slave report on a machine that just failed a job, but is still rendering.

Slave_2016-07-13_15-12-36_57864c5581f709586429fcf8.txt (1.07 KB)

This was written in the salve report viewer

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: DHSBS
Version: v7.2.3.0 R (d21b3e911)

Stalled Slave: Node03
Slave Version: v7.2.3.0 R (d21b3e911)
Last Slave Update: 2016-07-13 15:02:07
Current Time: 2016-07-13 15:12:36
Time Difference: 10.496 m
Maximum Time Allowed Between Updates: 10.000 m

Current Job Name: 2463 [2463_EVR_UILW_Vault_Ridges.max]
Current Job ID: 57864544dd986cf4a0a38285
Current Job User: dan
Current Task Names: 0
Current Task Ids: 0

Searching for job with id “57864544dd986cf4a0a38285”
Found possible job: 2463 [2463_EVR_UILW_Vault_Ridges.max]
Searching for task with id “0”
Found possible task: 0:[0-0]
Task’s current slave: Node03
Slave machine names match, stopping search
Associated Job Found: 2463 [2463_EVR_UILW_Vault_Ridges.max]
Job User: dan
Submission Machine: Hive70
Submit Time: 07/13/2016 14:42:27
Associated Task Found: 0:[0-0]
Task’s current slave: Node03
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.

Hope this is the right one. Its all stable till 8 runs with 7.

Hey Mark,

You’ve sent a slave log, I’d like to see a job report as it should have some extra information about why the task is still considered to be rendering after the job has completed. I think that disconnect is what causes the slave to stall.

All reports for a job can be viewed in the Job Reports panel.It can be opened from the Job and Task panel’s right-click menu. You can use the Job Report panel’s right-click menu to save reports as files.

Documentation on logs: docs.thinkboxsoftware.com/produc … nd-history

Sorry for the delay. It has just happened again and I saved the job report.
Job_2016-07-18_11-15-03_578cac2781f709586429fde4.txt (1.11 KB)

This was written in the Job report panel -

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: DHSBS
Version: v7.2.3.0 R (d21b3e911)

Stalled Slave: Render51
Slave Version: v7.2.3.0 R (d21b3e911)
Last Slave Update: 2016-07-18 11:05:03
Current Time: 2016-07-18 11:15:03
Time Difference: 10.007 m
Maximum Time Allowed Between Updates: 10.000 m

Current Job Name: NWR [2463_SurfaceReceiptTransfer_SWTC_150.max]
Current Job ID: 5784a75bdfb99f19ecc5b9e9
Current Job User: mark.h
Current Task Names: 0
Current Task Ids: 0

Searching for job with id “5784a75bdfb99f19ecc5b9e9”
Found possible job: NWR [2463_SurfaceReceiptTransfer_SWTC_150.max]
Searching for task with id “0”
Found possible task: 0:[0-0]
Task’s current slave: Render51
Slave machine names match, stopping search
Associated Job Found: NWR [2463_SurfaceReceiptTransfer_SWTC_150.max]
Job User: mark.h
Submission Machine: Hive71
Submit Time: 07/12/2016 09:16:27
Associated Task Found: 0:[0-0]
Task’s current slave: Render51
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.

Hope it helps. I was thinking, if its the Deadline 8 process running could the time delay for the process be increased?

Mark

Slightly annoyingly, I have now stopped the D8 slaves, and the D8 Pulse, and the D7 jobs are still stalling, but not, so not sure it is D8’s fault. I have no idea what it could be however! These are old scenes so it might be something in the files, but I dont see what.

Hi,

I have no stopped using D7 and fully over to D8. Still getting slaves stalling. I’ve changed the stall timeout to 120 mins not 10 and this has helped, but sort of hiding the problem. I just had one stall as not responsive after 120 mins. I’m giving D8 a week, then uninstalling from all slaves, just in case it is still a conflict. But it does seem to be a D7 and D8 issue, cant work out what it is!

I think we would still like a Slave log (found by using the ‘help’ menu in the Slave UI, the logs are substantially longer). It might be worthwhile giving us a call so we can look at things.

The fact that the Slave is stalling is because the Slave claims to be online, but it’s not updating it’s state in the database on a regular interval. The log shows that instead of updating every 7 seconds like it’s supposed to, it hasn’t recorded anything in the past ten minutes. That means it missed about 86 intervals and we’re assuming it’s either dead or disconnected and we should free up whatever work it was doing.

Usually, if the interval is less than ten minutes it’s because the Slaves are fighting each other.

I’m expecting to see database connection errors in the log, but we’ll have to see.

Would you be able to give us a call? Our number is in my signature here. I have a few meetings today, but you can try to call me directly. In the worst case it’ll forward to the main support line. Most of us are pretty seasoned at digging into this stuff now.