Our studio recently discover concurrent Task. Which is quite a great option ! We had quite a lot of problem related to reading file trough multiple instances. And now with the use of the concurrent task instead of instance, the problem is gone
But now we got a new set of issues that came up with this new option. Jobs keep stalling over and over.
So i get the auto task time out of enable. (after 30 task, and 75% of the job done, take the average time and double it), i also put a regular time out. And most of the time, the time out doesn’t apply itself. I don’t know why.
To explain the “crash”, once in a while, every concurrent job (example 12 on one node) will keep rendering. Never stop and the node CPU are at 0 percent. Some time, when i re queue the stalled frame, the node will become stalled, but not all the time. I also realize that this error is more likely to come up in small rendering job (2 minute or less the frame)
If the timeout isn’t being applied, it could be that the slave is losing track of the task, which can be caused by network issues. Maybe the extra concurrent tasks are putting a strain on the network. To determine if this is the case, first enable Slave Verbose Logging in your Repository Options: thinkboxsoftware.com/deadlin … on_Logging
Then restart your slaves so that they recognize the change immediately. The next time a task gets stuck like this, see which slave is rendering it in the task list, then go to that slave, and from it’s user interface, select Help -> Explore Log Folder. Find the current slave log (the most recent one based on the last modified date/time) and post it. We’ll take a look to see what’s going on.
I will try to find a more accurate log tomorrow. So far i found this on a node that got stuck 2day, but i cannot tell in the log at wich point he was rendering
2012-04-30 14:35:39: BEGIN - POSTE-0468\fasavard
2012-04-30 14:35:39: Start-up
2012-04-30 14:35:39: Deadline Monitor 5.1 [v5.1.0.46114 R]
2012-04-30 14:35:39: 2012-04-30 14:35:38
2012-04-30 14:35:54: UpdateAll (uimanager) !!
2012-04-30 14:35:55: Attempting to contact Deadline Pulse (FX-Pulse)...
2012-04-30 14:35:55: Requesting jobs update from Deadline Pulse...
2012-04-30 14:35:55: Update received from Deadline Pulse.
2012-04-30 14:35:55: Received Update for 70 jobs.
2012-04-30 14:35:55: Attempting to contact Deadline Pulse (FX-Pulse)...
2012-04-30 14:35:55: Requesting slaves update from Deadline Pulse...
2012-04-30 14:35:56: Update received from Deadline Pulse.
2012-04-30 14:35:56: Received Update for 22 slaves.
2012-04-30 14:39:22: Enqueing: &Requeue 23 Tasks
2012-04-30 14:39:22: Dequeued: &Requeue 23 Tasks
2012-04-30 14:39:58: Enqueing: &Refresh Job
2012-04-30 14:39:58: Dequeued: &Refresh Job
2012-04-30 14:40:02: Enqueing: View 13 Error Reports...
2012-04-30 14:40:02: Dequeued: View 13 Error Reports...
2012-04-30 14:41:16: Enqueing: &Modify Properties...
2012-04-30 14:41:16: Dequeued: &Modify Properties...
2012-04-30 14:41:35: Enqueing: &Restart Slave
2012-04-30 14:41:35: Dequeued: &Restart Slave
2012-04-30 14:41:49: Enqueing: &Requeue 19 Tasks
2012-04-30 14:41:49: Dequeued: &Requeue 19 Tasks
2012-04-30 14:41:52: Enqueing: &Modify Properties...
2012-04-30 14:41:52: Dequeued: &Modify Properties...
2012-04-30 14:49:08: Enqueing: &Refresh All Jobs
2012-04-30 14:49:08: Dequeued: &Refresh All Jobs
2012-04-30 14:49:08: Attempting to contact Deadline Pulse (FX-Pulse)...
2012-04-30 14:49:08: Requesting jobs update from Deadline Pulse...
2012-04-30 14:49:09: Update received from Deadline Pulse.
2012-04-30 14:49:09: Received Update for 5 jobs.
2012-04-30 14:49:12: Enqueing: &Refresh All Slaves
2012-04-30 14:49:12: Dequeued: &Refresh All Slaves
2012-04-30 14:49:12: Attempting to contact Deadline Pulse (FX-Pulse)...
2012-04-30 14:49:12: Requesting slaves update from Deadline Pulse...
2012-04-30 14:49:12: Update received from Deadline Pulse.
2012-04-30 14:49:12: Received Update for 12 slaves.
2012-04-30 14:49:22: Enqueing: &Refresh All Jobs
2012-04-30 14:49:22: Dequeued: &Refresh All Jobs
2012-04-30 14:49:22: Attempting to contact Deadline Pulse (FX-Pulse)...
2012-04-30 14:49:22: Requesting jobs update from Deadline Pulse...
2012-04-30 14:49:22: Update received from Deadline Pulse.
2012-04-30 14:49:22: Received Update for 2 jobs.
2012-04-30 14:49:54: Enqueing: &Refresh All Jobs
2012-04-30 14:49:54: Dequeued: &Refresh All Jobs
2012-04-30 14:49:54: Attempting to contact Deadline Pulse (FX-Pulse)...
2012-04-30 14:49:54: Requesting jobs update from Deadline Pulse...
2012-04-30 14:49:54: Update received from Deadline Pulse.
2012-04-30 14:49:54: Received Update for 2 jobs.
2012-04-30 14:50:54: Enqueing: View 6 Error Reports...
2012-04-30 14:50:54: Dequeued: View 6 Error Reports...
2012-04-30 14:51:04: Enqueing: Copy
2012-04-30 14:51:04: Dequeued: Copy
2012-04-30 14:51:32: Enqueing: View 6 Error Reports...
2012-04-30 14:51:32: Dequeued: View 6 Error Reports...
2012-04-30 14:51:37: Enqueing: &Suspend &Job
2012-04-30 14:51:37: Dequeued: &Suspend &Job
2012-04-30 14:52:06: Enqueing: &Explore Log Folder
2012-04-30 14:52:06: Dequeued: &Explore Log Folder
2012-04-30 14:52:38: Enqueing: View Error Report...
2012-04-30 14:52:38: Dequeued: View Error Report...
2012-04-30 14:52:42: Enqueing: View Error Report...
2012-04-30 14:52:42: Dequeued: View Error Report...
2012-04-30 14:52:52: Enqueing: &Explore Log Folder
2012-04-30 14:52:52: Dequeued: &Explore Log Folder
That’s the Monitor log actually. We’ll need the slave log, and it needs to be collected from the actual slave machine (wasn’t sure if fasavard is a render node or not).
Thanks! Based on this log, it doesn’t look like any tasks were lost during this session. Note that a new log gets started for each day, and the timestamp for this one is today. If the slave lost track of a task yesterday, we’ll want to look at the corresponding log. Just look for all the logs that start with “deadlineslave_Fx-render-13(Fx-render-13)-2012-04-30-” and post them. If the slave lost the task on another day, just grab that day’s slave logs instead.
here the one from the day before. I am actually quite surprise of what you are telling me. Jobs on the slave keep “sticking” to it when my concurenty is to high. So sometime i have 12 job that dosen’t render even if deadline is telling they are rendering deadlineslave_Fx-render-13(Fx-render-13)-2012-04-30-0003.log (685 KB)
Hmm, nothing in there either… I did notice this was log 0003, which means that there are logs 0002, 0001, and 0000 from that same day. The evidence might be in those logs. Please post those 3 and we’ll have a look.
Also, out of curiosity, can you open your Repository Options, and under the Connection tab in the Pulse Settings, can you check if the Task Confirmation feature is enabled? If it’s not, try enabling it and see if that makes an improvement: thinkboxsoftware.com/deadlin … e_Settings
THe option wasent activated. Hopefully that could help. Is there anyway that pulse could look for stalled slave and restart the slave by himselft. Beacause thats what seem to happen.
I made you a folder with my log so you have all the information you need
Thanks! v0002 of the log from yesterday contained the info I was looking for. For example, at 17:08:31, thread 0 could no longer see its task:
It then printed out the contents of the job folder, and sure enough, task 00048 had a different name:
In this situation, one of two things could have happened.
The task was actually requeued.
Network latency caused the slave to see the contents differently then they actually were.
Obviously, it’s not (1) here, because the task still shows as rendering. By enabling that Task Confirmation feature, the slave will wait up to the given amount of time until it can confirm it can see the task file it will be working on, so it should help here. You can even try setting the wait time to something high like 30000 milliseconds (instead of the default 5000), since the slave will move on as soon as it sees its task file.
Glad to hear it helped. I took a look at the logs, and I noticed that the two tasks it lost track of occurred before the slave recognized that the Task Confirmation feature was enabled. It can take the slaves up to 10 minutes to recognize changes made in the Repository Options.
it sure did. Overnight i had a maybe 5 nuke frame that got stuck. But we got a major improvement Thanks.
I still have to manually edit my rib job script. the rib submitter is failing at any warning even if the are juste simple warning. By exemple. If i got a job that have all her texture, but dosen’t hate a lambert1 default shader(event if its not use), well it will crash.
So thanks again, i’ll keep you posted on how everything goes
what rendering is a cutout. So a got a model in a scene, that i output rib out of the scene. What se see at the screen is rib render, at a currenties of 16 with the maximum currenties is limits by the cpu. When i remove the currentie the render dosent seems to crash
The probleme was around 13h00 log report.zip (41.8 KB)
Is it possible these machines simply can’t handle 16 rendering processes at the same time? Have you tried reducing it to 4 or 8 to see if it improves things?
I did and it not as bad, but still crashes. It dosen`t take actually the 16 job, because i have 12 core. if i put 12 i got as much error. With 8 they still crash, but not as much. At first at tought the problem was the cpu was too much at a 100%, but with nuke, the CPU goes at a 100% for 12 with “pretty much no problem”. The problem seem to be on smaller job. Small frame…
Maybe the issue is specific to rib renders. Do you know if 3delight renders are multithreaded by default or not? Maybe try adding the “-t 1” argument to limit the render to a single thread. Maybe it will play nicer with concurrent tasks that way…