Concurrent Task Keep stalling

anon43328852 · April 27, 2012, 12:56pm

Hi,

Our studio recently discover concurrent Task. Which is quite a great option ! We had quite a lot of problem related to reading file trough multiple instances. And now with the use of the concurrent task instead of instance, the problem is gone

But now we got a new set of issues that came up with this new option. Jobs keep stalling over and over.

So i get the auto task time out of enable. (after 30 task, and 75% of the job done, take the average time and double it), i also put a regular time out. And most of the time, the time out doesn’t apply itself. I don’t know why.

To explain the “crash”, once in a while, every concurrent job (example 12 on one node) will keep rendering. Never stop and the node CPU are at 0 percent. Some time, when i re queue the stalled frame, the node will become stalled, but not all the time. I also realize that this error is more likely to come up in small rendering job (2 minute or less the frame)

Do you have any idea of whats going on?

Thanks !

Fred

rrussell · April 27, 2012, 1:08pm

Hi Fred,

If the timeout isn’t being applied, it could be that the slave is losing track of the task, which can be caused by network issues. Maybe the extra concurrent tasks are putting a strain on the network. To determine if this is the case, first enable Slave Verbose Logging in your Repository Options:
thinkboxsoftware.com/deadlin … on_Logging

Then restart your slaves so that they recognize the change immediately. The next time a task gets stuck like this, see which slave is rendering it in the task list, then go to that slave, and from it’s user interface, select Help -> Explore Log Folder. Find the current slave log (the most recent one based on the last modified date/time) and post it. We’ll take a look to see what’s going on.

Thanks!

Ryan

anon43328852 · April 30, 2012, 6:54pm

I will try to find a more accurate log tomorrow. So far i found this on a node that got stuck 2day, but i cannot tell in the log at wich point he was rendering

2012-04-30 14:35:39: BEGIN - POSTE-0468\fasavard 2012-04-30 14:35:39: Start-up 2012-04-30 14:35:39: Deadline Monitor 5.1 [v5.1.0.46114 R] 2012-04-30 14:35:39: 2012-04-30 14:35:38 2012-04-30 14:35:54: UpdateAll (uimanager) !! 2012-04-30 14:35:55: Attempting to contact Deadline Pulse (FX-Pulse)... 2012-04-30 14:35:55: Requesting jobs update from Deadline Pulse... 2012-04-30 14:35:55: Update received from Deadline Pulse. 2012-04-30 14:35:55: Received Update for 70 jobs. 2012-04-30 14:35:55: Attempting to contact Deadline Pulse (FX-Pulse)... 2012-04-30 14:35:55: Requesting slaves update from Deadline Pulse... 2012-04-30 14:35:56: Update received from Deadline Pulse. 2012-04-30 14:35:56: Received Update for 22 slaves. 2012-04-30 14:39:22: Enqueing: &Requeue 23 Tasks 2012-04-30 14:39:22: Dequeued: &Requeue 23 Tasks 2012-04-30 14:39:58: Enqueing: &Refresh Job 2012-04-30 14:39:58: Dequeued: &Refresh Job 2012-04-30 14:40:02: Enqueing: View 13 Error Reports... 2012-04-30 14:40:02: Dequeued: View 13 Error Reports... 2012-04-30 14:41:16: Enqueing: &Modify Properties... 2012-04-30 14:41:16: Dequeued: &Modify Properties... 2012-04-30 14:41:35: Enqueing: &Restart Slave 2012-04-30 14:41:35: Dequeued: &Restart Slave 2012-04-30 14:41:49: Enqueing: &Requeue 19 Tasks 2012-04-30 14:41:49: Dequeued: &Requeue 19 Tasks 2012-04-30 14:41:52: Enqueing: &Modify Properties... 2012-04-30 14:41:52: Dequeued: &Modify Properties... 2012-04-30 14:49:08: Enqueing: &Refresh All Jobs 2012-04-30 14:49:08: Dequeued: &Refresh All Jobs 2012-04-30 14:49:08: Attempting to contact Deadline Pulse (FX-Pulse)... 2012-04-30 14:49:08: Requesting jobs update from Deadline Pulse... 2012-04-30 14:49:09: Update received from Deadline Pulse. 2012-04-30 14:49:09: Received Update for 5 jobs. 2012-04-30 14:49:12: Enqueing: &Refresh All Slaves 2012-04-30 14:49:12: Dequeued: &Refresh All Slaves 2012-04-30 14:49:12: Attempting to contact Deadline Pulse (FX-Pulse)... 2012-04-30 14:49:12: Requesting slaves update from Deadline Pulse... 2012-04-30 14:49:12: Update received from Deadline Pulse. 2012-04-30 14:49:12: Received Update for 12 slaves. 2012-04-30 14:49:22: Enqueing: &Refresh All Jobs 2012-04-30 14:49:22: Dequeued: &Refresh All Jobs 2012-04-30 14:49:22: Attempting to contact Deadline Pulse (FX-Pulse)... 2012-04-30 14:49:22: Requesting jobs update from Deadline Pulse... 2012-04-30 14:49:22: Update received from Deadline Pulse. 2012-04-30 14:49:22: Received Update for 2 jobs. 2012-04-30 14:49:54: Enqueing: &Refresh All Jobs 2012-04-30 14:49:54: Dequeued: &Refresh All Jobs 2012-04-30 14:49:54: Attempting to contact Deadline Pulse (FX-Pulse)... 2012-04-30 14:49:54: Requesting jobs update from Deadline Pulse... 2012-04-30 14:49:54: Update received from Deadline Pulse. 2012-04-30 14:49:54: Received Update for 2 jobs. 2012-04-30 14:50:54: Enqueing: View 6 Error Reports... 2012-04-30 14:50:54: Dequeued: View 6 Error Reports... 2012-04-30 14:51:04: Enqueing: Copy 2012-04-30 14:51:04: Dequeued: Copy 2012-04-30 14:51:32: Enqueing: View 6 Error Reports... 2012-04-30 14:51:32: Dequeued: View 6 Error Reports... 2012-04-30 14:51:37: Enqueing: &Suspend &Job 2012-04-30 14:51:37: Dequeued: &Suspend &Job 2012-04-30 14:52:06: Enqueing: &Explore Log Folder 2012-04-30 14:52:06: Dequeued: &Explore Log Folder 2012-04-30 14:52:38: Enqueing: View Error Report... 2012-04-30 14:52:38: Dequeued: View Error Report... 2012-04-30 14:52:42: Enqueing: View Error Report... 2012-04-30 14:52:42: Dequeued: View Error Report... 2012-04-30 14:52:52: Enqueing: &Explore Log Folder 2012-04-30 14:52:52: Dequeued: &Explore Log Folder

I`ll be back with more information tomorrow

Thanks !

Fred

rrussell · April 30, 2012, 8:35pm

That’s the Monitor log actually. We’ll need the slave log, and it needs to be collected from the actual slave machine (wasn’t sure if fasavard is a render node or not).

Cheers,

Ryan

anon43328852 · May 1, 2012, 12:18pm

I cant find it. My slave run in service so i cannot go in is user interface.

I use windows 7 x64. Do you know the path ?

rrussell · May 1, 2012, 4:19pm

C:\ProgramData\Thinkbox\Deadline\logs

anon43328852 · May 1, 2012, 4:28pm

There he go
deadlineslave_Fx-render-13(Fx-render-13)-2012-05-01-0000.log (2.03 MB)

rrussell · May 1, 2012, 4:34pm

Thanks! Based on this log, it doesn’t look like any tasks were lost during this session. Note that a new log gets started for each day, and the timestamp for this one is today. If the slave lost track of a task yesterday, we’ll want to look at the corresponding log. Just look for all the logs that start with “deadlineslave_Fx-render-13(Fx-render-13)-2012-04-30-” and post them. If the slave lost the task on another day, just grab that day’s slave logs instead.

Thanks!

Ryan

anon43328852 · May 1, 2012, 5:21pm

here the one from the day before. I am actually quite surprise of what you are telling me. Jobs on the slave keep “sticking” to it when my concurenty is to high. So sometime i have 12 job that dosen’t render even if deadline is telling they are rendering
deadlineslave_Fx-render-13(Fx-render-13)-2012-04-30-0003.log (685 KB)

rrussell · May 1, 2012, 5:38pm

Hmm, nothing in there either… I did notice this was log 0003, which means that there are logs 0002, 0001, and 0000 from that same day. The evidence might be in those logs. Please post those 3 and we’ll have a look.

Also, out of curiosity, can you open your Repository Options, and under the Connection tab in the Pulse Settings, can you check if the Task Confirmation feature is enabled? If it’s not, try enabling it and see if that makes an improvement:
thinkboxsoftware.com/deadlin … e_Settings

Thanks!

Ryan

anon43328852 · May 1, 2012, 5:51pm

THe option wasent activated. Hopefully that could help. Is there anyway that pulse could look for stalled slave and restart the slave by himselft. Beacause thats what seem to happen.

I made you a folder with my log so you have all the information you need

Thanks !

Fred
log transphere.zip (839 KB)

rrussell · May 1, 2012, 6:13pm

Thanks! v0002 of the log from yesterday contained the info I was looking for. For example, at 17:08:31, thread 0 could no longer see its task:

It then printed out the contents of the job folder, and sure enough, task 00048 had a different name:

In this situation, one of two things could have happened.

The task was actually requeued.
Network latency caused the slave to see the contents differently then they actually were.

Obviously, it’s not (1) here, because the task still shows as rendering. By enabling that Task Confirmation feature, the slave will wait up to the given amount of time until it can confirm it can see the task file it will be working on, so it should help here. You can even try setting the wait time to something high like 30000 milliseconds (instead of the default 5000), since the slave will move on as soon as it sees its task file.

Cheers,

Ryan

anon43328852 · May 1, 2012, 6:19pm

Cant wait to see the results

I`ll give you an update in a day or 2. Thks one again sir !

Fred

anon43328852 · May 1, 2012, 8:09pm

So the problem is a lot less worst, but il still get on some job, major crash

All the slave are stalling. I did the change in the pulse menu

i attach you the logs of one of the nodes
logs.zip (329 KB)

rrussell · May 2, 2012, 2:37pm

Glad to hear it helped. I took a look at the logs, and I noticed that the two tasks it lost track of occurred before the slave recognized that the Task Confirmation feature was enabled. It can take the slaves up to 10 minutes to recognize changes made in the Repository Options.

Have things improved today?

anon43328852 · May 2, 2012, 6:33pm

it sure did. Overnight i had a maybe 5 nuke frame that got stuck. But we got a major improvement Thanks.

I still have to manually edit my rib job script. the rib submitter is failing at any warning even if the are juste simple warning. By exemple. If i got a job that have all her texture, but dosen’t hate a lambert1 default shader(event if its not use), well it will crash.

So thanks again, i’ll keep you posted on how everything goes

Fred

anon43328852 · May 3, 2012, 5:17pm

I got some new data about my crashes. As you see, they are all crash at the same time

im attaching the log of the host 17.

what rendering is a cutout. So a got a model in a scene, that i output rib out of the scene. What se see at the screen is rib render, at a currenties of 16 with the maximum currenties is limits by the cpu. When i remove the currentie the render dosent seems to crash
The probleme was around 13h00
log report.zip (41.8 KB)

rrussell · May 3, 2012, 5:27pm

Is it possible these machines simply can’t handle 16 rendering processes at the same time? Have you tried reducing it to 4 or 8 to see if it improves things?

anon43328852 · May 3, 2012, 5:57pm

I did and it not as bad, but still crashes. It dosen`t take actually the 16 job, because i have 12 core. if i put 12 i got as much error. With 8 they still crash, but not as much. At first at tought the problem was the cpu was too much at a 100%, but with nuke, the CPU goes at a 100% for 12 with “pretty much no problem”. The problem seem to be on smaller job. Small frame…

rrussell · May 3, 2012, 6:08pm

Maybe the issue is specific to rib renders. Do you know if 3delight renders are multithreaded by default or not? Maybe try adding the “-t 1” argument to limit the render to a single thread. Maybe it will play nicer with concurrent tasks that way…