Debugging "stalled slave" hunting

nrusch · July 24, 2014, 1:10am

I’m trying to figure out what could be causing Pulse to mark a slave as stalled repeatedly for the same task, or what could cause the slave to fail to update the database for a long enough period of time that Pulse marked it as stalled. This is a large job (1,000+ tasks), but other than that, I have nothing to go on.

Coulter · July 24, 2014, 3:33pm

That does seem unusual. The slave logs are probably a good place to start. What sort of task is running?

nrusch · July 24, 2014, 4:38pm

It’s a pipeline plugin that bootstraps an appropriate environment and then eventually executes Python to do work. This particular job type is doing a bulk data transfer from our remote location to here using a combination of bbcp and tar. We are using two slaves on a single machine to do these transfers, and the tasks can run for a long time for large files (though most of them are 5-6 mins or less). We have many of these jobs running, but for some reason, one slave has been marked stalled 4 times in the last 2 days with no evidence. The jobs it gets marked stalled on are transferring large-ish files that take 15-20 minutes each (1 per task).

Since my original post, it has also happened twice on a job with 90 tasks, with two stall reports within 15 seconds of each other:

The first:

[code]Stalled Slave: sv-sync01-01
Slave Version: v6.2.0.32 R (2563d5bc8)
Last Slave Update: 2014-07-24 12:33:27
Current Time: 2014-07-24 12:47:50
Time Difference: 14.375 m
Maximum Time Allowed Between Updates: 14.000 m

Current Job Name: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Current Job ID: 53ce13a68799e36e9a51f5bb
Current Job User: kierano
Current Task Names: 12
Current Task Ids: 11

Searching for job with id “53ce13a68799e36e9a51f5bb”
Found possible job: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Searching for task with id “11”
Found possible task: 11:[12-12]
Task’s current slave: sv-sync01-01
Slave machine names match, stopping search
Associated Job Found: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Job User: kierano
Submission Machine: sv-deadline01
Submit Time: 07/22/2014 17:32:54
Associated Task Found: 11:[12-12]
Task’s current slave: sv-sync01-01
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.
[/code]
And the second:

[code]Stalled Slave: sv-sync01-01
Slave Version: v6.2.0.32 R (2563d5bc8)
Last Slave Update: 2014-07-24 12:33:27
Current Time: 2014-07-24 12:48:04
Time Difference: 14.618 m
Maximum Time Allowed Between Updates: 14.000 m

Current Job Name: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Current Job ID: 53ce13a68799e36e9a51f5bb
Current Job User: kierano
Current Task Names: 12
Current Task Ids: 11

Searching for job with id “53ce13a68799e36e9a51f5bb”
Found possible job: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Searching for task with id “11”
Found possible task: 11:[12-12]
Task’s current slave:
Slave machine names do not match, continuing search
Associated job not found, it has probably been deleted.

Setting slave’s status to Stalled.
Setting last update time to now.[/code]
Something seems to be going wrong with the stalled slave “head-hunting” process; at the end of the first report, it says “Setting last update time to now”, but the second report (again, 14 seconds later) clearly indicates that this has not happened. These two reports were generated by different machines; the first was by another idle render node, and the second was by Pulse.

Also, this is only happening in one of our studios, though they are both using the same slave setup for performing these transfers.

nrusch · July 24, 2014, 4:41pm

The slave logs do not reveal anything. They just end in the middle of the task log. The launcher log doesn’t really have anything interesting in it, except this (which seems unrelated):

2014-07-24 23:43:04:  Could not set deadline.ini file setting 'User' because: Could not create file "/tmp/deadline.ini" or "/Volumes/newhome/qmaster/Thinkbox/Deadline6/deadline.ini". File already exists. (System.IO.IOException)

rrussell · July 24, 2014, 7:02pm

Could the file transfers somehow be interfering with the slave’s ability to communicate with the database, and thus update it’s state in a timely fashion?

Maybe as a stop gap solution for now, you could increase the maximum amount of time between updates from 14 minutes to, say, 30 minutes to see if that helps:
thinkboxsoftware.com/deadlin … Settings_2

Cheers,
Ryan

Coulter · July 25, 2014, 3:37pm

Are the jobs assigned to this slave unique in some way? In other words, all jobs of type X go only to the troubled slave?

If other slaves can complete these jobs without issue, it seems to point to something unique to the troubled slave (although external causes cannot entirely be ruled out). In cases like this, my instinct is to re-image the slave as it could be a software issue anywhere from the OS on up. I suppose it could also be a physical networking issue of some kind (network card, cable, or switch), but it seems like that would also cause transfer errors.

I should also add that whenever choosing between solutions to Deadline problems offered by Ryan and myself, you should always go with Ryan’s suggestion first. However, whenever choosing the best place to get a burrito, you should go with my suggestion first, because there are not a lot of Mexican food places in Winnipeg as compared to LA.

nrusch · July 25, 2014, 4:38pm

I can’t imagine how… The other slave doing these transfers doesn’t seem to have any issues.

I tried that yesterday, but came in today to find another one (again, marked by Pulse):

[code]Stalled Slave: sv-sync01-01
Slave Version: v6.2.0.32 R (2563d5bc8)
Last Slave Update: 2014-07-25 23:48:48
Current Time: 2014-07-26 00:19:25
Time Difference: 30.616 m
Maximum Time Allowed Between Updates: 30.000 m

Current Job Name: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Current Job ID: 53ce13a68799e36e9a51f5bb
Current Job User: kierano
Current Task Names: 66
Current Task Ids: 65

Searching for job with id “53ce13a68799e36e9a51f5bb”
Found possible job: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Searching for task with id “65”
Found possible task: 65:[66-66]
Task’s current slave: sv-sync01-01
Slave machine names match, stopping search
Associated Job Found: cpremote_moonsun_OCN_001_010_fxMapGen_Scatter_KOG_2014.07.22-17:32:53.555744
Job User: kierano
Submission Machine: sv-deadline01
Submit Time: 07/22/2014 17:32:54
Associated Task Found: 65:[66-66]
Task’s current slave: sv-sync01-01
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.[/code]

nrusch · July 25, 2014, 4:42pm

There are two slaves assigned to these types of jobs, and they both exist on the same machine (which is a VM).

Again, two slaves on the same machine. The only way I can see that it would be network-related is if one slave process was failing to yield some critical resource to the other in order to allow it to update its state in the DB. So far it has always been the same slave instance that has been getting marked as stalled, which I suppose could be a coincidence.

Duly noted. Burrito scouting is an invaluable resource.

rrussell · July 25, 2014, 5:10pm

Thanks for reiterating that there are 2 slaves running on the same machine. I missed that point when I first read your post.

I’m stumped as to why one slave is affected, but not the other. It might a coincidence as you mentioned, but by now you think we would have seen it happen with the first slave at least once. I wouldn’t expect the issue to be a result of running multiple slaves (one slave shouldn’t be doing anything that would prevent the other from writing). As a test, would it be possible to just run the slave that has been marked as stalled in the past, and leave the “good” one offline for a day or two to see if the problem still occurs?

nrusch · August 5, 2014, 8:30pm

This happened to the other slave in the pair over the weekend. Before that, it hadn’t happened in quite awhile. It seems to occur when the slaves are executing jobs with quite a few frames, and so far also only seems to occur when both slaves are executing tasks within the same job.

I wonder if this could be related to the task state corruption issues that have been popping up around the forums.

rrussell · August 5, 2014, 8:50pm

I wouldn’t expect it to be related, since the state of the job shouldn’t affect a slave’s ability to run. You could always export the job that the slaves were working on though and upload here so we can check the task counts.

One thing I just realized we haven’t confirmed yet is the actual state of a slave that is marked as stalled. Is it still running? I’m wondering if something is actually causing the slave to just crash out and die.

nrusch · August 5, 2014, 11:48pm

I took a look at the slave log for the most recent occurrence, and that may actually be a possibility.

Here’s the tail of the log from the day of the last crash. This shows a transfer task echoing out heartbeat messages, which end abruptly at 5:05:

2014-08-02 05:04:33: 0: STDOUT: bbcp: 140801 12:04:32 110032% done; 0.0 MB/s, avg 1.5 B/s 2014-08-02 05:04:43: 0: STDOUT: bbcp: 140801 12:04:41 114916% done; 1.6 MB/s, avg 1.5 MB/s 2014-08-02 05:04:52: 0: STDOUT: bbcp: 140801 12:04:51 118516% done; 1.3 MB/s, avg 1.5 MB/s 2014-08-02 05:05:02: 0: STDOUT: bbcp: 140801 12:05:00 121601% done; 1.0 MB/s, avg 1.4 MB/s

The stalled slave report shows that the last slave update occurred just before 5:05:

Stalled Slave: sv-sync01-02 Slave Version: v6.2.0.32 R (2563d5bc8) Last Slave Update: 2014-08-02 05:04:52 Current Time: 2014-08-02 05:34:53 Time Difference: 30.029 m Maximum Time Allowed Between Updates: 30.000 m

Is there a way for me to enable some kind of mono debug mode that will write out a crash dump file or something if the process goes down? I know I could Google it, but I’m guessing you guys already have a way that you’ve been doing it for testing.

rrussell · August 6, 2014, 1:10am

Yup, looks like it’s crashing. Give this a try: open up your Repository Options, and under Slave Settings, disable the Gather System Resources option. The restart these slaves so that they recognize the change right away and see if the problem happens again.

We recently discovered that a Garbage Collection bug in Mono could cause Deadline to crash when this option is enabled. We’ve done extensive testing with Mono 3.4 and it looks like this bug has been fixed, which is why we’re shipping Mono 3.4 with Deadline 7.

If you find disabling this option helps, you could turn it back on, but then install Mono 3.4 on this machine (we’ve tested Deadline 6.2 with Mono 3.4 for many weeks straight with no side effects).

nrusch · August 6, 2014, 1:19am

Thanks Ryan, I’ll give that a try.