Temp directory permission woes

awilson · March 8, 2016, 5:54pm

I’m using deadline 7.2.0.18 on linux, and having a lot of trouble lately. There are three problems that coincide to make life painful:

The temp directory is set with permissions 755, which means that if I kill a slave and start it as a different user it fails every task because it can’t write to the temp directory
These failures seem to happen too early to actually register as a failed task, so the slave is never marked as bad. This means that a bad slave can very quickly set every job on our farm to a failed status with thousands of errors.
I have my repository options set to send me an email notification when a job fails, but this never happens in the above case. I do get job completion emails, so it seems like there’s something about a job being in a failed state with thousands of errors that is not triggering the send.

This has made us lose entire weekends of rendering, due to a rogue slave taking everything down. I could try to get around #1 by just adding a chmod command into all of our plugins, but that seems like a really bad workaround that is fraught with peril.

Anyone have ideas on how to troubleshoot these issues? Thank you!

dwallbridge · March 8, 2016, 10:23pm

Hello,

For the first one, have you tried using the slave property to use the /tmp folder on Linux, instead of the normal temp folder? I believe this feature was brought in for this very purpose of ensuring that users aren’t locked out of access to files.

For the second, are there job reports being made when these failures happen? If so can you share one? The slave should be detecting that it is hitting the limits and marking itself as bad, so I am curious what the slave is seeing when verbose logging it on. If a failure counts towards the job limit, it should count towards the bad slave limit. Can you verify that the mark slave as bad option is on in the repository options?

For the last one, email notifications should be sent by the machine that triggers them, so it’s harder to track down when you have a farm where many machines could have been the ones that should have sent the email. You mentioned other emails do happen, but I can’t imagine that it would be that one sided. Troubleshooting this, though, will be hard due to the way these things get sent.

I am hopeful fixing the first one will give a reprieve so we cant try to look deeper at the latter two problems, though.

awilson · March 8, 2016, 11:57pm

Thanks for the response. We have the option enabled to use /tmp. I thought about unchecking that, since that would make it per-user, but that would also put all of the temp data in a user’s home folder, which in our case is sync’d to several locations. If it were possible to specify a per-user temp directory (like /tmp/Thinkbox_USER) that would get around our issue, but I don’t think I have that ability.

For the slaves not being marked as bad, I’m attaching a log from one of the failures. What’s interesting (and maybe points to the issue?) is that these failures are all available by looking at the job’s job reports, but they never show up under the task they’re trying to run. In the case of this job, the job has 275 errors, but all tasks have 0 errors.

I’m guessing that issue #3 is related to #2 - the failures aren’t getting processed properly at some point, and I’m getting to some sort of hard limit of errors on a job instead of a “softer” failed job. That’s completely gut reaction with no knowledge of the source code, though.
failure_log.txt (42.7 KB)

eamsler · March 9, 2016, 8:31pm

I think this just lands in an edge case we hadn’t considered. I think we’ll need to add a username to the temporary folder writing so there aren’t conflicts. I’ll go and make an internal dev issue for this one.

awilson · March 9, 2016, 9:19pm

That should work, as long as the parent /tmp/Thinkbox directoy is world-writable.

That’ll take care of this specific problem, but I’m still somewhat wary of the ability of slaves to take down our farm without being marked as bad. Any ideas on that front? Was the log helpful at all?

eamsler · March 9, 2016, 10:23pm

Thanks for the poke on that one. I’m pretty sure the issue is that the plugin context isn’t completely initialized yet, so the exception handler likely hasn’t been registered yet either. I think we only mark as failed if there’s a problem after GetDeadlinePlugin(). I’ll throw another issue in the system for this one, it’s pretty clear-cut I think.

awilson · March 9, 2016, 10:26pm

Awesome. Thank you so much!

eamsler · March 11, 2016, 9:29pm

FYI, merge request is in. We’ll have that log failure probelm fixed in 8.0 release.

We’re still figuring out the temporary folder fix. There are ideas, we just need to implement.