slave error threshold - self disable

LaszloSebo · October 9, 2013, 1:46am

It would be nice if slaves that produce errors on multiple jobs without completing a single task would disable themselves, or mark themselves bad somehow…

Deadline6 is so fast, having 10 slaves that are badly configured can quickly mean all queued jobs gets marked failed in 5 mins, before you even notice you have an issue

rrussell · October 9, 2013, 2:51pm

Do you have Bad Slave Detection enabled in the Repository Options?
thinkboxsoftware.com/deadlin … _Detection

If enabled, then a slave will move on from a job if it consecutively errors the specified number of times.

LaszloSebo · October 9, 2013, 3:54pm

Yep we do. I am not sure it works btw, single slaves generated way more than 5 errors per job (see attached image, a single slave’s error reports). Also, none of the ‘failed’ jobs have any ‘bad slaves’ listed.

In our case, we had 30 new slaves added to the deadline farm, and all were generating errors for about 20 minutes (before they miraculously self fixed themselves… very very odd… the errors were all python related, seemed like the site packages folder was not being found properly).
Even if the bad slave detection worked, the 30 machines would have quickly plowed through all our queued jobs accumulating 5 errors each, easily putting the jobs through their 100 error limit.

The python error they generated was this, erroring on loading some common python libs. The JobPreLoad is supposed to be run in a separate, ‘system like’ python environment, right?:

=======================================================
Error in StartJob: job preload script “C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline6\slave\LAPRO0474\plugins\52549021c3f6ebd220d9b922\JobPreLoad.py”: Python Error: ImportError : No module named scanline (Python.Runtime.PythonException)
Stack Trace:
[’ File “none”, line 8, in \n’]
(System.Exception)
at FranticX.Scripting.PythonNetScriptEngine.a(Exception A_0)
at FranticX.Scripting.PythonNetScriptEngine.ExecuteScript(String scriptName, String script)
at Deadline.Scripting.DeadlineScriptManager.CreateScopeFromFile(String scopeName, String scriptFile, Boolean addGlobalFunctions, Boolean redirectToScriptManagerListener)
at Deadline.Plugins.ScriptPlugin.a(String A_0, String A_1, String A_2)
at Deadline.Plugins.ScriptPlugin.a(String A_0, String A_1, String A_2)
at Deadline.Plugins.ScriptPlugin.d(String A_0)
at Deadline.Plugins.ScriptPlugin.StartJob(Job job, String& outMessage, AbortLevel& abortLevel)

=======================================================

rrussell · October 9, 2013, 4:42pm

What’s the Frequency % for reattempts in the Failure Detection settings? If it’s not 0, you can set it to 0 so that slaves never try to reattempt jobs that they’ve been marked bad for. This is my guess as to why the slaves generated more than the 5 error limit.

LaszloSebo · October 10, 2013, 4:31pm

Its set to 0. Also, none of the jobs got any bad slaves added to them, so i think the mechanism is not working right now…

Maybe it was that they still used an older beta? ( our image still has beta4, and they usually self update when they get enabled )

rrussell · October 10, 2013, 9:02pm

That’s probably it. There was a bug that broke this feature in previous betas, and it was fixed in beta 5. I tested this today with beta 7, and it definitely seems to be working properly.