AWS Thinkbox Discussion Forums

Automatically disable slave if too many error frames?

Hello!
We are using a render farm and multiple workstations for rendering frames mainly with Autodesk 3dsMax 2016.
Our problem is, that sometimes a specific computer for some reason or another might not work correctly and therefore causes an error in deadline.
This means that if a single computer is not rendering correctly, the error limit will quite quickly be exceeded thus failing the job or multiple jobs.

Is it possible to configure Deadline 8 in such a way, that for example when one computer fails 10 frames in any job, the corresponding computer (slave) would be disabled automatically? This way, the other slaves could still keep on rendering the given job even if one slave is having issues.

Thank you for any insight concerning this matter!

Otto

The option you’re looking for is called Mark slave as Bad after it has generated this many errors for a job in a row. Go to Repository options > Job Settings > Failure Detection tab

Link to documentation:
http://docs.thinkboxsoftware.com/products/deadline/8.0/1_User%20Manual/manual/failure-detection.html#slave-failure-detection

Cheers

Thank you for the amazingly quick response! I will look into this material!

Yeah Tomasz! Much appreciated from me over here. :smiley:

One day I’d like to see us have an event script that will actually disable the Slave for the entire farm. I think we have all the pieces we need from the API now, it’s just a matter of scraping together enough time to get it done.

Basic plan is:

  1. Create an OnJobError() callback
  2. Check the Slave’s last X job reports
  3. If last X job reports are failures, mark state as disabled and optionally prefix a comment like “[Disabled for failing X times]”

Would also probably be wise to check that it isn’t just one job the slave is giving out errors on. I know we’ve had users assign jobs to wrong groups which has caused some slaves to fail.

This will be super hard to do properly. There are many kinds of errors that can happen eg 3ds max scene can be corrupted, there is a missing plugin, 3ds max installation is corrupted etc. Not sure if I would like to see a disabled slave caused by the missing plugin error :slight_smile: Moreover we run different programs on our farm (maya, cinema, 3ds max) and slave maybe “not working properly” for 3ds max, but it’s perfectly usable for other two programs. This will be a tricky feature but maybe it’ll not be designed for my case and other will find it useful to have slave auto-disabling after X consecutive errors.

The feature I would like to see with similar effect is slave auto-restart after X consecutive errors. I know I can setup deadline to restart nodes after X amount of time, but it’s not exactly what I’m looking for. From time to time I’m finding failing jobs with errors like “3ds max installation cannot be verified” (or something like that). When I’m logging to the machine to check what is happening I looks like previous job left a 3ds max process (with open VFB) hanging around. After machine restart or - more precisely - killing the 3ds max process everything is back to normal.

We actually have implemented such a tool. It keeps a counter internally that gets reset on successful renders without additional db queries. After 10 errors it sends out a warning email, and after 20 it disables the slave. Works pretty well!

Well, good to know it’d be useful for some. Really, if the logic was done, we could optionally have the script send a remote command to the local Slave to restart instead of having it disable itself. That’s really just an “if” statement and one more line of code.

[WARNING: Prototype event plugin - you have been warned!]

Please find attached a custom event plugin for Deadline 8 or 9 which adds the ability to handle one or more BAD Slaves in your farm based on consecutive errors being generated by that individual Slave. A number of configurable options have been exposed to the event plugin UI to allow you to dial in the settings you prefer. See below for these options.

Unzip “HandleBadSlave.zip” into: “<your_repo>/custom/events/” so you will have a single directory called: “HandleBadSlave” which contains 2 x files: “HandleBadSlave.py” and “HandleBadSlave.param”.

Set the event plugin State=Global Enabled to enable the plugin in your farm.

I recommend you run in DEBUG mode, so only “print” statements are generated as an “event log report” which will then be visible in the job log reports of a Deadline job. If you are then happy, you can set DEBUG=False to action the Slave Command and optionally mark the Slave as disabled and optionally inject a comment into the Slave’s comment field. Note, the various options in Slave Command such as simply StopSlave or alternatively, RestartSlave or RestartMachine.

NOTE: Error Threshold - To clarify, currently the calculation is as follows…we combine the integer value in your Repo Options setting under “Job Settings” --> “Failure Detection” --> “Mark a Slave as Bad after it has generated this many errors for a job in a row” with the Error Threshold value that you enter. So, if the Error Threshold is set to the default of 10, this is combined with the default value for the above Repo Options setting (5), so at 15 errors generated by a Slave, will then have this event plugin executed against it. This means a Slave is allowed to at least try to render one job until failure before then becoming ‘watched’ by this event plugin. If you are not using this Repo Option: “Mark a Slave as Bad after it has generated this many errors for a job in a row” then this event plugin simply will trigger when a Slave hits the above Error Threshold, which by default is 10.

Feedback is welcome.

Configurable options in HandleBadSlave event plugin:

  • [State]
    Default=Disabled
    Items=Global Enabled;Opt-in;Disabled
    Description=How this event plug-in should respond to events. If Global, all jobs and slaves will trigger the events for this plugin. If Opt-In, jobs and slaves can choose to trigger the events for this plugin. If Disabled, no events are triggered for this plugin.

  • [Debug]
    Default=False
    Description=Enable to place event plugin into debug mode. No action is executed, instead it is logged.

  • [Error Threshold]
    Default=10
    Description=Threshold number of consecutive errors a Slave can generate in a single Slave session before being deemed a Bad Slave. This threshold value will be combined with Repository setting (if enabled): “Mark a Slave as Bad after it has generated this many errors for a job in a row” value. If this total error count exceeds the threshold value, then the Slave Command will be executed.

  • [Slave Command]
    Default=Continue
    Items=Continue;StopSlave;RestartSlave;ShutdownMachine;RestartMachine
    Description=The Slave command to be executed on Slaves which exceed the Error Threshold.

  • [Disable Slave]
    Default=False
    Description=If enabled, the Slave will be disabled if the Error Threshold is exceeded.

  • [Add Slave Comment]
    Default=False
    Description=If enabled, a comment will be added to the Slave’s comment field explaining why it was disabled.

HandleBadSlave.zip (2.41 KB)
HandleBadSlave.png

Privacy | Site terms | Cookie preferences