AWS Thinkbox Discussion Forums

Some job/task behavior questions

Hi.



One of our leads is trying to do some maintenance on a large number of Max files using deadline by setting up jobs that address organizational sets of the files. Each task should, ideally, open one Max file, invoke a max script on it, and save it again.



He’s haveing some trouble apparently because some of the organizational sets contain as many as 15,000 Max files. He’s running into an apparent limit of about 1881 tasks per job. Also, since some of the Max file modifications fail for various reasons he’s finding that when he divides them into 8-10 per task he will have tasks fail partially through completion and then the entire job failing after enough tasks fail.



The behavior he is looking for would be to have one file per task, ~15,000 tasks per job, and to have DeadLine completely ignore task failures and ensure that each of the tasks gets attempted before the entire job completes. This seems like it should be feasable. Any chance you could give us some pointers on how to set this up (job properties, etc)? Also, is there an undocumented limit to the number of tasks per job?



Thanks,



Sean

Hi Sean,



Yes there is an undocumented task limit. We will look at removing this limit for the next release.



In the meantime, could you possibly submit multiple jobs (say, with 1500 tasks each), where each task throughout all the jobs would handle their own seperate max file? For example, if you were rendering normally, job 1 would render frames 1-1500. job 2 would render frames 1501-3000, etc…



You can prevent jobs from failing by opening their Properties dialog in the Monitor (right-click -> Modify Properties) and choose to ignore failed job and failed task detection.



Cheers,



Ryan Russell

Frantic Films Software

http://software.franticfilms.com/

(204)949-0070

Ryan,

Thanks for the response. Sean posted his questions on my behalf. I've been able to resolve most of my file processing issues, but I'm still having trouble with controling how many times a task ( in my case a single file\frame ) will be attempted before it is marked as failed. I would like to be able to say that a given failed task will be retried 5 times before it is flagged as failed. If I "Disable Failed Task" detection does that mean a task will be retried infinitely? Or does it mean that the per-job failed task limit ( 300 in our case ) will be ignored for the purposes of failing the job?

It seems like I should be able to do this, I'm just not exactly clear on what options I need to set, and where they are located.

thanks,

Jared

Hi Jared,



There are two types of failure detection: per task and per job. The limits for both can be configured in the repository options in the Monitor (Tools -> Repository Options while in super user mode). These are completely seperate from each other, and you can choose to ignore them seperately in the Job Properties dialog.



So yes, if you choose to ignore failed task detection, a task can fail indefinitely without being marked as failed. However, if the job reaches 300 errors, the job as a whole will still fail (unless you choose to ignore failed job detection too).



I should mention that there is a feature in Deadline where if a slave reports an error on a job 5 consecutive times, it will mark that job as bad for itself and move on to try other jobs (note that this doesn’t affect how other slaves render the same job). The purpose of this feature is to prevent slaves from getting hung up on jobs that they will likely have no chance of successfully rendering. In the next release, this functionallity will be customizeable, so you will be able to set the number of times a slave can consecutively error on a job, have jobs ignore this feature, or disable it altogether.



Cheers,



Ryan Russell

Frantic Films Software

http://software.franticfilms.com/

(204)949-0070

Ok, now I understand. I'm running into issues with the slave self-removal behavior. I've disabled Failed Job Detection but after about 1200 task failures all of the slaves have self-removed the job, so now it just sits around queued but without any slaves to work on it.

I guess it comes down to what I consider "failing" in the case of .max files processing. I have a number of conditions upon with I'm failing the render through script. I guess I could just log the error and return true. It would be nice to be able to override that self-removal though.

It would also be nice to be able to set "Disable Failed Job Detection" when I submit the script job. Right now I submit a bunch of jobs as "suspended" and then "Disable Failed Job Detection" through the DL monitor.

- Jared

Right now, the only way to “clear” the self-removal cache for the slaves is to restart them. You can use the “Restart slave after last task” remote control option to do so without interrupting any rendering. The option to override this feature (or disable it completely) will be included in the next release.



Are you using a custom submission script to submit your script jobs to Deadline? If so, you could always add the ability to ignore the failed job detection to your own submission script. In the submission info file (where you specify stuff like Pool, Frames, Plugin, etc), you can add these keys:



IgnoreFailedJobDetection=<true/false>

IgnoreFailedTaskDetection=<true/false>



You can either add a checkbox to your submission control, or just default them to true. For a complete list of keys that you can specify in the submission info file, check out this link:

http://software.franticfilms.com/index.aspx?page=deadline/command/submissionfile



Cheers,



Ryan Russell

Frantic Films Software

http://software.franticfilms.com/

(204)949-0070

Privacy | Site terms | Cookie preferences