when I run a Naiad simulation on a single machine and the simulation crashes, the slave cannot restart the sim at a given frame (where the crash occurred) instead it starts it from the first frame. (Naiad handles this kinda awkward: /path/to/naiad.exe --restart ‘frame’ ‘path to emp sequence’).
Is there a way to continue the simulation with Deadline?
While it would be nice to build this into the Naiad plugin, I’m concerned with how it would actually work. It would be pretty easy to add a “Resume” option to the submitter that lets you pick a frame, but the problem is that the first time you submit a job, you wouldn’t be using it. Then if Naiad crashes, the task would get requeued to be attempted again, but there would be nothing to indicate that the job should now be resumed from a specific frame.
For now, it might be best to override the error limit for your Naiad jobs to be 1. That way, they fail after a crash, rather than attempting to render again. Then you could manually submit the command line job to resume from the frame it failed at. You’ll want to set the limit error to 1 here too.