Naiad error handling and restart simulation

Hi,

when I run a Naiad simulation on a single machine and the simulation crashes, the slave cannot restart the sim at a given frame (where the crash occurred) instead it starts it from the first frame. (Naiad handles this kinda awkward: /path/to/naiad.exe --restart ‘frame’ ‘path to emp sequence’).

Is there a way to continue the simulation with Deadline?

thanks in advance,

G

You could probably resume the render with a command line job you submit to Deadline with those arguments:
thinkboxsoftware.com/deadlin … o_Deadline

While it would be nice to build this into the Naiad plugin, I’m concerned with how it would actually work. It would be pretty easy to add a “Resume” option to the submitter that lets you pick a frame, but the problem is that the first time you submit a job, you wouldn’t be using it. Then if Naiad crashes, the task would get requeued to be attempted again, but there would be nothing to indicate that the job should now be resumed from a specific frame.

For now, it might be best to override the error limit for your Naiad jobs to be 1. That way, they fail after a crash, rather than attempting to render again. Then you could manually submit the command line job to resume from the frame it failed at. You’ll want to set the limit error to 1 here too.

The option to override the error limit can be specified in the job info submission file:
thinkboxsoftware.com/deadlin … _Info_File

OverrideJobFailureDetection=true
FailureDetectionJobErrors=1

For the command line submission of the resume job, you would add these additional properties in the command line:

-prop OverrideJobFailureDetection=true -prop FailureDetectionJobErrors=1

Hope this helps.

Cheers,

  • Ryan

Thank you very much Ryan.
I`ll try to implement to resubmit automatically the job…