Tk-nuke-writenodes, sgtk bootstrapping and concurrent tasks

Tomfw · January 20, 2023, 3:14pm

Hey All,

I’m currently trying to update our farm to be able to handle running tk-nuke-writenodes from distributed sgtk configs*. This is kind of working… except when it comes to running concurrent tasks on a worker. All the tasks start at once which creates a bit of a “race condition” problem where they’re all trying to read/write to the same files on disk to localise the sgtk config files when sgtk bootstraps which causes a range of errors.

Ideally what we’d want to happen is have the first task in a set of concurrent tasks run a little ahead of the others, so that bootstrapping can be done by one task… then once that’s done the other concurrent tasks can be released to run.

I’ve tried looking at using Resource Limits to handle this with the “Release at Task Progress” set to 1.0% in the hope that this would work to do that… but in my tests Deadline doesn’t pay attention when the limit is released and allow other tasks to start.

Has anyone else had success controlling for this kind of scenario?

I’m currently thinking that I might need to do something in the OnSlaveRenderingCallback event that will do a bootstrap of it’s own to make sure files are localised first. However this isn’t ideal as it means that the job is bootstrapping twice… once just to localise files safely, then again to actually initialise sgtk within the Nuke session to be able to load the tk-nuke-writenodes in a script.

I appreciate that the other possible solution here would be to have a centralised farm config, but I’d rather avoid that if possible as it just means having to manage more configs.

Thanks
Tom FW
Pipeline Supervisor - Union VFX

*NB: Our long term goal is to actually ditch using these nodes entirely as they’re a pain in the a** anyway. Also for a number of reasons we’d rather avoid using the “convert to write node” options that SG provides.

zainali · January 20, 2023, 6:12pm

Hello @Tomfw

Thanks for reaching out. If I understood it right you need to run a bootstrap script on every task and since you are using concurrent tasks Workers are bumping into each other and making file locks?

If there are concurrent tasks, means Worker will render 2 or more task at the same time, does the same Worker tries to access the files at the same time?

When you use Resource Limits what exactly happens? Also what is the the usage level of the resource limit is set to.?

Try setting it to Task level and release the Worker at Task progress.

Tomfw · January 20, 2023, 6:25pm

Hey @zainali,

Thanks for getting back to me

Yeah I think you’ve summarised it well.

When you use Resource Limits what exactly happens?

What I’m finding is that the tasks just render one at a time when I have the following Resource Limit set up:

My understanding of the documentation is that with these settings shown above I should get the following:

When a Job starts it will be limited to rendering one Task only. That is until that first Task reaches 1% complete, at which point another Task can be started.

When I’m watching the Limits (in a Deadline Monitor panel) as the job progresses I can see that the Resource Limit is released… but the Job itself doesn’t seem to recognise this change and doesn’t allow a new Tasks to be started. Instead the Job waits until the first Task is fully completed before it allows another Task to be started.

The jobs I’m rendering are relatively light-weight. In my examples the task render time is between 3-5 mins. Is the problem I’m facing a limitation of how quickly Deadline can communicate back and forth between the limit’s and the Jobs/Tasks?

Thanks
Tom.

zainali · January 24, 2023, 7:48pm

It is weird that the job did not recognize the limit, Try using the progress to 10%. There used to be a bug which was resolved in later Deadline versions. What version of Deadline is this? Look in the Worker logs when the limit was released? It could be a bug in progress reporting or something else.

I think if you can increase frames per task it will help slow down the I/O per minute and might help reducing the lockups.

I am not sure what last question means. If a limit is applied to a job the access of the Worker to stub (resource) will only occur if there is stub available. Can you elaborate the question?