Setting PostJobScipt causes the final two tasks to never render

Stephen_Scollay · March 23, 2023, 6:55am

Heya,
I have a piece of python code when the postjobscript is set via python code causes the final frame of the render as well as the post job script itself to ‘never’ be rendered (at least I haven’t found a way for them to render even when the jobs prio is higher than anything else on the farm).

For context, the job is set to a pool that only has disabled workers in it, the python code is then ran and it enables one of the workers in this pool, sets the postscript job and then activated the worker. The worker then runs the job but stops at the last frame of the job before diverting to any other task it can. The intention is that when the post job script is ran, it will disable the worker once more and this works but only if the post job script is manually set. If the job is then manually failed, its status is listed as corrupted. Can anyone help out with this?

Thanks,
Stephen

def EnableWorker(slave_name):
    slave_settings = RepositoryUtils.GetSlaveSettings(slave_name, True)
    prependum = "Fake"
    if MakeChangesToWorkers == 1:
        slave_settings.SlaveEnabled = True
        prependum = ""
    RepositoryUtils.SaveSlaveSettings(slave_settings)
    print(prependum, "Enabling: ", slave_name)

def set_JobPostJobScript(jobId, pythonFileName):
    job = RepositoryUtils.GetJob(jobId, True)
    prependum = "Fake "
    # if MakeChangesToWorkers == True:
    if MakeChangesToWorkers == True:
        RepositoryUtils.SetPostJobScript(job, pythonFileName)
    print(prependum, "Set ", jobId, " postJobScript to: ", pythonFileName)

This is the worker log, apologies for the spam, I don’t understand this but hopefully someone does:

2023-03-23 17:32:09:  0: Plugin rendering frame(s): 4
2023-03-23 17:32:09:  0: Render Thread - Render State transition from = 'Other' to = 'Rendering'
2023-03-23 17:32:10:  0: Executing plugin command of type 'Render Task'
2023-03-23 17:32:10:  0: INFO: Waiting until maya is ready to go
2023-03-23 17:32:10:  0: STDOUT: mel: READY FOR INPUT
2023-03-23 17:32:11:  0: INFO: >This is a Render Job
2023-03-23 17:32:11:  0: INFO: Rendering with redshift
2023-03-23 17:32:11:  0: INFO: Rendering to network drive
2023-03-23 17:32:11:  0: INFO: Creating melscript to execute render
2023-03-23 17:32:11:  0: INFO: Executing script: C:\Users\deadline\AppData\Local\Temp\tmpCFD4.tmp
2023-03-23 17:32:11:  0: INFO: Waiting for script to finish
2023-03-23 17:32:11:  0: STDOUT: mel: Loading scene: Z:/_STAFF/StephenS/Maya/Test_CupAndPencils.mb
2023-03-23 17:32:11:  0: WARNING: Strict error checking off, ignoring the following error or warning.
2023-03-23 17:32:11:  0: STDOUT: Error: file: C:/Users/deadline/AppData/Local/Temp/tmpCFD4.tmp line 6: Unable to dynamically load : C:/ProgramData/Redshift/Plugins/Maya/2020/nt-x86-64/redshift4maya.mll
2023-03-23 17:32:11:  0: STDOUT: The specified procedure could not be found.
2023-03-23 17:32:11:  0: WARNING: Strict error checking off, ignoring the following error or warning.
2023-03-23 17:32:11:  0: STDOUT: Error: file: C:/Users/deadline/AppData/Local/Temp/tmpCFD4.tmp line 6: The specified procedure could not be found.
2023-03-23 17:32:11:  0: STDOUT:  (redshift4maya)
2023-03-23 17:32:11:  0: WARNING: Strict error checking off, ignoring the following error or warning.
2023-03-23 17:32:11:  0: STDOUT: Error: file: C:/Users/deadline/AppData/Local/Temp/tmpCFD4.tmp line 46: Render failed.
2023-03-23 17:32:12:  0: STDOUT: mel: READY FOR INPUT
2023-03-23 17:32:12:  0: Done executing plugin command of type 'Render Task'
2023-03-23 17:32:12:  0: Render time for frame(s): 3.296 s
2023-03-23 17:32:12:  0: Total time for task: 4.095 s
2023-03-23 17:32:13:  0: Saving task log...
2023-03-23 17:32:14:  0: Render Thread - Render State transition from = 'Rendering' to = 'WaitingForTask'
2023-03-23 17:32:14:  Scheduler Thread - Render Thread 0 completed its task
2023-03-23 17:32:14:  Scheduler Thread - Scheduler State transition from = 'PreRendering' to = 'PostRendering'
2023-03-23 17:32:14:  Scheduler Thread - Scheduler State transition from = 'PostRendering' to = 'EndJob'
2023-03-23 17:32:14:  Scheduler Thread - Scheduler State transition from = 'EndJob' to = 'WaitingForJob'
2023-03-23 17:32:14:  Scheduler Thread - Seconds before next job scan: 1
2023-03-23 17:32:15:  Scheduler Thread - Performing pending job scan...
2023-03-23 17:32:15:  Skipping pending job scan because it is not required at this time
2023-03-23 17:32:15:  Scheduler Thread - Performing repository repair...
2023-03-23 17:32:15:  Skipping repository repair because it is not required at this time
2023-03-23 17:32:15:  Scheduler Thread - Performing house cleaning...
2023-03-23 17:32:15:  Skipping house cleaning because it is not required at this time
2023-03-23 17:32:15:  Scheduler Thread - Scheduler State transition from = 'WaitingForJob' to = 'LicenseCheck'
2023-03-23 17:32:15:  Scheduler Thread - Scheduler State transition from = 'LicenseCheck' to = 'LicenseConfirmed'
2023-03-23 17:32:15:  Scheduler - Previously-acquired limits: [Deadline.LimitGroups.LimitGroupStub]
2023-03-23 17:32:16:  Scheduler - Preliminary check: 'Worker1' is not in allow list for limit '641a1c007a9e0c1ca832919c'
2023-03-23 17:32:16:  Scheduler - Scheduler - Acquired limit '641be5bff7175463a4fdb715'
2023-03-23 17:32:16:  Scheduler - Successfully dequeued 1 task(s) for Job '641be5bff7175463a4fdb715'.  Returning.
2023-03-23 17:32:17:  Scheduler - Job scan acquired 1 tasks in 1s after evaluating 3 different jobs.
2023-03-23 17:32:17:  Scheduler - Limits held after Job scan: [641be5bff7175463a4fdb715]
2023-03-23 17:32:17:  0: Shutdown
2023-03-23 17:32:17:  0: RenderThread CancelCurrentTask called, will transition from state None to None
2023-03-23 17:32:17:  0: Exited SlaveRenderThread.ThreadMain(), cleaning up...
2023-03-23 17:32:17:  0: Executing plugin command of type 'Cancel Task'
2023-03-23 17:32:17:  0: Done executing plugin command of type 'Cancel Task'
2023-03-23 17:32:17:  0: Executing plugin command of type 'End Job'
2023-03-23 17:32:17:  0: INFO: Ending Maya Job
2023-03-23 17:32:17:  Listener Thread - ::ffff:10.1.129.48 has connected
Success
2023-03-23 17:32:17:  0: INFO: Waiting for Maya to shut down
2023-03-23 17:32:17:  0: INFO: Maya has shut down
2023-03-23 17:32:17:  0: Done executing plugin command of type 'End Job'
2023-03-23 17:32:17:  0: Stopped job: Test_CupAndPencils
2023-03-23 17:32:17:  0: Unloading plugin: MayaBatch
2023-03-23 17:32:20:  0: Render Thread - Render State transition from = 'WaitingForTask' to = 'PreInitializing'
2023-03-23 17:32:20:  0: Shutdown
2023-03-23 17:32:20:  0: Render Thread - Render State transition from = 'PreInitializing' to = 'Initializing'
2023-03-23 17:32:20:  0: Initialized
2023-03-23 17:32:20:  Scheduler - Returning limit stubs not in use.
2023-03-23 17:32:20:  Scheduler Thread - Job's Limit Groups: 
2023-03-23 17:32:20:  0: Render Thread - Render State transition from = 'Initializing' to = 'WaitingForTask'
2023-03-23 17:32:20:  Scheduler Thread - Scheduler State transition from = 'LicenseConfirmed' to = 'LicenseCheck'
2023-03-23 17:32:20:  Scheduler Thread - Scheduler State transition from = 'LicenseCheck' to = 'LicenseConfirmed'
2023-03-23 17:32:20:  Scheduler Thread - Scheduler State transition from = 'LicenseConfirmed' to = 'StartJob'
2023-03-23 17:32:21:  0: Render Thread - Render State transition from = 'WaitingForTask' to = 'ReceivedTask'
2023-03-23 17:32:21:  Scheduler Thread - Scheduler State transition from = 'StartJob' to = 'PreRendering'
2023-03-23 17:32:21:  0: Got task!
2023-03-23 17:32:21:  0: Render Thread - Render State transition from = 'ReceivedTask' to = 'Other'
2023-03-23 17:32:21:  0: Plugin will be reloaded because a new job has been loaded.
2023-03-23 17:32:21:  0: Loading Job's Plugin timeout is Disabled
2023-03-23 17:32:21:  0: SandboxedPlugin: Render Job As User disabled, running as current user 'deadline'
2023-03-23 17:32:23:  'C:\Users\deadline\AppData\Local\Thinkbox\Deadline10\pythonAPIs\2022-07-22T224802.0000000Z' already exists. Skipping extraction of PythonSync.
2023-03-23 17:32:24:  0: Loaded plugin MayaBatch
2023-03-23 17:32:24:  All job files are already synchronized
2023-03-23 17:32:24:  Synchronizing Plugin MayaBatch from Y:\_Deadline_Repository\plugins\MayaBatch took: 0 seconds
2023-03-23 17:32:24:  0: Executing plugin command of type 'Initialize Plugin'
2023-03-23 17:32:24:  0: INFO: Executing plugin script 'C:\ProgramData\Thinkbox\Deadline10\workers\Worker1\plugins\641be5bff7175463a4fdb715\MayaBatch.py'
2023-03-23 17:32:24:  0: INFO: Plugin execution sandbox using Python version 3
2023-03-23 17:32:24:  0: INFO: About: Maya Batch Plugin for Deadline
2023-03-23 17:32:24:  0: INFO: The job's environment will be merged with the current environment before rendering
2023-03-23 17:32:24:  0: Done executing plugin command of type 'Initialize Plugin'
2023-03-23 17:32:24:  0: Start Job timeout is disabled.
2023-03-23 17:32:24:  0: Task timeout is disabled.
2023-03-23 17:32:24:  0: Loaded job: REDACTED

mois.moshev · March 23, 2023, 12:59pm

Stephen_Scollay:

2023-03-23 17:32:11:  0: STDOUT: Error: file: C:/Users/deadline/AppData/Local/Temp/tmpCFD4.tmp line 6: Unable to dynamically load : C:/ProgramData/Redshift/Plugins/Maya/2020/nt-x86-64/redshift4maya.mll
2023-03-23 17:32:11:  0: STDOUT: The specified procedure could not be found.
2023-03-23 17:32:11:  0: WARNING: Strict error checking off, ignoring the following error or warning.
2023-03-23 17:32:11:  0: STDOUT: Error: file: C:/Users/deadline/AppData/Local/Temp/tmpCFD4.tmp line 6: The specified procedure could not be found.

Redshift is unable to load, due to missing or incompatible libraries.

zainali · March 23, 2023, 5:24pm

Hello Stephen

Thanks for reaching out. The Python code you are running to change the job’s task list might be making your job corrupted, this forums post tell about the cause of corruption. It is not recommended to change the job tasks at the render time.
What run that Python script on the job? like, is it a prejob script or OnJobSubmitted?
I see you are using pools and have disabled Workers, can’t you use groups instead?

Stephen_Scollay · March 23, 2023, 10:11pm

Cheers for the reply. The correct plugin is there so not sure why it would be using different procedures specifically for the last frame and only when the inputs are made by code and not by hand. Cheers for the insight though.

Stephen_Scollay · March 23, 2023, 10:49pm

Hi @zainali,
Thanks for the reply. For the record, my code does not assign of change jobs at all, that is left up to deadline to do.

The exception I think your alluding to here would be adding the postJobScript but I don’t understand how that would be any different to adding it via the modify job properties (which does work fine). That being said, all of this occurs before any task is being rendered on the job. Also, the job does not corrupt until is is manually failed when the last two tasks don’t run.
The reason this (kind of) has to be run at on the job as because I don’t currently want to interface with Maya and was hoping to keep everything within Deadline for the time being. So a job doesn’t know it needs to run the code until I tell it to. It runs via the right click menu on a job that is in queue.
I haven’t really explored using groups as for my use they didn’t seem too different from pools. The code is intended to enable and disable parent and child workers depending on the resource needs of an artist. Child workers are parents that are split in two with CPU and GPU affinity and so if a child is running but the parent is needed. the children need to be disabled for the parent to be enabled to then be able to run on the job. Parent enabling and child disabling is ran via a post task script on the children jobs that disable the child upon completion of the task and enable the parent if all children are disabled (definitely a render time change but this isn’t implemented yet so we’ll see). This process is reversed with the job completion script on the parents job that re-enables the children and disables the parent. Rather then enabling disabling would you recommend moving workers in and out of groups, something else entirely, or is there a better way to achieve all of this?

Justin_B · March 29, 2023, 5:30pm

Sticking to the points here:

We know setting the postjob scripts via the API can corrupt jobs, even while setting the script in the Monitor works without issue. The ‘is this job corrupted check’ doesn’t run until the job state changes. So if the job corrupts after it gets marked failed it’s likely the job was corrupted earlier. Instead, set the postjob script in the job creation.
Given this info, setting the script on job creation isn’t workable. Instead, you could have that right-click script duplicate the existing job, add the script and submit the new job. Then delete the old script-less job. That’s also not a great solution, so a little more context on why splitting the Workers needs to be done might be helpful. I assume you’ve looked at concurrent tasks and not had those work out?
For the difference between groups and pools; in short pools are for job priority (both start with P ) and groups are for hardware and software limitations. In long, this blog post breaks down the three ways to manage/limit jobs. So you’d have job pools to make sure the most important jobs get done first, and groups to separate the machines with and without GPUs. For this setup you could put the parent and child Workers into their own groups.

I think the issue you’re having is due to setting the post-task script via the API and corrupting the job, and disabling a Worker that’s currently running will have it fail the task it’s working on.

If I could get a job archive where you’re seeing this behaviour I’d be able to better confirm what’s going on. If you’re not comfortable putting that online, you can send that file in via the ticket system.

As for a better solution, Deadline’s really not built to do this sort of dynamic enabling/disabling of Workers while they’re active. Is it just Maya+Redshift getting run on this farm? What makes a job appropriate for the child Workers versus the parent workers?

Stephen_Scollay · March 30, 2023, 4:22am

Hi @Justin_B and thank you so much for the response.

Great to know about that, just curious but is there a way to run Monitor scripts with the API similar to remote worker commands (a remote repository command)?
I am completely ignorant on many aspects of deadline and no one here has really touched this sort of stuff so I’m mostly going in blind. If I get desperate I will look into into the job copying but obviously would like to keep this as simple as I can and work within what deadline is meant to do (aware of the irony as I try to force Deadline to do things it’s not intended for!). The splitting is essentially to allow our workers to do both more computational intensive tasks as well as less intensive tasks without getting caught up on those smaller tasks. I suspect this current route was taken as a bit of a band aid fix as no-one has properly investigated concurrent tasks, pools, pool priorities, and secondary pools and currently we lack the knowledge to design such a system right now.
Thanks, thats pretty much my understanding now as well. As previously mentioned these are not being used effectively right now with basically groups being used for location (workers render to separate servers which take time to sync so people generally render with their local machines to their local server using these groups), and pools being a bit of a mess. I don’t think using groups for parent and children would work because whats preventing both the parent and child from taking on jobs and overloading the worker? That is what the enabling and disabling is intended to do, but how do I ensure only the parent OR the child to do jobs based on the needs of the farm?

Will for sure look into sending in an archive if there isn’t a better way but I think we know what the issue is anyhow. Jobs are from maya, redshift, houdini, arnold, nuke.

When I was talking about a better solution that would be to having the farm (more specifically the more powerful workers) efficiently handle more resource intensive tasks as well as less resource intensive tasks when needed. I honestly would suspect that the approach itself is wrong which tends to happen when ignorant people come up with a plan lol. If you have any suggestions for guidance on a better pool design that would be awesome. We basically have 3 tiers, a couple 256GB ram machines that can be split into 2x128GB or 4x64GB or a mixture of the two (I know RAM isn’t split like this but we werent sure how else to “give” larger jobs access to more ram), a couple 128GB machines that can be split into two 64GB machines and then a few 64 GB machines. Ideally, when larger jobs come up in the queue it will ensure that the children are free and the 128GB workers to take the job. The splitting is achieved with GPU and CPU affinities to ensure they’re not overlapping on hardware, and at the moment we manually enable and disable the workers as needed.

Cheers for any guidance,
Stephen

Justin_B · March 31, 2023, 4:24pm

Nope, afraid not.
I’d look into concurrent tasks. Truthfully, set a machine to have the big worker only and modify one of your existing jobs to have 2 concurrent tasks and see how it goes. You might be just fine without this extra layer of dynamic enabling/disabling fun.
Deadline doesn’t have a mechanism for this to be truthful. I’d do the test with two concurrent tasks before thinking harder on building around it.

What you’d love to have is slot based scheduling, where you can say how many slots a job needs and how many slots a Worker’s got to work with. I think Son of Renderman has that? I’m majorly mis-remembering. Deadline in just doesn’t have tooling to account for that.

What you could do is make 3 groups to match your tiers and submit jobs to the appropriately sized group. However that doesn’t account for big machines doing small jobs, which would work. But Workers won’t dequeue tasks outside of the group.

A couple folks have taken a crack at this in the past and I haven’t found a good way to pull this off.

Stephen_Scollay · April 3, 2023, 12:28am

Thank you for replying to my whims. What’s the difference between concurrent tasks and multiple worker instances on the one comp? Ill pass it by the team and see what they say. As you already saw, I’m looking into using a separate python task to draw the workers I want away so Ill see how these approaches turn out.
Cheers,
Stephen

Justin_B · April 3, 2023, 1:36pm

If a job has concurrent tasks set, the Worker will start multiple render sandboxes and work on multiple tasks at a time. You can set an upper limit of concurrent tasks per Worker as well, so you can cap how many tasks your smaller machines can take on.

Stephen_Scollay · April 3, 2023, 11:11pm

The question did pertain to the difference between concurrent tasks and multiple worker instances. From my understanding it appears that they are very similar in many aspects. Anyhow, I appreciate the replies.