AWS Thinkbox Discussion Forums

Custom plugin stopping at random

Hello,
Our custom plugin running multiple concurrent tasks on same machine gets stuck while executing some of the tasks , log are as follows

2023-02-24 12:10:03:  7: INFO: Process exit code: 0
2023-02-24 12:10:03:  7: Done executing plugin command of type 'Render Task'
2023-02-24 12:10:03:  7: Render time for frame(s): 1.184 m
2023-02-24 12:10:03:  7: Total time for task: 1.192 m
2023-02-24 12:10:04:  7: Saving task log...
2023-02-24 12:10:05:  7: Render Thread - Render State transition from = 'Rendering' to = 'WaitingForTask'
2023-02-24 12:10:07:  7: Render Thread - Render State transition from = 'WaitingForTask' to = 'ReceivedTask'
2023-02-24 12:10:07:  7: Got task!
2023-02-24 12:10:07:  7: Render Thread - Render State transition from = 'ReceivedTask' to = 'Other'
2023-02-24 12:10:07:  7: Plugin already loaded: TEST_WORKER_PLUGIN
2023-02-24 12:10:07:  7: Start Job timeout is disabled.
2023-02-24 12:10:07:  7: Task timeout is disabled.
2023-02-24 12:10:07:  7: Job already loaded: Step_2_Test_Job
2023-02-24 12:10:07:  7: Plugin rendering frame(s): 160
2023-02-24 12:10:07:  7: Render Thread - Render State transition from = 'Other' to = 'Rendering'
2023-02-24 12:10:08:  7: PYTHON: Deadline has requested Job cleanup.
2023-02-24 12:10:48:  7: Timed out waiting for output from the Sandbox, stopping the Sandbox thread.
2023-02-24 12:10:48:  ERROR: 7: An exception occurred: Sandbox process exited unexpectedly while waiting for response from Plugin. (Deadline.Plugins.RenderPluginException)
2023-02-24 12:10:48:  7: Unloading plugin: TEST_WORKER_PLUGIN
2023-02-24 12:10:58:  7: Timed out waiting for output from the Sandbox, stopping the Sandbox thread.
2023-02-24 12:10:58:  7: Error in EndJob: One or more errors occurred. (Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..)
2023-02-24 12:11:30:  7: SandboxedPlugin still waiting for SandboxThread to exit

Is this the log from the task report or the worker log? If this isn’t a snippet from the Worker log could we get the Worker log? The task report will only have what happened inside that given render thread, where the Worker report will include what happened in managing and around that same render thread.

Based on the log we’ve got, I’d guess that your database or network is getting overloaded, but that’s just a guess.

The logs will be on the machine in one of these locations:
Windows: C:\ProgramData\Thinkbox\Deadline10\logs
Linux: /var/log/Thinkbox/Deadline10
Mac OS X: /Library/Logs/Thinkbox/Deadline10

This is indeed a snippet from the worker log, but I am going to share more from the log, and for other tasks as well, and almost all of them gets completed most of the time, for example if we have 1600 tasks each of them will be completed only one will be stuck, and the worker will be stuck and the thread will be showing status of either endjob or rendering or something else. What we think is happening is sometimes one in 100 it tries to clean up right after it loads task, and since clean up deletes all the callbacks including render task which we are using to run the process, using
self.RunManagedProcess(self.Process), what we are doing is running multiple applications using tasks so each task is a separate application and in each application we are doing something, that’s why we are using render task to run the process.
Some logs are below:

The first shows the log when the plugin is being called in the very start.

2023-02-27 10:33:27:  0: Loaded TEST_WORKER_PLUGIN
2023-02-27 10:33:27:  All job files are already synchronized
2023-02-27 10:33:27:  Synchronizing Plugin TEST_WORKER_PLUGIN from …\\DeadlineRepository10\custom\plugins\ TEST_WORKER_PLUGIN took: 0 seconds TEST_WORKER_PLUGIN
2023-02-27 10:33:28:  'C:\Users\TestUser\AppData\Local\Thinkbox\Deadline10\pythonAPIs\2022-01-07T025548.0000000Z' already exists. Skipping extraction of PythonSync.
2023-02-27 10:33:28:  0: Executing plugin command of type 'Initialize Plugin'
2023-02-27 10:33:28:  0: INFO: Executing plugin script 'C:\ProgramData\Thinkbox\Deadline10\workers\JLR1GBMW2JWDV53\plugins\63fc86231dc1bc4b8487923f\TEST_WORKER_PLUGIN.py'
2023-02-27 10:33:28:  0: INFO: Plugin execution sandbox using Python version 3
2023-02-27 10:33:28:  0: PYTHON: Deadline has requested Job Manager creation.
2023-02-27 10:33:28:  1: Loaded plugin TEST_WORKER_PLUGIN
2023-02-27 10:33:28:  All job files are already synchronized
2023-02-27 10:33:28:  0: INFO: About: TEST_WORKER_PLUGIN
2023-02-27 10:33:28:  0: INFO: The job's environment will be merged with the current environment before rendering
2023-02-27 10:33:28:  0: Done executing plugin command of type 'Initialize Plugin'
2023-02-27 10:33:28:  Plugin TEST_WORKER_PLUGIN was already synchronized.
2023-02-27 10:33:28:  0: Start Job timeout is disabled.
2023-02-27 10:33:28:  0: Task timeout is disabled.
2023-02-27 10:33:28:  0: Loaded job: Step_2_TestJob (63fc86231dc1bc4b8487923f)
2023-02-27 10:33:28:  0: Skipping K: because it is already mapped
2023-02-27 10:33:28:  0: Executing plugin command of type 'Start Job'
2023-02-27 10:33:28:  0: DEBUG: S3BackedCache Client is not installed.
2023-02-27 10:33:28:  0: INFO: Executing global asset transfer TestWorker preload script 'C:\ProgramData\Thinkbox\Deadline10\workers\TestWorker\plugins\63fc86231dc1bc4b8487923f\GlobalAssetTransferPreLoad.py'
2023-02-27 10:33:28:  0: INFO: Looking for legacy (pre-10.0.26) AWS Portal File Transfer...
2023-02-27 10:33:28:  0: INFO: Looking for legacy (pre-10.0.26) File Transfer controller in C:/Program Files/Thinkbox/S3BackedCache/bin/task.py...
2023-02-27 10:33:28:  0: INFO: Could not find legacy (pre-10.0.26) AWS Portal File Transfer.
2023-02-27 10:33:28:  0: INFO: Legacy (pre-10.0.26) AWS Portal File Transfer is not installed on the system.
2023-02-27 10:33:28:  1: Executing plugin command of type 'Initialize Plugin'
2023-02-27 10:33:28:  1: INFO: Executing plugin script 'C:\ProgramData\Thinkbox\Deadline10\workers\TestWorker\plugins\63fc86231dc1bc4b8487923f\ TEST_WORKER_PLUGIN.py'
2023-02-27 10:33:29:  0: Done executing plugin command of type 'Start Job'
2023-02-27 10:33:29:  0: Plugin rendering frame(s): 1
2023-02-27 10:33:29:  0: Render Thread - Render State transition from = 'Other' to = 'Rendering'
2023-02-27 10:33:29:  1: INFO: Plugin execution sandbox using Python version 3
2023-02-27 10:33:29:  1: PYTHON: Deadline has requested Job Manager creation.
2023-02-27 10:33:29:  0: Executing plugin command of type 'Render Task'

The second one is when tasks executes normally without any problems

2023-02-27 10:34:32:  1: Got task!
2023-02-27 10:34:32:  1: Render Thread - Render State transition from = 'ReceivedTask' to = 'Other'
2023-02-27 10:34:32:  1: Plugin already loaded: TEST_WORKER_PLUGIN
2023-02-27 10:34:32:  1: Start Job timeout is disabled.
2023-02-27 10:34:32:  1: Task timeout is disabled.
2023-02-27 10:34:32:  1: Job already loaded: Step_2_TestJob
2023-02-27 10:34:32:  1: Plugin rendering frame(s): 34
2023-02-27 10:34:32:  1: Render Thread - Render State transition from = 'Other' to = 'Rendering'
2023-02-27 10:34:33:  1: Executing plugin command of type 'Render Task'
2023-02-27 10:36:12:  1: INFO: Process exit code: 0
2023-02-27 10:36:12:  1: Done executing plugin command of type 'Render Task'
2023-02-27 10:36:12:  1: Render time for frame(s): 43.519 s
2023-02-27 10:36:12:  1: Total time for task: 43.890 s
2023-02-27 10:36:13:  1: Saving task log...
2023-02-27 10:36:14:  1: Render Thread - Render State transition from = 'Rendering' to = 'WaitingForTask'

This part is where deadline randomly call job cleanup after getting a task.

2023-02-27 12:13:20:  1: Done executing plugin command of type 'Render Task'
2023-02-27 12:13:20:  1: Render time for frame(s): 26.412 s
2023-02-27 12:13:20:  1: Total time for task: 27.068 s
2023-02-27 12:13:21:  1: Saving task log...
2023-02-27 12:13:21:  1: Render Thread - Render State transition from = 'Rendering' to = 'WaitingForTask'
2023-02-27 12:13:28:  1: Render Thread - Render State transition from = 'WaitingForTask' to = 'ReceivedTask'
2023-02-27 12:13:28:  1: Got task!
2023-02-27 12:13:28:  1: Render Thread - Render State transition from = 'ReceivedTask' to = 'Other'
2023-02-27 12:13:28:  1: Plugin already loaded: TEST_WORKER_PLUGIN
2023-02-27 12:13:28:  1: Start Job timeout is disabled.
2023-02-27 12:13:28:  1: Task timeout is disabled.
2023-02-27 12:13:28:  1: Job already loaded: Step_2_TestJob
2023-02-27 12:13:28:  1: Plugin rendering frame(s): 1010
2023-02-27 12:13:28:  1: Render Thread - Render State transition from = 'Other' to = 'Rendering'
2023-02-27 12:13:29:  1: PYTHON: Deadline has requested Job Manager cleanup.
2023-02-27 12:13:29:  1: PYTHON: JOB STATUS IS: Active
2023-02-27 12:14:09:  1: Timed out waiting for output from the Sandbox, stopping the Sandbox thread.
2023-02-27 12:14:09:  ERROR: 1: An exception occurred: Sandbox process exited unexpectedly while waiting for response from Plugin. (Deadline.Plugins.RenderPluginException)
2023-02-27 12:14:09:  1: Unloading plugin: PCP_Pipeline_Worker
2023-02-27 12:14:19:  1: Timed out waiting for output from the Sandbox, stopping the Sandbox thread.
2023-02-27 12:14:19:  1: Error in EndJob: One or more errors occurred. (Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host..)
2023-02-27 12:14:51:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-02-27 12:15:21:  1: SandboxedPlugin still waiting for SandboxThread to exit

Interesting!

Which version of Deadline is this?

And could I get a copy of your plugin to look at? I’d like to see if there’s something in the code that might be off. I’ve never seen cleanup get called at random, that only happens when the task is considered complete.

If you’re not comfortable sharing that to the world, create a ticket at awsthinkbox.zendesk.com and share it there.
Thanks!

Deadline_Generic_Script.py (28.0 KB)
I am using Deadline Client Version: 10.1.20.3 Release (1a0d97d0f) and here is the plugin as well.
Thanks for all the help btw.

@Justin_B Hey did you manage to get time to look into this.

Hey sorry about the delay, I’m not on the forums rotation at the moment so my attention’s been elsewhere. So I’ve only had 30 minutes to look at this so please excuse the drive-by nature of this analysis. :slight_smile:

Cleanup gets called when the plugin object is getting disposed of by the render thread. That should only happen when function connected to the RenderTasksCallback exits. From the logs and your script it’s not clear why that’s happening. The log shows a timeout communicating with the render sandbox which might also trigger disposal of the plugin but I’m not sure.

Given the random nature of the failure I’m not sure what to do next. Offhand I’d see if you can re-create the failure with one of our application plugins to see if it’s the application plugin or if it’s the actual rendering creating issues with the sandbox.

Wish I had something more conclusive for you though!

Privacy | Site terms | Cookie preferences