Custom plugin directory not being honored

nrusch · September 15, 2014, 6:48pm

We use a custom plugin for all of our jobs, which does not actually exist in the Deadline repository (so we can maintain multiple parallel versions of it), and every job has its “Custom Plugin Directory” property set to point to the appropriate version. This has been working fine for the most part, but some machines don’t seem to know what’s going on; they seem to still be trying to synchronize the plugin from the Deadline repository’s ‘plugins’ directory.

I have an empty stub directory in the repository’s ‘custom/plugins’ subdirectory (which I know is a necessary workaround for my situation in the current version), but some of the slaves still seem to be looking for something in the ‘plugins’ directory.

Could not find plugin named "LumaJob", path does not exist: ..../deadline/repo-la-6/plugins/LumaJob

This raises a few questions:

Why is the slave looking in the ‘plugins’ directory, and erroring there?
Why doesn’t it appear to be checking custom/plugins?
Why isn’t the slave respecting the job’s Custom Plugin Directory?

Thanks

rrussell · September 15, 2014, 7:16pm

I checked our code, and we fall back to the default plugin folder if the custom one doesn’t exist. Is it possible that these problematic machines can’t “see” the custom location that the job override is pointing to?

nrusch · September 15, 2014, 8:09pm

No, it’s in a location that is mounted by basically every machine in the building. If it only happened once in a while, I would suspect a stale NFS file handle, but the fact that the same machine can generate an unbounded number of these errors (due to them not being respected as actual task/slave errors) seems to indicate otherwise.

nrusch · September 15, 2014, 8:10pm

And it will continue to generate these errors after multiple slave restarts.

rrussell · September 15, 2014, 8:28pm

Is that restarts of just the slave application, or the machine it’s running on?

Also, is it always the same group of machines that is affected, or is it random?

nrusch · September 15, 2014, 8:43pm

Restarts of the slave application. I have confirmed that the custom plugin directory is accessible from the slave in question.

It happens a lot on workstations, so my hunch is that is has something to do with machine restarts. I have one render node that has been having the same problem. It was restarted this morning, has been up for just over 3 hours, and consistently displays the problem every time the slave process is started.

rrussell · September 16, 2014, 1:27pm

Hmm… could these machines possibly be having permission issues accessing the custom plugin folder? Also, could you maybe post a larger section of the log around where this error occurs (or maybe the full slave log)?

I checked the code, and this is the search order that is used for the jobs:

Custom plugin folder override for job (if there is one).
Custom plugin directory in the Repository.
Default plugin directory in the Repository.

So if a job has an override, and LumaJob is in the custom folder, perhaps there is a bug that somehow is allowing the job to fall back to the default plugin folder. Seeing the full log might help track this down if that’s the case.

Thanks!
Ryan

nrusch · September 16, 2014, 4:38pm

Here’s a log from the same slave, with verbose logging turned on:

2014-09-16 09:30:12: BEGIN - rd-078.luma-pictures.com\qmaster 2014-09-16 09:30:12: Deadline Slave 6.2 [v6.2.0.32 R (2563d5bc8)] 2014-09-16 09:30:12: The current PYTHONPATH environment variable will be modified to avoid potential Python conflicts 2014-09-16 09:30:12: Scanning for auto configuration 2014-09-16 09:30:15: Auto Configuration: No auto configuration could be detected, using local configuration 2014-09-16 09:30:15: Connecting to repository 2014-09-16 09:30:15: Info Thread - Created. 2014-09-16 09:30:16: Trying to connect using license server '@sv-lic13.luma.la'... 2014-09-16 09:30:16: License obtained. 2014-09-16 09:30:16: Purging old logs 2014-09-16 09:30:16: Scheduler Thread - Slave initialization complete. 2014-09-16 09:30:16: Scheduler - Performing Job scan on Primary Pools with scheduling order Priority, Pool, First-in First-out 2014-09-16 09:30:16: Scheduler - The 5417981d2b64553b72a63e1a limit is maxed out. 2014-09-16 09:30:16: Scheduler - Successfully dequeued 1 task(s). Returning. 2014-09-16 09:30:16: Scheduler - Returning limit stubs not in use. 2014-09-16 09:30:16: Scheduler Thread - Synchronizing job auxiliary files from /.../deadline/repo-la-6/jobs/5418626f2b64553b72a640d6 2014-09-16 09:30:16: Scheduler Thread - All job files are already synchronized 2014-09-16 09:30:16: Scheduler Thread - Synchronizing plugin files from /.../deadline/repo-la-6/plugins/LumaJob 2014-09-16 09:30:16: Scheduler Thread - Failed to synchronize plugin files because: Could not find plugin named "LumaJob", path does not exist: /.../deadline/repo-la-6/plugins/LumaJob (Deadline.Configuration.DeadlineConfigException) 2014-09-16 09:30:16: Scheduler Thread - Unexpected Error Occured 2014-09-16 09:30:16: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2014-09-16 09:30:16: Exception Details 2014-09-16 09:30:16: DeadlineConfigException -- Could not find plugin named "LumaJob", path does not exist: /.../deadline/repo-la-6/plugins/LumaJob 2014-09-16 09:30:16: Exception.Source: deadline 2014-09-16 09:30:16: Exception.TargetSite: Void LoadPluginFiles(System.String, System.String, System.String) 2014-09-16 09:30:16: Exception.Data: ( ) 2014-09-16 09:30:16: Exception.StackTrace: 2014-09-16 09:30:16: at Deadline.StorageDB.SettingsStorage.LoadPluginFiles (System.String pluginName, System.String customPluginDirectory, System.String slaveDestPath) [0x00000] in <filename unknown>:0 2014-09-16 09:30:16: at Deadline.Slaves.SlaveSchedulerThread.SyncPluginFiles () [0x00000] in <filename unknown>:0 2014-09-16 09:30:16: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2014-09-16 09:30:16: Scheduler Thread - Seconds before next job scan: 5

This cycle will repeat indefinitely.

rrussell · September 16, 2014, 6:20pm

Thanks! The log allowed me to walk the exact code that is being executed here, and I can’t understand why only certain machines would have problems with this.

Is it at all possible that this Slave is maybe pointing to a different repo (ie: a test repo), but that repo’s dbConnect.xml file points to your production database?

nrusch · September 16, 2014, 6:39pm

Nope. The deadline.ini file matches that of other machines that are working away. And again, this slave has only had problems since the machine was rebooted just over 24 hours ago; before that, it was rendering fine. I’ve also checked its mount settings, just to make sure, but there isn’t a whole lot of room for error here, since these machines are all cloned from the same image, and all managed using Puppet.

I’ve been holding off on restarting the launcher, since I wanted to make sure I still had a reliable way to reproduce this behavior, but maybe it’s time to try that?

rrussell · September 16, 2014, 7:06pm

I guess it doesn’t hurt to try, but I can’t think of a reason why the launcher would impact the slave this way…

nrusch · September 16, 2014, 7:13pm

Sure enough, after restarting the launcher, the slave fires up and works fine.

I’m wondering if this could be related to the bug in the launcher service script. Right now, it’s impossible to stop the launcher service using that script, so Linux has no way of cleanly shutting the Deadline components down if the system is shut down.

Is it possible that an abruptly-killed launcher or slave could leave things in a state where either could have problems the next time it started up? I know the client uses some file-based locking, but I don’t know if files are used for any other varieties of IPC.

rrussell · September 16, 2014, 7:20pm

Interesting… we are looking at backporting the changes made to the service script for 6.2.1, so it would be interesting to see if that helps here or not.

That file-based locking issue has already been resolved in beta 1 of 6.2.1 (we use a local socket instead).

Cheers,
Ryan

nrusch · October 28, 2014, 4:49pm

I just wanted to follow up and see what the last status of this issue was, and whether any part of it had been tracked to one of the Deadline applications yet.

The reason I ask is that I have backported the patches to the Linux launcher service script from 6.2.1, but we continue to see this issue with machines after they are rebooted.

rrussell · October 28, 2014, 5:28pm

There isn’t anything new to report on this one. We were never able to reproduce it here, and we were basically waiting to see if 6.2.1 would fix it or not.

Would you be able to install 6.2.1 on a few machines to see if they have this problem? I know you’re using the 6.2.1 service script, but it’s still working with the 6.2 binaries. If it still happens with a proper 6.2.1 installation, then perhaps we can get our support team involved to set up a remote session and try to figure out what’s going on.

Cheers,
Ryan

nrusch · October 28, 2014, 6:31pm

Just to add another dimension, machines that experience this problem also fail to spawn processes from even listener plugins:

[code]=======================================================
Error

Event Error (OnJobStarted): ApplicationName=‘/Volumes/sv-dev01/branches/luma/python/bin/renderbin’, CommandLine=‘-q deadline start’, CurrentDirectory=‘/var/tmp’

=======================================================
Type

Win32Exception

=======================================================
Stack Trace

at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0
at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in :0
at System.Diagnostics.Process.Start () [0x00000] in :0
at (wrapper remoting-invoke-with-check) System.Diagnostics.Process:Start ()
at FranticX.Processes.ChildProcess.InternalLaunch (System.String executable, System.String arguments, System.String startupDirectory) [0x00000] in :0
at FranticX.Processes.ChildProcess.Launch (System.String executable, System.String arguments, System.String startupDirectory) [0x00000] in :0
at Deadline.Events.DeadlineEventPlugin.RunProcess (System.String executable, System.String arguments, System.String startupDirectory, Int32 timeoutMilliseconds) [0x00000] in :0
at Deadline.Events.DeadlineEventListener.RunProcess (System.String executable, System.String arguments, System.String startupDirectory, Int32 timeoutMilliseconds) [0x00000] in :0
at (wrapper managed-to-native) System.Reflection.MonoMethod:InternalInvoke (System.Reflection.MonoMethod,object,object,System.Exception&)
at System.Reflection.MonoMethod.Invoke (System.Object obj, BindingFlags invokeAttr, System.Reflection.Binder binder, System.Object parameters, System.Globalization.CultureInfo culture) [0x00000] in :0

=======================================================
Full Log

An error occurred in the “OnJobStarted” function in events plugin ‘LumaEvents’: An error occurred in function “OnJobStarted” the event plugin script file “/Volumes/luma/_globalAssets/production/deadline/repo-la-6/custom/events/LumaEvents/LumaEvents.py”: ApplicationName=‘/Volumes/sv-dev01/branches/luma/python/bin/renderbin’, CommandLine=‘-q deadline start’, CurrentDirectory=‘/var/tmp’ (Deadline.Events.DeadlineEventPluginException)
at Deadline.Events.DeadlineEventPlugin.HandlePythonError (System.String message, System.Exception e) [0x00000] in :0
at Deadline.Events.DeadlineEventPlugin.OnJobStarted (Deadline.Jobs.Job job, System.String auxiliaryFilenames) [0x00000] in :0
at Deadline.Events.DeadlineEventManager.OnJobStarted (Deadline.Jobs.Job job, System.String auxiliaryFilenames, Deadline.Controllers.DataController dataController) [0x00000] in :0

=======================================================
Details

Date: 10/28/2014 01:33:16
Event Type: OnJobStarted
Frames: (no task)
Job Submit Date: 10/27/2014 21:15:20
Job User: joeys[/code]

Possibly, but if we have auto upgrades enabled, and since auto-configuration doesn’t work to allow me to disable them, wouldn’t all of the 6.2.1 slaves immediately try to downgrade back to 6.2.0?

Anything remote is probably out of the question. Security has gotten very tight around here thanks to some (excessively paranoid, in my opinion) Marvel demands.

rrussell · October 29, 2014, 12:53pm

You could simply disable auto-upgrades for the duration of this test. Also, if you are able to run 6.2.1 on a few machines, it would also be worthwhile to install Mono 3.8 on them as well (this is the version of Mono we will be shipping with 7.0 - we upgraded from 3.4 with the latest 7.0 beta release). This version of Mono will print out the underlying system error when processes fail to start like this.

Cheers,
Ryan

nrusch · October 29, 2014, 4:21pm

Is there any way to do that without editing deadline.ini on all the machines in question?

rrussell · October 29, 2014, 4:26pm

It’s a setting in the repository options:
thinkboxsoftware.com/deadlin … ient_Setup

The setting in the deadline.ini file is just meant to override the global repository option.

Cheers,
Ryan

nrusch · October 29, 2014, 4:29pm

Oh right… Far too obvious.

Custom plugin directory not being honored

[code]======================================================= Error

======================================================= Type

======================================================= Stack Trace

======================================================= Full Log

======================================================= Details

[code]=======================================================
Error

=======================================================
Type

=======================================================
Stack Trace

=======================================================
Full Log

=======================================================
Details