It should affect any new jobs that were submitted after applying the patch.
Hi Ryan,
Half of our slaves are stalled now, most of the errors are
‘Slave - Exception: Failed to update slaveInfo: The process cannot access the file because it is being used by another process.’
so here is an error report, I’ve changed the script back to the one we had earlier yesterday.
=======================================================
Error Message
Exception during render: An error occurred in RenderTasks(): Cannot create temporary config directory C:\Documents and Settings\Administrator\Local Settings\Application Data\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_tempbSYHa0\tempCFG_fe0
at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage)
=======================================================
Slave Log
ill be handled as appropriate
0: INFO: Stdout Handling Enabled: True
0: INFO: Popup Handling Enabled: True
0: INFO: Using Process Tree: True
0: INFO: Hiding DOS Window: True
0: INFO: Creating New Console: False
0: INFO: Render Executable: “S:\LW9.6_x64_TEST\Programs\WSN.exe”
0: INFO: Using local Config for FPrime
---- August 25 2011 – 09:46 AM ----
Slave - Exception: Failed to update slaveInfo: The process cannot access the file because it is being used by another process.
0: INFO: Render Argument: -3 -c"C:\Documents and Settings\Administrator\Local Settings\Application Data\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_tempbSYHa0" -d"H:\TeamTo_Plankton_Invasion_11" “C:\Documents and Settings\Administrator\Local Settings\Temp\pla_render_108_064_RGB_Vs004_LCA_0.lws” 73 84
0: INFO: Startup Directory: “S:\LW9.6_x64_TEST\Programs”
0: INFO: Process Priority: BelowNormal
0: INFO: Process is now running
0: STDOUT: WSN Layout Launcher 1.01
0: STDOUT: Using Config dir “C:\Documents and Settings\Administrator\Local Settings\Application Data\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_tempbSYHa0\LW9-64.cfg”
0: STDOUT: Using command-line Content dir “H:\TeamTo_Plankton_Invasion_11”
0: STDOUT: Scene file output file prefix found: I:/TeamTo_Plankton_Invasion_11/Renders_3D/108/108-064/Beauty/pla_render_108_064_BEAUTY_rgb_
0: STDOUT: Cannot create temporary config directory C:\Documents and Settings\Administrator\Local Settings\Application Data\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_tempbSYHa0\tempCFG_fe0
Scheduler Thread - Render Thread 0 threw an error:
Scheduler Thread - Exception during render: An error occurred in RenderTasks(): Cannot create temporary config directory C:\Documents and Settings\Administrator\Local Settings\Application Data\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_tempbSYHa0\tempCFG_fe0
at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage)
=======================================================
Error Type
RenderPluginException
=======================================================
Error Stack Trace
at Deadline.Plugins.Plugin.RenderTask(Int32 startFrame, Int32 endFrame)
at Deadline.Slaves.SlaveRenderThread.RenderCurrentTask()
could you take a look?
Thanks!
hmmmmm
most of the slaves are getting stalled now…
although I’ve changed the script again back to that from yesterday, and I’ve rebooted them.
They all give error messages…
here’s a new error report:
=======================================================
Error Message
Exception during render: An error occurred in RenderTasks(): Cannot create temporary config directory C:/Users/Grid/AppData/Local/Prime Focus/Deadline/slave/jobsData/LocalFPrimeConfig_tempi9J0g0\tempCFG_82c
at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage)
=======================================================
Slave Log
ropriate
0: INFO: Any stdout that matches the regular expression “.Cannot create temporary config directory.” will be handled as appropriate
0: INFO: Any stdout that matches the regular expression “(Rendering frame [0-9]+).* pass ([0-9]+)[^0-9]+([0-9]+).” will be handled as appropriate
0: INFO: Any stdout that matches the regular expression “Frame completed” will be handled as appropriate
0: INFO: Stdout Handling Enabled: True
0: INFO: Popup Handling Enabled: True
0: INFO: Using Process Tree: True
0: INFO: Hiding DOS Window: True
0: INFO: Creating New Console: False
0: INFO: Render Executable: “S:\LW9.6_x64_TEST\Programs\WSN.exe”
0: INFO: Using local Config for FPrime
0: INFO: Render Argument: -3 -c"C:/Users/Grid/AppData/Local/Prime Focus/Deadline/slave/jobsData/LocalFPrimeConfig_tempi9J0g0" -d"H:/TeamTo_Plankton_Invasion_11" “C:\Users\Grid\AppData\Local\Temp\pla_render_117_102_RGB_Vs002_LCA_0.lws” 217 226
0: INFO: Startup Directory: “S:\LW9.6_x64_TEST\Programs”
0: INFO: Process Priority: BelowNormal
0: INFO: Process is now running
0: STDOUT: WSN Layout Launcher 1.01
0: STDOUT: Using Config dir “C:/Users/Grid/AppData/Local/Prime Focus/Deadline/slave/jobsData/LocalFPrimeConfig_tempi9J0g0\LW9-64.cfg”
0: STDOUT: Using command-line Content dir “H:/TeamTo_Plankton_Invasion_11”
0: STDOUT: Scene file output file prefix found: I:/TeamTo_Plankton_Invasion_11/Renders_3D/117/117-102/Beauty/pla_render_117_102_BEAUTY_rgb_
0: STDOUT: Cannot create temporary config directory C:/Users/Grid/AppData/Local/Prime Focus/Deadline/slave/jobsData/LocalFPrimeConfig_tempi9J0g0\tempCFG_82c
Scheduler Thread - Render Thread 0 threw an error:
Scheduler Thread - Exception during render: An error occurred in RenderTasks(): Cannot create temporary config directory C:/Users/Grid/AppData/Local/Prime Focus/Deadline/slave/jobsData/LocalFPrimeConfig_tempi9J0g0\tempCFG_82c
at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage)
=======================================================
Error Type
RenderPluginException
=======================================================
Error Stack Trace
at Deadline.Plugins.Plugin.RenderTask(Int32 startFrame, Int32 endFrame)
at Deadline.Slaves.SlaveRenderThread.RenderCurrentTask()
and it starts to be a bit urgent… is there another way to contact thinkbox?
Thanks,
Lieven
The updated script I sent you wouldn’t have caused the slaves to start stalling, so there must be something else going on here. This line from the slave logs is explains the problem:
A slave is considered stalled if it hasn’t updated it’s slave info for a certain amount of time (the default is 10 or 20 minutes). If something is preventing the slaves from updating this file, they will appear as stalled. It would appear these files are locked on the respository side (so rebooting the slaves won’t help). The slave info files can be found in \your\repository\slaves. Each slave has a folder here, and each one will contain a slaveInfo file.
You could try removing the existing slaveInfo files manually, since that should clear the path for the slaves to start writing them again. Another option might be to try restarting the repository machine to see if that clears things up.
Our support contact info can be found here:
thinkboxsoftware.com/support/
Sorry we didn’t jump on this thread sooner - we were just processing our support requests in the order we received them.
Hello,
What will happen when I delete the slave info?
Will the machines stop rendering?
Is it dangerous?
Thanks,
Lieven
The machines seem to render for a while, an hour or two after reboot, and then they start to give these errors.
Would the machines, according to your explanation, give the error immediately?
I’m just a bit afraid of removing the slave info?
Thanks,
Lieven
Oh,
and I just replaced the script back to one you’ve sent me yesterday.
Nothing bad will come from deleting the slave info. This just holds the slave’s last state, and the slave will just write a new one every 20-40 seconds.
I just want to note that the Lightwave errors you’re getting are completely independent of the stalled slave problem I was referring to. Just to make sure we’re on the same page, are your slaves appearing as stalled in the Monitor, or do you mean that they get into a state where they can’t process your LW jobs anymore and just start reporting these “temp config directory” errors?
Hello,
I tried to delete this morning all the slave info’s.
It work for some, and not for others.
So, I checked for a stalled slave in deadline and tried to delete it.
I received the windows message
“Cannot delete … It is being used by another person or program.
Close any programs that might be using the file and try again.”
so.
I tried to kill the deadline slave on the slave and tried to delete the slaveinfo, didn’t work.
I tried to kill the deadline launcher and tried to delete, didn’t work.
I tried to shutdown the slave and tried to delete, didn’t work.
So, I tried to edit the file:
<?xml version="1.0"?> 2011-08-26T09:54:28 Rf113 Grid fe80::c820:56f7:35cc:e490%20 00:30:48:CF:08:64 16 12875509760 8464400384 82830266368 77.142 GB 2400 x64 100 Windows 7 <_VersionString>v4.1.0.43205 R 84076.75 mv video hook driver2 false true -1 @server_projects Lightwave thibault 112_MB 999_060_001_106bba1f 60 none plankton_16core 61-72 5 0% Rendering 2011-08-26T09:43:21.3663253 Continue Running maya2010,fusion6,fusion61,3dsmax2010,3dsmax2011,fusion53,fusion51,lightwave plankton_error,plankton_16core plankton 3 51 8 666.705139 Aug 25/11 10:33:01 Slave startedand I couldn’t do it neather…?
The slaves show the stalled status in the deadline monitor, and they aren’t rendering.
They keep showing the message in the slave that the process of slaveINfo can’t be accessed.
Next to that they keep giving these error logs
=======================================================
Error Message
Exception during render: An error occurred in RenderTasks(): Cannot create temporary config directory C:\Users\Grid\AppData\Local\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_temp4y6fa0\tempCFG_ce8
at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage)
=======================================================
Slave Log
ropriate
0: INFO: Any stdout that matches the regular expression “.Cannot create temporary config directory.” will be handled as appropriate
0: INFO: Any stdout that matches the regular expression “(Rendering frame [0-9]+).* pass ([0-9]+)[^0-9]+([0-9]+).” will be handled as appropriate
0: INFO: Any stdout that matches the regular expression “Frame completed” will be handled as appropriate
0: INFO: Stdout Handling Enabled: True
0: INFO: Popup Handling Enabled: True
0: INFO: Using Process Tree: True
0: INFO: Hiding DOS Window: True
0: INFO: Creating New Console: False
0: INFO: Render Executable: “S:\LW9.6_x64_TEST\Programs\WSN.exe”
0: INFO: Using local Config for FPrime
0: INFO: Render Argument: -3 -c"C:\Users\Grid\AppData\Local\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_temp4y6fa0" -d"H:\TeamTo_Plankton_Invasion_11" “C:\Users\Grid\AppData\Local\Temp\pla_render_114_036_RGB_Vs003_DMO_0.lws” 157 168
0: INFO: Startup Directory: “S:\LW9.6_x64_TEST\Programs”
0: INFO: Process Priority: BelowNormal
0: INFO: Process is now running
0: STDOUT: WSN Layout Launcher 1.01
0: STDOUT: Using Config dir “C:\Users\Grid\AppData\Local\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_temp4y6fa0\LW9-64.cfg”
0: STDOUT: Using command-line Content dir “H:\TeamTo_Plankton_Invasion_11”
0: STDOUT: Scene file output file prefix found: I:/TeamTo_Plankton_Invasion_11/Renders_3D/114/114-036/Beauty/pla_render_114_036_BEAUTY_rgb_
0: STDOUT: Cannot create temporary config directory C:\Users\Grid\AppData\Local\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_temp4y6fa0\tempCFG_ce8
Scheduler Thread - Render Thread 0 threw an error:
Scheduler Thread - Exception during render: An error occurred in RenderTasks(): Cannot create temporary config directory C:\Users\Grid\AppData\Local\Prime Focus\Deadline\slave\jobsData\LocalFPrimeConfig_temp4y6fa0\tempCFG_ce8
at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage)
=======================================================
Error Type
RenderPluginException
=======================================================
Error Stack Trace
at Deadline.Plugins.Plugin.RenderTask(Int32 startFrame, Int32 endFrame)
at Deadline.Slaves.SlaveRenderThread.RenderCurrentTask()
Thanks,
Lieven
Hello,
I’ve uploaded 2 screenshots so that you can see what’s happening.
Another screenshot shows that a machine seems to be rendering, but still gives the message of the process which can’t be accessed…
Thanks,
Lieven
Hello,
After a reboot, of most of the slaves which were stalled, I could delete the slaveinfo of that specific test-slave on which I was working.
Thanks,
Lieven
That’s good to hear! Something must have happened that resulted in those slave files being locked. They were probably locked on the server side, which explains why rebooting the slave machines didn’t help.
I’m going to assume you’ll still run into those fPrime config directory errors. If you do, I really don’t know what else we can do at this point. We tried having it create those folders locally, and we made sure the path separators were correct, and that still didn’t resolve the problem. If fPrime could at least print out why it failed to create the folder, that could help diagnose the problem further, but that’s a change they would have to make on their end.
Cheers,
- Ryan
Hi Ryan,
Unfortunately perhaps something went wrong in our communication
because the slaves are still stalling. The problem keeps occurring about the process in a lock. I’ve deleted the slaveInfo but after a while it starts again?
After that I can check what happens with the local config of fPrime. For example, now we have 50 slaves of the 90 we have stalled.
Thank you,
Lieven
Is it possible that there is an issue with the server that you’re hosting the repository on? That’s odd that these files would start locking up like this over time. What is the operating system of the machine that you’re hosting the repository on?
It’s an Isilon storage system.
We run an operating system called oneFS 6.5.3.
Hi Ryan,
We’ve shut down every slave now, and tried to delete the slaveinfo. It didn’t work for some slaves. Then when we rebooted the storage system, afterwards we could delete the slave info. We restarted deadline everywhere. And for now they seem to be rendering. We’re mailing also with the support of our storage system, to know what the problem is on the repository;
I’ll keep you posted,
Thanks,
Lieven
Hi Ryan,
We’re now renaming the slaves directory, deadline makes immediately a new directory and this change will cause deadline slaves to render again for a while. Now we 're making a script that will rename the slaves-folder every hour, but this is far from ideal.
So we’re still searching on what is causing that slave-info to get locked.
Further ideas?
Thank you
Wow, that’s quite the workaround. I’m short of ideas at the moment, but we’ve contacted another client of ours that use an Isilon to see if they have any ideas. We’ll let you know when we hear back.
For the slave info file, that only process that should be writing to it is the Slave application that it belongs to. Every 20 to 40 seconds, the slave writes a temporary slaveInfo file with a random name in the same folder as the actual slaveInfo file. After writing the file, it then copies it over the actual slaveInfo file to replace it. Applications like Pulse or the Monitor will then read the file.
I don’t think we’re doing anything out of the ordinary here, so I really can’t think of a reason how this would result in the file becoming permanently locked. Maybe you can pass this info on to your storage support team to see what they think.
Cheers,
- Ryan
I’m now following a new idea.
I’ve closed down one server which had a deadline pulse running.
All of a sudden, I could delete slaveInfo.
I’m not sure this is a real solution or coincidence.
I’ll keep an eye on it.
But this doesn’t seems to be a correct idea as deadline pulse only reads files.
That’s interesting that Pulse would be locking the files like that, because it just reads the files…
I would definitely leave Pulse “off” for a day and see if the problem resurfaces. If it doesn’t, then it’s probably safe to assume that Pulse was the culprit here. If that’s the case, maybe you could try running Pulse on a different machine for a day or two to see if the problem is at all related to the machine that Pulse is running on.