AWS Thinkbox Discussion Forums

Bug with starting the monitor

Hi!

Yet another bug on fedora, which this time is more serious 'cause it affects all systems. Monitor crashes rather often on Fedora (especially after using the “refresh” button). Sometimes after such crash, when restarting the monitor, there seem to be an issue of a “stalled slave not giving up its lock”. It specifies the Job string, and after deleting that exact job, and launching the monitor again it finds yet another “Job” that has the locked up slave. Only help here is deleting all jobs, which is kind of ridiculous. We’re using mono 2.6.7 on FC12. Any ideas of what logs could be necessery for You to deal with that plugin?

Is this with the current beta 2 build? If you are, it might be worth trying to upgrade one or two machines to Mono 2.10.5 if possible to see if that improves stability. If not, which version of Deadline are you using?

The Monitor logs can be found in [Deadline Client Folder]/logs. Just grab the most recent monitor log after the crash occurs and post it.

Out of curiosity, when the crash occurs, does it prevent the Monitor on other machines from starting because the files are “locked”? Have you tried restarting the machine the crash occurred on to see if that cleans things up (rather than having to delete all the jobs)? Just trying to get a better idea of what the problem might be.

Thanks!

  • Ryan

Yes, that’s exactly what i meant by saying that all systems are affected. Crashing monitor on Fedora “sometimes” produces that bug, and when i’m trying to run it on either Fedora or Windows i cannot start the Monitor.

We’re using Build 2. I cannot find any logs either on windows or on linux. Restarting the machines doesn’t help much as we have network distribution of FC12, so basically every machine has exact same system, and Deadline was installed only once there as it is shared throughout the network. I’ll try to update that Mono library.

Thanks!

Where are you looking for the logs on Linux? There should be a “logs” folder in the Deadline Client install folder on the slaves. Is that not the case? We don’t need the logs from a Windows machine, since the crashing isn’t occurring there. Note that we’re interested in the log from a session where the Monitor initially crashed, not when starting up the Monitor again afterwards (where you get the locked file error). We want to try and figure out why the crash locks things up in the first place.

I forgot that you had the unique setup where you use a network distribution on your nodes. I should note that we have never tested (or designed) Deadline to work in such a setup, and as far as we know, no other clients of ours are doing this. I’m not saying this shouldn’t work, but I’m wondering if this setup is the reason why a single crashed monitor locks up the entire system. We’ve heard of the Monitor crashing on Linux before, but we’ve never heard of it bringing down the entire farm.

Just to confirm, is your repository on a separate server, or is it on the same network distribution as the render nodes? Also, is the network distribution only for render nodes, or is it used on workstations as well?

Note: for the mono update, we’re just wondering if this will reduce or prevent the monitor crashing in the first place. If the monitor still crashes with the latest mono version, I wouldn’t be surprised if things still get locked up.

Cheers,

  • Ryan

Ok, i’ve done some testing. It seems that i’m able now to run the Monitor on Fedora but on Windows it still fails to launch. I’m unable to crash the monitor on Fedora right now though :slight_smile: As soon as it happens i’ll send You the log.

About the network distribution, well, basically i’ve installed the Repository and the Client on the same system, though the Repository itself (since it’s a folder) is physically on another machine, so at startup our system just mounts it. The distribution is the same for all computers here, the render nodes as well as workstations.

On Windows, are you getting the same locked file error you were getting on Linux? Out of curiosity, what did you end up doing to get the Monitor starting up on Linux now?

With this setup, does it mean that if one “node” locks a file, that file is now locked for all nodes? In normal cases, you could probably just reboot that one node to sever the lock, but like you said, that won’t work because all nodes are the same node. It’s not just a concern from a Deadline perspective, because we’ve seen rendering software like Max or Nuke lock network files as well. When this happened to us in production, we could often just reboot the offending node to clean things up (in rare cases, we actually had to unlock the file from the server side).

@1. I’ll try to be more specific next time and completely describe the steps for reproducing this bug, sorry about that. It’s too random now. The fact is, it started to happen as we begun to develop our custom plugin for Houdini so maybe the issue is there. I deleted all jobs and on both systems the monitor runs fine. When there will be an issue i’ll post exact description of what has happened.

@2. Well, it is still the machine that runs the system, so when You have a slave on Render01 and another one on Render02, they both should lock only their slaves, because i as i assume back in 5.0 version of Deadline only one slave per machine was allowed so there was no conflict there, as every slave and machine had it’s own specific name. We’re not using multi-slave setup now, so it shouldnt be an issue, as it wasnt earlier.

In the meantime, i’ll report two minor mono-DeadlineMonitor issues i’ve found:
a) In monitor, after launching something which generates a window interface (like logs, or configuration windows, etc.) when You hit “ESC” on the keyboard it crashes everytime.
b) In 1920x1200, the “Submit” menu slides out of the screen and it’s impossible to launch the monitor submission window for the lower applications like “XSI”. Simple scrollbar would do. Also the dropdown list of that menu is sometimes erratic when You scroll Your mouse cursor over the list it often jumps and highlights the options which are not under cursor. I can provide screenshots if necessery.

Thanks!

  1. Sounds like a plan! :slight_smile:

  2. I was referring to a locked file in the repository, which isn’t related to the multi-slave feature. For example, let’s say we had 3 machines: 2 nodes and a server. Let’s say node 1 locks a file on the server (due to a crash, for example). In the current situation, node 2 cannot access the same file. Normally, rebooting node 1 would be enough to sever that lock so that node 2 could access it again.

My concern is that because of the network distribution setup, it’s not enough to just reboot node 1 in this situation. Now I could be completely wrong here, as I don’t have any experience with this type of setup, so please let me know if my concerns make no sense at all! :slight_smile:

Now on to the bug reports:
a) I’ve logged this as a bug and we will try to reproduce. I’m actually out of the office right now, so I don’t have access to any of our Linux VMs. Is it possible to send a log in the meantime? What you can do is press ESC to get the crash to occur. Then restart the Monitor and select Help -> Explore Log Folder. Find the monitor log from the session where the monitor just crashed and post it. This log should hopefully contain the error message.

b) This is a known issue, and is a limitation of the menu controls under Mono. You can deal with this problem by removing menu items from the Submit menu that you don’t use. This can be done from the Repository Options in the Monitor:
thinkboxsoftware.com/deadlin … enus_Setup

Cheers,

  • Ryan

I’ve tried repeatedly to reproduce the Monitor crashing issue when pressing ESC, but I’m not having any luck. We’ll have to see the log to get a better understanding of why the crash is occurring. If you open a terminal and run ‘deadlinemonitor’ from the command line, all error messages should be dumped to stdout. So if you can reproduce the crash and send us the stdout, that would be great!

Cheers,

  • Ryan

I’ve Opened the monitor, viewed the error report of some random job and hit ESC key.

I’ve attached log from the installation directory, but apparently it does not log any error. Though i’ve pasted this from the terminal, from which i’ve launched the monitor:

** (/opt/packages/Deadline5.1b2/bin/deadlinemonitor.exe:2862): WARNING **: System.Net.Sockets.SocketOptionName 0x1b is not supported at IPv6 level

** (/opt/packages/Deadline5.1b2/bin/deadlinemonitor.exe:2862): WARNING **: System.Net.Sockets.SocketOptionName 0x1b is not supported at IPv6 level
UpdateAll (uimanager) !!
CheckPathMapping: Swapped "Z:\PROD\dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1" with "/PROD/dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1"
CheckPathMapping: Swapped "Z:\PROD\dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1" with "/PROD/dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1"
CheckPathMapping: Swapped "Z:\PROD\dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1" with "/PROD/dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1"
CheckPathMapping: Swapped "Z:\PROD\dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1" with "/PROD/dev\sandbox\user\wojak\software\xsi\Render_Pictures\wojak\deadline_test_5_1"
System.NullReferenceException: Object reference not set to an instance of an object
  at FranticXForms.CustomControls.CustomRichTextBox.OnKeyDown (System.Windows.Forms.KeyEventArgs e) [0x00000] in <filename unknown>:0 
  at System.Windows.Forms.Control.ProcessKeyEventArgs (System.Windows.Forms.Message& m) [0x00000] in <filename unknown>:0 
  at System.Windows.Forms.Control.ProcessKeyMessage (System.Windows.Forms.Message& m) [0x00000] in <filename unknown>:0 
  at System.Windows.Forms.TextBoxBase.WndProc (System.Windows.Forms.Message& m) [0x00000] in <filename unknown>:0 
  at System.Windows.Forms.RichTextBox.WndProc (System.Windows.Forms.Message& m) [0x00000] in <filename unknown>:0 
  at FranticXForms.CustomControls.CustomRichTextBox.WndProc (System.Windows.Forms.Message& m) [0x00000] in <filename unknown>:0 
  at System.Windows.Forms.Control+ControlWindowTarget.OnMessage (System.Windows.Forms.Message& m) [0x00000] in <filename unknown>:0 
  at System.Windows.Forms.Control+ControlNativeWindow.WndProc (System.Windows.Forms.Message& m) [0x00000] in <filename unknown>:0 
  at System.Windows.Forms.NativeWindow.WndProc (IntPtr hWnd, Msg msg, IntPtr wParam, IntPtr lParam) [0x00000] in <filename unknown>:0 
Main window closing
 Listener Thread - OnConnect: Listener Socket has been closed.

deadlinemonitor(Grafika03)-2011-10-10-0000.log (1.34 KB)

Thanks for the log. This looks similar to an error that we fixed in beta 3. I see from the monitor log that you’re still on beta 2, so you should upgrade and let us know if the problem persists.

Cheers,

  • Ryan

Ok, update on accessing the monitor:

After submitting Nuke job, while Houdini/Mantra jobs were rendering i launched monitor and it came up with the window:

An error occurred while gathering the repository data:
Error in file: /repository/jobs/999_050_005_19f511cb/999_050_005_19f511cb.job (System.Exception)

It's likely that there is a stalled slave in the network (...)

It’s worth noting that on another machine with the Monitor Running has no problem with refreshing (through F5).

from output of the monitor:

Exception Details
Exception -- Error in file: /repository/jobs/999_050_005_19f511cb/999_050_005_19f511cb.job
Exception.Source: franticx
Exception.TargetSite: System.Object ReadXmlFile(System.String, System.Type, Int32, Int32)
Exception.Data: ( )
  Exception.StackTrace: 
  at FranticX.Xml.XmlUtils.ReadXmlFile (System.String fileName, System.Type type, Int32 attempts, Int32 millisecondsBetweenAttemps) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.JobStorage.LoadJobFromFile (System.String filename) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.JobStorage.LoadJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.InternalJobStorageCache.OnFetchValueViaKey (System.Object key) [0x00000] in <filename unknown>:0 
  at FranticX.Collections.KeyValueCache.GetValueViaCacheEntry (FranticX.Collections.CacheEntry cacheEntry) [0x00000] in <filename unknown>:0 
  at FranticX.Collections.KeyValueCache.GetValueViaKey (System.Object key) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.InternalJobStorageCache.GetJob (System.String jobId) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.JobStorageCache.GetJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at Deadline.Controllers.DeadlineController.RetrieveJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at DeadlineForms.Controls.JobListView.RefreshList (Boolean visibleOnly) [0x00000] in <filename unknown>:0 
  at DeadlineForms.Controls.JobListView.UpdateAll (Boolean visibleOnly) [0x00000] in <filename unknown>:0 
  at (wrapper remoting-invoke-with-check) DeadlineForms.Controls.JobListView:UpdateAll (bool)
  at DeadlineForms.Controls.JobListView.UpdateAll () [0x00000] in <filename unknown>:0 
  at Deadline.Monitor.MonitorManager.UpdateAll () [0x00000] in <filename unknown>:0 
  at DeadlineMonitor.DeadlineMonitorApp.Main (System.String[] args) [0x00000] in <filename unknown>:0 

reporting '[Deadline Monitor 5.1 Exception] - Exception' via email...
ErrorLog.WriteExceptionLog() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Date Stamp
  CurrentDate: Thursday, 03 November 2011
  CurrentTime: 17:01:32

Exception Details
Exception -- Error in file: /repository/jobs/999_050_005_19f511cb/999_050_005_19f511cb.job
Exception.Source: franticx
Exception.TargetSite: System.Object ReadXmlFile(System.String, System.Type, Int32, Int32)
Exception.Data: ( )
  Exception.StackTrace: 
  at FranticX.Xml.XmlUtils.ReadXmlFile (System.String fileName, System.Type type, Int32 attempts, Int32 millisecondsBetweenAttemps) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.JobStorage.LoadJobFromFile (System.String filename) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.JobStorage.LoadJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.InternalJobStorageCache.OnFetchValueViaKey (System.Object key) [0x00000] in <filename unknown>:0 
  at FranticX.Collections.KeyValueCache.GetValueViaCacheEntry (FranticX.Collections.CacheEntry cacheEntry) [0x00000] in <filename unknown>:0 
  at FranticX.Collections.KeyValueCache.GetValueViaKey (System.Object key) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.InternalJobStorageCache.GetJob (System.String jobId) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.JobStorageCache.GetJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at Deadline.Controllers.DeadlineController.RetrieveJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at DeadlineForms.Controls.JobListView.RefreshList (Boolean visibleOnly) [0x00000] in <filename unknown>:0 
  at DeadlineForms.Controls.JobListView.UpdateAll (Boolean visibleOnly) [0x00000] in <filename unknown>:0 
  at (wrapper remoting-invoke-with-check) DeadlineForms.Controls.JobListView:UpdateAll (bool)
  at DeadlineForms.Controls.JobListView.UpdateAll () [0x00000] in <filename unknown>:0 
  at Deadline.Monitor.MonitorManager.UpdateAll () [0x00000] in <filename unknown>:0 
  at DeadlineMonitor.DeadlineMonitorApp.Main (System.String[] args) [0x00000] in <filename unknown>:0 

ErrorReporting Watches
  "Version": v5.1.0.45606 R[String]
  "lastRefreshedJob_A": 999_050_005_154c149b[String]
  "lastRefreshedJob_B": 999_050_005_154c149b[String]
  "DeserializeXMLFileName": /repository/jobs/999_050_005_19f511cb/999_050_005_19f511cb.job[String]

Trace2Cache: 
  
Exception Details
Exception -- Error in file: /repository/jobs/999_050_005_19f511cb/999_050_005_19f511cb.job
Exception.Source: franticx
Exception.TargetSite: System.Object ReadXmlFile(System.String, System.Type, Int32, Int32)
Exception.Data: ( )
  Exception.StackTrace: 
  at FranticX.Xml.XmlUtils.ReadXmlFile (System.String fileName, System.Type type, Int32 attempts, Int32 millisecondsBetweenAttemps) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.JobStorage.LoadJobFromFile (System.String filename) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.JobStorage.LoadJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.InternalJobStorageCache.OnFetchValueViaKey (System.Object key) [0x00000] in <filename unknown>:0 
  at FranticX.Collections.KeyValueCache.GetValueViaCacheEntry (FranticX.Collections.CacheEntry cacheEntry) [0x00000] in <filename unknown>:0 
  at FranticX.Collections.KeyValueCache.GetValueViaKey (System.Object key) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.InternalJobStorageCache.GetJob (System.String jobId) [0x00000] in <filename unknown>:0 
  at Deadline.Storage.Caches.JobStorageCache.GetJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at Deadline.Controllers.DeadlineController.RetrieveJob (System.String jobId, Boolean archived) [0x00000] in <filename unknown>:0 
  at DeadlineForms.Controls.JobListView.RefreshList (Boolean visibleOnly) [0x00000] in <filename unknown>:0 
  at DeadlineForms.Controls.JobListView.UpdateAll (Boolean visibleOnly) [0x00000] in <filename unknown>:0 
  at (wrapper remoting-invoke-with-check) DeadlineForms.Controls.JobListView:UpdateAll (bool)
  at DeadlineForms.Controls.JobListView.UpdateAll () [0x00000] in <filename unknown>:0 
  at Deadline.Monitor.MonitorManager.UpdateAll () [0x00000] in <filename unknown>:0 
  at DeadlineMonitor.DeadlineMonitorApp.Main (System.String[] args) [0x00000] in <filename unknown>:0 



0 Other Running Threads

Process Threads

Memory and CPU Stats
  GC.TotalMemory: 12.465 MB
  Environment.WorkingSet: 0 Bytes
  ComputerSystem.TotalPhysicalMemory: 7.816 GB
  ComputerSystem.FreePhysicalMemory: 5.352 GB

Application Information
  Application.ExecutablePath: /opt/packages/deadline_5.1b4/bin/deadlinemonitor.exe
  Application.CurrentDirectory: /PROD/dev/sandbox/user/wojak
  Application.StartupPath: /opt/packages/deadline_5.1b4/bin
  Application.ProductName: Deadline Monitor 5.1
  Application.ProductVersion: 5.1.0.45606  File.GetLastWriteTime( Application.ExecutablePath ): 10/20/2011 22:54:57

Assembly Information (Executing)
  ExecutingAssembly.CodeBase: file:///opt/packages/deadline_5.1b4/bin/franticx.dll
  ExecutingAssembly.Location: /opt/packages/deadline_5.1b4/bin/franticx.dll
  ExecutingAssembly.GlobalAssemblyCache: False  File.GetLastWriteTime( ExecutingAssembly.Location ): 10/20/2011 22:54:57

Assembly Information (Current)
  CurrentAssembly.CodeBase: file:///opt/packages/deadline_5.1b4/bin/franticx.dll
  CurrentAssembly.Location: /opt/packages/deadline_5.1b4/bin/franticx.dll
  CurrentAssembly.GlobalAssemblyCache: False  File.GetLastWriteTime( CurrentAssembly.Location ): 10/20/2011 22:54:57

Thread Information
  CurrentThread.Name: 
  CurrentThread.Priority: Lowest

Operating System Information
  Environment.OSVersion.Platform: Fedora release 12 (Constantine)
  Environment.OSVersion.Version: 2.6.32.21

.NET Platform Information
  Environment.Version.Major: 2
  Environment.Version.Minor: 0
  Environment.Version.Build: 50727
  Environment.Version.Revision: 1433

Misc Environment Information
  Environment.MachineName: grafika03
  Environment.UserName: wojak
  Environment.SystemDirectory: 
  Environment.TickCount: 2200811

Command Line
  Environment.CommandLine: /opt/package/deadline_5.1b4/bin/deadlinemonitor.exe
  Environment.CommandLineArgs[0]: /opt/package/deadline_5.1b4/bin/deadlinemonitor.exe

Current Call Stack
  Environment.StackTrace:    at System.Environment.get_StackTrace()
   at FranticX.Diagnostics.Reporting.ErrorReporting.GetExceptionReport(System.Exception e, ErrorReportDetail detail)
   at FranticX.Diagnostics.Reporting.ErrorReporting.WriteExceptionReport(System.Exception ex)
   at FranticX.Applications.ApplicationManager.StartupException(System.Exception e)
   at DeadlineMonitor.DeadlineMonitorApp.Main(System.String[] args)

email file reporting failed: Sender account in the SMTP settings is blank (System.Exception)
exception occurred while recording file: Sender account in the SMTP settings is blank (System.Exception)

Nothing in monitor log, and that job folder is actually empty. This is PROBABLY the houdini “hbatch” job which is converting houdini scene file to mantra IFD files and then enqueues the Mantra job. We’re using our custom plugin however. It’s a simple "command plugi"n which is only submitting two jobs through “deadlinecommand” without any cleanup afterwards.

We’ve seen something similar to this before. Kind of annoying that malformed jobs cause the monitor to fail instead of just opening and ignoring the job with a warning.

I did a small workaround for the time being. A simple script that finds any empty directories in the /repository/jobs folder and removes them.

EDIT: any particular reason of why the “XSIBatch” is not showing in the “Configurate Plugins” in the monitor on Linux?

Yup, very annoying. We just tracked this one down, and this will be fixed in beta 5. Now the job will just appear as corrupted in the Monitor (which is what it should have been doing in the first place).

Yup, that’s a display glitch with Mono. You need to select XSI in the list, then click on XSI again to give focus to the plugin list. Then you can press the down arrow key to show XSIBatch. We’re looking at new UI toolkits for the next major Deadline release, which should make glitches like this a thing of the past.

Cheers,

  • Ryan

Great, thanks very much!

Also… Stability of the monitor on Linux is in fact improved however some of the crashes happen from time to time. Shall i send the Dumps and give You reproduction steps for some of them? (If yes then what would be appropriate place for that)

Yes, please do! For the remainder of the beta, feel free to post them on the Bug Reports forum. If you’re still getting them after 5.1 is released, you can use the regular forums or send it to our support team:
thinkboxsoftware.com/support/

Cheers,

  • Ryan
Privacy | Site terms | Cookie preferences