AWS Thinkbox Discussion Forums

slave crash

From the log:

2013-03-01 17:40:30: 0: STDOUT: mel: READY FOR INPUT
2013-03-01 17:40:34: 0: INFO: Ending Maya Job
2013-03-01 17:40:34: 0: INFO: Waiting for Maya to shut down
2013-03-01 17:40:34: 0: INFO: Maya has shut down
2013-03-01 18:14:39: Traceback (most recent call last):
2013-03-01 18:14:39: File “DeadlineSlave\UI\Forms\MainWindowSlave.py”, line 293, in update
2013-03-01 18:14:39: File “DeadlineSlave\UI\Forms\MainWindowSlave.py”, line 317, in updateJobInfo
2013-03-01 18:14:39: InvalidOperationException: Collection was modified; enumeration operation may not execute.
2013-03-01 18:14:39: at System.Collections.Generic.Dictionary`2.KeyCollection.Enumerator.MoveNext()
2013-03-01 18:14:39: at Deadline.Slaves.Slave.GetLimitGroupStubs()

cheers,
laszlo

Hey Laszlo,

Thanks! This will be fixed in beta 14.

Did this actually crash the slave? Based on the code, it shouldn’t crash the entire slave. It should have just reported this message and continued on.

Cheers,

  • Ryan

Yeah the slave was not running. The reason i noticed this is because i forgot to add this slave in the power management group that ensures crashed slaves restart…

cheers,
laszlo

Another case of the slave reporting stalled / crashing out completely (process not running on slave)

The logs dont have any error messages. Attached all logs for the slave from the last couple days
deadlinelauncher-LAPRO0335-2013-03-01-0001.zip (142 KB)

ANother one, Monitor reports: stalled slave
Slave process: not running
Log attached

Doesnt seem like anything fancy was happening on the box, but the slave just disappeared…
deadlineslave_LAPRO0241-LAPRO0241-2013-03-04-0001.log (1.09 MB)

Another one, same deal. Reporting stalled, machine doesnt have the slave app running.

Logs attached.
deadlinelauncher-LAPRO0238-2013-03-04-0000.zip (129 KB)

What im noticing that the most common errors are in order:

  1. slave trying to cancel a task, then hanging. This can either have a hanging gui reporting stalled in the monitor, or can have an interactive gui, reporting rendering in the monitor
  2. slave goes into a loop of “updating slave settings”, and hangs up
  3. slave just disappears, with nothing in the log to indicate any error

Every morning about 10% of all slaves are hanging / crashed.

For (1) and (3), we’re still trying to figure out what’s going on here.

For (2) I would lump this together with (1), since the “Update Slave Settings” log info is just from another thread.

Have you had a chance to test with nogui on some machines yet? The reason why this is helpful is that if you don’t see these problems in nogui mode, that significantly reduces the possibilities of where this problem is occurring.

Cheers,

  • Ryan

Jon and I have done a bit of digging, and we may have a lead on the hanging problem. One thing that caught our eye was the line that is printed by the scheduler thread when it can no longer find the task. In our code, this is immediately followed by another print statement, but we’ve never seen that line in any of your logs when this happens.

When we do a print statement in our code, we have a few “listener” delegates that get called so that they can do something with that text (write to the log file, write to the log UI, etc). We’ve discovered that if one of these delegates blocks indefinitely when handling a print statement, that will block future print statements from getting written out and potentially deadlock the entire application.

Our gut is telling us it’s the listener for the UI log that’s the culprit, since the log writer delegate has been around forever. It would be really great if you could get 20 of your machines running in nogui mode to test this theory.

Something we could even consider doing for beta 14 is add a repository option to disable the Slave UI log. That way, you can just turn it on and restart your slaves, without having to manually launch them in nogui mode.

EDIT: note that the remote log viewer should still work, even if the slave isn’t writing to its own UI.

Oh ok, ill try to get some time this week to do the nogui thing. The best would be that option, as then a machine restart would not revert back to gui mode.

So after some more thinking on our end, we think it might be beneficial to remove the slave log from the UI, and replace it with a button to spawn the log viewer (similar to how you remotely connect to the slave log from the Monitor). The main benefit to this is that the slave’s responsiveness will no longer be affected by how much output the renderer is logging. Another benefit is that if something screws up in the log viewer, it won’t impact the slave.

We’re going to put this into beta 14 and see what people think.

Cheers,

  • Ryan

Would it let you scroll back in the log? My main issue with the ‘connect to log’ feature is the lack of scroll back.

Yup, it’s going to grab the last 500 lines when you connect.

Privacy | Site terms | Cookie preferences