limit stubs not returned

LaszloSebo · August 19, 2014, 9:45pm

We have this happen randomly every day to multiple jobs.

Job is happily rendering. Then all of a sudden, it stops rendering completely. It has a machine limit of 1.
I go into the db to figure out whats happening, and find that a mystery machine is holding on to its stub:

{u’LastWriteTime’: {u’$date’: 1408482113478L},
u’Name’: u’53f3b38d0294983064037d6b’,
u’Props’: {u’Limit’: 1,
u’RelPer’: -1,
u’Slaves’: [],
u’SlavesEx’: [],
u’White’: False},
u’StubCount’: 1,
u’StubLevel’: 0,
u’Stubs’: [{u’Holder’: u’lapro0605’, u’Time’: {u’$date’: 1408482113478L}}],
u’Type’: 1,
u’_id’: u’53f3b38d0294983064037d6b’}

lapro0605 is not actually doing anything, or rendering another job. It will NEVER EVER get rid of this stub assignment.

The only way to fix the jobs is to manually increase their machine limit,.

Needless to say, this is causing quite the stir with artists. Frustration all around. Shouldnt the machine’s own housecleaning or the pulse central house cleaning sort this out? It doesn’t.

Once the limit has been increased, the job picks up again:

{u’LastWriteTime’: {u’$date’: 1408484022841L},
u’Name’: u’53f3b38d0294983064037d6b’,
u’Props’: {u’Limit’: 2,
u’RelPer’: -1,
u’Slaves’: [],
u’SlavesEx’: [],
u’White’: False},
u’StubCount’: 2,
u’StubLevel’: 0,
u’Stubs’: [{u’Holder’: u’lapro0605’, u’Time’: {u’$date’: 1408482113478L}},
{u’Holder’: u’lapro1393’, u’Time’: {u’$date’: 1408484022841L}}],
u’Type’: 1,
u’_id’: u’53f3b38d0294983064037d6b’}

LaszloSebo · August 20, 2014, 1:24am

Weird thing is, i cant even find any reference to these stuck jobs in the slave logs that apparently hold on to stubs…

LaszloSebo · August 20, 2014, 1:26am

Is it possible that the ‘looking for a new task to do’ algorithm grabs a stub and doesnt put it in the log, then crashes?

LaszloSebo · August 20, 2014, 1:30am

Ok, it simply doesnt look like slaves do a task sweep on restart to remove themselves from the ‘stubs’ entry of jobs at all…

No matter what voodoo dance i do to restart the slaves, the entries stick around

{u’LastWriteTime’: {u’$date’: 1408494300536L},
u’Name’: u’53f3d84c0294982b049ba636’,
u’Props’: {u’Limit’: 2,
u’RelPer’: -1,
u’Slaves’: [],
u’SlavesEx’: [],
u’White’: False},
u’StubCount’: 2,
u’StubLevel’: 0,
u’Stubs’: [{u’Holder’: u’lapro0825’, u’Time’: {u’$date’: 1408492900201L}},
{u’Holder’: u’lapro0571’, u’Time’: {u’$date’: 1408494300536L}}],
u’Type’: 1,
u’_id’: u’53f3d84c0294982b049ba636’}

In this case, lapro0825 never rendered this job, ever. Never in its log is this job id referenced.

Mystery, compounded by growing distrust

rrussell · August 20, 2014, 2:55pm

Hey Laszlo,

I imported this limit into a local Deadline 6 database here, and confirmed that the housecleaning did clean it up properly. I think the problem (which you brought up in another thread) is that your housecleaning operation take a very long time to cycle through because of all the cleanup it’s doing, and it doesn’t help that most aspects of housecleaning are slightly randomized.

We are already addressing this in Deadline 7 by doing the following:

There will be a new Repository Repair operation that will be separate from Housecleaning, and it will do things like check for orphaned tasks, orphaned limit stubs, and stalled slaves. This will run asynchronously from the rest of Housecleaning, so that things like purging jobs or old reports don’t interfere with Deadline’s ability to repair itself.
The randomization of operations will be removed. This may have been helpful in older versions of Deadline to reduce load on the file system, but with the database backend, this really isn’t necessary any more.

Another thing to note is that the stub must exist for 5 minutes before it is considered orphaned (just to avoid false positives). I’m sure that isn’t the case here, just wanted to make you aware.

Cheers,
Ryan

LaszloSebo · August 20, 2014, 5:48pm

Housecleaning never cleans these up for us. They have overnight renders that stop at 11pm, and never pick up again.

Neither does the housecleaning of the slave clean it (which i guess should be the primary candidate to do this?). Shouldn’t the slave clean the tasks its owning stubs of? Is the stub ownership stored in multiple collections? Maybe the slave thinks it doesnt have any, when in fact the job still has the slave in its stub owner list?

rrussell · August 20, 2014, 6:12pm

That’s really strange that housecleaning never cleans these up, especially since it worked here for me. I just imported that exact limit into the database, and then ran a housecleaning operation from the Monitor.

I also confirmed that when the slave self-cleans on startup, it cleans up any stubs that it is aware of (stubs that were listed in its slave state when it exited un-cleanly). So in this case, as you saw in the logs, the slave had no record of the job, which leads me to believe it was able to acquire this stub when trying to pick up this job, but then failed on another limit for the same job, and then this stub wasn’t returned properly. However, the housecleaning would be the “catch-all” that eventually cleans this up.

The next time this happens, can you try running housecleaning from your monitor and check the output about orphaned limit stubs? There is no randomness in the various housecleaning operations when running it from the Monitor, so it is guaranteed to run.

LaszloSebo · August 20, 2014, 6:51pm

I’ll attempt.

Here is a log from pulse:

2014-08-20 00:00:10:  Argument: -DoHouseCleaning 10 True
2014-08-20 00:00:10:  Startup Directory: "/opt/Thinkbox/Deadline6/bin"
2014-08-20 00:00:10:  Process Priority: BelowNormal
2014-08-20 00:00:10:  Process Affinity: default
2014-08-20 00:00:10:  Process is now running
2014-08-20 00:00:11:  Performing Job Cleanup Scan...
2014-08-20 00:00:11:      Job Cleanup Scan - Loading completed jobs
2014-08-20 00:00:17:      Job Cleanup Scan - Loaded 6117 completed jobs in 6.546 s
2014-08-20 00:00:17:      Job Cleanup Scan - Scanning completed jobs
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Create Proxy Quicktime: RS_190_2060_FL-SprayS02_beauty_v0069" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Create Proxy Quicktime: RS_190_2060_FL-SplashMistS01_beauty_v0006" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Create Proxy Quicktime: RS_190_2060_FL-SprayMistS02_beauty_v0069" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Create Proxy Quicktime: RS_190_2060_FL-SprayMistS06_beauty_v0066" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[FAST7] Create Proxy Quicktime: SHR_shr_rsrc_vehDodgeCharger-Lookdev-Parkade_beauty_v0026" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Create Proxy Quicktime: RS_190_2060_Light-All_beauty_v0010" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Create Proxy Quicktime: RS_190_2060_FL-SprayMistS07_beauty_v0042" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[MJ] Create Proxy Quicktime: RP_093_0260_FL-SprayBurst_beauty_v0005" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[MJ] Create Proxy Quicktime: RP_093_0260_FX-dambreak_beauty_v0121" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Create Proxy Quicktime: RS_190_2060_FL-SprayMistS03_beauty_v0047" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[EXO] Upload Proxy Quicktime: RS_190_2060_Light-All_beauty_v0010" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[MJ] Upload Proxy Quicktime: RP_093_0260_FX-dambreak_beauty_v0121" because Delete On Complete was enabled.
2014-08-20 00:00:18:      Job Cleanup Scan - Deleted completed job "[MJ] Upload Proxy Quicktime: RP_093_0260_FL-SprayBurst_beauty_v0005" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted completed job "[EXO] Upload Proxy Quicktime: RS_190_2060_FL-SprayMistS03_beauty_v0047" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted completed job "[FAST7] Upload Proxy Quicktime: SHR_shr_rsrc_vehDodgeCharger-Lookdev-Parkade_beauty_v0026" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted completed job "[EXO] Upload Proxy Quicktime: RS_190_2060_FL-SprayMistS06_beauty_v0066" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted completed job "[EXO] Upload Proxy Quicktime: RS_190_2060_FL-SprayS02_beauty_v0069" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted completed job "[EXO] Upload Proxy Quicktime: RS_190_2060_FL-SplashMistS01_beauty_v0006" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted completed job "[EXO] Upload Proxy Quicktime: RS_190_2060_FL-SprayMistS02_beauty_v0069" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted completed job "[EXO] Upload Proxy Quicktime: RS_190_2060_FL-SprayMistS07_beauty_v0042" because Delete On Complete was enabled.
2014-08-20 00:00:19:      Job Cleanup Scan - Deleted 20 and archived 0 completed jobs in 1.610 s
2014-08-20 00:00:19:      Job Cleanup Scan - Done.
2014-08-20 00:00:19:  Performing Orphaned Task Scan...
2014-08-20 00:00:19:      Orphaned Task Scan - Loading rendering jobs
2014-08-20 00:00:19:      Orphaned Task Scan - Loaded 236 rendering jobs in 68.562 ms
2014-08-20 00:00:19:      Orphaned Task Scan - Scanning for orphaned tasks
2014-08-20 00:00:19:      Orphaned Task Scan - Separated jobs into 3 lists of 100
2014-08-20 00:00:19:      Orphaned Task Scan - Scanning job list 1 of 3 (100 jobs)
2014-08-20 00:00:20:      Orphaned Task Scan - Requeuing orphaned task '93' for LAPRO0618: the task is in the rendering state, but the slave is no longer rendering this task
2014-08-20 00:00:20:      Orphaned Task Scan - Scanning job list 2 of 3 (100 jobs)
2014-08-20 00:00:21:      Orphaned Task Scan - Requeuing orphaned task '174' for LAPRO0648: the task is in the rendering state, but the slave is no longer rendering this task
2014-08-20 00:00:21:      Orphaned Task Scan - Scanning job list 3 of 3 (36 jobs)
2014-08-20 00:00:21:      Orphaned Task Scan - Requeuing orphaned task '17' for LAPRO0648: the task is in the rendering state, but the slave is no longer rendering this task
2014-08-20 00:00:21:      Orphaned Task Scan - Requeuing orphaned task '41' for LAPRO0777: the task is in the rendering state, but the slave is no longer rendering this task
2014-08-20 00:00:21:      Orphaned Task Scan - Requeuing orphaned task '71' for LAPRO0686: the task is in the rendering state, but the slave is no longer rendering this task
2014-08-20 00:00:21:      Orphaned Task Scan - Cleaned up 5 orphaned tasks in 1.519 s
2014-08-20 00:00:21:      Orphaned Task Scan - Done.
2014-08-20 00:00:21:  Performing Stalled Slave Scan...
2014-08-20 00:00:21:      Stalled Slave Scan - Loading slave states
2014-08-20 00:00:21:      Stalled Slave Scan - Loaded 1938 slave states in 123.137 ms
2014-08-20 00:00:21:      Stalled Slave Scan - Scanning slave states
2014-08-20 00:00:21:      Stalled Slave Scan - Cleaned up 0 stalled slaves in 24.249 ms
2014-08-20 00:00:21:      Stalled Slave Scan - Done.
2014-08-20 00:00:22:  Process exit code: 0

You can see that it does a job scan, and indeed finds orphaned tasks. So something is happening…

Whats suspicious is this line: “Loaded 236 rendering jobs in 68.562 ms”

Is it only looking at jobs with a rendering state? Cause these jobs are usually in a queued state when we find them.

rrussell · August 20, 2014, 7:00pm

That’s for the orphaned task scan. The orphaned limit scan is separate, and in that case, we go through all limits, and then compare their stubs with the states of the slaves.

Another thing to check if this does happen again is the Limit Stubs In Use column in the slave list. The only reason I could see why housecleaning wouldn’t be working is if the slave is still showing that has that stub checked out.

LaszloSebo · August 20, 2014, 7:06pm

Wow. Slaves are holding onto loads of limit stubs…

Check it out. This is an idle slave:

53f15e19bb649a31809cc56c, 53ef3ee2f8f9130f44591e7a, 53ecd0cff8f91302986046ba, 53eff5d110f85223c88b1478, 53f15dc3bb649a319479cf0b, 53eff68e10f852302c987faf, 53ef3ee5f8f9130cd445217b, 53ef3eeff8f9130d44a7e5d9, 53ed1b4c9e277f0e9853c596, 53f03a4701f73e388c5161da, 53f13e88bb649a36e80ce5de, 53e502dbc3ce0a00bcaf7005, 53efd22a162dfe0ff89e04d9, 53efd1d1162dfe2660bd69a9, 53efcca5162dfe014089a416, 53f0513a01f73e070825d9d1, 53ef9985162dfe201c4f56c0, 53f00322162dfe19a0e31a31, 53efccdf162dfe240c4e8336, 53f15e2ebb649a1874b7a242, 53f15dd8bb649a05e0907c26, 53ed2d4fe74b542d384487df, 53f15a69029498230ca6fe3c, 53f15a5f0294981d1405c2c6, 53f139f60294981fe8472f99, 53f139eb02949807e81654fb, 53f1493abb649a18b0b2bc16, 53ee4d848b53a02dcce9d2de, 53f1494ebb649a130ca1d0f2, 53ed49289298cb2b682283c7, 53f1396a01f73e162c996cf2, 53ed49329298cb2860d66e5c, 53ed59eb9298cb11f802bf68, 53ed5a189298cb2090d920f2, 53ed48fa9298cb22a8c6d66e, 53ed59fc9298cb2edcb54f38, 53ed29a9e74b5415e0629347, 53ed490b9298cb2c6455fa0b, 53f141d6bb649a12c4570607, 53f138ddbb649a1b5c7c3eac, 53f14162bb649a2610a9df43, 53f1417ebb649a204882f82f, 53f14285bb649a26c8a16ffe, 53f14fafbb649a3558df58fd, 53f141f3bb649a2760dcb935, 53f14f93bb649a05e00caf52, 53f15393bb649a326047ff69, 53f15413bb649a21f0178166, 53f153adbb649a2bc456e62c, 53ed2d86e74b543ddcb3c56a, 53f1404cbb649a2dd01dbc27, 53f1402fbb649a2f90a258a2, 53f1426ebb649a1da04a0e1c, 53f15a0abb649a36309614a0, 53f15529bb649a0ad43341d3, mx_highmem, vray2, 53f1542fbb649a23e4865792, 53f15545bb649a2108a217fe, 53f1524dbb649a13006840ee, 53f15230bb649a1fd405f9f3, win7, 53ee3da92eb2af24e42be70c, mx_128gb, 53f15a3c0294981a8038a39b, 53f15af902949820dc7604af, 53f00338162dfe042ca89f3d, 53f0031b162dfe1da4ae5b7c, 53efccd7162dfe25842c73a0, 53efcc9e162dfe23840c2076, 53efd1ca162dfe19a0847d10, mx_render, mx_flowlinerender, 53f0032f162dfe2254207e79, 53f0032c162dfe20a041aed4, 53f153a5bb649a1c885c7284, rendernodes, mx_flowlinesim, mx_fastmachines, 53f141e8bb649a2e7088743c

Is this expected?

rrussell · August 20, 2014, 8:30pm

I’m pretty sure that isn’t expected behavior, and could explain why housecleaning never cleans it up. However, from looking at the code, the slave should be returning any unused stubs it might still be holding on to before starting the next job, or before going from rendering to idle. So even if one attempt at returning the stubs fail, the slave will try again the next time.

In this slave’s log, do you see any messages that look like this?

Or any mongodb related errors? I’m wondering if something is preventing the slave from returning its stubs…

LaszloSebo · August 20, 2014, 9:16pm

Found a crash on this particular machine, then it failed to connect to the db, and then had other errors:

2014-08-17 19:07:03:  0: Shutdown
2014-08-17 19:07:03:  0: Shutdown
2014-08-17 19:07:04:  0: Exited ThreadMain(), cleaning up...
2014-08-17 19:07:04:  Scheduler Thread - Unexpected Error Occured
2014-08-17 19:07:04:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2014-08-17 19:07:05:  Exception Details
2014-08-17 19:07:05:  OutOfMemoryException -- Exception of type 'System.OutOfMemoryException' was thrown.
2014-08-17 19:07:05:  Exception.Data: ( )
2014-08-17 19:07:05:  Exception.TargetSite: Void StartInternal(System.Security.Principal.IPrincipal, System.Threading.StackCrawlMark ByRef)
2014-08-17 19:07:05:  Exception.Source: mscorlib
2014-08-17 19:07:05:    Exception.StackTrace: 
2014-08-17 19:07:05:     at System.Threading.Thread.StartInternal(IPrincipal principal, StackCrawlMark& stackMark)
2014-08-17 19:07:05:     at System.Threading.Thread.Start()
2014-08-17 19:07:05:     at Deadline.Slaves.SlaveRenderThread.Initialize(Int32 threadID)
2014-08-17 19:07:05:     at Deadline.Slaves.SlaveSchedulerThread.ThreadMain()
2014-08-17 19:07:05:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2014-08-17 19:07:05:  Slave - An error occurred while updating the slave's info: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
2014-08-17 19:07:05:  Exception of type 'System.OutOfMemoryException' was thrown. (FranticX.Database.DatabaseConnectionException)

This was a random slave btw, there are a lot of others with these stuck stubs

LaszloSebo · August 20, 2014, 9:22pm

Found a job that recently did this, and this is the limit groups document its showing (the artist manually upped the machine limit to ‘fix’ the issue, thats why you see 2):

{u’LastWriteTime’: {u’$date’: 1408566014281L},
u’Name’: u’53f3efaebb649a0f884084c2’,
u’Props’: {u’Limit’: 2,
u’RelPer’: -1,
u’Slaves’: [],
u’SlavesEx’: [],
u’White’: False},
u’StubCount’: 2,
u’StubLevel’: 0,
u’Stubs’: [{u’Holder’: u’lapro0678’, u’Time’: {u’$date’: 1408496866325L}},
{u’Holder’: u’lapro0417’, u’Time’: {u’$date’: 1408566014281L}}],
u’Type’: 1,
u’_id’: u’53f3efaebb649a0f884084c2’}

0417 is in fact rendering it, 0678 is a ‘stub hoarder’ though… check it out… All of these are assigned as current stubs, amongst them the one from the job above:

53f3a5cae74b5416040a3627, 53f3e2299e277f218c0dd121, mx_flowlinerender, 53f3ca1f229aec303c1e1e19, 53f3c64bbb649a18442e07e3, 53ed5a189298cb2090d920f2, 53ed59fc9298cb2edcb54f38, 53ed59eb9298cb11f802bf68, 53ed49329298cb2860d66e5c, 53ed49289298cb2b682283c7, 53ed490b9298cb2c6455fa0b, 53ed48fa9298cb22a8c6d66e, 53f3c747bb649a195885e9ac, 53f3c634bb649a1010bfc8fe, 53f3b8b5bb649a0cf89f4f13, 53f3c27e9e277f0854283441, 53f3b57be74b5416040a3629, vray2, mx_render, 53f3b39c0294980e2000c7e0, mx_fastmachines, 53f3c2b3e74b5416040a362d, mx_128gb, 53f3e2e0162dfe1f4496b063, 53f3daa5cf71591ca06f79c1, 53f3efaebb649a0f884084c2, rendernodes, mx_flowlinesim, 53f3efa4bb649a196c75bd2a

Remoting into the slave, it seems to have crashed. The last lines of its log:

2014-08-19 18:06:58:  Scheduler - The 53f3da8acf715918e8ca8843 limit is maxed out.
2014-08-19 18:06:58:  Scheduler - The 53f3e307cf71591574e6d658 limit is maxed out.
2014-08-19 18:06:58:  Scheduler - Slave has been marked bad for job 53f246b5e74b5430b8d9ca87, skipping this job.
2014-08-19 18:06:58:  Scheduler - Slave has been marked bad for job 53f3b57be74b5416040a3629, skipping this job.
2014-08-19 18:06:58:  Scheduler - Slave has been marked bad for job 53f3afa5e74b5416040a3628, skipping this job.
2014-08-19 18:06:58:  Scheduler - Unexpected Error Occurred
2014-08-19 18:06:58:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2014-08-19 18:06:58:  Scheduler Thread - exception occurred:

The actual exception isnt there… But the deadline slave is hanging with a crash popup:

According to the event viewer, it was an out of memory exception:

Log Name:      Application
Source:        .NET Runtime
Date:          8/19/2014 6:06:59 PM
Event ID:      1026
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      LAPRO0678.scanlinevfxla.local
Description:
Application: deadlineslave.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.OutOfMemoryException
Stack:
   at Deadline.Slaves.SlaveSchedulerThread.HandleException(System.Exception)
   at Deadline.Slaves.SlaveSchedulerThread.ThreadMain()
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.ThreadHelper.ThreadStart()

Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name=".NET Runtime" />
    <EventID Qualifiers="0">1026</EventID>
    <Level>2</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2014-08-20T01:06:59.000000000Z" />
    <EventRecordID>34223</EventRecordID>
    <Channel>Application</Channel>
    <Computer>LAPRO0678.scanlinevfxla.local</Computer>
    <Security />
  </System>
  <EventData>
    <Data>Application: deadlineslave.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.OutOfMemoryException
Stack:
   at Deadline.Slaves.SlaveSchedulerThread.HandleException(System.Exception)
   at Deadline.Slaves.SlaveSchedulerThread.ThreadMain()
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.ThreadHelper.ThreadStart()
</Data>
  </EventData>
</Event>

On a 64 bit machine, this is odd to see…

LaszloSebo · August 20, 2014, 9:49pm

Found several other ‘stub’ hoarders that hang with a semi frozen slave. Crash popup + out of memory exception

LaszloSebo · August 21, 2014, 1:17am

I think this is big part responsible for what we are seeing… orphaned task / stub tests seem to be running very intermittently, sometimes not at all for 20+ minutes…
So we get a lot of people complaining about ‘stuck’ / hanging frames, mysterious behavior (they remote into a slave and see that its been rendering something else for the past hour, but in the monitor still says its rendering their job etc)

rrussell · August 21, 2014, 12:44pm

We’re going to add some additional robustness to the limit stub returning code. There are two areas I think we can improve on:

When the slave is returning its stubs, if an exception occurs for one stub, the loop breaks out, leaving the rest of the stubs hanging around. We can improve this by making sure each limit is attempted.
When an unexpected exception like the one you’re seeing occurs, the slave tries to clean up its job, but not the limit stubs. We can improve this by making it clean up its limit stubs as well.

The next time you see a slave in this state, can you check its memory usage in Task Manager? If it’s large, could you get a dump of the slave’s memory so that we can take a look at it?

Thanks!
Ryan