pulse crash

LaszloSebo · November 16, 2013, 12:16am

Random pulse crashes…seemingly its not doing anything… the last lines of the log are:

2013-11-15 05:45:01: Server Thread - lapro2077
2013-11-15 05:45:01: Server Thread - ::ffff:172.18.7.60
2013-11-15 05:45:01: Server Thread - Auto Config: No config worth sending
2013-11-15 05:45:01: Server Thread - Auto Config: Received packet on autoconfig port
2013-11-15 05:45:01: Server Thread - Auto Config: Picking configuration based on:
2013-11-15 05:45:01: Server Thread - lapro2034
2013-11-15 05:45:01: Server Thread - ::ffff:172.18.4.117
2013-11-15 05:45:01: Server Thread - Auto Config: No config worth sending
2013-11-15 05:45:01: Server Thread - Auto Config: Received packet on autoconfig port
2013-11-15 05:45:01: Server Thread - Auto Config: Picking configuration based on:
2013-11-15 05:45:01: Server Thread - lapro3072
2013-11-15 05:45:01: Server Thread - ::ffff:172.18.8.99
2013-11-15 05:45:01: Server Thread - Auto Config: No config worth sending

Corresponding time from var/log/messages:

Nov 15 05:49:14 deadline abrt[32248]: Saved core dump of pid 18793 (/opt/mono-2.10.9/bin/mono) to /var/spool/abrt/ccpp-2013-11-15-05:45:01-18793 (4386299904 bytes)
Nov 15 05:49:14 deadline abrtd: Directory ‘ccpp-2013-11-15-05:45:01-18793’ creation detected
Nov 15 05:49:14 deadline abrtd: Executable ‘/opt/mono-2.10.9/bin/mono’ doesn’t belong to any package
Nov 15 05:49:14 deadline abrtd: ‘post-create’ on ‘/var/spool/abrt/ccpp-2013-11-15-05:45:01-18793’ exited with 1
Nov 15 05:49:14 deadline abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2013-11-15-05:45:01-18793, deleting

Seems like it cant create a crash dump properly for it, so hard to tell what went wrong.

Had 5 pulse crashes identical to this today.

Some exceptions from the log from eariler times look like this:

2013-11-15 15:37:06: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2013-11-15 15:37:06: Exception Details
2013-11-15 15:37:06: ArgumentException – An element with the same key already exists in the dictionary.
2013-11-15 15:37:06: Exception.Source: mscorlib
2013-11-15 15:37:06: Exception.TargetSite: Void Add(System.String, Deadline.Jobs.Job)
2013-11-15 15:37:06: Exception.Data: ( )
2013-11-15 15:37:06: Exception.StackTrace:
2013-11-15 15:37:06: at System.Collections.Generic.Dictionary`2[System.String,Deadline.Jobs.Job].Add (System.String key, Deadline.Jobs.Job value) [0x00000] in :0
2013-11-15 15:37:06: at Deadline.Pulses.PulseDataThread.ThreadMain () [0x00000] in :0
2013-11-15 15:37:06: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

LaszloSebo · November 18, 2013, 6:24pm

This exception happens fairly regularly, i would say once every 10-20 minutes or so

rrussell · November 20, 2013, 3:14pm

Has the frequency of these crashes gone down since you started cleaning up the deleted jobs? There definitely appears to be some memory issues with housecleaning that we were able to reproduce over the weekend with a very large database.

LaszloSebo · November 20, 2013, 5:08pm

It happened once overnight last night, so thats much better than before.

Found a new exception in the logs though:

2013-11-20 02:55:33: Data Thread - An error occurred while updating cached data: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017):
2013-11-20 02:55:33: The requested address is not valid in this context (FranticX.Database.DocumentException)
2013-11-20 02:55:33: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2013-11-20 02:55:33: Exception Details
2013-11-20 02:55:33: SocketException – The requested address is not valid in this context
2013-11-20 02:55:33: SocketException.ErrorCode: 10049 (The requested address is not valid in this context)
2013-11-20 02:55:33: SocketException.SocketErrorCode: AddressNotAvailable (10049)
2013-11-20 02:55:33: Win32Exception.NativeErrorCode: 10049
2013-11-20 02:55:33: Exception.Source: System
2013-11-20 02:55:33: Exception.TargetSite: Void Connect(System.Net.EndPoint)
2013-11-20 02:55:33: Exception.Data: ( )
2013-11-20 02:55:33: Exception.StackTrace:
2013-11-20 02:55:33: at System.Net.Sockets.Socket.Connect (System.Net.EndPoint remoteEP) [0x00000] in :0
2013-11-20 02:55:33: at System.Net.Sockets.TcpClient.Connect (System.Net.IPEndPoint remote_end_point) [0x00000] in :0
2013-11-20 02:55:33: at MongoDB.Driver.Internal.MongoConnection.Open () [0x00000] in :0
2013-11-20 02:55:33: at MongoDB.Driver.Internal.MongoConnection.GetNetworkStream () [0x00000] in :0
2013-11-20 02:55:33: at MongoDB.Driver.Internal.MongoConnection.SendMessage (MongoDB.Driver.Internal.MongoRequestMessage message, MongoDB.Driver.WriteConcern writeConcern, System.String databaseName) [0x00000] in :0
2013-11-20 02:55:33: DocumentException – An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017):
2013-11-20 02:55:33: The requested address is not valid in this context
2013-11-20 02:55:33: Exception.Source: deadline
2013-11-20 02:55:33: Exception.TargetSite: Void HandleException(MongoDB.Driver.MongoServer, System.Exception)
2013-11-20 02:55:33: Exception.Data: ( )
2013-11-20 02:55:33: Exception.StackTrace:
2013-11-20 02:55:33: at Deadline.StorageDB.MongoDB.MongoDBUtils.HandleException (MongoDB.Driver.MongoServer server, System.Exception ex) [0x00000] in :0
2013-11-20 02:55:33: at Deadline.StorageDB.MongoDB.MongoDBUtils.HandleException (MongoDB.Driver.MongoDatabase database, System.Exception ex) [0x00000] in :0
2013-11-20 02:55:33: at Deadline.StorageDB.MongoDB.MongoJobStorage.GetJobs (Boolean invalidateCache) [0x00000] in :0
2013-11-20 02:55:33: at Deadline.Pulses.PulseDataThread.ThreadMain () [0x00000] in :0
2013-11-20 02:55:33: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

This is from pulse, running on the same machine as mongodb is on

LaszloSebo · November 22, 2013, 6:17pm

Another crash, and it seems like something held on to the port, because it was unable to restart itself:

[root@deadline ~]# /opt/mono-2.10.9/bin/mono /opt/Thinkbox/Deadline6/bin/deadlinelauncher.exe -pulse -nogui
Port 17060 is already in use
Another Launcher is already running
got reply: Sent “LaunchPulsePromptForUpgrade” command. Result: “”
Shutting down

The last lines of the log:

2013-11-21 20:54:20: Requeuing orphaned task ‘9’ for LAPRO0614: the task is in the rendering state, but the slave is no longer rendering this task
2013-11-21 20:54:20: Requeuing orphaned task ‘10’ for LAPRO1416: the task is in the rendering state, but the slave is no longer rendering this task
2013-11-21 20:54:20: Requeuing orphaned task ‘11’ for LAPRO0640: the task is in the rendering state, but the slave is no longer rendering this task
2013-11-21 20:54:20: Requeuing orphaned task ‘12’ for LAPRO1356: the task is in the rendering state, but the slave is no longer rendering this task
2013-11-21 20:54:20: Requeuing orphaned task ‘163’ for LAPRO0145: the task is in the rendering state, but the slave is no longer rendering this task
2013-11-21 20:54:20: Requeuing orphaned task ‘164’ for LAPRO1278: the task is in the rendering state, but the slave is no longer rendering this task
2013-11-21 20:54:20: Requeuing orphaned task ‘165’ for LAPRO1461: the task is in the rendering state, but the slave is no longer rendering this task
2013-11-21 20:54:20: Done.
2013-11-21 20:54:20: Checking available Database connections
2013-11-21 20:54:20: Process exit code: 0

After that pulse was crashed.

LaszloSebo · November 22, 2013, 6:26pm

Its very odd,… while pulse was down, people could simply not submit anything to deadline, mongo was so extremely overloaded.
The monitor would be timing out etc. Just to restart pulse, it took me around 20 minutes, at the end it only worked if i shut down the launcher, all deadline process, restarted mongodb, and then started everything fresh.

Why does mongo get overloaded when pulse is not running? I thought its an optional service to speed up housecleaning, but it seems like that at this scale its crucial to deadline functioning.

rrussell · November 22, 2013, 9:36pm

My guess is that it’s due to some of the housecleaning operations that the slaves perform in pulse’s absence. We are focusing heavily on improving the housecleaning for beta 12 to be more performant and less memory intensive, and part of that change is locking the housecleaning process so that only one process can do it at a time.

LaszloSebo · November 23, 2013, 1:44am

Attached is another crash, i had it running in a console to catch the crash dump.
pulse_crash_dump.txt (26.1 KB)

LaszloSebo · November 24, 2013, 2:31am

Still on beta9, but pulse died 9 tiems today… every time it goes down, the farm basically stops, people are unable to submit jobs etc.

I hope we can sort this issue out soon :-\

rrussell · November 25, 2013, 2:45pm

We are working to resolve these issues for beta 12.

Cheers,

Ryan

LaszloSebo · November 25, 2013, 5:32pm

Cool, beta11 pulse started great, but crashed again this morning :\

Trying to recover from the crash now… mongo became inaccessible again. I think the crash dumping process hangs up the database

LaszloSebo · November 25, 2013, 6:02pm

Our mono build has the following command line option:

–gc=[sgen,boehm] Select SGen or Boehm GC (runs mono or mono-sgen)

Would it make a difference you think to mod the shell scripts to use sgen? (im not sure this switch is enough, most online sources recommend rebuilding mono)

rrussell · November 25, 2013, 6:54pm

Based on this, it looks like you just need to use the command line option:
mono-project.com/Release_Notes_Mono_2.8

So give that a try to see if it helps.

Over the weekend, we ran a test where our simulator maintained 20,000 queued/rendering jobs in the queue at all times, and these jobs were set to auto-delete on completion. We had our internal version of Pulse running that entire time, and not once did it’s memory usage climb above 300 MB. However, with beta 11 Pulse, it would run fine for a bit, but then climb up to many GB of memory before it eventually ran out of resources. So it’s looking like our refactored house cleaning is making a big difference.

There is a good chance that beta 12 will be ready this week.

Cheers,

Ryan

LaszloSebo · November 25, 2013, 7:49pm

Oh cool that’s really good to hear!! Huge improvement it seems, as our pulse is usually around ~4-5 gigs or ram. Housecleaning and pending job scan can also easily go up to 2 gigs each.

Launcher hovers around 950megs usually.