AWS Thinkbox Discussion Forums

Farm is stalling, "Waiting to start"

We are experiencing some issues currently across all Deadline plugins (Nuke, Maya, Max, etc) where the slaves don’t start rendering.
The slaves get stuck at the “Waiting to start” status in the monitor, if I VNC into the slave it says “Running / IDLE” and it has the right job information loaded under the job information tab but its not doing anything. The slave list monitor says the slave are idle even though in the task list for the job it says Waiting to start. Don’t have much debug information at this point, here is a snippet from a Nuke job that finished and the next job was supposed to render a Maya job…

0: STDOUT: Frame 298 (1 of 1) 0: INFO: Process exit code: 0 ---- July 25 2011 -- 10:19 AM ---- Info Thread - Using process affinity of 8 CPUs ---- July 25 2011 -- 10:24 AM ---- Info Thread - Using process affinity of 8 CPUs ---- July 25 2011 -- 10:26 AM ---- purging error reports older than 5 days ---- July 25 2011 -- 10:29 AM ---- Info Thread - Using process affinity of 8 CPUs ---- July 25 2011 -- 10:34 AM ---- Info Thread - Using process affinity of 8 CPUs

Tried restarting the slaves, and when they come online they get stuck here with the task/job details loaded,

Info Thread - Using process affinity of 8 CPUs Checking repository integrity ---- July 25 2011 -- 11:33 AM ---- Info Thread - Using process affinity of 8 CPUs

Any ideas?

Ok, so restarting pulse did the trick, everything back to normal…

Not much more debug details, pulse was showing the same info we get normally - x slaves connected, x waiting for jobs, x waiting to start jobs… Worth noting is that it claimed 0 slaves were waiting to start jobs…

Glad to hear things are back to normal, but I’m curious what happened to Pulse. If you open the temp folder (%temp%) on the Pulse machine, are there any “Deadline Pulse…” exception reports from the last couple of days? I’m wondering if a thread crashed…

Hmm… the latest log file in the temp folder is from 07/07/2011… There are a bunch of logs from before that but nothing for the last tree weeks… This is the last one and seems to be from when we were migrating, version being 5.1.0.44844, which I guess would mean while we were still on alpha 3… Not sure when the pulse logs but, but no logs for the last three weeks seems a bit odd, no?

“Deadline Pulse 5.1 [v5.1.0.44844 R] - NpgsqlException - 2011-07-07 13-01-50.log”

[code]ErrorLog.WriteExceptionLog() >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Date Stamp
CurrentDate: Thursday, 07 July 2011
CurrentTime: 13:01:50

Exception Details
Exception – Failed to establish a connection to ‘127.0.0.1’.
Exception.Data: ( )
Exception.TargetSite: Void Open(Npgsql.NpgsqlConnector)
Exception.Source: Npgsql
Exception.StackTrace:
at Npgsql.NpgsqlClosedState.Open(NpgsqlConnector context)

NpgsqlException – Failed to establish a connection to ‘127.0.0.1’.
NpgsqlException.Item:
NpgsqlException.Severity:
NpgsqlException.Code:
NpgsqlException.BaseMessage: Failed to establish a connection to ‘127.0.0.1’.
NpgsqlException.Detail:
NpgsqlException.Hint:
NpgsqlException.Position:
NpgsqlException.Where:
NpgsqlException.File:
NpgsqlException.Line:
NpgsqlException.Routine:
NpgsqlException.ErrorSql:
NpgsqlException.Errors: ( Failed to establish a connection to ‘127.0.0.1’. )
ExternalException.ErrorCode: -2147467259 (Unspecified error)
Exception.Data: ( )
Exception.TargetSite: Void Open(Npgsql.NpgsqlConnector)
Exception.Source: Npgsql
Exception.StackTrace:
at Npgsql.NpgsqlClosedState.Open(NpgsqlConnector context)
at Npgsql.NpgsqlConnector.Open()
at Npgsql.NpgsqlConnectorPool.GetPooledConnector(NpgsqlConnection Connection)
at Npgsql.NpgsqlConnectorPool.RequestPooledConnectorInternal(NpgsqlConnection Connection)
at Npgsql.NpgsqlConnectorPool.RequestPooledConnector(NpgsqlConnection Connection)
at Npgsql.NpgsqlConnection.Open()
at FranticX.Database.DatabaseService.Open()
at Deadline.Statistics.SlaveStatistics.DeleteSlaveStats(StatisticDatabase db, DateTime entryDateTime)
at Deadline.Storage.ReportStorage.DeleteSlaveStats(DateTime entryDateTime)
at Deadline.Controllers.DeadlineController.DeleteSlaveStats(DateTime entryDateTime)
at Deadline.Controllers.DeadlineController.PurgeOldStatistics()
at Deadline.Scheduling.HouseCleaning.DoPeriodicHouseCleaning(DeadlineNetworkSettings networkSettings, Boolean verbose, Boolean quitRequested, Boolean fromSlave)
at Deadline.Pulses.PulseCleanUpThread.ThreadMain()

ErrorReporting Watches
“Version”: v5.1.0.44844 R[String]

Trace2Cache:
Pulse Disk Space Limit: 29.297 GB
Total Disk Space: 596.17 GB
Available Disk Space: 591.287 GB
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Pulse Disk Space Limit: 29.297 GB
Total Disk Space: 596.17 GB
Available Disk Space: 591.287 GB
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Purging old job reports
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Checking repository integrity
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Purging limits
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Purging trash
Pulse Disk Space Limit: 29.297 GB
Total Disk Space: 596.17 GB
Available Disk Space: 591.287 GB
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Purging repository temp files
Purging trash
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Purging repository temp files
Deleted 5 previously locked and then ignored files.
Purging old statistics
Purging slave statistics that are older than Jun 07/11 13:02:47
Clean Up Thread - Unexpected Exception Occured: Failed to establish a connection to ‘127.0.0.1’. (Npgsql.NpgsqlException)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Purging old job auxiliary files
Listener - WARNING: received request from 10.0.33.93, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.75, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.80, but the Slave is running a different version (4.1.0.43205)
Listener - WARNING: received request from 10.0.33.178, but the Slave is running a different version (4.1.0.43205)
Purging old statistics
Purging slave statistics that are older than Jun 07/11 13:07:06
Clean Up Thread - Unexpected Exception Occured: Failed to establish a connection to ‘127.0.0.1’. (Npgsql.NpgsqlException)

0 Other Running Threads

Process Threads
2580 Wait UserRequest
1916 Wait UserRequest
2952 Wait UserRequest
952 Wait LpcReceive
2924 Wait ExecutionDelay
2512 Wait ExecutionDelay
1736 Wait UserRequest
2532 Wait UserRequest
3000 Wait ExecutionDelay
2192 Wait UserRequest
524 Wait UserRequest
2484 Wait UserRequest
1904 Wait LpcReceive
2756 Wait UserRequest
3044 Wait ExecutionDelay
1688 Running
2432 Wait ExecutionDelay
992 Wait ExecutionDelay
1352 Wait UserRequest
1116 Wait UserRequest
3028 Wait UserRequest
980 Wait UserRequest
3064 Wait UserRequest
1640 Wait UserRequest
1232 Wait UserRequest
1756 Wait UserRequest
2320 Wait UserRequest
2052 Wait ExecutionDelay
1356 Wait UserRequest
3016 Wait UserRequest
1488 Wait EventPairLow
1168 Wait EventPairLow
844 Wait UserRequest
1368 Wait EventPairLow
2148 Wait UserRequest

Memory and CPU Stats
GC.TotalMemory: 11.073 MB
Environment.WorkingSet: 102.695 MB
ComputerSystem.TotalPhysicalMemory: 11.999 GB
ComputerSystem.FreePhysicalMemory: 11.04 GB

Application Information
Application.ExecutablePath: C:\Program Files (x86)\Prime Focus\Deadline\bin\deadlinepulse.exe
Application.CurrentDirectory: C:\Program Files (x86)\Prime Focus\Deadline\bin
Application.StartupPath: C:\Program Files (x86)\Prime Focus\Deadline\bin
Application.ProductName: Deadline Pulse 5.1
Application.ProductVersion: 5.1.0.44844 File.GetLastWriteTime( Application.ExecutablePath ): 06/24/2011 15:16:39

Assembly Information (Executing)
ExecutingAssembly.CodeBase: file:///C:/Program Files (x86)/Prime Focus/Deadline/bin/franticx.DLL
ExecutingAssembly.Location: C:\Program Files (x86)\Prime Focus\Deadline\bin\franticx.dll
ExecutingAssembly.GlobalAssemblyCache: False File.GetLastWriteTime( ExecutingAssembly.Location ): 06/24/2011 15:16:39

Assembly Information (Current)
CurrentAssembly.CodeBase: file:///C:/Program Files (x86)/Prime Focus/Deadline/bin/franticx.DLL
CurrentAssembly.Location: C:\Program Files (x86)\Prime Focus\Deadline\bin\franticx.dll
CurrentAssembly.GlobalAssemblyCache: False File.GetLastWriteTime( CurrentAssembly.Location ): 06/24/2011 15:16:39

Thread Information
CurrentThread.Name: PulseCleanUpThread
CurrentThread.Priority: AboveNormal

Process Information
CurrentProcess.Name: deadlinepulse
CurrentProcess.MachineName: .
CurrentProcess.MainModule: System.Diagnostics.ProcessModule (deadlinepulse.exe)
CurrentProcess.StartDate: Tuesday, 05 July 2011
CurrentProcess.StartTime: 08:46:59
CurrentProcess.UserProcessorTime: 1.041 hrs
CurrentProcess.TotalProcessorTime: 3.235 hrs

Operating System Information
Environment.OSVersion.Platform: Windows XP64 (SP1)
Environment.OSVersion.Version: 5.2.3790.65536

.NET Platform Information
Environment.Version.Major: 2
Environment.Version.Minor: 0
Environment.Version.Build: 50727
Environment.Version.Revision: 3053

Misc Environment Information
Environment.MachineName: pulse-van-1
Environment.UserName: slave
Environment.SystemDirectory: C:\WINDOWS\system32
Environment.TickCount: 1310081406

Command Line
Environment.CommandLine: “C:\Program Files (x86)\Prime Focus\Deadline\bin\deadlinepulse.exe”
Environment.CommandLineArgs[0]: C:\Program Files (x86)\Prime Focus\Deadline\bin\deadlinepulse.exe

Current Call Stack
Environment.StackTrace: at System.Environment.get_StackTrace()
at FranticX.Diagnostics.Reporting.ErrorReporting.GetExceptionReport(Exception e, ErrorReportDetail detail)
at FranticX.Diagnostics.Reporting.ErrorReporting.WriteExceptionReport(Exception ex)
at FranticX.Diagnostics.Reporting.ErrorReporting.RecordException(Exception e)
at Deadline.Pulses.PulseCleanUpThread.ThreadMain()
at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
at System.Threading.ThreadHelper.ThreadStart()
[/code]

Yeah, that’s definitely from an older version of 5.1, so there wouldn’t be anything relevant in there.

Note that Deadline just places these exception dumps in the temp folder. To view the actual Pulse logs, select Help -> Explore Log Folder from the Pulse application. If you want to collect the Pulse logs over the last few days and post them, we can look for anything out of the ordinary.

Attached is the logs from the last week or so…

Thanks!
pf-deadlinepulse-logs.rar (138 KB)

Thanks! I don’t see anything in the logs that stands out. There was a logon error in the latest pulse log, but that exception is handled, so it wouldn’t bring a thread down.

If it happens again, let us know and we’ll see what we can do to investigate further.

Cheers,

  • Ryan
Privacy | Site terms | Cookie preferences