AWS Thinkbox Discussion Forums

slave inaccessbile, task not stalled

One of our slaves stalled overnight but deadline did not stall the task. Is that normal? In the monitor, the frame has been rendering for 14 hours and counting.

The machine does not ping, but for some odd reason i can remote into it, and see this in the machine log:

Scheduler Thread - Render Thread 0 threw a major error:

Exception Details
RenderPluginException – Error in RenderTasks: Deadline caught error “Warning: Could not open file. : //inferno2/projects/tboa/scenes/SHR_shr_rsrc/cache/maya/rig/vehPersianShipA/v0052_jsu_breakoutChange/anim_proxy.mb”. If this error message is unavoidable but not fatal to your render, please email support@thinkboxsoftware.com with the error message, and disable the Maya job setting Strict Error Checking.
at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)
RenderPluginException.Cause: JobError (2)
RenderPluginException.Level: Major (1)
RenderPluginException.HasSlaveLog: True
Exception.Data: ( )
Exception.TargetSite: Void RenderTask(System.String, Int32, Int32)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.Plugins.Plugin.RenderTask(String taskId, Int32 startFrame, Int32 endFrame)
at Deadline.Slaves.SlaveRenderThread.a(TaskLogWriter A_0)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Error occurred while writing report log:

Exception Details
IOException – The network path was not found.
Exception.Data: ( )
Exception.TargetSite: Void WinIOError(Int32, System.String)
Exception.Source: mscorlib
Exception.StackTrace:
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite)
at Deadline.StorageDB.JobStorage.WriteJobReportFile(Report report, String reportLog)
An error occurred while saving job report: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
Error occurred while writing report log:

Exception Details
IOException – The network path was not found.
Exception.Data: ( )
Exception.TargetSite: Void WinIOError(Int32, System.String)
Exception.Source: mscorlib
Exception.StackTrace:
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite)
at Deadline.StorageDB.SlaveStorage.WriteSlaveReportFile(Report report, String reportLog)
An error occurred while saving slave report: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
Scheduler Thread - Unexpected Error Occured While Handling Exception

Exception Details
SocketException – No such host is known
SocketException.ErrorCode: 11001 (No such host is known)
SocketException.SocketErrorCode: HostNotFound (11001)
Win32Exception.NativeErrorCode: 11001
Exception.Data: ( )
Exception.TargetSite: System.Net.IPHostEntry InternalGetHostByName(System.String, Boolean)
Exception.Source: System
Exception.StackTrace:
at System.Net.Dns.InternalGetHostByName(String hostName, Boolean includeIPv6)
at System.Net.Dns.GetHostAddresses(String hostNameOrAddress)
at MongoDB.Driver.MongoServerAddress.ToIPEndPoint(AddressFamily addressFamily)
at MongoDB.Driver.MongoServerInstance.GetIPEndPoint()
at MongoDB.Driver.Internal.MongoConnection.Open()
at MongoDB.Driver.Internal.MongoConnection.SendMessage(MongoRequestMessage message, WriteConcern writeConcern, String databaseName)
at MongoDB.Driver.Internal.MongoConnection.RunCommand(String databaseName, QueryFlags queryFlags, CommandDocument command, Boolean throwOnError)
at MongoDB.Driver.MongoServerInstance.Ping(MongoConnection connection)
at MongoDB.Driver.MongoServerInstance.Connect()
at MongoDB.Driver.Internal.DirectMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference)

MongoConnectionException – Unable to connect to server deadline.scanlinevfxla.com:27017: No such host is known.
Exception.Data: ( System.Collections.DictionaryEntry )
Exception.TargetSite: Void Connect(System.TimeSpan, MongoDB.Driver.ReadPreference)
Exception.Source: MongoDB.Driver
Exception.StackTrace:
at MongoDB.Driver.Internal.DirectMongoServerProxy.Connect(TimeSpan timeout, ReadPreference readPreference)
at MongoDB.Driver.Internal.DirectMongoServerProxy.ChooseServerInstance(ReadPreference readPreference)
at MongoDB.Driver.MongoServer.AcquireConnection(MongoDatabase database, ReadPreference readPreference)
at MongoDB.Driver.MongoCursorEnumerator1.AcquireConnection() at MongoDB.Driver.MongoCursorEnumerator1.GetFirst()
at MongoDB.Driver.MongoCursorEnumerator1.MoveNext() at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable1 source)
at Deadline.StorageDB.MongoDB.MongoJobStorage.GetJob(String jobID, Boolean invalidateCache)

DatabaseConnectionException – An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues.
Exception.Data: ( )
Exception.TargetSite: Void a(MongoDB.Driver.MongoDatabase, System.Exception)
Exception.Source: deadline
Exception.StackTrace:
at e.a(MongoDatabase A_0, Exception A_1)
at Deadline.StorageDB.MongoDB.MongoJobStorage.GetJob(String jobID, Boolean invalidateCache)
at Deadline.Slaves.SlaveSchedulerThread.a(Int32 A_0, Task A_1, TimeSpan A_2, Exception A_3, AbortLevel A_4)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Error occurred while writing report log:

Exception Details
IOException – The network path was not found.
Exception.Data: ( )
Exception.TargetSite: Void WinIOError(Int32, System.String)
Exception.Source: mscorlib
Exception.StackTrace:
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite)
at Deadline.StorageDB.JobStorage.WriteJobReportFile(Report report, String reportLog)
An error occurred while saving job report: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
Error occurred while writing report log:

Exception Details
IOException – The network path was not found.
Exception.Data: ( )
Exception.TargetSite: Void WinIOError(Int32, System.String)
Exception.Source: mscorlib
Exception.StackTrace:
at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath)
at System.IO.File.InternalCopy(String sourceFileName, String destFileName, Boolean overwrite)
at Deadline.StorageDB.SlaveStorage.WriteSlaveReportFile(Report report, String reportLog)
An error occurred while saving slave report: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
Scheduler Thread - >>> UNABLE TO CONFIRM CURRENT LICENSE, SKIPPING TASK DEQUEUING
Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
----2013/02/06 05:27 ----
Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)

cheers,
laszlo

Are you guys currently running Pulse? I can’t remember.

While Pulse isn’t necessary, it does things like stalled slave detection much more frequently. In this case, the slave should have been marked as stalled and the task should have been requeued.

We will do more testing on our end though to make sure this is functioning properly.

Cheers,

  • Ryan

Yes pulse is running on the same server where the mongodb lives.

Privacy | Site terms | Cookie preferences