Hi there,
I hope im wrong, but it seems that maybe the 3dsmax hanging on status updates issue is back with beta11?
We have loads of machines that are hanging with 0% cpu usage, and say that they are working on a job, on which they are not actually on the task list at all…
The slave log has an endless list of:
2013-11-25 11:02:26: 0: INFO: Loading bitmaps…
2013-11-25 11:02:28: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 11:02:30: 0: INFO: Updating transformation matrices
2013-11-25 11:02:53: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 11:03:12: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 11:03:29: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 11:03:46: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 11:04:02: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 11:04:19: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 11:04:38: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
…
2013-11-25 18:41:46: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 18:42:02: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 18:42:21: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
2013-11-25 18:42:41: An error occurred while checking task status: Object reference not set to an instance of an object. (System.NullReferenceException)
If i kill the max process manually, the slave says this:
0: WARNING: Monitored managed process 3dsmaxProcess is no longer running
0: An exception occurred: Error in RenderTasks: RenderTask: Unexpected exception (Monitored managed process "3dsmaxProcess" has exited or been terminated.)
at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel) (Deadline.Plugins.RenderPluginException)
0: Unloading plugin: 3dsmax
Scheduler Thread - Render Thread 0 threw a major error:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Exception Details
RenderPluginException -- Error in RenderTasks: RenderTask: Unexpected exception (Monitored managed process "3dsmaxProcess" has exited or been terminated.)
at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)
RenderPluginException.Cause: JobError (2)
RenderPluginException.Level: Major (1)
RenderPluginException.HasSlaveLog: True
Exception.Data: ( )
Exception.TargetSite: Void RenderTask(System.String, Int32, Int32)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.Plugins.Plugin.RenderTask(String taskId, Int32 startFrame, Int32 endFrame)
at Deadline.Slaves.SlaveRenderThread.RenderCurrentTask(TaskLogWriter tlw)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - Exception occurred while trying to requeue task.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Exception Details
NullReferenceException -- Object reference not set to an instance of an object.
Exception.Data: ( )
Exception.TargetSite: Boolean WasTaskRequeued(System.String, Deadline.Jobs.Task, System.String ByRef)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.StorageDB.MongoDB.MongoJobStorage.WasTaskRequeued(String jobID, Task task, String& reason)
at Deadline.Controllers.DataController.RequeueTask(Job job, Task task)
at Deadline.Slaves.SlaveSchedulerThread.RequeueTask(Job currentJob, Task task)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - It is likely that the Slave cannot connect to the Repository, which means that the network may be down or the Repository machine is offline.
Scheduler Thread - The Slave cannot continue until this operation has completed successfully.
Scheduler Thread - Waiting 20 seconds before retrying...
Connecting to slave log: LAPRO1261
---- 2013/11/25 18:45 ----
Scheduler Thread - Exception occurred while trying to requeue task.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Exception Details
NullReferenceException -- Object reference not set to an instance of an object.
Exception.Data: ( )
Exception.TargetSite: Boolean WasTaskRequeued(System.String, Deadline.Jobs.Task, System.String ByRef)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.StorageDB.MongoDB.MongoJobStorage.WasTaskRequeued(String jobID, Task task, String& reason)
at Deadline.Controllers.DataController.RequeueTask(Job job, Task task)
at Deadline.Slaves.SlaveSchedulerThread.RequeueTask(Job currentJob, Task task)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - It is likely that the Slave cannot connect to the Repository, which means that the network may be down or the Repository machine is offline.
Scheduler Thread - The Slave cannot continue until this operation has completed successfully.
Scheduler Thread - Waiting 20 seconds before retrying...
Listener Thread - 172.18.4.46 has connected
Listener Thread - Received message: StreamLog
Listener Thread - Responded with: Success
Scheduler Thread - Exception occurred while trying to requeue task.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Exception Details
NullReferenceException -- Object reference not set to an instance of an object.
Exception.Data: ( )
Exception.TargetSite: Boolean WasTaskRequeued(System.String, Deadline.Jobs.Task, System.String ByRef)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.StorageDB.MongoDB.MongoJobStorage.WasTaskRequeued(String jobID, Task task, String& reason)
at Deadline.Controllers.DataController.RequeueTask(Job job, Task task)
at Deadline.Slaves.SlaveSchedulerThread.RequeueTask(Job currentJob, Task task)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - It is likely that the Slave cannot connect to the Repository, which means that the network may be down or the Repository machine is offline.
Scheduler Thread - The Slave cannot continue until this operation has completed successfully.
Scheduler Thread - Waiting 20 seconds before retrying...
Scheduler Thread - Exception occurred while trying to requeue task.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Exception Details
NullReferenceException -- Object reference not set to an instance of an object.
Exception.Data: ( )
Exception.TargetSite: Boolean WasTaskRequeued(System.String, Deadline.Jobs.Task, System.String ByRef)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.StorageDB.MongoDB.MongoJobStorage.WasTaskRequeued(String jobID, Task task, String& reason)
at Deadline.Controllers.DataController.RequeueTask(Job job, Task task)
at Deadline.Slaves.SlaveSchedulerThread.RequeueTask(Job currentJob, Task task)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - It is likely that the Slave cannot connect to the Repository, which means that the network may be down or the Repository machine is offline.
Scheduler Thread - The Slave cannot continue until this operation has completed successfully.
Scheduler Thread - Waiting 20 seconds before retrying...
Its not recovering…
We have hundreds of machines erroring like this… Only force restarting the slave fixes it
Another error i have found is max hanging, and the followng message in the log:
0: INFO: Preparing direct light manager…
0: INFO: Preparing global light manager…
---- 2013/11/25 16:23 ----
Info Thread - Could not check if Pulse is running because: The requested address is not valid in its context 255.255.255.255:17062
---- 2013/11/25 19:08 ----
Or
---- 2013/11/25 14:33 ----
0: INFO: Preparing direct light manager…
0: INFO: Preparing global light manager…
---- 2013/11/25 14:53 ----
Info Thread - Could not check if Pulse is running because: Unable to write data to the transport connection: An established connection was aborted by the software in your host machine.
---- 2013/11/25 19:08 ----
Hey Laszlo,
We figured out what was causing this NullReferenceException error, and it will be fixed in beta 12.
The problem only happens if a slave is working on a particular task for a job, and someone changes the frame range of that job, resulting in that particular task no longer existing. It doesn’t happen if the job is deleted, because that is detected before entering the block of code where this error can occur.
Thanks for reporting this, and sorry for the inconvenience!
Cheers,