AWS Thinkbox Discussion Forums

Potential crasher bug in lighting.dlx

Hi there,

We have a pretty critical, but very VERY elusive issue that we have been trying to troubleshoot for more then a week now. Up till now we thought it was issues in the scene, but it seems that rendering the same files on assburner (using 3dsmaxcmd.exe), they are fine.

The problem happens mid render. The file is processing, chugging away doing its thing, then all of a sudden max disappears.
In max.log, this is what we see:

2013/10/31 15:06:56 ERR: An unexpected exception has occurred in the network renderer and it is terminating.

In the slave log (the scLog lines are our internal progress report functions):

2013-10-31 15:06:57:  0: INFO: scLog: info 1016f took 6991 ms
2013-10-31 15:06:57:  0: INFO: scLog: info  slidertime: 1015f anim range: 900f 1300f
2013-10-31 15:06:57:  0: INFO: scLog: info calling setProgress with: 28.75
2013-10-31 15:06:57:  0: WARNING: Monitored managed process 3dsmaxProcess is no longer running
2013-10-31 15:06:58:  0: An exception occurred: Error in RenderTasks: RenderTask: Unexpected exception (Monitored managed process "3dsmaxProcess" has exited or been terminated.)
2013-10-31 15:06:58:     at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel) (Deadline.Plugins.RenderPluginException)

We went through everything, network connectivity issues, bad configs, network cables, whatever you can think of, but found no pattern.

Today Chris Deiss assigned a debugger to a job that was crashing like this, and let it run during the process. He noticed that some exceptions are thrown regularly, but i guess sometimes thats OK.

However, they might be connected somehow to a socket connection issue between lighting and the slave, and i suspect that it can actually crash max.
The exceptions that repeat regularly every couple of seconds:

First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00126f68…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00126f68…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: frantic::win32::simple_socket_timeout_error at memory location 0x001273d0…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: frantic::win32::simple_socket_timeout_error at memory location 0x00127760…

I think the blue exceptions are from our progress report maxscript, which we will take out to see if it makes those exceptions go away. However, there are also some socket timeout errors there from the frantic libraries.

Just before 3dsmax disappears, we get this:
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00127518…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00126f68…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000…
First-chance exception at 0x000007fefd23cacd in 3dsmax.exe: Microsoft C++ exception: ConversionError at memory location 0x00126f68…
The thread ‘Win64 Thread’ (0x4dd4) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x40e0) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x2140) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x5b4) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x10f4) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x39a4) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x11a0) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x4da8) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x3104) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x4d84) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x1d60) has exited with code -1 (0xffffffff).
The thread ‘Win64 Thread’ (0x2920) has exited with code -1 (0xffffffff).

Note the missing socket errors… Could it be crashing there? After that all the threads quick with the exit code -1, and max goes missing…

Hey Laszlo,

The Socket-related first-chance exceptions showing up aren’t super concerning, since we’re expecting (and handling) timeouts in a lot of spots. It might be a good hint to figure out where this is happening, though – can you send us the full log for this job? Might be some more hints in there as to what might be going on. At what point in the rendering process is this happening, typically? Does it happen right away ish? Or can renders be chugging through for a while before crashing?

Cheers
Jon

This one in particular is a TP cache, which has a single frame render job. In a pre-render callback, the TP simulation is triggered, and a time-callback is registered.

That time callback basically provides progress updates to deadline, and our logs. It seems that if we remove that time callback, these exceptions go away. We will test overnight if that also means that the jobs stop crashing…

If they do, it might be a problem with the DeadlineUtils.SetProgress function.

Well, removing the progress updates seems to have fixed this issue. So it seems that the deadline progress update notification can destabilize max :frowning:

We are getting weird random crashes in other renderjobs too, but its much less reproducible.

We did got about 10-15 of these:
Error in RenderTasks: Timed out waiting for the next progress update. The ProgressUpdateTimeout setting can be modified in the 3dsmax plugin configuration.

But thats simply cause we removed the progress updates, and the timeout was too low… The random crashes would happen at random times, so sometimes at 2% completion, sometimes at 50, 78 etc. And they happened around 90% of the time, so we would definitely have seen a bunch by the morning if it was still occuring

Did another test, we are now just simply calling SetProgress with a static value, no other calculation is done in the time callback function. We got crashes right away…

So i think something real bad is happening within DeadlineUtils.SetProgress… :-\

Thanks for the information! We’ll look into it.

Could you potentially try reducing the frequency of the SetProgress call? Just curious if the problem is calling it too many times, or calling it in general.

We are calling it when it goes a frame forward, but a maximum of every second.

So its usually once a frame ( the frames can take 5-10 minutes each to process ).

I think its calling it at all is the problem, there might be something going on there that doesn’t handle memory properly, cause it seems random. A job could fail on random frames 10 times, then finish on the 11th.

What I’m currently doing as a test is modifying the customize.ms file in the 3dsmax plugin folder in the repository to just loop indefinitely and set progress back and forth between two values:

	DeadlineUtil.LogMessage ">TESTING STABILITY:"
	local ready = false
	local toggle = false
	while not ready do
	(
		sleep 0.10
		if toggle then
		(
			DeadlineUtil.SetProgress( 33.3333333333 )
		)
		else
		(
			DeadlineUtil.SetProgress( 66.6666666666 )
		)
		toggle = (not toggle)
	)

It’s been running for a few minutes now with no errors. I’ll let it go a bit longer, then look at moving it into a pre-render/time callback.

Cool! We have a pre-render callback, that then sets up a timecallback function using the registerTimeCallback call. The SetProgress is called in there. Something like this:

contents of TPCACHE.ms:

global scLog
if(scLog == undefined) then
	filein @"SC_Log.ms"

fn fnTimeChange=
(
	try(
		DeadlineUtil.SetProgress (33.33)
	) catch (scLog.log ("exception : " + (getCurrentException())))
)
struct TPCacheToNetworkStruct
(
     fn fnCache  =
     (
           registerTimeCallback fnTimeChange
           result=mDynamicSet.cacherecord localTPSFile false false -- <- do tp caching here
     )
)
TPCacheToNetwork=TPCacheToNetworkStruct()

then a prerender callback assigned to the job:

script="Global TPCacheToNetwork\n"
script+="filein @\"TPCACHE.ms\"\n"
script+="TPCacheToNetwork.fnCache()\n"

So the prerender callback is saved into the file. When deadline gets to rendering, it calls that, which in turn filein’s the callback handler script, and then registers the timechange callback. Which then sets the progress.

I can’t get it to happen with a time callback either. Here’s the script I’m registering with the pre-render callback:

fn setProgressTest =
(
    DeadlineUtil.LogMessage ">TESTING STABILITY:"
    local ready = false
    local toggle = false
    local counter = 0
    while not ready do
    (
        sleep 0.10
        if toggle then
        (
            DeadlineUtil.SetProgress( 33.3333333333 )
        )
        else
        (
            DeadlineUtil.SetProgress( 66.6666666666 )
        )
        toggle = (not toggle)
        
        counter = counter + 1
        if counter > 50 do
        (
            ready = true
        )
    )
)

registerTimeCallback setProgressTest

If you attach a debugger to max, are you seeing those first chance exceptions?

Yes, I did see those, and they didn’t seem to cause any issues. However, I discovered a bug in the UpdateProgress function that caused them to appear as often as they do, so that will be fixed in beta 10. They will still appear at the start of every task (mainly due to the way the listening socket in Lightning works), but they should not be appearing as your script updates progress.

When you upgrade to beta 10, it will be interesting to see if this helps the stability issues you are seeing.

Cheers,

  • Ryan

Cool, we will still try to narrow it down on our end, we are isolating whats the necessary minimum code to get it to crash (takes a while)

It seems that its somehow related to other logging. If we remove all logging from the progress update function and just leave setprogress, we get no crashes. Still trying to narrow it down further,… our logging library is a bit complicated (its a generic lib used in all our applications) that generates external files via python libraries, prints to the std out, the maxscript listener, manages old unused log file archival etc, any of those processes could cause max to crash i guess.

Privacy | Site terms | Cookie preferences