Win7-64ultimate newly installed SP1>a lot of deadline hangs

I recently finally did install the first big service pack for win7 (i run 64bit ultimate).
Since then i get proportionally more hangs in deadline. the slave is not picking up although slave is started. Or it hangs at “Starting Up” for ever and never actually picks up the job.
I also have a dot.net Framework issues when simulating FumeFX since then but i don’t think this is directly connected to the deadline issue of slaves not starting right or picking up jobs beyond the “Starting Up” point(?).

Is there any known issues with SP1 of win7? (asking bravely without having used the search function of the forum)

If SP1 reeks of trouble…any tips on how to roll back savely?

thanks in advance,
Anselm

Hi,
Make sure you use Fume v2.1c or newer if you have sp1 installed otherwise you will have dll hell.
Also make sure you don’t have any duplicate dll’s left over from any uninstall / re-install of Fume.
We found that sometimes there would be a duplicate voxelflow.dll left over.
Otherwise, I’ve got quite a few machines on sp1 now and it all seems to be OK. Any errors I have recently got, have all been artist driven ‘bad’ files.
HTH,
Mike

what about the “Starting Up” hangs? those render scenes don’t even have fumefx in them…?

Which version of Deadline are you running? We have SP1 installed on some of our Windows 7 machines and we have never seen this problem.

Can you post a screen shot of the slave when it hangs at startup? Also, when the slave is stuck like this, can you check to see if it produced a log? You can open the log folder from the Launcher on the machine by selecting Explore Log Folder. If it does produce a log, and you post that too?

Thanks!

  • Ryan

Thank you Ryan!

here is teh last log the machine produced:

[code]Thread: 14392
running daemon…

Got command: IgnoreMissingExternalFiles

Command: IgnoreMissingExternalFiles
IgnoreMissingExternalFiles

Got command: IgnoreMissingUVWs

Command: IgnoreMissingUVWs
IgnoreMissingUVWs

Got command: IgnoreMissingXREFs

Command: IgnoreMissingXREFs
IgnoreMissingXREFs

Got command: StartJob,“C:/Users/Anselm/AppData/Local/Thinkbox/Deadline/slave/jobsData/SpellChase06_01.max”,""

Command: StartJob,“C:/Users/Anselm/AppData/Local/Thinkbox/Deadline/slave/jobsData/SpellChase06_01.max”,""
StartJob
C:/Users/Anselm/AppData/Local/Thinkbox/Deadline/slave/jobsData/SpellChase06_01.max

StartJob begin
setting up default actions
Ignoring missing external file errors
Ignoring missing UVW cooridinate errors
Not ignoring missing DLL errors
Ignoring missing XREF errors
Checking max file validity…
Trying to load max file…[/code]

The version od deadline is: Deadline Version: 5.0.0.44528
I run it in free mode with just 2 machines.

So the “Trying to load max file…” is the last entry. the file it tried to load is 3,66MB only. Can’t be bandwidth imo? this fiel is a cache jobs. Render jobs go through fine.

That’s interesting that it’s a specific type of job (caching job in this case) that is causing you grief. When Deadline is loading the Max file, it is just calling a Max SDK function to load the file, so something is causing Max to stall out under the hood.

Maybe something to try is to see if the problem occurs outside of Deadline. You could copy the scene file to the machine, and then open a command prompt on the render node and try rendering it with 3dsamaxcmd.exe. For example:

"C:\program files\autodesk\3ds max 2012\3dsmaxcmd.exe" "c:\temp\SpellChase06_01.max"

If that also stalls out, then at least we know it’s not specific to Deadline.

Cheers,

  • Ryan

it happens with most of the jobs lately! It’s not a specific scene unfortunately. And it is doing it on both slaves. If lucky 1 out of 10 partition jobs starts. rest hangs at “Starting Up” period.

But I’m guessing all the scenes use FumeFX? In other words, all non-fumefx related jobs go through fine?

Did you get a chance to test rendering the problematic scenes from the command line?

Now it throws this error:

Scheduler Thread - Cancelling task because task filename "\\Juggernaut\f\DeadlineRepository\jobs\999_050_999_44239781\tasks\999_050_999_44239781_00004_5-5.Rendering.Beastmachine" could not be found, it was likely requeued sending cancel task command to plugin

The directory exists and is usually where i have the repository living. Now a DeadlineRepository2 folder is there as well. I changed the folder the other slave should be looking for to \Juggernaut\f\DeadlineRepository2 but it still just sits there starting up never actually starting the partitioning.

What’s the OS of the machine that hosts the Repository and other assets? If it’s a non-server version of Windows, maybe you’re hitting the maximum connection limitation that prevents more than 10 files from being opened remotely. That could explain the “lost” task files, as well as the problems with loading the scene file.

Sorry for beating the dead horse here, but I need to know: Did you get a chance to test rendering the problematic scenes from the command line? Maybe try launching 2 command line renders at the same time (one on each slave machine) to more or less replicate what Deadline is trying to do.

command line rendering worked! Sorry it took so long.
So i changed both machines to teh DeadlineRepository2 and they seam to do things…i copied the whole repository folder to a new location. Weird thing is that can access the folder from both machines over the network. So it is accessable and of course has read and write rights enabled.

I am running windows 7-64 Ultimate on both boxes.

So on the new repository it seams to work again BUT what i notice is that the 2 machines hesitate to pick up jobs still and never pick up more then 2 partitioning job although their limit is set to 10 (and limit by processors checked but htey both have 16 and 24 cores). all in all deadline slowed down the caching process not picking up more hten 2 cache jobs parallel and taking a good while to pick up the jobs.

Can you expand on this a bit? Does it take a while for a slave to START a job after it has been submitted, or does it start it right away and then takes a while to load the scene? If possible, could you post screen shots of the slave when it’s “hesitating”? I just want to make sure we’re on the same page.

Is that 2 tasks per machine, or 2 tasks total (ie: each machine is grabbing only one task)? Can you send screen shots of the submission window so we can see your job settings?

Finally, I had asked before which OS the repository is on, as well as the scene assets. You had mentioned you’re running windows 7-64 Ultimate on both boxes, but I need to confirm if the repository and the scene assets are on one of those machines or on a different machine.

Thanks!

  • Ryan

The slaves pick up teh job right away and then sit there “starting up”

The repository is on one of the 2 machines in a network accessable directory with read and write rights.

I attached a screenshot of the krakatoa submission dialogue. This usually (like a month back) would start 10 partitions per machine if set to 20 partitions e.g.

I didn’t realize you were using Krakatoa until now. :slight_smile: Are Krakatoa partitioning jobs the only jobs that are affected this way? For example, if you submit a regular render scene, does Deadline load the scene and render it in the expected time? I’m not all that familiar with how Krakatoa submits partitions to Deadline, but we can do some digging.

After you submit the jobs, I’m assuming 2 jobs appear in the Monitor with 10 tasks each? If that’s the case, please right-click on one of the jobs and select Browse Repository Directory. Grab the .job file with the job ID as the file name, zip it up, and post it.

Also, if you could right-click on one of your slave machines (while in super user mode), select Modify Slave Settings, and send us a screen shot of the settings, that would help too! If you could send a screen shot of the slave UI after it has picked up the first task, that could be helpful too.

Cheers,

  • Ryan

Sorry it took so long t get back! My apologies!!!

Yes, tat is what is happening when submitting through deadline.it spawns 2jobs (1 task per partition) per machine.

I attached the JOB file for a partitioning job that only allows for 2 caching jobs at a time although it is set to start as many as 10 at a time. And still, some jobs just sit there…“starting up” or “queued” for hours some times :frowning:
999_050_999_2ed94341.zip (1.33 KB)

Cool, everything looks correct in the job file. I still need the info from these other questions though:

It’s just more pieces to the puzzle, which helps us get a better understanding of the problem.

Thanks!

  • Ryan

The bad news don’t seam to stop…

Now i am getting trapped SEH Exceptions when using particle flow toolbox#3 which I use in pretty much EVERY of my scenes:

[code]=======================================================
Error Message

An error occurred in StartJob(): 3dsmax: Trapped SEH Exception in LoadFromFile(): Access Violation
Process: C:\Program Files\Autodesk\3ds Max 2010\3dsmax.exe
Module: C:\Program Files\Autodesk\3ds Max 2010\plugins\ParticleFlowTools\Box3\ParticleFlowSubOperators.dlo
Date Modified: 10/11/2010
Exception Code: C0000005
Read Address: 00000000
Instruction: 48 8B 11 48 8B F1 44 0F 29 40 A8 44 0F 29 48 98
Call Stack:
33940000 C:\Program Files\Autodesk\3ds Max 2010\plugins\ParticleFlowTools\Box3\ParticleFlowSubOperators.dlo
+001415A4 Exception Offset
2011/10/21 15:42:40 INF: Loaded C:/Users/ansi/AppData/Local/Thinkbox/Deadline/slave/jobsData/SandErrosion.max
[/code]

Any idea what could cause this issue? It only appears on 1 of the 2 machines. Both have the toolbox#3 licensed as full workstation mode. The user signed in is Administrator too. I attached the .JOB file here as well.

thanks in advance,
Anselm
999_050_999_7b63d54d.zip (1.33 KB)

Access Violation errors occur when memory becomes corrupt. Based on the error message, the corruption is occurring in ParticleFlowSubOperators.dlo. I think Deadline is just the messenger here…