DEADLINE takes long time to start the job, stuck at certain phase for about 10 min

christian_kim · November 18, 2020, 8:55am

Hi team, was wondering if anybody else experienced this, or know a solution to this issue.

i am using HOUDINI + REDSHIFT.
the scene renders fine locally.

when i submit the job, the job gets submitted without any issue. and will proceed to below point, and will be stuck there for about 10 min, and when it eventually gets going, everything renders fine. till next task, will get stuck on the same position again.

not sure if it is related problem, but when i submit the same job, but with concurrent jobs set to six concurrent tasks (i have 6 GPU installed) it will be stuck here for even longer (so long that i had to force stop the job), where as if i submit a single job, its stuck for about 10 min?

i can’t see any error messages or any hint at what could be causing this delay.

would appreciate any help! thanks!

--------- from log -----------------------

2020-11-18 00:52:27: 0: INFO: Starting Houdini Job
2020-11-18 00:52:27: 0: INFO: Stdout Redirection Enabled: True
2020-11-18 00:52:27: 0: INFO: Stdout Handling Enabled: True
2020-11-18 00:52:27: 0: INFO: Popup Handling Enabled: True
2020-11-18 00:52:27: 0: INFO: QT Popup Handling Enabled: False
2020-11-18 00:52:27: 0: INFO: WindowsForms10.Window.8.app.* Popup Handling Enabled: False
2020-11-18 00:52:27: 0: INFO: Using Process Tree: True
2020-11-18 00:52:27: 0: INFO: Hiding DOS Window: True
2020-11-18 00:52:27: 0: INFO: Creating New Console: False
2020-11-18 00:52:27: 0: INFO: Running as user: alteredgene
2020-11-18 00:52:27: 0: INFO: Executable: “C:\Program Files\Side Effects Software\Houdini 18.0.499\bin\Hython.exe”
2020-11-18 00:52:27: 0: INFO: Argument: “C:\ProgramData\Thinkbox\Deadline10\workers\IR-214226-7540\plugins\5fb4dee713f8ff37ac6c7770\hrender_dl.py” -f 710 719 1 -o “$HIP/render2/$OS/$OS.$F4.exr” -g -d /out/RS_testMid -tempdir “C:\ProgramData\Thinkbox\Deadline10\workers\IR-214226-7540\jobsData\5fb4dee713f8ff37ac6c7770\0_tempzX2ip0” -arnoldAbortOnLicenseFail 1 “C:/IMB_NationalMuseumExhibition/IMB-NationalMuseumExhibition.hip”
2020-11-18 00:52:27: 0: INFO: Full Command: “C:\Program Files\Side Effects Software\Houdini 18.0.499\bin\Hython.exe” “C:\ProgramData\Thinkbox\Deadline10\workers\IR-214226-7540\plugins\5fb4dee713f8ff37ac6c7770\hrender_dl.py” -f 710 719 1 -o “$HIP/render2/$OS/$OS.$F4.exr” -g -d /out/RS_testMid -tempdir “C:\ProgramData\Thinkbox\Deadline10\workers\IR-214226-7540\jobsData\5fb4dee713f8ff37ac6c7770\0_tempzX2ip0” -arnoldAbortOnLicenseFail 1 “C:/IMB_NationalMuseumExhibition/IMB-NationalMuseumExhibition.hip”
2020-11-18 00:52:27: 0: INFO: Startup Directory: “C:\Program Files\Side Effects Software\Houdini 18.0.499\bin”
2020-11-18 00:52:27: 0: INFO: Process Priority: BelowNormal
2020-11-18 00:52:27: 0: INFO: Process Affinity: default
2020-11-18 00:52:27: 0: INFO: Process is now running
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Redshift for Houdini plugin version 3.0.30 (Sep 26 2020 14:49:18)
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Plugin compile time HDK version: 18.0.499
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Houdini host version: 18.0.499
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Plugin dso/dll and config path: C:/ProgramData/Redshift/Plugins/Houdini/18.0.499/dso
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Core data path: C:\ProgramData\Redshift
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Local data path: C:\ProgramData\Redshift
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Procedurals path: C:\ProgramData\Redshift\Procedurals
2020-11-18 00:52:32: 0: STDOUT: [Redshift] Preferences file path: C:\ProgramData\Redshift\preferences.xml
2020-11-18 00:52:32: 0: STDOUT: [Redshift] License path: C:\ProgramData\Redshift
2020-11-18 00:52:35: 0: STDOUT: Detected Houdini version: (18, 0, 499)
2020-11-18 00:52:35: 0: STDOUT: [‘C:\ProgramData\Thinkbox\Deadline10\workers\IR-214226-7540\plugins\5fb4dee713f8ff37ac6c7770\hrender_dl.py’, ‘-f’, ‘710’, ‘719’, ‘1’, ‘-o’, ‘$HIP/render2/$OS/$OS.$F4.exr’, ‘-g’, ‘-d’, ‘/out/RS_testMid’, ‘-tempdir’, ‘C:\ProgramData\Thinkbox\Deadline10\workers\IR-214226-7540\jobsData\5fb4dee713f8ff37ac6c7770\0_tempzX2ip0’, ‘-arnoldAbortOnLicenseFail’, ‘1’, ‘C:/IMB_NationalMuseumExhibition/IMB-NationalMuseumExhibition.hip’]
2020-11-18 00:52:35: 0: STDOUT: Start: 710
2020-11-18 00:52:35: 0: STDOUT: End: 719
2020-11-18 00:52:35: 0: STDOUT: Increment: 1
2020-11-18 00:52:35: 0: STDOUT: Ignore Inputs: True
2020-11-18 00:52:35: 0: STDOUT: Output: $HIP/render2/$OS/$OS.$F4.exr
2020-11-18 00:52:35: 0: STDOUT: Driver: /out/RS_testMid
2020-11-18 00:52:35: 0: STDOUT: Input File: C:/IMB_NationalMuseumExhibition/IMB-NationalMuseumExhibition.hip
2020-11-18 00:52:56: 0: STDOUT: Unknown command: verification_id
2020-11-18 00:52:56: 0: STDOUT: Unknown command: license_id
2020-11-18 00:52:56: 0: STDOUT: Unknown command: lock
2020-11-18 00:52:56: 0: STDOUT: Unknown command: product_id
2020-11-18 00:52:56: 0: STDOUT: Unknown command: server_platform
2020-11-18 00:52:56: 0: STDOUT: Unknown command: support_expiry
2020-11-18 00:52:56: 0: STDOUT: Unknown command: houdini_version
2020-11-18 00:52:56: 0: STDOUT: Unknown command: available
2020-11-18 00:52:56: 0: STDOUT: Unknown command: count
2020-11-18 00:52:56: 0: STDOUT: Unknown command: ip_mask
2020-11-18 00:52:56: 0: STDOUT: Unknown command: display
2020-11-18 00:52:56: 0: STDOUT: Unknown command: }

anthonygelatka · November 18, 2020, 11:58am

I’m always a bit wary running concurrent tasks with GPU’s

you may submit 6 concurrent tasks, each using a single card, but I’m never sure what determines which card they use, or whether they’re all jumping on the first one.

If I was running 6 cards, I’d likely use 2x workers with 3 card affinity, or 3x workers with 2 card affinity. Then submit jobs with GPU limit but which will be assigned cards via the worker

I’d recommend using a GPU monitoring tool like Afterburner from MSI, you can monitor which cards are being picked up.

christian_kim · November 18, 2020, 5:06pm

hey Anthony! thank you for your advice. ye i had good success with 2 concurrent task (2 GPU, 1 GPU per task), it really scaled linearly. but i think your advice on paring 2 to 3 per task makes good sense.

that being said, above issue actually happens even with a single task (no GPU affinity), so just a simple straight forward deadline submission will have this lag adding about 5 min per task (lag is about 5 min long)

anthonygelatka · November 18, 2020, 5:50pm

do you have the same lag submitting a commandline render without Deadline?

Are you able to switch the debug log on to see if redshift outputs any more info?

Are you able to monitor the GPU to see if it’s loading into vram etc?

I prefer exporting to standalone in whatever application or renderer i’m using, I know it’s not always quicker down to the export process.

I saw GridMarkets had some nice tips on submitting houdini/redshift, not sure if it’s of any use, I’ve not tested it out

christian_kim · November 18, 2020, 5:55pm

hmm interesting tips again!!

will look into those. for the time being, things i know

GPU vram was not being loaded (or the cuda cores being used).
will need to look at the redshift logs as well,

will post an update here when i find out! thank you for your suggestions Anthony!

Jesse_Holmes · November 26, 2020, 2:33pm

Having this exact same issue.

Haven’t tried adjusting concurrent tasks, but given I’m only running two GPUs per machine (only two machines), and don’t have anywhere near the same bottleneck when rendering locally in Houdini, I’m very curious to see what’s causing the hang-up here…

anthonygelatka · November 26, 2020, 3:05pm

network?

Disk

We recommend using fast SSD drives. Redshift automatically converts textures (JPG, EXR, PNG, TIFF, etc) to its own texture format which is faster to load and use during rendering. Those converted textures are stored in a local drive folder. We recommend using an SSD for that texture cache folder so that, during rendering, the converted texture files can be opened fast. Redshift can optionally not do any of this caching and simply open textures from their original location (even if that is a network folder), but we don’t recommend this. For more information on the texture cache folder, please read the online documentation.

To recap:

** Prefer SSDs to mechanical hard disks*

Network and NAS

Redshift can render several times faster than CPU renderers. This means that the burden on your network can be higher too, just like it would be if you were adding lots more render nodes! As mentioned above, Redshift caches textures to the local disk so it won’t try to load textures through the network over and over again (it will only do it if the texture changes). However, other files (like Redshift proxies) are not locally cached so they will be accessed over the network repeatedly. Fast networks and networked-attached-storage (NAS) typically work fine in this scenario.

However, there have been a few cases where users reported extremely low performance with certain NAS solutions. Since there are many NAS products available in the market, we strongly recommend thoroughly testing your chosen NAS with large Redshift proxies over the network. For example, try exporting a large Redshift proxy containing 30 million triangles or so (a tessellated sphere would do), save it in a network folder and then try using it in a scene both through a network path and also through a local file - and measure the rendering performance difference between the two.

To recap:

** Rendering with Redshift is like rendering with lots of machines. It might put a strain on your network.*
** Thoroughly test your network storage solution! Some of them have performance issues!*

Jesse_Holmes · November 26, 2020, 3:11pm

SSDs… SSDs everywhere…

Could be a network bottleneck? But in the bit of testing I’ve done, even pulling source scene/caches from my server and rendering locally on a workstation is still heaps faster than submitting a job to the farm.

mirkoj · February 19, 2021, 10:08pm

I’ve run into the same issue here but I may be found the problem.
I’ve opened up the resource monitor and looking at network speed.
The scene is loaded from NAS.
When live Houdini is loading scene it pulls full speed from NAS and loads in seconds, a bit over 500 MB scene.
But when I monitor Houdini from the deadline and worker log, when it stats loading scene it pulls only at a speed of half a megabyte per second. And then it takes 18 minutes just to load the scene.
So even loading every time is not that big of an issue, an inconvenience but not a big issue, but the speed of loading is.
Also to mention this is with single machine loading from NAS while testing so no bottlenecking or anything like it.

Aeoll · March 12, 2021, 1:42pm

Been running into this problem a lot recently. Files on a NAS which usually take 1-2mins in a Houdini UI session are taking 10-20mins to open during a Deadline task. This applies to Redshift renders and geo caches alike.

They always get stuck at the ‘Input File’ stage:

2021-03-12 13:31:36:  0: STDOUT: Start: 149
2021-03-12 13:31:36:  0: STDOUT: End: 155
2021-03-12 13:31:36:  0: STDOUT: Increment: 1
2021-03-12 13:31:36:  0: STDOUT: Ignore Inputs: True
2021-03-12 13:31:36:  0: STDOUT: No output specified. Output will be handled by the driver
2021-03-12 13:31:36:  0: STDOUT: Driver: /obj/Build/geo1/filecache1/render
2021-03-12 13:31:36:  0: STDOUT: Input File: (REMOVED THE FILENAME...)
2021-03-12 13:41:00:  0: Task timeout is 7520 seconds (Auto Task Timeout)
2021-03-12 13:41:15:  0: STDOUT: Warnings were generated during load.

Approx 10 minutes to open on our fastest PC

Do we know whether this is a Houdini issue or a specific problem with hrender_dl.py / Deadline specific processes? How could I test further? It’s really crippling our farm performance.

octaviuzz · March 29, 2021, 11:56pm

grrrr…i am having the same issues here, i had to rebuild the file, and it worked well for a few submissions. then all of a sudden 5 mins, 10 mins to load and start simulating…it is a 8Mb file, and i got a 10MBE network…why is this happening and how do we fix it please?

octaviuzz · March 30, 2021, 4:14am

ok, if the frame on your file is NOT at f1, the file will take a while loading or it sits on the worker cooking something…send files on f1, and the job starts as usual…still not 100% sure this is the issue but so far it has been working

Vigs · March 30, 2021, 6:57pm

I’m having the same issue with RedShift and C4d. No concurrent tasks. I checked the log, and it’s taking 5min to load this script. You can see the jump in the log time.
2021-03-30 09:43:08: 0: STDOUT: Running Script: C:\ProgramData\Thinkbox\Deadline10\workers\DESKTOP-R5SF3N2\jobsData\606351a79d44890bd0cd7958\thread0_tempyEoRd0\c4d_Batch_Script.py
2021-03-30 09:48:34: 0: STDOUT: Redshift Debug: Context: Locked:Render

And it renders fine in C4d Without Deadline. I’m pretty sure this is happening with Deadline.

lfb · March 31, 2021, 6:33am

My stuff renders fine outside deadline, in houdini gui or in the cmd line with hython / hrender. But deadline is triggering something forcing the scene to recompute a lot of things. In my case, sucking all the 96Gb of ram I have, going into swap, etc.

ubuntu, houdini 18.5, rs 3.0.41, deadline 10.1.14.5

mirkoj · March 31, 2021, 9:18am

I tested royal render for a bit and it seems to handle this a bit better. looks like Houdini i s loading the scene every single time for each task in job, royal render managed to load scene once and keep it loaded and just assigning frames. something like deadline already do for Maya batch. it is a huge issue and was producing huge waste of time.

bazuka · April 24, 2021, 12:19pm

I have to agree on this one too, there is a some kind of problem with deadline!!! And i think its about time that someone check this out, i dont want to pay for extra time license!!!

if you want guys you can try this from Houdini Command Line Tools:

hython “C:\Program Files\Side Effects Software\Houdini 18.5.408\bin\hrender.py” scenefile.hip -d ropname -v -e -f 1 2

And see the speed of loading your scene, i have test it on local machine and also on network too, and there was almost no delay in loading any scene!!! so AWS wake up

mirkoj · April 24, 2021, 5:49pm

And to confirm again.
I have scene where, in deadline log there is point wher it sits for 30-40 minutes seemengly doing nothing.
On the other hadn rendering from command line does nto ahv ethat hold up.
In comand line I use:
source houdini_setup
hbatch
Redshift_setGPU
render -f <start frame
After hbatch and scene it laods ratehr fast and redi to render. In dealdine same time after load it just stuck for 30-40 minutes…
This alone makes houdini unusable in deadline

Bobo · April 24, 2021, 8:01pm

I wonder if this is related to the bug mentioned in this thread: Path mapping with Houdini plugin - #3 by antoinedurr

The original poster said:

There’s a Houdini bug where rbdfracturematerial SOPs cook themselves to death when the DL code calls hou.fileReferences() (and rbdbulletsolver SOPs don’t seem immune either).

I’ve submitted a bug to SESI r.e. the crazy cooking caused by hou.fileReferences(), ticket #105440 / bug #112903.

Can you try commenting out the call to

    parms = gather_parms_to_map()
    if parms:
        pathmap_parms(tempdir, parms)

inside the hrender_dl.py to see if it changes the behavior?
It is possible that the attempt to collect the external references by calling hou.fileReferences() is causing this.

christian_kim · April 25, 2021, 2:31am

Hi everyone, original poster here.

it seems the thread picked up some steam and its good to know that I am not alone.

@Bobo, thank you for your input, I have tried commenting out the hou.fileReference() part, and just ran the test. still having a very long delay at the start of the render.

#UPDATE

for me the DEADLINE log is stuck at the same position (as the original post), and can last up to 20 min. which adds rediculous amount of overhead on render time, however, it does render.

the problem does not seem to be affecting all the projects. not sure what is triggering it, but for the time being, only one of my projects are causing a long delay to launch (for the time being, project file size over 60mb).

when I start stripping down the file (deleting stuff), the same problematic file will start rendering as usual (no delay at start). not sure what the threshold is, but I kept deleting nodes i don’t need, submit, render, delete more, submit render, and when I finally got to almost barebone scene (with literally only RS proxies), it started rendering fine.

obviously, this is not ideal for production, so really would appreciate more help.

antoinedurr · April 26, 2021, 5:40am

Are your sims all cached out? Or are you seeing greater and greater “render” times as your frame range increases?