What is the exact reason for the Monitor to take 5 minutes!!! to bring up an error report, for example, over the internet?
It’s unbearable!
What is the exact reason for the Monitor to take 5 minutes!!! to bring up an error report, for example, over the internet?
It’s unbearable!
Are you using remote mode?
In this case, I wasn’t.
But I’m just curious, what is it exactly that takes so unbelieveably long to update the monitor over the Internet?
Hey, Mike,
yeah, all these things are all OK. Just Monitor is extremely unresponsive.
I don’t even have long latencies!
Microsoft Windows [Version 6.1.7600]
Copyright © 2009 Microsoft Corporation. All rights reserved.
C:\Users\loocas>ping rammstein
Pinging rammstein [192.168.0.100] with 32 bytes of data:
Reply from 192.168.0.100: bytes=32 time=9ms TTL=127
Reply from 192.168.0.100: bytes=32 time=15ms TTL=127
Reply from 192.168.0.100: bytes=32 time=13ms TTL=127
Reply from 192.168.0.100: bytes=32 time=9ms TTL=127
Ping statistics for 192.168.0.100:
Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
Minimum = 9ms, Maximum = 15ms, Average = 11ms
Hi,
Does the error report contain a large number of STDOUT lines? ie ALOT more than the usually captured amount of stderr?
Have you got line numbers enabled?
M
Oops. Just deleted my first post.
Hmm… well, no, the error reports (deleted them already) were only showing missing texture files. About 10-15 or so.
Ok, can you remember if there were a large number of error reports on this job(s)?
Yep, about 400
OK. I believe all reporting in Deadline isn’t cached by Pulse with regard the actual reports, so if you have a high number, it can take a little while whilst it loads up the individual report files directly from the file server.
Under your repository options, “Jobs:Failure Detection”, what number have you set for “Error limit before a job fails” & “Error limit before a Task Fails”? By reducing these numbers, in effect the troublesome jobs will fail quicker with less error reports, which will cause the reports to load quicker in your monitor.
However this is not really the best answer. I’m guessing it’s a 3dsMax job with the missing textures error? If so, have you by chance got the “ignore missing textures” checkbox option enabled during submission?
M
I enabled the missing textures checkbox. This was a fail partially on my post-load script’s side. I made a mistake and repathed to a wrong destination where it couldn’t have loaded all the texture files, unfortunately. I can’t fix the script, though, as it seems that ATSOps are broken.
Anyways, this is strange. I imagine Deadline streaming, roughly, up to kylobytes of data. So, what the hell is the holdup? I can download up to 600KBs/s with my lame internet connection (am on a 4mbit) and my server sits behind an unlimited 25mbit up/down fiber obtics line. The latencies are in tens of milliseconds, tops. That’s why I don’t understand why Deadline is so terribly slow!
But I even noticed some wait-times on my local 1gbit line at the studio. Sometimes Deadline just takes a second or two to refresh etc…
Hi,
ATS is tricky to control sometimes and is hard to trust all the time. Typically, when a user opens a Max file and uses the “browse to…” missing maps dialog to fix any missing assets, they are effectively using the “customise local paths /…external files, [session paths]” function which is an arse when then sending the job to a renderfarm as these paths may not exist across the network. Luckily, in the SMTD submission interface, there is a *.mxp pathing function to help export out a *.mxp file which can be sent with a job. However, IIRC, you don’t use the SMTD, so this is no good and I would argue is still side-stepping the enforced Autodesk way of thinking.
Anyway, going back to my original point, when you insert a local session path by browsing to the asset path, this doesn’t update the ATS system. Autodesk, I guess, think its a good idea to have all your asset paths un-touched but allow a local override via the session path system. This leads to confusion I have found by artists.
On top of this, I’m of the opinion that during network rendering, if “ignore missing textures” is enabled, 3dsMax is VERY sensitive on frame-0 or indeed whatever is the first numbered frame in your 3dsMax file timeline scrub bar when the file is opened for rendering. If you have this setup and have a missing texture, Deadline will error out and even if you have a limit set to the number of times a task/job can FAIL before being stopped, Deadline still continues to process. This is something I need to take up with Ryan further as Bobo is already looking into the “frame-0” part of this issue and is talking to Autodesk about it.
Yep, the error reports are really small XML files. I just checked on one of my systems and 73 error reports of different stderr size in total add up to 300Kb, so it should be streamed to your remote system reasonably quickly. However, under really heavy load, it can sometimes take a couple of seconds to receive all that data, but I don’t think I’ve ever since it take more than say, 5 seconds or so.
Now, IIRC, I did make a feature request to get the reporting system during loading to be multi-threaded as I ‘think’ its presently a single-threaded process, so this may well be causing the bottleneck. ie: the data has been sent to your machine, but its reading the XML files 1 at a time, which causes the slowdown. Again, Mr. R will be able to enlighten us. (Friday = Canada Day, Monday = 4th July), so might not be till Tuesday for an answer.
BTW, if you want your repository to go at stupidly fast speed then, installing the repository to a raided SSD setup will help a lot. As of v5.0 you can also off-load the actual “data files” (3dsMax files) if you submit them with your Deadline jobs, to a dedicated storage system as SSD’s are fast but their capacity isn’t quite there yet. If you need insane speed, then FusionIO PCI card(s) will give you that, but you will need to sell your house, car and a kidney to afford a reasonable capacity one. Although, the smallest size one (~80Gb) is affordable’ish and with the new v5.0 off-loading feature, you could setup an amazingly fast repository, but then you get yourself a i/o bottleneck and the fun and games continue The Deadline white-paper talks the talk on all this tech / setup.
So, in summary, I would say, configuring your repository to error out sooner on jobs which are missing textures is a good thing and saves on wasted proc cycles on the farm. However, I think there are some underlying issues which are WIP at the moment.
Mike
Wow, thanks a lot for the post. Tons of interesting info and stuff to check out.
As for ridiculously speedy repository, I have one RevoDrive PCIe SSD I’ll be taking off my workstation (its capacity isn’t enough even for an OS install) so I might upgrade my server with that one. But other than that, the speed of the pipeline isn’t really of issue here. It’s all stored on a fast DAS (Dell MD1000) with rather fast regular HDDs. The speed hasn’t really been a problem. Especially since I only have ten render nodes.
As for the paths. Currently, I have two post-load scripts I use with Max when submitting jobs. One for setting up GI prepass and repathing all the source data. And the other one for setting up the Beauty pass and also repathing the source data.
The problem is that ATSOps is, pardon my french, fucked up! The methods just don’t work and I even had a developer (in the beta forums) confirming a serious bug they discovered in the path handling after my bug submissions.
Anyways, if you have any tricks up your sleeves for managing re-pathing (essentially, I only need to replace D:\ with \myServer\projects, that’s it), I’m one big ear.
Thanks a lot, Mike, always appretiate your input!
Hey, Mike,
just tested .mxp setup with a custom .mxp file and Deadline and as I suspected, the .mxp doesn’t work as I’d imagine, thus, it’s pretty much useless.
I have a scene with some geom and only one texture (for testing purposes):
D:\test\shots\aaa\2d\texture.jpg
and I have this .mxp:
[Directories]
Animations=.\sceneassets\animations
Archives=.\archives
AutoBackup=.\autoback
BitmapProxies=.\proxies
Downloads=.\downloads
Export=.\export
Expressions=.\express
Images=.\sceneassets\images
Import=.\import
Materials=.\materiallibraries
MaxStart=.\scenes
Photometric=.\sceneassets\photometric
Previews=.\previews
RenderAssets=.\sceneassets\renderassets
RenderOutput=.\renderoutput
RenderPresets=.\renderpresets
Scenes=.\scenes
Sounds=.\sceneassets\sounds
VideoPost=.\vpost
[XReferenceDirs]
Dir1=\rammstein_UNMANAGED_PROJECTS_\test\shots\aaa\2d
[BitmapDirs]
Dir1=\rammstein_UNMANAGED_PROJECTS_\test\shots\aaa\2d\
But even after this, I still get “2011/07/03 21:33:58 ERR: Missing Map: D:\test\shots\aaa\2d\texture.jpg” in Deadline.
Oh man! I hate the closed god damn .max file! It’s awful!
HA! It seems I’ve found a “workaround” or rather a usable solution in the ATSOps, if anyone cares
[code]ATSOps.Visible = true
ATSOps.getFiles &fileArray
deleteItem fileArray 1
for file in fileArray do
(
ATSOps.selectFiles file
ATSOps.retargetSelection ( substituteString ( file ) srcPath destPath )
AtsOps.clearSelection()
)
ATSOps.Visible = false[/code]
This seems to be working on the render farm!
The reason why loading the error reports is slow over the internet is because the are being read directly from the file system over the VPN. If the data was being streamed from Pulse or some other webservice, it would be much faster, but because Deadline needs to touch the file system directly, it will be much much slower.
In remote mode, we avoid the long load time by only collecting the file names of the reports and parsing as much info as possible to populate the report viewer list. The full report is only loaded when you click on an actual report in the list.
Cheers,
Thanks for the info, Ryan, it’s still strange that a few KB file will take so long to stream over a 4mbit bottleneck.
But, I assume, that’s where Apache and IIS come from, right?
I now have a bit of trouble with my wifi and with latencies around 300ms, the monitor takes about 10 minutes!!! to open! Even with the remote mode!
Yeah, we still need to make improvements to the initial launch speed. The Monitor still needs to gather some info from the repository (like loading the repository options so that it knows where Pulse is), but we hope to prune down any unnecessary file system reads as much as possible.
Wouldn’t Apache or IIS help in this regard? I mean by serving the actual data?
No, because the Monitor is still touching the actual repository file system in some areas while in remote mode. All we can do is reduce the number of direct file system queries and try to move more communication between the Monitor and Pulse (which is done with a socket connection, which is why it’s so much faster).