[7.1.0.30R] balancer issues

ctjanney · April 28, 2015, 10:18pm

provider: google
DL7 version: 7.1.0.30R
DL and balancer on win7
render VMs centOS 6.6

so the issue is I have a 24 frame test render, and I up my budget to 16.0. Balancer spins up 16 nodes, but by the time the render is over half done, 8 of the 16 nodes are still spinning up and in ‘unknown’ status. Now, with only 3 unassigned tasks, I still have 6 ‘unknown’. By the end of the job, I have paid for 6 instances to spin up that never rendered a frame. They spent the entire time spinning up, then coming back down.

Any suggestions on how to tighten this up so we’re not paying for compute time we aren’t actually using?

Thank you,

-ctj

cbond · April 28, 2015, 10:56pm

we’ve had issues with time for spinning up [from the provider side], and are trying to work with the providers to nail this down to a more reasonable standard.or even a metric…

for the interim i would keep your instances down unless the render is longer. i’ve seen 10-12 instances per minute, and 5-20 minutes per instance, and that’s a huge range. we could publish a white paper on some of our tests, but things keep changing [and mostly improving] on the provider side and i dont want to seed incorrect information.

i hope i dont sound like we are throwing the cloud guys to the wolves, but in our scale testing and live tradeshow events, we’ve seen these swings and it’s tough to get to the bottom of the why with each one. the amount of time to scale can differ seemingly without externally visible reasons, and from provider to provider. that being said, we’ll keep looking at it from our end. eg. in terms of the balancer, we can look at more sanity checks…and we ARE working with the providers to get this improved.

my advice for now - unless your renders are in the hours, or frames in thousands, keep your instance qty down…other guys may chime in with better suggestions.

cb

ctjanney · April 29, 2015, 2:08am

dig it. Yeah, it ultimately lands at the providers’ feet. After all, it’s there machines. I was just looking for advice, and I got it. Less is more until we find the sweet spot. Thank you.

-ctj

cbond · April 30, 2015, 10:55pm

yeah, no worries. if anything kills this cloud thing, it will be the underestimating of what kind of cores you guys need. i was at a keynote where someone said they used 1800 cores over 5 regions to get some machine learning thing done [or something] and everyone in the room was ‘oohs and ahhs’…meantime we’ve scaled up 7500 MACHINES on a provider as a test. i cant wait to see what happens when someone needs 1000 frames on some killer render and has money to burn and they spin up deadline + jigsaw to do 100 regions on each frame - and they do all the tasks in parallel :twisted

anyway, we’re pitching a target of 100 machines /minute as a useable target to each provider, and i hope they can get there in a reasonable timeframe.

cb