aux file alternate 'read' location?

LaszloSebo · November 12, 2013, 11:02pm

We have an avere cache server that’s a ‘front’ to some of the data that lives on our back-end isilon.

Writing data to this cache server is slow, but reading from it is very fast and scales extremely well. We would like to store the aux files on our regular isilon, but so that they are read from the cache server.

Is it possible to set up an alternate ‘read’ location with having submissions go to the regular space?

So for example, our data files would be written to:

\dataserv\deadline\repository\jobs\

but read from:

\cacheserv\deadline\repository\jobs

(it contains the same data)

MikeOwen · November 13, 2013, 12:26pm

Hi Laszlo,
I assume you already have the “Configure Repository Options…” > “Job Settings” > “Auxiliary Files” section populated with your “jobs” paths on your isilon?

I believe your desired configuration can be achieved by configuration of your Avere cluster.

You can configure your Avere global namespace to present itself as either a new namespace or override your existing pipeline namespace. It can present itself as CIFS as well as NFS (I assume all your Window clients are connecting via CIFS and not using Windows 7 Enterprise and it’s built-in NFS client). Either way, Avere can present itself as NFS or CIFS to your clients. (You will just get 5%, maybe 10% better performance with NFS compared to CIFS. That’s assuming your not using Windows 8.1 and the newer SMB version built-in?)

Anyway, you can configure your Avere cluster to allow “write-around” when client machines are submitting (writing) jobs, BUT “read-from” the Avere’s when pulling the Aux (data/3dsmax files, etc) files from the Aux file path location as configured in the repo options which really is pointing to your isilon cluster. Avere handles all the trickery at the CIFS/NFS level & global namespace level.

You could also configure the Avere cluster to make the Isilon Aux file path a “hot folder” in Avere speak, which means that it would recursively keep the contents of the Isilon Aux file path held in it’s highest storage/memory state (DRAM). However, with the number of jobs you guys have in your queue and I can only guess, contains very large Aux (data) files on occasion, then your DRAM cache will very quickly fill up and then the Avere will start populating the lower storage tiers, which in effect, would lead to potentially undesirable degrade on the Avere’s performance for the rest of the facility as they ‘work through’ the Avere cluster, reading/writing files.

So, couple of options for this last issue. Either configure a ‘cap’ on the total disk capacity of the Aux “jobs” directory to be held by the Avere OR as the Deadline Aux jobs data can be considered ‘transient’ data which is only needed/useful at the beginning of a job when the most number of slaves are trying to PULL this data, you could configure the Avere’s to have a FIFO /retention policy.

The logging/table views/graphing - analysis tools on the Avere are absolutely brilliant and will be able to provide you good feedback on how the configuration is working for you as well as including stats on currently “hot folders” & individual “hot machines”…so you will be able to see how well it is working or whether further tweaking is required.

There may well be other newer/advanced options in the Avere configuration (I’m no expert), so I would recommend speaking with your local expert

Mike

LaszloSebo · November 13, 2013, 4:53pm

Wow Mike, its rare to see someone who knows avere configuration

We went through about a year of evaluation with the avere cluster we have and without disclosing any nda stuff, i can only say that we have to currently constrain it to select type of traffic. At one point we had it as “the” server namespace, with the write-through / read type configuration that you are describing. Due to some issues, we took a step back and are using it as a front end to some cache sensitive, but not “instant failure in case of error” type traffic only. We are working with their engineers on the issues, so at one point we might go full speed ahead again though.

Maybe i could achieve the same functionality by patching 3dsmax.py though

MikeOwen · November 13, 2013, 7:58pm

Sounds like you are currently using it in “read-only” mode, which would be perfect for your rendernodes to pull the Aux files from. The majority of the time, your rendernodes should only need to “read” from the Aux directory, so it should be safe to identify them via the Avere configuration as machines to “read-from” the cluster. All your submitting client machines can continue to point at the Islion and if you do have a rendernode which say, does an automated submission event type of thing, then they will write-around the Avere and still get to the Isilon path. Essentially, the idea, is to remove the bulk of the load (which will be the read access) of the rendernodes from your Isilon

In theory it all sounds possible, but of course I don’t know your precise setup or any conflicting issues, which it sounds like you guys have suffered from.

I don’t think the plugin level py files like the 3dsmax.py will be able to help here as the Deadline core handles where the “Aux” file is pointing to at the global repo options level.

LaszloSebo · November 13, 2013, 8:06pm

Yeah you are right
I could not patch 3dsmax.py, because the aux file synchronization is handled already before it gets there.

Hm :
Would be nice if we could use pathmapping for this, but without remapping everything,… just the aux path when syncing locally to the machines. Not for submission, or plugin outputs etc. Any tips would be appreciated…

I can’t think of a solution right now that would not require us to do major infrastructural changes (like making cachesrv -> datasrv, which we had some real bad issues from and caused a lot of pain all around, missing files, access problems etc). On assburner we just patched where the file was being loaded from, string replace style.

LaszloSebo · November 15, 2013, 12:59am

What happens if i reset the job aux path location in the repository settings with a live & well farm full of jobs?

Will old jobs still source the files from the old location that they had when they were submitted?
Or will they simply swap the root folder to the new location and start sourcing from there?
Will pulse be clearing the old location, or just simpy start looking at the new location?

Basically, we already have the ‘write-around’ set up, so we are thinking of simply enabling the aux path in the deadline repo to point to \cacheserv instead of \dataserv. That way, we get a speed hit on publishing files (because the writearound actually is pretty slow), but at least we don’t have to worry about every path being repointed to \cachesrv.
But i am worried of doing this with a queue with 20k jobs in it… So if you can tell me this is safe, and pulse wont wipe the jobs folder, ill go ahead and do the change

cheers,
laszlo

rrussell · November 15, 2013, 3:35pm

Deadline shouldn’t purge the original job folders because the housecleaning only removes folders from the jobs folder or the alternate auxiliary folder if the job is no longer in the database.

The aux files that are stored with the job object aren’t rooted, and the aux folder is prepended to them whenever necessary.

If an aux file can’t be found in the alternate location, Deadline falls back to checking the original location in the repository.

Cheers,

Ryan

LaszloSebo · November 15, 2013, 5:15pm

Beautiful, thanks Ryan!

LaszloSebo · November 15, 2013, 6:40pm

I have reset the job aux path to the new location, but it seems like the slaves are still getting their files from the original one.

Would this only work for freshly submitted jobs?

Job Aux Files windows path set to: \inferno2new\deadline\repository6\jobs

From a task report from 20 seconds ago:
Scheduler Thread - Synchronizing job auxiliary files from \inferno2\deadline\repository6\jobs\528668f2cf71592780ef9380

MikeOwen · November 15, 2013, 6:41pm

Yep, only newly submitted jobs will be directed towards the new path location.
It will be very interesting to see your Avere performance change during this time! PM me!

LaszloSebo · November 15, 2013, 6:46pm

Odd, because the right click options for old jobs go to the new location (browse to aux path).

Also, it seems that new jobs are still syncing from the old path.

Maybe the slaves need to be restarted?

rrussell · November 15, 2013, 7:47pm

The slaves only update their repository options every 10 minutes or so, so it might take a bit before they all recognize the change. Restarting the slaves would let them notice the change immediately.

LaszloSebo · November 15, 2013, 7:58pm

Yep seeing traffic on the avere now! Thanks

LaszloSebo · November 15, 2013, 10:33pm

Its odd, we are seeing substantial traffic on the avere now, and the traffic on the data server dropped by 80%, but the logs still show the old server name. I think its just the log message that are buggered.