AWS Thinkbox Discussion Forums

parallel pending job scans?

Hi there,

We have quite a few inter-dependent jobs in the queue (that are file dependent, so are doing a lot of disk queries), and the pending job scan can take up to 20-22 minutes per cycle. This is way too slow, and we were wondering if there was an official way of running multiple pulses (or other processes) for the purpose of distributing the pending scan load?

cheers
laszlo

Not at the moment unfortunately.

Do you see any value in caching that scan? Considering it’s “does this file exist?” it should be good enough to maintain that state for a minute or two if the request were being forwarded through Pulse.

Update: On thinking about this, the issue is more to get the linearity of the scan distributed and not the file access calls. This is going to be a lot harder to design…

I was thinking of maybe submitting a deadline job (that auto deletes on complete) every minute that has a list of all current pending jobs, and then distribute the scans across ~10-20 machines. That would be our ‘workaround’…
But best would be if the secondary pulses could simply take a chunk of the job IDs that the regular pulse is cycling through

Synchronizing the state between the various Pulses feels like the most challenging part. At least a job would leverage the existing distribution instead of re-cooking something in core…

I guess the hackish workaround in core would be to grab pending jobs at random and let the distribution of Pulse machines handle it. Each Pulse would grab a fraction of the pending jobs equal to the number of pending jobs divided by the number of active Pulses running. It won’t be perfect, but it could be close enough. There’s also the global lock we store in the database that we would have to contend with.

Are there holes with that solution? Being reliant on a random number, the queue order could be unexpected.

What if the pending jobs had an additional property, something like “last pending check”, and then the pulses could process them (somewhat randomly), but based on that last pending check value, starting from oldest to latest - using a lock.
So one pulse could grab 100 jobs, then the next pulse the next 100 etc. With the lock and the ‘last pending check’ property, they wouldnt pick the same jobs at the same time (since when pulse2 looks for jobs, those would already have a newer check time thanks to pulse1 picking those)

Well, if Pulse 2 does a query while Pulse 1 is updating those values they’ll be doing the same work or we’ll be doing a lot of individual queries. I’ll bounce the ideas off of RR next week and we’ll see if anything falls out of that.

Privacy | Site terms | Cookie preferences