Hey guys,
Would you mind explaining how the updating process of the monitor works? We run into situations where the artists and even wranglers get confused about what deadline is showing versus reality.
For example, you can have a task that shows that its rendering on a slave, you remote into that slave to troubleshoot why its taking so long, only to find that its in fact not rendering that job.
You then go back to the monitor, click refresh, wait a minute, its still showing that machine.
So the issue gets reported to wrangling/pipeline, but by the time i look at the monitor (on my computer), it shows the valid data, that the given task failed on that slave, and it has since been requeued on another machine.
The artist might spend 2-5 mins looking at wrong / obsolete information, KNOWING that its wrong. Is that expected? How does the deadline updating mechanism work? When i click ‘refresh’ manually, does it in fact go to the db and do a full update or just apply the internal caches to the gui?
We have our refresh rates as follows:
Job update interval: 30s
Slave update interval: 30s
Pulse update interval: 90s
Limit update interval: 30s
Settings update interval: 120s
Cloud update interval: 90s
Enable manual refreshing ON, number of seconds between manual: 10s
The only time the Monitor ever loads up all the Jobs, Slaves, etc, is when it starts up.
From then on, it queries the DB for any Slaves/Jobs/etc that have changed since the last time it checked (this is an indexed query), and updates those in the Monitor. The refresh button manually triggers one of these updates, instead of waiting for the next interval to tick down.
Theoretically, the Monitor shouldn’t ever be displaying data that’s older than one ‘cycle’. In practice, if lock queues start creeping up under heavy load, the server may not get to an individual Monitor’s query in a timely manner, and the user could end up looking at data that is older than a cycle’s typical length. This is not ‘intended’ behaviour, but it could happen under heavy load.
Does this only happen when the DB is being strained? If not, it might be some other bug that’s somehow preventing updates from being processed properly…
Pretty elusive situation, so hard to tell… We had heavy loads over the last couple days though, so might be a factor.
The different updates (jobs, slaves etc), are they running parallel, or sequentially? So jobs first, then slaves etc? I always wondered why the ‘last update’ counter never shows the actual update intervals set in deadline, but something arbitrary
They’re all in separate threads, so they run in parallel. The last update counter resets whenever any of them get data back from the DB; it’s meant more as a “heartbeat” to make sure the Monitor is still actively talking to the DB than a “when should the next update happen” kinda thing.