Job Reports Discussion

jgaudet · February 14, 2013, 11:06pm

Hey all,

Lately we’ve been looking at Job Reports and their potential to really balloon the size of the DB if left unattended. There are several things we can do to reduce the impact of this, and we are definitely going to do what we can – for example, in Beta 13, individual reports should be taking up half the amount of space in the DB that they used to.

However, there are also a few other things we’ve discussed doing that may be a bit more controversial, and we wanted to bring them up here so that we could have a discussion about it before we do anything. So here goes!

First, we’re talking about doing away with Requeue Reports. They use up a lot of space relative to their usefulness; we figured that Requeues should all be logged in the Job History (we’ll make sure they are if they aren’t already), and that individual reports for each Task is kind of unnecessary.

Second, we discussed capping Render Logs to only 1 per task (in other words, we would only keep the latest Render Log for each task). The rationale here is that only the latest one of these should be relevant.

Third, we are considering forcing Job Failure Detection on, probably with a very generous default threshold (say, 500-1000 errors). This is mostly for preventing slaves from generating an indefinite amount of errors on an overnight job, or if no one is paying attention to that particular Job at the time. I feel like this is important particularly because we’ve dramatically lowered the time a Slave waits in between Jobs, so in the worst case (job failing right away), Error Reports can grow out of control much quicker than they did in Deadline 5.

Finally – and this is probably the most controversial/impactful change we’ve discussed – we are thinking of putting a hard cap on the number of a Job’s Error Reports we keep around per Job. How (or even if) we do this is mostly open to discusssion. There are a lot of different options here; we could enforce the cap on a per-Task or per-Slave basis, or just for the overall Job. There’s also a choice in which reports we keep; we could either keep only the first X reports (and no more until they get deleted), or we could keep cycling through reports and only keep the latest X reports. No matter how you slice it, there are reports that aren’t going to be kept around; the challenge here is making sure we are keeping a good distribution of the relevant ones to cover end-user needs.

Now, all of these are up for debate, we haven’t decided for sure to implement any of these changes. I wanted to bring these up here before implementing any of these changes (especially the last one) to get your opinions on the matter. We want to make sure we’re accounting for the most common use-cases of Job Reports before doing anything drastic!

Of course, any other ideas to reduce Job Reports’ footprint in the DB are definitely welcome!

NOTE: Keep in mind for this discussion that the bulk of the reports are actually stored in (compressed) log files in the Repository; the main thing that affects the size in the DB is the sheer number of reports

Cheers,

Jon

nrusch · February 15, 2013, 12:08am

Requeue reports aren’t terribly important or useful on our end, so whatever you end up doing would probably be fine. It might even be nice to have an option to disable them altogether.

Please don’t do this, at least not as an irreversible change. Having logs of previous task runs is very useful, especially with development versions of software or plugins. Having a setting to control how many reports are stored per-task might be nice… if there isn’t one already.

Sure, seems reasonable.

Finally – and this is probably the most controversial/impactful change we’ve discussed – we are thinking of putting a hard cap on the number of a Job’s Error Reports we keep around per Job. How (or even if) we do this is mostly open to discusssion. There are a lot of different options here; we could enforce the cap on a per-Task or per-Slave basis, or just for the overall Job. There’s also a choice in which reports we keep; we could either keep only the first X reports (and no more until they get deleted), or we could keep cycling through reports and only keep the latest X reports. No matter how you slice it, there are reports that aren’t going to be kept around; the challenge here is making sure we are keeping a good distribution of the relevant ones to cover end-user needs.

My gut reaction to this (and actually to the first two items as well) is “make it flexible.” Basically, I don’t know that there will be a good “one-size-fits-all” solution to most of these, so giving users a good granular cross-section of controls (with some reasonable defaults) seems important. Now, I’m not completely familiar with all of the different types of reports that are generated for all of the different types of events, but the last thing I want to have to do is revert to managing our own logging.

im_thatoneguy · February 15, 2013, 6:37pm

On the fence. Sometimes we have specific tasks failing and we can’t figure out why. Usually it ends up being a slave problem not a task specific problem but sometimes it’s a specific task failing on all machines due to a missing or corrupted frame (nuke jobs). Are you suggesting no longer having separate logs for each task in case one task is failing but the others work?

That assumption would be incorrect. Sometimes slaves fail for different reasons. In fact often slaves fail for different reasons. One might fail because of a 3ds max misconfiguration and the next might fail because of a license error. I need to know both errors so that I can track down why different slaves failed.

If you do only default “bad slave” detection. Often a subsection of our slaves will error out (bad max version for instance) while the rest of the farm renders fine. So we’ll finish the job without trouble with 5,000 errors in that case but a successful job.

This is my least objectionable change haha. I’m fine with a hard cap of like 100 errors. Honestly after 100 I probably have a good enough random cross section of the problem(s).

It seems like the best solution would be to have some sort of “archive” solution where after 100 errors the older errors get dumped into an archive file but not tracked in the database.

LaszloSebo · February 18, 2013, 6:00pm

One of the best features about deadline that you have full oversight as to what happens to a job while in the queue. Very frequently, when something unexpected happens, due to the lack of proper logging in our current render management system, we see emails circulating asking people “who changed my priority” or “who requeued task/job X” etc. With deadline that never is a problem, you look at a tasklist, and see that a task was requeued 3-4 times, you check its logs, see who did it, you can ask what happened etc.
So as long as you just move those logs to be stored under the job, but would still be trackable through a task, i guess thats fine.

Please don’t do this. We frequently use older logs for troubleshooting purposes, to find patterns etc.

I think thats fine, even lower should be OK.

Hm… not sure what my opinion is here haha. Having 1-2-300 errors, i usually just skim through to see if there are any patterns. I rarely dig into individual ones, but sometimes i do, and then its crucial for it to be there. A job with 2-300 errors is probably such an epic fail though that it doesn’t really matter what errors are logged. Maybe make this a user setting? We could tweak it based on our experience…

l

anon89511579 · February 26, 2013, 7:56pm

Agreed, we don’t care about per task requeues.

As rusch already said, a configuration option here is what’s needed. It needs to be a global default and a per task override. While the latest render log is the MOST relevant I often look at iterative logs to find differences when troubleshooting.

Awesome.

Finally – and this is probably the most controversial/impactful change we’ve discussed – we are thinking of putting a hard cap on the number of a Job’s Error Reports we keep around per Job. How (or even if) we do this is mostly open to discusssion. There are a lot of different options here; we could enforce the cap on a per-Task or per-Slave basis, or just for the overall Job. There’s also a choice in which reports we keep; we could either keep only the first X reports (and no more until they get deleted), or we could keep cycling through reports and only keep the latest X reports. No matter how you slice it, there are reports that aren’t going to be kept around; the challenge here is making sure we are keeping a good distribution of the relevant ones to cover end-user needs.

Again, needs more configuration. I would go for an option to keep first X and/or last Y log entries.

jgaudet · February 27, 2013, 6:40pm

Thanks for all the valuable input guys! As an update to this thread, we’ve decided to pretty much leave things as they are and just default Job Failure detection to ‘on’.

All the other stuff proposed in this thread would mostly only have a small impact (requeues and render logs really don’t contribute that much to bloat anyways, compared to error logs). As a result, and due to your feedback, we’ll wait to see if our other changes we’ve made will be enough to fix the problem.

However, if logs bloating up the DB becomes a problem again, we’ll definitely re-visit this, and make sure whatever changes/caps we put in will be configurable.

Cheers,

Jon