Efficient issue resolution workflow on large farms by grouping error logs

anon16865508 · July 31, 2015, 11:10pm

We have hundred of render nodes and frequently a pipeline or license error causes 100’s of tasks to throw very similar errors.
So here is the process we need to go through:

identify that its a serious issue because it shows up in many error messages
find the issue and deploy a fix
remove the error reports on all jobs which had this type of error (time intensive)
monitor if the the error is fully resolved or if it shows up again (by looking through many error reports)

All in all a very tedious and labor intensive process.So how about this approach which automates most of it?

Automatically group similar errors across tasks and jobs including a counter of how often it showed up and on which jobs, nodes
If the issue is serious (ie above a threshold) send a notification
Allow marking this consolidated error as resolved which in turn resolves all individual error reports
If the same error shows up again after if was marked as resolved directly send a notifcation

This could be either implemented in deadline or an open source tool web based tool like Sentry could be used to do the same thing:
github.com/getsentry/sentry
The process of consolidating events is called rollup and the algorithm is described here:
docs.getsentry.com/on-premise/rollups/
Sentry has the added benefit that a consolidated error report has a status, can be tagged and can trigger notifications

Thanks for implementing it!
Patrick

eamsler · August 3, 2015, 8:53pm

Hey Patrick!

I think the easiest way is going to be to pump data into Sentry. I can work on the Deadline side if you want to work on the sentry side.

For the Deadline side, we don’t have an event for task errors or completion for performance reasons, so it might need to happen either manually or on a set interval. Also, would the log title be enough to match the similar errors?

anon16865508 · August 14, 2015, 12:08am

Should this post be moved to the beta forum?