We have hundred of render nodes and frequently a pipeline or license error causes 100’s of tasks to throw very similar errors.
So here is the process we need to go through:
- identify that its a serious issue because it shows up in many error messages
- find the issue and deploy a fix
- remove the error reports on all jobs which had this type of error (time intensive)
- monitor if the the error is fully resolved or if it shows up again (by looking through many error reports)
All in all a very tedious and labor intensive process.So how about this approach which automates most of it?
- Automatically group similar errors across tasks and jobs including a counter of how often it showed up and on which jobs, nodes
- If the issue is serious (ie above a threshold) send a notification
- Allow marking this consolidated error as resolved which in turn resolves all individual error reports
- If the same error shows up again after if was marked as resolved directly send a notifcation
This could be either implemented in deadline or an open source tool web based tool like Sentry could be used to do the same thing:
github.com/getsentry/sentry
The process of consolidating events is called rollup and the algorithm is described here:
docs.getsentry.com/on-premise/rollups/
Sentry has the added benefit that a consolidated error report has a status, can be tagged and can trigger notifications
Thanks for implementing it!
Patrick