We still don’t know what could be causing this. Just to refresh our memory, what do you have to do to “fix” this? Does restarting the slave application work, or do you have to restart the machine?
If you enable bad slave detection in Repository Options -> Job Settings -> Failure Detection, then at least your slaves can move on after X consecutive errors, instead of you having to blacklist them manually.
If I remember correctly and this seems to jive with what I’m seeing in practice. You have an auto-fail after like 10 minutes hardcoded in for this error. I don’t think it’s actually marking it as an “error” though that counts towards the failure detection. I do already have failure detection enabled.
The error for this timeout is just like any other error, so it should “count” like any other error. If you have the bad slave error limit in Failure Detection set to 5, then if the slave timeouts on customize.ms 5 times in a row, it should move on. I guess this would take 50 minutes to determine though. Maybe 10 minutes is still a bit high for the timeout for this script…