Ok here is an odd one. Started up a slave, but it was a DL4 slave and it connected to repository 4 instead of repository 5 and started kicking back errors:
Exception Details
JobDeletedException -- job was deleted, jobDirectory: \\sfs-file\repository4\jobs\004_060_000_69adcc6a\004_060_000_69adcc6a.job
JobDeletedException.JobDirectory: \\sfs-file\repository4\jobs\004_060_000_69adcc6a\004_060_000_69adcc6a.job
Exception.Data: ( )
Exception.TargetSite: Boolean RefreshJob(Deadline.Jobs.Job ByRef, Boolean)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.Storage.JobStorage.RefreshJob(Job& job, Boolean forceRefresh)
at Deadline.Scheduling.SchedulerUtils.DequeueTasksFromPulse(DeadlineController deadlineController, DeadlineNetworkSettings networkSettings, SlaveState& slaveState, Boolean& invalidMessage)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler - Returning limit group stubs not in use.
Scheduler Thread - Could not lock job to write start date time, trying to force it...
Scheduler Thread - Synchronizing job files
Scheduler Thread - Unexpected Error Occured
I think the jobs were being assigned by pulse. But since the slave was connecting to Repository 4 instead of Repository 5 it obviously was confused by the job info and failing. Pulse would then assign it another job since it told pulse it needed more work (without registering the error) and as a result this slave took over 80% of the tasks on the farm.
At no point did this slave show up in the list of slaves either since it wasn’t connected to repository 5’s slave list so it was just stealthily stealing all the tasks.
Solution: Have pulse send a handshake for version number to ensure the two are on the same repository and are the same version?