We’re currently experiencing an issue where our simulation jobs with a machine limit of one end up getting queued eventually. We believe this could be a limit issue that we encountered back in the Deadline 7 days. So what happens is that a machine will be rendering a job and an error comes up on the job. The job then sits queued until we manually change the machine limit from 1 to 2. Then it will only pick up one machine. Any ideas on what’s causing this?
This is affecting most of our simulation jobs, and is really hurting right now. We have to babysit all these jobs
I remember this bug as well from the d7 days, seems like there was a regression? We did not have this problem with v 8.0.4
It sounds like the first slave that is rendering when the error occurs is not releasing the limit ever. That definitely should not be happening. Are you using custom Limit Groups at all, or just the limit that corresponds to the job?
The jobs get submitted with usually 3 different limits that we’ve created.Not sure if that answers your question, if not could I get some clarification?
When a job is submitted, a limit is created for that job automatically. These limits are not displayed in the limit panel. Limits that you assign are ones you’ve created. When this error occurs, do you need to up the limit count for the job, or for one of the limits you assigned to the job?
Here is one stall log for example, after which the job was stuck till we increased the machine limit:
STALLED SLAVE REPORT
Current House Cleaner Information
Machine Performing Cleanup: deadline01.scanlinevfxla.com
Version: v8.0.10.4 Release (c19fd2cef)
Stalled Slave: LAPRO0677
Slave Version: v8.0.10.4 Release (c19fd2cef)
Last Slave Update: 2016-11-06 00:59:36
Current Time: 2016-11-06 01:10:11
Time Difference: 10.596 m
Maximum Time Allowed Between Updates: 10.000 m
Current Job Name: [GATE] NXN_017_0620_v0039_npf_v39Cache_cache_flowline_Fire_1
Current Job ID: 5817fbadeb608122d8f06b82
Current Job User: nick.pfeiffer
Current Task Names: 1285
Current Task Ids: 335
Searching for job with id "5817fbadeb608122d8f06b82"
Found possible job: [GATE] NXN_017_0620_v0039_npf_v39Cache_cache_flowline_Fire_1
Searching for task with id "335"
Found possible task: 335:[1285-1285]
Task's current slave: LAPRO0677
Slave machine names match, stopping search
Associated Job Found: [GATE] NXN_017_0620_v0039_npf_v39Cache_cache_flowline_Fire_1
Job User: nick.pfeiffer
Submission Machine: LAPRO3145
Submit Time: 10/31/2016 19:19:25
Associated Task Found: 335:[1285-1285]
Task's current slave: LAPRO0677
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave's status to Stalled.
Setting last update time to now.
Slave state updated.
Lapro0639 is the machine currently rendering. The one that stalled was called lapro0677, and you can see that its not listed under “Holds” any longer. But still the document shows “Used: 2”…
The other jobs limitGroups are similarly corrupted.
Randomly picking jobs here from the many that failed. This did it again after having had its limit increased. I have not yet increased its limit further:
It’s possible, we’ll have to do sopme digging on our end. From that limit you posted, I did notice that the InOverage value less than 0. We had made a fix a few versions ago that should prevent that from happening, but this is especially suspicious considering Job-Level limits do not ever use overage, so that count should never be getting decremented. Can you confirm if the other offending jobs have this issue as well?
I’ve done some tests, and I’ve confirmed that the issue is that when a slave stalls and tries to resume the job, it seems to be decrementing the limit in overage count for the previous job instead of the inuse value. We’re currently looking into a fix.
So simply rolling back the pulse version (to get the housekeeping, etc jobs to use the old version) would not fix this issue, we would have to roll back the entire farm, or wait for a hotfix from you guys. Correct?
Thanks, we have rolled back the pulse machines for now (there were other fixes in 8.0.10 that were also required for us), and it seems to behave much better now. What is your expected timeline for the next update?