I’ve noticed that many jobs that have Priority 50 and machine limits 1 are taking more than 1 machine.
Sometimes the jobs that are priority 48 and machinelimits 0 are taking all the machines available even if a job with 49 in priority and no machinelimits is in queue.
Thanks for your help.
QBuc.
Edit : It happens most of the time when the pulse is over 15% in CPU usage
We’re starting to handle it, but pulse turns crazy with some jobs ingestions
When some information is erroneous or missing, it goes full CPU (monothread, hence the 15% limit on a 8 threaded CPU)
We had some issues with some homemade maya jobs push, but seems cleared now,
it also happens when submitting form softimage with very little tweaks but using the integrated submission scripts
We’ll try reverting to the very default submission script to make sure it’s not our tweaks
At some point I’m forced to kill pulse so it doesn’t get stuck ingesting jobs
that’s at that precise moment that some slaves pick up jobs they shouldn’t
That’s very strange behavior. Note that in Deadline 6, the slaves never talk to Pulse to figure out which job they will dequeue. So if pulse is freaking out on bad job data, that same bad job data must be affecting the slaves as well.
I wouldn’t expect tweaks to the submitter should affect anything, but it would be interesting to know if reverting to the out-of-the-box submitter makes a difference or not. If you are able to reproduce the problem and find the problematic jobs in the Monitor, can you right-click on one or two and select Export Job, and then post the export here? We can drop them into our database to try and reproduce the problem.
nothing really makes sense in that strange behavior
I’m trying to cut back on which kind of job or moments makes things buggy, and I can’t …
I’ll try to reset our setup over the week end, preserving a copy of existing database and repository elsewhere
If not too heavy I could make it available to check if something inconsistent went in
Reset of deadline repo + db : useless, bug is still here
the bug itself is Pulse throwing in loop
[code]Data Thread - An error occurred while updating cached data: Object reference not set to an instance of an object. (System.NullReferenceException)
Exception Details
NullReferenceException – Object reference not set to an instance of an object.
Exception.Data: ( )
Exception.TargetSite: Void a()
Exception.Source: deadline
Exception.StackTrace:
at Deadline.Pulses.PulseDataThread.a()
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
[/code]
Secondary bug :
Repository Clean-Up locks itself very regularly for no apparent reason (except that it has to do something which it does not), so there’s an infinite thread on cleanup, but it does not output anything
I’ll try running without pulse for a while as it clearly is the weak point as it is today
How big is your database if you zip it up? Since it’s difficult to pinpoint the exact jobs that are causing this, perhaps you could just zip up and send us your database the next time it occurs. We’ve never been able to reproduce this here, so it’s probably an edge case that we’re not thinking of. If we can reproduce it here, it should be much easier to track down the cause.
I could remote in there, but I’m concerned about mucking around with your system while you’re in the middle of rendering. If you’re okay with me remoting in, zipping up your database myself, and sending it back to us, I can do that.
Feel free to send my a private message with the connection details. I should be able to remote in today.
EDIT: I should mention that I’ll need access to the machine that mongo is running on, as well as your Pulse machine (if they’re different).
The jobs in the database have a flag called IsSub, which stands for “Is Submitted”. When a job is being submitted, it is first created in the database with this flag set to False. When the job finishes submitting, this is set to True. This is to prevent slaves from rendering a job that hasn’t finished being submitted.
However, there was a bug with the general GetJobs function that Pulse uses to load the jobs. If there was at least one job with IsSub set to false, this error would occur. We have fixed this internally, and the fix will be included with beta 13. This fix should prevent Pulse from freaking out like this in the future. We have also improved the code that was responsible for setting this flag to make it a bit more robust.
Finally, we’re going to look into adding house cleaning code to remove jobs that have IsSub set to False and aren’t showing signs that it will ever be set to True.
For now, you can manually remove these problematic jobs by using mongo.exe to connect to the database. If you’re running it on the same machine that mongo is running on, do this:
mongo.exe localhost
Then do the following from the mongo terminal:
use deadlinedb
db.Jobs.remove( {IsSub:false} )
That will remove the problematic jobs, and should allow Pulse to calm down again.
Hmm, 2 minutes does seem longer than it should take. Do you happen to know how many jobs were in the queue at this time?
Can you send us the Pulse log file from this session? The file will be useful because we prefix every line with a time stamp. You can find the log folder from Pulse by selecting Help -> Explore Log Folder.
Thanks. I’m really not sure what would cause this at this point. The fact that it’s random doesn’t help either.
We’ll add a bit more debugging output to beta 13 so that if it does happen again, we can determine if the slowdown happens when loading the jobs or when scanning them.
I’m sorry to announce we had more and more issues,
we finally had to revert to 5.2 as production is growing,
the good news is every code we had works flawlessly between both versions.
I’ll sur miss the new layout capabilities of the monitor but this is becoming too risky for us to be on an unstable version
I hope the bugs we reported will help …
Is there any chance we get modo progress report in 5.2 ?
That’s completely understandable, and we appreciate you sending us the bug reports. I think we’ve addressed most of your issues for the upcoming beta 13 release, which we hope to get out this week. The long job repository scan was the only one we couldn’t track down, but hopefully the additional debug logging we’ve added helps us narrow the scope. Maybe it’s even possible that when we fixed the other Pulse issue with it loading jobs, it fixed this job repository scan issue (since they both involve Pulse loading jobs).
Getting progress reporting in modo for 5.2 should be pretty straightforward. What’s the full version of 5.2 you are using? We just want to make sure we’re updating the correct version of the modo plugin for you.