unrespected machinelimits

anon93172483 · February 14, 2013, 5:01pm

Hi Thinkbox team I’m using the B12 version

I’ve noticed that many jobs that have Priority 50 and machine limits 1 are taking more than 1 machine.
Sometimes the jobs that are priority 48 and machinelimits 0 are taking all the machines available even if a job with 49 in priority and no machinelimits is in queue.

Thanks for your help.

QBuc.

Edit : It happens most of the time when the pulse is over 15% in CPU usage

Alcium · February 14, 2013, 5:36pm

To add some more precisions

We’re starting to handle it, but pulse turns crazy with some jobs ingestions
When some information is erroneous or missing, it goes full CPU (monothread, hence the 15% limit on a 8 threaded CPU)
We had some issues with some homemade maya jobs push, but seems cleared now,
it also happens when submitting form softimage with very little tweaks but using the integrated submission scripts
We’ll try reverting to the very default submission script to make sure it’s not our tweaks

At some point I’m forced to kill pulse so it doesn’t get stuck ingesting jobs
that’s at that precise moment that some slaves pick up jobs they shouldn’t

rrussell · February 14, 2013, 7:08pm

That’s very strange behavior. Note that in Deadline 6, the slaves never talk to Pulse to figure out which job they will dequeue. So if pulse is freaking out on bad job data, that same bad job data must be affecting the slaves as well.

I wouldn’t expect tweaks to the submitter should affect anything, but it would be interesting to know if reverting to the out-of-the-box submitter makes a difference or not. If you are able to reproduce the problem and find the problematic jobs in the Monitor, can you right-click on one or two and select Export Job, and then post the export here? We can drop them into our database to try and reproduce the problem.

Thanks!

Ryan

Alcium · February 15, 2013, 5:11pm

nothing really makes sense in that strange behavior

I’m trying to cut back on which kind of job or moments makes things buggy, and I can’t …

I’ll try to reset our setup over the week end, preserving a copy of existing database and repository elsewhere
If not too heavy I could make it available to check if something inconsistent went in

Alcium · February 18, 2013, 4:08pm

More information :

Reset of deadline repo + db : useless, bug is still here

the bug itself is Pulse throwing in loop

[code]Data Thread - An error occurred while updating cached data: Object reference not set to an instance of an object. (System.NullReferenceException)

Exception Details
NullReferenceException – Object reference not set to an instance of an object.
Exception.Data: ( )
Exception.TargetSite: Void a()
Exception.Source: deadline
Exception.StackTrace:
at Deadline.Pulses.PulseDataThread.a()
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
[/code]

Secondary bug :
Repository Clean-Up locks itself very regularly for no apparent reason (except that it has to do something which it does not), so there’s an infinite thread on cleanup, but it does not output anything

I’ll try running without pulse for a while as it clearly is the weak point as it is today

rrussell · February 19, 2013, 2:30pm

How big is your database if you zip it up? Since it’s difficult to pinpoint the exact jobs that are causing this, perhaps you could just zip up and send us your database the next time it occurs. We’ve never been able to reproduce this here, so it’s probably an edge case that we’re not thinking of. If we can reproduce it here, it should be much easier to track down the cause.

Thanks!

Ryan

Alcium · February 19, 2013, 2:33pm

this is all too heavy

Can you remote access here ?

This issue is constant and really causing us big troubles …

rrussell · February 19, 2013, 3:37pm

I could remote in there, but I’m concerned about mucking around with your system while you’re in the middle of rendering. If you’re okay with me remoting in, zipping up your database myself, and sending it back to us, I can do that.

Feel free to send my a private message with the connection details. I should be able to remote in today.

EDIT: I should mention that I’ll need access to the machine that mongo is running on, as well as your Pulse machine (if they’re different).

rrussell · February 19, 2013, 6:49pm

Good news! We’ve tracked down the bug.

The jobs in the database have a flag called IsSub, which stands for “Is Submitted”. When a job is being submitted, it is first created in the database with this flag set to False. When the job finishes submitting, this is set to True. This is to prevent slaves from rendering a job that hasn’t finished being submitted.

However, there was a bug with the general GetJobs function that Pulse uses to load the jobs. If there was at least one job with IsSub set to false, this error would occur. We have fixed this internally, and the fix will be included with beta 13. This fix should prevent Pulse from freaking out like this in the future. We have also improved the code that was responsible for setting this flag to make it a bit more robust.

Finally, we’re going to look into adding house cleaning code to remove jobs that have IsSub set to False and aren’t showing signs that it will ever be set to True.

For now, you can manually remove these problematic jobs by using mongo.exe to connect to the database. If you’re running it on the same machine that mongo is running on, do this:

mongo.exe localhost

Then do the following from the mongo terminal:

use deadlinedb
db.Jobs.remove( {IsSub:false} )

That will remove the problematic jobs, and should allow Pulse to calm down again.

Cheers,

Ryan

Alcium · February 19, 2013, 10:03pm

Excellent, I think this is a big step forward to production use (which we do because we really need all new functions including improved modo support)

Alcium · February 20, 2013, 2:55pm

Note this does not fix the stuck reposirory cleanup issue

rrussell · February 20, 2013, 4:20pm

Can you post the pulse log when it gets stuck? It should at least indicate where it is getting stuck.

When it does get stuck, do you just restart Pulse to get it going again?

Thanks!

Ryan

Alcium · February 20, 2013, 4:44pm

It can get stuck whenever it tries to perform repo scan

Sometimes Right after displaying

Alcium · February 20, 2013, 5:03pm

Almost typical example :
(this one resumed after some time, which is not the usual …)

As you can see, it waited almost 2 minutes before resuming … which is way too long

rrussell · February 20, 2013, 5:12pm

Hmm, 2 minutes does seem longer than it should take. Do you happen to know how many jobs were in the queue at this time?

Can you send us the Pulse log file from this session? The file will be useful because we prefix every line with a time stamp. You can find the log folder from Pulse by selecting Help -> Explore Log Folder.

Thanks!

Ryan

Alcium · February 20, 2013, 5:16pm

250~

but as you can see in log, previous scan was instant
deadlinepulse-SRV-DEADLINE-2013-02-20-0000 - Copy.log (116 KB)

rrussell · February 20, 2013, 6:00pm

Thanks. I’m really not sure what would cause this at this point. The fact that it’s random doesn’t help either.

We’ll add a bit more debugging output to beta 13 so that if it does happen again, we can determine if the slowdown happens when loading the jobs or when scanning them.

Cheers,

Ryan

Alcium · February 20, 2013, 6:06pm

Couldn’t be more awaiting for the b13 then

Alcium · February 25, 2013, 12:10pm

ok guys

I’m sorry to announce we had more and more issues,
we finally had to revert to 5.2 as production is growing,

the good news is every code we had works flawlessly between both versions.
I’ll sur miss the new layout capabilities of the monitor but this is becoming too risky for us to be on an unstable version

I hope the bugs we reported will help …

Is there any chance we get modo progress report in 5.2 ?

rrussell · February 25, 2013, 1:16pm

That’s completely understandable, and we appreciate you sending us the bug reports. I think we’ve addressed most of your issues for the upcoming beta 13 release, which we hope to get out this week. The long job repository scan was the only one we couldn’t track down, but hopefully the additional debug logging we’ve added helps us narrow the scope. Maybe it’s even possible that when we fixed the other Pulse issue with it loading jobs, it fixed this job repository scan issue (since they both involve Pulse loading jobs).

Getting progress reporting in modo for 5.2 should be pretty straightforward. What’s the full version of 5.2 you are using? We just want to make sure we’re updating the correct version of the modo plugin for you.

Cheers,

Ryan