loading tasks can be very slow

LaszloSebo · March 30, 2013, 12:14am

Sometimes i click on a job, then wait 3-5 minutes for the task list to update :-\

I checked, and the mongo service is running at 100% use of one core (total usage on the machine is maybe 16% or so)

jgaudet · April 1, 2013, 4:33pm

Hey Laszlo,

What does your Disk IO look like on the mongo machine? Also, are you guys using regular hard drives, or SSDs? If your I/O is constantly maxed, and you’re not using SSDs, it would be a good idea to try migrating over to some. It’d be good to get some logs from the mongod process, so we can see what might be causing this.

Cheers,

Jon

jgaudet · April 1, 2013, 4:56pm

Also, out of curiosity, what values are you currently using for Monitor update intervals, and Slave wait times? Have you tried tweaking those values at all to see if it makes a difference for you?

It’d also be useful if you could connect to the Mongo DB through the mongo shell and get a few stats from there. The easiest way to do that is to get on the MongoDB machine and run ‘mongo.exe’ from the mongo install directory. (You can also download the Mongo binaries on your machine and connect to the DB Server remotely by running ‘mongo.exe :’, if you want)

Once you’ve got the shell open, you need to switch to the Deadline DB, by using the ‘use deadlinedb’ command (if you changed the default db name, use that instead of ‘deadlinedb’).

Finally, if you could run the following four commands and paste the output here, that’d be super-useful!
db.stats()
db.printCollectionStats()
db.call.stats()
db.serverStatus()

Cheers,

Jon

rrussell · April 2, 2013, 8:05pm

Hey Laszlo,

We’ll be adding a View Database Statistics menu item to the Help menu in the Monitor, Slave, and Pulse to get these stats easily. This will be included in the next beta.

However, if you could still let us know the info that Jon was looking for, that would be great!

Cheers,

Ryan

LaszloSebo · April 2, 2013, 9:26pm

Some io stats, dont look too bad? but i dont know how to read these numbers the utilization doesnt seem very high:

[root@deadline ~]# iostat -d -x 5 3
Linux 2.6.32-279.el6.x86_64 (deadline.scanlinevfxla.com) 04/02/2013 x86_64 (8 CPU)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 1.90 4.36 33.93 11.90 4588.16 148.56 103.36 0.48 10.52 1.31 6.02
sdb 2.14 4.40 33.70 11.85 4589.78 148.56 104.03 0.42 9.15 1.55 7.05
md127 0.00 0.00 0.06 16.15 13.06 148.45 9.97 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.06 15.23 13.06 142.80 10.19 0.48 31.18 3.28 5.02
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 46.88 1.14 0.00
dm-2 0.00 0.00 0.00 0.71 0.00 5.65 8.00 0.20 283.16 0.10 0.01

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 1.00 0.00 16.80 0.00 163.20 9.71 0.29 17.48 2.21 3.72
sdb 0.00 1.00 0.00 16.80 0.00 163.20 9.71 0.29 17.13 2.21 3.72
md127 0.00 0.00 0.00 17.80 0.00 163.20 9.17 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 17.60 0.00 163.20 9.27 0.34 19.07 2.41 4.24
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.40 0.00 8.80 0.00 84.80 9.64 0.04 4.55 2.68 2.36
sdb 0.00 0.80 0.00 8.40 0.00 84.80 10.10 0.04 4.26 2.64 2.22
md127 0.00 0.00 0.00 9.20 0.00 84.80 9.22 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 9.00 0.00 84.80 9.42 0.05 5.13 3.02 2.72
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

LaszloSebo · April 2, 2013, 9:33pm

Attached are the mongo stats
mongo_stats.txt (468 Bytes)
mongo_serverstatus.txt (5.88 KB)
mongo_collectionstats.txt (9.44 KB)

jgaudet · April 2, 2013, 10:37pm

The good news is that after looking at your stats, I don’t think it’s I/O related (all of the Deadline DB is fitting in RAM still, which is great). It seems to have a really high write-lock percentage though… I’m guessing it’s related to slaves taking longer than we’d expect to update their status, since that’s really the main thing that is performing writes on a regular basis.

What are you guys using for the ‘Number of seconds between Slave information updates’ value in the Repository Options (under Slave Settings)? I’d definitely try bumping that up (especially if it’s under 15s), at least until we figure out the root of the problem. You won’t be getting Task/Slave progress updates as often, but I bet you’d see an improvement in performance all around. Keep in mind too, that the Slaves will still always update as soon as their statuses change (from idle to rendering, or what have you), this will only affect Task Progress, really.

-Jon

LaszloSebo · April 3, 2013, 4:55pm

It was at the default setting of 10, i adjusted it up to 15.

How fast does that setting propagate to the slaves?

Btw, my deadline monitor has not updated any of its jobs for days now. The only way to get it to update is to restart it, but it gets stuck right away.
Tasks / slaves seem to update.

jgaudet · April 3, 2013, 5:53pm

The slaves should be updating the network settings every 10 minutes or so, so it shouldn’t take super long to propagate. If you’re still having issues, I’d try setting it to something over a minute, to see if it makes any difference.

How many jobs do you guys have in the repo right now, and what proportion of them are queued/rendering, roughly? Obviously you can’t give an exact number, since the Jobs aren’t refreshing properly, ballpark is fine.

LaszloSebo · April 3, 2013, 7:56pm

We have around 4700 jobs (4570 complete that i cant delete due to deletion taking too long a time ).

Only a couple are queued, maybe around 10-15, rendering in the same range.

LaszloSebo · April 3, 2013, 8:05pm

By changing the slave update time to 30 seconds, and restarting most of the slaves, we managed to push the mongodb process activity down a bit. It currently fluctuates between 50-120%, with periods under 100% usage, and periods sustained over 100% percent. But its not constant 100 any more. As i write this, its been stagnating at 110% , so the problem is not yet solved.

rrussell · April 4, 2013, 3:15pm

Glad to know it’s at least helping a bit. We are currently looking for areas to reduce the amount of writes the slave does to avoid this issue in the future.

Also, here’s another interval to check. Under Slave Settings in the Repository Options, what do you have set for these two properties:

Number of seconds between queries for new Tasks while the Slave is Rendering
Number of seconds between queries for new Jobs while the Slave is Idle

Thanks!

Ryan

rrussell · April 4, 2013, 3:34pm

Just realized that there is a bug where when the slave is idle, it still uses the Task polling time, instead of the Job polling time. That would cause idle slaves to hit the repository much more rapidly then they should.

This will be fixed in beta 18.