beta 13 errors

LaszloSebo · March 1, 2013, 8:04pm

Starting to see a lot of errors like this since we rolled out beta13. The slave picks up, starts rendering then pops this right away. Note that the slave system time seems to be off by an hour:

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: lapro0216
Version: v6.0.0.50509 R

Stalled Slave: LAPRO0233
Slave Version: v6.0.0.50509 R
Last Slave Update: 2013-03-01 12:04:56
Current Time: 2013-03-01 13:04:58
Time Difference: 1.001 hrs
Maximum Time Allowed Between Updates: 10.000 m

Current Job Name: [TBOA] Software Render: FB_090_1770_maya_animation_layout.ma version: v0013
Current Job ID: 51310898f5ec9b07b48a2868
Current Job User: Winn.OBrien
Current Task Names: 1151-1165
Current Task Ids: 10

Searching for job with id “51310898f5ec9b07b48a2868”
Found possible job: [TBOA] Software Render: FB_090_1770_maya_animation_layout.ma version: v0013
Searching for task with id “10”
Found possible task: 10:[1151-1165]
Task’s current slave: LAPRO0233
Slave machine names match, stopping search
Associated Job Found: [TBOA] Software Render: FB_090_1770_maya_animation_layout.ma version: v0013
Job User: Winn.OBrien
Submission Machine: LAPRO3060
Submit Time: 03/01/2013 12:02:34
Associated Task Found: 10:[1151-1165]
Task’s current slave: LAPRO0233
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.

LaszloSebo · March 1, 2013, 8:06pm

It seems that the slave time is being checked against the db time, which can be quite off.

LaszloSebo · March 1, 2013, 8:11pm

Strange thing is, double checking the offending slave, it has the right time setting:

C:\Documents and Settings\ScanlineVFX>time
The current time is: 12:10:25.60

The deadline server is also correct:
[root@deadline ~]# date
Fri Mar 1 12:11:41 PST 2013

Any ideas?

LaszloSebo · March 1, 2013, 8:14pm

For now i increased per update time to 120 minutes, otherwise our farm stopped working :\

LaszloSebo · March 1, 2013, 8:22pm

I noticed that the slave was reporting proper times before beta13, then it started popping values an hour later

The particular slave had error reports logged at (these are the times listed in the slave ‘right click / slave reports’ popup dialog):

13.10 <-- now using v6.0.0.50509 R
13.07 <-- now using v6.0.0.50509 R
13.05 <-- now using v6.0.0.50509 R
11.58 <-- still using v6.0.0.50272 R
11.56 <-- still using v6.0.0.50272 R
10.57 <-- still using v6.0.0.50272 R

rrussell · March 1, 2013, 8:24pm

Hmm, beta 13 shouldn’t have introduced anything like this…

We’ve logged this as a bug, and will investigate.

Did the time on the database machine perhaps change?

LaszloSebo · March 1, 2013, 10:02pm

Nope, the db server has the right time

rrussell · March 4, 2013, 2:30pm

We did make a breaking change between beta 12 and 13 with respect to reports, so it’s possible those times are misleading.

Jon and I walked through the code on Friday, and we couldn’t find any reason why this would be happening. When checking for stalled slaves, Deadline is always using the database time, and is adjusting it to local time so that even if the database is in a different time zone, it shouldn’t matter.

Just in case, can you check the time on ‘lapro0216’ if you haven’t done so already? This is the machine that reported that the slave on ‘LAPRO0233’ was stalled.

LaszloSebo · March 4, 2013, 6:28pm

You are right, the system time on lapro0216 is all out of whack. Its looking at April 8th 2008, 12:11pm…

I reset that now, but its weird that that time is WAY off, and the logs reported were just 1 hour off?

rrussell · March 4, 2013, 7:07pm

Thanks for checking that. It would appear that the slave machine’s local time is still playing a role, which is something we were trying to avoid in Deadline 6. We’ll look into it.

Cheers,

Ryan

jgaudet · March 4, 2013, 11:24pm

Given that date (and the current date), I’m willing to bet it was a Daylight Savings thing (April 2008 would’ve been in DST, while the current date is not). We’ll make sure that it won’t stay a problem

LaszloSebo · March 4, 2013, 11:53pm

Thanks, in the meantime, i asked our IT to try and fix the dates/times on the slaves. Considering our workload right now, that might not happen :-\

jgaudet · March 4, 2013, 11:58pm

I’ve confirmed that the issue was indeed a DST thing. I was able to replicate it on my machine, and I’ve fixed it for the next Beta!

Cheers,

Jon

LaszloSebo · March 5, 2013, 12:04am

Glorious, thanks!