tasks continually restarting

anon7385795 · January 27, 2015, 10:42am

We have updated our deadline to the current version

but we are still seeing slaves on our farm machine running twice with the same name.

a process list on a farm machine shows:

render 2152 3.3 0.3 1669984 108240 ? Sl Jan16 528:11 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlineslave.exe -nogui
render 2153 3.3 0.3 1678872 116012 ? Sl Jan16 531:17 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlineslave.exe -nogui
render 2172 2.4 0.1 1559992 57696 ? Sl Jan16 387:02 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlineslave.exe -nogui -name 1cpu
render 2173 2.4 0.1 1564072 60748 ? Sl Jan16 386:01 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlineslave.exe -nogui -name 1cpu

So there is 4 slaves running on the machine, when there should only be two.

any help?

anon7385795 · January 27, 2015, 10:53am

sorry for the missleading title - ‘tasks continually restarting’ is actually a side effect of this i believe.
when i investigated further I then noticed the machine was running the multiple slaves

rrussell · January 27, 2015, 3:34pm

Strange… this shouldn’t even be possible in the latest version. Could you provide some more information about this farm machine?

Which flavor/version of Linux?
Is the Deadline Launcher installed as a daemon?
Is the machine running in headless mode (no desktop)?
Are you netbooting your farm machines?
Is there more than one launcher instance running?

Thanks!
Ryan

anon7385795 · January 28, 2015, 1:14pm

the farm machines are:

Which flavor/version of Linux?
CentOS release 6.5 (Final)
deadline v: v7.0.2.3 R
Is the Deadline Launcher installed as a daemon?

yes, or at least I can see it running on the machine:
[root@rn03sldnlinux init.d]# ps auwx | grep deadline
render 1835 1.2 0.0 1005376 26388 ? Sl Jan16 224:28 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlinelauncher.exe -nogui
render 2154 3.7 0.3 1758816 125100 ? Sl Jan16 641:04 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlineslave.exe -nogui

Is the machine running in headless mode (no desktop)?

yes - I think so anyway

Are you netbooting your farm machines?

nope - we switch them with the button on the front, and reboot via ssh

Is there more than one launcher instance running?

nope - see above.

rrussell · January 28, 2015, 3:16pm

Thanks for the information! A couple follow up questions…

Just to confirm, did you check this on a machine that was running more than one slave instance? The “ps” test you ran only shows one slave instance running on that machine.

Also, on a machine that has multiple instances like this running, can you grab the logs from that day (and maybe the previous day), zip/tar them up, and post them? You can find the logs on the render node in /var/log/Thinkbox/Deadline7.

Thanks!
Ryan

anon7385795 · January 28, 2015, 3:40pm

I have already killed the extra slave processes on all of our farm machines.
farm is busy rendering right now so cant really do much testing.

when I have some time I will recreate the issues and send you some logs.

R

anon7385795 · January 28, 2015, 5:29pm

on a freshly rebooted machine:

ps auwx | grep dead
render 1835 1.2 0.1 798236 50544 ? Sl 16:48 0:12 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlinelauncher.exe
render 2111 3.2 0.3 1157384 119504 ? Sl 16:48 0:32 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlineslave.exe -nogui
render 2112 3.1 0.3 1227740 128756 ? Sl 16:48 0:32 ./mono --runtime=v4.0 /opt/Thinkbox/Deadline7/bin/deadlineslave.exe -nogui
render 3093 0.0 0.0 103256 848 pts/0 R+ 17:05 0:00 grep dead
logs.zip (5.51 MB)

rrussell · January 28, 2015, 5:55pm

Thanks for the logs! Based on the logs, it looks like two deadlinelaunchers are trying to start when the system boots up. This might have something to do with the problem.

Can you go to the /etc/init.d folder on this machine and see if there is more than one file that looks like ‘deadlinelauncherservice’ or ‘deadline#launcherservice’ (where # represents the Deadline version)? If there is more than one, try deleting the others EXCEPT for ‘deadline7launcherservice’, then reboot the machine and let us know if the problem happens again. If it does, send us the logs again.

Thanks!
Ryan

LaszloSebo · January 28, 2015, 6:08pm

Hi there,

We also have a problem with tasks constantly restarting. Its affecting a large chunk of our jobs. The machine picks up the job, starts rendering, then i notice a couple minutes later that another machine takes over the task. Later, the first machine has this in the log:

2015-01-28 10:01:20:  0: INFO: STARTED
2015-01-28 10:01:21:  0: INFO: Lightning: Render frame 990
2015-01-28 10:04:50:  Scheduler Thread - Task "39_990-990" could not be found because task has been modified:
2015-01-28 10:04:50:      current status = Rendering, new status = Rendering
2015-01-28 10:04:50:      current slave = LAPRO0623, new slave = LAPRO0656
2015-01-28 10:04:50:      current frames = 990-990, new frames = 990-990
2015-01-28 10:04:50:  Scheduler Thread - Cancelling task...
2015-01-28 10:04:51:  0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2015-01-28 10:04:51:  0: Unloading plugin: 3dsmax
2015-01-28 10:04:58:  Scheduler Thread - In the process of canceling current tasks: ignoring exception thrown by render thread 0

No error log, no requeue log. In fact, max is usually still rendering when this happens.
This happened on jobs that were rendering happily for a day, then since yesterday around 6pm, they just constantly requeue…

rrussell · January 28, 2015, 6:29pm

Hey Laszlo,

This is happening because you likely still have some slaves (or pulse) running an older version of Deadline 7. You can use the version column in the Slave list in the Monitor to check the versions of the slave. It is highly recommended to get all machines upgraded to 7.0.2.3 (the current public release).

Cheers,
Ryan

LaszloSebo · January 28, 2015, 6:47pm

Note that we have been running without any issues for 2 weeks. This started happening (and still is happening) since yesterday on a large chunk of jobs, none of which progressed a single frame (while other jobs are going just fine).

All signs point at this being unrelated to version discrepancies. We are mostly on 7.0.2.2, afaik .3 only had a minor maxscript related fix.

Ill move all machines to 7.0.2.2 to make sure its consistent, and report back.

LaszloSebo · January 28, 2015, 6:57pm

I have cleaned out the remaining machines that were not on 7.0.2.2. The behavior still persists.

For now, we have to roll back to deadline6

rrussell · January 28, 2015, 7:07pm

I suspect that perhaps an old slave (or pulse) was started up yesterday, and that’s what is causing this. Have you checked the pulse list in the Monitor to see if there is another pulse running?

Also, in your slave list, do you have any slave lines that are mostly blank (and the slave is a blue color)? If so, those would be old versions of the slave as well.

LaszloSebo · January 28, 2015, 7:10pm

We only have 7.0.2.2 active slaves, all other slaves are explicitly disabled in deadline.

Is there a way to globally turn off slaves doing any housecleaning/repo repair? We never want that to happen anyway, it always just causes problems.

LaszloSebo · January 28, 2015, 7:11pm

We only have a single pulse running.

LaszloSebo · January 28, 2015, 7:12pm

I did find a couple of machines that were blue that werent even supposed to be in the slave list (administrative images etc), i disabled them ~10 minutes ago. Monitoring now.

LaszloSebo · January 28, 2015, 7:21pm

It must have been one of those rogue machines with an old image. I am starting to see some tasks finish now…

I really wish we could globally turn off slaves doing the pulse operations so that we could eliminate this risk in a controlled manner.
With 2k+ slaves, its practically impossible to control whats going on whenever a breaking change is introduced in a deadline build, or there is some disruption between a slave and pulse communications. Currently, the slaves acting as fallback pulse machines is not providing redundancy, only a risk. I think with the new “multiple pulse” feature, we should be able to create a robust pulse network without relying on the slaves.

rrussell · January 28, 2015, 8:33pm

Really glad to hear you were able to find the rogue slaves. This was a really nasty bug, and I think your idea of having a global option to disable housecleaning for the slaves is a good idea. We’ll add that to the todo list for 7.1.

Cheers,
Ryan

LaszloSebo · January 28, 2015, 10:52pm

Thanks Ryan. Yeah, am i ever glad we could sort this out with relative ease. I had a near heart attack this morning haha.
Thanks for adding that switch in 7.1, i think it will provide peace of mind (also, will light the fire under our bottom to set up the redundant pulses )

anon7385795 · January 29, 2015, 11:36am

only one file in /etc/init.d/ for deadline:

[render@rn04sldnlinux init.d]$ ll /etc/init.d/ | grep dead
-rwxr-xr-x 1 root root 4069 Jan 16 13:21 deadline7launcherservice

I checked several machines and none have double entries in here…