AWS Thinkbox Discussion Forums

Feature Request

One of the things I’ve noticed about Deadline Slaves is that when a machine boots up. The slave is started and if it fails to see the repository then that slave goes offline, and doesn’t keep trying. It displays a dialog box about the error and becomes stuck there. Waiting for the user to click the “ok” button. Meanwhile, the repository may become accessible.



You can reproduce this problem by booting a slave with it’s network cable unplugged. After the popup error message appears plug the cable back into the machine. The slave will not start. (Note: This assumes your repository is located on another machine).



Would it be possible that the slave remain running, and check every X minutes for access to the repository. It doesn’t have to be 100% running. You could have a popup dialog saying “waiting for access to repository”.



Mostly, this happens when we get a power out. When the power comes back the slaves boot faster then the main file server that has the repository. The slaves check for access to our file server, then display a popup, and a few minutes later our file server finishes booting. The slaves remain stuck waiting for a user to click the “ok” button.



I’m worried this will happen over a weekend job and cause all the slaves to get stuck.



Mathew Foscarini

Technical Director

Crush, Inc.

Toronto

Another thing. I haven’t test this, but I think if the machine boots and the Flexlm service is not found. Then the Slave will report the error that the service could not be found, and then also get stuck. But I haven’t tested this yet.



Mathew Foscarini

Technical Director

Crush, Inc.

Toronto

Hi Mathew,



I think that’s a good idea, and we’ll add it to our wish list. We would likely make it a feature that you can toggle on and off.



Cheers,



Ryan Russell

Frantic Films Software

http://software.franticfilms.com/

(204)949-0070

That’s great. Thanks.



Mathew Foscarini

Technical Director

Crush, Inc.

Toronto

As anything been done about this in version 2.6.



We had a power outage over the weekend, and all the slaves failed to restart correctly. They showed up as “stalled” in the manager, and when I go to the desktop of the slaves the slave software is not running.



It appears that if they fail to see the repository then they just quit.



The slaves should continue to keep trying.



Mathew Foscarini

Technical Director

Crush, Inc.

Toronto

Hey Mathew,



If the slaves are already running, and the repository machine goes down,

they should still keep trying to connect. However, if the slaves are

just starting up and they can’t connect (which sounds like the case

you’re reporting), then they will just display an error and exit. We

agree that in the case, the slave should keep trying to connect, and

we’ve logged this feature request on our todo list.



Cheers,

Yes, please add the feature.



Not that it matters, but what happens here is that our file server is a Unix machine with about 8 RAID drives mounted on it. When the power is cut off the file server assumes there might have been a hardware failure. So when it boots up it begins a check of each drive. Mounting each drive one at a time after it’s passed the check. This can take several minutes. So the slaves boot up, sit idle for several minutes until the file system comes online.



We’re expanding our file system (could be 10 terabytes). So I expect our power up operations to take even longer in the future.



Mathew Foscarini

Technical Director

Crush, Inc.

Toronto

Just to give you an update, we will include this feature in the next

release. In the cases where the slave can’t connect to the repository or

license server on startup, it will keep trying to connect until someone

comes along and tells the slave to do otherwise. After each attempt, the

slave will display that countdown dialog, but instead of exiting when

the countdown reaches 0, it will just try again and again until it

connects or until someone hits the “Exit Slave” button in the countdown

dialog. Of couse, you will also have the option to set the repository or

license server if necessary.



Cheers,

That’s great.



Should it fail to connect to the repository or license server. It should send an e-mail to the supervisor reporting the problem, but it should only send one e-mail.



Also, it would be nice when Deadline switches the status of a slave from something active to “stalled”. It should send an e-mail to the supervisor about the stalled condition.



Mathew Foscarini

Technical Director

Crush, Inc.

Toronto

Privacy | Site terms | Cookie preferences