AWS Thinkbox Discussion Forums

Understanding power management commands

Hi,
I’m trying to set up a power management for some nodes.

I have a working IPMI script on the running pulse that is supposed to be used to restart the nodes, but I can’t get it to be executed.

Maybe I got confused on where the wakeup command is going to be executed from. Isn’t it on the node that is running pulse ?

Thank you,
Pierre

Hi Pierre,

Power Management signals such as “Machine Startup” via the “Run Command” are executed by the Pulse application. Note if you execute a “Remote Command” via the Slaves Panel in your Monitor to “startup a machine”, this currently is hard-wired to support just WOL and would be executed by the machine running Monitor. There is a wishlist item to expose and abstract this functionality to make it more configurable per slave in the future.

So, it should indeed be executed by Pulse if your using Power Management. Here’s the example from our docs:
docs.thinkboxsoftware.com/produc … ne-startup

Could I see your script and the arguments you are passing the “Run Command”?

Hello,

It would indeed be nice to have access to an IPMI script in the monitor’s “Remote Command”.
We don’t use WOL at all, so we would have to run the bios setup on every nodes we run. A hell of a job :frowning:

About the Power Management, I wrote a little script on the server that runs pulse.

impi_control :

[code]#!/bin/sh

slave_ip=$1

echo "date : $@" >> /tmp/test_ipmi_control

if [[ ! $slave_ip == ‘10.2.1.’* ]] ; then
echo “$slave_ip is not a valid node IP. Exiting …”
else
echo “$slave_ip seems to be valid node IP. Continuing …”
ipmi_ip=echo $slave_ip | sed 's/10.2.1./10.2.16./g'
echo “IPMI node IP should be : $ipmi_ip”
if [[ “$2” == “on” ]] ; then
echo “Launching power on …”
ipmitool -H $ipmi_ip -U XXXXX -P XXXXX chassis power on
elif [[ “$2” == “softoff” ]] ; then
echo “Launching soft power off …”
ipmitool -H $ipmi_ip -U XXXXX -P XXXXX chassis power soft
elif [[ “$2” == “off” ]] ; then
echo “Launching power off …”
ipmitool -H $ipmi_ip -U XXXXX -P XXXXX chassis power off
else
echo “Invalid command. Usage is $0 IP [on|soft|off]”
fi
fi

[/code]

Note that 10.2.1.* are the nodes’ IPs, and 10.2.16.* are the corresponding IPMI IPs.

The script runs fine when I call it manually.

I have put this in the “Machine Startup”'s “Run Command” :

Any clue ?

Cheers,
Pierre

Hi Pierre,

Few things to check here to hopefully get to the bottom of this for you.

  1. Just checking. “/root/scripts/…” is a path that the Pulse server can resolve either locally or mounted? Shouldn’t make a difference, as long as it can resolve it.

  2. I wonder if the script needs a “.sh” at the end to ensure it knows it’s a script to execute? (bit silly in Linux world if this is the case, but worth a try and if this is the case, then I’ll get it fixed internally)

  3. The Pulse log or in the GUI --> “Machine Restart” field should print out the Stdout/Stderr of the last time the “Run Command” was executed. Does it give us any clues here as to why the command is failing for you?

We should probably update our docs to include a Linux friendly script version as well in our example here:
docs.thinkboxsoftware.com/produc … ne-startup

I have an example (which I need to dig out), which also adds a right-click script to your slaves panel, so when you right click, 1 or more slaves, it brings up a UI, allowing you to then select IPMI options, then it sends this as a remote command to your Pulse server and feeds the “args” to a Python script residing on the Pulse machine (just like your ipmi bash script), which then does the actual IPMI machine command execution. The reason for this extra hop, is when I wrote these scripts for another studio, only the Pulse server had a NIC bridge between the 2 x vlans (data vlan & management/ipmi vlan), so to allow any client (residing only on the data vlan) running Deadline Monitor to power up a machine via IPMI ad-hoc, this script could be used by artists, which then executed on the secure Pulse server (on data & mgmt vlan). This was in addition to the normal “Machine Startup” --> “Run Command” IPMI startup script as well.

Anyway, first thing, is to get this “Run Command” working for you on your Pulse server. I used this for years in production, so I know it can work! :slight_smile:

Hi Mike,

thanks for your answer.

I managed to have the command executed by pulse.

It seems that it is mandatory to add the .sh “extension” to make it work.
I’m almost certain this addition is the only thing I changed.

It could be nice to have the example you’re talking about in your last paragraph :slight_smile:

Thank you !
Pierre

Glad you got it working! I’ll see if we can do something about getting that file ext requirement removed in a future version.

I will add a tutorial/improve docs on IPMI to my ever increasing ToDo list and see if I can update my code example (Deadline v4-v5 days) to make it work in v6 onwards! I’ll probably upload it to our GitHub repo when it’s ready. I’ll also see about us improving our built-in “Machine Startup” support as well. - github.com/ThinkboxSoftware/Deadline

Hi,

damn I thought I had replied to this post already, it seems I lost my answer :frowning:

Thank for the link, definitely useful !

After a few more tests, it seems that only the waking up works.

The shutdown does not call the script at all. Only the slave is shutdown, in fact.

I cannot find anything relevant in the logs (the pulse’s log or my script’s log)
Any clue ?

Pierre

PS : here’s the last version of the script :

[code]#!/bin/sh

logfile="/var/log/Thinkbox/Deadline7/ipmi_control.log"

slave_ip=$1

echo "date : $@" >> $logfile
export LD_LIBRARY_PATH=""

if [[ ! $slave_ip == ‘10.2.1.’* ]] ; then
echo “$slave_ip is not a valid node IP. Exiting …”
else
echo “$slave_ip seems to be valid node IP. Continuing …”
ipmi_ip=echo $slave_ip | sed 's/10.2.1./10.2.16./g'
cal_number=echo $slave_ip | sed 's/10.2.1.//g'
echo “IPMI node IP should be : $ipmi_ip”
(…)
if [[ “$2” == “on” ]] ; then
echo “Launching power on …”
ipmitool -H $ipmi_ip -U $user -P $password chassis power on >> $logfile 2>&1
elif [[ “$2” == “softoff” ]] ; then
echo “Launching soft power off …”
ipmitool -H $ipmi_ip -U $user -P $password chassis power soft >> $logfile 2>&1
elif [[ “$2” == “off” ]] ; then
echo “Launching power off …”
ipmitool -H $ipmi_ip -U $user -P $password chassis power off >> $logfile 2<&1
elif [[ “$2” == “reset” ]] ; then
echo “Launching power reset …”
ipmitool -H $ipmi_ip -U $user -P $password chassis power reset >> $logfile 2<&1
elif [[ “$2” == “status” ]] ; then
echo “Quering power status …”
ipmitool -H $ipmi_ip -U $user -P $password chassis power status >> $logfile 2<&1
else
echo “Invalid command. Usage is $0 IP [on|soft|off|reset|status]”
fi
fi

[/code]

Well, first step for me would be to see if Deadline’s actually calling it. I hope it’s in the Pulse logs (it’s been a bit) when you have ‘verbose’ mode on (see ‘application logging’ in the Repository options). If it’s not clear from there, just output some echos to a file to see if it’s starting up correctly.

I’ve had people use this before, and we haven’t changed it in some time, so I expect things to still work.

If not, verbose Pulse log is going to help us out here.

Here is the pulse log.

Shuting down :

Corresponding mail received :

But the script is never called.

Waking up :

Corresponding mail received :

The script is called.

I’m not sure where to investigate now …

I can confirm this. Digging in.

Here’s my testing:

Create simple script:
Mobile-029:tmp edwin.amsler$ vi sayhello.sh
Mobile-029:tmp edwin.amsler$ chmod +x sayhello.sh

Contents of script:

#!/bin/sh
echo "Hello" >> /tmp/message.txt

Settings:

[attachment=0]Screen Shot 2015-07-20 at 1.31.23 PM.png[/attachment]

Okay, so I’d forgotten the command is sent to the Slave.

Since we have the Launcher running, we don’t need IPMI commands sent if the launcher is connectable. Have you tried shutting down render nodes without the IPMI script?

Here’s what my launcher said (investigating the connection problem, it’s unrelated)

2015-07-20 13:41:48:  Got reply: Mobile-029.local: Sent "OnLastTaskComplete ExecuteCommandIdle  :    :  /tmp/sayhello.sh " command. Result: "Connection Accepted.
2015-07-20 13:41:48:  "
2015-07-20 13:41:48:  Sending command to slave: OnLastTaskComplete ExecuteCommandIdle  :    :  /tmp/sayhello.sh 
2015-07-20 13:41:48:  Got reply: Mobile-029.local: Sent "" command. Result: "Connection refused"
2015-07-20 13:41:48:  Sending command to slave: OnLastTaskComplete ExecuteCommandIdle  :    :  /tmp/sayhello.sh 
2015-07-20 13:41:48:  Got reply: Mobile-029.local: Sent "" command. Result: "Connection refused"
2015-07-20 13:41:48:  Sending command to slave: OnLastTaskComplete ExecuteCommandIdle  :    :  /tmp/sayhello.sh 
2015-07-20 13:41:48:  Got reply: Mobile-029.local: Sent "" command. Result: "Connection refused"
2015-07-20 13:41:48:  Sending command to slave: OnLastTaskComplete ExecuteCommandIdle  :    :  /tmp/sayhello.sh 
2015-07-20 13:41:48:  Got reply: Mobile-029.local: Sent "" command. Result: "Connection refused"

I’m also having issues with power management and Execute Command…

I had set up a batch file much like the one in the help file referenced above when using Deadline 6. It read the slave’s IP, and sent the command to the corresponding BMC IP (100 higher on the subnet) to wake it up. The machine starts, logs in, renders, and power management shuts it down after 30mins idle time. I was also able to manually start machines by right clicking in monitor, running Execute Command and using the same batch file. So far so good.

This has been working fine for a while on my original set of nodes. We are now on Deadline 7.2, and we got some new nodes. These also use IPMI, but are newer machines with newer firmware. For some reason, the command run through deadline won’t wake these ones up. If I run the batch file locally on the Pulse machine, it works fine. If power management runs it through pulse, only the original nodes turn on. If I right click and run Execute Command, I can’t wake up any of the machines, not even the old ones, I just get this error:

A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 192.168.0.57:17070

Any ideas gratefully received.

Command:

C:\ipmiutil\BoxxDLPM.bat {SLAVE_IP}

Batch file:

@echo off
set IP=%~1
echo %IP:~10,12%
set IPNew=192.168.0.1%IP:~10,12%
echo.%IPNew%
c:\ipmiutil\ipmiutil reset -u -N %IPNew% -U XXXXXX -P XXXXXX

Are the newer machines, ‘blade’ based and therefore all essentially passing through the same backboard of the blade chassis? One thing I have noticed in the past, is if you try to send a signal too many times for multiple blades that all reside through the same chassis backboard, that the BMC gets itself in a twist and you have to actually login via the DRAC (or DELL equivalent of the KVM) to ‘reset’ the controller. The fix is easy, introduce a 5-10 sec delay between signals. This would explain why it works manually but not enmass.

FYI. New PM docs are also available as of Deadline v7.2. Big improvements + examples of this time delay!
docs.thinkboxsoftware.com/produc … ement.html

Thanks for the response, I will look through the new helpfile in detail.

The machines are the same type as the previous ones - Boxx RenderPro, they use Supermicro motherboards. They aren’t blades, they are separate machines.

I can have a look at introducing a delay, but if that were the problem you would think that the execute command would still work, as that is running one at a time.

I will take a look at the new supermicro specific software and see if that does anything different to ipmiutil.

Any clue if something has changed from DL6 to 7.2 that would cause the execute command to stop working? As far as I know nothing has changed on the machines.

ok, so you can scrap that shot in the dark guess, as you don’t have blades.

We should first check that ExecuteCommand is working fine on one of these ‘new’ machines when it is up and running ok:
docs.thinkboxsoftware.com/produc … -balancers

You could send a simple command such as the one given in the above docs link, just to confirm all is well and you don’t have a firewall block or application block, that needs adding to your firewall exception list. We provide details here if you need to go through a checklist to ensure all is well:
docs.thinkboxsoftware.com/produc … iderations

I assume when Power Management runs through Pulse, it execute the same script as when you run it manually on the Pulse machine? (just checking).

The original error message you posted, would indicate something like a firewall block is causing the network connection timeout. Turn off firewalls to verify?

Although you can successfully execute the script manually and it starts the new machines, I wonder if this is just a bit of luck here and either you need a newer version of ipmiutil which supports these newer machines or indeed, perhaps something has changed on the Supermicro node and you need to consider using the BOXX IPMI software instead of a generic one. Gotta love ‘industry standards’…

OK, so I’ve done a bit more testing and read through the new documentation.

I downloaded the Supermicro IPMI software, and have tested using that. It works fine for powering machines on and off manually, and I have incorporated it into the batch file. Other than syntax, it appears to be exactly the same as using IPMIUtil. When a power management check occurs with the new batch file, the same machines turn on and fail to turn on as using the old method.

The Execute Command test writing a text file while a machine is on works fine.
I can also use Execute Command to run an IPMI command to turn on or off a different machine. (Right click on slave 3 to execute command to turn off slave 2 by putting it’s IP manually, for example)
However, trying to execute the command on a slave to turn itself off doesn’t work - the command executes and I can see a cmd prompt appear on the desktop of the machine, but the IPMI command fails to find its IP address.
Also, trying to execute a command on a slave that is currently off to turn itself on doesn’t work. I’m guessing this is because instead of executing the command on the pulse machine, it is trying to execute it on the slave itself, which is off, so can’t do it. Is this different behaviour from previous versions of Deadline, where this worked?

I would note that I am using the shared IPMI interface - not dedicated, so I only have one network cable going to each machine. This has seemed to function fine for now.

Yes, running the power management on Pulse is using the same script as running manually.

I don’t have any firewalls running internally that would cause this not to connect, as far as I am aware. All other functions of the renderfarm work fine.

Sorry for delay [back from vacation].
Hmm…I think it might be best if you open up a support ticket so one of our support team can remote in and work with you on this to resolve it.
support.thinkboxsoftware.com/

Hello,

Coming back on this issue after several month and our recent 7.2 upgrade.

I finally understood that the shutdown command is sent to the slave and not the pulse !

Is there any way to change this and send the shutdown command via pulse ?

For some reasons, we cannot use the shutdown option, or execute a script on the nodes.
Sometimes the shutdown does not work because of the node system state : over-swaping, kernel panic, hardware issue …
IPMI is the best way to unsure the power state of the nodes.

Thanks,
Pierre

Hi Pierre,
We have this already logged as a feature request/wishlist item for all PM commands (IPMI, PM script, etc) to be re-directed and executed via the machine running the Primary Pulse. Hopefully, we will get to this feature later this year.
In the meantime, could you use a right-click “slave” script in your slave panel, which then sends a command to your primary “Pulse” machine, which then send out the IPMI/script command? Would that work for now? If so, you should be able to use our API to achieve this already in 7.2. LMK, if you want me to try and put something together for you.

Hello Mike,

Is it in the v8 beta already ?

Yes, having the possibility to put a new script in the slave menu could be very useful.
I’m not sure how to do that, any help would be appreciated :stuck_out_tongue:

Cheers,
Pierre

Privacy | Site terms | Cookie preferences