Delivering reliability with a UPS

Motivation

After 4-5 power failures over the last year, one of which corrupted a development repository, nuked my Git Extensions configuration, and interrupted external services, I decided I needed a UPS. I was recommended the BX500CI by a friend, but an extended period of Amazon stocklessness gave me time to realise that it wasn’t really right for my needs. Since I’m not always in the house, it wouldn’t help much for long outages because the equipment would merely suffer a delayed power loss if I wasn’t around to notice.

A search yielded the more expensive but connected BX950UI, with full support on Windows, GNU/Linux, and macOS via the excellent APC UPS Daemon, so I went for it.

Initial setup

On its arrival, I charged it and spent a couple of hours doing physical hookup, which involved replacing the plugs on a couple of multiway power strips with IEC C13s. This article and the apcupsd manpage helped me get the basics going in another couple of hours, and I ran a battery calibration.

My theory is that this step doesn’t directly use power-consumption information, but records the drain curve of the battery with respect to total energy delivered to loads, allowing the UPS to correctly estimate remaining runtime for any load. I suspect that it’s worth recalibrating every 3 months or so, both to keep this profile updated and to give the battery itself enough exercise to avoid degrading too fast; I gather lead-acid batteries don’t like being left charged for too many months on end.

Fine-tuning

At first I began editing the /etc/apcupsd/apccontrol script directly, but this file needs to be replaceable on package upgrade, so the correct approach is to edit the scripts specific to each event, e.g. /etc/apcupsd/doshutdown. The next gotcha was that, not having read the above article properly, I missed the important step of setting ISCONFIGURED=yes in /etc/default/apcupsd, without which apcaccess would work — but not the daemon itself. I was alerted to this problem by the smoking gun that neither the /var/log/apcupsd.events nor /etc/apcupsd/powerfail files were being created during testing.

As well as the Ubuntu server (an Intel NUC), I have a Windows 10 desktop attached to the UPS. Since the latter sucks enough juice to reduce battery runtime to about ten minutes, it was crucial to arrange for it to shut down quickly on power loss in order to maximise the server’s uptime. To this end I put the following command in the /etc/apcupsd/onbattery script, just before the exit 0 line (to conceal any failure owing to e.g. the machine already being off), and using its IP address to dodge any transient name-resolution problems:
net rpc shutdown -t 45 -f -C "ups.rdg shutdown" -I $IP_ADDRESS -U$USERNAME%$PASSWORD

This gives 45 seconds’ warning, just enough to run shutdown /a if there’s a pressing need to use the machine. Testing revealed a few gotchas: the first was that the remote-shutdown feature apparently (see later) requires the Remote Registry service to be running, which got me from one error message to another.

The second problem was the need for UAC auto-elevation on RPC connections; I fixed that by using regedit to create the DWORD value LocalAccountTokenFilterPolicy=1 inside the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System key.

I later discovered that the Remote Registry service was set to Automatic but not actually started — and thus unnecessary. I left it on Automatic anyway and started it, because there’s a nasty impasse one can trip over during cloning/migration of Windows builds where logon is blocked owing to a faulty MountedDevices configuration.

Another detail worth noting is that I don’t see the point of blocking remote logins on a GNU/Linux system that’s in the process of shutting down; it seems like a nannying feature to save admins from having their session nuked by the shutdown (not much of a benefit) and it takes away options in a time-critical situation. I therefore set this line in /etc/apcupsd/apcupsd.conf:
NOLOGON disable

Final configuration

Here’s how it looks, along with my comments.

/etc/apcupsd/.gitignore

Since I run a Git repository in the Ubuntu server’s /etc folder, the following gets rid of diff noise:
apctest.output
powerfail

/etc/apcupsd/apccontrol

Since I don’t check root‘s mailbox on the server, I changed the export SYSADMIN line to refer to my external e-mail address.

/etc/apcupsd/apcupsd.conf

Here I altered the following:

UPSNAME $NAME
UPSCABLE usb
UPSTYPE usb
DEVICE
POLLTIME 15
ONBATTERYDELAY 30
MINUTES 2
#BATTERYLEVEL 5

POLLTIME is the maximum time (in seconds) that apcupsd will take to notice a UPS event. I believe ONBATTERYDELAY is the time in seconds between one of apcupsd‘s polls noticing power loss and the onbattery state being triggered; this is how I prevent very short outages from shutting down the Windows machine. MINUTES specifies how close to battery exhaustion the UPS can get before a shutdown is triggered; 2 minutes is generous for my server, which usually takes 10-20 seconds to shut down.

Using MINUTES made BATTERYLEVEL unnecessary, so I commented it out.

/etc/apcupsd/onbattery

Here I added the remote-shutdown command for the Windows machine (see above).

/etc/apcupsd/doshutdown

I replicated the Windows remote-shutdown command here (again before the exit 0 line to conceal failure) so that, even if I choose to keep the machine on during power loss via shutdown /a, it will retry on battery exhaustion.

/etc/default/apcupsd


ISCONFIGURED=yes

Flaws and future work (or Things I Didn’t Have Time For)

One thing I must address is what I consider the biggest (only?) flaw in apcupsd: its failure to provide handling for communications failures, which might sabotage the whole setup by hiding a power loss. In theory, this could be worked around with a bunch of custom scripting, but I think the real solution would lie in apcupsd itself:

  • introduce a COMMFAILUREDELAY setting that allows comms to recover within some timespan without generating an event;
  • change the comms-failure behaviour to (optionally?) trigger a shutdown.

The other major improvement to my setup would be to hibernate the machines in question instead of shutting them down. I looked into this for a while, but there are two problems:

  • GNU/Linux requires a swap partition in order to hibernate;
  • The Samba net rpc shutdown command provides only shutdown and restart, not sleep or hibernate.

I’ll probably deal with the former via an already-planned server rebuild, and the latter could be fixed by remotely invoking the native Windows SHUTDOWN command instead of using the Samba one, but since Windows 10 no longer has an inbuilt Telnet server this would require extra software and time.

Conclusion

Thanks to this modest amount of work, and some testing — sometimes using the TIMEOUT setting in /etc/apcupsd/apcupsd.conf to avoid running the battery all the way down — I no longer have to worry about power loss in the middle of an intense work session, fiddling around with manual git-repository repairs, or touring the local construction sites in a blame-seeking rage. Yay!

Delivering reliability with a UPS

Leave a Reply

Your email address will not be published. Required fields are marked *