High Availability Troubleshooting¶
High availability configurations can be complex, and with so many different ways to configure a failover cluster, it can be tricky to get things working properly. In this section, some common (and not so common) problems will be discussed and hopefully solved for the majority of cases. If issues are still present after consulting this section, there is a dedicated HA/CARP/VIPs board on the Netgate Forum.
Before proceeding, take the time to check all members of the HA cluster to ensure that they have consistent configurations. Often, it helps to walk through the example setup, double checking all of the proper settings. Repeat the process on the secondary node, and watch for any places where the configuration must be different on the secondary. Be sure to check the CARP status (Check CARP status) and ensure CARP is enabled on all cluster members.
Errors relating to HA will be logged in Status > System Logs, on the System tab. Check those logs on each system involved to see if there are any messages relating to XMLRPC sync, CARP state transitions, or other related errors.
There are three common misconfigurations that happen which prevent HA from working properly.
Use a different VHID on each CARP VIP¶
A different VHID must be used on each CARP VIP created on a given interface or broadcast domain. With a single HA pair, input validation will prevent duplicate VHIDs. Unfortunately it isn’t always that simple. CARP is a multicast technology, and as such anything using CARP on the same network segment must use a unique VHID. VRRP also uses a similar protocol as CARP, so ensure there are no conflicts with VRRP VHIDs, such as if the ISP or another router on the local network is using VRRP.
The best way around this is to use a unique set of VHIDs. If a known-safe private network is in use, start numbering at 1. On a network where VRRP or CARP are conflicting, consult with the administrator of that network to find a free block of VHIDs.
Check that all systems involved are properly synchronizing their clocks and have valid time zones, especially if running in a Virtual Machine. If the clocks are too far apart, some synchronization tasks like DHCP failover will not work properly.
Incorrect Subnet Mask¶
The real subnet mask must be used for a CARP VIP, not /32. This must match the subnet mask for the IP address on the interface to which the CARP IP is assigned.
IP Address for CARP Interface¶
The interface upon which the CARP VIP resides must already have another IP defined directly on the interface (VLAN, LAN, WAN, OPT) before it can be utilized.
Incorrect Hash Error¶
There are a few reasons why this error turns up in the system logs, some more worrisome than others.
If CARP is not working properly when this error is present, it could be due to a configuration mismatch. Ensure that for a given VIP, that the VHID, password, and IP address/subnet mask all match.
If the settings appear to be proper and CARP still does not work while generating this error message, then there may be multiple CARP instances on the same broadcast domain. Disable CARP and monitor the network with tcpdump (Packet Capturing) to check for other CARP or CARP-like traffic, and adjust VHIDs appropriately.
If CARP is working properly, and this message is in the logs when the system boots up, it may be disregarded. It is normal for this message to be seen when booting, as long as CARP continues to function properly (primary shows MASTER, secondary shows BACKUP for status).
Both Systems Appear as MASTER¶
This will happen if the secondary cannot see the CARP advertisements from the primary. Check for firewall rules, connectivity trouble, switch configurations. Also check the system logs for any relevant errors that may lead to a solution. If this is encountered in a Virtual Machine (VM) Product such as ESX, see Issues inside of Virtual Machines (ESX).
Primary system is stuck as BACKUP¶
In some cases, this is may happen normally for a short period after a system comes back online. However, certain hardware failures or other error conditions can cause a server to silently take on a high advskew of 240 in order to signal that it still has a problem and should not become master. This can check be checked from the GUI, or via the shell or Diagnostics > Command.
In the GUI, this condition is printed in an error message on Status > CARP.
From the shell or Diagnostics > Command, run the following command to check for a demotion:
# sysctl net.inet.carp.demotion net.inet.carp.demotion: 240
If the value is greater than
0, the node has demoted itself.
In that case, isolate the firewall, check its network connections, and perform further hardware testing.
If the demotion value is 0 and the primary node still appears to be demoting itself to BACKUP or is flapping, check the network to ensure there are no layer 2 loops. If the firewall receives back its own heartbeats from the switch, it can also trigger a change to BACKUP status.
Issues inside of Virtual Machines (ESX)¶
When using HA inside of a Virtual Machine, especially VMware ESX, some special configurations are needed:
Enable promiscuous mode on the vSwitch.
Enable “MAC Address changes”.
Enable “Forged transmits”.
ESX VDS Promiscuous Mode Workaround¶
If a Virtual Distributed Switch is in use, a port group can be made for the firewall interfaces with promiscuous mode enabled, and a separate non- promiscuous port group for other hosts. This has been reported to work by users on the forum as a way to strike a balance between the requirements for letting CARP function and for securing client ports.
ESX VDS Upgrade Issue¶
If a VDS (Virtual Distributed Switches) was used in 4.0 or 4.1 and upgrade from 4.0 to 4.1 or 5.0, the VDS will not properly pass CARP traffic. If a new VDS was created on 4.1 or 5.0, it will work, but the upgraded VDS will not.
It is reported that disabling promiscuous mode on the VDS and then re- enabling it will resolve the issue.
ESX VDS Port Mirroring Issue¶
If port mirroring is enabled on a VDS it will break promiscuous mode. To fix it, disable and then re-enable promiscuous mode.
ESX Client Port Issues¶
If a physical HA cluster is connected to a switch with an ESX host using multiple ports on the ESX host (lagg group or similar), and only certain devices/IPs are reachable by the target VM, then the port group settings may need adjusting in ESX to set the load balancing for the group to hash based on IP, not the originating interface.
Side effects of having that setting incorrectly include:
Traffic only reaching the target VM in promiscuous mode on its NIC.
Inability to reach the CARP VIP from the target VM when the “real” IP address of the primary firewall can be reached.
Port forwards or other inbound connections to the target VM work from some IP addresses and not others.
ESX Physical NIC Failure Fails to Trigger Failover¶
Self-demotion in CARP relies on the loss of link on a switch port. As such, if a primary and secondary firewall instance are on separate ESX units and the primary unit loses a switch port link and does not expose that to the VM, CARP will stay MASTER on all of its VIPs there and the secondary will also believe it should be MASTER. One way around this is to script an event in ESX that will take down the switch port on the VM if the physical port loses link. There may be other ways around this in ESX as well.
Use e1000 NICs ( em(4)), not the ed(4) NICs or CARP VIPs will never leave init state.
Setting “Promiscuous mode: Allow All” on the relevant interfaces of the VM allows CARP to function on any interface type (Bridged, Host- Only, Internal)
Other Switch and Layer 2 Issues¶
If the units are plugged into separate switches, ensure that the switches are properly trunking and passing broadcast/multicast traffic.
Some switches have broadcast/multicast filtering, limiting, or “storm control” features that can break CARP.
Some switches have broken firmware that can cause features like IGMP Snooping to interfere with CARP.
If a switch on the back of a modem/CPE is use, try a real switch instead. These built-in switches often do not properly handle CARP traffic. Often plugging the firewalls into a proper switch and then uplinking to the CPE will eliminate problems.
Configuration Synchronization Problems¶
Double check the following items when problems with configuration synchronization are encountered:
The username must be admin on all nodes.
The password in the configuration synchronization settings on the primary must match the password on the backup.
The WebGUI must be on the same port on all nodes.
The WebGUI must be using the same protocol (HTTP or HTTPS) on all nodes.
Traffic must be permitted to the WebGUI port on the interface which handles the synchronization traffic.
The pfsync interface must be enabled and configured on all nodes.
Verify that only the primary sync node has the configuration synchronization options enabled.
Ensure no IP address is specified in the Synchronize Config to IP on the secondary node.
Ensure the clocks on both nodes are current and are reasonably accurate.