Troubleshooting High Availability¶
High availability configurations can be complex, and with so many different ways to configure a failover cluster, it can be tricky to get things working properly. In this section, some common (and not so common) problems will be discussed and hopefully solved for the majority of cases. If issues are still present after consulting this section, there is a dedicated HA/CARP/VIPs board on the Netgate Forum.
Before proceeding, take the time to check all members of the HA cluster to ensure that they have consistent configurations. Often, it helps to walk through the example setup, double checking all of the proper settings. Repeat the process on the secondary node, and watch for any places where the configuration must be different on the secondary. Be sure to check the CARP status (Check CARP status) and ensure CARP is enabled on all cluster members.
Errors relating to HA will be logged in Status > System Logs, on the System tab. Check those logs on each system involved to see if there are any messages relating to XMLRPC sync, CARP state transitions, or other related errors.
The issues on this page are for HA in general. For issues specific to using HA in virtual environments, see Troubleshooting High Availability Clusters in Virtual Environments
There are several common misconfigurations that happen which prevent HA from working properly.
Incorrect Interface Order¶
As mentioned on pfSense XML-RPC Config Sync Overview, the interface assignment order and internal identifiers must match identically on both nodes.
If the interface order does not match, the configuration synchronziation process will copy rules and other settings such as DHCP failover to the wrong interfaces on the secondary node.
Use a different VHID on each CARP VIP¶
A different VHID must be used on each CARP VIP created on a given interface or broadcast domain. The VHID determines the virtual MAC address used by that CARP IP address, this different clusters attempting to use the same VHID on the same L2 segment cause a MAC address conflict.
With a single HA pair, input validation will prevent duplicate VHIDs. Unfortunately it isn’t always that simple. CARP is a multicast technology, and as such anything using CARP on the same network segment must use a unique VHID. VRRP also uses a similar protocol as CARP, so ensure there are no conflicts with VRRP VHIDs, such as if the ISP or another router on the local network is using VRRP.
The best way around this is to use a unique set of VHIDs. If a known-safe private network is in use, start numbering at 1. On a network where VRRP or CARP are conflicting, consult with the administrator of that network to find a free block of VHIDs.
Incorrect CARP VIP Settings¶
Inspect the settings for CARP VIPs (Firewall > Virtual IPs) to ensure they are correct and consistent on both nodes.
The Advertising Frequency values must be appropriate for each VIP and node:
Values should be the same on both nodes. In some situations where the secondary node is on a slow or non-local link, users have increased this value on only the secondary, but that can lead to problems with each node assuming their expected roles at the proper times.
Values must be different on the primary and secondary nodes. The primary is typically
0, and the secondary is typically
Check that all systems involved are properly synchronizing their clocks and have valid time zones, especially if running in a Virtual Machine. If the clocks are too far apart, some synchronization tasks like DHCP failover will not work properly.
Incorrect Subnet Mask¶
The real subnet mask must be used for a CARP VIP, not /32. This must match the subnet mask for the IP address on the interface to which the CARP IP is assigned.
Both Nodes in Maintenance Mode¶
If both nodes have activated Persistent CARP Maintenance Mode at Status >
CARP (failover), they each will advertise a skew of
254 and the actual
status will be unpredictable. Ensure only one node is in maintenance mode at a
Incorrect Hash Error¶
There are a few reasons why this error turns up in the system logs, some more worrisome than others.
If CARP is not working properly when this error is present, it could be due to a configuration mismatch. Ensure that for a given VIP, that the VHID, password, and IP address/subnet mask all match.
If the settings appear to be proper and CARP still does not work while generating this error message, then there may be multiple CARP instances on the same broadcast domain. Disable CARP and monitor the network with tcpdump (Packet Capturing) to check for other CARP or CARP-like traffic, and adjust VHIDs appropriately.
If CARP is working properly, and this message is in the logs when the system boots up, it may be disregarded. It is normal for this message to be seen when booting, as long as CARP continues to function properly (primary shows MASTER, secondary shows BACKUP for status).
Both Systems Appear as MASTER¶
This will happen if the secondary cannot see the CARP advertisements from the primary. Check for firewall rules, connectivity trouble, switch configurations. Also check the system logs for any relevant errors that may lead to a solution. If this is encountered in a Virtual Machine (VM) Product such as ESX, see Troubleshooting High Availability Clusters in Virtual Environments.
Primary system is stuck as BACKUP¶
In some cases, this is may happen normally for a short period after a system comes back online. However, certain hardware failures or other error conditions can cause a server to silently take on a high advskew of 240 in order to signal that it still has a problem and should not become master. This can check be checked from the GUI, or via the shell or Diagnostics > Command.
In the GUI, this condition is printed in an error message on Status > CARP.
From the shell or Diagnostics > Command, run the following command to check for a demotion:
# sysctl net.inet.carp.demotion net.inet.carp.demotion: 240
If the value is greater than
0, the node has demoted itself.
In that case, isolate the firewall, check its network connections, and perform further hardware testing.
If the demotion value is 0 and the primary node still appears to be demoting itself to BACKUP or is flapping, check the network to ensure there are no layer 2 loops. If the firewall receives back its own heartbeats from the switch, it can also trigger a change to BACKUP status.
Other Switch and Layer 2 Issues¶
If the units are plugged into separate switches, ensure that the switches are properly trunking and passing broadcast/multicast traffic.
Some switches have broadcast/multicast filtering, limiting, or “storm control” features that can break CARP.
Some switches have broken firmware that can cause features like IGMP Snooping to interfere with CARP.
If a switch on the back of a modem/CPE is use, try a real switch instead. These built-in switches often do not properly handle CARP traffic. Often plugging the firewalls into a proper switch and then uplinking to the CPE will eliminate problems.
Configuration Synchronization Problems¶
Double check the following items when problems with configuration synchronization are encountered:
The username must be admin on all nodes.
The password in the configuration synchronization settings on the primary must match the password on the backup.
The WebGUI must be on the same port on all nodes.
The WebGUI must be using the same protocol (HTTP or HTTPS) on all nodes.
Traffic must be permitted to the WebGUI port on the interface which handles the synchronization traffic.
The pfsync interface must be enabled and configured on all nodes.
Verify that only the primary sync node has the configuration synchronization options enabled.
Ensure no IP address is specified in the Synchronize Config to IP on the secondary node.
Ensure the clocks on both nodes are current and are reasonably accurate.