Important

Netgate is offering COVID-19 aid for pfSense software users, learn more.

Troubleshooting High Availability

High availability configurations can be complex, and with so many different ways to configure a failover cluster, it can be tricky to get things working properly. In this section, some common (and not so common) problems will be discussed and hopefully solved for the majority of cases. If issues are still present after consulting this section, there is a dedicated HA/CARP/VIPs board on the Netgate Forum.

Before proceeding, take the time to check all members of the HA cluster to ensure that they have consistent configurations. Often, it helps to walk through the example setup, double checking all of the proper settings. Repeat the process on the secondary node, and watch for any places where the configuration must be different on the secondary. Be sure to check the CARP status (Check CARP status) and ensure CARP is enabled on all cluster members.

Errors relating to HA will be logged in Status > System Logs, on the System tab. Check those logs on each system involved to see if there are any messages relating to XMLRPC sync, CARP state transitions, or other related errors.

See also

The issues on this page are for HA in general. For issues specific to using HA in virtual environments, see Troubleshooting High Availability Clusters in Virtual Environments

Common Misconfigurations

There are several common misconfigurations that happen which prevent HA from working properly.

Use a different VHID on each CARP VIP

A different VHID must be used on each CARP VIP created on a given interface or broadcast domain. The VHID determines the virtual MAC address used by that CARP IP address, this different clusters attempting to use the same VHID on the same L2 segment cause a MAC address conflict.

With a single HA pair, input validation will prevent duplicate VHIDs. Unfortunately it isn’t always that simple. CARP is a multicast technology, and as such anything using CARP on the same network segment must use a unique VHID. VRRP also uses a similar protocol as CARP, so ensure there are no conflicts with VRRP VHIDs, such as if the ISP or another router on the local network is using VRRP.

The best way around this is to use a unique set of VHIDs. If a known-safe private network is in use, start numbering at 1. On a network where VRRP or CARP are conflicting, consult with the administrator of that network to find a free block of VHIDs.

Incorrect CARP VIP Settings

Inspect the settings for CARP VIPs (Firewall > Virtual IPs) to ensure they are correct and consistent on both nodes.

The Advertising Frequency values must be appropriate for each VIP and node:

Base

Values should be the same on both nodes. In some situations where the secondary node is on a slow or non-local link, users have increased this value on only the secondary, but that can lead to problems with each node assuming their expected roles at the proper times.

Skew

Values must be different on the primary and secondary nodes. The primary is typically 1 or 0, and the secondary is typically 100.

Incorrect Times

Check that all systems involved are properly synchronizing their clocks and have valid time zones, especially if running in a Virtual Machine. If the clocks are too far apart, some synchronization tasks like DHCP failover will not work properly.

Incorrect Subnet Mask

The real subnet mask must be used for a CARP VIP, not /32. This must match the subnet mask for the IP address on the interface to which the CARP IP is assigned.

Both Nodes in Maintenance Mode

If both nodes have activated Persistent CARP Maintenance Mode at Status > CARP (failover), they each will advertise a skew of 254 and the actual status will be unpredictable. Ensure only one node is in maintenance mode at a time.

Incorrect Hash Error

There are a few reasons why this error turns up in the system logs, some more worrisome than others.

If CARP is not working properly when this error is present, it could be due to a configuration mismatch. Ensure that for a given VIP, that the VHID, password, and IP address/subnet mask all match.

If the settings appear to be proper and CARP still does not work while generating this error message, then there may be multiple CARP instances on the same broadcast domain. Disable CARP and monitor the network with tcpdump (Packet Capturing) to check for other CARP or CARP-like traffic, and adjust VHIDs appropriately.

If CARP is working properly, and this message is in the logs when the system boots up, it may be disregarded. It is normal for this message to be seen when booting, as long as CARP continues to function properly (primary shows MASTER, secondary shows BACKUP for status).

Both Systems Appear as MASTER

This will happen if the secondary cannot see the CARP advertisements from the primary. Check for firewall rules, connectivity trouble, switch configurations. Also check the system logs for any relevant errors that may lead to a solution. If this is encountered in a Virtual Machine (VM) Product such as ESX, see Troubleshooting High Availability Clusters in Virtual Environments.

Primary system is stuck as BACKUP

In some cases, this is may happen normally for a short period after a system comes back online. However, certain hardware failures or other error conditions can cause a server to silently take on a high advskew of 240 in order to signal that it still has a problem and should not become master. This can check be checked from the GUI, or via the shell or Diagnostics > Command.

In the GUI, this condition is printed in an error message on Status > CARP.

../_images/ha-carp-demoted.png

CARP Status when Primary is demoted

From the shell or Diagnostics > Command, run the following command to check for a demotion:

#  sysctl net.inet.carp.demotion
net.inet.carp.demotion: 240

If the value is greater than 0, the node has demoted itself.

In that case, isolate the firewall, check its network connections, and perform further hardware testing.

If the demotion value is 0 and the primary node still appears to be demoting itself to BACKUP or is flapping, check the network to ensure there are no layer 2 loops. If the firewall receives back its own heartbeats from the switch, it can also trigger a change to BACKUP status.

Other Switch and Layer 2 Issues

  • If the units are plugged into separate switches, ensure that the switches are properly trunking and passing broadcast/multicast traffic.

  • Some switches have broadcast/multicast filtering, limiting, or “storm control” features that can break CARP.

  • Some switches have broken firmware that can cause features like IGMP Snooping to interfere with CARP.

  • If a switch on the back of a modem/CPE is use, try a real switch instead. These built-in switches often do not properly handle CARP traffic. Often plugging the firewalls into a proper switch and then uplinking to the CPE will eliminate problems.

Configuration Synchronization Problems

Double check the following items when problems with configuration synchronization are encountered:

  • The username must be admin on all nodes.

  • The password in the configuration synchronization settings on the primary must match the password on the backup.

  • The WebGUI must be on the same port on all nodes.

  • The WebGUI must be using the same protocol (HTTP or HTTPS) on all nodes.

  • Traffic must be permitted to the WebGUI port on the interface which handles the synchronization traffic.

  • The pfsync interface must be enabled and configured on all nodes.

  • Verify that only the primary sync node has the configuration synchronization options enabled.

  • Ensure no IP address is specified in the Synchronize Config to IP on the secondary node.

  • Ensure the clocks on both nodes are current and are reasonably accurate.

HA and Multi-WAN Troubleshooting

If trouble is encountered reaching CARP VIPs from when dealing with Multi-WAN, double check that a rule is present like the one mentioned in Firewall Configuration