Troubleshooting High Availability

In the event that any of the testing fails (Testing High Availability), there are a few common things to check.

Review the Configuration

Before digging too deep into the technical details below, first review the configuration and ensure all steps were followed accurately.

Troubleshooting CARP

Check Interface Status

If an interface shows “INIT” for the CARP state, as shown in CARP Status on Primary with Disconnected Interface, most commonly this indicates that the interface upon which this VIP resides is not connected to anything. If there is no link to a switch or another port, the interface is down and the VIP cannot be fully initialized. If the NIC is plugged in and appears to have a link when this occurs, edit, save, and apply changes for the VIP in question to reconfigure it.

../../_images/ha-carp-init.png

CARP Status on Primary with Disconnected Interface

Conflicting VHIDs

The VHID determines the virtual MAC address used by that CARP IP. The input validation in pfSense® will not permit using conflicting VHIDs on a single pair of systems, however if there are multiple systems on the same broadcast domain running CARP, it is possible to create a conflict. VRRP also uses the same virtual MAC address scheme, so a VRRP IP using the same VRID as a CARP IP VHID will also generate the same MAC address conflict.

When using CARP on the WAN interface, this also means VRRP or CARP used by the ISP can also conflict. Be sure to use VHIDs that are not in use by the ISP on that broadcast domain.

In addition to creating a MAC conflict which can interfere with traffic, it can also interfere with the CARP VIP status.

Incorrect Subnet Mask

The subnet mask for a CARP VIP must match the subnet mask on the Interface IP address for the same subnet. For example, if an interface IP address is 192.168.1.2/24, the CARP VIP must also be 192.168.1.1/24.

Switch/Layer 2 Issues

Typically a switch or layer 2 issue manifests itself as both units showing “MASTER” status for one or more CARP VIPs. If this happens, check the following items:

  1. Ensure that the interfaces on both boxes (The WANs, LANs, etc, etc) are connected to the proper switch/VLAN/layer 2. For example, ensure that the LAN port on both units is connected to the same switch/VLAN.

  2. Verify that the two nodes can reach each other (via ICMP echo, for example) on each segment. Firewall rules may need to be added to WAN to accommodate this test.

  3. If the units are plugged into separate switches, ensure that the switches are properly trunking and passing broadcast/multicast traffic.

  4. If the switch on the back of a modem/CPE is being used, try a real switch instead. These built-in switches often do not properly handle CARP traffic. Often plugging the firewalls into a proper switch and then uplinking to the CPE will eliminate problems.

  5. Disable IGMP snooping or other multicast limiting and inspecting features. If they are already off, try enabling the feature and disabling it again.

Troubleshooting XMLRPC

If an XMLRPC synchronization attempt fails, a notice is generated in the GUI to bring attention to it, as seen in XMLRPC Sync Failure Notice. The notice typically contains some information about why it failed that points to a fix, but if that is not enough, check the other items in this section.

../../_images/ha-xmlrpc-fail-notice.png

XMLRPC Sync Failure Notice

Check the System Log

XMLRPC failure details are logged to the main system log (Status > System Logs, General tab). Usually the error is stated plainly, for example an authentication failure would indicate that the password entered for the Admin user on the synchronization settings was incorrect. As shown in XMLRPC Sync Failure Log Entry a timeout happened during the synchronization attempt. In this example it was due to a missing firewall rule.

../../_images/ha-xmlrpc-fail-log.png

XMLRPC Sync Failure Log Entry

Check the Firewall Log

Visit Status > System Logs, Firewall tab on the secondary node. Check the log for entries failing to reach the secondary’s Sync interface on the GUI port, as seen in XMLRPC Sync Failure Firewall Log Entry. If the traffic is shown as blocked, adjust the Sync interface rules as needed.

../../_images/ha-xmlrpc-fail-fwlog.png

XMLRPC Sync Failure Firewall Log Entry

Check the Admin User

Visit System > User Manager and ensure that the admin user is enabled on both systems and that the admin password is the same on both systems. Visit System > High Avail Sync and double check that the admin username has been entered and that the correct password is present.

Verify Connectivity

Check Status > Interfaces and ensure the Sync interface shows a link on both units. If there is no link, ensure a cable is connected between the two units. The ports on the SG-4860 are Auto-MDIX so either a straight-through patch or a crossover cable will work. If a short cable is in use, try a longer cable (minimum 3ft/1m). If a link can still not be achieved, try using a small switch or VLAN between the two nodes.

Add a firewall rule to the Sync interface to allow ICMP echo requests and then attempt to ping from one firewall to the other to ensure they can reach each other at layer 3. If they cannot, double check the interface IP address and subnet mask settings, along with the cabling.

Troubleshooting pfsync

If the pfsync nodes do not line up under Status > CARP, that can indicate that the states have not been synchronized.

Check Firewall Rules

Check the firewall log at Status > System Logs, Firewall tab on both nodes. If any pfsync protocol traffic is present, the firewall rules on the Sync interface are probably incorrect.

Look at Firewall > Rules on the Sync interface tab. Make sure that the rules will pass pfsync protocol traffic, or traffic of any protocol, to any destination. Adjust the rules accordingly and check the logs and CARP status again to see if it starts working.

Verify Connectivity

See Verify Connectivity above to check the connection between the nodes.

Check Interfaces

If the states appear to sync but failover is still not seamless, check Interfaces > (Assign) and make sure the interfaces all line up physically as well as by name. In pfSense 2.2 and later, the states are bound to the interface so if, for example, the LAN interface is igb0 on one unit but igb3 on the other, then the states will not line up. Fix the interfaces so they are identical on both units.

Troubleshooting Local Services

DNS Resolution

If local clients are unable to obtain DNS responses from a CARP VIP on the cluster, check the following items:

  • If using the default DNS Resolver (unbound), visit Services > DNS Resolver and click Save on the primary to ensure the default values are fully respected.

  • If using either the DNS Resolver or DNS Forwarder, ensure the daemon is configured to listen on All interfaces or at least Localhost and the internal CARP VIPs.

  • Ensure the local interface firewall rules pass both TCP and UDP port 53 to the CARP VIPs used for local DNS.

  • Ensure the firewall itself has DNS servers configured under System > General, especially if using the DNS Forwarder (dnsmasq) instead of the DNS Resolver (unbound).

DHCP

If the DHCP failover pool status does not reach “normal”, there are a few items to check:

  • Ensure both units are connected to the same switch/subnet on the correct interface.

  • Verify connectivity between the two units on that interface.

  • Ensure the failover peer IP address has been properly configured

  • Ensure that there is a CARP VIP on the interface in question

  • Ensure that the CARP VIP on the primary node has a skew of 0 or 1, and the secondary has a skew of 100 or higher.

  • If all else fails:

    • Click image_icon_service_stop to stop the DHCP service from Status > Services on both nodes

    • Visit Diagnostics > Command Prompt on both nodes

    • Run the following command in the Shell Execute box on both nodes: rm /var/dhcpd/var/db/dhcpd.leases*

    • Click image_icon_service_start to start the DHCP service from Status > Services on both nodes