Troubleshooting High Availability DHCP Failover¶
There are several potential scenarios that can cause problems with DHCP service for a high availability cluster. This document contains items to check as well as potentially problematic scenarios and workarounds.
Time Not Synchronized¶
The system time on both cluster nodes must be within 90 seconds of each other. Otherwise the time difference is too large and the DHCP daemon processes will not communicate.
Interface Order Mismatch¶
The interfaces must be assigned identically on both nodes, for example: wan=WAN,
lan=LAN, opt1=Sync, opt2=DMZ. Check the config.xml
contents directly to
ensure a match. If the interface are not assigned in the same order, the
automatically generated failover pool names will not match, which prevents DHCP
failover from working.
Pool Status¶
Look at the pool status section at Status > DHCP leases. All defined pools (often 1 per interface) are listed at the top of that page.
If any of the pools are in a state other than “normal”, then debug the problem further. The solution is most likely found in one of the other items in this document.
If the pool status on both nodes is “normal” for a given interface, then issues with clients obtaining leases on that interface is not likely a problem in the DHCP configuration or failover, but elsewhere. For example, an L2 or switch problem, client problem, etc.
Incorrect Failover Peer Address¶
Each interface tab in the DHCP server options has a separate field for the Failover Peer IP address. This field must be filled in for each interface participating in DHCP with HA. The address in this field must be the actual interface IP address on the peer corresponding to the chosen tab – not a CARP VIP, other shared address, or an address from an unrelated interface (e.g. The SYNC interface).
When XMLRPC synchronization is enabled the primary node will adjust this automatically when copying settings to the secondary node, filling in its own IP address for the interface.
Failover Peer Unreachable¶
If one failover peer cannot contact the other peer when it starts up, it will stop itself from handing out leases intentionally. It does this as a fail safe to prevent itself from handing out conflicting lease data.
This can happen if, for example, both nodes suffer a power loss and only one recovers. Another common scenario is if one node suffers a hardware failure and the working node must be rebooted before the failed node can be repaired.
Correcting this can be tricky. The simplest way to correct it is to bring the other peer online if possible. If that is not possible, then the only way may be to remove the failover peer IP addresses from each DHCP interface configuration so the node no longer believes it should be part of a failover pool. When the other node recovers, the configuration can be put back in place.
Firewall Rules¶
For two DHCP peers to exchange failover data, they must be able to reach each other on an interface. There are typically automatic firewall rules which handle this, but there have been issues in the past where these automatic rules did not cover every possible scenario.
If the firewall log shows this traffic being blocked, then it may be necessary to add manual rules to pass the traffic. Ensure the two nodes are allowed to communicate on every relevant interface. The primary node must be allowed to receive traffic on TCP port 519 from the secondary node and the secondary node must be allowed to receive traffic on TCP port 520 from the primary node.
Restart DHCP Daemons¶
Stop and restart the DHCP daemon from Status > Services on both nodes and check the status after a few moments. This may correct the issue or at least provide better detail in the logs during the startup procedure for the daemons.
Check CARP VIP Configuration¶
Check the CARP VIP configuration for VIPs on interfaces used for DHCP failover.
The primary node must have an Advertising Frequency Skew value below
20
, the secondary node must have an Advertising Frequency Skew value
above 20
.
Mismatched Versions¶
Both nodes must be running the same version of pfSense® software. Update both nodes to the newest available stable release if they do not match. Older versions may have problems with various aspects of DHCP failover that have already been corrected.
Reset Lease Database¶
If the two nodes cannot agree on the pool status, it may be due to the contents of the lease databases. This can sometimes happen when first setting up failover or after reinstalling an HA node without backing up and restoring its DHCP lease database.
If all else fails, perform the following:
Stop the DHCP daemon on both nodes
Remove the DHCP lease database files from
/var/dhcpd/var/db/dhcpd.leases*
on both nodesStart the DHCP daemon on both nodes
It’s important to perform each step on both nodes individually and not do the whole procedure at once on each node; the important part is that they both start at the same time with a new empty lease database.
Inconsistent Client Hostnames¶
The DHCP servers on each node in a failover configuration work in coordination with one another. Each server will handle a portion of the total pool and relay lease information to the failover peer.
The lease information the nodes exchange, however, does not include client hostnames. This means that features which rely on DHCP hostnames, such as DNS resolution of DHCP client hostnames, will not work consistently when using DHCP failover.
This is a limitation of the ISC DHCP daemon and not something that can be changed or corrected in pfSense software.
Currently the most viable workaround is to define static DHCP mappings for each host that must be resolved via DNS.