Troubleshooting High Availability DHCP Failover¶
There are several potential scenarios that can cause problems with DHCP service for a high availability cluster. This document contains items to check as well as potentially problematic scenarios and workarounds.
The issues and limitations encountered here vary depending on which DHCP daemon backend is active.
Common Issues¶
These issues may affect high availability DHCP failover no matter which backend is active.
Time Not Synchronized¶
The system time on both cluster nodes must be within 90 seconds of each other. Otherwise the time difference is too large and the DHCP daemon processes will not communicate.
Kea DHCP Daemon¶
These issues may affect high availability DHCP failover when the Kea DHCP backend is active.
Kea High Availability Node Status¶
Check the DHCP Leases or DHCPv6 Leases to see the current status of failover for each daemon. The High Availability status for Kea DHCP operates identically for both DHCP and DHCPv6. For details on how the DHCP HA status operates, see High Availability Status – Kea DHCP Only.
If any of the peers are in a state other than hot-standby
it may indicate a
problem. The solution is likely in one of the other answers in this document.
Incorrect Failover Peer Address¶
The Settings tab for DHCP and DHCPv6 service defines the failover peers for HA in Kea.
The address in this field must be an actual interface IP address on the peer – not a CARP VIP, other shared address. Most often this is the IP address of the Sync interface on each node.
Note
The address family does not need to match the family of the DHCP daemon, for example, DHCPv6 HA can be performed using IPv4 peer addresses.
When XMLRPC synchronization is enabled the primary node will adjust this automatically when copying settings to the secondary node, swapping the local and remote values appropriately.
Failover Peer Unreachable¶
If the peers are both active but cannot communicate with each other, they may both consider the peer to be in an unreachable state. If this happens, both nodes may hand out addresses at the same time, causing a conflict. Double check the IP addresses and ports used for failover as well as the firewall rules on the interfaces involved.
Firewall Rules¶
For two DHCP peers to exchange failover data they must be able to reach each other on the configured ports through the interfaces containing the configured local and remote addresses. These addresses are typically on the Sync interface.
It may be necessary to add firewall rules to pass the sync traffic, for example when using strict firewall rules on the Sync interface.
The primary node must be allowed to reach the secondary node on the defined port
(default 8765
for DHCPv4 and 8766
for IPv6) using the Remote Address
defined in the DHCP settings, and vice versa. Check the firewall rules on the
Sync interface and add rules as needed to pass this traffic.
Secondary Does Not Enter partner-down
State¶
In certain cases with customized Advanced settings, a secondary node may
not enter the partner-down
state even when the primary is unreachable.
Usually this happens because of the Max Unacked Clients setting, which
instructs the daemon to wait for that many clients to send unanswered DHCP
requests before it will take over.
While this setting can help prevent a secondary node from taking over too quickly in certain cases, a high value for Max Unacked Clients will prevent that number (minus 1) clients from getting addresses until another client also requests an address and goes unanswered. On a large and busy network this may not take long, but on a small or quiet network this could take a significant amount of time and leave some clients stranded. If too many clients are going unanswered for too long, lower the value until clients are served in a timely manner during a failure of the primary node.
The default value is 0
which causes the secondary to take over immediately
after the primary has been unresponsive (default time: 60 seconds).
Mismatched TLS Settings¶
The TLS settings for Kea DHCP and DHCPv6 HA do not synchronize via XMLRPC as they may be different on each node. If the TLS settings are not configured in an appropriate way that aligns for both nodes, then the two nodes cannot communicate and DHCP HA will fail.
When using TLS transport, ensure each node has the Server certificate option set to use a Server type certificate signed by a CA trusted by both peers (e.g. both using the same CA).
When using mutual TLS, ensure each node is using an appropriate User certificate signed by the same CA in the Client certificate option.
ISC DHCP Daemon¶
These issues may affect high availability DHCP failover when the ISC DHCP backend is active.
Interface Order Mismatch¶
The interfaces must be assigned identically on both nodes, for example: wan=WAN,
lan=LAN, opt1=Sync, opt2=DMZ. Check the config.xml
contents directly to
ensure a match. If the interface are not assigned in the same order, the
automatically generated failover pool names will not match, which prevents DHCP
failover from working.
Pool Status¶
Look at the pool status section at Status > DHCP leases. All defined pools (often 1 per interface) are listed at the top of that page.
If any of the pools are in a state other than “normal”, then debug the problem further. The solution is most likely found in one of the other items in this document.
If the pool status on both nodes is “normal” for a given interface, then issues with clients obtaining leases on that interface is not likely a problem in the DHCP configuration or failover, but elsewhere. For example, an L2 or switch problem, client problem, etc.
Incorrect Failover Peer Address¶
Each interface tab in the DHCP server options has a separate field for the Failover Peer IP address. This field must be filled in for each interface participating in DHCP with HA. The address in this field must be the actual interface IP address on the peer corresponding to the chosen tab – not a CARP VIP, other shared address, or an address from an unrelated interface (e.g. The SYNC interface).
When XMLRPC synchronization is enabled the primary node will adjust this automatically when copying settings to the secondary node, filling in its own IP address for the interface.
Failover Peer Unreachable¶
If one failover peer cannot contact the other peer when it starts up, it will stop itself from handing out leases intentionally. It does this as a fail safe to prevent itself from handing out conflicting lease data.
This can happen if, for example, both nodes suffer a power loss and only one recovers. Another common scenario is if one node suffers a hardware failure and the working node must be rebooted before the failed node can be repaired.
Correcting this can be tricky. The simplest way to correct it is to bring the other peer online if possible. If that is not possible, then the only way may be to remove the failover peer IP addresses from each DHCP interface configuration so the node no longer believes it should be part of a failover pool. When the other node recovers, the configuration can be put back in place.
Firewall Rules¶
For two DHCP peers to exchange failover data, they must be able to reach each other on an interface. There are typically automatic firewall rules which handle this, but there have been issues in the past where these automatic rules did not cover every possible scenario.
If the firewall log shows this traffic being blocked, then it may be necessary to add manual rules to pass the traffic. Ensure the two nodes are allowed to communicate on every relevant interface. The primary node must be allowed to receive traffic on TCP port 519 from the secondary node and the secondary node must be allowed to receive traffic on TCP port 520 from the primary node.
Restart DHCP Daemons¶
Stop and restart the DHCP daemon from Status > Services on both nodes and check the status after a few moments. This may correct the issue or at least provide better detail in the logs during the startup procedure for the daemons.
Check CARP VIP Configuration¶
Check the CARP VIP configuration for VIPs on interfaces used for DHCP failover.
The primary node must have an Advertising Frequency Skew value below
20
, the secondary node must have an Advertising Frequency Skew value
above 20
.
Mismatched Versions¶
Both nodes must be running the same version of pfSense® software. Update both nodes to the newest available stable release if they do not match. Older versions may have problems with various aspects of DHCP failover that have already been corrected.
Reset Lease Database¶
If the two nodes cannot agree on the pool status, it may be due to the contents of the lease databases. This can sometimes happen when first setting up failover or after reinstalling an HA node without backing up and restoring its DHCP lease database.
If all else fails, perform the following:
Stop the DHCP daemon on both nodes
Remove the DHCP lease database files from
/var/dhcpd/var/db/dhcpd.leases*
on both nodesStart the DHCP daemon on both nodes
It’s important to perform each step on both nodes individually and not do the whole procedure at once on each node; the important part is that they both start at the same time with a new empty lease database.
Inconsistent Client Hostnames¶
The DHCP servers on each node in a failover configuration work in coordination with one another. Each server will handle a portion of the total pool and relay lease information to the failover peer.
The lease information the nodes exchange, however, does not include client hostnames. This means that features which rely on DHCP hostnames, such as DNS resolution of DHCP client hostnames, will not work consistently when using DHCP failover.
This is a limitation of the ISC DHCP daemon and not something that can be changed or corrected in pfSense software.
Currently the most viable workaround is to define static DHCP mappings for each host that must be resolved via DNS.