Troubleshooting High Availability Clusters in Virtual Environments

Hypervisor users (Especially VMware ESX/ESXi)

The below settings are specifically for VMware ESX/ESXi but similar settings may be present on Hyper-V, VirtualBox, and other similar hypervisors.

Note

These notes all apply to CARP VIPs in multicast mode. Unicast mode CARP on pfSense Plus software may not require these settings, but experiences may vary by hypervisor and environment.

  • Enable promiscuous mode on the vSwitch

  • Enable MAC Address changes

  • Enable Forged transmits

  • If multiple physical ports exist on the same vswitch, the Net.ReversePathFwdCheckPromisc option must be enabled to work around a vswitch bug where multicast traffic will loop back to the host, causing CARP to not function with “link states coalesced” messages. (See below)

ESX VDS Promisc Workaround

If a Virtual Distributed Switch is in use, a port group can be made for the firewall interfaces with promiscuous mode enabled, and a separate non-promiscuous port group may be used for other hosts. This has been reported to work by users on the forum as a way to strike a balance between the requirements for letting CARP function and for securing client ports.

ESX VDS Upgrade Issue

If a VDS (Virtual Distributed Switches) is used in ESX 4.0 or 4.1 and an upgrade from 4.0 to 4.1 or 5.0 is performed, the VDS will not properly pass CARP traffic. If a new VDS is created on 4.1 or 5.0, it will work, but the upgraded VDS will not.

It is reported that disabling promiscuous mode on the VDS and then re-enabling it will resolve the issue.

ESX VDS Port Mirroring Issue

If port mirroring is enabled on a VDS, it will break promiscuous mode. To fix it, disable promiscuous mode, then re-enable promiscuous mode.

Client Port Issues

If a bare metal HA cluster is connected to a switch with an ESX host using multiple ports on the ESX host (lagg group or similar), and only certain devices or IP addresses are reachable by the target VM, then the port group settings in ESX may need adjusted to set the load balancing for the group to hash based on IP address, not the originating interface.

Side effects of having that set incorrectly include:

  • Traffic only reaching the target VM in promisc mode on its NIC

  • Inability to reach the CARP VIP from the target VM when the “real” IP address of the primary firewall is reachable

  • Port forwards or other inbound connections to the target VM work from some IP addresses and not others.

Changing Net.ReversePathFwdCheckPromisc

Login to the VMware vSphere Client

For each VMware host

  • Click on host to configure and select the Configuration Tab

  • Click Software Advanced Settings in left pane

  • Click on Net and scroll down to Net.ReversePathFwdCheckPromisc and set to 1

  • Click OK

Promiscuous Mode interfaces need to be set now or toggled off and then back on. This is done per host by clicking Networking in the Hardware section

  • For each vSwitch and/or Virtual Machine Port Group:

    Note

    If Promiscuous is already enabled it must be disabled, saved and then re-enabled and saved again.

    • Click on Properties of the vSwtich

      By Default Promiscuous Mode is Reject.

    • Click the Edit > Security Tab

    • Select Accept from the drop down

    • Click OK

  • However, this setting is usually applied per Virtual Machine Port Group (More Secure) where the VSwitch is left at default to Reject.

    • Navigate to Edit > Security > Policy Exceptions

    • Uncheck Promiscuous Mode

    • Click OK

    • Navigate to Edit > Security > Policy Exceptions

    • Check Promiscuous Mode and select Accept.

More information available from VMware

ESX Physical NIC Failure Fails to Trigger Failover

Self-demotion of a CARP VIP relies on the loss of link on a switch port. As such, if a primary and secondary node instance are on separate ESX host and the primary ESX host loses a switch port link and does not expose that to the VM, CARP will stay MASTER on all of its VIPs and the secondary will also believe it should be MASTER. One way around this is to script an event in ESX that will take down the switch port on the VM if the physical port loses link. There may be other ways around this in ESX as well.

VMware Workstation

If using VMware workstation on Linux for testing/modeling and CARP failover does not function, it is likely because VMware workstation is running non-root and cannot set the vmnet adapter in Promiscuous mode.

The permissions on /dev/vmnet* should be changed such that the user running VMware workstation is allowed to modify the /dev/vmnet* devices. See the VMware KB for details.

To make the change permanent, edit /etc/init.d/vmware, and in function vmwareStartVmnet(), add commands to chgrp and chown the vmnet devices to a group which contains user running VMware Workstation.

ProxMox VE, KVM, and QEMU Issues

Use VirtIO (vtnet(4)) or e1000 NICs (em(4)), not the ed(4) NICs or CARP VIPs will never leave the INIT state.

VirtualBox Issues

From this thread:

Setting Promiscuous mode: Allow All on the relevant interfaces of the VM allows CARP to function on any interface type (Bridged, Host-Only, Internal)