Apache Geode CHANGELOG

How Network Partitioning Management Works

Geode handles network outages by using a weighting system to determine whether the remaining available members have a sufficient quorum to continue as a cluster.

Individual members are each assigned a weight, and the quorum is determined by comparing the total weight of currently responsive members to the previous total weight of responsive members.

Your cluster can split into separate running systems when members lose the ability to see each other. The typical cause of this problem is a failure in the network. When a partitioned system is detected, only one side of the system keeps running and the other side automatically shuts down.

The network partitioning detection feature is enabled by default with a true value for the enable-network-partition-detection property. See Configure Apache Geode to Handle Network Partitioning for details. Quorum weight calculations are always performed and logged regardless of this configuration setting.

The overall process for detecting a network partition is as follows:

  1. The cluster starts up. When you start up a cluster, start the locators first, start the cache servers second, and then start other members such as applications or processes that access cluster data.
  2. After the members start up, the oldest member, typically a locator, assumes the role of the membership coordinator. Peer discovery occurs as members come up and members generate a membership discovery list for the cluster. Locators hand out the membership discovery list as each member process starts up. This list typically contains a hint on who the current membership coordinator is.
  3. Members join and if necessary, depart the cluster:

    • Member processes make a request to the coordinator to join the cluster. If authenticated, the coordinator creates a new membership view, hands the new membership view to the new member, and begins the process of sending the new membership view (to add the new member or members) by sending out a view preparation message to existing members in the view.
    • While members are joining the system, it is possible that members are also leaving or being removed through the normal failure detection process. Failure detection removes unresponsive or slow members. See Managing Slow Receivers and Failure Detection and Membership Views for descriptions of the failure detection process. If a new membership view is sent out that includes one or more failed processes, the coordinator will log the new weight calculations. At any point, if quorum loss is detected due to unresponsive processes, the coordinator will also log a severe level message to identify the failed processes:

      Possible loss of quorum detected due to loss of {0} cache processes: {1}
      

      where {0} is the number of processes that failed and {1} lists the processes.

  4. Whenever the coordinator is alerted of a membership change (a member either joins or leaves the cluster), the coordinator generates a new membership view. The membership view is generated by a two-phase protocol:

    1. In the first phase, the membership coordinator sends out a view preparation message to all members and waits for a view preparation acknowledgement from each member. If the coordinator does not receive an ack message from a member (within a specified timeout period – see below), the coordinator attempts to connect to the member’s failure-detection socket. If the coordinator cannot connect to the member’s failure-detection socket, the coordinator declares the member dead and starts the membership view protocol again from the beginning.

      The timeout period for acknowledgement of a view change, the ack view timeout period, is based on the value of the member-timeout system property, and defaults to about 12 seconds (12437ms). The allowable range for the view ack timeout setting is 1500ms to 12437ms.

    2. In the second phase, the coordinator sends out the new membership view to all members that acknowledged the view preparation message or passed the connection test.
  5. Each time the membership coordinator sends a view, each member calculates the total weight of members in the current membership view and compares it to the total weight of the previous membership view. Some conditions to note:

    • When the first membership view is sent out, there are no accumulated losses. The first view only has additions.
    • A new coordinator may have a stale view of membership if it did not see the last membership view sent by the previous (failed) coordinator. If new members were added during that failure, then the new members may be ignored when the first new view is sent out.
    • If members were removed during the fail over to the new coordinator, then the new coordinator will have to determine these losses during the view preparation step.
  6. With a default value of enable-network-partition-detection, any member that detects that the total membership weight has dropped below 51% within a single membership view change (loss of quorum) declares a network partition event. The coordinator sends a network-partitioned-detected UDP message to all members (even to the non-responsive ones) and then closes the cluster with a ForcedDisconnectException. If a member fails to receive the message before the coordinator closes the system, the member is responsible for detecting the event on its own.

The presumption is that when a network partition is declared, the members that comprise a quorum will continue operations. The surviving members elect a new coordinator, designate a lead member, and so on.