Apache Geode CHANGELOG

Configure Apache Geode to Handle Network Partitioning

This section lists configuration considerations relating to network partition detection.

The system uses a combination of member coordinators and system members, designated as lead members, to detect and resolve network partitioning problems.

  • Network partition detection works in all environments. Using multiple locators mitigates the effect of network partitioning. See Configuring Peer-to-Peer Discovery.

  • Network partition detection is enabled by default. The default setting in the gemfire.properties file is

    enable-network-partition-detection=true
    

    Processes that do not have network partition detection enabled are not eligible to be the lead member, so their failure will not trigger declaration of a network partition.

    All system members should have the same setting for enable-network-partition-detection. If they do not, the system throws a GemFireConfigException upon startup.

  • The property enable-network-partition-detection must be true if you are using either partitioned or persistent regions. If you create a persistent region and enable-network-partition-detection to set to false, you will receive the following warning message:

    Creating persistent region {0}, but enable-network-partition-detection is set to false.
          Running with network partition detection disabled can lead to an unrecoverable system in the
          event of a network split."
    
  • Configure regions you want to protect from network partitioning with a scope setting of DISTRIBUTED_ACK or GLOBAL. Do not use DISTRIBUTED_NO_ACK scope. This prevents operations from being performed throughout the cluster before a network partition is detected. Note: Geode issues an alert if it detects DISTRIBUTED_NO_ACK regions when network partition detection is enabled:

    Region {0} is being created with scope {1} but enable-network-partition-detection is enabled in the distributed system. 
    This can lead to cache inconsistencies if there is a network failure.
    
    
  • These other configuration parameters affect or interact with network partitioning detection. Check whether they are appropriate for your installation and modify as needed.

    • If you have network partition detection enabled, the threshold percentage value for allowed membership weight loss is automatically configured to 51. You cannot modify this value. Note: The weight loss calculation uses round to nearest. Therefore, a value of 50.51 is rounded to 51 and will cause a network partition.
    • Failure detection is initiated if a member’s ack-wait-threshold (default is 15 seconds) and ack-severe-alert-threshold (15 seconds) properties elapse before receiving a response to a message. If you modify the ack-wait-threshold configuration value, you should modify ack-severe-alert-threshold to match the other configuration value.
    • If the system has clients connecting to it, the clients’ cache.xml pool read-timeout should be set to at least three times the member-timeout setting in the server’s gemfire.properties file. The default pool read-timeout setting is 10000 milliseconds.
    • You can adjust the default weights of members by specifying the system property gemfire.member-weight upon startup. For example, if you have some VMs that host a needed service, you could assign them a higher weight upon startup.
  • By default, members that are forced out of the cluster by a network partition event will automatically restart and attempt to reconnect. Data members will attempt to reinitialize the cache. See Handling Forced Cache Disconnection Using Autoreconnect.