LATEST VERSION: 8.2.6 - CHANGELOG
Pivotal GemFire® v8.2

Handling Forced Cache Disconnection Using Autoreconnect

Handling Forced Cache Disconnection Using Autoreconnect

A GemFire member may be forcibly disconnected from a GemFire distributed system if the member is unresponsive for a period of time, or if a network partition separates one or more members into a group that is too small to act as the distributed system.

How the Autoreconnection Process Works

After being disconnected from a distributed system a GemFire member shuts down and then automatically restarts into a "reconnecting" state, while periodically attempting to rejoin the distributed system by contacting a list of known locators. If the member succeeds in reconnecting to a known locator, the member rebuilds its view of the distributed system from existing members and receives a new distributed system ID.

If the member cannot connect to a known locator, the member will then check to see if it itself is a locator (or hosting an embedded locator process) or if multicast discovery is being used (instead of locators.) If the member is a locator or if multicast discovery is available, then the member perform a quorum-based reconnect; it will attempt to contact a quorum of the members that were in the membership view just before it became disconnected. If a quorum of members can be contacted, then startup of the distributed system is allowed to begin. Since the reconnecting member does not know which members survived the network partition event, all members that are in a reconnecting state will keep their UDP unicast ports open and respond to ping requests.

Membership quorum is determined using the same member weighting system used in network partition detection. See Membership Coordinators, Lead Members and Member Weighting.

Note that when a locator is in the reconnecting state, it provides no discovery services for the distributed system.

After the cache has reconnected, applications must fetch a reference to the new Cache, Regions, DistributedSystem and other artifacts. Old references will continue to throw cancellation exceptions like CacheClosedException(cause=ForcedDisconnectException).

See the GemFire DistributedSystem and Cache Java API documentation for more information.

Managing the Autoreconnection Process

By default a GemFire member will try to reconnect until it is told to stop by using the DistributedSystem.stopReconnecting() or Cache.stopReconnecting() method. You can disable automatic reconnection entirely by setting disable-auto-reconnect GemFire property to "true."

You can use DistributedSystem and Cache callback methods to perform actions during the reconnect process, or to cancel the reconnect process if necessary.

The DistributedSystem and Cache API provide several methods you can use to take actions while a member is reconnecting to the distributed system:

  • DistributedSystem.isReconnecting() returns true if the member is in the process of reconnecting and recreating the cache after having been removed from the system by other members, or has shut down due to missing Roles and is reconnecting.
  • DistributedSystem.waitUntilReconnected(long, TimeUnit) waits for a period of time, and then returns a boolean value to indicate whether the member has reconnected to the DistributedSystem. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the member shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
  • DistributedSystem.getReconnectedSystem() returns the reconnected DistributedSystem.
  • DistributedSystem.stopReconnecting() stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.
  • Cache.isReconnecting() returns true if the cache is attempting to reconnect to a distributed system.
  • Cache.waitForReconnect(long, TimeUnit) waits for a period of time, and then returns a boolean value to indicate whether the DistributedSystem has reconnected. Use a value of -1 seconds to wait indefinitely until the reconnect completes or the cache shuts down. Use a value of 0 seconds as a quick probe to determine if the member has reconnected.
  • Cache.getReconnectedCache() returns the reconnected Cache.
  • Cache.stopReconnecting() stops the reconnection process and ensures that the DistributedSystem stays in a disconnected state.

Operator Intervention

You may need to intervene in the autoreconnection process if processes or hardware have crashed or are otherwise shut down before the network connection is healed. In this case the members in a "reconnecting" state will not be able to find the lost processes through UDP probes and will not rejoin the system until they are able to contact a locator. If multicast discovery is being used, then the members in a "reconnecting" state must be bounced in order to have them rejoin.