Core group Failure Detection Protocol

Operating Systems: i5/OS
Personalize the table of contents and search results

Core group Failure Detection Protocol

When a core group member starts, a task running the Failure Detection Protocol also starts. This task runs as long as the member is active. The Failure Detection Protocol monitors the core group network connections that the Discovery Protocol establishes. When the Failure Detection Protocol detects a failed network connection, it reports the failure to the View Synchrony Protocol and the Discovery Protocol. The View Synchrony Protocol adjusts the view to exclude the failed member. The Discovery Protocol attempts to reestablish a network connection with the failed member. The Failure Detection Protocol uses two distinct mechanisms to find failed members:

It looks for connections that closed because the underlying socket was closed.
It listens for active heartbeats from the core group members.

Sockets closing

When a core group member normally stops in response to an administration command, the core group transport for that member also stops, and the socket that is associated with the transport closes. If a core group member terminates abnormally, the underlying operating system normally closes the sockets that the process opened and the socket associated with the core group transport. is closed.

For either type of termination, core group members that have an open connection to the terminated member are notified that the connection is no longer usable. The core group member that receives the socket closed notification considers the terminated member a failed member. When a failed member is detected because of the socket closing mechanism, one or more of the following messages are logged in the SystemOut.log file for the surviving members:

DCSV1113W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: 
Suspected another member because the outgoing connection to the other member was closed. 
Suspected member is anzioCell01\nettuno\ServerB. DCS logical channel is View|Ptp.

DCSV1111W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: 
Suspected another member because the outgoing connection from the other member was closed. 
Suspected members is anzioCell01\nettuno\ServerB. DCS logical channel is Connected|Ptp.

The closed socket mechanism is the way that failed members are typically discovered. TCP settings in the underlying operating system, such as FIN_WAIT, affect how quickly socket closing events are received.

Active heart beating

The active heart beating mechanism is analogous to the TCP the keep alive function. At regularly scheduled intervals, each core group member sends a ping packet on every open core group connection. If the packet is acknowledged, all is assumed to be all right. If no response is received from a given member for a certain number of consecutive pings, the member is marked as failed. When a member is marked as failed, the following message is logged:

DCSV1112W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: 
Suspected member anzioCell01\nettuno\ServerB because of heartbeat timeout. 
Configured Timeout is 180000 milliseconds. DCS logical channel is Connected|Ptp.

Active heartbeats are most useful for detecting core group members that are unreachable because the network is stopped. Active heartbeats consume some CPU usage. The amount of CPU usage that is consumed is proportional to the number of active members in the core group. The default configuration for active heartbeats is a balance of CPU usage and timely failed member detection. You can use the following core group custom properties to change the settings for active heartbeats:

IBM_CS_FD_PERIOD_SECS, which specifies the time interval, in seconds, between consecutive heartbeats. The default value for this property is 30 seconds.
IBM_CS_FD_CONSECUTIVE_MISSED, which specifies the consecutive number of heartbeats that must be missed before the core group member is considered failed. The default value for this property is 6.

Related concepts

Core group Discovery Protocol
Core groups (high availability domains)