When a core group member starts, a task running the Failure Detection Protocol also starts. This task runs as long as the member is active. The Failure Detection Protocol monitors the core group network connections that the Discovery Protocol establishes. When the Failure Detection Protocol detects a failed network connection, it reports the failure to the View Synchrony Protocol and the Discovery Protocol. The View Synchrony Protocol adjusts the view to exclude the failed member. The Discovery Protocol attempts to reestablish a network connection with the failed member. The Failure Detection Protocol uses two distinct mechanisms to find failed members:
When a core group member normally stops in response to an administration command, the core group transport for that member also stops, and the socket that is associated with the transport closes. If a core group member terminates abnormally, the underlying operating system normally closes the sockets that the process opened and the socket associated with the core group transport. is closed.
For either type of termination, core group members that have an open connection to the terminated member are notified that the connection is no longer usable. The core group member that receives the socket closed notification considers the terminated member a failed member. When a failed member is detected because of the socket closing mechanism, one or more of the following messages are logged in the SystemOut.log file for the surviving members:
DCSV1113W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: Suspected another member because the outgoing connection to the other member was closed. Suspected member is anzioCell01\nettuno\ServerB. DCS logical channel is View|Ptp. DCSV1111W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: Suspected another member because the outgoing connection from the other member was closed. Suspected members is anzioCell01\nettuno\ServerB. DCS logical channel is Connected|Ptp.
The closed socket mechanism is the way that failed members are typically discovered. TCP settings in the underlying operating system, such as FIN_WAIT, affect how quickly socket closing events are received.
DCSV1112W: DCS Stack DefaultCoreGroup at Member anzioCell01\anzioCellManager01\dmgr: Suspected member anzioCell01\nettuno\ServerB because of heartbeat timeout. Configured Timeout is 180000 milliseconds. DCS logical channel is Connected|Ptp.Active heartbeats are most useful for detecting core group members that are unreachable because the network is stopped. Active heartbeats consume some CPU usage. The amount of CPU usage that is consumed is proportional to the number of active members in the core group. The default configuration for active heartbeats is a balance of CPU usage and timely failed member detection. You can use the following core group custom properties to change the settings for active heartbeats: