HTTP plug-in failover

HTTP plug-in failover

Overview

The HTTP plug-in provides failover in the event the plug-in is no longer able to send requests to a particular cluster member. Conditions under which the plug-in will mark a particular cluster member down...

The plug-in is unable to establish a connection to a cluster member's application server transport.

The plug-in detects a newly connected socket that was prematurely closed by a cluster member during an active read or write.

ConnectTimeout

Perform non-blocking connections with a backend cluster member. Beneficial when the plug-in is unable to contact the destination to determine if the port is available.
<Server CloneID="10k66djk2" 
        ConnectTimeout="10" 
        ExtendedHandshake="false" 
        LoadBalanceWeight="1000" 
        MaxConnections="0" 
        Name="Server1_WebSphere_Appserver" 
        WaitForContinue="false"> 

    <Transport Hostname="server1.domain.com" 
               Port="9091" 
               Protocol="http"/>
</Server>
If not set, plug-in performs a blocking connect in which the plug-in sits until an operating system TCP timeout occurs (as long as 2 minutes depending on the platform), and allows the plug-in to mark the cluster member unavailable. If set to 0, plug-in performs a blocking connect. If set greater than 0, the the number of seconds for the plug-in to wait for a successful connection. If a connection does not occur after that time interval, the plug-in marks the cluster member unavailable, and fails over to one of the other cluster members defined in the cluster.
In an environment with busy workload or a slow network connection, setting this value too low could make the plug-in mark a cluster member down falsely. Therefore, caution should be used whenever choosing a value for ConnectTimeout.

ServerIOTimeout

Time out value, in seconds, for sending requests to, and reading responses from, a cluster member. If not set, the plug-in, by default, uses blocked I/O to write request to and read responses from the cluster member until the TCP connection times out. For example, if we specify:
<Server CloneID="10k66djk2" 
        ServerIOTimeout="120" 
        ConnectTimeout="10" 
        ExtendedHandshake="false" 
        LoadBalanceWeight="1000" 
        MaxConnections="0" 
        Name="Server1_WebSphere_Appserver" 
        WaitForContinue="false">

    <Transport Hostname="server1.domain.com" 
               Port="9091" 
               Protocol="http"/>
</Server>
In this case, if a cluster member stops responding to requests, the plug-in waits 120 seconds (2 minutes) before timing out the TCP connection. Setting the ServerIOTimeout attribute to a reasonable value enables the plug-in to time out the connection sooner, and transfer requests to another cluster member when possible.
It might take a couple of minutes for a cluster member to process a request. Setting the value of the ServerIOTimeout attribute too low could cause the plug-in to send a false server error response to the client.
The ServerIOTimeout is ideal for situations where Keep-Alive connections exist between WAS and plug-in, and the application server machine is abruptly disconnected from the network. Without ServerIOTimeout, the plug-in would take a long time to detect that the connection was closed abruptly on the WAS machine. When an application host machine is shut down abruptly, the Keep-Alive connections between plug-in and application server might not get closed completely. As a result, when the plug-in needs to route a request to the host machine, the plug-in would use an existing Keep-Alive connection if there was one in the pool. When plug-in sends the request over such a connection, since the host machine had been taken down abruptly, the plug-in machine does not receive any TCP packets to close the connection. The plug-in request writing would not return a failure until the connection timed out at the TCP level. The HTTP Plug-in would then try to contact to the same application server by establishing a new connection. The connect() call would then fail after the TCP timeout. As a result, it could take a considerable amount of time depending on the operating system TCP timeout setting for the plug-in to detect the application server status and mark it down before failing over to another application server. If there were many requests sent to the server during this time, this fact would apply to every request.
When both ConnectTimeout and ServerIOTimeout are specified, it could take as long as (ConnectTimeout + ServerIOTimeout) for the plug-in to detect and mark a server down.

RetryInterval

Length of time that should elapse from the time that a server is marked down to the time that the plug-in will retry a connection. The default is 60 seconds.
Specified in the ServerCluster element
<ServerCluster CloneSeparatorChange="false" 
               LoadBalance="Round Robin"
               Name="Server_WebSphere_Cluster" 
               PostSizeLimit="10000000" 
               RemoveSpecialHeaders="true" 
               RetryInterval="120"> 
This would mean that if a cluster member were marked as down, the plug-in would not retry it for 120 seconds.
There is no way to recommend one specific value; the value chosen depends on your environment. For example, if you have numerous cluster members, and one cluster member being unavailable does not affect the performance of your application, then we can safely set the value to a very high number.
Alternatively, if your optimum load has been calculated assuming all cluster members to be available or if you do not have very many, then you will want your cluster members to be retried more often to maintain the load.
Also, take into consideration the time it takes to restart your server. If a server takes a long time to boot up and load applications, then you will need a longer retry interval.

PrimaryServers versus BackupServers

The plug-in can be configured for true failover by using PrimaryServers and BackupServers Elements in the plugin-cfg.xml configuration file.
In the following example, the plug-in will load balance between both servers, Server1_WebSphere_Appserver and Server2_WebSphere_Appserver defined in PrimaryServers element only. However, in the event that bothServer1_WebSphere_Appserver and Server1_WebSphere_Appserver become unavailable and marked down, the plug-in will then failover and start sending requests to Server3_WebSphere_Appserver defined in the BackupServers Element.
<ServerCluster CloneSeparatorChange="false" 
               LoadBalance="Round Robin"
               Name="Server_WebSphere_Cluster" 
               PostSizeLimit="10000000" 
               RemoveSpecialHeaders="true" 
               RetryInterval="120">

    <Server CloneID="10k66djk2" 
            ServerIOTimeout="120" 
            ConnectTimeout="10" 
            ExtendedHandshake="false" 
            LoadBalanceWeight="1000" 
            MaxConnections="0" 
            Name="Server1_WebSphere_Appserver" 
            WaitForContinue="false">

        <Transport Hostname="server1.domain.com" 
                       Port="9091" 
                       Protocol="http"/>
    </Server>

    <Server CloneID="10k67eta9" 
            ServerIOTimeout="120" 
            ConnectTimeout="10" 
            ExtendedHandshake="false" 
            LoadBalanceWeight="999" 
            MaxConnections="0" 
            Name="Server2_WebSphere_Appserver" 
            WaitForContinue="false">

        <Transport Hostname="server2.domain.com" 
                   Port="9091" Protocol="http"/>

    </Server>

    <Server CloneID="10k68xtw10" 
               ServerIOTimeout="120" 
               ConnectTimeout="10" 
               ExtendedHandshake="false" 
               LoadBalanceWeight="998" 
               MaxConnections="0" 
               Name="Server3_WebSphere_Appserver" 
               WaitForContinue="false">

        <Transport Hostname="server3.domain.com" 
                   Port="9091" 
                   Protocol="http"/>

    </Server>

    <PrimaryServers>
        <Server Name="Server1_WebSphere_Appserver"/>
        <Server Name="Server2_WebSphere_Appserver"/>
    </PrimaryServers>
    <BackupServers>
        <Server Name="Server3_WebSphere_Appserver"/>
    </BackupServers>
    
</ServerCluster>
See also
Tuning Plug-in Workload Management Failover
plug-in Load Balancing in a clustered environment
Understand plug-in Fail-over
Understand plug-in Load Balancing
Tune IBM HTTP Server processes and threads
Web server plug-in configuration
Modify plug-in properties from the WAS administrative console
How do the properties ServerIOTimeout and PostBufferSize affect plug-in behavior?