Failure detection in Sentinel is ping-pong based. It used to work by
remembering the last time a valid PONG reply was received, and checking
if the reception time was too old compared to the current current time.
PINGs were sent at a fixed interval of 1 second.
This works in a decent way, but does not scale well when we want to set
very small values of "down-after-milliseconds" (this is the node
timeout basically).
This commit reiplements the failure detection making a number of
changes. Some changes are inspired to Redis Cluster failure detection
code:
* A new last_ping_time field is added in representation of instances.
If non zero, we have an active ping that was sent at the specified
time. When a valid reply to ping is received, the field is zeroed
again.
* last_ping_time is not reset when we reconnect the link or send a new
ping, so from our point of view it represents the time we started
waiting for the instance to reply to our pings without receiving a
reply.
* last_ping_time is now used in order to check if the instance is
timed out. This means that we can have a node timeout of 100
milliseconds and yet the system will work well since the new check is
not bound to the period used to send pings.
* Pings are now sent every second, or often if the value of
down-after-milliseconds is less than one second. With a lower limit of
10 HZ ping frequency.
* Link reconnection code was improved. This is used in order to try to
reconnect the link when we are at 50% of the node timeout without a
valid reply received yet. However the old code triggered unnecessary
reconnections when the node timeout was very small. Now that should be
ok.
The new code passes the tests but more testing is needed and more unit
tests stressing the failure detector, so currently this is merged only
in the unstable branch.