64 Commits

Author SHA1 Message Date
antirez
8c551d65a1 Sentinel: make sure role_reported is always updated. 2013-11-21 15:20:15 +01:00
antirez
9577fed8a3 Sentinel: track role change time. Wait before reconfigurations. 2013-11-21 15:20:11 +01:00
antirez
fc10fb17da Sentinel: fix no-down check in master->slave conversion code. 2013-11-21 15:20:07 +01:00
antirez
b02ef3d59a Sentinel: readd slaves back after a master reset. 2013-11-21 15:19:56 +01:00
antirez
1a1dc3de38 Sentinel: sentinelResetMaster() new flag to avoid removing set of sentinels.
This commit also removes some dead code and cleanup generic flags.
2013-11-21 15:19:52 +01:00
antirez
9f0e52a13d Sentinel: receive Pub/Sub messages from slaves. 2013-11-21 15:19:49 +01:00
antirez
f7604b4c07 Sentinel: change event name when converting master to slave. 2013-11-21 15:19:45 +01:00
antirez
1569326277 Sentinel: added config-epoch to SENTINEL masters output. 2013-11-21 15:19:41 +01:00
antirez
9f20780de6 Sentinel: new failover algo, desync slaves and update config epoch. 2013-11-21 15:19:34 +01:00
antirez
2488257ad8 Sentinel: when starting failover seek for votes ASAP. 2013-11-21 15:19:30 +01:00
antirez
6593f33222 Sentinel: +new-epoch events. 2013-11-21 15:19:26 +01:00
antirez
48acc675dd Sentinel: wait some time between failover attempts. 2013-11-21 15:19:22 +01:00
antirez
447c2787a0 Sentinel: allow to vote for myself. 2013-11-21 15:19:16 +01:00
antirez
eba4775b5d Sentinel: fix PUBLISH to masters and slaves. 2013-11-21 15:19:11 +01:00
antirez
b95c6ed7b7 Sentinel: epoch introduced in leader vote. 2013-11-21 15:18:52 +01:00
antirez
663d79c0d5 Sentinel: leadership handling changes WIP.
Changes to leadership handling.

Now the leader gets selected by every Sentinel, for a specified epoch,
when the SENTINEL is-master-down-by-addr is sent.

This command now includes the runid and the currentEpoch of the instance
seeking for a vote. The Sentinel only votes a single time in a given
epoch.

Still a work in progress, does not even compile at this stage.
2013-11-21 15:18:45 +01:00
antirez
b72985d7f7 Sentinel: handle Hello messages received via slaves correctly.
Even when messages are received via the slave, we should perform
operations (like adding a new Sentinel) in the context of the master.
2013-11-21 15:18:41 +01:00
antirez
fe7f96f18c Sentinel: remove code not useful in the new design. 2013-11-21 15:18:36 +01:00
antirez
be2ef1b59f Sentinel: epoch introduced.
Sentinel state now includes the idea of current epoch and config epoch.
In the Hello message, that is now published both on masters and slaves,
a Sentinel no longer just advertises itself but also broadcasts its
current view of the configuration: the master name / ip / port and its
current epoch.

Sentinels receiving such information switch to the new master if the
configuration epoch received is newer and the ip / port of the master
are indeed different compared to the previos ones.
2013-11-21 15:18:31 +01:00
antirez
c874e39c45 Sentinel: sentinelSendSlaveOf() was missing a var and the prototype. 2013-11-06 11:29:57 +01:00
antirez
1767998751 Sentinel: increment pending_commands counter in two more places.
AUTH and SCRIPT KILL were sent without incrementing the pending commands
counter. Clearly this needs some kind of wrapper doing it for the caller
in order to be less bug prone.
2013-11-06 11:29:53 +01:00
antirez
97810c45e8 Sentinel: always send CONFIG REWRITE when changing instance role.
This change makes Sentinel less fragile about a number of failure modes.

This commit also fixes a different bug as a side effect, SLAVEOF command
was sent multiple times without incrementing the pending commands count.
2013-11-06 11:29:49 +01:00
antirez
f899ab55ca sdsrange() does not need to return a value.
Actaully the string is modified in-place and a reallocation is never
needed, so there is no need to return the new sds string pointer as
return value of the function, that is now just "void".
2013-07-24 11:22:52 +02:00
antirez
1e23848ed3 Sentinel: embed IPv6 address into [] when naming slave/sentinel instance. 2013-07-11 17:10:09 +02:00
antirez
076f6395b9 Sentinel: use comma as separator to publish hello messages.
We use comma to play well with IPv6 addresses, but the implementation is
still able to parse the old messages separated by colons.
2013-07-11 17:09:44 +02:00
antirez
98d0abcecd Sentinel: make sure published addr/id buffer is large enough.
With ipv6 support we need more space, so we account for the IP address
max size plus what we need for the Run ID, port, flags.
2013-07-11 17:09:30 +02:00
antirez
a7451c1b6d All IP string repr buffers are now REDIS_IP_STR_LEN bytes. 2013-07-11 17:07:52 +02:00
Geoff Garside
0d8f254359 Add IPv6 support to sentinel.c.
This has been done by exposing the anetSockName() function anet.c
to be used when the sentinel is publishing its existence to the masters.

This implementation is very unintelligent as it will likely break if used
with IPv6 as the nested colons will break any parsing of the PUBLISH string
by the master.
2013-07-11 17:07:31 +02:00
Geoff Garside
4b2e374e4a Update calls to anetResolve to include buffer size 2013-07-11 17:05:08 +02:00
antirez
9af8125c7d Sentinel: parse new INFO replication output correctly.
Sentinel was not able to detect slaves when connected to a very recent
version of Redis master since a previos non-backward compatible change
to INFO broken the parsing of the slaves ip:port INFO output.

This fixes issue #1164
2013-06-20 10:24:31 +02:00
antirez
e7bcec829c Sentinel: changes to tilt mode.
Tilt mode was too aggressive (not processing INFO output), this
resulted in a few problems:

1) Redirections were not followed when in tilt mode. This opened a
   window to misinform clients about the current master when a Sentinel
   was in tilt mode and a fail over happened during the time it was not
   able to update the state.

2) It was possible for a Sentinel exiting tilt mode to detect a false
   fail over start, if a slave rebooted with a wrong configuration
   about at the same time. This used to happen since in tilt mode we
   lose the information that the runid changed (reboot).

   Now instead the Sentinel in tilt mode will still remove the instance
   from the list of slaves if it changes state AND runid at the same
   time.

Both are edge conditions but the changes should overall improve the
reliability of Sentinel.
2013-04-30 15:09:14 +02:00
antirez
4028a777b6 Sentinel: more sensible delay in master demote after tilt. 2013-04-30 15:09:10 +02:00
antirez
70845320cc Sentinel: only demote old master into slave under certain conditions.
We used to always turn a master into a slave if the DEMOTE flag was set,
as this was a resurrecting master instance.

However the following race condition is possible for a Sentinel that
got partitioned or internal issues (tilt mode), and was not able to
refresh the state in the meantime:

1) Sentinel X is running, master is instance "A".
3) "A" fails, sentinels will promote slave "B" as master.
2) Sentinel X goes down because of a network partition.
4) "A" returns available, Sentinels will demote it as a slave.
5) "B" fails, other Sentinels will promote slave "A" as master.
6) At this point Sentinel X comes back.

When "X" comes back he thinks that:

"B" is the master.
"A" is the slave to demote.

We want to avoid that Sentinel "X" will demote "A" into a slave.
We also want that Sentinel "X" will detect that the conditions changed
and will reconfigure itself to monitor the right master.

There are two main ways for the Sentinel to reconfigure itself after
this event:

1) If "B" is reachable and already configured as a slave by other
sentinels, "X" will perform a redirection to "A".
2) If there are not the conditions to demote "A", the fact that "A"
reports to be a master will trigger a failover detection in "X", that
will end into a reconfiguraiton to monitor "A".

However if the Sentinel was not reachable, its state may not be updated,
so in case it titled, or was partiitoned from the master instance of the
slave to demote, the new implementation waits some time (enough to
guarantee we can detect the new INFO, and new DOWN conditions).

If after some time still there are not the right condiitons to demote
the instance, the DEMOTE flag is cleared.
2013-04-30 15:09:06 +02:00
antirez
d2ff5ed603 Sentinel: always redirect on master->slave transition.
Sentinel redirected to the master if the instance changed runid or it
was the first time we got INFO, and a role change was detected from
master to slave.

While this is a good idea in case of slave->master, since otherwise we
could detect a failover without good reasons just after a reboot with a
slave with a wrong configuration, in the case of master->slave
transition is much better to always perform the redirection for the
following reasons:

1) A Sentinel may go down for some time. When it is back online there is
no other way to understand there was a failover.
2) Pointing clients to a slave seems to be always the wrong thing to do.
3) There is no good rationale about handling things differently once an
instance is rebooted (runid change) in that case.
2013-04-24 11:34:02 +02:00
antirez
d0c9a2a767 Sentinel: turn old master into a slave when it comes back. 2013-04-22 11:26:29 +02:00
antirez
fcfdbda104 Sentinel: advertise the promoted slave address only after successful setup. 2013-02-11 11:44:14 +01:00
guiquanz
1caf09399e Fixed many typos.
Conflicts fixed, mainly because 2.8 has no cluster support / files:
	00-RELEASENOTES
	src/cluster.c
	src/crc16.c
	src/redis-trib.rb
	src/redis.h
2013-01-19 11:03:19 +01:00
antirez
8ddb23b90c BSD license added to every C source and header file. 2012-11-08 18:34:04 +01:00
antirez
dfb7194cba Sentinel: Support for AUTH. 2012-09-27 13:06:17 +02:00
antirez
b8ce9a84c5 Sentinel: reply -IDONTKNOW to get-master-addr-by-name on lack of info.
If we don't have any clue about a master since it never replied to INFO
so far, reply with an -IDONTKNOW error to SENTINEL
get-master-addr-by-name requests.
2012-09-27 13:06:12 +02:00
antirez
1f8bd82332 Sentinel: more easy master redirection if master is a slave.
Before this commit Sentienl used to redirect master ip/addr if the
current instance reported to be a slave only if this was the first INFO
output received, and the role was found to be slave.

Now instead also if we find that the runid is different, and the
reported role is slave, we also redirect to the reported master ip/addr.

This unifies the behavior of Sentinel in the case of a reboot (where it
will see the first INFO output with the wrong role and will perform the
redirection), with the behavior of Sentinel in the case of a change in
what it sees in the INFO output of the master.
2012-09-27 13:06:05 +02:00
antirez
ef792fc950 Sentinel: do not crash against slaves not publishing the runid.
Older versions of Redis (before 2.4.17) don't publish the runid field in
INFO. This commit makes Sentinel able to handle that without crashing.
2012-09-27 13:06:01 +02:00
antirez
de499f7f7e Sentinel: INFO command implementation. 2012-09-27 13:05:58 +02:00
antirez
161e137c55 Sentinel: Sentinel-side support for slave priority.
The slave priority that is now published by Redis in INFO output is
now used by Sentinel in order to select the slave with minimum priority
for promotion, and in order to consider slaves with priority set to 0 as
not able to play the role of master (they will never be promoted by
Sentinel).

The "slave-priority" field is now one of the fileds that Sentinel
publishes when describing an instance via the SENTINEL commands such as
"SENTINEL slaves mastername".
2012-09-27 13:05:49 +02:00
antirez
d480b9ce7f Sentinel: suppress harmless warning by initializing 'table' to NULL.
Note that the assertion guarantees that one of the if branches setting
table is always entered.
2012-09-27 13:05:45 +02:00
antirez
fa23fc3363 Sentinel: send SCRIPT KILL on -BUSY reply and SDOWN instance.
From the point of view of Redis an instance replying -BUSY is down,
since it is effectively not able to reply to user requests. However
a looping script is a recoverable condition in Redis if the script still
did not performed any write to the dataset. In that case performing a
fail over is not optimal, so Sentinel now tries to restore the normal server
condition killing the script with a SCRIPT KILL command.

If the script already performed some write before entering an infinite
(or long enough to timeout) loop, SCRIPT KILL will not work and the
fail over will be triggered anyway.
2012-09-27 13:05:41 +02:00
antirez
fc0a0d4aa7 Sentinel: fixed a crash on script execution.
The call to sentinelScheduleScriptExecution() lacked the final NULL
argument to signal the end of arguments. This resulted into a crash.
2012-09-27 13:05:38 +02:00
antirez
ea9bec50c6 Sentinel: SENTINEL FAILOVER command implemented.
This command can be used in order to force a Sentinel instance to start
a failover for the specified master, as leader, forcing the failover
even if the master is up.

The commit also adds some minor refactoring and other improvements to
functions already implemented that make them able to work when the
master is not in SDOWN condition. For instance slave selection
assumed that we ask INFO every second to every slave, this is true
only when the master is in SDOWN condition, so slave selection did not
worked when the master was not in SDOWN condition.
2012-09-27 13:05:33 +02:00
antirez
26a340095d Sentinel: client reconfiguration script execution.
This commit adds support to optionally execute a script when one of the
following events happen:

* The failover starts (with a slave already promoted).
* The failover ends.
* The failover is aborted.

The script is called with enough parameters (documented in the example
sentinel.conf file) to provide information about the old and new ip:port
pair of the master, the role of the sentinel (leader or observer) and
the name of the master.

The goal of the script is to inform clients of the configuration change
in a way specific to the environment Sentinel is running, that can't be
implemented in a genereal way inside Sentinel itself.
2012-09-27 13:05:30 +02:00
antirez
524b79d231 Sentinel: when leader in wait-start, sense another leader as race.
When we are in wait start, if another leader (or any other external
entity) turns a slave into a master, abort the failover, and detect it
as an observer.

Note that the wait-start state is mainly there for this reason but the
abort was yet not implemented.

This adds a new sentinel event -failover-abort-race.
2012-09-27 13:05:26 +02:00