Commit Graph

441 Commits

Author SHA1 Message Date
7c359449d5 Fix return value check for anetTcpAccept
anetTcpAccept returns ANET_ERR, not AE_ERR.

This isn't a physical error since both ANET_ERR
and AE_ERR are -1, but better to be consistent.
2014-03-11 11:09:37 +01:00
9a7cf31960 Bind source address for cluster communication
The first address specified as a bind parameter
(server.bindaddr[0]) gets used as the source IP
for cluster communication.

If no bind address is specified by the user, the
behavior is unchanged.

This patch allows multiple Redis Cluster instances
to communicate when running on the same interface
of the same host.
2014-03-11 11:09:37 +01:00
a0ea8f235e Cluster: error out quicker if port is unusable
The default cluster control port is 10,000 ports higher than
the base Redis port.  If Redis is started on a too-high port,
Cluster can't start and everything will exit later anyway.
2014-03-11 11:09:37 +01:00
e4833ed8bf Fix configEpoch assignment when a cluster slot gets "closed".
This is still code to rework in order to use agreement to obtain a new
configEpoch when a slot is migrated, however this commit handles the
special case that happens when the nodes are just started and everybody
has a configEpoch of 0. In this special condition to have the maximum
configEpoch is not enough as the special epoch 0 is not unique (all the
others are).

This does not fixes the intrinsic race condition of a failover happening
while we are resharding, that will be addressed later.
2014-03-05 10:22:07 +01:00
0725988a07 Cluster: clusterDelNode(): remove node from master's slaves. 2014-02-11 10:34:14 +01:00
4513d8fcd4 Cluster: UPDATE messages are the norm and verbose.
Logging them at WARNING level was of little utility and of sure disturb.
2014-02-11 10:22:05 +01:00
6d550f2de4 Cluster: configEpoch assignment in SETNODE improved.
Avoid to trash a configEpoch for every slot migrated if this node has
already the max configEpoch across the cluster.

Still work to do in this area but this avoids both ending with a very
high configEpoch without any reason and to flood the system with fsyncs.
2014-02-11 10:21:58 +01:00
585e9fb886 Cluster: clusterSetStartupEpoch() made more generally useful.
The actual goal of the function was to get the max configEpoch found in
the cluster, so make it general by removing the assignment of the max
epoch to currentEpoch that is useful only at startup.
2014-02-11 10:21:55 +01:00
8b5196addf Cluster: always increment the configEpoch in SETNODE after import.
Removed a stale conditional preventing the configEpoch from incrementing
after the import in certain conditions. Since the master got a new slot
it should always claim a new configuration.
2014-02-11 10:21:52 +01:00
2e3f6b0fb3 Cluster: on resharding upgrade version of receiving node.
The node receiving the hash slot needs to have a version that wins over
the other versions in order to force the ownership of the slot.

However the current code is far from perfect since a failover can happen
during the manual resharding. The fix is a work in progress but the
bottom line is that the new version must either be voted as usually,
set by redis-trib manually after it makes sure can't be used by other
nodes, or reserved configEpochs could be used for manual operations (for
example odd versions could be never used by slaves and are always used
by CLUSTER SETSLOT NODE).
2014-02-11 00:39:24 +01:00
a221ae5ce2 Cluster: fsync at every SETSLOT command puts too pressure on disks.
During slots migration redis-trib can send a number of SETSLOT commands.
Fsyncing every time is a bit too much in production as verified
empirically.

To make sure configs are fsynced on all nodes after a resharding
redis-trib may send something like CLUSTER CONFSYNC.

In this case fsyncs were not providing too much value since anyway
processes can crash in the middle of the resharding of an hash slot, and
redis-trib should be able to recover from this condition anyway.
2014-02-11 00:39:20 +01:00
77c6fa65f1 Cluster: conditions to clear "migrating" on slot for SETSLOT ... NODE changed.
If the slot is manually assigned to another node, clear the migrating
status regardless of the fact it was previously assigned to us or not,
as long as we no longer have keys for this slot.

This avoid a race during slots migration that may leave the slot in
migrating status in the source node, since it received an update message
from the destination node that is already claiming the slot.

This way we are sure that redis-trib at the end of the slot migration is
always able to close the slot correctly.
2014-02-11 00:39:14 +01:00
cc97305ec3 Cluster: don't update slave's master if we don't know it.
There is no way we can update the slave's node->slaveof pointer if we
don't know the master (no node with such an ID in our tables).
2014-02-11 00:39:02 +01:00
fa6f4f21c3 Cluster: ignore slot config changes if we are importing it. 2014-02-11 00:38:59 +01:00
30214fff3e Cluster: update configEpoch after manually messing with slots. 2014-02-11 00:38:56 +01:00
8e12fae05e Cluster: fixed inverted arguments in logging function call. 2014-02-10 17:21:17 +01:00
6a01545744 Cluster: clear the FAIL status for masters without slots.
Masters without slots don't participate to the cluster but just do
redirections, no need to take them in FAIL state if they are back
reachable.
2014-02-10 17:19:16 +01:00
969a4f1db3 Cluster: replica migration should only work for masters serving slots. 2014-02-10 17:08:47 +01:00
6987a95952 Cluster: clusterReadHandler() fixed to work with new message header. 2014-02-10 16:28:44 +01:00
b82b66b51d Cluster: signature changed to "RCmb" (Redis Cluster message bus).
Sounds better after all.
2014-02-10 16:05:22 +01:00
b6e04f5584 Cluster: discard bus messages with version != 0. 2014-02-10 16:05:18 +01:00
0ee1a78c86 Cluster: added signature + version in bus packets. 2014-02-10 16:05:15 +01:00
142281dc79 Cluster: keys slot computation now supports hash tags.
Currently this is marginally useful, only to make sure two keys are in
the same hash slot when the cluster is stable (no rehashing in
progress).

In the future it is possible that support will be added to run
mutli-keys operations with keys in the same hash slot.
2014-02-07 17:39:01 +01:00
04fe000bf8 Cluster: fixed MF condition in clusterHandleSlaveFailover().
For manual failover we need a manual failover in progress, and that
mf_can_start is true (master offset received and matched).
2014-02-05 16:01:56 +01:00
c6f02fd67a Cluster: CLUSTER FAILOVER replies with OK and logs the event. 2014-02-05 15:52:38 +01:00
c72449af30 Cluster: check that a MF is in progress in manualFailoverCheckTimeout().
Otherwise it is always detected as a manual failover timed out.
2014-02-05 15:45:24 +01:00
b7402bcad5 Cluster: force AUTH ACK on manual failover.
When a slave requests masters vote for a manual failover, the
REQUEST_AUTH message is flagged in a special way in order to force the
masters to give the authorization even if the master is not marked as
failing.
2014-02-05 13:10:03 +01:00
4cf0cd5719 Cluster: manual failover initial implementation. 2014-02-05 13:01:24 +01:00
a7d30681c9 Cluster: configurable replicas migration barrier.
It is possible to configure the min number of additional working slaves
a master should be left with, for a slave to migrate to an orphaned
master.
2014-01-31 11:26:36 +01:00
6c9359add1 Cluster: perform orphaned masters check before continue statements.
The check was placed in a way that conflicted with the continue
statements used by the node hearth beat code later that needs to skip
the current node sometimes. Moved at the start of the function so that's
always executed.
2014-01-30 18:23:31 +01:00
c2507b0ff6 Cluster: replica migration implementation.
This feature allows slaves to migrate to orphaned masters (masters
without working slaves), as long as a set of conditions are met,
including the fact that the migrating slave needs to be in a
master-slaves ring with at least another slave working.
2014-01-30 18:05:11 +01:00
5b4020fb42 Cluster: swap two code blocks to have a more obvious flow. 2014-01-30 16:34:23 +01:00
4beaaff8ea Cluster: remove not needed return statement breaking failover. 2014-01-29 17:28:46 +01:00
3582054982 Cluster: broadcast pong to other slaves in the same ring.
When we schedule a failover, broadcast a PONG to the slaves.
The other slaves that plan to get elected will do the same too, this way
it is likely that every slave will have a good picture of its own rank.

Note that this is N*N messages where N is the number of slaves for the
failing master, however usually even large clusters have many master
nodes but a limited number of replicas per node, so this is harmless.
2014-01-29 17:19:55 +01:00
e2b59621a8 Cluster: log offset when announcing the failover election delay. 2014-01-29 17:16:10 +01:00
940531e9b7 Cluster: added progressive election delay according to slave rank.
Note that when we compute the initial delay, there are probably still
more up to date information to receive from slaves with new offsets, so
the delay is recomputed when new data is available.
2014-01-29 16:53:45 +01:00
6f54032080 Cluster: function clusterGetSlaveRank() added.
Return the number of slaves for the same master having a better
replication offset of the current slave, that is, the slave "rank" used
to pick a delay before the request for election.
2014-01-29 16:39:04 +01:00
40cd38f0c4 Cluster: update node replication offset from bus packets headers. 2014-01-29 16:01:00 +01:00
9d4ded7ec6 Cluster: refactoring: new macros to check node flags. 2014-01-29 12:17:16 +01:00
099bd336db Cluster: use myself instead of server->cluster.myself. 2014-01-29 11:38:14 +01:00
e36bd8b43e Cluster: added a global myself pointer in cluster.c.
Accessing to the 'myself' node, the node representing the currently
running instance, is handy without the need to type
server.cluster->myself every time.
2014-01-29 11:22:22 +01:00
f1e09d8c41 Cluster: clusterBroadcastPong() improved with target selection.
Now we can broadcast a pong to all the instances or just the local
slaves (that is useful for replication offset propagation).
2014-01-29 11:08:52 +01:00
befcf6259e Cluster: broadcast master/slave replication offset in bus header. 2014-01-28 16:51:50 +01:00
0b1b25c51c Cluster: introduced repl_offset fields in clusterNode.
The two fields are used in order to remember the latest known
replication offset and the time we received it from other slave nodes.

This will be used by slaves in order to start the election procedure
with a delay that is proportional to the rank of the slave among the
other slaves for this master, when sorted for replication offset.

Usually this allows the slave with the most updated offset to win the
election and replace the failing master in the cluster.
2014-01-28 16:28:07 +01:00
0f9422d575 Cluster: update slaves lists in clusterSetMaster(). 2014-01-22 18:46:53 +01:00
5383ab0bc6 Cluster: CLUSTER SLAVES subcommand added. 2014-01-22 18:38:42 +01:00
603e480fd5 Cluster: clusterGenNodesDescription() refactored into two functions. 2014-01-22 18:36:12 +01:00
80e80668f4 Cluster: master nodes wait before rejoining the cluster after reboot.
One of the simple heuristics used by Redis Cluster in order to avoid
losing data in the typical failure modes created by the asynchronous
replication with the slaves (a master is unable, when accepting a
write, to immediately tell if it should be really accepted or refused
because of a configuration change), is to wait some time before to
rejoin the cluster after being partitioned away from the majority of
instances.

A similar condition happens when a master is restarted. It does not know
if it was already failed over, nor if all the clients have already an
updated configuration about the cluster map, so it is possible that
clients will try to write to stale masters that were restarted.

In a similar way this commit changes masters behavior so they wait
2000 milliseconds before accepting writes after a reboot. There is
nothing special about 2 seconds if not to be a value supposedly larger
a few orders of magnitude compared to the cluster bus communication
latencies.
2014-01-20 11:52:52 +01:00
e6970e204f Cluster: debug printf statemets removed.
These were committed for error after being inserted in order to fix an
issue.
2014-01-20 11:19:04 +01:00
ac3850cabd Cluster: allow CLUSTER REPLICATE to switch master.
The code was doing checks for slaves that should be done only when the
instance is currently a master. Switching a slave from a master to
another one should just work.
2014-01-17 18:22:35 +01:00