6231 Commits

Author SHA1 Message Date
antirez
39d3448703 Cluster: fix gossip section ping/pong times encoding.
The gossip section times are 32 bit, so cannot store the milliseconds
time but just the seconds approximation, which is good enough for our
uses. At the same time however, when comparing the gossip section times
of other nodes with our node's view, we need to convert back to
milliseconds.

Related to #3929. Without this change the patch to reduce the traffic in
the bus message does not work.
2017-04-18 16:17:34 +02:00
antirez
78148d0e5a Cluster: add clean-logs command to create-cluster script. 2017-04-18 16:17:31 +02:00
antirez
a5c1c77eb8 Cluster: decrease ping/pong traffic by trusting other nodes reports.
Cluster of bigger sizes tend to have a lot of traffic in the cluster bus
just for failure detection: a node will try to get a ping reply from
another node no longer than when the half the node timeout would elapsed,
in order to avoid a false positive.

However this means that if we have N nodes and the node timeout is set
to, for instance M seconds, we'll have to ping N nodes every M/2
seconds. This N*M/2 pings will receive the same number of pongs, so
a total of N*M packets per node. However given that we have a total of N
nodes doing this, the total number of messages will be N*N*M.

In a 100 nodes cluster with a timeout of 60 seconds, this translates
to a total of 100*100*30 packets per second, summing all the packets
exchanged by all the nodes.

This is, as you can guess, a lot... So this patch changes the
implementation in a very simple way in order to trust the reports of
other nodes: if a node A reports a node B as alive at least up to
a given time, we update our view accordingly.

The problem with this approach is that it could result into a subset of
nodes being able to reach a given node X, and preventing others from
detecting that is actually not reachable from the majority of nodes.
So the above algorithm is refined by trusting other nodes only if we do
not have currently a ping pending for the node X, and if there are no
failure reports for that node.

Since each node, anyway, pings 10 other nodes every second (one node
every 100 milliseconds), anyway eventually even trusting the other nodes
reports, we will detect if a given node is down from our POV.

Now to understand the number of packets that the cluster would exchange
for failure detection with the patch, we can start considering the
random PINGs that the cluster sent anyway as base line:
Each node sends 10 packets per second, so the total traffic if no
additioal packets would be sent, including PONG packets, would be:

    Total messages per second = N*10*2

However by trusting other nodes gossip sections will not AWALYS prevent
pinging nodes for the "half timeout reached" rule all the times. The
math involved in computing the actual rate as N and M change is quite
complex and depends also on another parameter, which is the number of
entries in the gossip section of PING and PONG packets. However it is
possible to compare what happens in cluster of different sizes
experimentally. After applying this patch a very important reduction in
the number of packets exchanged is trivial to observe, without apparent
impacts on the failure detection performances.

Actual numbers with different cluster sizes should be published in the
Reids Cluster documentation in the future.

Related to #3929.
2017-04-18 16:17:29 +02:00
antirez
51901396ea Cluster: collect more specific bus messages stats.
First step in order to change Cluster in order to use less messages.
Related to issue #3929.
2017-04-18 16:17:26 +02:00
antirez
f7b91b6c89 Add a top comment in crucial functions inside networking.c. 2017-04-18 16:17:13 +02:00
antirez
6e1489ae50 Set lua-time-limit default value at safe place.
Otherwise, as it was, it will overwrite whatever the user set.

Close #3703.
2017-04-18 16:17:11 +02:00
antirez
5fd841c069 Fix preprocessor if/else chain broken in order to fix #3927. 2017-04-18 16:17:08 +02:00
antirez
185b361aa8 Fix typo in feedReplicationBacklog() top comment. 2017-04-18 16:16:51 +02:00
lorneli
b740fc1ee3 Expire: Update comment of activeExpireCycle function
The macro REDIS_EXPIRELOOKUPS_TIME_PERC has been replaced by
ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC in commit
6500fabfb881a7ffaadfbff74ab801c55d4591fc.
2017-04-18 16:16:51 +02:00
antirez
56cafcceac Fix zmalloc_get_memory_size() ifdefs to actually use the else branch.
Close #3927.
2017-04-18 16:16:28 +02:00
antirez
a5b66da883 Make more obvious why there was issue #3843. 2017-04-18 16:16:27 +02:00
antirez
f60d6f09ce Fix modules blocking commands awake delay.
If a thread unblocks a client blocked in a module command, by using the
RedisMdoule_UnblockClient() API, the event loop may not be awaken until
the next timeout of the multiplexing API or the next unrelated I/O
operation on other clients. We actually want the client to be served
ASAP, so a mechanism is needed in order for the unblocking API to inform
Redis that there is a client to serve ASAP.

This commit fixes the issue using the old trick of the pipe: when a
client needs to be unblocked, a byte is written in a pipe. When we run
the list of clients blocked in modules, we consume all the bytes
written in the pipe. Writes and reads are performed inside the context
of the mutex, so no race is possible in which we consume the bytes that
are actually related to an awake request for a client that should still
be put into the list of clients to unblock.

It was verified that after the fix the server handles the blocked
clients with the expected short delay.

Thanks to @dvirsky for understanding there was such a problem and
reporting it.
2017-04-18 16:16:27 +02:00
antirez
c56668c89e Rax library updated. 2017-04-18 16:16:09 +02:00
antirez
c4716d3345 Cluster: hash slots tracking using a radix tree. 2017-04-18 16:16:03 +02:00
vienna
a9fefbce2e fix #3847: add close socket before return ANET_ERR. 2017-04-18 16:15:58 +02:00
Dvir Volk
17250409ba fixed free of blocked client before refering to it 2017-04-18 16:15:55 +02:00
Oran Agra
8aced9e9c5 add LFU policies to the test suite, just for coverage 2017-03-22 10:14:36 +01:00
antirez
3aa656abf5 Use sha256 instead of sha1 to generate tarball hashes. 2017-03-22 10:07:07 +01:00
Salvatore Sanfilippo
42d6a6c36f Makefile: fix building with Solaris C compiler, 64 bit. 2017-03-22 10:07:00 +01:00
Salvatore Sanfilippo
e082d0569b Use ARM unaligned accesses ifdefs for SPARC as well. 2017-03-22 10:07:00 +01:00
Salvatore Sanfilippo
7269d5471c Fix BITPOS unaligned memory access. 2017-03-22 10:07:00 +01:00
antirez
1552058881 Solaris fixes about tail usage and atomic vars.
Testing with Solaris C compiler (SunOS 5.11 11.2 sun4v sparc sun4v)
there were issues compiling due to atomicvar.h and running the
tests also failed because of "tail" usage not conform with Solaris
tail implementation. This commit fixes both the issues.
2017-03-22 10:07:00 +01:00
antirez
9faeed04ac Test: replication-psync, wait more to detect write load.
Slow systems like the original Raspberry PI need more time
than 5 seconds to start the script and detect writes.
After fixing the Raspberry PI can pass the unit without issues.
2017-03-22 10:07:00 +01:00
antirez
b3440b3559 Test: fix conditional execution of HINCRBYFLOAT representation test. 2017-03-22 10:06:44 +01:00
antirez
5a4133034e SipHash 2-4 -> SipHash 1-2.
For performance reasons we use a reduced rounds variant of
SipHash. This should still provide enough protection and the
effects in the hash table distribution are non existing.
If some real world attack on SipHash 1-2 will be found we can
trivially switch to something more secure. Anyway it is a
big step forward from Murmurhash, for which it is trivial to
generate *seed independent* colliding keys... The speed
penatly introduced by SipHash 2-4, around 4%, was a too big
price to pay compared to the effectiveness of the HashDoS
attack against SipHash 1-2, and considering so far in the
Redis history, no such an incident ever happened even while
using trivially to collide hash functions.
2017-02-21 17:16:35 +01:00
antirez
a8cbc3ec87 freeMemoryIfNeeded(): improve code and lazyfree handling.
1. Refactor memory overhead computation into a function.
2. Every 10 keys evicted, check if memory usage already reached
   the target value directly, since we otherwise don't count all
   the memory reclaimed by the background thread right now.
2017-02-21 17:16:35 +01:00
antirez
857e6d5641 Use locale agnostic tolower() in dict.c hash function. 2017-02-21 17:16:35 +01:00
antirez
34387ceae3 SipHash x86 optimizations. 2017-02-21 17:16:35 +01:00
antirez
ba647598b4 Use SipHash hash function to mitigate HashDos attempts.
This change attempts to switch to an hash function which mitigates
the effects of the HashDoS attack (denial of service attack trying
to force data structures to worst case behavior) while at the same time
providing Redis with an hash function that does not expect the input
data to be word aligned, a condition no longer true now that sds.c
strings have a varialbe length header.

Note that it is possible sometimes that even using an hash function
for which collisions cannot be generated without knowing the seed,
special implementation details or the exposure of the seed in an
indirect way (for example the ability to add elements to a Set and
check the return in which Redis returns them with SMEMBERS) may
make the attacker's life simpler in the process of trying to guess
the correct seed, however the next step would be to switch to a
log(N) data structure when too many items in a single bucket are
detected: this seems like an overkill in the case of Redis.

SPEED REGRESION TESTS:

In order to verify that switching from MurmurHash to SipHash had
no impact on speed, a set of benchmarks involving fast insertion
of 5 million of keys were performed.

The result shows Redis with SipHash in high pipelining conditions
to be about 4% slower compared to using the previous hash function.
However this could partially be related to the fact that the current
implementation does not attempt to hash whole words at a time but
reads single bytes, in order to have an output which is endian-netural
and at the same time working on systems where unaligned memory accesses
are a problem.

Further X86 specific optimizations should be tested, the function
may easily get at the same level of MurMurHash2 if a few optimizations
are performed.
2017-02-21 17:16:35 +01:00
Salvatore Sanfilippo
2ee19d9805 ARM: Avoid fast path for BITOP.
GCC will produce certain unaligned multi load-store instructions
that will be trapped by the Linux kernel since ARM v6 cannot
handle them with unaligned addresses. Better to use the slower
but safer implementation instead of generating the exception which
should be anyway very slow.
2017-02-21 17:16:35 +01:00
Salvatore Sanfilippo
eb62cfeadb ARM: Use libc malloc by default.
I'm not sure how much test Jemalloc gets on ARM, moreover
compiling Redis with Jemalloc support in not very powerful
devices, like most ARMs people will build Redis on, is extremely
slow. It is possible to enable Jemalloc build anyway if needed
by using "make MALLOC=jemalloc".
2017-02-21 17:16:35 +01:00
Salvatore Sanfilippo
620e48b1d7 ARM: Avoid memcpy() in MurmurHash64A() if we are using 64 bit ARM.
However note that in architectures supporting 64 bit unaligned
accesses memcpy(...,...,8) is likely translated to a simple
word memory movement anyway.
2017-02-21 17:16:35 +01:00
Salvatore Sanfilippo
980d8805da ARM: Fix 64 bit unaligned access in MurmurHash64A(). 2017-02-21 17:16:35 +01:00
John.Koepi
522b10e4f6 fix #2883, #2857 pipe fds leak when fork() failed on bg aof rw 2017-02-20 10:28:49 +01:00
antirez
03f557223b Don't leak file descriptor on syncWithMaster().
Close #3804.
2017-02-20 10:28:49 +01:00
antirez
8d55aeb5de Fix MIGRATE closing of cached socket on error.
After investigating issue #3796, it was discovered that MIGRATE
could call migrateCloseSocket() after the original MIGRATE c->argv
was already rewritten as a DEL operation. As a result the host/port
passed to migrateCloseSocket() could be anything, often a NULL pointer
that gets deferenced crashing the server.

Now the socket is closed at an earlier time when there is a socket
error in a later stage where no retry will be performed, before we
rewrite the argument vector. Moreover a check was added so that later,
in the socket_err label, there is no further attempt at closing the
socket if the argument was rewritten.

This fix should resolve the bug reported in #3796.
2017-02-09 10:15:49 +01:00
antirez
7c22d76869 Fix ziplist fix... 2017-02-01 17:01:41 +01:00
antirez
8327b8136f Ziplist: insertion bug under particular conditions fixed.
Ziplists had a bug that was discovered while investigating a different
issue, resulting in a corrupted ziplist representation, and a likely
segmentation foult and/or data corruption of the last element of the
ziplist, once the ziplist is accessed again.

The bug happens when a specific set of insertions / deletions is
performed so that an entry is encoded to have a "prevlen" field (the
length of the previous entry) of 5 bytes but with a count that could be
encoded in a "prevlen" field of a since byte. This could happen when the
"cascading update" process called by ziplistInsert()/ziplistDelete() in
certain contitious forces the prevlen to be bigger than necessary in
order to avoid too much data moving around.

Once such an entry is generated, inserting a very small entry
immediately before it will result in a resizing of the ziplist for a
count smaller than the current ziplist length (which is a violation,
inserting code expects the ziplist to get bigger actually). So an FF
byte is inserted in a misplaced position. Moreover a realloc() is
performed with a count smaller than the ziplist current length so the
final bytes could be trashed as well.

SECURITY IMPLICATIONS:

Currently it looks like an attacker can only crash a Redis server by
providing specifically choosen commands. However a FF byte is written
and there are other memory operations that depend on a wrong count, so
even if it is not immediately apparent how to mount an attack in order
to execute code remotely, it is not impossible at all that this could be
done. Attacks always get better... and we did not spent enough time in
order to think how to exploit this issue, but security researchers
or malicious attackers could.
2017-02-01 15:03:06 +01:00
antirez
1688ccff26 ziplist: better comments, some refactoring. 2017-02-01 15:03:06 +01:00
antirez
36c1acc222 Jemalloc updated to 4.4.0.
The original jemalloc source tree was modified to:

1. Remove the configure error that prevents nested builds.
2. Insert the Redis private Jemalloc API in order to allow the
Redis fragmentation function to work.
2017-01-30 10:09:48 +01:00
Jan-Erik Rediger
37b4c954a9 Don't divide by zero
Previously Redis crashed on `MEMORY DOCTOR` when it has no slaves attached.

Fixes #3783
2017-01-30 10:09:45 +01:00
miter
aee1ddca5d Change switch statment to if statment 2017-01-30 10:09:41 +01:00
oranagra
af292b54a8 fix rare assertion in DEBUG DIGEST
getExpire calls dictFind which can do rehashing.
found by calling computeDatasetDigest from serverCron and running the test suite.
2017-01-30 10:09:36 +01:00
Itamar Haber
c3c2aa3bbf Verify pairs are provided after subcommands
Fixes https://github.com/antirez/redis/issues/3639
2017-01-30 10:09:33 +01:00
antirez
7c2153dafa Add panic() into redisassert.h.
This header file is for libs, like ziplist.c, that we want to leave
almost separted from the core. The panic() calls will be easy to delete
in order to use such files outside, but the debugging info we gain are
very valuable compared to simple assertions where it is not possible to
print debugging info.
2017-01-27 10:49:25 +01:00
antirez
dc83ddf068 serverPanic(): allow printf() alike formatting.
This is of great interest because allows us to print debugging
informations that could be of useful when debugging, like in the
following example:

    serverPanic("Unexpected encoding for object %d, %d",
        obj->type, obj->encoding);
2017-01-27 10:49:22 +01:00
antirez
3ef81eb301 Ziplist: remove static from functions, they prevent good crash reports. 2017-01-13 11:47:14 +01:00
Jan-Erik Rediger
96f75faac6 Initialize help only in repl mode 2017-01-13 11:34:55 +01:00
antirez
bcd51a6acb Use const in modules types mem_usage method.
As suggested by @itamarhaber.
2017-01-13 09:07:37 +01:00
antirez
354ccf0ce9 Add memory defragmenting capability in 4.0 release notes. 2017-01-12 10:01:21 +01:00