AUTH and SCRIPT KILL were sent without incrementing the pending commands
counter. Clearly this needs some kind of wrapper doing it for the caller
in order to be less bug prone.
This change makes Sentinel less fragile about a number of failure modes.
This commit also fixes a different bug as a side effect, SLAVEOF command
was sent multiple times without incrementing the pending commands count.
The previous implementation of SCAN parsed the cursor in the generic
function implementing SCAN, SSCAN, HSCAN and ZSCAN.
The actual higher-level command implementation only checked for empty
keys and return ASAP in that case. The result was that inverting the
arguments of, for instance, SSCAN for example and write:
SSCAN 0 key
Instead of
SSCAN key 0
Resulted into no error, since 0 is a non-existing key name very likely.
Just the iterator returned no elements at all.
In order to fix this issue the code was refactored to extract the
function to parse the cursor and return the error. Every higher level
command implementation now parses the cursor and later checks if the key
exist or not.
The previous implementation assumed that the first call always happens
with cursor set to 0, this may not be the case, and we want to return 0
anyway otherwise the (broken) client code will loop forever.
The new implementation is capable of iterating the keyspace but also
sets, hashes, and sorted sets, and can be used to implement SSCAN, ZSCAN
and HSCAN.
Sometimes when we resurrect a cached master after a successful partial
resynchronization attempt, there is pending data in the output buffers
of the client structure representing the master (likely REPLCONF ACK
commands).
If we don't reinstall the write handler, it will never be installed
again by addReply*() family functions as they'll assume that if there is
already data pending, the write handler is already installed.
This bug caused some slaves after a successful partial sync to never
send REPLCONF ACK, and continuously being detected as timing out by the
master, with a disconnection / reconnection loop.
Since we started sending REPLCONF ACK from slaves to masters, the
lastinteraction field of the client structure is always refreshed as
soon as there is room in the socket output buffer, so masters in timeout
are detected with too much delay (the socket buffer takes a lot of time
to be filled by small REPLCONF ACK <number> entries).
This commit only counts data received as interactions with a master,
solving the issue.
There was a bug that over-esteemed the amount of backlog available,
however this could only happen when a slave was asking for an offset
that was in the "future" compared to the master replication backlog.
Now this case is handled well and logged as an incident in the master
log file.
The code freed a reply object that was never created, resulting in a
segfault every time randomkey returned a key that was deleted before we
queried it for size.
Multiple missing calls to lua_pop prevented the error handler function
pushed on the stack for lua_pcall() to be popped before returning,
causing a memory leak in almost all the code paths of EVAL (both
successful calls and calls returning errors).
This caused two issues: Lua leaking memory (and this was very visible
from INFO memory output, as the 'used_memory_lua' field reported an
always increasing amount of memory used), and as a result slower and
slower GC cycles resulting in all the CPU being used.
Thanks to Tanguy Le Barzic for noticing something was wrong with his 2.8
slave, and for creating a testing EC2 environment where I was able to
investigate the issue.
When no encoding is possible, at least try to reallocate the sds string
with one that does not waste memory (with free space at the end of the
buffer) when the string is large enough.