fix rare replication stream corruption with disk-based replication

The slave sends \n keepalive messages to the master while parsing the rdb,
and later sends REPLCONF ACK once a second. rarely, the master recives both
a linefeed char and a REPLCONF in the same read, \n*3\r\n$8\r\nREPLCONF\r\n...
and it tries to trim two chars (\r\n) from the query buffer,
trimming the '*' from *3\r\n$8\r\nREPLCONF\r\n...

then the master tries to process a command starting with '3' and replies to
the slave a bunch of -ERR and one +OK.
although the slave silently ignores these (prints a log message), this corrupts
the replication offset at the slave since the slave increases the replication
offset, and the master did not.

other than the fix in processInlineBuffer, i did several other improvments
while hunting this very rare bug.

- when redis replies with "unknown command" it includes a portion of the
  arguments, not just the command name. so it would be easier to understand
  what was recived, in my case, on the slave side,  it was -ERR, but
  the "arguments" were the interesting part (containing info on the error).
- about a year ago i added code in addReplyErrorLength to print the error to
  the log in case of a reply to master (since this string isn't actually
  trasmitted to the master), now changed that block to print a similar log
  message to indicate an error being sent from the master to the slave.
  note that the slave is marked as CLIENT_SLAVE only after PSYNC was received,
  so this will not cause any harm for REPLCONF, and will only indicate problems
  that are gonna corrupt the replication stream anyway.
- two places were c->reply was emptied, and i wanted to reset sentlen
  this is a precaution (i did not actually see such a problem), since a
  non-zero sentlen will cause corruption to be transmitted on the socket.
This commit is contained in:
Oran Agra
2018-07-14 16:49:15 +03:00
parent cefe21d28a
commit d55598988b
3 changed files with 18 additions and 9 deletions

View File

@ -2148,6 +2148,7 @@ void replicationCacheMaster(client *c) {
server.master->read_reploff = server.master->reploff;
if (c->flags & CLIENT_MULTI) discardTransaction(c);
listEmpty(c->reply);
c->sentlen = 0;
c->reply_bytes = 0;
c->bufpos = 0;
resetClient(c);