Uname was profiled to be a slow syscall. It produces always the same
output in the context of a single execution of Redis, so calling it at
every INFO output generation does not make too much sense.
The uname utsname structure was modified as a static variable. At the
same time a static integer was added to check if we need to call uname
the first time.
used_memory_peak only updates in serverCron every server.hz,
but Redis can use more memory and a user can request memory
INFO before used_memory_peak gets updated in the next
cron run.
This patch updates used_memory_peak to the current
memory usage if the current memory usage is higher
than the recorded used_memory_peak value.
(And it only calls zmalloc_used_memory() once instead of
twice as it was doing before.)
The code tried to obtain the configuration file absolute path after
processing the configuration file. However if config file was a relative
path and a "dir" statement was processed reading the config, the absolute
path obtained was wrong.
With this fix the absolute path is obtained before processing the
configuration while the server is still in the original directory where
it was executed.
server.unixtime and server.mstime are cached less precise timestamps
that we use every time we don't need an accurate time representation and
a syscall would be too slow for the number of calls we require.
Such an example is the initialization and update process of the last
interaction time with the client, that is used for timeouts.
However rdbLoad() can take some time to load the DB, but at the same
time it did not updated the time during DB loading. This resulted in the
bug described in issue #1535, where in the replication process the slave
loads the DB, creates the redisClient representation of its master, but
the timestamp is so old that the master, under certain conditions, is
sensed as already "timed out".
Thanks to @yoav-steinberg and Redis Labs Inc for the bug report and
analysis.
A system similar to the RDB write error handling is used, in which when
we can't write to the AOF file, writes are no longer accepted until we
are able to write again.
For fsync == always we still abort on errors since there is currently no
easy way to avoid replying with success to the user otherwise, and this
would violate the contract with the user of only acknowledging data
already secured on disk.
In high RPS environments, the default listen backlog is not sufficient, so
giving users the power to configure it is the right approach, especially
since it requires only minor modifications to the code.
When a slave was disconnected from its master the replication offset was
reported as -1. Now it is reported as the replication offset of the
previous master, so that failover can be performed using this value in
order to try to select a slave with more processed data from a set of
slaves of the old master.
Some are just to know if the master is down, and in this case the runid
in the request is set to "*", others are actually in order to seek for a
vote and get elected. In the latter case the runid is set to the runid
of the instance seeking for the vote.
Example:
db0:keys=221913,expires=221913,avg_ttl=655
The algorithm uses a running average with only two samples (current and
previous). Keys found to be expired are considered at TTL zero even if
the actual TTL can be negative.
The TTL is reported in milliseconds.
We don't want to repeat a fast cycle too soon, the previous code was
broken, we need to wait two times the period *since* the start of the
previous cycle in order to avoid there is an even space between cycles:
.-> start .-> second start
| |
+-------------+-------------+--------------+
| first cycle | pause | second cycle |
+-------------+-------------+--------------+
The second and first start must be PERIOD*2 useconds apart hence the *2
in the new code.
This commit makes the fast collection cycle time configurable, at
the same time it does not allow to run a new fast collection cycle
for the same amount of time as the max duration of the fast
collection cycle.
The main idea here is that when we are no longer to expire keys at the
rate the are created, we can't block more in the normal expire cycle as
this would result in too big latency spikes.
For this reason the commit introduces a "fast" expire cycle that does
not run for more than 1 millisecond but is called in the beforeSleep()
hook of the event loop, so much more often, and with a frequency bound
to the frequency of executed commnads.
The fast expire cycle is only called when the standard expiration
algorithm runs out of time, that is, consumed more than
REDIS_EXPIRELOOKUPS_TIME_PERC of CPU in a given cycle without being able
to take the number of already expired keys that are yet not collected
to a number smaller than 25% of the number of keys.
You can test this commit with different loads, but a simple way is to
use the following:
Extreme load with pipelining:
redis-benchmark -r 100000000 -n 100000000 \
-P 32 set ele:rand:000000000000 foo ex 2
Remove the -P32 in order to avoid the pipelining for a more real-world
load.
In another terminal tab you can monitor the Redis behavior with:
redis-cli -i 0.1 -r -1 info keyspace
and
redis-cli --latency-history
Note: this commit will make Redis printing a lot of debug messages, it
is not a good idea to use it in production.