This chapter discusses topics relevant to the development of the RedisRaft module itself.
The module includes a basic set of unit tests and integration tests. To run them you'll need:
- lcov (for coverage analysis, on Linux)
- Python and a few packages (e.g. pytest, redis, etc.)
- redis-server in your PATH, or in
../redis/src
.
To run tests in a Python virtualenv, follow these steps:
$ mkdir -p .env
$ virtualenv .env
$ . .env/bin/activate
$ pip install -r tests/integration/requirements.txt
$ make tests
Integration tests are based on pytest, and specific parameters can be provided to configure tests.
For example, running a single test with live logging of all output:
pytest tests/integration/test_snapshots.py::test_snapshot_delivery -v --log-cli-level=debug
To see a full list of custom test configuration options, use:
pytest tests/integration --help
To run unit tests and see a detailed coverage report:
$ make clean
$ make COVERAGE=1 unit-tests
$ make unit-lcov-report
To see coverage reports for the entire set of integration tests, you'll first need to build Redis with gcov support:
$ cd <redis dir>
$ make clean
$ make gcov
Then execute:
$ cd <redisraft dir>
$ make clean
$ COVERAGE=1 make
$ make integration-tests
$ make integration-lcov-report
See jepsen/README.md for information on using Jepsen to test Redis Raft for linearizability violations.
A single RAFT
command is implemented as a prefix command for clients to submit
requests to the Raft log.
This triggers the following series of events:
- The command is appended as an entry to the local Raft log (in-memory cache and file).
- The log is replicated to the majority of cluster members. The RedisRaft module uses module-specific commands against other nodes to accomplish this.
- Once the log has been replicated to a majority of nodes and RedisRaft determines that the entry can be committed, the command is executed locally on all nodes as a regular Redis command and the response is sent to the user.
Raft communication between cluster members is handled by RAFT.AE
and
RAFT.REQUESTVOTE
commands, which are also implemented by the RedisRaft module.
The module starts a background thread which handles all Raft-related tasks, such as:
- Maintaining connections with all cluster members
- Periodically sending heartbeats (leader) or initiating an election if heartbeats are not seen (follower/candidate).
- Processing committed entries (delivering to Redis in a thread-safe context)
All received Raft commands are placed on a queue and handled by the Raft thread itself, using the blocking API and a thread-safe context.
When a new node starts up, it can follow one of the these flows:
- Start as the first node of a new cluster.
- Start as a new node of an existing cluster (with a new unique ID). Initially it will be a non-voting node, only receiving logs (or a snapshot).
- Start as an existing cluster node which recovers from a crash. Typically this is done by loading persistent data from disk.
Configuration changes are propagated as special Raft log entries, as described in the Raft paper.
The trigger for configuration changes is provided by RAFT.NODE ADD
and
RAFT.NODE REMOVE
commands.
NOTE: Currently membership operations are node-centric. That is, a node is
started with module arguments that describe how it should behave. For example, a
RAFT.CLUSTER JOIN
is invoked on a new node in order to initiate a connection
to the leader and execute a RAFT.NODE ADD
command.
While there are some benefits to this approach, it may make more sense to change
to a cluster-centric approach which is the way Redis Cluster does things.
The Raft Log is persisted to disk in a dedicated log file managed by the module. In addition, an in-memory cache of recent entries is maintained in order to optimize log access.
The file format is based on RESP encoding and is similar to an AOF file. It begins with a header entry that stores the Raft state at the time the log was created, followed by a list of entries.
The header entry may be updated to persist additional data such as voting information. For this reason, the entry size is fixed.
In addition, the module maintains a simple index file to store the 64-bit offsets of every entry written to the log.
The index is updated on the fly as new entries are appended to the Raft log, but if crash recovery takes place it will not be considered a source of truth and will be reconstructed as the Raft log is read.
Raft defines a mechanism for compaction of logs by storing and exchanging snapshots. The snapshot is expected to be persisted just like the log, and include information that was removed from the log during compaction.
When the Raft modules determines it needs to perform log compaction, it does the following:
First, a child process is forked and:
- Performs a Redis
SAVE
operation after modifying thedbfilename
configuration, so a temporary file is created. - Iterates the Raft log and creates a new temporary Raft log with only the entries that follow the snapshot.
- Exits and reports success to the parent.
The parent detects that the child has completed and:
- Renames the temporary snapshot (rdb) file so it overwrites the existing one.
- Appends all Raft log entries that have been received since the child was forked to the temporary Raft log file.
- Renames the temporary Raft log so it overwrites the existing one.
Note that while the above is not atomic, operations are ordered such that a failure at any given time would not result with data loss.
When a Raft follower node lags behind and requires log entries that have been compacted, a snapshot needs to be delivered instead:
- Leader decides it needs to send a snapshot to a remote node.
- Leader sends a
RAFT.LOADSNAPSHOT
command, which includes the snapshot (RDB file) as well as last-included-term and last-included-index. - Follower may respond in different ways:
1
indicates snapshot was successfully loaded.0
indicates the local index already matches the required snapshot index so nothing needs to be done.-LOADING
indicates snapshot loading is already in progress.
NOTE: Because of the store-and-forward implementation in Redis, this is not very efficient and will fail on very large datasets. In the future this should be optimized.
When RAFT MULTI
is issued, an entry is created in the multi_clients
dictionary for the current client. When an entry exists, every new command is
append to the entry and a QUEUED
response is generated.
When RAFT EXEC
is issued and an entry exists for the client, a
RaftRedisCommandArray
with all commands is created and processed as a regular
command.
A Redis Module API Event handler is created to catch client disconnect events,
and clean up their state (in case a client initiates MULTI
and drops the
connection).
WATCH
needs to be implemented by the Raft module itself, because we need to
detect and fail an EXEC
before creating an entry and propagating it.
Supporting WATCH
in proxying mode is difficult because of the need to
propagate the watched keys state between nodes on leader election (in case
re-election takes place between the WATCH
and EXEC
).
To support watch:
- Every
WATCH
command should be propagated to Redis directly. - When
MULTI
andEXEC
are performed, we consult the Module API context to determine if we have dirty keys, in which case EXEC would fail. In that case, we fail it ourselves.
Proxying should work for the simple MULTI
case, but there are issues with
WATCH:
- We need to be sure we maintain a dedicated proxy connection per client,
because
WATCH
lives in a connection context. - If a new leader is elected between
WATCH
andEXEC
, we must not proxy commands as theWATCH
state will be invalid on the new leader.
At the moment this is not implemented.
RedisRaft supports two experimental and/or for-testing-only modes, which are described below.
Follower Proxy mode allows a follower (non-leader) node to proxy user commands to the leader, wait for them to complete, and send the reply back to the client.
The benefit of this experimental mode of operation is that a client no
longer needs to deal with -MOVED
redirect replies.
This mode has several limitations:
-
It cannot preserve and manage state across commands. This affects commands like
MULTI/EXEC
andWATCH
which will exhibit undefined behavior if proxied to a leader (or even different leaders over time). -
It uses a single connection and therefore may introduce additional performance limitations.
To enable Follower Proxy mode, specify follower-proxy yes
as a
configuration directive.
By default, RedisRaft works transparently by intercepting all user commands and processing them through the Raft Log.
Explicit Mode disables this automatic interception and allows the client to decide which commands should be run through the Raft log on a command-by-command basis.
To allow for this this, RedisRaft's explicit mode exposes the RAFT
command.
For example:
RAFT SET mykey myvalue
This sends the SET mykey myvalue
Redis command to RedisRaft, causing it
to execute with the strongly-consistent guarantees.
- Add NO-OP log entry when starting up, to force commit index computing.
- Latency optimizations through better concurrency (batch operations, distribute entries while syncing to disk, etc.).
- Improve debug logging (pending Redis Module API support).
- Batch log operations (pending Raft lib support).
- Cleaner snapshot RDB loading (pending Redis Module API support).
- Stream snapshot data on LOAD.SNAPSHOT (pending hiredis support/RESP3 or a dedicated side channel implementation).
- Improve follower proxy performance.