Partitions¶
In the original implementation of ringpop, if a cluster is split to multiple
partitions, nodes in each partition declare each other as faulty, and afterward
will no longer communicate. Ringpop implemented support for merging the
partitions, which we call healing
.
Partition Healing – basic algorithm¶
In order for two partitions to heal, the algorithm does the following, periodically (some details are omitted for brevity; exact algorithm can be found in the code comments in the implementation):
- Randomly select a
faulty
node. - Send it a
/join
request, get its membership list. - If the local and retrieved lists are incompatible (merging them will introduce new faulties), mark all incompatible nodes suspect. When receiving this change the respective node will reassert that it is actually alive and update its incarnation number making it compatible for merge.
- If the local and retrieved lists are compatible (merging them will not introduce new faulties), merge the membership list with the local node’s membership and disseminate the changes.
We test this feature in 3 ways:
- Unit tests asserting the correct behavior.
- Integration tests, which will be the same for Go and Node implementations, checking the behavior of a node in isolation.
- Manual acceptance test to see partitions actually getting healed.
Further down, we will talk about how to manually create a partition and confirm it heals itself.
Forming a partition¶
With the current implementation of tick-cluster, it is non-trivial to form a partition. To understand why, we need to understand how connections are established.
Port Allocation¶
A ringpop instance opens a local tchannel socket (=listening tcp socket) to
accept incoming connections from other ringpops. By default, on a 2-node
tick-cluster, this is 127.0.0.1:3000
. Let’s call it instance a
. For
instance a
to establish a connection to instance b
(127.0.0.1:3001
),
instance a
will open an ephemeral port, e.g. 43323
, to connect to instance
b
. This connection, from 127.0.0.1:43323
(a
) to 127.0.0.1:3001
(b
) is
used for messages initiated by node a
. The other connection (example below),
from 127.0.0.1:36113
(b
) to 127.0.0.1:3000
(a
), is used for messages
initiated by b
. Here is a snapshot of lsof
from a two-node cluster:
root:/# lsof -Pnni
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
node 64 root 10u IPv4 217924 0t0 TCP 127.0.0.1:3000 (LISTEN)
node 64 root 11u IPv4 217925 0t0 TCP 127.0.0.1:43323->127.0.0.1:3001 (ESTABLISHED)
node 64 root 12u IPv4 217926 0t0 TCP 127.0.0.1:3000->127.0.0.1:36113 (ESTABLISHED)
node 66 root 10u IPv4 219916 0t0 TCP 127.0.0.1:3001 (LISTEN)
node 66 root 11u IPv4 219917 0t0 TCP 127.0.0.1:36113->127.0.0.1:3000 (ESTABLISHED)
node 66 root 12u IPv4 219918 0t0 TCP 127.0.0.1:3001->127.0.0.1:43323 (ESTABLISHED)
root:/#
Armed with this knowledge, we can try to make a partition.
Manually forming a partition¶
The naïve approach to make a partition between a
and b
is to block incoming
connections from and to port 3000
: then no packet will leave a
, and we will
have a partition. However, this misses the fact that ephemeral connections are
used for relaying traffic between nodes, and, in this case, connection from
127.0.0.1:43323
(a
) to 127.0.0.1:3001
is established and... misses the
firewall! We could block port 3001
too, but, with more nodes, that
would create a cluster with N partitions (N being the number of nodes) – not
what we want. In our example, we want two partitions.
With that in mind, a bit more sophistication in firewall rules is required. To
easily create a partition in tick-cluster
locally, we created
tools/make_partitions
, which, by reading the state of the connections from
lsof
, will emit iptables
/pf
commands accordingly.
In the example above, firewall rules to create a partition will look as follows (OS X):
$ sudo lsof -Pnni | ./tools/make_partition 3000 3001 --platform darwin
block drop in proto tcp from 127.0.0.1 port 3000 flags S/S
block drop in proto tcp from 127.0.0.1 port 3001 flags S/S
block drop in proto tcp from 127.0.0.1 port 43323 to 127.0.0.1 port 3001
block drop in proto tcp from 127.0.0.1 port 3001 to 127.0.0.1 port 43323
block drop in proto tcp from 127.0.0.1 port 36113 to 127.0.0.1 port 3000
block drop in proto tcp from 127.0.0.1 port 3000 to 127.0.0.1 port 36113
Linux:
$ sudo lsof -Pnni | ./tools/make_partition 3000 3001 --platform linux
*filter
-A INPUT -p tcp -s 127.0.0.1 -d 127.0.0.1 --tcp-flags RST RST -j ACCEPT
-A INPUT -p tcp --syn -m state --state NEW -d 127.0.0.1 --dport 3000 -j REJECT --reject-with tcp-reset
-A INPUT -p tcp --syn -m state --state NEW -d 127.0.0.1 --dport 3001 -j REJECT --reject-with tcp-reset
-A INPUT -p tcp -s 127.0.0.1 --sport 43323 -d 127.0.0.1 --dport 3001 -j REJECT --reject-with tcp-reset
-A INPUT -p tcp -s 127.0.0.1 --sport 3001 -d 127.0.0.1 --dport 43323 -j REJECT --reject-with tcp-reset
-A INPUT -p tcp -s 127.0.0.1 --sport 36113 -d 127.0.0.1 --dport 3000 -j REJECT --reject-with tcp-reset
-A INPUT -p tcp -s 127.0.0.1 --sport 3000 -d 127.0.0.1 --dport 36113 -j REJECT --reject-with tcp-reset
COMMIT
To sum up:
- New connections to the listening ports (
3000
,3001
) will be blocked. This prevents tchannel to re-open new valid connections. - Relevant existing connections will be terminated (e.g.
3000
to ephemeral ports). - Linux only: for the above to work, the firewall needs to explicitly accept
RST
packets.
During the partition, new connections to the nodes will be impossible to make.
This is important to keep in mind when using ringpop-admin
: invoke
ringpop-admin before forming the partition.
Armed with background how this works, we can go and make a local partition:
Start the tick-cluster¶
In this example, we use Node version of ringpop, but we can use testpop
from
go too:
$ ./scripts/tick-cluster.js -n 4 ./main.js # node
Open the ringpop-admin to observe the cluster state¶
We shall open ringpop-admin partitions
and ringpop-admin top
before making the
partition. This way, the “management” connections will be open and status will
be visible during the partition:
$ ringpop-admin top 127.0.0.1:3000
$ ringpop-admin partitions -w 1 127.0.0.1:3000 # other terminal
ringpop-admin top
will show something like this:
Address P1
127.0.0.1:3000 alive
127.0.0.1:3001 alive
127.0.0.1:3002 alive
127.0.0.1:3003 alive
1 of 4
ringpop-admin partitions
will show a single partition, updated every second:
10:27:09.615 Checksum # Nodes # Alive # Suspect # Faulty Sample Host
192859590 4 4 0 0 127.0.0.1:3000
10:27:10.607 Checksum # Nodes # Alive # Suspect # Faulty Sample Host
192859590 4 4 0 0 127.0.0.1:3000
Start the partition¶
First, check how the firewall rules would look like before applying them to the
firewall (optionally, you can pass --platform=darwin
or --platform=linux
to
the make_partition
script:
$ sudo lsof -Pnni | ./tools/make_partition 3000,3001 3002,3003
If you are happy with the output, apply the rules:
OS X:
$ sudo lsof -Pnni | ./tools/make_partition 3000,3001 3002,3003 | sudo pfctl -emf -
Linux:
$ sudo lsof -Pnni | ./tools/make_partition 3000,3001 3002,3003 | sudo iptables-restore
In a few seconds, you should see output from tick-cluster
that some of the
nodes aren’t able to ping each other. Let’s verify we actuall have a partition.
Checking in the tools¶
On partition, ringpop-admin top
(opened before the partition) should display
something like this:
Address P1 P2
127.0.0.1:3000 faulty alive
127.0.0.1:3001 faulty alive
127.0.0.1:3002 alive faulty
127.0.0.1:3003 alive faulty
ringpop-admin partitions
(opened before forming a partition) shows a more
high-level view:
10:37:04.878 Checksum # Nodes # Alive # Suspect # Faulty Sample Host
400620880 2 2 0 2 127.0.0.1:3002
3283514511 2 2 0 2 127.0.0.1:3000
That’s it, we have a partition! To break it, we need to wipe the firewall rules:
- OS X:
pfctl -f /etc/pf.conf
. - Linux:
iptables -F
.
... and wait for partition healing to kick in.
Final remarks¶
tools/make_partition
can only create two partitions. It can work with arbitrary partition sizes; for usage, runtools/make_partition --help
.tools/make_partition
is not intended to be used in an automated way. See--help
to learn about the limitations.