Placeholder Image

字幕表 動画を再生する

  • Next speaker is Robert and he will talk about how using the RAFT consenous algorithm helped him a lot

  • Hello everyone, can you hear me?

  • Well...

  • At the starts I would like to introduce myself. I'm Robert Wojciechowski, I work for Skylable, which is making a distributed storage system.

  • And...

  • I'm wondering how many of you are system administrators here...

  • Yeah... I can see a lot of you are, so probably you faced an issue of having lots of...

  • Maybe not lots of but couple of servers with a lot of free space on them.

  • And you were wondering what to do with that space.

  • So...

  • I started thinking about this situation at once and I thought that I could built my own distributed storage cluster.

  • Because why not? I have a lot of free space. I could make some good use of it.

  • And I could also have cluster which will be S3 compatible.

  • That would give me some additional freedom of usage.

  • My goals were to build a cluster which would be first of all free for me and I would not have to spent lots of time on maintaining this.

  • I asked you how many of you are system administrators and I'm not.

  • And for me ease of use is one of the most important goals because I really wanted to...

  • To just setup the configuration and to be able to store my data there.

  • And...

  • One of the requirements was to be fault tolerant. I used to store my backups on those servers.

  • And I really wanted them to be distributed on my behalf.

  • And they would be replicated so my data should be safe there.

  • After looking at some possible solutions there were couple of them... but I've got interested in Skylable SX distributed storage

  • Which is GPL licenced solution for the people like me.

  • It is meant to be easy to use.

  • And it is also... It also supports S3 compatibility with LibreS3 layer.

  • And this solution seems to me very nice because I was reading a little bit about it and it stand out that this supports couple of nice features like...

  • ...block level deduplication which would be... well I was thinking that it would be nice for me to have this deduplication because I store a lots of data

  • Which can usually be deduplicated, so why shall I spend a lot of time on sending data which is already there.

  • Besides the deduplication it is of course scalable which means that you can easily add or remove nodes to the cluster...

  • ...on your demand and you don't have to disable it, turn it off or anything else.

  • It just works as it is.

  • And it is also a multiplatform solution which you can easily install on any UNIX based systems.

  • It works for Linux, BSD...

  • So...

  • Well... I had a scalability but there was I would really appreciate...

  • It was...

  • Fault tolerance

  • The software supported data replication and distribution of the blocks so they are equally distributed between nodes of the cluster.

  • And already software supported administrator tools from the Command Line interface.

  • Which allowed me to... disable nodes which I would now that they are failed.

  • This routine needs to be done by an admin which is called a set-faulty function.

  • And it enables you to inform the cluster that some of the data may not be available. So when you have enough replicas of your blocks...

  • And you know that one of the nodes or more of them are dead basically, because some power outage or something...

  • You can inform the life nodes that the data should not be send to them and retrieved from those nodes.

  • ok...

  • but...

  • As I said there was missing feature of automatic failover...

  • The cluster was able to deal with failed nodes only when I told him to do it.

  • So...

  • I started thinking what could be done to improve this software and to make it aware of automatic failover.

  • What was needed for me was the ability for the cluster members - the nodes to detect the situation of failure.

  • In order to detect the situation of faulire I needed some leader node which would perform the operations I previously had to do from the CLI interface as an admin.

  • The leader node should do this task on my behalf... also the leader node should be the one which would be respected as making a decision.

  • And also...

  • All the alive nodes, all the rest of the nodes should followed this decision made by this leader node.

  • So the... conclusion was that I definitely needed a consensus algorithm.

  • The software was at that state that it did not supported that...

  • Well the consensus algorithm is basically designed to handle the exact situations I had.

  • When you have failure condition, when you need some internal decision to be made automatically...

  • You definitely need a consensus algorithm.

  • So I started looking at possibilities and basically I...

  • ...started looking at two of them which were the most commonly returned from possible algorithms.

  • The first was PAXOS, which is probably the one of the most known algorithms.

  • It is a provent work... let say... very complicated algorithm to implement. There are many of implementations of it.

  • But it turns out that many of them also has to be somehow adjusted to the working environment...

  • so...

  • Well... there was existing some software that I could probably use but also I didn't want to include too much to the existing software.

  • But I started also looking at the RAFT algorithm, and RAFT algorithm was basically designed to be easier version of PAXOS... easier to implement.

  • So why not taking a look at this specification when it told that it is easy.

  • And it turn out that I should be able to do it... why not to try?

  • so...

  • I started reading and now I'm going to tell you a little basics on how does it work and what stuff I needed from it...

  • ...to do my job

  • So first thing I needed was the leader election.

  • As I said in the beginning the software itself did not has a concept of master and slave nodes or leader and follower as you call them.

  • So... this is the first situation I had to face...

  • Basically every node... each node is equal to each other. They have follower role which means that all they should obey the leader node commands...

  • ...but there is no leader

  • so...

  • At least one of them will at some point... just time out.

  • The time out, which is called election time out is a time after which some nodes should start thinking that something is wrong...

  • ...there is no command from the leader

  • When the node times out, it starts a routine called - election basically

  • When the election starts, the node which timed out changes it's role as a candidate. When node is a candidate...

  • It starts sending request vote queries. Request vote queries are just simple... queries that contains the ID of the sender...

  • ...which can be saved on the follower nodes, which means that when a follower nodes receives a request vote query...

  • ...It save the number or ID of the node which send the request vote query.

  • So that is pretty simple and... that mark is needed for the nodes to avoid voting for other nodes.

  • So when there are couple of voting nodes in the cluster...

  • Follower nodes can only vote for one at a time and it is also... a part of the algorithm that candidate note votes for itself so...

  • ...In one election candidate node receives at least one vote.

  • And the final state of this voting is that... candidate received 3 votes.

  • 3 votes are more then half of the nodes in the cluster.

  • In this simple case there are 3 clusters so it had to get 2 votes out of 3... so the situation is simple.

  • Node number 2 becomes a variable leader now.

  • When it becomes a leader it has to immediately send heartbeat queries.

  • Heartbeat queries are just the simple ping like queries which are meant to reset time outs on the follower nodes.

  • Effectively stopping them of being timed out... then while the node is sending heartbeat queries to all the nodes...

  • It should...

  • It should just still stay to be this leader and all the nodes should follow it's commands.

  • Ok, so this is the first thing I needed from the cluster to elect some node which would behave as an administrator basically...

  • What happens when a node fail?

  • When a node fails...

  • ...the leader node should somehow detect this situation and that's how this will work in SX storage.

  • Let's suppose that node number 3 is dead, in this situation it's cross out so it should get the heartbeat query but it's not responding properly for it.

  • So the...

  • Node number 2 and node number 1 are basically connected to each other but node number 3 is not rather working.

  • So what happens in this situation is that node number 2 will reach the time out, it's called a heartbeat time out, it's a different time out then election time out.

  • It's the thing that can be configured by cluster administrator and after that time I want my leader node to send the query of exclusion of the node number 3.

  • And... what happens next is that node number 3 is excluded.

  • How is it excluded? It is performed in the same way as I would do from the Command Line.

  • Only the alive nodes are communicated with this exclusion command and the node number 3 is no longer consider as a part of the member.

  • This means for clients that when they connect to all the nodes of the cluster, they won't get any errors at the condition time and when the node...

  • ...3 which was marked as faulty gets excluded from the cluster it can be later exchanged by the new node.

  • So the cluster administrator can reveal the lacking data replicas.

  • Now let's say a little bit how was it implemented in SX...

  • I basically used the existing internal API... my goal was not to overcomplicate things...

  • So I... started thinking if I couldn't reuse existing API and it was fairly easy It was just adding two types of queries...

  • ...and the node exclusion command was already available from the cluster admin and now it just became being send internally.

  • Which... also gave me effective failure detection on the cluster.

  • Also... what is important is that cluster administrator still have a control over how is this exclusion working.

  • By default... the... heartbeat time out is disabled. Which means that if you are an administrator and you don't really want those nodes to be excluded automatically.

  • Just in case...

  • ... you have it disabled, but you can simply enable it by single command which is setting one value in the cluster.

  • And... when you kick out... the 120 is a time out in seconds, it is just for testing purposes so it is meant to give you quickly overview how does it works.

  • So when you kill one of the nodes, you will see that administrator information tool want be able to communicate with it.

  • But after some time the node will be marked as faulty. When the node is marked as faulty it is no longer consider as a part of the cluster...

  • But you can still reveal the data from the other nodes to exchange this faulty node.

  • Thank you very much. I hope that it was not too complicated...

  • Please follow Skylable if you are interested and... I'm giving a next presentation right after that which is about client for the storage...

  • It's called SXFS, it's a client-side encryption featured client tool for this cloud basically...

  • Thank you very much...

  • Ok there is time for a couple of questions...

  • - You mean shed about kicking the client nodes but what happens if your lead node dies? I didn't see this...

  • ...because while presentation I didn't see what's the failover if your lead node dies... can client takeover or...?

  • Let me check... there is a situation when the leader node dies... as you can see it's marked as a leader.

  • The situation is pretty easy...

  • The situation is pretty easy now... the leader node dies which means that effectively some of the follower nodes will reach the election time out.

  • As I said before followers are waiting for heartbeat queries, so when a leader node dies, heartbeat is no longer going to the follower nodes which means that...

  • ...it will win the election... of course it still has to follow the convention of majority of nodes.

  • - Any other questions?

  • ...ok just a second...

  • - Hi, I'm wondering what about network split when we have larger cluster then 3 nodes... let's say 7 nodes and you have 3 and 4 splitted clusters?

  • I'm not sure what you mean by "Split clusters"?

  • - There is network split and there are connection only between 3 nodes and other splitted 4 nodes...

  • Yeah, in this particular situation of 7 nodes you still have 4 which still can communicate with each other and they are in the majority...

  • So... what happens is that the majority of nodes will elect a leader in this 4 nodes part of this cluster and the 3 of them won't elect them as themselves.

  • So... what happens is that the leader is in the majority still and it can work normally...

  • - Ok

  • - I guess it's best to start the next talk and if there will be any time left at the end you can still ask questions about this talk as well.

Next speaker is Robert and he will talk about how using the RAFT consenous algorithm helped him a lot

字幕と単語

ワンタップで英和辞典検索 単語をクリックすると、意味が表示されます

B1 中級

FOSDEM 2016に参加しました。ラフトコンセサスアルゴリズムを選ぶ理由 (FOSDEM 2016: Why we choose the Raft consesus algorithm)

  • 26 6
    July に公開 2021 年 01 月 14 日
動画の中の単語