Kubernetes at Datadog Scale
Hello, welcome, everyone. Um. My name is alla polida i hope everyone is having a wonderful, dockercon, so far. Um i work for, dog as a developer advocate. And, on on this talk i'm going to be talking about how we run kubernetes, internally. At data dock and particularly. At data docker scale. So this, talk is not about data dock. It's about kubernetes, as we said but just you know, datadog, is a monitoring, and analytics, platform, that helps companies, improve observability. Of their infrastructure, and applications. Including. Of course, kubernetes. And the applications, that you can run on your kubernetes, cluster. As i said i'm a developer at okay at datadog. That it's my twitter, handler. And my emails. So if you have any questions, or you want to follow up after this talk or, later. Just let me know ping me on any of those, two mediums. But this is a talk about kubernetes, and what is kubernetes, kubernetes, is an orchestration. Platform, for your containerized. Applications. What does it mean it's going to help you run your containers. In production. And. It's a completely, open source project it's a graduated. Cncf. Project. And it's a very successful. Open source project. So. Just to give some numbers. Uh it has had, 19 major releases, in 2015. More than 90, 000 commits. And, from more than 2 000 contributors. A super, successful. Open source project. And, it's not only, uh super, successful, as an open source project, uh as well as, from a user point of view. This is google trends, for searchers. On kubernetes, since 2016. Until january, this year more or less. And obviously, you can see that the trend is growing. Steadily, growing. And probably will continue, to do so. Uh, but, um. You, may have noticed, that there are some outliers. In this trend. Um and when i first saw this graph, um. I had a theory of what it was happening and one of the things that we like doing at datadog. Is putting together. On the same time, timeline. Different metrics, to try to find correlations. And we know that correlations. Is not causation. But it certainly, can help you find the root cause. So because i had a theory. I decided to put the same on the same time frame google trend searches, for mariah, carey's, classic. All i want for christmas. And when i did, this happened. As you can see it's a perfect correlation. Uh the, obviously, the root cause is that. Um, it's a holiday, system, people take time off work. And they tend to search, uh less for. Work, related, stuff. If correlation. Main conversation, would mean that once a year we don't know why. There is a bunch of people that decide to drop everything. That they have to do with kubernetes, to listen to mariah. Which. Is another, possibility. So, um, this has been my journey with kubernetes, before, i joined datadog, so before i joined datadog, i was working two years full time, in kubernetes, projects. Both, from an admin point of view so managing a cluster but also. Building applications, on top of the kubernetes, api. I'm a certified kubernetes. Administrator. And i was, also part of the team that created. Another. Similar, exam which is a certified kubernetes, application, developer. I'm the maintainer, of, youtube currencies channels, as well so what i post, things that you can do with communities, concepts, etc. But then i joined.. While i was. Interviewing. For datadog. As part of that interview process i had the opportunity, to talk to many people on their infra team, about how they were running, kubernetes. At what scale. Etc. And, uh it made me realize that. It would be a great opportunity, for me to join this company and to learn how people are running kubernetes, as a super big scale.
And What is scale, we're talking about. So just to put an example this is one of our clusters. So each hexagon. On that. It represents, a node. So. Very big cluster, and definitely, not the only one so we run. Thousands, of clusters. With thousands of nodes, on each cluster. And, we run that, on different clouds. And one of the things that happens when you run open source projects, is that they get better. As, more people use it. But what if you're, using it in a way that less people are using it. So i found this survey. About, 800, respondents. On that survey, and one of the questions was how many notes do you have on your clusters. And about 40 percent of people said between one and five which. Clearly means that they are still trying out the platform. Uh. 27. Said between six and ten which is a little bit bigger but still. And only four percent. Said more than 100, notes, again this is a very small survey so take it with a pinch of salt, but it gives us an idea, that, there are definitely, more people. Um. Running kubernetes, at a smaller, scale. So. When you run, at a bigger scale, you tend to find issues that probably you are the first one who encounter, those issues. And what. What are those, common. Scalability. Issues. So. No surprise, many of them these are not the only ones where many of them, come from networking. As a distributed, system. Networking. Is a pain point. And. People, soon realize, when they start that kubernetes, networking, is not particularly. Easy. So this is where i'm going to focus on this talk i'm going to be talking a little bit about kubernetes, networking. So i'm going to introduce. Three concepts, of kubernetes networking, very very high level. But it will give you an idea where those are. So i'm going to be talking about pod networking. Service networking, and dns. And then i'm going to. Explain, how people, usually, run those when they're running a smaller, scale. And what people, do or datadog, does. When they had to move to a bigger scale. So i'm going to introduce, furthest, concepts, so what is spot networking, so putting kubernetes. Is this, the small. Small. Unit, that you can deploy, into, kubernetes. Is basically. One or more containers. That share the same networking, namespace. So pond networking. In kubernetes, basically, is about, giving those spots ips, and road traffic so, it has two rules two basic rules that says. Every part is going to get, a unique ap. And every part can brought traffic to any other part in the in the cluster, so that's. That's. Very high level what pod networking, is. What about service networking, when you when you have a service in kubernetes, you don't talk directly, to the pods. That are part of that survey, because those are ephemeral. They can come and go you don't know where they are. So instead what you do. Um is you, for example when your front-end, service wants to talk to a back-end service. Uh, that back-end service is a concept, in kubernetes. Called kubernetes, service. That basically. Acts as a. Load, balancer. Between, all the pots, that are part of that service, and that's done through a component, called q proxy that we will see, a little bit more in detail. And finally dns. Instead of talking directly, to this ips. That for the services. You. Usually. Would talk to. A name, service, and that gets translated, to an ip. Thanks to a component, called core dns. In kubernetes. So, starting with pod networking. Just, to summarize, again to. Um, as a summary. Again, every part is going to get a unique ip. And every part can potentially, road traffic to any other pod. The trick here this is not implemented. By kubernetes, itself.
You Have to do it through. Plugins. Plugins, that, follow, a specification. Called, container, network, interface. Is a spec that is part, of the cncf. As well. And basically this spec tell you if you want to write any of these plugins. What they should do and how also they can interact with each other because actually you can mix match them. And the thing is that. Because. It's through plugins. There are many of them and. They, work very differently, to one another. So when people start, usually. They use, when they start a smaller, scale they use. Planes, that implement, network, overlays. Example of this is flannel this is a classic, one, or with net. And. Basically, what it does what it's going to do. This is what i was doing, uh because i was running, at a, smaller, scale. Basically, what those are going to do. Uh is they're going to create. An overlay, network, for all the and all the parts are going to get an ip. On that, network. And then they're going to have an interface, a subnet. Uh from this overlay, that is going to be connected directly, to, to that interface cnip. Which is basically. Uh, the. The, docker, bridge. So. What how that, um does it draw traffic so let's imagine that the, a part in the first node wants to talk to the part in the second node. So this is what is going to happen it's going to say okay, if. I want to talk to a pod on my same, subnet. Just go directly, through the docker bridge. But if i want to talk to a pod in a different, subnet. Go to. An interface. Uh, called flannel, and this interface, called flannel what is going to do. Um. Flannel, uh it's a demon that is going to create. An ip on ip. So and it's going to maintain. A set of tables. For each subnet. Where. What node, is, is on that subnet, and then it's going to create this new package, an extra header. An extra, traffic. And it's going to pass that, to. To the next one. Um so this, this works great but one of the things that happens. Um is the extra. Um. This extra networking, that needs to happen, this. Extra cpu, of creating the ipo, ip so that this doesn't, scale very well. So people tend to move away from that, um, as they scale. So there are very, several, options. Um. As i said there are many many plugins. Um, one example, is using, uh the ones that run. The bgp. Protocol. To share, all these routine. Tables. Uh, an example of this is calico, is super popular, if you're running. Kubernetes, on bare metal. And basically what it's going to do, it's going to create. Um. A table. Per. Um so it's going to create an interface, for each part, and it's going to create those row tables. That they are going to be shared between. All the notes. Thanks. To the bgb, protocol. So for example, this would be. The row table, for the first, node. Where. The. This ip is directly, connected to that interface. And the second one so second note and the, the second part on the second note. I know the next hop. Because. Those routing, tables, have been shared through. Bgp. So this is great but it seems like if you're running in a cloud. You're. Uh duplicating. What your, uh. Your cloud is already giving you, so you're creating a software defined network, so. What people are doing, if they're running cloud, is using direct, routing, instead. So reducing. The, sdn. That already, the cloud provides. To basically, get. An ip, per pot, coming from the vpc, directly where you're running your cluster. This is where datadog, is doing, and this is, where it's really recommended, if you're running on a public, cloud. So, in this case one of the benefits, that we will have is that well, first of all. It will match, the two rules that we had.
So Everybody, is going to have a unique ip it's coming from the, from the same vpc, the your cluster is running it. The traffic is going to be routable. Because, you're running on that vpc. And another benefit, is that. Not only that you're going to have readable, traffic between pods, and other elements, in your vpc, for example, vms. Which is great, if you're running a hybrid, environment. So the first takeaway that i would say for this talk would be is that, your team will. Need to become an expert. Not only in kubernetes. But also, the cni plugin, that you use. Because things are going to be failing. You will need to to know, how to debug them. Etc. The next one is. Service, networking. And just to recap. We have. A service, which is acting, as a load balancer, between the different pods. Thanks to a component called q proxy. The good news here is the q proxy is actually implemented, in kubernetes, itself. But it has two modes. The first mode that it can run is in iptables. Mode. And, this is the default. Mode. If you're running kubernetes. For the first time. And because it's the full mode, this is what i was doing. Before joining datadog. And the way it works is that, um. Qproxy, is a component that is going to be on every node. In your cluster. And it's going to watch the kubernetes, api, for. Pots that get created, and deleted, services that get created and delete etc. And it's going to update. The ip tables, on the in that table, on that, on that. Note. And this is more or less. How, that nut table will look like so you have a service, on the very top. That. Has an ip that you want to talk to. And then you have a probability. Of 50. Between the two parts. So it's random. And each of those rules will match one of those, parts. This works. Well. Uh it's easy to do back. But it has some issues. That table is going to grow linearly, with bots and services, so. Obviously, as you scale. It's going to take longer to find the right part. Also for every change the whole table needs to be resync. And from a kubernetes, developer point of view. There is another issue that is there is no room for more features so they wanted to improve cube proxy. And ip tables, didn't accept, a lot more. So here, comes ipvs. Mode. Which is a different mode to run q proxy, it's been ga since 111. This is where data dock, is running. And it has very many benefits, so. It has atomic, changes so you don't have to resync, every rule every time there is change. It has different load balancing, algorithms, so it's not random if you don't want to you can use a list connection, for example, or other. Algorithms. And. The complexities. Of one so. No matter how big it is it's going always going to be fast. But also. It came even though it was ga, when data, started, moving to ipvs. They definitely, found some issues, and therefore, they had to, this is what we were talking about. Of using software, in a way that less people are using it is that you tend to find issues. That nobody, else, found before. And that's why. This is for example an example of a pr from, from. One of our team members. Who became. A maintainer. Of ipbs. Improving, ipvs. Mode a lot. So the second. Takeaway, i would say is that. Many large equipments, are already moving to ipvs. And also newer solutions, are being developed. Like cilio, for example which is evpf. Based. Finally, we are going to talk about dns. Uh just a recap, it's just a name resolver. Um. But it has, it had some issues with the scale this is one, issue that people were finding, a lot when scaling. A. Number of notes especially particularly, in the number of notes. Um. This is a contract. Race condition. That many people, were having so one of the solutions, that cabrinis, came up with, is to have. Node local dns, cache. Which is, basically, what datadog, is doing. And in this case. What happens, is that, every node. Is going to have a dns, cache. And, when a client. Needs to, get the dns, instead of, reaching out directly to coordinates. It's going to try. The cache first. With. With an udp connection. But then if the cache is a miss it's going to upgrade. That connection. And the connection dns. Cache, and core dns, happens, through tcp, which is a, first benefit. But also. Um, italo. Dropping, for example, searches, that are not related to a cluster, so for example if the pod wants to connect to. Google.com. Um, the cache, will definitely, drop that one, in favor of the cloud dns. Resolver. Without. Reducing, the traffic, in core dns. Cool so we reach out. To the very end of this talk. So some takeaways. First, some bad news. Kubernetes, is very flexible, for developers. But it's still quite complex, for your operational, teams and something to take into account.
The Second one would be the cni plugins, work very differently. Between. One another, so you will need to learn, how yours work. And finally, you will hit backs as any other software. So. You will need to understand. How to reach out to the community, where to file backs. Etc. But there are many good things as well. The ecosystem, picks up very quickly there are a lot of companies putting a lot of effort. And development. Into kubernetes, so just. How we saw with ipvs. And, and dns. Not local cache. This, as soon as people would start finding scaling issues. There are new solutions, being developed. Also the development, experience, from your developer point of view, doesn't change much, if you change. If you use dns, cache or not if you change your cni, plugin. From a developer, point of view it doesn't change that much so you don't have to make those changes, in a big migration. Project. Also changes don't have to happen for all workloads, directly you can create a new cluster. Start moving some of the workloads, etc, trying things out. So, thanks very much. I'm available, for questions. Uh so, i'm, i'm going to, go now to the, to the window to the chat window, and start answering those questions. And again if you want to reach out, um just let me know, thank you very much. And have a nice, rest of dockercon.