Hardware-accelerated Networked Systems

Show video

Okay. Thank you for your time I said. We wanted to get a conversation going between people, in networking, optics, hardware distributed, systems and the. Main motivation, here is actually very simple it's. The fact that this, is a time of change if. You look at almost any resource. Online cloud infrastructure, whether. It's compute, networking. Inside the data center or in the long haul or different, kinds of storage. The. Dominant, technology for. Each of these resources is, approaching. The tail end of its s-curve, things. Are looking, like a plateau. For. Example if you think about compute. The. Primary, metric of interest is flops, per dollar that begin achieve, thanks. To Moore's law and they're not scaling we've had a lovely ride over, the past 35 years but. Because of fundamental. Physical constraints, on the. Size, of transistors on, the. CPU drive voltage and the, number of cores that we can put on individual, sockets sin. Seeing, things seem to be flatline. Now. This. Is actually prompted, a lot of work in, custom. Accelerators, things. Like FPGAs, and GPUs, and custom. Asics, that are getting a lot of traction in cloud environments, and you, probably heard math talk about the, growth of FPGAs, are later in this session you'll hear from Tom and Sam Rama about, some of the FPGA innovation, happening, for networking workloads. Similarly. When, you start thinking about the. Network inside datacenters, it. Relies on electrical, switches and there's. An analogous phenomena. That I'm terming, the Moore's law for networking. Essentially. Every, couple of years switch, renders give us a new switch generation, that. Has double, the bandwidth but the cost remains exactly, the same. Now. This, is a lovely property because, it allows us to build faster, and faster networks that. Can keep pace with increasing. Demand, while. Keeping, the overall network. Cost low, and constant. That's. Actually one of the assumptions underlying the cloud economic model. However. Things could start getting tricky very soon, and. To understand that let's look at this graph again. So, historically, the increase. In switch bandwidth, has come from the fact that we, are able to increase the speed of the services, at the edge of the switching ASIC and the. Switch wonders have always used the non-return-to-zero, modulation. Format so, one is encoded, as high voltage zero as low voltage. However. The, latest generation of switches is the first generation, where, we've had to rely to hire a resort, to higher order modulation because. It's becoming harder, and harder to, increase the speed and the, density of those i/o pins, so. This is like four level modulation which, means that you encode two bits per symbol but. It means that you have tighter, SNR, requirements. Because the levels are getting closer and closer and this. Is a trick that's hard to play over and over again because. If you go to eight levels, your s on our requirements, are even harsher. So. If. You were to step back and look at network inside datacenters the primary. Metric of interest, is the, gigabits per dollar, we've. Had a lovely ride but. Within the next five, six years so, by the time we get to 50. Terabytes which generation, beyond, that things. Could start to look very dodgy. The. Same thing applies to other resources if. You, look at van, capacity. We. Are actually approaching. The Shannon, capacity of individual fibers, if. You go to the next storage session you'll, hear about fundamental. Physical reasons why, we are approaching capacity, limits on hard drives now. If. You. Are the CTO, of a company this is a nightmare scenario you, know it's a perfect strong. But. For researchers this is the perfect opportunity.

Because. It means that if, we. You, if, we come up with the new technology, no matter how disruptive it is if, it, leads to a new growth curve in terms of these metrics of interest it, has a very good chance of deployment, and. I think that's a really powerful message and, if. That's the only message that you take away from this talk or even from this session at least I and Yibo will be delighted. So. What. I want to do next is talk. To you about a little bit about the disruptions, happening, in the networking. Space and tell. You about. Optical innovations, that we're doing inside the company both in research and in Azure networking. To, address these challenges so. I'm. Gonna start with the network inside the data centers and. If you remember I told. You about this Moore's, Law for networking and the fact that that has allowed us to keep the cost of the network low, and constant. Hello. If. Electrical. Switches stopped scaling for free we'll have an increase in the cost and I'll, explain the shape of this curve in a couple of slides from now. What's. Really worrying is the timing of this because. This is expected to happen around the same time, where. The cloud traffic, not only continues, to grow but. It's pace actually, picks up because. Of custom hardware FPGAs, GPUs, because, of new scenarios, like large-scale. Learning which means that we'll be generating and consuming, a lot more data and doing it in a power efficient fashion, so. This, is a big problem for us and. One. Of the things or one of the benefits. Of working at a place like Microsoft, Research is that, we always think of, mediums. Term strategic solutions, and long, term disruptive, solutions and I. Wanted to give you a sense of those sort of examples, of those solutions in the context of this problem over the next few slides so. You. Understand a solution let's look inside the problem for a second, so. If you look at a switch. Inside the switch we, have a switching ASIC, that, sits on a substrate. Or a package, connected. By BGA, connectors, or i/o pins a PCB. Trace and an optical transceiver, that, goes from the bits. Of the electronic an domains to, optical. Domain so something tiny like this, now. It, is the speed, and the density, of the i/o pins, that causing the problem because it's harder and harder to scale and. So the obvious thing to do is that well why don't we move the optics closer. To the switching, ASIC which. Means that you get rid of this bottleneck resource, and the, switches, are easier. To scale. Now. This is something that's been discussed a lot in the optics community, and silicon, photonics space lots. Of names to it we. Are calling it in package optics it's. Essentially what, the switch would look like two, generations from now is. You'd have a switching, ASIC it. Would be surrounded by chip. 'lets comprising. Electrical, components, so all the drivers. And all the analog. Components, and then, the optical, chip le'ts and what, this would do is it would take, the data being generated by the ASIC, and modulate. Light which is then sent tens or hundreds of meters. Now. In this case the triplet is made in silicon, because. The field reasons, to reduce costs, and to, make it amenable to integration with the CMOS racing, and. What we've started doing is actually designing our own chips, so, this is a silicon based chip are, designed at a lab hi, sky, here and he's our so, he's a chief designer who makes these chips for us that are being fabricated at. External foundries. Now. One of the key technical challenges, here is that well to, do optical networking you need light and it's. Hard to generate light in silicon it's. A fundamental problem it's an indirect bandgap material, so. What. We now need to do is need to have disaggregated. Transceivers, because. In a standard transceiver, the. Laser and the modulator is in the same package. Whereas. In this case we need a separate. Light, source in this, case which happens to be shared across all the chip plates, and.

This Is cool because it means that we can have a better light source we can actually cool it better and I think that is going to tell us tell you about more about some of the implications of, this, architectural, chain, now. This is typically, made in a material like indium phosphide and again. We've, started, designing our own chips, and I think this is animation, that will show you the. Different. Layers, of the chip it. Comprises lasers, waveguides no other components. I think. It's a, few. Millimeters by philometor sky. So. 12 minute does by 12 millimeters, and that's going to be shared across the entire switch, now. This is an example of a medium-term solution and I. Want to step back and talk about the implications of this for the overall cloud Network cost so. Historically, cost has been neutral. We're. Seeing signs of disturbances, but. If we are able to deliver on this in package of solution, we'd, be able to extend, the cost neutrality of switches for the next five or six years, which. Is great it because it means are the world's not ending. But. At some point we'll, run into the actual. Physical. Constraints, on the number of fibers, that we can align at the edge of the switching ASIC which. Means that we'll have to have complex solutions, and an increase in overall, extra cost and. At some point we run into the actual, seam of scaling limit whereby. You cannot, put more transistors. Onto the switching Idzik which, means that we'll have multiple Asics on the, same die inside each pack inside, each switch which. In turn means an, increase. In complexity, and an, increase, in cost for, every, network generation, thereafter and, that's. The nightmare scenario for us and that's. Where we really really, come in the. Long, term opportunity for optics of photonics. So. What we really want to do is actually look at these topologies, and instead, of using electrical. Package switches replace. Them with optical, circuits which is something that's been thought a lot about in the, optics, community and in the network community. Now. One, of the advantages, of moving away from CMOS components, is that you can develop technologies, that have a much higher radix and you can have a much flatter topology. This. Means that we can offer better performance, per dollar because, there are fewer transceivers.

And Fewer switches in the network, we. Can because. There's no buffering, inside the network we, can offer very low and very predictable, agency and, that, predictability is really important, for, the kinds of application, scenarios, that we expect, to develop in the cloud over the next few years and finally. Because we don't have dependency. On CMOS, this, can actually be a long lasting solution. So. This is actually motivated, us over the past couple of years as a team to. Develop a research, prototype that. Shows the basic feasibility. Of achieving, a very. Fast optical switching nanoseconds. Switching so that we can match the performance of electrical. Packet switches. So. Stuff that's not done a lot in the optics, community but this is now taking, of building a real system out of it, now. I do want to point out that there is a big, difference in a circuit-switched architecture. And a packet-switched architecture, and so. There's a long long way to go to take this from. Up research, through try with a bunch of wires back in Cambridge do something that can connect an actual, data center and run real applications, so. Just want to check the you know keep the expectations, in check but. The IPO technology, gives us the space the four to five years needed. To do that. So. I want to change tack, now quickly and move. Outside the data center and give, you another example of, this sort of hardware. Software network, optics Co design. So. About five. Years ago we used to build megaria. Central facilities, a few. Hundred thousand servers big, LAN. Lots of power. But. Increasingly we are finding it hard to find the land and the power and space for, such mega facilities, so. Over, the past couple of years we moved to the regional, architecture where. We have many, medium sized data centers about. Hundred thousand servers in the, same metro region so, 80 kilometers, connected. By a regional hub. Now. Let's think about the implications, of this for the unlined network architecture, so. In a traditional mega, datacenter we have the standard, cross. Topology in. This case I'm showing four levels in. All. Our data centers we use optical, cables outside the rack which. Means that every switch port needs that transceiver, that are shown earlier to do the opto electronic conversion. And this is a great transceiver, so transmits over a single wavelength and then. We do transmission, over the long haul so, when you're going thousands, of kilometres we, need a colored transceiver, which. Is massive, power-hungry, and you, know rather. Heavy so I can't keep this too long. On. The other hand then you move to the regional. Architecture what. Happens is that well yes you, are essentially taking this big block and doing, smaller chunks, but.

Your Regional links the one in the red is actually 80 kilometers. So. What this means is that yes we can use great transceivers. Inside the data centers but, now we have to use colored power it's expensive, hungry, transceivers, both. On the van on the, link's and. This is a nightmare if you were to do this for, power and cost and space and all these reasons. So. What the azure networking, team did is they actually went out and design, their own transceivers. That, are customized, for the regional environment, this. Is a correct transceiver, that can operate over 80 kilometers you, can see it's exactly the same form factor of the great transceiver that we use inside, the data center. And. Essentially the trick is to customize the DSP, to customize the feck for, the particular scenario of interest, if. The lights not going tens of thousands of kilometers why, would we do chromatic, dispersion for, that distance so. Those are the kinds of designing. For our, scenarios. That, led to this innovation and this was core design both inside the company and working, within Phi who, manufactures, these transceivers, and these are being deployed in our data centers over the past year. So. The. Reason I chose these two examples is to give you a sense of our thinking. Innovation. Are by, combining, optics, with. Networking, distributed. Systems architecture. And, hardware. And. We genuinely believe that to, develop the next generation of cloud infrastructure, we. Will have to innovate across. The stack and we. Can do this because that as a cloud operator we, own and operate the entire stack. As. I showed you we are developing our own custom, chips we, are building optical components, we. Have designed and are deploying our own transceivers. And our, own switches, and, as. You'll hear we already have our NICs so, we can deploy things on the stack on the neck or the servers themselves, which. Allows us to innovate, across the stack and. That's something that we're doing both in our work but, also in all our collaborations. With a committee collaborators, for example the work that we're doing with Dan and many of you in this room, so. That's the message I wanted to get across thank. You for your time I think I'm running out of time so thanks. Yeah. Whoa do we have time for a question yeah. Maybe. So. Real fast um you, didn't say anything about, efficient. Use of the. Optical, network but even with 100 megabits, 100, gigabit right, now we're, seeing the people doing small writes and not many of the data lanes on on this sort of wide data path are, really occupied, so you've got you. Know say it's a 10 day two-lane sixty-six, bits each and you've got one byte, of good data there yeah maybe. 60 bytes of headers and the rest is just empty what. Can be done at that layer that's. A very good question I mean, you were too short answers to that question you can apply that question, in the want, van scenario, where efficiency is very important, and you really need to think about software-defined, primitives, to achieve that I think your question is more in the intra, DC context, and one, of the interesting.

Things About the inter DC context is that things like the 60 right header of the 40 byte header come. From scenarios, if we are limited, to legacy protocols if. We, control, things like the, NIC and the underlying network, in many. Customized, scenarios. Like our DMA clusters I would argue we should just go to customized protocols, that, reduce a lot of this already. Yeah. Yeah. I completely agree I think in that sense you really need. A. Decoupling. Of the data pane and the control plane would which allow, you to make those changes yes and company take me on that front do. We have one question maybe. That's bringing the guys, oh okay. Yeah all, right thank. You - again thank. Next. Speaker is Tom Anderson I believe most of you already know him he, is a short, description our common. Sense the chair in a poll and school for computer science engineering, at the UW, his. Research interests span all aspects. Of building practical, robust and efficient, computer, systems. So. He is a member of National. Academy. Of Engineering and. The window of multiple Lifetime Achievement Awards. Great. And I, think, actually this will be this great, counterpoint. I think there are, I'm. Actually. At. At essence, and operating systems person and, many. Of the things that have been driving our research, over time. Is just this recognition, that we are in the cloud, setting spending, or particularly. For really key applications, spending huge amounts of time just. Doing these repetitive operations, in the operating system and, what's. Clear about the hardware technology trends. Are that everything, that can be re-implemented. Everything, that's bits of. High-capacity. The things that you're doing all the time are going. To be re-implemented, hardware just, you. Can't afford to do just, general-purpose processing, for things, that you're doing many many times you can see this with TP use but you're also going to see this on networking. And for operating systems so essentially. The argument. That I'm going to try to make here what we've been looking at is is on, the protocol, stack but. But. More broadly I think one. Could ask this question more generally, about operating systems in general so. If you look at how I'm gonna come down from application. Side rather than up from what we can do in hardware so. One of the things that's really characteristic, of these applications, is of. The applications, people actually run in data centers is that large, amounts of a very. Fine grained communication, partly just because as you start to spread applications, out over more, and more nodes because you need more and more nodes because Dennard, scaling isn't, doing it for you now. Your communication. Granularity. Goes down and. Kind. Of interesting data. Set that. Somebody, in. The sitcom community collected. All. Of the day all of the are PCs inside, of the Google Data Center over, a period of time that the that the median size of that was only 300 bytes so, generally, you're doing single packet transfers. Between. Nodes, you, still want the. Semantics, of TCP, so that is reliable, in order delivery. It's many-to-many so what, you addicted it's like the applications, that this drives for key, value stores databases, distributed, analytics.

And. That you're pushing performance. Limits on those and, then scaling up as you go so. The question is how you can. Deal. With this. So. On the on the software processing, side. Linux. Even. If you kind of ignore the VM layer because, you're gonna say, catapult, will take care of that for you the. On, the VM side are, on, the TCP side you're, spending huge amounts of time doing just regular, processing, just you know if you're, trying. To do, through. Linux you know about three and a half microseconds. Per packet. And. Then single core performance is not gonna it's not gonna bail you out so you could try to paralyze, this that has try to run many. Many connections over many many cores but potentially. It's gonna just. End up burning, lots of power and and. Spending. A lot of effort, a lot of costs. And time doing things that you don't really need to do. So. There are a lot to propose. Solutions to this probably everybody, in the room would like no many of these so one of them is Karl bypass so. DP, 2k or things like that it, the challenge. Though is that what you want is something that's going to give you a, policy. Enforcement, that is if you're in a multi-tenant data center you also need to worry about well what are these. Applications. Actually doing and can you trust them is is it going to be the case that they're gonna obey, congestion. Control your, behavior of, your network is going to depend on n host behavior, so, you need something that's gonna give you policy, enforcement. Some. People are deploying. Smart. Necks or NIC CPU arrays. And I should say actually there's some really nice paper from MSR, that. Sombra is possibly. Gonna start but I'm not sure. That actually, goes through essentially. A lot of the same arguments, and that, appeared at the last n STI 18. And actually makes all the same arguments that we would make, so. If you do if, you try to put on your neck, essentially. Cpu arrays the challenge, for that is there, is this complicated thing as you scale out the number of CPUs are doing your TCP processing, that, you have this complex assignment, problem, of that to the CPU array, and. Limited, per flow performance, that is if you're you. Potentially, want really. High bandwidth for. A given individual flow and it's also relatively, expensive to go build those our DMA is the one that the research, community is spending most of their time on in. Fact most of the stuff that I'm gonna talk about here is essentially trying to understand, how you would build a more, flexible, our DMA engine. Rema. The api is fine, for some things but certainly for these kind of small scale single. Packet, RPC. Communication. As. A, protocol rotate, writing or reading from a memory isn't necessarily, the natural, thing, that you would do if you wanted to build a really fast RPC, system. And. Then the hardware for our DMA actually bundles not, so well. With congestion. Control. Protocols. That you would actually want to use in practice and. So. Some people had also done TCP, offload engines, be an example, you. Need there's, gonna be an argument that are gonna make towards, protocol. Agility, so. Ultimately. If they can't integrate. Over that whole set I think what you want is the traditional. Operating system functionality. That is the feature set of operating systems which is multi-tenant. Isolation, the ability to say that the, behavior of applications, doesn't affect other applications, and, at the same time, hardware. Assists that is some way of doing a flexible. Reconfigurable. Architecture. So that you could then change. Your protocols as you need them do, you policy compliance as you want, scale. Up connections, to really large numbers of connections. Cpu. Efficiency, in the common case and. Performance. Efficiency so, essentially. You would think of you want the flexibility. Side to be able to do whatever you want. And at this and the other side be able to handle the, stuff at the speeds that you want to handle them without having to do a, huge, amount and I'm gonna, make this argument Oh, independently. Of the FPGA vs VLSI, argument. I Microsoft. Has its own perspective. On this which is FPGA. So the feature but in, some, ways the argument I'm going to make here is gonna be independent. Of that. So. You can think of this hardware assist at multiple, levels so. One, layer and, the key note talked about some, of this extent, which Microsoft, has already deployed widely. Demonstrated. This if you're only asking it but the packet processing, at. The VM layer that is creating. A virtual network that, the that. Makes, the virtual machine look like, oh okay I've got my own little, land here that I'm gonna be able to communicate with that set of work. Is. Pretty, regular and actually, a pretty damaged demonstrable. That you can move that unto Hardware what I'm gonna end up talking about is a harder, problem which, is okay, now once you've gotten these packets to the virtual machine the.

Virtual Machines now spending almost all of its time doing TCP, packet, processing in the in the virtual machine the guest operating system so that's, inefficient, from the customer, perspective it's, like why, was i burning all those cpu time for that so, can you take and try to do hardware, acceleration. There it's flexible, hardware acceleration, inside. The VM so how you teach, me packet processing in. Hardware, and then there. You could ask the similar question about the application. Level processing, so you're handing this off to an application it's also doing, essentially. Layer, seven packet, processing how, do you, want to try to optimize that this, whole set of things can actually be handled in foot in this thing that I'm gonna talk, called, flex NIC but it's just a model for how you would do the programming for it and then, I'm also going to talk a little bit about how you could do this flexibility, at network switches and in general, this where. This is going to tie in with the optical, side is. Where. You do have congestion that is where you have large amounts of communication, coming in generally, at the at the top. Of rack switch or whatever you, are going to need to be able to do something on the level of trying to do resource. Allocation, in some fashion light, largely and, kind of programmable, space, okay. The overarching, lesson here is going to be that common case packet handling is systolic which, is to say that individual. Packet, the steps. That you're taking to do packet processing is regular. Risible and that, can be pipelined not, just packet, to packet but even within a packet that you can do kind of regular communication, and as a consequence, of that that, you can you will end up being able to do, large. Amounts of sorry. High bandwidth, operations. Even, on a single flow because. You're pipelining, individual, individual, stages of computation, and and, I'll, come back to this point that I made earlier which is that the. Anything, that can be regularized and, you're spending a lot of time on will, eventually, be building the hardware it's just how the economics, of data center is going to work. So, now the question is how how how you can go about this and this, will happen both on Nick's n switches. Okay. So the programming model we started with is. Is usually reconfigurable. Multi stage pipelines were not the only people working in the space. You, take this packet stream coming in you, have a programmable, parser. That. That. Goes through some number of stages of computation, the these computational. Stages. Are protocol. Agnostic, you. Can run essentially, about a terabit. Per second, in a single pipeline the, Barefoot's, demonstrated, that so from a NIC perspective, that is from an end host perspective, that's pretty good that'll, carry you pretty far, and. It has and the idea is going to have predictable, performance the, stages all executed in parallel you. Can think of essentially each, successive. Packet, it goes through the stage in in the sequence, as a consequence, you know from a single flow perspective you're, still processing everything in order as opposed. To spreading, it out over multiple pipelines or. Multiple, CPUs which we might be doing in a smart. NIC operation. Then you're having to make sure that all of the packets for a given flow go through the same, sequence. Are on the same processor just, for kind, of state management or ordering, constraints. Okay. So what. Are these match action systems. Do you can think of them kind, of the simple simple cases that they can steer packets, they can initiate D, a trigger. Replies, or whatever they're just doing some kind of simple oh this. Is an application that's that's, trying to. Write. On a particular port then what I should do with this is to hash it to a particular core and then DMA to a - to, a queue or something like that that. Because, their regular against. Is stuff in a systolic model you. Don't want loops there's no complicated, arithmetic you, have to be careful about what you're doing with state you, can't have an arbitrary number of stages it's not programming, the way we would normally think of it rather, your programming kind of for hardware. And. So. We built kind. Of designed a part, of this from a kind, of NIC perspective, how you would actually make this work is you, actually for the NIC actually has multiple pipelines, coming in so it, has data coming in for. Received. Data that is you're having to do some set of pipeline stages for packets that are arriving from the network similarly, the host itself is also delivering. Packets, into, the NIC.

And So we have some pipeline for handling that there's. A transmit pipeline to actually, send data out there's also a pipeline to try, to do essentially. Programmable. NIC. Behavior you can kind of imagine if. You were to try to decompose what. Happens, on an RDM a packet, if you're, coming like, if that was your protocol, that you were trying to implement. What. Happens on an eidm a packet is that you're essentially, de-mux your you're parsing. The packet and then d muxing, on an, element, of the packet to tell the dma engine to do a particular operation, that. That step, is, something that's pretty taught that's pretty, similar to what a TCP packet does when you arrive like okay I'm gonna get this packet in and I want to decide which, queue this, is going to go into based on the port number and then. Based on that I'm gonna and, maybe this the. Sequence. Number in the packet and then that's going to tell me where to go from there so. What, this then allows me to do is to do this transform, these packets efficiently and software DMA. Directly in and out of application. Data structure so it looks like it's just already you know has, all the same benefits that already you may have and. Then send acknowledgments, out on the neck, but. Potentially, you can implement. TCB with it so, if you look at what tcp is doing it's got a lot going on right you're. Opening. Because in connections you've got locking on socket API you've, got IP routing our firewalls, like congestion, control like very, complicated, connection, state often, in the kernel if you look at the kernel code you're, doing. Multiple. Steps each in, you're chasing data structures you're doing pointer chasing a lot. Of it's like oh okay how is this going to end up working in this kind of systolic model and the, model we've taken is one that would take and essentially split out the operating system code into. Kind of a fast path that's regular, izybelle and then, a slow path for everything that's not regular risible and, then hopefully take, oh okay the slow path just handles offline. So you can think essentially of the fast path doing all of the normal, operations, you would do on every data on every data packet generating. Data segments, applying. Rate limits and so forth collecting, congestion. Statistics, all of this and ending, up running on the NIC itself. And. Then the slow path running in the kernel that, handles. Essentially all the other kind. Of essentially connection, or control. Plane operations, and then the application, it's up doing, some additional. Amount of work, one. Of the aspects, of this is say that you're splitting up congestion control in a way that splits. The. Work, so. That instead of thinking that you're doing can just control every packet, you. Can think that you're doing it on it on a kind of roundtrip, basis. So. Collecting, information on the data path and then using it on, a kind of background basis to enforce control, and but that's actually sufficient for implementing the. Protocols, that we have, so. This is a kind of performance graph we all end up leaving here. Which. Is essentially, you. Know how well this can do the. The yellow this is lint this is Linux here scaling. Up to sixteen core is not very well and.

Then A. Software. Implementation of what I've just described, which. Actually does enforce, congestion. Control, compared. To a software, version of. Of. Just a DDP a, TCP. Version that doesn't enforce, congestion, for all. They actually do reasonably well and then a kind of hardware version is that, line over there. All. Of this is actually just emulated, rather than other I'll. Just take, just briefly. Talk about you. Can actually think, about a bunch of these issues also happening, inside of the network itself so switches. Themselves can, also be programmable and, you, could ask this question of what. Can we do inside, of the network using. Reprogrammable. Hardware, to. Do, different. Things and one of the things that we've been looking at is to do, essentially. Fair approximate. Fair queuing so fair queuing is one of these things that have is a kind. Of long standing maybe, 30, 40 years ago I'm actually, sure exactly when it was 30. 30 years ago okay so, I think, we like dated from almost before I graduated, from grad, school so that's like how long ago it was but. The. Idea is that you get isolation, that has predict. More predictable, behavior out, of your network if you're able to say that if. You're queueing. Discipline and how you handle packets inside of the network base. Fair queuing the, challenge, for it is that it's really difficult to implement these things in a, in. Any kind of fashion. That this is just an animation of what happens, as you go about do, they do this in order to get, fair queuing to work and, I'll just, go. Forward for this the, different pieces you need to have a sorted packet buffer you, need to store an update per flow counters you, need to track a global. State that, updates, on every packet and all of that ends up like okay, how are you gonna end up implementing that at, multi. Gigabit speeds, we were talking about trying to implement it at many, cake events and. So. The, argument, is. That you could potentially use this kind of a reconfigurable. Match action model to, simulate fair, queuing or approximate, for queuing and I'm not gonna actually do the example. Here but but. I'll just skip. By it but the idea is basically to use multiple, queues to try to approximate, the behavior store. Approximate, per flow counters in. Tables, and so forth so, there's a lot of kind of research that people are doing in terms of trying to understand what you can do with reprogrammable. Switches. And. That outcome, this ends up being so. This table. Here this is actually flow size on the on the x-axis and then, a. Normalized, completion, time on the y-axis so small, is better and. Then AF, Q which is the blue line so as you get bigger. Flows. Than of, course it takes longer to complete but. But. One of the things that you end up seeing, is even for things like DCT CP which is kind, of the best-known. Congestion. Control protocol, for data centers you, can do better as a consequence, of having more. Intelligence, happening inside of the switch okay. And I'll just leave it there right, Thank, You. Vince. So. You. Talked about TCP. So. Totally. Kona bypass technique, oh it is too. Isolating, the kernel. Stack. Using. The hardware so. The. Enforcement, you can think of as the equivalent of virtual, I think that the partly. The best analogy, for this is there was a period about 40, or 50 years ago where.

We Decided. That what we needed to do for multi programming was. To have virtual memory hardware and, there was a period where people didn't know what that virtual memory Hardware looked like like, what exactly were the abstractions, what, were you gonna what we ended up with what was this notion of having essentially. Table-driven. Enforcement. Mechanism, that's controlled, by the operating system kernel, that. Model. Is essentially, the one that we're moving towards which, is we're trying to come up with some new set of constraints. Some new model, for how operating. Systems are going to interact with networks, that. Going, to be okay those control planes operating, in the kernel that's that's, that's the enforcement. Mechanism. An. Applications perspective and then everything. Everything. Else. The app with the application actually does itself is all done at full speed and so what VM hardware does essentially, is to say look in, the normal case you're, executing at the full speed of the hardware even, though I'm providing, this kind of protection layer in. With, some hardware system and the question that we're ending up trying to figure out is what, is that hardware assist that you need to be able to implement this at full, speed. Sorry. Did you did. You I have I have the microphone yeah. So, I'll ask my question anyway so a lot. Of dialogue about our DMA centers on, use. Of resources, that are really living, in the OS such, as the the, page table entries, and the. Actually. The specific, DMA mappings, and then, the DMA permissions. Records. Which are well tables but then this struts to exceed, the amount, of SRAM available. And most of these NICs don't have time to go off to DRAM on an average packet, um, do you have an opinion on all that and where's that trending, yeah, so uh I, think some growth that's, like if you wanted to start coming up there, was a really nice again.this. NS di paper. Of looking, at how they, map catapult, they they have a two-level. Connection. State, model, where, the, SRAM is for the active connections and then they have a dram backup, for for other ones I think once, you start to think that your operating system is managing hundreds of thousands of packets you're needing to do some kind of. Two-stage. Model. For it so I think, that also is a really interesting question of well, if everybody, is actively, using. Say. Tens, of thousands or hundreds of thousands of packets of connections. Simultaneously then. Does this kind of cash in model work and how, do, you make that work now the upside is that even. If you, have to fetch, your. Data in from a deep into dierent your connection, state in from DRAM to do some work it's, still as long, as it's relatively, compact. The. Only amount of the, thing that you might be doing over time would be some. Regular. Amount of operation that the amount of time that a connection is actually actively, using, nick is gonna be relatively small so the the only am number, of active connections, you might be doing inside, the neck at any point in time I'd might not be large there are also large amounts of like just. Regular. Protocol. Issues that come from running trying to run large, numbers of connection simultaneously like TCP is just basically broken for, once, you get above a thousand connections we we ran a really simple test that just. Doing, simple RBC's between, two nodes with a thousand connections and Linux will actually literally starve connections, like, just. Like okay, is, that the best we can do. What. You're proposing is flex next looks a lot like before, are, you trying to use the same programming strats traction of p4 at the host, site so. You can think of it as very similar to p4 I think the difference is p4. Doesn't really have a model for how to program, DMA. And so. We, added essentially, it's people were plus the, kind of DMA programming, okay. Okay. Let's thank Tom again. The. Next speaker is a brahma monkey. She's. A technic, technical. Lead in, agile networking, working. In the area of authority networking. So, Microsoft, she has primary work on network offloading. Technologies, especially. In the area of network, virtualization, such. As virtual machine queue single, route IRS, IOT and general. Generic flow table offroading. Thank. You hi. Like he said my name is sim brahma i'm, a tech lead in, agile networking, I specifically.

Work In the host networking, group and for, the last couple of years what, we've been working on is to offload, the software, stack that we have. In the. As your host or to. An FPGA, smart. NIC and deploying. This entirely in our Azure data, centers and seeing. Great benefits, out of it in performance, and latency throughput. Per. Packets packets. Per second etc so, I'll go into some of the details, about how we went, about implementing, this so, the agenda for might we'll be a little bit about the edge of background, how the. Problems, that we have in scale and, the. SDN stack the way it's implemented today in the software stack and what, are some of the challenges we face around that and. In, packet, processing at, a per packet level in the host and we'll. Also talk about what other considerations. We employed to determine, that FPGA smart NIC is the right choice for us to go. Forward and then I'll go into some of the details of or the architecture, and the design of the smart, NIC and then we, have a lot of lessons learned because this is pretty, much the first time where we are deploying at such a scale software. Hardware. Project. Coexistence. And there, you know deploying hardware, at such a scale to be able to debug it monitor, it have, alerts, and, at an agile, way, very which, is like our. You, know a cornerstone, of azure and networking. So. This is the generic slide that we use for the azure services so we have at the bottom most layer the infrastructure, service which, includes our storage. Compute, networking, etc and on. Top of it we have the data services, and and. Providing, the. App services, like as, your ML Analytics, etc. So. Azure. And networking. Has come a long way since it was first deployed from. In terms of compute, we have started, from like 10k or 100k, of VMs and now we are at millions, of virtual. Machines and as. Your storage has seen the same rate of growth as well as the networking where we are seeing from terabytes. Of data to, petabytes. Today. So. This, is a general marketing, slide that we use and we see at what, rate the azure is growing, on a day to day basis, literally we. Have to have support for all the Linux. Williams because most of the, virtual. Machines that are out there are Linux VMs and we. See a great, growth, in terms. Of the deployment. Onboarding. Of new customers and the.

Different, Kind of features and requirements that we have in, terms of Azure and networking, specifically, as well so. That would kind of show us like the. That, we may need to maintain because a lot of the features that we add our customer, driven, either through issues that they face or, the, new features, that they need to move, from their, current model, on Prem Network to the azure network. So. Other. Ways to think about scale, so, V. It's. Not just about deploying, a this, service. Or deploying a feature it's, about how do we, handle. Failures. How do we detect monitor, and it's a continuous, feedback loop that we need to be in and believe. Me I we. Often get our on-call rotations, and it's not pretty to live through a one-week, of an on-call, rotation, because if you don't have a. Backup. Of knowing. How, to, automate. This detection, of heart failures. Like switch link, node, failures, you. Know the crashing, of any software components, in the nodes etc we, need at, such a large scale it's, impossible, to do I mean literally impossible to do manually, or even, not having a super sophisticated automated. Solution. For that so. So. We have and, another customer, goal would be to actually, one. Is detection, of these failures, and mitigation. Of it so, most. Likely we would prefer to know about an issue before a customer even notices, it so we need to have the automation in place, to, detect these conditions, that are going on some. Hard failures and be able to mitigate it which would be like a service healed the virtual machine to a different, node or in. Our case for example switch. Back from let's say if the hardware is misbehaving switch back to the software path etc and then. We always have the problem of soft. Failures which are not obvious obviously, if the, link. Is down between two switches it's easy to detect easy to fix but, if there are soft failures, that we see very often is, like one. In like 100 packets is getting corrupted by one extra, byte or something like that and though sometimes I've seen even in practice, those go unnoticed, until, a for, example in our case when we offloaded this flow to an FPGA the, FPGA was adding an extra 32 bytes at the end of the packet and that went unnoticed until. Some TLS, type of checksum was happening, and then the the customer, endpoint they would see an effect of this so, how do we detect these kind of failures and lot, of issues that we see are in latency, so generally. Everything is working fine occasionally. The latency peaks at 50 milliseconds, or something, like that in a TC paying type of operation. So. How do we detect that how do we monitor that at that scale that's also very important, so I. Think again like I'm stressing the automation is the key for all this the, other thing is the is customers, they are highly, intolerant. To any virtual, machine downtime or any networking downtime, so whenever we, have to service any of our components, and we have a lot of components including, in the hosts in the virtual machine in the host, and now, in the hardware etc the, serviceability, is the key we cannot afford to bring down the host we cannot reboot. So we try to update, all our components, reboot Leslie except. For the OS which also we try to do it in such a fashion that the downtime is extremely, less any, networking, components, being. Updated. Or being serviced, need, to guarantee that we do not see a network, outage, to the virtual machine and we. Try our best to actually keep the TCP connections, alive so that's the goal of it basically the VM should not really be able to see a TCP connection drop. So. The, key to this is performance. All the, everything. Is measured at like four to five nines, and on our customer basis which includes the performance, the network performance the, host performance, the, availability, the VM availability, whether the they, should not, suffer. Any kind of down time and the, serviceability, of all our components. So. How do we achieve Sdn, in Azure so, today, what we do is what we call is the host Sdn so our data plane runs in the host we, have our distributed, controllers, that are running across the regions so for, example the SLB has, a regional. SLB. Manager for, any v-net programming, there is an regional. Network manager etc that is per region and we. Have controllers. Running in each of the hosts so, any. Of the configuration, that comes along for from the management, plane which is using the REST API like, a portal, or. CLI, however the deployment, occurs, for the virtual machine that, comes to the regional, network manager eventually and those, policies. Are programmed, to a per host level and. And. The so the controller on the host programs, our, virtual. Switch that runs in the host and the, data plane in, the virtual switch is which applies the policies, so.

Obviously. Then you have this data plane needs to apply the flow. Policy, to millions of VMs and that's, only achievable, because we are handling it at the host level for, whatever number of VMs that are running on the host. So. This is, a diagram of our software, stack this runs, in each of the hosts, so, we have the virtual machines hosted on the. Azure. Node or the host as we call it so, before. Hardware, acceleration, what is presented to the virtual machine is a synthetic, NIC or a software NIC any packet. That needs to come in and out of the VM is always, going through the software stack in the VM that goes, through the hypervisor to the host and in, the host we have what we call as a VM switch which is a layer, to switch which, basically is, really a delivery, mechanism in and out of the VM and. All. The policies, that are told that get applied to the core the controllers the controller programs, using a vfp api which is a simple. Match. Action, and match action table API so. That the controller's program. The vfp with that information so, like I said there could be different, kinds of controllers, like something, for load balancing something, for a V net customer. Address. To the provider address mapping. Ackles. Metering, etcetera so each of the controllers, can program their own rules. And those, go into different layers of vfp and they're independent, of each other so if. You were to add a new feature we could have a new layer and none. Of the other controllers, would have to change any of their policy. So. In vfp what. Would happen is so let's, take an example of a packet being transmitted, from the VM that would land, from the VM over what, we call as a virtual machine bus to the host from, the hypervisor and in, the host when it hits the VM switch VFP. Is bill as an extension, to vm switch so, originally. When we started as you're all of, the policy, enforcement, was done as part of vm switch and. Then due to different constraints because you will have the vm switch update. Is a requires, a reboot etc so we went with this model of having a pluggable virtual. Filtering platform, within vm switch so every packet that vm switch sees in inbound, and outbound path, is routed, through VF p and VF, p is the one that is the brains of the host Sdn on the host, so it applies all the policies it goes through the layers for example on an outbound packet, it would go through the a K'Ehleyr modify.

The Packet, based on the configuration go. To the V net layer modify, the that packet let's say map the CA to the PA and this, might also need some communication to the, regional. Control plane. For example you might not have the mapping of this destination VM you might, need to wait for that configuration, to come back and then, apply that change to that packet and then. Go to the SLP layer so any whip to dip mapping would occur there, so. This processing. Allows us to, do, multiple. Layers independent. Of different kind of controllers programming, it then, we go one step further. Than this that once, this packet goes through it if you see a connection, setup packet, like in the case of TCP it being the syn packet you, know that this is a packet, and these are the flow the. Flow ID for this packet and the. End result is always going to be the same so, Akal, v-net SLP the NAT layer combine, it and what is the final action that. Is called as a transposition we can save that information so that you don't have to go to the individual, layers after the connection setup is done, so. We call this technology called as a unified flow which is actually the basis for our offloading. To the accelerated, NICs. Yeah. So basically. So. What so this again explains, how we would go through the layering, occur. Both in the inbound and outbound so, the vfp is also the forwarder, so once it does the outbound processing, for a packet it forwards, the packet determines the target and then does the inbound processing, based on the rule set up for that port so the target. Could either be another VM on the same host for example if they were to be Ruby to VM and we are seeing more of this in particular. Requirements. Or it could be another, VM, on a V net in as, you're within the region or another region, or it could be some internet endpoint so it doesn't matter basically. The vfp is just doing a per port policy application, on inbound and outbound. So. This, just shows you the. Flow. Of a packet so, for example the Ackles would be applied their different action, is specified, for that layer for example drop. Certain packets, with 10.4, 16 address, etc and then, the action at the load. Balancer, layer and then the v-net mapping, would occur at the end layer. So. What. Do we see as a challenge us in this right this works fine at 1 G at 10 G it works fine but, what do we see as challenges as, the. The. Networking. Speed increases, there, are more and more packets, being generated but there's not enough CPU to handle this load so, we and what, are the options for us so we would prefer to offload, this to hardware but when we offload it to Hardware using Asics, you are having. A one or two years of cycle. To update the Asics, it might not be sufficient what, are the features available in, the Asics etcetera so we have to go with we decided to look at FPGAs. To solve this problem so what we have is today we still have the the. ASIC, NIC that provides. The. Access. To the virtual machine there. Is a technology called single route I of virtualization which, allows the virtual functions, of PC a virtual function to be exposed to the VM so, the packets from the VM, can directly access the network to and from it but, if he did that then it would be bypassing, the host and there would be nobody applying the policies, so we go with this bump in the wire model that the delivery, of the, packets and reception of the packets from the VM still occurs on the virtual, function and the, existing, NIC and any, of the offloads, that the NIC can provide for example receives, scaling. Like CPU steering, tasks off loads like checksum, etcetera. Or lots of offload that can still occur at that NIC but the policy application. On a per packet basis occurs on an FPGA, that's, programmed, by our VFP and. We, use this API called the GFT api which, converts, the unified flow that we talked about in the vfp to, the, to. The to. The to, the settings, basically what the FPGA understands, and these, are some of the measurements. That we got from running regular you.

Know Open source tools, like iperf and soft path and you see I'll just go quickly over this is like. You see the compute D series gets up to 25, G and we, see improvements, in the latency, especially. Because we have SRV, totally bypassed the whole stack so, we see you, know per packet per. Second improvement and then you obviously also reduce jitter because you don't have any software, issues, of getting D scheduled by hypervisor to run vm processing, etc so, this gives us more course for the vm processing, or floats, or the network to, the hardware. And then, since. The vm can get near native performance we can also include other workloads. That needed, that kind of performance. So. This is an example of what on the left side you have how, the packets. Transmission. And reception looks, like in the case where it's being processed by the software, so, anything coming out of the vm goes to the virtual switch does, the processing, then goes out of the physical layer but, in the case of accelerated, networking, it just goes to the network card and directly out of the Tod, and beyond. So. This is an example of how the gft actually does the processing, so like I said the controller's program vfp asked today there's no change over there VFP, still, needs to be there in the path of what. We call as exception, packets which is the connection set of packets and for, TCP it is syn. Syn ack packets for you repeats the first packet, that we see so we are right now offloading, just TCP and UDP and, potentially, we would do our DMA flows as well so. Like, I said the unified flow gets created, in the vfp, when it does receive, the first packet so it goes through the slow path processing, in across each layers generates. This unified, flow which is this is a flow and this is the action and that is what it configures, to the GF, T so, when g FD r is our in GF T is the engine running inside the FPGA smartening so, when FPGA. Receives this packet from the VM let's say an outbound packet, it goes, through its stable there's a multi a layer. Hash table lookup so like, he was saying on the SRAM and then we, have an l1 cache and you don't find the packet go to the DRAM look up look up that and if, you don't find that packet or a flow for that particular, packet that's tagged as an exception, packet and in, sent, to this whole, stack for VfB to process, and if, it does find the packet which is when VFP offloads this flow then. It does the transformation itself bypasses. The stack and sends it out on the tour, similarly. If you receive a packet the same mechanism, happens but, when it will indicate the packet to the ASIC, NIC which, is using the SR, io v and directly indicating, the packet to VM again we bypass the host. And. Off the key thing for this is serviceability, so, serviceability, is a key is because we need to be able to update any of the components without any kind of.

Down. Time for the VM because. We have to so. I have just one minute so I'll quickly go over some of the lessons we learnt in this area as well so one of that is basically if the VM cannot, suffer downtime then we need to have a failover simple, failover from the hardware, to the software path, when any of the key components are being serviced and. So. That was one of the key things that we did as part of the initial design because we also need a service because the whole idea of using FPGA is, to, be innovate as fast as we can which means like we our, goal is to service the entire stack every month in in a sprint fashion right so. Some. Of the lessons that we learned is I would say the most important, is monitoring, is really the key it's not in my bullet points here but monitoring. Is the key because. At this scale when you see in generally. In hardware, we tend to think that okay you need to get a repro, all, that's not going to happen right like customers facing this issue you got to figure it out so we need to have as much as monitoring. Alerts and notifications that, we can have the, failure rate has remained quite low, and we have some alerts for that like you know DRAM failing over the power over some time or some, of the parts, that like, oscillators, and stuff like that failing so we're kind of monitoring, that like what, the. Thing is for that and I think another thing is like it's a mindset change as well because to, work in the coexist in fashion, between, we're engineers and software engineers to build a team around that that, had like. Initial challenges, to understand, each other etc and that. Is an iterative, problem. That you to solve and the. Biggest like, benefit, that I saw of this one on first hand is basically when we have a new Sdn, feature that we want to implement in the software earlier. When, we started off the software existed, and we figure out how to offload the stuff to hardware so that's the way we were thinking but now when we whenever we are doing a software, feature we say okay how is hardware able to support it because we do have hardware, limitations, in terms of tables, etcetera so. That is also, another like a mind shift that we see while. Doing this entire project so. That's, it any questions, or I don't have time. The. Question here. Yeah, I had a quick question about migration. It seemed like there's a lot of dependency, on the, hardware and whatnot so like live migration yes. Yes so we, do take that into account and, that does mean at this point of time to offload to upload the flows from the hardware to the software because. We need to save the state of the VM, as it is and then migrated out so we will remove the hardware offloads, at that point of time save, the state move it and then re-enable the hardware offload set in the destination. Yeah. Do. You anticipate there, to be any scalability. Limits with, FPGAs. As we scale to even faster, and host networking. Stay for the gigabits per second, maybe even 800 gigabits per second, yeah we do see that actually at this point of time the bottleneck, is the host because we still have to do the first, packet processing in the host, so once, the, offloads. Have been offloaded, to the hardware it is still able to scale at that level but we make continuous improvements, for example you have the you. Also have a memory constraint which is also a bottleneck, in that thing right because you have multiple flows, you started out with a single. Full.

Size VM and then you go down to having multiple flows there, is a limit at the end of it and if any packet, has to get processed in the host because, of Hardware limits that is a problem, that we have to look at either. Yes. I guess a following question, um do. Programmable. ASIC switches like the barefoot switches. Have. The capability, to handle. All the layers here and also hit the cost point. As. An alternative, one that's and what scale up. Switch. Is like the barefoot switches the programmable, ASIC switches have the capabilities, to handle this as they, scale up in capacity, to I I'm, not aware but the one thing is we are doing all this in the host so that is something that we have to keep in mind like we have to support the hosts Sdn because, that's where we try to do all the policy enforcement.

2018-08-29

Show video