Hello. Welcome. You guys having a great build, yeah. Excellent, I'm Jeffrey Stover a technical, fellow of Microsoft and it is my great pleasure to, introduce Scott, havens, Scott. Is a works. At jet comm, and it's, one of my things I like most about build, is now we're starting to bring in people, from the industry, and it's not just Microsoft, talking to you about our our, technology, which of course we love but, you have industry, experts who have been taking this technology to, solve real world problems, so. They can tell you where it works and where it doesn't so, Scott works at jet comm and Walmart. Labs, okay he's focused in on a bunch of the date of intensive, systems and, interesting. Note jet. Comm, is one of the largest if not the largest employer. Of functional. Programmers, okay, so if you got anybody and, if you or your friends or like want to spend your likes code, an f-sharp this. Is the guy you want to talk to now, Walmart in, 2016. Took a look at this digital, transformation and. Decided. That the best way know that if they were not on that bus, they were gonna be run over by that bus and they, did. That the best way for them the best and fastest, way for them to get on that bus was, to acquire, jet, comm, and so, now, Scott. And his team are taking the techniques, that they brought to jet. Comm, and applying. It to the. World's, largest, retailer. So, let's give a big welcome. To Scott, havens. Thank. For that wonderful introduction Jeffrey, so. You heard a little bit about jet comm you know that we are mass-market, ecommerce company, and, we've been dealing, with scale from our earliest, days and. In, 2016. We were purchased, by Walmart after. Purchase. For, a few months we had started some early, integrations, of systems, make. Sure that our catalogs, can kind of talk to each other make sure order, processing, systems, can kind of talk to each other but. Never any advanced. Full-fledged. Integration. Or reunifying the systems. There's. Always going to be politics, at a large company especially when two companies are being merged together, but. One of the reasons that Walmart, had bought Jed in the first place is because our tech looked cool it looked transformative. But, not everyone. Goes. Into that scenario being, convinced, that our techniques, are going to be providing, real benefits. Well. It wasn't long before we, were fortunate, enough to get the chance to demonstrate these benefits, and by, fortunate. I mean disaster. Struck in, the, middle of March last year a. Little, bit after 3:30, a.m. I, got, paged I woke. Up hopped on our bridge for a pager duty in case anyone wanted to talk about the problem and started, looking into it and almost, immediately, I was joined by co-workers. From several other teams it. Turned out our production, kafka cluster was down. Kafka. For, us is the primary method of communication, among all of our back-end services, we. Try to stick with asynchronous. Messaging. Whenever possible in Kafka's worked really well for us for that however. In this case it was, not working, it what everything, went to a dead stop. Before. Long we, realized it wasn't just down, but, dead, every. Single message in flight, in Kafka was gone customer. Orders, replenishment. Requests, catalog, changes, inventory. Updates warehouse. Replenishment, notifications, pricing, updates every, single one was gone we. Were going to have to rebuild the cluster from the round up now. This could have been catastrophic this could have been the end of the grand jet experiment, enough to convince, our new Walmart compatriots. That, that's. Technical, tenets sound good on paper but, don't work in a real enterprise, with compared, to tried-and-true, systems, so. What happened with that well. You're. Gonna have to wait to the end to find out first. Let's, talk about those, tenants those principles.
That Have guided us. What, does Jet do differently. There. Are several idiomatic. Practices, that when I was first. Interviewing, at jet the. Parts that made me really excited or, that they were so, big into functional. Programming and F sharp in, particular, I've, done a lot of functional, programming style, in c-sharp at other. In, other organizations, but. I was really excited to work. With a language that made functional, first made, doing, the right thing easier, there. Have been a lot of spikes. In. Early. Systems done, before I got there that showed that there it, looks like there could be a lot of benefits, from adopting, F sharp as, our language, of choice, another. Big thing one, of the early principles, I learned, joining, jet was that we treat micro services as stream, processors, and we meant then make everything a micro, service, that. Works really well with the functional paradigm then. The third thing that I tried in other places but, didn't, have a real good grasp on the exact right way to do it was, event, sourcing, I'll, talk more about all three of these things later, the. Benefits, as we'll talk about a little bit later include. Resiliency. The, fact that it's really easy to add features, in. Fact then it makes systems really scalable, make. Systems really, testable, and it. Gives you a time machine. That's. Not it's. Not a typo that is not me miss speaking you actually do get a time machine with this we'll talk about what that means in a little bit, so. Now. That we've got to throw these concepts, let's talk, about them in a real-world system. First. An aside about naming, systems, in, the early days at jet we were spinning out an e-commerce company which meant a lot of new systems and a lot of new teams in a short period of time having. A theme made, it fun made, it easier to come up with new names and the first ones were the obvious ones Batman. And Superman two. Years into building systems, and teams a lot of the big names were taken we, were in the early stages of planning out a new system and I tried to think of a name of a superhero who wasn't, yet popular enough to be chosen but, who might have some clout a couple years down the line so. Ladies and gentlemen for today's case study allow me to introduce. Panther. Couldn't. Get the rights from Marvel for any of the official, Black Panther media but I think this is a better option. Black. Panther is the inventory tracking, and reservation. Management, system on the, supply side it. Aggregates. And tracks all sources of inventory, not, just from jet and Walmart, owned warehouses, but, from all partner merchants, and all of their warehouses worldwide, then. On the demand, side it acts, as the source of truth for reservations, against, available inventory when. A customer, is checking out the contents, of their cart get reserved if, the inventory is not available. At that point the reservation, fails and either the items must be resourced. From a different location or a different merchant or the. Customer needs be given the option to choose a different item, between. Those two our, primary goals are to maximise on site availability, while, minimizing. Reject, rates due to lack of inventory, our. Secondary, goals are to, improve the customer experience by reserving, inventory, earlier, in the order pipeline, to. Enhance insights, for the marketing and operations teams, by, providing more, historical, data and better, analytics, and we. Wanted to unify the, inventory, management responsibilities. That, it typically spread across multiple, systems, of. Course with all these business, goals our solution, had a lot of non-functional.
Goals As well like, high availability, geo. Replication, and really. Fast performance backed. Up by SLA s. Now. Geo, replication, and performance are a fun problem in this domain because, actions, taken by any. In the world can affect the validity of any, subsequent, attempted, actions by any other customer, to, get the lowest latency we want to serve our East Coast customers, data, from, an East Coast data center and our, West Coast customers, from a West Coast data center but, we don't want one customer, in each region to try to purchase the last baseball, sitting in our Kansas City warehouse and both, be told that they successfully, ordered it now. Lots of you'll recognize this as a fundamental. Problem of distributed, systems you. Can get perfect consistency, guaranteeing. That the data is exactly, the same no matter which node you check which, is really good for making sure that you don't lose anything if an entire region falls into the ocean or, you. Can get really good uptime and latency using, all your nodes to make sure that a customer, can hit the site in that it's responsive, but, you can't get both at the same time there's. Always a trade-off a spectrum, of fruit consistency, to availability, this. Also informs our design choices as it must in many domains that, might have people. From different geographic, regions interacting. So. Let's jump into how we built Panther and how it solves these problems. Panther. Uses, the. Canonic. What we consider at jet to be our canonical building. Block for how to build the data, intensive. Systems we, call this the. Command. Processor, pattern you, some of you may recognize this from elsewhere. Where. We have a stream, of commands, that, come in that. Are processed, by, a process. Command micro service. The. Command, micro service and I'll go into more detail in all of these stages in just a second but it has a quick overview the. Command. Micro service will pull in the state whatever, state is relevant, to, execute, the command against. That state this. Execution. Will produce an, event, as output, the, event gets written into the. Its appropriate, event stream and. Then those events, are emitted. Via Kafka, via, event, hubs or, any other mechanism that you may choose, down. To all other downstream, systems. The. State, itself, is Bill. Over. Time with a snapshot service, you, don't want to have in a stream, that goes on forever and try to rebuild the state from that every time so. Instead, you cache the most recent, state of the stream, using. A in a synchronous service back. Into, a snapshot, data store that can be used in real, time with, lower. Latencies, and we'll, go into details, in all of these in a moment. So. First let's dig into the, first part what, does it mean to execute. A command and produce an event for. Us there's it's, a several stage process one. You ingest, the message this, could be coming in from Kafka be coming in over HTTP, coming, over some kind of cue any of these would work it. Could be in JSON. Format or, a protobuf, Avro thrift, and, any, of those can work as well. We. We've, tried. A few different a. Few. Different inputs, and a few different protocols, at a jet and we've. Been using just, made of JSON zipped and a, certain move toward protobuf, and, were very heavy users of Kafka. For our asynchronous, messaging. And HTTP. Direct connectivity when, it has to be synchronous. The. Next step is that we'll deserialize. That, payload, of the message into a strongly, typed command. Now. In f-sharp. We. Read, deserialize, all of these into f-sharp record. Types here's. An example here and the kind of data that we would normally see in a command they're usually not very heavy it, has an ID that, can be used for idempotence, the, whatever, the relevant. Data is to identify what. The context, is in, this case would be an order and a SKU or item. Identifier. At. Which. Fulfillment, node or warehouse. We want to be reserving it how much you want to reserve and then a little bit of metadata who's doing it and at what time.
Now, The set, of the commands are messages, that express, intent, to, change state they. Are named with imperative, verbs and then usually followed by a direct object. They. Could be update. Inventory, reserved, inventory, canceled, reservation, or ship. Order would be some. That you might see in this domain. Now. In F sharp we have really nice property, of being able to describe the entire set. Of commands, that could possibly be. Processed. By a particular, command processor, via, discriminated, unions if, you're not familiar with discriminated, unions in languages. That have algebraic, data types like, F sharp you, can do a match, against. Any particular. Item in descriptive Union kind of like a switch, some. Languages call it a choice type or a tagged union here, it's a discriminated, union it. Works well for message, handling when you have a set of known message types. Here. A couple examples you've an idea of what a discriminated, union might look like if you're not familiar with it for. Colors you would have three options red, green and blue and these. Are untyped, which looks like an enum but. You can also add types, to. Your discriminated, unions instead. Of just being a list, of string or numbers they can be full types themselves. Or, tuples, of types so, we have an example on the lower half where you might have a union, of shapes or rectangle, is defined by, its, height and its width whereas. A circle would be defined just as its radius. So. This is what it looks like when you have a set, of commands, in a discriminated, union, you. Have your the, different types in the discriminant union each, of which is referring to an entire, standalone, type of its own I showed, you the reserved inventory, command, earlier and. You'd have similar ones for update cancel, ship, and however many other commands, you'd like a few. Years ago when I was working entirely, in c-sharp I tried. Something like this a couple. Different ways with typecasting, I would have a base message class and then messages, would inherit from it or in another I tried, a message, of type T in both. I did typecasting, with a lot of switch statements, what makes this model really nice is that it's an exhaustive, set the, compiler, knows that you'd need to handle every, single, case if, I add a new command to this list and I don't add, the code to handle this case in every, place an item, from the commanded.
The Screen and Union is used the, compiler will tell me and I'll know that I need to go in and put. In the code that I need to have there. So. Once. We've. Completely. Deserialized, the strongly-typed command, we move on to step three where we retrieve. The state of it from the database for. Inventory commands. Just as an example for this domain the, appropriate identifier, is usually. The SKU ID plus, the FN, ID or, the warehouse idea where that's being stored, so. It will retrieve at that point say the current, inventory were the available, inventory counts, for, that SKU at that warehouse. Also. And this will be important a moment it retrieves, a logical, clock for that state a monotonically. Increasing. Integer that identifies, the last update that clock is called a sequence, number if. There have been a million changes, to this inventory a sequence, number would be 1 million. Raw. Inventory, counts aren't the only state in our domain some commands like ship order expected, reservation, probably, exists as well and will. Include the order ID to. Retrieve that matching reservation, state in addition to just the SKU availability, state so. You can get a lot as complex, as you want with the amount of state that you need to retrieve, so. Once you have the command that's, describing, your intent and the current state you want to change you. Execute, the command against, the state executing. The command itself does not change the state but. It executes, the business, logic that usually consists of validation. Rules in some. Cases like, update, inventory the, command is coming, straight from the merchant or straight, from the warehouse management service they're the source of truth so the validation is trivial you just reset the the quantity at that point. But. Some other commands may have some real validation, rules Reserve, inventory, it would be a great example but, still simple enough to talk about here is the quantity, being reserved, less, than or equal to the quantity available, to sell great. In, cases that case it worked if. Not then, it failed. So. If executing, the command itself doesn't change the state what, does it do it, emits, an event, as output. An event. Indicates, that something has happened. It, usually flips, the syntax, of the command that has triggered it with a past, tense of the verb instead. Of update inventory we now have inventory, updated. Instead. Of reserve inventory, we have inventory, reserved. And so, on and. Failures. Are events, too if, the. Reason instead, of getting an inventory reserved, if, the quantity wasn't enough you might get a reserve, inventory, failed event, as output, it, covered the idea, is that the events cover every single potential domain, output, from, the execution, of the command. So. With, that output, event in hand you commit, the event to, the event stream it's. Only provisional. Until. It's committed you. Can attempt to write the new event to the stream at the next sequence number if. The right is successful, with no conflicts, then, the event at that point has officially. Happened, if the. Right attempt fails do a canoe, to a conflicting, right say, another writer has, re-written, an event with that particular, sequence number then, this event is invalidated, we. Know that the state has changed or potentially, has changed, so we have to re retrieve, the change state and re execute the original, command. Now. It should be noted that this model of processing, commands, or of processing, a stream, of commands, lends, itself extremely. Well to a functional, style of coding you're. Starting, with your immutable. Inputs your, command, which, is just an immutable message, and your current state which, is also immutable, you. Run it through a chain of stateless. Functions. Decode. Your. Retrieval. Of state your, execution, of command your creation, of an event and then, you write a new immutable, output, to, your datastore this, works great, for free in a functional style.
So. Now, we have, an event stream stored somewhere, how. Do we build the current state from that. Well. It turns out this is also really, well-suited for functional, programming. There, are three things we have to do we, have to define a zero state in. Our. Domain this is really easy it's, just an unknown, SKU at, an unknown warehouse, with zero, on hand that's. What we use when we're starting from scratch in the event stream we. Need an apply, function, that, takes a state, and an, arbitrary event, as parameters. And produces. A new state as output. This. Is where the meat of the business logic usually. Is in most domains and that's the case here. After. Inventory, updated, event, the, on hand, the. State will be reset, to whatever the merchant told us in that event the, count of completed, in, other words shipped, reservations. We were keeping around for, accounting purposes can, be reset, to zero we don't need to worry about those anymore we have a new canonical, count, after. Inventory. Reserved. The, on hand would stay the same the. Total reserved, would, increase, by some, amount and thus. The available, to sell will be decremented by, that, same amount. Finally. We have a fold, function, that, takes, the apply function we just defined a, state. And a, sequence, of events to apply so. We, can start from scratch have. An arbitrary, sequence, of events and keep. Folding in those events until we get to the current state and this. Is exactly what snapshot. Event stream does it. Reads in a batch of commits just committed events from the change feed retrieves. The latest snapshot from the snapshot, data store applies. The entire batch of events, and overrides. The old snapshot, with the newly updated one. That. Happens in our domain it's a little more tricky since we have multiple, ways of snapshotting, a stream would be the latest quantities, along, with the latest, state of each individual, reservation, in that stream but, the idea is the same. Now. While. It described doing this, after. The fact after the event is written in snapshot, event stream, it's. Important to know you could also do this in the, execute.
Command Microservice, if the snapshots, are not up to date especially. If there are a lot of a lot of events coming in or a lot of commands coming in that are producing a lot of events maybe, the asynchronous, snapshot, er isn't, always keeping up you know with every. Last event in time for the next one so. The command. Executor, can. Not. Just get the latest snapshot from the state but and get the latest snapshot and any, events written since that snapshot the, snapshot has a sequence number associated. With it so if the snapshots still only at 1 million even, though the stream is progressed by a few events you can query the event store for, all events that are greater than 1 million apply, those 3, resulting. Events to the snapshot in memory and continue. On your merry way. Now. We've learned. A lot in, in. Our. Practices, of what. Makes good event stream design and we've, made a lot of mistakes along the way as well, since. Event streams are defined as every, event that happens, to an aggregate. Over time in a, linearized, or strictly, ordered fashion. We. This, is a constraint, that has. A lot, of real-world practical, implications, as well. Different. Event streams don't always have a global ordering there's nothing, that forces, a given event from stream a to, have happened before or, after any given event from stream B you, may know every single change that's happened to the item and warehouse, over here and you may know every change from the warehouse over here but you have no idea which came first. We. And, we want to define an aggregate, a stream, as narrowly. As possible, because imposing. Order has a cost, usually. In the form of a minimum, amount of time it takes per, event which. Inhibits, scaling, throughput. On a given stream we. Started, in our particular domain with, defining. All events that happened to a particular, item because. It was convenient we. Eventually realized that, item, at a given warehouse, is more, appropriate, we. Were conflating. Independent. Events in the same stream a reservation. On a box of tissues in warehouse. A in, New. Jersey has, zero. Impact on whether we, should be permitted to reserve a similar. Box of tissues in a, different warehouse in California. We. Found that then we could achieve higher, maximum, throughput by splitting if you, have your item spread, evenly across ten, warehouses, by, using 10, different, SKU, fulfilment node streams instead of just one SKU. Stream we can get ten times the the, throughput because of that. Now. Because streams are independent, or once you have streams that are independent, it's good to note that the number of streams should be able to grow arbitrarily, and distribute, easily, it makes, it a really good fit for, for. Storage systems, that happen. To be partitioned, and achieve their scalability, through partitioning. So. Given, that idea. Of what the core flow of data. Intensive micro services or, a set of micro services at Jett looks like how does that fit into a larger architecture. In. The middle you'll see that same core flow with a lot of extra stuff on the outside and that's, not a mistake you'll see this pattern over and over when you look at services at Jett the, center core flow with the command commands. Command processor. Event. Streams that then emitted, as events and what. You'll see that's really different in the architecture, is different, inputs, and different, outputs, in. This. Case, for. Our domain we have both asynchronous, inputs and we, have synchronous. Inputs, the. Asynchronous ones on the bottom as I mentioned earlier usually come in through Kafka our inventory, updates our order updates these, in large part are coming from external merchants, where, there.
Might Be some kind of delay maybe just a few milliseconds, but it works really well in an asynchronous fashion on. The, topside coming in we have the customers, who are actually. On the website checking. Out as we speak we. Want to be able to give them an immediate response every time so, they can know whether, that item, has actually been reserved or not we don't want to make them wait, any. Longer than would, be expected so. Instead of going in through. We send those commands in over HTTP, we, have two different command processors, one hosted in HTTP, versus, a Kafka. Reader, but. The domain, logic that's within them is exactly, the same, they. Both write they both work the same way and write events to the same data store. Now. On the other end the. Events, getting emitted from your system may. Not, everyone. Who's listening to these events may not care about every, single event that you have they. Might need some kind of transform, into whatever. Format they desire or they might be able to filter. It down to just the events that they want so, you'll see that in the upper right hand corner. We. Have the sellable, inventory, in a raw form that's, being emitted out as an event over Kafka, but. For our different Mart's, whether, we're actually selling, an item on walmart.com, or. On jet comm on a needle or any of our other Mart's, they, may not want care about the inventory at every warehouse or. Every single SKU hey. Needle which is a site, that is selling primarily home goods probably. Isn't going to be caring about our, grocery, items that we're also tracking so. We filter it out in, that way it also gives us a chance to put in any kind of buffers. Or any other kind of Mart. Specific, logic for any particular downstream, consumer, but, the key is that having. Those filters and those Maps don't change the core processing, flow. Now. What were we using or. What how do we write the, events, to a datastore. Originally. And some of the systems at jet still do this we're, using an. Open source product called event store it has a great, API for, streaming. Events and be able to read back a stream, of events it's. Very clean and it, turned out it does. Not scale, to, the performance, and geo replicated, characteristics, that we want so. Instead we, ended up with cosmos, TV. We've. Been most. Of you at this point have at least a cursory, knowledge of what cosmos TV is we've. Been we found it by far the best choice for designing, an event store on top of it heard - all of a lot of the other competitors that we've gotten to look at and. We've been lucky enough to work with the cosmos TV team on the future design to, help make an even better event store, so. We use it in several different ways one, is for, the event, stream storage, itself, we've, written an API that looks very similar to the event store API with which we were familiar before, that. Reads and writes JSON, events the storage is partitioned by a stream, ID and each, event is stored as a new document, within, the name of the of that document. We have an incrementing counter that matches, to the sequence number mentioned earlier.
We. Take advantage of the tunable, consistency. Models that, Kosmos DB offers. The. Next loosest, is called, consistent. Prefix, consistency. Without, going into what technically. That means one, very, important, guarantee that it provides is that you can't accidentally. Create, two documents, with the same ID the, second writer will get an exception and this. Is how we implement, optimistic. Concurrency, now. This is something that a lot of the competitors can't, provide at the scale that we're talking one. Data. Store that we looked into heavily, in building the system was Cassandra. Unfortunately. Cassandra, in a Geo replicated. Environment. If you have something on the East Coast and the west coast because. It is multi-master, and only, multi-master, can't. Provide us the consistent, prefix, model without going, into full so-called. Lightweight, transactions. And, unfortunately, what that means from, a performance perspective is, that if you have an East Coast and a west coast data center trying, to maintain a consistent prefix. You're, doing, four four, hops - for. The consistency, model. Called pesos which. Reduces. The throughput down to about four, events per, second per event string which. Just, isn't enough scale for what we're talking about. We, also use cosmos to be as snapshot. Storage, as. A document, database it's well suited for low latency high, throughput reads and writes we. Get in practice. One, millisecond, reads that's including, the work and, the writes are usually, in the five to six, millisecond, range, for us the. Snapshots, that we write the roll-up of the entire state is fully indexed, and queryable, with a sequel, variant language so. For a given stream we store both, the roll-up. Of the item warehouse, state for. Instance the on-hand, quantities, the available to sell quantities and all, of, the reservations, for items as snapshots. The. Third feature that, we, found as a to, be just a killer feature in cosmos DB it's the change feed we. We're. Also lucky enough to work closely with them on the, design of this so that it would work really well for an event sourcing mechanism, every. Single change that gets written to the datastore once. Committed, is immediately. Available through, the change feed now. We've modeled this interface after cuff, consumers, so, some number of consumers, divided the database partitions, among themselves rebalancing. As consumers, drop. Out or some more spin up you. Can regularly, checkpoint your progress, and those, checkpoints, are then geo replicated, along with the data store and you, can start they can start, the consumers, from the latest. Point in time from, the beginning of time or from. An arbitrary date, time in between. What. Makes change feeds so nice is that it, eliminates dual, rights which, we found in practice to be one of the biggest causes of divergence, state and non. Resilient, systems when you're talking about a distributed, system. The. Common pattern where you need to write to a database and, notify. An external system that, something has changed or. You are writing to databases, or. Anytime you're, running multiple times, from, the same thread, and hoping, that all of them work or trying to coordinate it with distributed. Transactions, it just doesn't work very well you, will get divergence, over time, with. A change, feed based design, everyone. Who needs the day we'll get it eventually there's. No such thing as undetected. Divergent state only. State that hasn't yet converged. With, a constantly, known measurable. Time to convergence. This. Is another thing that when, looking at the competitors, we couldn't find another one that did this well. To. Use Cassandra again as an example and, they did finally. Introduce. A change data capture output. Which works, similarly, at least on the surface but. It is a limited, amount of time and storage. If the change data capture is, enabled. But the consumer, from, Cassandra isn't, live in keeping, up the, pre-configured, buffer will fill, up and right so that table will stop the, intent is for a single consumer to read and then delete, the log contents, but, you can't start consuming to, an arbitrary, date time and pick, up from there especially. Not from the beginning of time if you, really want multiple consumers you need to coordinate and have only one purging, completed data. It's. A good start but it is not just not as full featured as the cosmos DBA change feed it.
Has A lot of other features of note as. You probably know it is fully G replicated, with controllable, failover and a59, uptime, SLA, the. TTLs are assignable, on a per document so, you can automatically. Delete old records, from, the event store if you don't need them anymore, we. Found, this to be really helpful in dealing with the large amounts of data that inherently, gets stored in an event store if, a. Particular. Event has been sort off to a cheaper. Bulk, cool storage location, then, we may not need it in the hot event store anymore and we can let it expire. In. Cosmos, DB gives us really, easy scaling, including. An API so we can algorithmically, control, auto scaling this, is something that we've started introducing where, we either based, on the. The. Number. Of request units being used in cosmos, or in some business, level metrics such as how, far behind, we are in processing, orders, we, can scale up or scale down our, most collection. So. This design as I promised earlier does have a lot of benefits. The. First one I want to talk about is resiliency. And just. Make sure we're all in the same page what. Is resiliency. To. Me it means when a system stays functional, with little or no customer, impact in case of an outage or even, just a partial outage it means that recovering, from these outages is trivial. And it, means that any errors, that you get are either, trivially, corrected, yourself, or, self-correcting. So. What are some types of problems that this. Architecture, helps us solve in. Resiliency. In the domain of resiliency, one. Is when bugs, in code produce, our own bad state if. We find a bug we. Can fix. The bug redeploy. Our code, rollback. To. Whatever the, command. Stream to whatever point in time came before the bug and replay. Everything, up to that point to fix the error, we. Might have out of order reservation. Workflow, events we, have a bunch of different sources from in the architecture, diagram it could have been from any number. Of third party merchants, or management, systems, third party api's, warehouse. Is all of these different sources of data they, are not synchronized they, may try to be and they may be most of the time but, they're there's no guarantee that they will be all the time so if commands, and events may get out of order so. How do we deal with it well, we've adopted a lot of learnings, from what. Are called CR T T's, conflict-free. Replicated, data types in other lattice. Based structures, where. If we adopt, a certain number or a certain set of principles in our domain design it, gives us a lot of flexibility, in how we're processing events in in, order or out of order we. Need to make sure that our events are idempotent. If we process, an event multiple times it. Shouldn't matter we should still end up in the same state, we. Need to make sure that our events are commutative, if you process event. A and then event B and then, swap and process, event, B before, event a you, should still end up in the, same end state. Now. You can do that, by. Adopting. Some principles in how you build your state and in. In. An event sourced, world. Like we have here that, end state is these snapshots that are getting rolled up and stored in the snapshots store we. Try to make sure that all of the state is additive. Only, it. Can either mean that it is literally. Additive, like we're taking two integers, and event, a says add on five and event B says add on four we're. Going to end up in nine no matter which order we process them in. Some. State. Changes, some events aren't as obvious as, that but. We can still try to design it with a monotonic, state change make. Sure that, the. State, always has, only. A unidirectional way, to end, up in a particular, point if. A reservation, that, we. Have early on when the customer, places an order eventually. Gets shipped and we have two events for that place, the reservation, and ship the order it.
Shouldn't Matter what order in which we process them because reservation, once shipped can never return back, to the just, created stage so, if we process the order shipped event first it, doesn't matter we should still keep that marked as shipped, even if we later process. A create reservation. Other. Bugs, downstream. Consumers, may. Be caches, of our information, may. Have their own bugs or they may have data loss maybe their system goes down what. We can do in that case is replay. Ourselves, either, they can pick. Up from. Whatever's. In Kafka if it's being read asynchronously, and they've, only lost maybe the last days worth but. If they need a full, replay back from the beginning of time or from weeks ago we. Can replay, ourselves, from our own data stores change feed, we. May have random, restarts of compute in. This model that works fine all, of the compute, is stateless, if it comes down maybe. Some partitions, won't be processed, for. A few seconds until that. Gets rebalanced, but, we should still end up okay we. Might have a total, loss of the coffee cluster that, might sound familiar if you're thinking, back to the beginning of the presentation and, we'll. Talk about techniques, we can solve with that or how we can solve that problem in a moment, when we talk about how we actually did solve that problem and, we might have problems, as catastrophic as an entire region going down, now. We can solve that by running hot-hot where applicable, or, you can with when we're using cosmos, DB we, can switch the right regions for the data store in seconds, and if. We don't want to be hot. For a particular, system maybe for cost reasons we, can still scale up a secondary region, of compute in seconds, with modern techniques like using. Docker and some kind of. Sub. Service like nomads or communities, to, to. Schedule. The on a new, cluster. We. Found that another benefit of this model is that it's really easy to add features. Here. We have a feature called the merchant snapshot, expiration. Just. Because an cuss, or a merchant, has told us that they have 20, units of something 3, months ago it doesn't mean that we should still believe them they. Might have done might have sold that off maybe, maybe. They're just not very accurate for whatever reason, so. We decided to introduce a service that after, a certain amount of time will. Delete. That that. Snapshot I will just zero it out because we don't believe them anymore and we. Found that when we implemented it it was really, good business win for us it dropped our reject rates very pretty, quickly now this, entire service, was. Where a set of services was developed by a single, engineer, with light f-sharp training, and no, cloud micro service background before this went. From design to production in three, weeks. It's. Because this system in. Order to develop as this, additional, feature we, were able to just add on to what was already there with, very few changes and an easy to test that everything worked out ok. Then. One, of my favorite benefits from event sourcing is that it gives us a time machine what.
I Actually mean by that is that with, the, stream of events we can see, the state at any, point in time we. Can walk step-by-step, through everything, that happened this is great for troubleshooting this. Is great for when a, customer. Comes to us or a merchant comes to us says they, thought they were getting changes. Into the site they're uploading them but, they never see anything and or, maybe they put it in and it's there and then it disappears, five days later they have no idea where it went with. A time machine like this we can validate the behaviors that people are observing, if people, report the problem around, a specific time we can go back in time and then, reabsorb, it and do a root cause analysis, you, can do this days weeks. Months. Later. Having. This event stream be able to go back in time to any given point is fabulous. For, troubleshooting for, auditing, for analytics, so, many things. This. Model, scales, really, well it. Practice, in at jet we, deploy, all of our compute, and, schedule. On nomad clusters, you, don't have to use Nomad a lot of other ones are popular but that's worked well for us all, of our all, of our compute, is then independently, deployable, independently. Scalable our. Datastore is, then. Scalable. Within seconds as well since we use cosmos, and. Not, just scalable in terms of throughput but in terms of size the art there, is now automatic, splitting, in cosmos, so that if partitions, get too full if there's too much storage, being, used it. Will continue to grow on its own over time. In. In practice this has worked really well for us overnight. We went to 250, percent of our normal throughput, when a particular, merchant started. Dumping way. Way more information than we needed. For. The most part just repeats, of data that they had already sent us and. While. We do have some filtering on the API side a lot of it did still get through to. This degree. It. Wasn't a big deal for us we saw that it it was happening we scaled it up to 250%, to, deal. With the flow until the merchant, was able to fix it when. That was done or done we just scaled it back down within seconds, no. Problem. The. Last benefit I want to talk about with this is how testable. This model is. 100%, of the domain logic in Panther, is unit, testable when.
You Execute, the state of where. You execute, a command, against. The state and you produce an event, that. Entire, piece of logic has no external dependencies, when. You apply, in a new event to an old state to, produce a new state you. Have no external, dependencies, you. Can identify every, single path, through the business code and test, every, single one not. Just write unit tests but, you can write executable. Specifications with. Whatever invariants. Need to be checked with, large numbers of randomized. Inputs extremely, quickly there, are a lot of tools out there that can help you with that one of our favorites is FS, check all, you need to do is say that inventory. Should. Never be able to drop below zero no. Matter what events you add into the onto, this state and it. Will be able to test that and make sure that no matter what combination that, continues to be true if not, it'll tell you that, is a huge win enable, in being able to write, new domain logic quickly, and get. It into production knowing, that you have all your bases covered. So. Let's, go back to this story is talking about at the beginning where we had our coffee outage. How. Did we actually deal with that, it. Ended up being no, big deal we. Got all the teams, on. To the online who. Are especially the ones who are at the edge of the company exposed, to the outside world all of our merchants, API inputs, all of our customer order inputs, anything. That touches, the real world while. We were doing that we rebuilt, our Kafka cluster just. Set. It up in a new, location. Completely, from fresh had no data in it but, getting it ready to go so we could replace the old one we. Scaled up all of the systems to be ready for, this by an order of magnitude or more and then. Once all of that prep work was done each, team reset. The checkpoints on their event streams to, a point, prior to the outage if caused. All of those systems to just remit, all their events to the downstream consumers, who just reprocessed again with, no, problems, and because. We were scaled up so far we. Were able to catch up very quickly, and get. Back to where we were and, with. No loss in any data at that point whatsoever. This. Was. A huge win for us in establishing. Not. Only that all of our processes, worked, but, all of our design principles worked. Really well in order to deal with the kind of scale that we needed at Walmart. So. Going, forward what, are we going to be doing jet, comm is now a part of Walmart, Walmart. And jet comm working together to, unify, our back-end, systems, but, in the meantime we. Still have to deal with the fact that all, of Walmart's. Servers. And all of their systems exist. In data. Centers where all of jets are, up in Azure in the cloud so. We're moving forward with a hybrid, cloud approach we. Have, services. In. Each side that connect, through Express route, just. Like the systems. Are talking to each other in the same data center it's. Proved. To be a very. Scalable model we've. Added. Some new data, centers that are close. Geographically. To the data, centers that Walmart. Has physically, and.
We're Able to keep Layton sees really, low for, all the communication, between them. Now. Once. We have that in place we'll, be able to move the data and data, and systems into the cloud gradually, we're. Identifying all, of the points where the system's interact, a, lot. Of these are in. Kafka a lot of these are HTTP. And, but. We find where those integration, points are. Now because, we followed the asynchronous event, model and we're. Able to read. Events from the same stream with new systems piece-by-piece. Introduced. A new micro service that is reading from the same stream as an existing service and confirm. That in parallel, that they are working if. Appropriate, we, can, do. In some a be testing. Or a, canary, testing, where we slowly, start to migrate traffic, over from one piece of that interface to the other if. The interface, is more complex and there are more integration, points we can still do that gradually, if. There isn't a good integration, point already, in place we, introduce a strangler pattern where we set up an explicit, interface and are, able to overtime, gradually. Migrate, features. From, the old system to the new system. The. Goal here is to eventually, get a. Significant. Portion of Walmart, systems running. In, Azure. So, we get all of that scalability. While, still taking advantage of all, of the data center resources we, already have in place. We. Found that functional programming has been a huge boon in helping. Us write code quickly and get it to production without any problems and. It. Turns out we've been very happy with all, of the principles that we've adopted originally, and hope to continue using them for years. Two. Books that found very useful and. Recommend. That everyone on teams. That I that. Either I join or when, people join teams that I'm on are. Designing. Data intensive applications. By, Martin clapman and domain. Modeling made functional, by Scott, WA Washington, the. First one is. Six. To seven hundred pages of in-depth, but very readable analysis, of how to build systems that. Run at scale, that meet. All of these principles of consistency, that I've talked about and, help, you understand, what. Works and what doesn't, where. The problems can can come in and how to avoid them the, second one talks, a lot about using. A lot of the techniques that are in F sharp and. Other, functional. Languages, to. Build, the kind of domain models that I talked about earlier where. The, domain. Is very explicitly.
Matched, To where the domain code very, explicitly, matches your business domain without, a lot of the problems a lot, of the runtime errors that you might get in other places. Wow. That was amazing. So. We, have some time for some QA, I'm. Gonna start with a few but please line, up at the mics we'll have some questions well, this is fascinating stuff, you know you you talked about resiliency. The way I like to define resiliency. Is robustness. In the face of change and, in physics once, the. Professor. Once described this as if you had a cereal, bowl in a marble in it there's two ways to configure, it one, this way and you can bash it and the marble rolls around and always ends up the other way at, the bottom the, others to take that cereal bowl and move it this way and, put the marble on top and, then as. Long as nothing changes, it can stay that way but the least, perturbation. Of the system it. Never returns to the normal state and so. What I think what, I like so much about your talk was it, seems to me that you've got a very robust, system, because. You have, managed, your state very, very well so I had a couple questions about that sure first is I followed, a lot of it but not all of it do you have a white paper that that that does that, describes. Us in more detail I suspect. That we do but I don't have a link for it right now we have some blog posts I know at the very least I go in more detail I can. Find some links for those that'd be great, and then can you talk about the. To. What degree. Does. You get this great state management, through functional programming, versus. Say if you used as your functions, would you get the same sort of thing with as your functions, as. Your functions, have I. Have. A lot of hope that they will develop into what. I would like to see our micro services being the. The. Idea with Azure functions, is very, similar that they get the input, message they, run some kind of function do have some kind of output it's. Very much like our micro service model, the. Biggest. Concerns I have with other functions, right now are how quickly they, can scale up and. What kinds of Layton sees are dealing with yeah when, you're talking, you know three four or five milliseconds. Latency, as your desired or, as your requirement, that's, the biggest, error where they where they don't quite meet the bar yet but, I have huge hope that stateless, computing, like. With Azure functions, over time will get to that point and be able to replace this, entire infrastructure, yep. And then you. Mentioned that the messages. Were in impotant but I didn't follow that how, are they in a pond like say you say, update our inventory. Exactly. Like -, - because. Some of the order - how's that it impotant is it through the ID itself yeah there are a couple different ways that we achieve item potency, one. On if. It's an update, inventory, event. Or command. Because in this case they're nearly interchangeable, it's. It's. Not adjusting. The inventory, up by - or down by - it is just saying this is the state right now we, have two in the warehouse so, as many times as you tell me that you have to you, still have to mmm. Now for, the. As. Some of the others do involve Delta, events, like the reservations. Where maybe we need to set. Aside two for this particular user and that. Is part, of the state that we need to store we. Fold over that state to make, a reservation. To. Say, that this particular customer has, ordered to of this particular, item from this particular warehouse and that.
Is, Now on order, but not yet shipped if we, get another, reserved, inventory, and for that same customer it'll. Have the same order ID which. Itself. Is enough to to help but the same order ID the same item, ID the same quantity, from the same warehouse it. Matches up with the state that we already have there, is no potential. Delta, against, the current state and that makes that item that particular, item potent I see I think I've followed that ok very cool ok, I got two more questions so you guys should line up if you have some. The. First was in. This functional, programming model and you're you have such well managed, state it seems to me that it make. It easy, to roll, out updates to your application. Is that the case it, really is how the, versioning. Is, it. Turns out in practice has not been nearly as hard as a, lot, of people are often worried about and, as a as much as we've been worried about when. We want to make. A new version of an event we, just make a new event that's v2, we. Right, there make the changes to our command, processor, code to support it we, make, changes to our snapshot, event, stream to support it and. Then we. Just roll out those changes. We're. Replacing, in place the existing, code and then, once we, just make sure that that's there before, the new version of the, messages come through. It. Works, really well we keep we do on the. The problem with it is that, if you maintain, the same code base over. Years, and years you may end up having, code. That needs to handle, not just, the latest version, of an, event but maybe your, maybe your on version 20 and they're denied. 18:17. In. Practice. We found that that's not a big deal maybe, that's just because we're a young company and haven't had, three. Decades of different, events that we need to version but, at least we're talking on a few year period it's very easy to just, write the code that supports a new version deploy. It, then. Let the new version come through and everything, works all right let's go here for a question. Talking. About the time. Machine we're, mostly talking about replaying, the past or there also things where for like forecasting.
And Maybe, looking at what you're ordering. Or, what, styles, are for like an upcoming season, or something where you have to predict the future with this stuff too and do. Different things using the same kind of models, yeah. The the, models. End up being different but, the historical, data ends up being critical, for that while. Talking. About order. Forecasting. It's. A little bit far away from the domain which I do have experience but. One that is very similar is. Using. The historical data to predict, how, accurate. The incoming. Inputs. Are we. May have and we do have some merchants, that maybe. They tell us they have 10 so, we try to send them ten orders and two, of them get rejected and, maybe this happens a lot we. Can use this historical, data and this, is something that we're actually rolling, out in a couple weeks. Machine. Learning models that are taking all of this historical, data analyzing. To say hey we're getting rejections. In for. These reasons. Over. 250. Different variables, maybe this merchants dealing with this particular, kind of item they're, just really bad at it and we. Then, adjust. How much inventory we think they have if, they tell us 10 and they're, always lying in sell and reject - we, adjust the number down to 8 pretty. Quickly you. Mentioned. Generative, testing. Have, you been able to leverage that to do into n integration. Test to provide like look, at your SLA s and say we're actually providing these SaaS and consistent, consistency. Guarantees we, you, know we project we. Can't do that within the domain because the domain is entirely. Without any external dependencies, and that is where we're doing most our generative, testing right now using. FS check we. Have, started looking into TL, A+ to do full, end-to-end. System, testing, of of similar. Specifications. But, that's not something that we've actually. Gotten to do yet. It's just something we're still exploring but. Something we absolutely want to do and is are on our radar q. You. Mentioned earlier about your, vendor stream assuming. You will have thousands sk use and you know hundreds, of Warhawks locations. Right are you maintaining. This combination, of, these, event streams for example you know tissue, at house in California, I mean New Jersey each one will, have its own, stream. Right. But, that's one, of the things that we changed from our, original design instead, of having all warehouses. In one stream, for. A given item we've split it into each, item, at a warehouse has, its own stream, so. When, the customer order, so, is that you designate a, specific. Location. With the order you have to attach that location, property, to that stream, that's, when you check, against, it so basically like I said were like, sovereigns are cute in that sense. Yeah. The, I. Didn't talk about this a lot but one of the things that makes jet comm. Different. From a lot of the competition is, its pricing. Engine which, adjusts prices, for, the items in this basket, depending. On what. Else is in the basket and if they can all be sourced, from the same location, what. That means, as a consequence, is that by, the time you check out we, already know from, where it's going to be sourced which warehouse, so, we can make the reservation, at that point. Against. The, particular. Item. At a particular, warehouse now. Not all, front-ends. Not all sourcing, engines work that way walmart.com, is. Migrating, toward that but doesn't have that yet so we I don't. Have an architect I do have an architectural diagram it's a lot more complicated for what we're doing next generation, where. We can make reservations, at. The, item, level regardless. Of the of the. Warehouse. Where it might eventually end, up being sourced and then. Migrating, those reservations, to, more specific, sub. Systems as they, as, we get more specific, sourcing. Information I see another person you, missioned when, you have your cluster, fell down you were able to you know don't lose any data to me they, will have you know different you know different streams.
And Different time stamps, right so it's, a bit more how can you don't. Lose any data. Yeah. Well Kafka, might have gone down in everything, that was in flight went down every. System, that put a message into Kafka, in the first place still, had all of its data all of its events that it was writing into Kafka so, all we needed to do was, make sure that once. We brought up a new cluster that, any event that was writing, into cuff we're sorry any system writing, into Kafka did. A replay, from a point in time that, was, prior. To the, Kafka cluster going about that as well you've. Got another diagram, in. Your diagram you showed the event of commands coming in is, there a store, for the commands before they get to the command processor, so to replay, generally. No we. We haven't. Found a value in storing. Both the commands, and the, events because. Usually. Because, any. Command, any, domaine command for a system. Maps. To, events. That are coming from an upstream system, for. Instance, in, this. Architectural, diagram where on the lower part where we. Have all of our inputs inventory. And order updates, the. That. Get put into commands that are in Kafka we have a stateless, mapper that map's the message to a command, but. Upstream, in order updates, comes. From our, processing, system that. Order processing, system already has its, events, and that. It can replay so, all it it all it needs to do is replay, those order events. They. Will get mapped using, our stateless mappers into, commands, again. Where, we can then reprocess, them in other words we don't need to store the commands because, other systems, already have the events and we, can just convert them, thanks.thanks. For. The, attack so I have two questions the. First one is regarding. Regulatory. Compliance. Based on the region so you said that a that. Is possible for you to have like two versions of like of. Like a process command. Right what happened when you have to have like, versions. That they need to live together because for example exams, like like. You need to comply to us for, to something in China. Okay. Yes. Yeah. So my, question is like how do you handle having. A. Not. Just like abrasion, of like a command, but like the, equivalent with like a small. Difference, I'm, thinking, of like like now with the before, this regulation. That we are seeing like it's very nice to have the cloud to sell yeah the data's going to be stored just. On the London, data. Sent that's very. Good right but when it comes to the business logic, you. Have like any insight how, how. Can you keep sanity, there when. You need to change the logic. We. Probably won't need to do a lot, with that in. Directly. In in how we design, or write the code. When. We're keeping, the. When. You're having to keep the data separate, in the first place for. The regulatory, reasons we. Probably, aren't, going to be shipping, inventory. Across. Those regions, that have the difference. The. Different regulatory domains either. It's. Not something we've had to do so far and that has been sufficient, so far. Maybe. We'll have to revisit that in the future but it's not a problem we've had to solve directly yet. In. Other word to expand, on that a little bit more if, B were to expand into those areas we'd. Probably just roll. Out a new version of the system in, that particular region, to, deal with the inventory and whatever, business problems, we have in that particular area. Okay and, you think like if you go that way right if you have to have like a command, for like you know update the inventory right, and they have like a different set of you know like I mean basically has to be processing a different way but, semantically, is the same like. Do you see that this would be like version 2.1. Of the, same or like you know like it's, like something different like I'm thinking like hey how, is it going to look like five. Years because. This is like ongoing, process like and e-commerce. Is not something that you can stop and say that you know what it's like question to, the. We. Decide to version when. The. Data formats, that are coming, in don't match anymore and, the semantics, don't, match anymore if we can do a map, from, version. One to version two, stateless. Lis and not, lose anything then that's what we'll do but, the semantics, change and. The semantics are different, between. One version another version then, we do have to treat them separately have a different, code path will. Still maintain the old code path for the old, versions, or the original, versions but. Well we will have a different code path at that point so if. I understand, that right yours you. Are saying that the signature, of like, a command to be process is something, more than an extreme it's. Like the data it's, like what is coming from it's like more ways right. It's. Yeah it's more than just the shape of it it's it's definitely, what.
The Intent, is that's, the point behind it command is that it is expressing, an intent to do something, so, if that's something that, is intended to be done changes, then that's a different command even if the shape is exactly the same so. My second question so. You. You. Were saying that you are using microservice, and you are looking at the functions, right so, if I look at the service and the functions it is just this is just like ways to deploy business. Like. It's just like how you deploy. Like new code and keep it they're. Want. To eat safe to run right do, you have like a wish list or, like I mean besides the latency, that you mentioned, because right, now a functions, F. Chef functions it's just you know II said the net company, you just said done net and you turn it there right but a. The. Things like there. Is like more than one ways to run that depending, on the interpretation of, the content, in there like for, example if you run. An. ODS, functions, right on Google. Cloud a all. The code that is outside of the exports, that. That. Is persisted, there so if you have a LED and you increase that the, next time you call the function. That is, there so this is not going to happen with f-sharp but my point is like besides, the latency, if there's something, else that you would like to see on the, official. Functions, on a show so. That is easy for you to say you know what let's. Remove, all the hassle with the micro-services, let's go completely. Functional here. That. Is a good question and I, have not looked into functions. Enough recently, to do a full evaluation I kind of just, for time reasons I got to the point like oh this is not good enough I'm, not gonna worry about it until this gets solved, and then I'll look at the rest of it so maybe, there's more I don't know sorry. I. Guess, since you're honest, and. Do. You feel like the the. Knowledge of like what. The community is asking for functions, that is like there like that does, Micro sauce does. Microsoft, has a very good idea of what the, community is expecting, with. Functions. Because right, now it's like we, are like opening. Like any, other and we, we. Need to deal with like how to keep like versioning. On these like how to keep the correctness, I think. That with, respect to functions. It's. A different, way of thinking about the world and it'll take a while for people to get their head through that and then, to understand, both its benefits, and then the what things they need out of it in my mind I liken, it to you. Know when we wrote you, know, console. Applications. Right you have a main. And you have an exit you just walk through it and then when we started doing windows, programming, we had a message pump and all of a sudden we had to learn like wait what, so I write what. I wrote it with cement handler, like how's that call like who's how, did I where. Do I call my event handler like, no somebody else is doing that and the, state management associated. With that and how, you build applications, and if
2018-05-09