Unapologetically Technical Ep.3 Abhishek Chauhan - Grainite
Hello and welcome to Unapologetically Technical. My name is Jesse Anderson and we're going to be going deep today because this is the show that goes deep. In fact, we'll probably be going deeper than normal because we'll be talking about some pretty difficult distributed system things that you may have hit, you may have wondered about, and we're going to go deep on that. And to do that, I have Abhishek Shellhan with me. He is the CTO and co-founder
of Grainite. Abhishek, would you mind introducing yourself a little bit more, please? Sure. Thanks, Jesse. Like Jesse said, I am the CTO and co-founder at Grainite. We've been around for about four years. Before this, I was at the tricks. I was the technology fellow in charge of all of their products, but I specialized in networking. I grew up on
the networking side of Citrix as well. I was the CTO for Netscaler. Before that, I was the chief architect for it. And before that, the chief security architect for Netscaler. So I do have some of the paranoid blood running through me as well. Before Citrix, I was at a startup that I had created that was an application firewall that Citrix had acquired. And before that, I was at
Sun Microsystems. That part is very interesting to me. At Sun Microsystems, I was the blueprint architect for J2E. And we are now talking about 25 years ago. And those were the days when the dot-com boom was in full swing. Everybody wanted to build web applications and get online. And it was the wild, wild west out there. And when we created J2E, it really simplified and
programmatized the variable web application. So it was part of that. Before that, one more startup that was acquired by Microsoft and Microsoft for a little bit before then. Well, that's quite a pedigree because I'm really interested in hearing about some of those stories. And spoiler alert, we're going to get into some of those stories too. Because as much as we talk about things 20, 25 years ago, there's a cycle in technology and things are going to come around. And we'll talk
about that with Java. So you talked a little bit about some of that background. What makes you so interested in data and pipelines? Yes, it's interesting. I started out in applications. If you think of J2E as kind of the peak of that, I moved into networking. And networking, if you have heard the famous quote from Linus, it's like, the database there is don't have any taste. And the networking was supposed to be the elegant one. So I was the elegant one for a while. But over time, what I realized and especially when I tried first hand to build an application that's a event driven and data intensive. What I realized is that there is just a huge amount of demand and a huge amount of need that the database and the data analysis and the data processing but it has. And unlike some of the other fields like networking, this huge demand isn't all that
well-matched today. There are sure a lot of different technologies on the data side. But most of these technologies are still fairly nascent. And I know this might sound like less than me too many people who have practiced databases a lot longer than I have. But I did train in the world of databases when I did my masters back at Wisconsin, which in those days was called the Institute of Databases and other computer sciences because it was just so well-focused on databases. But yeah, so most recently, I tried my hand at databases or at data intensive applications. And to tell you a little more about that, I was at Citrix and I was a CTO and
technical fellow. So I was on stage at a keynote 5,000 of our best customers but in the audience. And I went up and announced a new product, a new behavioral analytics security product that Citrix was going to build. And the idea was that when you announce these products, you expect to be able to build them in the same year that you announced them. And we announced it in May. The expectation was you'd get something out and in the hands of customers by December. Turns out that by December, we weren't even able to get the set of people who could build the concept of the vision that we had around this product.
Because what we wanted was to collect a bunch of telemetry that the Citrix customers could produce, bring it all in one place, analyze it in real time. And every time a system call is made to be able to say whether to let that system fall through or to not let it through. And these were the days when some very sensitive information had leaked on the internet, snowed down and so forth was in the news. And we felt like we had the technical where with all to be able to do that. And Citrix is one of the top 10 or used to be one of the top 10 software companies. So we weren't completely clueless about the world of data and data intensive applications. But it was a long, long journey. It took us maybe three, three and a half years before we finally
got a product that we could be proud of in that space. And I tell you more as we go through the talk, but it was anything but easy and it was anything but predictable. And it was anything but stable or robust, even when we were all set in them. Yes, no surprise there that there's that narrator voice that when you said, and we were up on stage announcing this narrator voice and it didn't happen. And it took far longer than they thought. Yeah, you're definitely not the only one on that. And that's part of what I try to do with the show is say, hey, just because somebody says it's easy or thinks it's easy. Now you get down into it,
you hear from people like yourself have done it. It's not easy. It's really difficult. Yeah, even for what as you mentioned, a pretty top 10 company. So it's tough. And then speaking of being at the top, I just read that you graduated from IIT, which is the Indian Institute of Technology.
And I read about what it means to get in. And it simply seems like it's based on just the Indian population that there's a lot of people trying to get in. Could you explain more to more to people about what IIT is and what that means? Yeah, yeah. So IIT is a preeminent technology institution, or a collection of universities in India. And I'll tell you, if you remember the Men in Black movie where Bill Smith says, these are the best of the best of the best. It's kind of like that.
And I say this half jokingly. But really, in those days, there were about 2,000 people who were selected by these 5 IITs. And there were, God knows how many, maybe 300,000, maybe more people who applied for this. The university education was almost entirely free. And so it was something that
equalized or leveled the playing field across the different demographics within all of India. But at the same time, you get like a 9-digit roll number when you take that exam. There's an entrance exam, lasts about half a day sometimes. So there are three subjects to teach or they test
on. But at the end of it, a couple of months later, the results are published. And in those days, there was no internet. So a list of 2,000 roll numbers, which were these 9-digit numbers, was printed on a piece of paper and stuck on a wall. And you were supposed to see if your number
was in that list. And this list was sorted by rank. So like the person who scored the highest would be at the top. And nobody in the right mind scans the list from the top top. So you always start at the bottom and you start matching each of those 9-digit numbers to yours. And imagine there are like thousands and thousands of people, children kind of teaming around that hall trying to look for these numbers. And as you climb towards the top, your hope, keep on declining
because you're like, you know, I didn't rank 2,000. I didn't rank 19 in bets. I probably won't rank 100. And so eventually, if you do find your number, it is that much of a delight. So it's an excellent institution that it is life-changing once you get in, in the sense that the people you are surrounded by are outstanding. They are almost every one of them is better than you in at least one or more
areas. The teachers and professors are outstanding as well. A lot of them are there out of passion for teaching and spreading the knowledge as opposed to just doing a job. But having said all that and, you know, glorified and created a foreclosure around by it. I will say this, it, we selected 2,000 people. Of the top 10,000 people that applied, I would say on any given day,
any of those could have made it to those 2,000 people. And what I mean by that is that the competition is so intense that if you look at the person who ranked 2,000 and the person who ranked 10,000, there wouldn't be all that much of a difference in their performance. And if you imagine that maybe 2 or 3 or 5 percent of your performance on any given day, it's just the day, just the way you woke up that day. Then that tells you that, you know, even though it's selective and it's unfortunate that there are only a small number of seats, but there is, there are a lot of excellent people, even the ones who didn't, didn't necessarily go into Haiti or didn't go into Haiti that year. And this is born out throughout my career. I have interacted over these years with so many people, both from my 80s and not from my 80s, that were all smarter than me. I think you make a really good point. And it's a point that I would tell everybody, surround yourself with people smarter than you, not you be smarter than them. And this
as you see with Abhishek's life, this is one of the secrets. You start to become smarter, not rather than you trying to lead the room and be the smartest. This is a, this is a key life lesson, I think, and, and good that you've been doing that all along. So you mentioned before about some acquisitions. So you started some companies and those were acquired. Those two companies were called V Extreme and Taros. What advice would you give to startups in today's
climate where they may be either looking to be acquired because of a cash flow problem, what have you, or maybe because they've become very successful? What advice would you give them? I think it's a fairly broad question. I mean, I have had a lot of experience from both sides of the people when it comes to acquisitions. Like Jesse mentioned, I have, I have been part of two companies that were acquired, but one of my roles at Citrix for a long time was also to look for companies that we could acquire. So I have acquired a few, I have looked at what works and what doesn't work. From a startup's perspective, I think the, the most important thing to remember that is that getting acquired cannot be planned. A, you know, you have a vision, you are passionate about that vision, you are trying to take it somewhere. And an acquisition,
if it happens, is, is because it makes a lot of sense for both the acquirer and the equity, but it is not something that you typically want to will to happen. Like, there's an old adage in the world of raising money, is that if you ask for money, you will get advice. But if you ask for advice, you will get money. And so, so, so the same kind of thing plays in the world of acquisitions. You don't strategize or plan for a net position, but you plan for is to show, to work with different companies, to partner with them, to show them how you have mutual customers that could benefit if you guys join hands together. And typically,
you join hands together by meeting in the field, as in I have a common customer, meeting on technology, that is I have got complementary technology. And then finally, you look at meeting of the teams. Right. Also, as a startup, remember that many times we look at a big company and say, you know, I don't want to talk to them because they will replicate what I am doing. Or I don't want to talk to them because this is already in the roadmap or something like that. We should, again, having worked at a big company for a while as well, know that the soft underbelly of a large company is its lack of agility. Even if you went on a whiteboard and spelled out exactly what you do and how you do it, it will take that big company probably three years before they could pull it off, if they wanted to do it themselves. And no big company
really wants to build technology technology from scratch. They would much rather build something in partnership with you that they can bring to their customers quicker rather than having the customer split three more years. So I stopped there and yeah. No, you make a very good point about opportunity cost. That's kind of what you're referring to. If you take three years to build whatever or to rip off. So you take that three years, three years of opportunity cost,
and the company that you're ripping off has had three years of time to improve their product, you're now three years behind, three years plus in some cases. So yeah, there's a lack of agility generally companies would rather and good management realizes this. So that you make a very good point theorem. And it's interesting that you were on both sides of that. I think it's it's quite rare
to talk to somebody who both advised and sold. And so you're able to peer in in a different way. As you bought those companies that for it sounds like at Citrix, what do you think made those company acquisition successful? Yeah, I think we had a fair share of acquisitions that didn't work and some that did work and some worked spectacularly well. And you know, after you look at the basic fundamentals of it, you know, we have joint customers, they could benefit and so on and so forth. There are two things that from an acquirer side, you often miss. The first is that you have to keep some powder dry in the sense that if I had, let's say, a $200 million budget to get into a new area, if I spent all of that on a position cost, then I don't leave myself enough money to be able to work that team and integrate them in and you know, incentivize the sales force to be able to carry a new product for the first for the first several months or quarters. And it's always advisable to have maybe 20 to 25% of your budget as dry powder that you use after the acquisition to invest into it and to make it more successful. That's one and the second thing that sometimes doesn't work out is the culture
of the team you're acquiring. And it is, you know, you will hear people talk about, you know, we've got to find somebody who is a good cultural match. And sometimes good cultural match is not the right thing. For example, at Citrix, you know, this company had been around for 30 years.
We even in 2016, 2017, we weren't using Git for our source code control. We were using some other old-fashioned proprietary software to do that. Our engineering practices were still fairly old-fashioned and waterfall-like. And we hadn't embraced Agile as much as one would expect a
company of our size to have done. And so sometimes when you acquire a new company, it is also an opportunity for what I call a reverse acquisition in the sense that you are acquiring a smaller team. But if you work together and if you are kind of open to the infusion of new ideas, that small team can sometimes multiply and grow exponentially in terms of its ability to influence cultures and practice at the bigger company. And so you have to be open to that and not kind of get into cultural clashes and always trying to find somebody who looks and feels just like you are. You mentioned this dry powder and keeping some, is that what we often see where a company will get acquired and they'll completely gut that other company that they just didn't leave enough for the other company? Or was that the plan all along? Usually, when you acquire a company, depending on the size of the company, of course, if you are simply acquiring a team, then you want to retain most of the team. If you are acquiring a business, which has already been executed and has its own financing department and its own sales department and so on, there are some synergies that are built in and synergies is like a glorified bird for saying that there will be some things that will be redundant and you would cut them. For
example, you don't need to see a force, but you acquired one of them. So typically, there are some functions that get eliminated in that process. The sales team is especially one that is productive is pure gold and you almost never eliminate that because these are the guys who know how to sell the product you just acquired and if anything, you want to be able to replicate their successes and have them be the mentors of your sales team. So those people typically stay, but I meant my dry powder was, sometimes you would have a CFO who says, I have announced to the street that I am going to take 21 business points of dilution as part of this acquisition. And now the next time you want to do something with it, you want to create a stiff, for example, for the sales and say, hey, you know, this is a new product. We want to have every one of our
customers try it and but the stiff is going to show up on the balance sheet as maybe another basis point of dilution. Then you know, sometimes these budgets get tight and you are not able to do the right thing to make that acquisition thrive. And ultimately the acquisition is measured by its ability to produce additional revenue. And if it doesn't thrive and we are not able to fund the activities of that startup after we have acquired it, then we are basically thrown away good money and without having to get the returns. Now that's really important. So let's go further back in your career and talk about the early days of Java. So from your LinkedIn profile,
I saw you were there working at Sun in 1997. I started working with Java in 1996, which is kind of crazy to think about. Do you see any parallels between that time and Java and now? Yeah, man, that's an excellent question. There are a surprising number of parallels,
an uncanny number of them. So when I got into Java, I was on the server side of Java. And in those as you remember, Java applets and Java running in the browser is where Java first built its name. It's like right once done everywhere. And it would run in the browser and so forth. But then
it was becoming clear around around that time that people wanted to have online applications. And you know, this idea that I would have an online presence and I could do e-commerce online or I could have a brokerage online or I could have other functions that could be delivered over the web that the concept was just starting. There was a gold rush where every large company wanted to do something in that area. And the kind of things people were doing, I remember if you logged into e-trade in those days, the URL would say e-trade.dll. And if you logged into eBay, it would say e-bay.pl, like that script was written in Perl. And there were plenty of others who were generating HTML out of shell strips. And when Java to the
J2EE, which is level two enterprise edition, we started standardizing some of the ways in which you could build applications or online applications and build them in Java. So the idea of an application server, the idea of Java server pages, which is the first thing that I worked on when I was part of J2EE was the spec for Java server pages. I remember Microsoft had active server pages in those days and they had announced active acts, if I remember correctly or something like that. And Java server pages was a more rational, more well thought out. This is something with a well defined language and grammar kind of way of generating HTML and mixing HTML and Java together.
So that's where I started, but it was very clear that people weren't sure of building these applications, not just on the front end as in the Java server pages, but also in terms of how do you manage your data? If your data is an oracle database or MySQL or something, how do I get the data? How do I interact with the data? How do I fill forms? How do I bind data to tables and so forth? And so we created a number of different standards or proposals. And I had a first-hand view at how you work in a standardized extended body and get multiple different vendors to collaborate towards a common standard that helps the industry as a whole, but also most importantly helps the users of the product. Before you knew it, we had about 20 different APIs specifications, each one of them about 50 pages with lots of rationale and thinking behind it.
And now we had a different sort of problem. And the problem now was that people looked at these different services or APIs that J2E could provide, but nobody knew how these things fit together. And people didn't have the time or the patience and sometimes not the skill to be able to work it out for themselves. That's why I need to use an enterprise Java being over here and a session Java being over there. And what do I need to do if I have to scale out myself and so on? So with the blueprints in J2E, which is the thing that I was one of the architects for, we created a set of guidelines or blueprints that showed people how you put these things together.
The reason I am talking in so much detail about something that happened 25 years ago is because we see something very similar happening now. What's happened is that for the last 10 years, you have been hearing about the word digital transformation. And that means that all of my assets and so on should all have a digital amount of power. I should have
my information, my processes, my technology, everything represented digitally. For the last, for the best part, this has already happened to a great deal in most of the companies that we interact with today. So their assets are digital, the data is digital, their interactions with their customers are digital, their events are coming in digitally. So they have managed to do that. What has not happened is that people haven't yet seen the great big aha of now that I have everything digital, some magic was supposed to happen. The company was supposed to gain a competitive edge and supposed to do a lot better than competitor who didn't go digital. That part of it is going
through some disillusionment right now. And the reason that disillusionment is happening is because the ability of these companies to make sense of all of the data that became digital is fairly limited. So this time around in 2023, the gold rush is in being able to take the large amounts of data that you are collecting or have the ability to collect every day and be able to make sense of it while the data is still fresh. And it turns out that the sooner or quicker you can make sense of the data, the more value you can extract from the data. And that gold rush you
see all around everywhere. People are building pipelines. You might hear it in the form of ETL. ETL at its very basic level is a pipeline. You might hear it in the form of training applications, payment processing, fraud analytics, all of these different use cases that you hear about. They are all attempts to take the data which is a resource that you have now amassed and be able to extract insight from it. Every company that I have talked to and I have talked about 300 of them over the last three, four years has got multiple and sometimes dozens of projects that are in the queue to be built or developed that are going to analyze data that they are collecting that are going to produce insights from it, either security insights or customer insights or sales insights or product management insights and so on and so forth. And this is the
new gold rush that companies that are able to accomplish harnessing of this data successfully are going to have a distinct competitive edge over the ones who are slow or lack the resources to be able to pull it. Your experience corresponds with mine. I think that there's two issues in the industry. One is technical as you mentioned and the other is people and I think that the two go hand in hand is that you have the wrong people trying to use the wrong technologies and it's hard to sort out which is which or which is worse and it's really concerning to me because as an industry we're not generating the value that we should be. We're actually generating
relatively low amounts of value relative to spend and that's really concerning because eventually there's a party that stops and I've been worried about this for a while. So let's stop this. Let's not have this happen. You were there for the .com bust and you don't want a bust like that. That's not fun. It's not fun at all. Well let's go forward in time to your
time at Citrix. Citrix was the company that you were at right before you started Grainite. So you have this pattern of being early. I noticed in your resume you were early in Java, you were early in the cloud at Citrix. What made the cloud so interesting to you? It's an interesting story. Citrix, when I got into Citrix, we had acquired another company
that used to build a product called net scaler. Net scaler is a load balancer. That product is still there as part of Citrix's product portfolio. It's the load balancer. It's the load balancer that at one point of time if you interacted with any website on the internet at all, there was a 75% chance that your packets came to you through a net scaler. So it was all of the large web properties and you can imagine which ones but they were all scaled or built with the net scaler at that. And as time passed by, so these all companies that were previously building web applications slowly started building the cloud. In fact, if you look at the three big clouds today, at one point there was a net scaler serving a lot of the content behind that. But it
turns out that we as as net scaler, we saw the cloud being born and the cloud gaining traction and the cloud growing, except that we felt that, you know, the cloud is something that other people do and net scaler itself would not need to do it. And then one day came where we were talking to one of our largest customers. And he says, you know, this net scaler that you shipped to me as a 2U appliance that I was very proud of, by the way, because we could push 300 gigabits per second through a 2U appliance. And this person who was an advisor to us as well, he says, this net scaler that you shipped to me as a 2U appliance, it's like a big boulder in my cloud.
I can't move it. I can provision it with a CLI. It's just like it's a rock that is weighing down that entire cloud. And that's when the light bulb went on. And you know, we said, even though we have had this seat at the table or at the front of the table to look at the growth of the cloud, we haven't clearly thought of how our business would run in the cloud. And from that day onwards, you know, we started rethinking that cloud, rethinking our relationship to that cloud. And we now have almost, Citrix has almost all of its product delivered in the cloud.
So thinking about rewind fast forwarding to now, do you think there's something missing in the cloud now? Yeah, that's a that's an interesting question. And it's a look, the way I think of that cloud is it's gone through a few terms as cloud. You know, in the early days of cloud, we used to have these debates about infrastructure as a service versus platform as a service versus software as a service. And turns out that platform as a service kind of has declined in passion. And
infrastructure as a service is the way most people think of the cloud now. And then to the extent that they need something cooked, they think of SaaS services to cook it with. Now, in the middle, the gap has been filled by what I called aggregated platform services.
So if you look at all of the services from Amazon that are not just storage and compute, these are things like Kinesis, for example, on the streaming side, or DynamoDB, these are individual point services that are disaggregated from each other. And it's like, hey, you know, here is a bouquet of services that you Mr. customer can use. And you can put these together, stitch them together, whichever way you seek it. And in the menu of options that we have, I feel like the thing to your question of what is missing is that we have some excellent low-level options. We have got some disaggregated platform services that are slightly higher level than that. And then at the other extreme, they have what SaaS services, which are completely fully cooked.
And you know, you please leave your imagination behind and use my SaaS service to do whatever it is that you're doing makes a lot of sense in many cases, but doesn't make sense in some places. The gap I see is something that is more cooked than the disaggregated platform services, but still sufficiently flexible so that you are not all living in the world of SaaS and, you know, beholden to precisely the features that the SaaS provider has built. Because SaaS is an equalizer and companies that are trying to build differentiators don't necessarily want to adopt SaaS as a means of differentiating. On the other hand, to build a differentiator today by stitching together to all of these different services is incredibly difficult as well. So I wonder if there is something
in the middle. Well, it turns out that you've made an incredible lead-in of we've got these disaggregated services. What do we do about that? And that leads us to basically the core idea behind Grainite. So let's start whiteboarding out and we've been talking for a bit.
Would you mind diagramming out whiteboarding out what is Grainite? Yeah, so thanks for that Grainite. The way I think of Grainite is it's a converged streaming database and we got to talk about what does it mean each of these keywords and, you know, at the outset when I say converged streaming database, it's like this guy knows how to use chat GPT. But there is more to it. And I'll talk about what are the things that we have converged and why do we call it a streaming database. And it turns out that no matter what sort of data analysis or data processing pipeline you build, you are going to use three different things. The first block I'm going to call that ingest. Then there's going to be a second block that we can call process and then a third block that we are going to call record. Okay, so on the ingest you can imagine that events are coming in. You have got some sort of processing tier behind it and you don't want to miss out on these events.
Typically the person producing an event like a web browser or an application has nowhere to put that event if you were not available to take to record that event or to ingest it. And so the ingest tier has this burden on it that it needs to act like a buffer and be able to absorb the events as they're coming in a whatever they're coming. And typically you would use Kafka is the preeminent product that everybody thinks of first. And so you use the Kafka to ingest these events. Now I've got these events. I can look at them at my leisure. I can read them like a file if I wanted or I can process
them as they come in. And that gets you to the processing tier which is going to have to make sense of these events. So in that processing tier you would have you would have code that you have written in one of a variety of different languages. You would use something like a clink for example or before clink people used to use Spark to be able to process these events. And then once you've processed these events typically what happens specially if you're doing anything that's beyond the prototypical world count example that everybody uses to describe this. If you're doing something meaningful the processing of the
event is not done in isolation. For example if I received a payment as an event I need to know how much amount was in the how much balance was in the account that the payment is being withdrawn from. So I need to pull up some state. I need to process this event in the context of that state. If let's say the event was a location update for the user I might want to pull up information about that user and see what that location update means in the context of that user. It could mean that they have walked to the nearby Starbucks or it could mean that they have flown to a different country. And what I would do in both cases would be very different. And the only way
I would know which one is if I had figures context about this user. So I'm going to pull that context before I process that before I process the event so I'm going to draw a database here and you pull that context you feed the event from here. And then once you are once you have done processing that event typically you would want to write something back or record the effects of your event back into the database. So you would say okay so I have received this new event as a
result the new balance of this user is this value or as a result the new location of this user is this or if this user is a car driver and I was trying to track whether they are speeding or whether they're driving riskily I would maybe record that hey you know based on this event and the user's history I think that their new risks score for how fast or how how dangerously they are driving on the road is this new value. So I got to go do a recording into the database as well. And this is something that you do for every event that you process like so this is this is a prototypical event processing pipeline. Turns out that this pipeline is full of a lot of God Chas and and I'm not even drawing the more complex versions of this pipeline yet but but I built a pipeline like this like I was saying for the tricks it took us a long time to build it and the I had something almost exactly like this drawn on a paper napkin even before I announced that product right I would like how hard can this be this should be fairly easy to accomplish you know each one of these things seems fairly tractable but turns out that there are dangers in this and the dangers are one in the building of it and second is in the operations of it. Let me start from the operations first because that's the
part where I still have some scar tissue from and on the operations side you know you you build a pipeline like this and you try to scale it. Your team has just come back the architect that you hired at nine months ago they have built their first pipeline and you ask them hey you know so you have built this pipeline I'm planning to scale it to about 10x the load that we previously extracted. What do I need to do? Turns out that the answer and that's the answer I got and that's the answer that invariably most other people that I have talked to get that oh you know well we didn't think of a 10x scale you know even though each of these components is supposed to be like a pizza box and then you add more pizza boxes and like this good but you for that we are going to have to rethink the architecture. The half so scale is not automatic the next thing is the half
life of this technology. Half life of this technology that you use to build a pipeline is two years. This means that if you build it its bark the new kids are telling you that hey man you should have used flink. If you build it with flink they are telling you you should have used something else like maybe materialize or something like if you build it with materialize you are yet to find out what the next thing is going to be so the and finding people and that thing is going to be written and rest just so that everybody knows. Yeah so finding people who you know who have experience having built one of these things before is incredibly difficult so that we were able to muster for example aesthetics we were able to find people who were experts at Kafka we found somebody who was expert at Spark found some people who you know understood draft databases that we were planning to use but we didn't know anybody who had built a pipeline like this.
So then on the operation side the first question I asked them when we went to production is hey you know these things are going to fail like they always do and when that happens what am I looking at in terms of RTO and RPO. Now this team of 50 people who had built this thing over nine months they couldn't answer that question and it wasn't that you know they didn't have an idea it wasn't they didn't know whether it was a day or it was a hour that it that that's how much the gap was. Well define RTO and RPO is just so you have that for people. Yeah I think that the RTO is the recovery time objective you know it's like if this thing all went south then how long before I can start offering services and then RPO RPO is that if I lost some data how much data am I talking about having lost if if this whole thing went up. So clearly you don't want to lose any
data but in practice you know you are able to tolerate a little bit of data loss it's like that you get your risk analysis department involved the CFO and sell risk term to the balance sheet and then life goes on. So so you can you can work with non-zero RPOs in fact most of this pipeline I don't know of any pipeline so far that I have looked at that has a zero RPO but so typically you know you want to know if if a node or two were to fail or if a region were to crash is it like five minutes later that the service will be back up because remember we are talking about intercepting a system call on the user's laptop and not letting that call proceed until we hear an answer here and so to be able to do something like that I don't want the service to be like a three nines or two nines of three and a half nines if it goes down I want to know it has gone down I want to know when it will come back up and turns out that because of the different numbers of moving parts that we were using there wasn't a well formed answer to that simple question on the just side sure we were using using Kafka but there was a number of shards that number of shards that was a question mark there's a number of partitions a number of partitions I guess the Kafka is called it so let me change that yes a number of partitions the system is very sensitive to how many you provision and how you change that on the flink and spark side it turns out that it wasn't just flink and or by the way on the Kafka side there was also zookeeper like every one of these blocks has got like a main product and then a sidekick right so it's like it it's like a comedy hero and then there is one guy for comic release so on the Kafka side you have got zookeeper and you got to make sure that you're zookeeper and Kafka agrees with each other on what's the current notion of membership and partition and consensus and that's that's tricky business on the processing side you have flink and spark but you need a scratch pad scratch pad database is needed and you're going to checkpoint into the scratch pad database and then from time to time you know if the if the processing were to crash then you would restore from the checkpoint and continue from there on and the one that lots lots of people use is rocks DB rocks DB whether it was the author of rocks DB it was one of the guys that I used to do my grad school projects with that Wisconsin so I know this very very well and have benefited from his advice a few times as well but yeah so you put a rocks DB on the processing side now you are in the business of figuring out how to recover a fallen rocks DB instance and especially if I'm taking snapshots every other I'm going to have to redo process in for the last trouble if I am taking snapshots every 15 minutes I've got to redo processing for the last 15 minutes if I am taking snapshots every minute then my rocks DB is going to permanently right so so there is no good answer for how frequently I need to take this snapshot and then finally on the record side where you know you have a database and so you like what would be wrong with the database turns out that most databases are not designed to be able to produce or consume or like read or write data at the rate of your event stream so you're going to add like a cache here and let's say just for naming names that's how you edit a radius here and when you first edit it's a pleasant surprise because everything is working so much more smoother now but when that thing goes down it is it's not as if the pipeline is going to keep on working but maybe just a little bit slower once that thing goes down that pipeline is going to stop working it's like the the KPIs of the pipeline are going to go beyond and acceptable and probably cause a cascading failure somewhere or the other in the system right so now I have got six products on the slide and have and figuring out when I do a failover let's say that you know the region is down I am failing it over just if you consider a simple example where each of these products let's say is its own cluster and being its own cluster it has its own opinion or its own mind about when it considers a region to be down and if you did that then what would happen is that your Kafka has filled over to us west from us east but your spark hasn't or the processing job the the cluster that's doing processing hasn't and your database has maybe moved over or maybe the cache has moved over but the database database hasn't so that doesn't work so you're going to need to write a bunch of blue code just to convince all of these products to kind of behave in the same modality when they see the same same thing happening on on their environment so so I I need to call call that fail over slash a change which is like it it it clicked down on the RTO RTO thing and then sometimes you can make incremental restores and other other times you cannot and so the RTO question in particular is is question mark because RTO could sometimes be like three seconds sometimes it could be five minutes sometimes it could be 12 hours and of these really the more you start relying on these pipelines and the more mission critical they become the more you are looking towards something which has an RTO of three seconds as opposed to 12 hours and I was talking to a very large logistics company probably ships half of the world's packages or at least half of the United States packages and they had built a pipeline like this and there are a few of us 12 hours which they only found out after they had built it and so it turns out that you know if one of the regions in that cloud goes down it is 12 hours later that they can tell you whether your package was even delivered or whether it's been put on a truck or or which truck to even put it on right so these these are the kinds of problems on the operations side of a contraption like this on the developer side which is remember we are targeting enterprise developers these guys wake up in the morning they are experts at the business problem they are experts at the domain that they are working in they understand the rules they understand the regulations they understand the business they are not experts that distributed systems and they shouldn't be I faced the exact same pattern when I started my second company Terros which was an application firewall when I talked to developers and they were experts in their domain but they were not experts in security and they were like I am I woke up this morning because I wanted to make this inventory work flow work I didn't wake up this morning to figure out whether SQL injection is going to ruin my day right so the same kind of thing is happening here is that I have got these developers they understand their business they don't understand and they don't want to understand distributed systems but because I have got six distributed systems in the mix they have no choice but to understand distributed systems they have no choice but to plan for the contingency of when the system only half fails when there is a split brain when something is being retried when something that they wrote is not idempotent when they acquired a lock and then crashed while holding a durable lock all of these things that have nothing to do with their domain that if my domain is to build a right sharing application or my domain is to build the best user experience for a coffee shop or my domain is to build payment processing for a bank none of the things I talked about earlier it has anything to do with the domain the developer rightfully complains that they shouldn't have to understand these things except that to make this pipeline work you have to understand those things because it is very likely that your processing will have to be restarted and then you would have duplicates and then you would need idempotency to get rid of those duplicates and if you went down the path of distributed transactions and rocking then you would have to work with lock recovery right so there are all of these all of these kind of gotchas that come in the moment you realize that the system we are building is a conflagration of distributed systems put together and so that's the building or coding side of it and the hazards there and now finally coming back to the question that you asked in the beginning which is what is granite and I said it's a convert streaming database what we do is that we take these three things and we converge them into a single system that is just a single cluster that auto scales and that internally contains all of these three functions built into one and I'll talk about why we use the word word converged as opposed to for example we could have said integrated but we didn't and what does it mean to be converged and why or how does converging these things together solve some of the problems that I just outlined so I just want to do a quick plus one on a lot of this of although a vendor may tell you hey Kafka can have an RTO of minutes not necessarily not necessarily and then you brought up another important part of the the level of integration between these two clusters of a spark cluster and a Kafka cluster if one could be down and one could think it's up and they or they could be reaching across to a completely different different cluster that's now being replicated to in a different region and now you've got an increase in latency and you may not know why there there's a lot to this this is this is by no means an easy problem and so I want I just want to call out to the people who are curling up in a ball right now thinking about this because they're living this there are solutions and those of you who are thinking what are they talking about this is all real don't don't let the hand waving people tell you that this is not real that these are real issues that you will have to deal with so yes how about let's go to the whiteboard to and let's talk and go deep into how greenite works yes let's let's talk about it so so I said that we are converged so I'll start there and say what does it mean to be converged so now we have got ingest process and record we have got all of this in the same cluster and so you know one could say okay so you got it in the same cluster I get I get that but how does it how does it solve what the problems that you were talking about do I not have to do retries does this thing not crash like every like it's running in the cloud what do you mean it will not crash shortly it will crash and how does our view and our view improve how do I operate this thing more efficiently and so on so the my answer to that is that if you look at the previous picture the root of the evil like the big problem there was that there are two safety constructs that databases use and the first construct is called consensus and the second one is called journal right a journal is something that you write on your disk right so a single node it's the single node concern the node will whenever it receives an action or something to do or something that will change its state it will first note it down in a journal it's like a log book right a head block is probably what people have heard for example in Kafka or others the Kafka is called this a log right and so once you write it into the journal then as far as the user is concerned who submitted that action their job is done they're like it's written in the journal now it's become beautiful but once it becomes durable it has not yet become readable in the sense that the journal stills got to be processed and the action that you took has to be reflected or quote unquote materialized into the database so that you can now read and query and see the results of your action right so the journal is the first such construct and the beautiful property of the journal is that if I wanted five things to all happen at once like to all happen either atomically as in either all five should happen or none of the five should happen the way that's accomplished is by writing it into a journal as a single entry because the journal is atomic on entries and I can write my five things that I want to do together as one entry and then it will be a the consensus is the distributed analog of the journal and what it says is that I got all these different notes they're all writing their own journals but they all will not necessarily agree on what was the order in which actions were taken so if the order was supposed to ABC supposed to be ABC the first note might remember it as ABC but while the second note second note might remember is remember as AC like the in a different order and the job of the consensus is to get these different notes to agree on what was the order in which they were going to do things and why is it that different notes are doing the same thing it's because of it's because of redundancy and durability and being able to it when you write things in multiple places when you do things in multiple places the same thing then you have got the redundancy which leads to higher availability otherwise if you build a system where everything is happening only in one place then the near failure of a single note is going to cause you to lose processing it's going to cause you to lose data and that nobody wants so typically that kind of the rule of the thumb is that you write your results or you write your actions in at least three different places and ideally those three places would be fate isolated as in they wouldn't all fail at once so that's the job of consensus is to put things in order and make sure that these different places where you write things are all are all able to keep track of the order in which they are applying these changes collectively this is in the broad branch of what is known as state machine replication or I'll just call it replicated state machine so the people call this a replicated state machine because as long as I can guarantee that different notes will see things in the same order and they always do the same thing when they see a given action then hopefully these notes will stay in sync and do the same thing over to this part is well understood it turns out that in the previous diagram the ingest process and record each of those components had their own replicated state machine so while each component was guarding and safeguarding its own nobody was responsible for consistency across components right so I have these beautiful constructs they have been studied for 20 years in the world of distributed system everyone of the components that you have recruited already is taking advantage of these constructs but the user of the system is still at the mercy of Murphy's law as in the bad things are going to happen to this user because there are cracks through which this there are cracks between these through which these guarantees were slipping through what we have done very simply is we have wrapped the three components into a distributed consensus on top and a distributed journal on the bottom so so we are saying it's still going to be ingest process and record because that's the fundamental thing that we're trying to accomplish but instead of the ingest having its own idea of consensus and and the process having its own idea and the record having its own idea how about if we excise the idea of consensus consensus from each of these components and we excise the idea of general from each of these components and then we unify or converge these two concepts across the three components turns out that when I first started discussing this idea and in fact this is a great time for me to introduce Ashish who is our co-founder and you know if you have looked at you you have heard my background in the past this day has been building networking stuff for the last 15 years what does you know about database turns out that my partner in crime here is Ashish Ashish used to run all of the databases for right so everything from big table to spinner and anything in between like cloud cloud store and mega store and all the all of the other properties that have to do with databases those were all things that Ashish was directly involved in he used to manage teams for all of those grew up on the technology side as well so I and he we kind of debated these ideas before we had started to unite and we're like you know there's a problem to be solved here Ashish and I am thinking you know could he separate these things out and Ashish is like you know that's a that's a good idea and in fact you know he had tried doing something like this as part of spinner and finally concluded that it was too difficult to do on an existing code base tells out that once you are built something like a distributed database or a distributed queue as in think of Kafka in this case there are so many assumptions that go into tightly integrating sometimes for performance and sometimes for correctness you so tightly interview the journal and the consensus into the core of your widget that it is impossible to take that out of existing so when I first started granite I looked at maybe 35 different open source components because I'm like you know it the year is 2019 nobody builds a database from scratch in 2019 because surely there are all these open source components out there that we could start from and turns out that we couldn't start from any of them the reason we couldn't start from any of them is because we wanted to unify or converge the idea of consensus and journal across them this is a really difficult problem the more closer somebody is to the engineering behind a database or behind a queue the more incredulous they are when I tell them that this is what they are doing it so this is a fairly fairly tall order fairly difficult problem to solve but it took us four years to solve it and we finally are at a point where we have managed to take the consensus and the journal out of each of these three components as a trade off each of these three components is completely built from scratch in our work there we are not using Kafka for the ingest we are not using rocks 3b for the record these are homegrown components and I will talk more about why that's the case beyond just consensus and general but some interesting things happen when you take these components out the first thing that happens is that you get some strong guarantees and the first strong guarantee is what I call virtual resilience but I mean by that is that now when an application developer writes code they simply write code as if this code is running on a node that will never crash like so it's so they get this virtual impression of the they get the illusion as if they are their code is going to run exactly once the code is going to run exactly once and never crashes so now they can get out of the business of writing boilerplate code for item writing boilerplate code for retries figuring out if their KPI is in terms of you know latencies for end-to-end latency across the pipeline are going to be impacted if their link was recovering a checkpoint at that exact moment so those those issues simply are no longer the developers concern from their perspective things that their write-in code will happen exactly once it's it's just like you run the code on your laptop you run it life is done right so you don't have you don't have to worry about things crashing now behind the scenes of course drainage is a distributed system consists of multiple modes and of course the network is going to sometimes misbehave and the nodes are sometimes going to crash and even drainage software is sometimes going to crash or be be the culprit behind a crash so it is a virtual resiliency paradigm as in these things do crash but the developer doesn't have to care about it in their mindset it's all it's all one so that's so that's the first thing that happens by combining these things together the second thing is that performance i would have gotten these strong guarantees i could have gotten these strong guarantees by wrapping the previous system that i showed you inverse known as a distributed transaction manager and then i would have i would have like the similar strong virtual resiliency type of guarantees where things will happen exactly once and so on turns out that those systems are extremely slow that you would maybe be lucky to do even 100 events per second in a system like that if it was replicated three days in grenade thus performance comes for free because we have unified the journal and we have unified the consensus so set another way for the cost of a non-distributed transaction we are able to provide strong guarantees that otherwise only a distributed transaction coordinator would be able to provide right and so so performance is the second thing the third thing which is okay so first is atomicity and performance the second thing is that when you have got six different consensus and six different journals each one of which wants to write things in triplicate remember every if you don't write it in triplicate then there is a non-zero chance that you would lose the data so each one of the components wants to write it in three places and they all have their own journal which means if you do with math you are you are suffering through a right amplification of 18x so so so we take that 18x right amplification and we take that down to 3x 3x is the minimum cost of doing business right so because there is only a single consensus and a single journal in grenade we only write things once and we write them twice so to make sure that we don't lose it but we do that act of writing it in triplicate only once and that's why the the the system is not fast because we wrote really cleve
2023-06-18 15:34