they say one of the keys to intelligence is the ability to make connections between disparate pieces of information and let's see patterns and links before everyone else sees them if that's so then today's guest might be the most intelligent person I know I'm joined by Ben Gamble and he's just exceptional at seeing the connections between different Technologies and different technical ideas and he has such a varied background as a programmer that he's got like a wealth of different experiences to draw on so who better to gaze at the programming landscape with us and ask with all these pieces how are we supposed to fit things together to a coherent system which bits of Technology belong in a modern architecture do we really want half a dozen different databases with an event bus between them or will we be happier with just a general purpose relational database ruling the world because you can build a thriving business with just my Sequel and python you really can but by the time you're a thriving business are you going to start to wish you had a dedicated analytics database like Apache Pino for your queries are you going to wish you had redis as a high performance caching layer which bits should you use and when do you make the trade-off to switch and when the Landscapes this vast why isn't there a map that's the topic up for discussion today and it's a huge one so we best get started I'm your host Chris Jenkins this is developer voices and today's voice is Ben Gamble [Music] I'm joined today by the inimitable Ben gamble Ben how's things things are great if slightly warm today I suppose yeah yeah it's cooking up in England it really is we had a wet and remarkably cold summer until literally this last week or two it seems yeah that's the joy I think I used to know this Italian person who said she couldn't understand where the English talk about the weather so much until she came to England and then the weather's so interesting because you never know what you're gonna get you really don't like what's funny is one person in my team who literally lives it turns out within walking distance of my parents and we've been going to the same library for the last 30 years without knowing this um he's so he's 10 miles away from me and we normally have different weather Cambridge you do know it's quite funny okay we'll save that for the meteorology book trying to explain why that would be I've got you in because you you just you've got one of those um Avalanche Minds whenever I poked it I get just an avalanche of information coming out that's why you're here I want to get into that but before we do one of the reasons you have so many perspectives and angles on programming is because you've got such a varied um backstory you've worked in so many angles of the industry so I'm going to start there gives us your biography sure thing so I got into the industry the usual way a lot of programming at least usually most people do which is you want you see video games you think I want to make those and you then you go to the local library at the time and then find a book that says learn C plus plus there's a lot to answer for why do we do that to children I know I know and then my mother well there's a course down at the local College you can go to for that so me age 11 was surrounded by postdocs growing up because I was growing up in Cambridge yeah right learning C plus plus to some some like nvq level of stuff way back when back in 1999 and so that's where it all started and then at school's lucky enough to be able to play with everything from microchips coding uh in both basic yes actual basic on microchips oh wow you some assembly uh and had a fun maths teacher who used to run C sharp courses uh during lunch times every now and again and so I kind of got into it through all this and then the Fateful thing happened of I found modding tools video games to build and an attempt to break every single video game ever from hacking data files in things like GTA 3 to using the Unreal Engine editing kit to basically build my own levels and then occasionally mod games oh wow yeah that's a freaking way to grab a teenager's imagination it really is because you just give them a bunch of tools and say are you smart enough well first instinctive yes yes I am let me go prove it and then the answer is no no I wasn't and all I did was learned a lot about loops and what doesn't work and how to crash a game oh very easily if you do something on some fun things like how many like what scale actually costs and things like actually there are limits to what your machine can do and there's a real reason games designers make certain levels certain shapes so you can't see things around corners so they can lazy load them in oh yeah yeah that was a serious thing back in Doom still a serious thing today all right yeah so like portal rendering kind of like to Doom did this a bit but it kind of really became big in the build engine and the Dark Project was really famously a portal renderer whereas this is the idea as you render things through Little Portal doors to the next Zone and never be on that and turn it back to the story was I went to University originally to do uh some some sort of avionics engineering which I hated and then transferred to astrophysics so I have a degree in astrophysics which I basically spent 90 of my time trying to avoid physics basically every Computing module you can imagine every stats module you can imagine um electronic engineering model modules and then enough physics to get by and still get the qualification basically I spent I got I I basically said and was kind of hooked into various little bits and pieces which was the first version of unity came out on PC in a Mac product originally and being a friend tried to build games on it still a good friend we were building we're chatting about building games in San Francisco last week because we didn't know we were both flying out World being a small place and then after uni I got my first job at a consultancy doing technical Consulting and management consultancy I got that by talking about Minecraft mods I built uh really I know yeah so I used to build world generators to kind of plug on top of Minecraft and I was trying to build things World worlds based on the kind of equations I've been learning in planetary science things like how do you make a Mercury looking world in Minecraft oh cool yeah so this is what I was doing at University rather than actually my degree which I really struggled through the end of but also on top of that uh trying to prove that a 3DS didn't need two cameras so I tried to build an augmented reality system on my own from scratch I don't recommend [Laughter] not halfway there and that actually got me hired um at the time as an image processing engineer uh for this consultancy I can say that and there I kind of joined in on a bunch of very large system developments everything from kind of inspection machines format for uh drugs so the drug capsules you buy have all gone through a variety inspection machines like help design it's the insides of one it was amazing I got to basically play with high speed cameras gigabit Ethernet back when it was still a rarity and then like custom GPU Imaging techniques to basically say how do we look for the defects and such in in true real time these are uh you know a million items per day one-to-one rejection as it's sliding along the conveyor belt as it gets printed and then goes over the edge of the conveyor belt you have to do air rejection blowing off the capsules oh cool it looks fairly cool to watch but then it gets faster and faster and faster and you realize what speed really can can go wrong I kid you not the first bit of debugging I did after we couldn't work out was going wrong was an oscilloscope on trigger wires of the camera awesome I found something found an unbalanced lack of a commoner between various UPS's in the system oh my God I was like serious oh yes but also very annoying I lucked out by finding it my colleagues at the time said no one would have done that that was a kind of a weird move but it works there's literal debugging where you pull the moth off the circuit board and then there's one level above that which is an oscilloscope between two things watching pulses go by to see if the camera's trigger is actually a proper Rising Edge in the right place oh God yeah and from there built embedded circuits programmed tiny 8-bit micros and some other bits along and I ended up doing mobile Dev because I kind of talked about the augmented reality thing and then then one of the partners overheard that I could write code for devices and asked how long do you think it would take to write an iPad app to do a questionnaire and I said a few weeks shouldn't really be that much more complicated vague understanding of how I didn't know it was objective so you thought it was still C plus plus and then that turned over my desk the next day with an iPad on top it's one way to get the gear three weeks later I shipped the app to this internal thing and it did a lot of it did very well except I had to speed them Objective C and I do not recommend and from there it kind of became things like how I increasingly kind of end up in these bigger and bigger systems like like things like like camera systems Inventory management systems and then a lot of apps which had to communicate either as iot devices or otherwise and then what happened next was like over time kind of like specialized deeper and deeper into trying to make things a bit more interactive so I had a lot of augmented reality and a lot of just high-speed data things like how do you deal with like low um low transmission rate UE UHF signal lines how do you deal with that the answer is it's fine as long as you're accepting a bound rate of nothing almost and I mean like think no no end of null modem level bad right and the thing is right two ethernet ports went in either end so you didn't was it was transparent if slow so that's where I started out and did a lot of things I can't talk about due to the whole thing's official Secrets but oh yes one but weird but generally the more secret the more boring right yeah actually um and then I left and founded an augmented reality company on Google Glass um so we call bus Google Glass oh Google Glass you were actually one of those companies yes so it's still a YouTube video race yourself you can look up for augmented reality exercise games yeah so it was a whole idea of run in the park with someone chasing you or a personal trainer telling you to slow down and speed up oh yeah yeah I could see that those kind of things were just like what we did lots of and then like because I've been involved in the launcher by bharti C when I was at the consultancy so Google launched webrtc um uh I built an air drum kit for it with a connect so you can do this Jammy Chrome game if you remember that that was at I O it was good fun and what happened was I realized that hey why don't I just take this Tech and reuse it myself so I used like webrtc to do multiplayer games over Google Glass so you run around playing Pac-Man with other people in the park I'll have to share the video with you it's it's good fun it was over 10 years ago now yeah I'm always sad that Google Glass didn't evolve into something yeah it felt like it should have but then like it was always that weird moment where we were telling people about turning on the GPS chip and they said it didn't have one I showed them the bill of materials and I showed them the GPS strings coming out the device and they were like how did you do that I said do not cry so after that peted out um and a bit of pivoting happened I stepped away and was working at the hiking app company for a while called view Ranger that was their Rd lead doing more recommendous reality on IOS and Android so cross-platform building of basically AR apps to label mountains oh nice and then like GPS app load GPS code to start a safety correct positioning in sharp valleys and things like this so how do you de-amplify error how do you make it safe so that all the mountain rescue teams who use this app could actually do it safely right and actually navigate safely around the mountain where it told you they really were rather an amplifying error which you can do quite easily have you pitched any of this to Apple now they're doing their apple Vision thing they must be clamoring when cook has I have a picture of Tim Cook using the vision thing with the uh with the with the um labeling mountains so the that app company we had a on the series 2 watch launch there is a literal uh WWDC thing one of my old colleagues she's on stage talking about the watch app oh cool yes I know for a fact they've seen some of this stuff already right that's that's quite a cool thing feather to have in your cab oh it is there are loads of these little bits and pieces I just wish I was physically there I wish I wasn't I was back at the head office going please don't explode the server please show the server please don't explode the server the firestormer clicks only eventually explode the server right but they were pre-cloud and in a physical machine somewhere in a data center because originally there were a company that launched on Symbian but now they're part of actual outdoor active so it's kind of still a big app it was it is hide it had you know five million actives so it was pretty cool time and from there kind of steps away uh due to a logistics company I've been building in the background for a while upon my own investors followed and then did a ton of Contracting along the way for bootstrap startup reasons where I worked at rare Microsoft on things like sea of Thieves and ever wild oh cool which was good fun uh worked on AI development there for their kind of games originally it was brought in to bring kind of like literally generative AI thinking into the game was the original remit of BIOS there never quite happened due to remnants shifting around but that was what I was hired for at the time what year was that that you were trying to do generative AI in games so 2016 2016. okay um and we actually had so the and what's funny is we actually built a film very similar to an llm for addresses at the logistics company so there's not a lot of NLP processing behind the scenes where we were basically taking in addresses pulling them apart by parts of speech by additional steps and doing that kind of functional correction across the stuff and then from there I kind of went through a bunch of like things where I consulted the places doing building literal operating systems from the kernel up which is fun and silly uh to do crazy scale stuff helps there was a great cool demo with CCP games where I got 15 000 players into a actual real-time game uh so I helped a lot of the architecture for the game itself then kind of like after the logistics company was going on to its high but I failed acquisition kind of burned me out a bit and I went as an exec producer at a small studio for a while and then uh improbable for a bit as well after the kind of mergery where stuff happened uh doing big scale things again built some of their built kind of a big initial part of the render that became their metaverse renderer and then I ended up then kind of pandemic hit just before that I left uh and you my daughter was born sort of a bit better timeout and then joined averley real time I think we met and I was still there yes real-time data real-time data yeah so I was the one banging the Kafka drum there um and so that was big scale websockets with a hardcore reliability and also mqtt and that was head of devrel and then kind of a position analogous to a field GTO called the following champion and then after about two and a half years of absolute hilarious really good fun there uh because it was getting things like the Kafka story sorted out getting us to Market these bigger and bigger areas showing what you really do with high-speed data I joined Ivan and now Elite developer education here though I tend to go by the open source code sommelier these days open source code sommelier pretty large there is one thing I'm jealous of you Ivan is you have like so they're a kind of stuff as a service platform right so you've got all the databases and all all the infrastructury playground stuff that you can go and build Crazy demos with and call it work that is that isn't let me say a conservative 90 of why I joined um it's also kind of the biggest story which is that open source is eating the world and one of my colleagues had this great description of Open Source which is you're basically leveraging the free cycles of every single one of the world's developers out there yeah because that's what they're doing they're just putting things into open source arguing beautifully productively to some to make something really good because they care and then at someone like Ivan where we are still a big contributor to open source we're almost like 20 of like about Dev Team 25 of our Dev teams literally Upstream only committed like seriously we have that we have a lot of committers in staff that is a decent percentage Yeah it is because of all four founders of postgres committers oh cool now people don't realize like everyone thinks oh we're just we're leeches from no founded by people they've run by people who contribute still I know it's really cool so the DNA of the company is about this idea of saying find the best of Open Source deliver it in this awesome piece of infrastructure and I'm doing like so much infrastructure work behind the scenes so abstract away all those Cloud platform layers and just say here's the database get going which leads us on to the main topic I've brought you in for right because you've got access to all these different data platforms a lot of experiences and different ways of building software right and not everyone can see this but you're wearing a t-shirt that says have a nice data so so this is from the cafe Summit this year and possibly one of my actual favorite t-shirts I've had so many requests for more people from people people say where did you get that t-shirt it seems awesome cool the other good one from last year putting all that together right let me put it this way I would think a lot of companies a lot of projects say to themselves they have two arguments for generic Project X they say okay let's argue about whether we're going to use my sequel or postgres and once we settled that argument we'll argue over whether we should be processing data using python or SQL those two questions you're away yes yes this is a the classic dichotomy of there are only there there are there are some answers in the world but most of them are postgres and yes it go this is this is the thing like everyone starts out with thinking what's reach for the tools and we all remember the lamp stack of old yeah yeah it was beautiful because it answered every question you had you needed something you need somewhere you need a machine to run on running Linux you need is something to handle the actual requests themselves and reverse proxy them so you had Apache you needed a database my sequel was everywhere at the time postgres wasn't really an option yeah it was just a little early in the postgres story and then of course you needed something to actually write your code in and PHP it got stuff done and they will always give it credit for it got stuff done stuff done and it opened the door for a lot of programmers and lots it had productivity on its side and eventually getting stuff done soonest will win whether it runs or not it's if it really The Simpsons it's the right way the wrong way and the max power way I've not heard that quote no homo teams is named to max power and and the answer and put their the question in response to it was it's not just the wrong way and Homer says yes but faster foreign PHP but it's not actually always the wrong way it's just this but it is faster to get there you'll get an answer faster yeah that's another great dichotomy in programming do you want your problems today or tomorrow there are a lot of solutions which are just storing up problems for tomorrow absolutely and as much as I'm often the one saying technical debt is a coin you spend like any other you do end up at the point where sometimes you do have to pay it back yeah and this kind of comes back to this idea of saying like what it like what tools do you use along the way so how do you navigate that decision space generally when would you come off that that path of postgres with python ah so um if you're actually at Ivan yourself the answer is never Ivan is still mostly he's built by postgres committers in Python all right that is that is that is that is the religion in this building right however that that aside it comes down to two or three major kind of questions the first is always going to be access patterns what are you trying to how are you trying to access this data what like is it I'm changing am I changing you know the location of single person a thousand times and then being able to query it quickly am I reading the stock market in as a massive stream of data and then acting on the events of change am I looking at flight data for the last 100 years to try and maybe 100 years yes but let's say last 10 years of flight data to work out where I can optimize my flight routes all these data kind of questions are really come down to have a different access pattern to the data itself and you can always say postgres dot dot dot and it will give you an answer and and until you get some specifically very far outliers you will just get a worse answer depending on what happens next yeah yeah you'll be stretching it further and further out from this comfort zone and as soon as you go beyond a certain point what really gets you more than anything else is the cost fundamentally you only really have a defined budget for an answer right and yet then look at the cost that's both for financial cost and a cost of failure right so I'm considering time as a cost in there too yes and this is the kind of like you have a time window bound a success and then a failure window on top of that so if you imagine let's say an example I keep giving these days is you are trying to recommend a movie when you leave the Netflix and they leave their seats on Netflix or Amazon Prime or a service yet to be determined you leave that when you leave that video and they need to recommend to a video they have about 20 seconds total to recommend something right or they're going to recommend something wrong and a good recommendation keeps you in the platform or lets you lock in your next thing your actual churn percentage will go through the roof if they don't give you the next thing because you'll just not see what you need to do next and exploration takes time yeah your engagement is massively determined by how fast that answer is so your cost of failure is actually quite large by not having the data fast enough yeah and if the cost of data being fast enough is I now need a hundred thousand cores to run my postgres that's not a good answer anymore yeah and this is where we come back to like costs so I'm kind of abstracting costs as a Time versus machines cost here as you could if you have enough machines the time will go low but number of machines do cost going low is another cost itself yeah yeah yeah and then Flipside is trading let's say you're trying to make you've got a fill or kill order for 100 shares of Tesla if you don't get that right you have a potentially unlimited liability for that okay but this is where it kind of comes down to it is the cost of failure it can be very high therefore having a correct answer in the time allowed is actually something you can spend money on but then it's like come back to why we don't use postgres for everything like these days a lot of the time I'm saying use an analytics specialist database like click house like snowflake like redshift because to be honest you just can't put a few you can't put more than a few terabytes in postgres before it gets very upset with you and and then it just starts consuming calls and then you watch it slow down and you can't do the fast transactions you want to do on it and some of this is just because of what postgres is with this kind of acid transaction model some of it's because of SQL but a lot of it's simply because of the way it's engineered is not for saying I can scrub through a very large quantity data very quickly but I can find one answer inside a very in a defined six jumps I think it is B plus three under the hood yeah yeah that kind of implies that this is just a question that affects people with more than a few terabytes so the the fact is that we we have this one the one that the one the most kind of ubiquitously freeing things particularly of the last 30 years of software development has been more store right fundamentally the machines we run on are super computers right like I was at the San Francisco computer Museum recently and though the cray one was in front of me and I wanted to sit on it like it was a piece of Lounge Furniture I didn't because I said no but fundamentally that is probably less powerful than the server this this podcast is being recorded to I think my laptop's got a dozen Graphics cores and like when did that happen I know I know and these are you know scalable Vector units and you think those used to only be in super computers yeah and it's like that is a series that is a serious business when you think about it so if you think that Moore's Law is basically holding up whatever we use we get away with a lot of solutions which are not optimal but so far in the noise this is where we come back to the kind of some of those ideas of where speed comes from if I'm going to call a database and it's on the other end of a network wire I'm not going to get faster than maybe 50 micros in the same data center 50 microseconds yeah or if it's you know a cloud one if I get better than 10 milliseconds to get there and back I'm already doing pretty well so I don't really need my database to be any faster than about half that Network velocity yeah you could find that the bottleneck is actually serialization and deserialization right it is it really often is like we have these novel formats specifically for this and like a lot of things like message pack and protobufs or hopefully not protobus I have a bone to pick there I rant it on stage at qcon about them um um but the key thing always comes back to this idea of like everything costs something therefore where do we need to optimize the bit over here and particularly with postgres is postgres is pretty quick and most data is pretty small and most queries are not very complicated if you're not doing a very complicated query you can get away with almost anything I know people who have read it as their primary database because they're not doing anything clever oh it works I know someone who had registered their primary database and the the company imploded one day so we so I once lucked out because by that for a little while we had a very database for about eight months on what startup for a while because we were in a hurry and it worked and we never got put out until like we found out how bad it really was later down the line we realized we were never actually we had a very aggressive caching policy it never actually ended up doing this clever rate module we put in to write that to this properly oh right yeah and it worked for disclaimer purposes I'm not sure reddits are advising that it'd be your primary database of Records the website would say otherwise really oh okay a very multimodal story and if you work for redis you're invited on the podcast like genuine though but genuinely they've done some magic and I'm not I'm not gonna I'll say that they have right ahead logs going to disk these days you know they've done some they've taken a fun idea and stretched it so far that it's not that it's but it's weirdly awesome now right okay I have nothing but fun things to say about what readers can do the general answer is should it fair enough yes so yeah push this more into the language space for me though because we talked a lot about databases on the podcast but where does this affect language choice so what really comes down to it is if you think about it like how like most of what we do in programming is try to express a series of Concepts as you know in our code which machine can then understand the by the way this is my general statement about llms it's llms do two things very well they make us sound like computers and computers sound like us they're really good at translating between the other they're basically just a way to do it basically a very weird programming language want of a better word okay or a very bad interpreter for a programming language which makes me think of something someone else said about llms which is they don't give you an answer they give you something that looks the shape of an answer exactly and we're missing the compiler of the RM basically yeah yeah the thing that checks that actually you said what you think you said yes I can imagine a borrow Checker looking at someone's grammar and going no yeah it's like incomplete course stuff yeah but can I get back to this idea of manipulating data at any kind of scale whether it's the one or the many currently like the de facto is SQL that 49 years from 49 years of development and an ISO standard behind it yeah and it's you know it's designed to be a declarative language to basically State what you want to happen and then it happens however over time that sort of munges you can do all things in SQL uh I think I may have shared the schema verse with you at one point the scheme of Earth if I haven't I need to which is anything you have if someone built a multiplayer Space Trading game in postgres Sequel um and the whole game runs outside postgres store procedures no okay you can find it it is open source um okay we're gonna link to that in the show notes how I got into postgres I was searching multiplayer Space games and found that and was like I'm in if it's powerful enough to do that you're you're interested I I no sniffing me is not really a challenge so I get that sense I really do kind of back to this kind of thing it's like so SQL can do most things but should it and is it very optimal for these things and in general it was designed against this relational model which has a lot of kind of like thoughts and patterns from when it originally came about which is the idea of relations are important in data which you normalize our data to reduce its storage footprint because if you remember storage was almost an order of magnitude more expensive than compute for a while yeah yeah it really was so normalizing your data mattered and then we haven't had the big shift probably about 10 years ago maybe a bit more now maybe 15 years ago where compute collapse in price for computing compute collapsed in price a bit but here's what really happened storage went to near zero right commodity storage became a thing so suddenly it's like why are we paying costs to normalize when we don't need to why don't we just make access faster so then we've normalized out our data and suddenly the relational model didn't hold up so SQL didn't hold up so we went no SQL or not just SQL or not only SQL or an acronym of your choice basically and you kind of come around again and think wait what we now have a different access pattern like Cassandra's first query language was Thrift so you really yes API to begin with um a while ago it's just came back about that and he says it was good at the time I regret many things about it um but this is the kind of thing it's like rpcs are just a language kind of model of choice and that's all SQL is it's an RPC language that changes something whether it's data description whether it's data modeling whether it's a query and then you think but why aren't we using something more general purpose because I can transform data with SQL if the data thinks is sequel shaped but what happens if I actually want to you know run something a bit more clever like I want to do I don't know a theory a transform right like find the fluency domain is saying so this is very prominent images so if you take it so jpeg is a Fourier domain compression so you take an image you split it to the furry domain and you basically do some normalizations there and flip it back and this means that's why jpegs are actually very good at looking good even though highly compressed yeah they're smarter things but they all work on similar principles so you think about it you can't really Express a Fourier transforming SQL unless it's a competition at which point you know competitive codes like it's one of your previous guests talked about that's that is the realm of of of wonderfully crazy I don't approach yeah and instead you'd want to approach into a more complete language but the problem is now is Imagine you've now now you've given people with python and access to your actual database and they're running python in your database are you actively worried about that I generally so it I see personally I like the idea of it until everyone says but when what happens if someone puts something of a demolicious query and it starts doing random things across your network yeah so how do you sandbox it I've seen in the wild uh Java stored procedure that decided to be a good idea to start a web server yes yeah and it was malicious that was just a really bad idea I I've had that bad idea literally had that bad idea myself so that kind of stuff is why on average you don't let people run code directly in the database um other than lure inside many things it seems but it's really powerful because suddenly you have arbitrary compute and we like arbitrary compute because if I can manipulate the data where it is I'm not paying a network cost yeah right I can use the fact that I have a you know a big CPU there doing my heavy lifting and my web server can be nice and state stateless um it's like one of those benedictions may your services be stateless the better does that I mean some people would argue that that that's a mistake of how you're looking at the database and you should actually split it out into the storage layer in the Computing layer ah yes the share everything bizarrely model which has two separate parts naming their costumes share nothing has uh has every has the computer story shared yet share everything doesn't which is kind of like one of those interesting kind of moments of I see where you're going but the naming came out funny yeah so I cannot understand where these things go loads of services now do this um like famously snowflake just dumps everything in S3 um loads of databases Street secretly dump everything S3 these days hell warp stream have rebuilt Kafka with only S3 below it yeah they have very interesting looking thing and it's modeled now like a lot of like a lot of these um kind of uh Prometheus backends like Thanos where you just have agents writing directly in desk three are kind of like almost a data Lake model like uh the kind of um Iceberg or hoodie tables hmm the problem ends up being that once you're in S3 you're the whim of S3 S3 is way faster than it should be for what it is way faster it's also way more reliable on the scalable than it should be it's actually almost black magic under there but fundamentally it's still 100 mils to do a change or a read yeah and that's that's partially Network hot partially you know the actual S3 itself but also the actual API itself is not that fast fundamentally it's okay once you start streaming but you've got to establish and re-establish connection and you're starting having a problem so when you're on a local disc you actually end up this thing where it's like literally two orders magnitude faster yeah that makes perfect sense and the question is where can you afford to pay it so if you can afford to have it in S3 it's probably a good idea but if you can't afford to have an SRE it's you you don't realize it until you've already paid the cost and you don't know you're paying it and that's where you kind of end up with these hybrids and a lot of why I'm working right now is around this idea of like how do I do hot and cold storage at the same time so creating local disk is just another layer of cash fundamentally yeah um so I have some clickhouse examples where I have tape I have I have dictionaries which are literal key value lookups covering half my memory of my literal Ram I have then I have a hot layer which is a material a strict materialized view in local SSD and I have my cold extension off in S3 which is either scrubbing across parquet files from a Delta a data lake or it's just reading files I've dumped there myself and then I have tiered storage but multi but more than just the two everyone talks about yeah yeah is that because you just like the idea is that practically useful do you think people should be doing it so the facts are going to look like that that's what I'm getting at I started out thinking this is fun like I often do yeah and what can I do next because of the because of um clickhouse is a massive box of tools but then what happened was you come across the simple fact of you need the right tool for the job and this is where we come back to those language choices why should I use r or python today versus SQL if my problem domain is too complicated to easily express this SQL I probably shouldn't but then if I wanted to but if I want if I have a regular query I'm running on repeat let's say it's you know I have my my I have my streaming analytics coming in right which is let's say every single position of every single vehicle in my fleet of delivery of delivery drivers coming in also the state of every kitchen of all the restaurants we work with we're a food delivery company today yeah I want often want to know a view of every single restaurant against where their delivery driver currently is that's just a key value pair series of key value pairs yeah but if you think about how much scrubbing you've got to do to get that view out of something in S3 that is disproportionately expensive yeah yeah right I want to do it once to have it ready to go and this is where you get to those optimizations of it's not just Faster by two orders magnitude or three it's cheaper by four because I'm not paying for every single one of those Network hops along the way yeah and when you actually want to start getting these big answers you start needing to think how do I have small data if I want big answers fast I can't have lots of data change when I want to change them otherwise my calculation is just going to start stretching into near infinity and we can only put so many cores in before it stops making sense because otherwise you end up a simple factor that I'm hopping between cores hopping between threads hopping between servers and then back to that wait house fast is my network again yeah I'm sure now this feels like to play Devil's Advocate someone is going to hop in and say hang on you're trying to join two large data sets you're back to postgres of course yes and the answer is I wish if we could uh I think Ben stops it puts it is best of if we could get away with just postgres we would it was quite fun and the answer is like you always aren't joining something but my argument then comes down to if I pay the join once and only once all right and then have the secondary table ready to go am I good and if you just have these kind of cascaded views of your data and this Cascades all the way up this is not and this is kind of the big conceitative where I work right now is I don't have to just say this tool is Magic I can say this tool Starts Here stops here and then I go up the tree because now I get to say and now let's go really fast never read his cash on top yeah yeah okay so give me that map then and I'll allow you a bit of a plug here for um for Ivan yes okay okay so given the service is on offer where do they start and end so to give you a kind of example of like let's say let's say you are that a ride share company or a delivery company would also but it's e-commerce e-commerce is whatever everyone everything is e-commerce strictly because everyone so give me the restaurant one because I like that because I've not heard that one before give me the restroom you start with many drivers streaming their locations so you've got lots of quick data coming in yeah so that's mqtt and then we're going to absorb that into Kafka because Kafka beautifully matches webqtt I wanted to give that talk but they said no the qtt because it is just the iot language of choice yeah but you can put web sockets there with ably or something else it doesn't really matter just need to get that data in quick and it's streamed so then we've got the stream of data coming in from Kafka and now we're going to build what I like to joke about known as the KFC stack and it's an actually good joke because it's Kafka Flink and clickhouse but also because we have 11 products which are our herbs and spices oh God okay I guess we had everyone signing in the room and I was like this is saying that's the proof of a good pun if you can make the whole room great yes yes and literally one of my team just go nope and walk away from it okay so what happens then is you go into something like Flink so Flink is kind of the stream processing engine does your and not to say that it's going to go away as it's actually slightly older than spark which is funny um it has been around that long but it's kind of very good at this kind of distributed streaming processing thing like take the kind of concept for captive streams but wrap it up in something that will handle all of the offset management for you the checkpoint for you so you do a big kind of let's say join denormalization and then like you split out the data you convert it to Avro you make it in some properly easy to process formats then you put it into so you're in Cafe right now then you stream it back to flickhouse so you have your long-term store being built live so this is where you have data going back into the time but then it's still in Kafka so we can do more because Kafka and Pub sub means you then have a hot cache in redis so this is where you do geo ads so the Kafka connectors for Reddit have the ability to actually add the data to Geo sets in redis so now what I can do is build a hot cache where every single driver is for geopoint queries while I'm doing all the rest of this in Flight no real cost anything else are in real time I do the same thing in my restaurant so I have two caches give me give me the restaurant in its current status give me the driver and their current status but now I want to do one step more and queue up another kind of nice quick easy access table to say give me the best thing for my current situation so I'm going to do now is I'm going to join the contents of my drivers in Flight with where they are so I now have a so now I do is I have a averaging over a window so every 20 seconds I might say dump out the drivers and their locations and their current content in their current direction so when I do a select from to calculate nearest driver please I have a nice you know approximately hot cash to know these are the people in the snapshot window at which one I can ask the really hot cash to say where are they really and then I can safely make that join but it gets better because this is and I have more than one subscriber so that restaurant data is coming through and I now know if my restaurants are starting to meet their capacity or not because I have previous historic data in Click house to say these restaurants can only really handle 30 or 40 orders per second so I can do a join or your orders your max orders per second and say anyone's like to exceed that recommend them down put them lower in the list so I don't ever end up over indexing on the most popular restaurants yeah one egalitarian I want my everyone in my restaurant platforms to experience you know an even load but more importantly I can't give a bad customer experience you know this whole thing we do is because of customer experience yes yeah always so now what I've got is the ability to recommend the right restaurant at the right time independent of how loaded they are so the load just goes if the logo is too high on on the local place that serves the best pineapple pizza it's going to go down the tree yes that's an in-joke and you'll probably have another guest on here at later Point who will make it more apparent foreign but when that goes down the order list because they're too busy no one is going to suffer neither the restaurant having to say no nor a user getting frustrated and this is where it's most important here is we're doing this in real time you don't have to re-query and see full restaurant full restaurant we're just not going to return it to your app because we know and we also know not to give a driver too much things we know when the driver is not actually going to make a turnaround point we can give a really accurate assessment of cycle time actually the driver we suggested might have you know might be the closest but he's on break because we have his current state before we make the decision right yeah yeah so you are in that stack you're advocating for real-time data spending the cost of materialization once per notification and having that ability to have the historic views turned into it's basically so often I turn this as extract transform load and optimize because like the key problem always ends up being is I've loaded this data but if I don't optimize it I can't really use it outside of a dashboard yeah so how do I make long-term historic data usable in real-time scale so and your answer to that is picking specific data tools for the job exactly and you never have a stack of one tool if you have one magical tool for me and it's not postgres I'm going to be surprised yeah at what point should a problem should a solution start going Beyond postgres into this what is definitely a more complex and expensive stack oh so the answer is as late as you can get away with and that's familiar and no later the inevitable fact is slightly too late is nearly always the case yeah but it's we get to this point where you can be quite forward-looking because none of none of these things are hard dependencies on each other that's the magic of doing a proper distributed architecture with something like an event bus between them because we can start with postgres At a clickhouse next we just Federate start just start draining the long-term data out straight away no additional tools required and then we say actually we need this data to go more places let's start with pubs up start with rabbit mq if you need to go lightweight go to Kafka when you realize rapid mq is you know it's still it's still pre-version one so maybe it has a problem yes I will batter mq first but it is still below version 0. it's not version one yet due to reasons racism yes at this point I believe it's naming convention more than anything else yeah we could go into a whole separate rant about what version one actually means but let's not yes yes I believe marketing is the technical answer but yeah the idea is you incrementally build one of these systems and the reason you normalize against something like doing this on event is because once you decouple the systems a little bit and say I'll bring the best thing for the best for the right job and rather than over stressing any individual system is at no point does your main transaction system fail if any of the things Downstream fail right you still take orders right you can still have a rougher guess that your driver is going to get there or not get there and it allows you to have this ability to say well I wanted to do this transactional system in postgres but now I've gone too big let's roll Cassandra let's actually go massively massively huge let's make sure I can't fail any given point in time no single point failures so Wally Cassandra let's say I'm doing shopping baskets now this is my favorite little demos I built mostly to try and prove a point to a local supermarket let's say over my baskets are big scale I'm using change data capture I'm pulling that thing out of Cassandra tables because I can go to any scale and now what I'm doing is matching that my baskets with my actual inventory so now I know when I've exceeded the percentage of I might have actually tried to sell too many oranges today right therefore I can do is message my top by top few people who are either my subscribers and say lock in now and don't get a substitution because I now know roughly who is going to be disappointed ahead of time because I've seen it happen as it happens but as my stock levels change in real time as well I need to have both yeah in that so two questions and they may be the same answer in that stack what's the place for transaction heavy processing ah no actually I'll say the second question where does because all we everything you've described feels like analytics-based programming so processing so I fall into that camp of being more into the event sourcing world and arguing the transactions are kind of a flawed concept okay so the idea of a transaction is it's Atomic consistent isolated and however it's de-turable durable yes between the two of us we can pass we are sort of a computer scientist we're confused we're computer enthusiasts so the point about transactions particularly is personally everyone says distributed transactions are hard and any system with two systems is distributed so we've already in distributed transactions before we even started the second part is that the first thing you must do for a transaction to be really stop time because of your transaction is only ever going to be consistent within a certain time slice it was ever in you can only be Atomic assuming no other rights happened at the same time it happened which means you're already time slicing in ticks which only means your consistency is time sliced to that point in time at which point before I want to refer to it that time Point has passed so either it's an event source which when I materialize my events and it is consistent or it's not consistent so it's the ever classic that bit of the quantum world of incremental time but also just the idea that I believe that we have this kind of Touchstone of acid compliance assuming it's assuming it's the only way to do things whereas actually it's never really held true it's like cat theorem it we normally get one not two right if we're on a good day we get on a good day we get one and a bit okay yeah but it doesn't actually matter as the other point because the trick here is if we can make everything go quickly enough one the likelihood of a change low enough that was statistically we're good enough but also you think push into event sourcing and avoid the transactional system entirely so dial it back is lowest footprint as far as you can like you do need like you do need guarantees where transactional systems offer really good guarantees though I argue that what they consider to be guarantees are soft and then they admit they are just because of you've used computers for a long time I've used computers for a long time the one thing the longer use computers for is that you're more surprised they work than you were there yeah yeah every day I'm more surprised anything works yeah yeah absolutely um this is because the more you see these systems the more you know what's going on the more you realize that none of these things have a consistent view of the world so we basically assume that postgres MySQL or one of these others has a pretty damn good view of the world so we'll trust it to a certain point as that kind of starting point and then as we kind of cascade down the stack we basically accept that we are not event we are eventually consistent in a defined time frame of assuming no more events popped up you know be consistent and that is literally the model we have to go for because of Any system will be on a certain level of complexity is going to be somewhat consistent of best uh there's a great talk by I can't remember her name anymore uh Kafka summit we're talking about the idea of like using completion patents that's exactly that was Anna McDonald great talk best talk I heard there genuinely we'll link to that is the best talk of Kafka Summit I've got a humble opinion I have I I have written a kind of a follow-up a literal follow-on to it for Out full thing four just didn't get accepted but like it's now Cornerstone of I talk about a lot because I've used exactly these patterns before and they for exactly the same reasons it's just that this was like that crystallizing moment of I need to talk about this more and is it that exact thing if we can achieve consensus it's not a problem but we've got to pay for it somewhere but we don't need to pay for it where everyone assumes you do so outbox patterns they're fine but do we need them take me through that I'm in a bit more detail so outbox basically says that we have our transactional table We join a bunch of things we output to another table and we just follow the log of that table right where I'm going to argue that fundamentally we're doing some processing let's just throw that into our stream processing engine which is a bit further Downstream and have the events as is and get no delay okay so why not just use Flink for that Downstream and rebuild it at will and have all the information when we need it rather than assuming like we not only need a limited subset now this doesn't work in banking or some of the highly regulated things where you need to show certain things are true therefore showing three things work in one database is much cleaner than during the work across 10 yes yeah and it's more likely to be true frankly one would hope these days I have been surprised yes also early days of certain databases I won't mentioned led me to believe that publishing to Dev null was a slightly more it was at least deterministic compared to some of them I probably guess which databases you're thinking of but let's let's duck that well they they I know and if you know the one I'm talking about it got a hell of a lot better they bought somebody who fixed it okay it's my sequel isn't it oh it's not oh is it not it's not oh okay I because my sequel oh my secret from that and they bought something and got a hell of a lot better well that that definitely did yeah that but that was all that but then you're thinking aren't you might be might not be correct it's one of those possibly say that why a tiger is really good okay yeah I get to say it yes but generally is an absolutely awesome tool these days in fact it is so good it actually causes problems because people don't model the data as much as they might need to ahead of time oh yeah yeah yeah is the wonderful safety blanket until it isn't yeah because the other lesson of relational databases which was modeling your data and understanding your domain data model as a primary concern yes I think that's an art we've lost in programming it really is and when I came so I came into databases originally through Cassandra and the idea of not modeling data to me it seems 100 alien because there's only mobile your data and if you start in hard-type languages like C plus plus or um or Haskell for a while for me as well as you I think cool yeah absolutely you have this idea I don't understand everything has a type yeah I always like the rich hickey quote everything has a schema the only question is did you write it down yes exactly and then he wrote an unstructured language and I'm I'm at this point you know two of these things do not agree and I do not know where you're going with this no I I love what plugger can do I just can't I can't wrap my head around its thought patterns required yeah I see I love the thought patterns my what killed closure for me not killed but retired let's say is I found in Haskell I could do everything I liked about closure plus static typing and all benefits that come with that I went the other way it was the problem right because I started out in Haskell and I got closure and I was like but now I have Json objects flying around and I have no idea what they are
2023-09-26