Building data pipelines with Kafka and PostgreSQL

Show video

Yeah. So I'll. Be giving this is a presentation, about data, pipelines with Postgres and Kafka I gave. The same presentation earlier a, month ago in PG con, EU. But. This. Time around it was actually supposed to be presented. By my colleague but the para he, fell sick earlier today so I'm, covering, for him, so. The. Agenda for the actual talk is I'm gonna be talking, about the data pipelines, some. Bits, about Kafka, and how does it work and the, related. Concepts, then how do you use cough, can post questions together, worried, about influx somewhere in between and, then how. Do you use Kafka, for data processing, of, the data the. Data examples. Mark mostly gotta be covering are actually, from real life events. About time series data and their manipulation. Word, about me I'm a co-founder, at, Ivan, I'm, also a Postgres aficionado. Have, been using Postgres for ages, and have. Contributed to multiple. Different open. Source projects, around it. Yeah. So a word about us, we're. In the. Databases. Service startup we were founded a couple years ago we. Currently serve customers in six different public clouds have, eight different databases, as, a service, and. Our customers. Range from small check aquarium, stores to huge companies, like Toyota and, Comcast. We've. Been, basically. Operational. And selling, our services. Since early 2016. Then. Word about data pipelines, the Wikipedia. Definition for, this is pipeline, is a set of data processing elements, connected in a series where the output of one element is the input of the next one the, elements of a pipeline are often executed in parallel or, in time slice fashion the name pipeline, comes from a rough analogy with physical, plumbing, so now that we've got the definition, out of the way you know you know what I'm going to be talking about today. But. Anyway modern data pipelines, are basically used to ingest, a. Huge volumes. Of data for. Example Netflix, claims, that they're actually processing. Two trillion messages in their coffe clusters, a day which, are roughly equates to three petabytes. Of data every. Given day of the year so. People. Are using data pipelines, at huge, volumes. Also. Real-time, processing, data pipelines, these days which are using streaming as their basis are more, or less replacing, old stall ETL. Processes. From different firms so previously. Used to have let's say a nightly, batch where you dump, to your whole database and, then actually put it into a reporting, database and it probably, probably, probably, some sort of a ETL on its so extraction, transformation or. Basically. Tossed the data up somehow so that it would actually fit into your reporting database but, these days actually people would, rather process. The data in a streaming fashion, and hopefully, in real time so, people, aren't really willing to wait for let's. Say 24 hours to get a response, saying how, many books did we sell today they actually want to know it by the second, and the, same thing applies let's say you're a gaming company you. Actually want to have high scores that are updated let's. Say every couple seconds or something you, definitely don't want them to be lagging behind for, a day. Anyway, common components, of a data pipeline. There's. Usually. A component, that ingest, data and, usually, this is the thing that actually has, to survive. Fire whole style data bursts, so basically a lots and lots of data coming in then. You typically do some sort of filtering on it so let's, say you have data that you know that I let's. Say you are, for. Example you have an HTTP, access log you're only interested in. HTTP. Access logs that have a 200. As, the, status. Code for. It you're, not interested in the 404, or 300. Serious, errors you're just interested in the 200 you'd be filtering, at this point then, there's a you, usually do, some sort of processing because what's the point of getting the data if you're not actually doing anything with it so, then, this could be that you are calculating how, many requests, a day you got or whatnot, then. Tipic we after, the fact you actually want to query the data somehow, it's. Usually neat, that you actually have the date somewhere but you actually will probably want to actually. Query, it some at some point later in the day and, then, eventually you. Probably actually want to dump it somewhere so, if some for some. Unforeseen, reason you actually are interested in seeing it let's say six months from now you, actually don't need to keep it in your actual, day-to-day databases. But you actually have some place where you push. The data to and then. Because, somebody will eventually, come up with a great idea of. How. Do you analyze the data better the next time around you probably need to be able to reprocess, the data again and again and.

About. The requirements, for these like. Scalability. Like I mentioned so, that there are firms that hire are running huge data pipeline. Platforms. There. Are firms that are doing, this on a small scale and, we, try to cater to everything. In between. Then. Because. Of these. Data, volumes, you, typically, want the system to be available, all the time it's not usually okay if it goes down for a couple hours or days you. Probably actually want to keep, it running at all times, also. For, example one of our customers, about, like Comcast, they actually were a fairly adamant, on having really, low latency, for their data processing, pipeline. So, we actually had to do some, tweaking there to actually get the best, performance available, and then. These. Data pipelines, they, typically come with client. Libraries, for different languages and operating, systems, so, you probably want to support. Whatever is the set of applications. Or operating, systems or like, programming. Languages that you need to support so let's say you're a java house you want the thing to support java. This. Is like the traditional, data flow model there's. Clients. On the top side. Of the picture and then, there's some sort of a, web. App or some, other sort, of a service that basically proce, the data and eventually, the data ends up going. Into some sort of a database. But. Then, usually, the problems come along when because people, start doing this. Sort of example so here's this tiny thing that I want to go and grab some data from let's, say HTTP, API and then, I want, to run it against some Python, code so I can filter the data somehow and then, eventually you push it to psql so it'll actually get inserted into the database that you're actually talking to. But. Eventually you start getting pictures like this so if you saw, this picture which is kind of simple. This. Is basically the, same picture, taken, six months later or twelve months later so. Typically. These, things. Don't, get simpler, over time you keep on adding these curl and funny one-off. Scripts until, things, actually start hurting, and at, this point it's, going, to be very hard for you to develop new software, because you, don't really have any clear-cut the interface between the systems they, basically are completely. Accessing. One another's data without any constraints. So. Basically. You can have whatever sort, of data coming in from whichever direction. And, if. You're. Doing it diligently maybe you can still handle it but it's still getting to be quite. A bit of things that you actually need to keep you in mind whenever you're touching, any component, within, this diagram. So. A word about the Patricof, government, it's a open. Source project, that came out of LinkedIn. It's, basically meant, for streaming data. But, how many of you are aware of what Apache Kafka is just, so, ok. So fair, few of you. It's. A top-level. Apache project. That, comes with the same voting, things on every release that every other Apache project does for, whom you spent reasons. But. Then. It's. Also these, days used by lots, of different companies so the. Airbnb s and Comcast's of the world are all using, Apache. Kafka for a lot of processing, needs so. You can basically pick up any fortune 500, firm and they're definitely using Kafka today. The. Good. Thing about Kafka, is that it's actually. Is able to scale at any given to. Pretty much any given size but. The revolutionary, thing about, Kafka, was actually, that they, if. You have a historical, message queue you typically had a sender, that knew, who, it wanted to send a message to but. Kafka actually inverted, this so you, actually the guys who are actually writing the messages, into Kafka. Don't know who's gonna be eventually reading, this so, actually, by removing, this coupling, between the. Readers. And the writers you. Basically allow any, kind of new use cases to come up after, the fact because, let's say you were using something like a rabbitmq, previously. You sent a message to let's say a group of services, but, let's say somebody came up with a new use case for analyzing, data, how. Would you actually get to do that you'd probably have to configure so that the senders are sending it to the new place as well in the, case of Kafka anybody. Can read any data as long as the access control lists allow it allowing, lots, of different new use cases for processing. This. Is the ideal, Kafka, centric data, flow model it's sort, of beautiful nobody. Really gets it to be this, centric but, this is the idea that there. Would be the one thing that everybody talks to so. Instead of all services, talking to each other directly they'd, be using Kafka to do that, this. Is the yeah. Let's, say the thing, where people, yearn. To be but. It's really more like the spaghetti thing still but, this is what people would like to have.

Basically. Having Kafka, and basically, having all interactions, happen through Kafka, with well-defined, interfaces, between data, and then, having the data be structured, in some sort of message formats. Word. About Kafka, concepts, so in Kafka when you're writing. There. You're writing to a topic it's, fairly. Analogous to, a PostgreSQL. Table if you will so, basically. It. Is an, entity. Where, you can write data to which. Is further split to partitions so you can have unlimited partitions. Or a very, high number of partitions basically. The partitions, are the unit of concurrency, so if you have. Let's, say five readers, you need to have at least five partitions, so Kafka. Can actually actually push the data to all those fives so they're not processing the same messages, again, and. You can actually benefit, out of having five readers. The, other thing about Kafka, it's an immutable, locks log, so you can also think of it as a log file so a you sequential, e write to it and then you can just read back from it and you basically have. An offset, of where you're reading from basically. Because it's a sequential. Immutable. Log it's, also really, really fast to write into, because it doesn't really have much of any sort of structure, beyond that it's, just an immutable log, that you can basically, keep on writing at really high speeds. The. Thing. That I was alluding to earlier about. A decoupling. Between producers. Which are the guys who are writing hitting, kafka terms and consumers. Is, the. Idea is that we. Can when. The producers write stuff to let us say this logs topic. What. Happens, is that they, just write to it there's nobody reading at this point but let's say we come up with a use case that I want to read logs because I want to copy them to elasticsearch, for example that would be a one consumer, group but, then let's say six months later on somebody. Actually comes with an idea that I would like to process that, all the data in the logs topic, again and, you just create a new consumer, yeah. And the fact that just reads through the, data. You, can also search where by time between, the topic so let's say you don't just want to search I want to go to offset four to the fourth message in the topic you can just say that can. You kind of please search to, 2018. January. First and then you can start reading and pre-processing. All your data from, that point onwards. Some. Of the benefits of using Apache Kafka, Whites this popular, these days is, that it supports, real-time streaming or something close to that so you can get the latencies. Below ten milliseconds, with that it's, not really a hard real-time but, it is fast. Enough for most use cases, you. Can, also scale to billions, of messages a day so like I mentioned earlier Netflix. Says that they're currently, pushing a couple of trillion messages, a day so, two thousand, billion messages a day they, have like three thousand nodes of Kafka running so they have quite a few machines and. Of course I'm sure they have tons and tons of machines actually doing the processing the consumers, and producers but. Basically every time you're browsing, your Netflix like looking for a movie to watch they're, basically recording, or where did your mouse, hover over which title, and for how long, are you interested in, I don't know westerns. Are in sci-fi movies, or whatnot and they're basically are collecting, all this data and they're actually approaching, a real time profile. Of you what are you interested in so, this, guy was, some soap opera. Fan, sort of thing so he's only interested in that sort of thing in the future or, your kids were watching Netflix, so, you get suggestions, for some. Cartoons or whatnot. Also. It's, also supports, out of the box rack, and data center where replication. So. Typically. In. The case of Kafka the only persistence, you have for messages is coming. Out of replication, so let's. Say you, a machine. Dies you probably want there to be another replica of the data, so, typically, people use either replication.

Factor Two or three or higher but. Typically. People will just pick something like three and just go with that. Also. It. Supports, things like cross read your application we, have customers, running. Kafka. In South America, and actually. Like. Sending, all their data to Europe and, basically. You can have Kafka. Clusters around the world and you can have Kafka, clusters that do ingestion is some continent, X and then moving that data somewhere, else for later processing. That. Sort of thing is completely commonplace. When you're using Kafka, height scale. Anyway. Like I mentioned earlier we it's the the huge. Paradigm, shift here is that this decouples. The message consumption, and producing, so at the time when you're creating the message writing, it you, don't have to know who's gonna eventually be processing, it so, you can after, the fact come like six. Months down the road come up with a new idea and, reprocessed the message because they are not completely. Independent. From each other. Client, libraries are available. In pretty much any language in some languages they're worse than others node is particularly. Had. Challenging. Clients it has, gotten better but still. Slightly. Off and Python they're fairly good support Java has the native, Kafka. Consumer, and producer libraries, so that has a fairly excellent, support and. Then see there's a thing called labardi, Kafka, that is very well supported by confluent. Some. Downsides. Of using Kafka, is that it's actually realized in zookeeper, so how. Many of you have actually had to maintain zookeeper. Clusters. How. Many have enjoyed, maintaining. A zookeeper cluster, okay, hey queue back there doesn't count he's our CTO and he loves that sort of thing but. It. Depends on zookeeper, it's a hard dependency, you can't really easily run it without it and. Then. There are things that it. Actually doesn't take care of so, let's say you are. Your. Keep on adding new machines it doesn't automatically, take care of rebalancing, your data there are tools that do this for you but there are extra tools. Outside of Kafka itself so, if you actually want to balance the load between Kafka, Broker nodes you, have to have some way of doing it and then, especially. Historically. Kafka. Has, had its share of stability, issues, it has gotten much better over time but.

Especially. Let's say couple, of years ago when we started, offering, it it was fairly. Rough around the edges. Some. Would say that it still is but, it's, much better than it used to be back then. If. You. Don't want to deal with the hassle consider, using a managed Kafka service like ours or confluence, it'll. Save you a lot of time you can of course run this yourself but then you get to keep, the zookeeper, problems by yourself. Anyway. Worried about databases, in the pipeline. Yeah. So they, usually have this fairly, similar requirement. Set as well, then the ones, I presented for Kafka in the earlier slides, they. Need to be scalable when you probably want to be able to rely on them and you probably need to have some sort of platform support for the languages, and/or operating, systems or environments, that you're running it in, Postgres. Is usually. A fairly good choice for this it's really robust, it's. Very hard to break, Postgres. You, can do it but it's usually it's still standing when almost, all the other components, in your data pipeline have fallen over so. Postgres, is fairly reliant. Reliable. It's. Also really easy to just run arbitrary queries, on your data so, let's. Say you're pushing, the data to something like Cassandra, but, like it's. Not that easy to do arbitrary, queries over it in the data with Postgres you can just create arbitrary indexes. And you can do whatever, sort of joins between the data easily and. This. Is by the way the explain support, that Postgres has it actually tells you how its kind of run a query that, is very. Very, useful for. Lots, of things when you are wondering, why is my query slow, because, not all the competing, databases. Have anything, that is close to as usable, as post-crisis, explained support, the. Downside, of using is something like Postgres is actually did it has limited horizontal, scalability so, let's say your data doesn't fit into a single machine. That's. Usually, a problem of course hardware has gotten better over time so. These days it's not uncommon, to see let's say five to ten terabyte database, is somewhere around, which, used to be like in the realm of fantasy ten years ago or. Basically. You need really custom hardware that's cost in the millions. But these days 10 terabyte databases, are commonplace, in different firms. If. You, are running something like a Kafka centric data flow model with, close crests so basically. You're, pushing all the data from the application layer into. Kafka, and then you're, ingesting. It into, a Postgres, you, might have a separate, OTP, cluster, for a real real time needs then, you might have a separate. Cluster. To actually handle the metrics let's say you have a time series the data that you want to push in. Postgres. Can cater to a lot of different use cases it's not just OLTP, it's not just warehousing you, can actually do lots, of different things for lots of different data types with. It. Here's. An example of. What. Would it take, if you're we're running in flux data's Telegraph, it's. Basically a matrix collection, agent, for those of you haven't seen it before it's. Basically a way of collecting metrics. Like CPU, or disk usage metrics from, a. VM. And, pushing, it to, an output of your choice it. Has a really large selection. Of outputs, that you can push data to it supports for example cough quite supports Postgres, it supports lots, of different other. Systems. Like in this case it's the supports in flux DB for example so. This. Allows you to go collect, the data this is just an example of a data pipeline, where we're collecting, metrics, it's. Giving the data that, metrics. Could actually be coming directly from something, like phones, or whatnot but in this example of mine Dhar data pipeline, basically consists. Of getting, data from. Telegraph, itself. Then. You typically would have something. That sends the data to Apache, Kafka, and then, you would have another container. Or. Possibly. A Kafka streams application, or a Kafka, connect application for. Those of you who don't know Kafka Connect is a Kafka, service, which, allows, you to have. Basically. Ingest. Data from a source and then, send it somewhere, else so this is really useful if you want to let's say read data from coast, crest and you want to send it to let's say elasticsearch, or some other service like, Amazon s3 for example, there's, a Kafka connect connector ready, to do this sort of thing out of the box but, in this example of ours we're using Telegraph, to actually do the sending to, influx. DB. Worried. About influx DB in this example this is actually fairly, close to what we have have. As a like. A time series the data pipeline that we've used within Ivan, in. Flux DB is, basically, based on the gorilla paper for a compression it compresses time series data really, efficiently.

But. It's. Also. Because. The compression is really efficient it's also disk, footprint, is really, small, it's. Also fairly fast, and it scales fairly well, unfortunately. It doesn't have things like HJ it, used to have, a rudimentary, way of doing a high availability but. They removed, it from the open source version I put it in the proprietary. Version that they sell and, that. Kind of limits all of its use cases it, also has a bit. Of an unpredictable. Memory, usage pattern so let's say you are doing select. Star from fool limit, 1 it, doesn't actually return one row as you'd think it actually materialized. That whole fool. Into memory, and then it takes the one row out of it so, let's say your table. Foo is I don't know a terabyte, you, have better have at least one terabyte of memory to get that one row so, these, things are not easy to predict, and it has no support whatsoever for, things like explain, so, let's say you, wanted to know what is it gonna do it's kind of material this whole return, set into memory and then return one row it. Won't, tell you that these are things that you just need to know and there, are sharp edges everywhere. It. Also is. Switching. To a new it, hasn't a sequel like language, that it supports now but it's changing, to something different. Soon, and it. Has. A different. Barrier for entry again because it's going to be using a different language. This. Is another way of doing the same thing so instead of pushing. The data from. Whichever devices over Kafka, to influx TV which is time search database you can actually do the same thing we're just using Postgres, you'd basically have one less component if you already have a Postgres running it somewhere, so, why take another dependency, into management, and maintain a maintenance, but if you could just get rid of it so. There's. A, dis. There. Have the historically be multiple ways of doing, storing. Then time series data in Postgres say people have been using PGP, apartment and that sort of thing to do. Partitioning. So you can actually store and ingest the data at great speeds these, days there is also a fairly. Commonly used. Extension. Called time scale DB which is basically does. Postgres partitioning, and some, it, has also some time. Series, queries. For putting time into buckets, and that sort of thing but. Basically, Postgres. Has the same functionality, as you would have in a time series specific. Database, in addition, to being a really, good relational, database you, can actually you put time, series data there as well so, Postgres. Is really versatile, you, can actually do lots and lots of different things in different domains with the same thing. Postgres. Unlike, the influx DBA which i mentioned earlier has support, for ha4 these days since. 900, been able to use streaming replication and. It's been fairly easy to set up a che, around that, it. Has also, a great ecosystem of tools around, it so there's plenty of different graphical, user interface, if you want to do queries. Against, Postgres, and. Basically. There's. Also, tons. And tons of, people who know how to use this and who you can hire to your company, if you actually need somebody, to know something about Postgres, I mean here we are there's 50 people here. In a post Chris Helsinki Meetup how. Many people are there in a more specific database. Vendors. Event. There's. Probably using one in Helsinki, for example the. Other nice thing about by the way using something like Postgres is since the command-line tools for example are way superior, to. Many other products, is that. For example this is a real example, I'm showing here so we. Have, this open source backup, daemon for Postgres called PG hoard that actually, compresses. And encrypts your backups, and puts them into objects, store and, we. Were wondering, what's the compression. Ratio if, you're using it, supports different compression algorithms, so was things like snappy, and lzma we. Were wondering what's the compression. Ratio and. What we were using in flux DB we actually didn't know that we were collecting this already this good has been in PG hoard for a couple of years now we, supported, collecting, the compression, ratio for ages. We. Didn't really know because in, 50 in influx DB it's really hard to just browse through your metrics what sort of data do I have there so, we, didn't actually know that we had the data already but when we actually started, pushing the same data to time scale DB Postgres, we. Actually just looked at the list of tables in P ap SQL, and oh, there's this thing called PG. Horde compressed, size ratio, ah okay, so now. We know what, the compression, ratios are for different algorithms, in our client, base, because. We were really just, a surprise, that we actually were already collecting the data that we were looking for but we, had no idea because an influx DB browsing, your, data is really not that simple.

Then. A word about sequel, it's. But, it. Was originally invented in 70s, and it's really, really popular even. These no sequel databases, basically. Started. Pretty much all of them supporting sequel in one form or another MongoDB. Still, an exception, but a lot. Of the others like Cassandra, who which started out having some sort of a different, way of accessing the data pretty, much all have some sort of a sequel support these days a. Lot of people are criticizing it but the skills, that you use when you're using sequel, are usually, transferable, to something else so let's say you're using Postgres, today some, of these skills are transferable to, cassandra not that many because the data model is different but, at least the idea you can read the query what did what is it doing even if you're reading a cassandra query but, the also, the skills. That you develop when, you're actually learning sequel, you can develop also, transfer to other our, DBMS, it's like poster. Like my sequel, or Oracle, or others. Like that. Also. Now you can do actually sequel, over Kafka, which is why we have the slide so. In. Kafka. When you're storing, data in a Kafka topic there's. A fairly new thing called K sequel, which, allows you to write. Sequel. Against. Your Kafka, topic so, let's say you have, you're, writing messages into, cough gum that have three. Fields, in the message so path user ID and status you can actually search that but you can also do more complex, things than this so. Let's say, if. You would be storing let's say an integer or something you could just sum, them up so. Some field a to B and then, basically the end result, of this processing. Basically. Would be stored in another Kafka. Topic, but, the reason why this is interesting is. That historically. When you've actually been using Kafka, you've had to actually write some code that, actually, consumes, the data from Kafka, and then does, some sort of processing, on it and then pushes it and the result set back this, is basically, I mean sequel is still coding and, programming as, such but. It has a way lower barrier, of entry than, actually somebody is starting to write Python or Java or node, or whatnot, so.

This, Is something you can give, to a lot of other people, to actually run. Queries against. Yeah. Anyway, case equal it's early days for it but, it is simply. Going places and it really, helps that it actually. Supports. Sequel, you can also do things like window, queries with it so it's of course complex, sequel even. These. Queries. That you're running in case sequel it has two different modes one, is basically, in, real time, returning. A subset, of data this isn't that interesting, except when you're browsing through your topic, data in Kafka, you want let's give me the first hundred rows and topic. That's fine. For when you're browsing this stuff but, the really interesting thing, is that when you have a continuous, query so the, the query that we had on the previous page like this path. User ID stays from clickstream where state is about 400, so in this example we have a click stream where, you, have HTTP, status code so 200, ok we have a request, it succeeded and, everything, about 400, is some sort of an issue. But. This stream, that you're creating case, equal is processing, the data in real, to another topic so if you want real-time, data processing, in Kafka, you can just write a sequel, statement. To do it now. It. Has been possible to do this otherwise, but, it is much simpler. To. Do it this way than it has been in the past. Then, if you have things. Like Kafka, connected I or alluded, to earlier basically. It, is a framework for, in. Kafka. To have, things that's actually read data from somewhere basically, there's a source adapter, for pretty much source corrector for pretty, much any database, service so you could have a source that is Twitter fee is for example so people are reading Twitter feeds in directly, into Kafka with Kafka connect then. They have a sink, which is where the data is pushed, to and those. Things could support, things like Amazon is free if you want to put the data into object store, they. Could. Be things where you push data into Postgres, that is a fairly common way, you. Can also have sources and I've actually read data from Postgres, so with let's say things like debit zoom which actually is post, christological replication, to replicate data in real time as you're writing it into post Chris I'm pushing it into a Kafka topic these sort of things are easily possible doing with tools from the Kafka, ecosystem. Yeah. But, anyway the idea behind, the Kafka connect connector is that instead. Of everyone writing, their own application. To do this because you've always been able to do this programmatically. Yourself, to write. 80 application, that reads from a database and writes it to another place, you've, been always able to do this but the idea behind this is that you don't actually have to write code yourself anymore, so you can just set, up a configuration, file and say read. The data from Postgres write it to elasticsearch. Or any, combination. Of different services that you can imagine. This. Is an, example of what, your. Clickstream, could be like so let's say you have clients that are phones, or whatnot they, do. HTTP, requests to the. Application. Servers that were in the original picture at the top they. Keep on writing messages there then. With, Kay sequel we. Reprocess the clickstream so we actually get another topic, called errors so basically we, have at this point processed. And filtered all, the error like, all the HTTP, access requests, that you've done to a separate topic so now you can actually just read, a topic, that has only the errors that you're actually having when you're HTTP servers if, you're doing something like alerting, this would be really useful to see like you're always getting up 403. For path X, or whatnot, but. Basically you can do all of this with. Just. Writing. Configuration. Or sequel, you don't have to write any Python, Java node or whatnot code. And. In the end this, error thing, yeah, there's a cough connector, called JDBC. Sync which allows you to insert, the data from, a Kafka topic to a Postgres, table. Or Postgres. Database so. In this example. We. Are actually. Pushing. All the data that we process from the clients, the HTTP, errors they, are ending, up in Postgres. Cluster, as a table, where, you. Have these rows like paths user, IDs, status. As. Being, there and the columns in the table so you can actually query this in Postgres and whatever you want to do in your, post with database you can do this also in real time so let's say you have a service where, you have hundreds, of HTTP, web servers that are getting error logs you can do this sort of real-time data pipelines.

Easily, Without writing, a single line of code well if. You don't count sequel, to, being code but it's. Still code but it's a slightly, easier sort of code. Yeah. So summary, have this, data pipeline thing so Kafka, is a fairly powerful, component. That allows you, to have a nicely. Drawn, Kafka, centric architecture, so instead of all servers talking, to each other you kind of come up with having. Kafka, sit at the center of the picture and then basically everybody's, talking to Kafka, and so instead of everybody talking, directly or reading directly from each other's databases, you, have a nice separation between the, different components in your system. Postgres, is a really, robust our DBMS, so, it really has, been developed for 20-plus, years and, it works really well so these. Days. Nobody. Is picking something like Oracle, as the database aware if you're starting up a firm nobody's, going to pay 30 plus K for. A new firms database, so people are picking polls chris post chris simply, just works and. Kafka. And post codes are used in. A lot of firms to ingest, billions. And billions of messages a day and. A. Word, about sequel, sequel, is mostly here, to stay, just. We'll go ahead and learn it it will, help you later in life. Any. Questions. It depends, on the thing if you're using. The, the telegraphing, that I mentioned, instead of using Kafka connect it actually supports batching, it was written by our CEO, to support for that we. Actually need it for our demo in another conference, so. We actually wrote the code for that but it's open source so feel free to use it it uses batching, but. Yeah there are other ways of doing, it but did the JDBC driver actually does it row by row so, it's not using batching, the example, I had. Sorry. Yeah. There's, really don't need why it could do batching, there there are some like. The Kafka kind of framework doesn't really like. Doing, things in batches. It's. More about actually, handling a single message at a time but, there you could still, make it do things in batches if you got. Around. Now. It's. Yeah. It doesn't, everybody. Has some philosophical. Point. Of view on how you should be doing things. Okay. One person who does kafka stores that they know. On. A disk it's like I said immutable, log so it's basically it stored as is in a log file it also keeps us saying like time. Index, next to it and newer versions of Kafka so but. It basically knows, that at, 3 o'clock on Friday, the 13th of December, it, you are at offset X so it's basically keeps an offset, to time mapping, on, the side but. It's basically an immutable log that, doesn't, have almost any structure, to it it has, very low overhead and it's, written sequentially. So it's really fast or almost, all hardware. Huh. Yes. Yes. We do. Hank. If you want to answer to we are it does got to have a transaction, support. There. There are some transaction, support but it's not as. Comprehensive, as if you were using things like a. Lot. Of banks are using kavkaza, even the some of our customers, are banks sorry. Yeah. Like basically in Kafka, when you're committing, a message you, can basically say that I want this to be at least well. Actually, it's not that it you can basically configure, it so that you get to let's say 3 acknowledgements. That the data is on disk for, example before you actually say, that it's done there. Are ways of getting as many acknowledgment, as you want or you can just a fire-and-forget and hope the data is there it's mostly there but. That's. One option. There. Was a, pleasure. To mr. stomach, just now I was. Getting. Yeah. But then you like pretty, much all huge, banks are using Kafka these days so it's trustworthy. Enough for that sort of thing of, course it's only money so hey. Well. They're using it for something as oh. But, the f there are people, using it for a different use cases even, with money involved. Some. Yeah. Any. Other questions. By default the. Message size limit is 1 Meg but, you can basically configure. That to be whatever it's. Not meant for having, large blobs in it, but like 1 Meg is still fine but, you. Don't want to put gigabytes, of files into it or something like that there's no it's, not the place to put that.

Sort Of data into it. Yeah. I mean you don't want to use Postgres for, your file system either I mean you can put blobs into it up. To X gigs but, you it's, not really meant for that file systems are much better at handling files. Kafka, is actually agnostic, towards it's basically binary blobs so you serialize, it in any which way you want so typically, people typically use things like Jason, or are ou or protobufs. Or something. Like that but you get to basically put anything there it's really or whatever, data you wish but. A lot of people like C and Jason for example eternally, with we user Kafka centric architecture, ourselves we, use Jason pretty much exclusively. Yeah. There's a thing called schema, registry, includes a Kafka world which basically says that it's. Basically you're. Sending, topic. And messages. Of type X to a topic, and it has these four fields so two integers and two strings or whatnot and then, basically, when you're reading it back the. Schema registry, you can say that okay it's message version X it should have two integers and two strings in it and then, you can basically get. Validation for the data. Thank. You very much.

2018-12-22

Show video