Consolidate Your Technical Debt With Spark Data Sources: Tools + Techniques to Integrate Native Code
Hi. There I'm Doug Carson I'm a customer, solution architect, illuminated, technologies, I'm, in a spark user for, probably. About five years now 2015. Looking. At various optimization. Techniques, and we're now start to look at spark. As a way. Of storing data and also ingesting, data into, our systems as well so, this is a kind of walk through some of the challenges we had to integrate legacy, cords and hopefully, be useful, to look, at spark as having, an alternative data source and what you're trying to do okay. So. Here's. Our agenda first, of all why native code well I think. What Claude's is, gonna give some illustrations, and that everybody has their own unique vault what Claude everybody has their own Scott uniques a, secret. Sauce so this is a we. All do connect you've always got legacy, cords and you've, always got builded, by decisions, about whether it's better to start, again or do you take your old cord and wrap it up. And. One, of the ways to think about it is there, RFC, 1925. So if you're following along google, that and you'll, be in on the joke and why that is but there's actually some wise ones in there 19:25, as well so. In the means that me to the thing. But we'll look at is integrating. Native code and we'll. Look at taking, from C++, air, cords, which is actually the pcap library, it's, an open source library that we're going to wrap a and, turn. It into a spark data source now, to actually bring it into spark itself, there's a few tips, and tricks you need to do to actually load that so, we're gonna have a look at that as well. So. We'll look at the data source, version. Two life cycle, and. Also how you plug into the API to create yourself, a. Fully. Formed independent. Spark. Data source, the. Lever to, actually make this data sources work we'll look at a file example, where we're just reading from a file and ingesting, the data into spark which. Is the i/o access pattern, and we'll, have, a look at some, of the best they're missing in the EP is and how you can walk. And, then lastly we'll look at streaming because that's very often you may have large-scale sensor, data, where. You want, to ingest that data in real time and, spark. Streaming is a great way of getting systems, editor, into your systems and we'll have a look at that and, there's a couple of tricks that we want to look at within, that as well. Okay. Let's have a look at what clued. What. Our systems, do is, look. At a. Cyber. Security aspects. Of mobile, devices, and mobile networks. These are carrier skill networks, with you, know eventually. Millions, of users on board and. We have tools. That will help look at the security that now what can ensure that the network secure, so, clearly there's a lot of data to deal with and. We deploy sensors, within the network that, can look at Internet traffic we can look at interconnect, traffic, and then, we can look at traffic related to the radio side as well so. We can archive. That data on and then a security. Operation center within the telecoms, provider can, look at that data and the, great thing about SPARC is we can slice and dice that data in lots of different ways lots. Of different analytics, for lots of different use cases so for example on, the internet attack surface, you might want to look at, potential. Hacks or, intrusions. On the device you might want to do packet packet. Forensics. To see air. Holder network was intruded. There. Might be critical voice services, during assurance, on that and again we can do some tools that help with that, there. The. Stuff that we can do in the ran attack. Surface which is the radio access so, we can do things like imagine, see geolocation assist, for example, by looking at historical, data within. That and then on the interconnect, and, we can actually look at attacks, that are coming in from. Interconnected. Networks and doing stuff so we actually deal with quite a lot of data obviously that data is all in real time and there's quite a lot of data as well probably about a billion events, so. We need a way of ingesting. That data so. Why. Would we want to do a data, source if we're already doing that sort of thing. Well if, we think about where we want to get to in 5g, networks, we've. Got two little bubbles. There the first one in 2019. Is probably, what you fill me with 5g, know where. The. Access network changes so, the radio is change you, can see it's a very small part of the network moving. Forward five years the core of the networks going to change and that'll be a completely. Different way of how. The networks controlled, and how, you service has around it and unfortunately.
The Reality is, the. Rest of the network is actually fairly old in some cases it goes back 20, years that's. A lot of legacy cords that's a lot tradecraft that were built up over the years and a lot of domain knowledge and we, don't want to throw any of that away, so we need to get away of being. Ready for the new technologies, that coming in 5g, but. Have, all this legacy cords and bring that in and that legacy cords being written in C++, for performance, reasons so. That was kind of driving a decision, when we took a look at spark okay, we need to have a better we're getting deeper into the system and you don't want to lose that. So. Native, code why would you do it well we've said that this. Lot since we've got a builder by decision you can use some so. You might ask why would you want to use some crufty old codes for it and by something like me when you can use some great, in your language and great, middleware, and that that's a valid, point of view you've always got builded by decisions, there's. Some very good reasons why you want to build and what you already have rather than taking new stuff. Within. Now. Within. Any network or any domain. RFC. 2019. 25 gives you some fundamental, truth so that the first one of there says the, problem was already solved. You. Cord. Has to work some, things in networking could never be fully understood until. You've actually been there you've only really achieved, it and. There's. Always the same problem rehashed, over again so. What this means is. That. Your existing Court has value you. Have domain knowledge you've test to depth and you've learned from your mistakes and if. You're gonna change that, cord and have you cord then you're only just gonna revisit. Those problems, again in. A different way. The. Other reason is your cords already optimized, and. It quite succinctly most, efficient thrusts pigs will fly can always throw more CPU at the problem or you. Might have optimized, it's the limit anyway so there's no point in trying to rate it in a different language and expect something else to happen you, can't increase the speed of light is quite a good observation and. Also, it takes a long time to optimize cord right. In the cords only 20% of it there the. Optimisation, and debugging, is usually 80% of that so you can have good fast and cheap okay -.
Usually. You want to get there fast and you want it to be good. And. For. All we ever resources. You're always gonna need a bit more and in, fact even though it's legacy cords, that, the requirements, your domain will increase, anyway so you're, always got to you, know try, and get ahead of the game and do you know better stuff in there so. The, summary of that is Merlot of won't save you yeah. Your, domain will always have new requirements, you'll always have, to take your baseline cords and make it go faster, for more. Subscribers, more, simultaneous connections, whatever. Metric you use and also, remember, that yes you can have as much. CPU. Power as you like and clouds can scale. Fantastically. But, those cloud compute, costs and 0 it's basically a number. Of saints per CPU, per hour so. The less CPU use and the better so optimized, cord is a good thing. And. The. Other one is by. Changing horses you. Can introduce new problems. I already discovered, the ones you already have. So, it's always possible, to move, the problem around you can always add more levels. Of indirection, you. Can always take multiple, problems that are already been solved and create a horrible, interdependent. Solution, and it's, almost harder than you think and there, are no silver bullets in there either. So. Yes things like Kafka, can for example can give you another, level of indirection. But. Make sure you don't make a calf guess who's. Gonna do the checkpointing, who's going to do deduplication, who's done your streaming analytics how, many network operations, of that who's. Doing storage, Kafka. Has a lot of these features and sort of spark and you want to make sure they're good, partition, on that so, I think carefully about who does what and if your code already does some of this stuff that's fair enough, and but. Don't don't. Try and do it in duplication, just use the minimum you want to have. So. The bottom line is that, is, it's in the twelfth rule in RFC, 1925. Perfection. Is reached when there's nothing left to it not, good there's nothing left to add there's, nothing left to take away, so, really, you don't have to do any more work and, you've. You've. Optimized, everything, so that every. Component is doing the minimum that needs to do the minimum viable now. That point there you've. Got rid of all the intuitive into dependencies and and all the other issues so, the way to look at it from a spark perspective. And. This was the summary, is they've, sparked those data storage and analysis your. Cords will do the data collection and the processing, and that's the perfect split there's nothing left to add there's nothing left to take away. Okay. So what we're going to cover here. Well we're going to cover writing. A custom data source that's, going to look at pika, now pcap is the standard way of collecting. Network packets, and, for. For troubleshooting, and analysis on. A network and. Usually. Those libraries, are in C and C++ the, written native code and they're, actually Creed's it's. Such a common format, and a lot of viral heavy that it's actually it's, quite a nuisance deported, to a new language because the libraries they are usually heavily optimized, and. If such a data format is used throughout the industry. So. What we're gonna do first of all is, how. Do we wrap a native code in an auto loading class because I said it's gonna need, stuff so, we're going to create a class called peek at native that's going to basically all. The stuff that's all the natives codes can be wrapped up in there and then we're going to make the code look like Java to, the rest of the system.
The. Next thing we'll do is look at the basics, of a data source how do you what, sort of calls. Do you need in the data source and how do you integrate it into your system and. Also how do you build the data source as well. Next. Thing we're going to look at is I, all access, patterns, so how do you access, the data and how do you marshal it so that spark can understand, it and make it look like a data frame and. Then. We're going to understand, streaming, floor control so when I look, at a streaming. Source, and. Then we'll look at how, the data streams and how we do flow control and that, and, then, finally. Within. Streaming, it's, very important, that you understand, how to do a single tune a or reader because. You can't keep looking, at the data. Over and over again sometimes, you might wanna create perhaps. A user thread that's readin long-lived, users thread that's looking at the data it. Might be listening on a network socket, it. Might open a file and stream. A little bit at a time, so. It's pretty important, that you have a single, to know your video rather than what's, a different one and there's a clever trick that you can use to do that. Okay. First off how do we bridge C++. And Java in a single in a single class there's. A couple of libraries that we can use, for that one is Java C++ which. Is a really. Good library, for wrapping. Your. Code and, so they can access native, code and it creates a hole with the boilerplate underneath. It in C++, they'll wrap that very clever tool and. The, other thing that we want to do is be able to load that in a shared library and, the, clever, thing about the native Lib loader is that our shared library, that holds all a C++, code can. Be loaded automatically, using, the boilerplate code that you see there single line in. A static block will, actually automatically. Bring in the native code from Java. Pretty clever a. Quick. Tip from that on Java C plea Java. CPP. If. You're gonna do marshaling data is better to deal with the raw data types in C++ and, then, Marshall into Java using. A wrapper one so, basically you've got one class, that bridges, everything this is the Java. Native class, you. Can see on the c++, side, anything. That's a native class has its analog in c++ and that does the low level stuff so let's take the open file for example that's, going to read in a path which is a string you can see it matches and. The open file call and the other side, hey. And then. We've, got Ted I read, above a call as well and you can see it's analog on the other side there the. Other. Important, part of this is to build ology, of it and. We. Use maven to do the building and. You can actually bring, in Java. Cpp. As. A command, line on on that that. Will generate all your boil plate codes then. For c++, we've, got an example here of the sort of thing that we are doing which, is using c make. That. Will compile, the boilerplate, code and. Then actually build a shared library and you can see that's going to inject it in the right plant the code then, finally, we do you make to build that that shared library, and then the compilation, precedes, us normal so, by doing those compilations. Ahead of time the. Integration, of pretty seamless, and. We only have to, have. One area where we we, change the chord change the interfaces, which is great.
Ok. Let's consider the data source life cycle, in your chords, what you want to do is basically use the spark api a nice icing, tab. Syntax. Really. Great way and really consistent, so, what you would like in your code is basically and, this is a piece of Python code it. Wants to open that failed called traffic cookie cap and you want to tell it's a pcap, a. Pcap. Data source which we're gonna write customer how's. That happen in the data source lifecycle, well first of all. The. System needs to know that it. Was called pcap, is, as. Your. Data source and that's done, by having. A data source registration. File in the meta interface. And. It. Actually gives the class name of the pcap, data source which is cleverly. Called pcap, data source how. Does it know that it's. Associated with the with the name PC. Ap it's, the the short name function. Caller does that. When. That happens, it creates something called a pcap file reader and then it creates some partitions. Then. You, create as many file, partitions. As you want, depends how much parallelism you want in a job also depends and I'm perhaps, how many executives, you have in the system could, be how many threads you have in the system the, choices yes but. You can also do in the pcap file partition. Is have preferred, locations, you can actually see which nodes you want to send out to as well. Well. The system then does which is quite clever, is. It looks at those their locations. And it. Then serializes, it. Sends. Out close the wire to, their. To. The worker nodes, and. Then it sends the executives, and then it's deserialized. On the other side so, it starts, till actually the same object, and, and, then, it creates a partition, reader on the other side. So. There's a partition, reader any other side and then, that does they get. Basically, there's an API that does the i/o in there and the data send back all the way back so that's the sort of life cycle it's important, to realize that there's a serialization, in the middle of that so. Let's have a look at what. We have to do on the on. A file data source, I've. Mentioned, before that the really the the, key part of this is the file, partition, reader now. Remember, the, file partition, reader is. Created. On the driver side and then serialized. And then. Sent. Over to the executor, and that. Creates a bit of a problem to access, the i/o resources. The, reason is that the constructor, normally you'd put your initial, in the constructor, but the constructor, happens on the driver side so.
We Need to do something that does. The initialization. When. You arrive on the executor, site now unfortunately there's, no API, call to do that it's a bit of an oversight in the spark API but. You can hack around that by. Basically looking. At the. Next. Call, which actually advances. To get the next rule into. The table and just have a little binary, flags that you initialize there so, that the first time you go in there you call in a net routine which we're doing here and you can see it's actually creating, the. Pcap native class that we're going to use to access our data and you can see it calls down into the open file, so. It initializes the. Native. Thing there and then. We. Can go on the next next. Part of it we do the next packet, and then read that in and in. In our case if the lens, plane. 0 then you bail otherwise, you keep on going. What. Happens after that is you know so the next calls every time is there another rule is there another rule to actually, read a rule you get the gate call and you. Need to fill in the blanks basically, so. You need to create a structure called raw which, is a get. Which. Is a generic internal, rule class and then, fill in the blanks in this case column, 0 is is. A string a. Column. 1 is a, timestamp. And. The third column second, column call. Them three if you like is. Is. The link tape ok. So that's how you write a, file. Data source to. Do a streaming. Sauce you, have the same sort of thing, but. We have to do something else in the file reader err SOA. On the stream reader API. So. Let's have a look at what, we need to do in the streaming site. So. In the stream reader, there's. Another couple of calls normally, you just call and, plan, improve partitions. And. On. This on a streaming side there's a couple of extra calls because, every. Time you're, going to read in a partition. You want to advance, things and, make, sure there's been no error of us there's. A bit of floor control going on and. That's done by a couple, of calls one of which is set offset range, which is based on this system. Updating. Your. Streaming. Reader. To, see okay this is as far as a god this is the last thing I've seen and. You need to update those. Internal. Values, as well and, it checks that using, a coupla calls called gate start off saying get on and off, set. Fortunately. With inspark. Those. Classes, already defined, so it can be any structure, you like I've used Long's in this case and. What you also have to do is provide a DC realization, function, because. The communication is happening between the driver and executive, they have to see, realize they've, chosen to use Jason, as a serialization, function, so. That's a templates that you can work with and. It works extremely well. Now. For. That we mentioned before that. Within. A streaming partition. You do not want to be creating, a load of oil resources, every time you read a partition, because. You could be. Doing this every few seconds, and actually. You may want to keep some some context, as they're there, as well so usually, they say you know it is a single dim patron for example there might be a knot your thread you might have a socket opened as well and.
This. Has already been experienced. By the, Kafka plugin, there's. A reference there to what you have to do and you, can actually use a spark infrastructure. For that so. In this case you, actually keep a reference to, the partition, reader inside. The broadcast variable. And. You do that, when. You, when. You create it in the constructor there, which, means that when you come to create, partition, reader you, basically just read that value. There. So that make sure that happens across serialization. And the executor. So. It's created on the driver and then read in the executor, and keeps, that global thing, there so, it, means that it's always the same you know every, time you create a new part which, is exactly what you're looking for. Hey, so. The use cases for that io threads. It might be an i/o descriptor. Might be a piece of hardware for example, you want to keep her a, file. Descriptor or you, know a descriptor, block - or, it might be some sort of middleware sync as well so it's. A very common very. Common pattern that is, wise to follow and you can see that sparks, actually got some fairly good and portable, ways of handling. So. In summary. Perfections. Reached not when there's nothing, - left to adds but when there's nothing left to take away and, that's. What. We've got got, to here and, your. Cords doing IO it's doing high-performance. Data processing. It's got your your, crown jewels your proprietary, algorithms, your secret sauce call. It what you like it's, all within your codes any innovations. Locked within your code as well and. Anything. To do with. Proprietary. Hardware, or that, from matter driver, hardware, is in your code and. You've already solved, the problem so you've got high productivity, so. The ability to wrap that within a. Data. Source is is, fantastic, you. Don't have to revalue things now. What you get from spark is all the other good stuff that. You don't have to retrofit to, your code or use, another framework for so, data storage, you can handle all the different data formats that you'll ever want a.
Data. Deduplication is. Handled for you horizontal. Scaling, is handled through you through partitioning. Data. Format, translations, handled, as well in underlying schema so if you want to move from JSON to CSV. No. Problem. Now. Clearly you've got machine learning that's part of spark and then, you've got dr2 resilience as well in terms of checkpointing, and you've, got documented, and stable API as, SDK. As well so. Once you've complained, to the, the. Data source API. You're. Going to be there for the longer term ok. So thanks. For attention I'm going to translate, to know is, we've. Got a demonstration. We're. Going to show some. Access. To native chords then we're going to bring in a failed demonstration. And then lastly we're going to source streaming them demonstration. Okay. Let's show some Auto, loading and, binding. To native code in action so, got test file here that's gonna actually gonna. Pull. In a class which is going to give us native access, to the pcap access, function, and. We can see got here that the first thing it does is open. The file and. It takes in. The. Name of the file here now. It's analog in, Java. World is this pcap native. Class. And you can see it's got a method in here which is open file and then. Within. C++. Lines we've got an. Interface, here that say gonna talk to. The little level and. Access. Functions, in here and actually open, it and return an integer so. The. Way that we build, all this together is a, a, couple. Of things on the c++, side. We're. Actually going to take. In this the pcap native, file. Here which is the kind of header file. So. That we've got a, well defined interface, into a c++, library and then, we're also going to compile. In linking these two, automatically. Generated. Boilerplate. Code and that's going to be generated by a a, preprocessor. That's going to take in the Java file and Link that one end that'll, produce a library, and Enel producer. A shared. Library that's going to be actually pretty inside the jar file here so to, orchestra all that we've got a maven file and. Then we'll use the the java cpp, tool to do that, so. Let's. Run. That once that's when you. Okay, so that's all built that and. It's generated, the boilerplate chords. And. Now we, can go ahead and actually just run the ad. The compilation of the C++, chord. This, is calling through calling, through all the sub libraries, and, then. Each sub library is gonna create an archive and then the archives all are going to be linked together with a boilerplate. And I would say called on that. Cool. And then lastly we do the link fees and you can see that we've linked in all the Java libraries, automatically. The. Libraries that you, know within there the archive itself. And. We've also linked, into the, pcap, library as well so it has just done everything automatically. What. We can now do is. Compile. All the Java bits as well. And we've, created our own, archive. That a jar jar file has everything in it and we can have a look at that. And, you can see that actually the, shared. Libraries, in here so, that when we run this file it'll automatically, that shared library, and then the coached walk so won't say, see. If it all box, so. Just. Created a little alias here for, the, chords so we'd be star gonna run Java and. Just add that jar file to the class path and then run the native test and supply a test, program to that. So. Automatically. Loaded it it's. Got, all the, and. The package there and we, can see that the parade, 21, frames, and we've got to be recited, it packaged there as well we're just. Sure. There's no cheating here you. Can see that put in the raw file there's 21 frames, and. The first length is 84. And then 300. Needs which. Matches that all, good. Now. They've got a native act, class, which, can access. Underlying. Libraries. And C++, we, can use that within a spark data source and when, we use things in spark we get the fill spark experience, a really. Nice tidy inconsistence. Syntax. So, in this case we're going to open, up the pcap source which is actually just going to load that file and then turn it into spark data frame. So. We've already seen the source code for this and. Underlying, that we're gonna have a spark file partition. That's, going to work, with our uh native, library, class. Here, had, to do all the accesses, and. When we lured the Spartan, beta source is automatically, gonna lured to the, C++, a, shared. Library in there as well. So. Let's. Go in ahead and, run that again, we've got an ileus for it a. Nest. Time of a new spark submit a, bit. Early the jar file has, the data source and the test cords we're. Going to pass in the the. SCTP, pcap. File, and. We, just call a failed test as. A class. Run. That. So, it's brought in the file reader and, it's doing all the way things in spark and. Then if we can have a look at that come through as a spark data frame and, sure.
Enough We've got all the right figures. They are 84 in 308 which remember from a native access is all correct, and. Then further down, should. See that we've read 21, frames, which. Corresponds, to at. The counts that we got back so the data frame is exactly, the same as the stuff. We got from the native library before all, good again. Okay. Lastly we're going to test the spark. Stream here so, we've, got a spark streaming sauce that's going to have P cap data I'm gonna do synthetic data and it's going to send ten, rules. Every. Time we pull, it or. Every time we generate a partition, and we're going to see that we're going to trigger that every, five seconds. We're. Going to store the. Partition, and append it to datastore, in orc file, and. Then, we're also going to show the first line the first five lines of that as well, and. We're going to run this for about 20 seconds. Okay. So let let's run, that. And we can see that there is fired. Up and we're getting the report every time we can see that the number of engine, roars, there was it was ten i'm, you can see the start, and end offset walk-in as well so the floor controls working we. Can also see at the top in the screen that. We've got the batches. Are walking and you can see that the lines, are go in there as well and. Three. Timestamps. Proximately. Correct yeah we've got the dummy, data and then. The, combs there's decrementing. From from Maine dainties ear or so so all good we, have a look at the. See. That we've got some. Some, data has been stored off. Anyhow. So, have created that data store and then. We've. Got a anak file for each partition, I was RIT hold. Good. Thanks. For your attention. What we've covered here is taking, a native library, which in this case with the peek apply every then, we've wrapped it up and so it's been in the spark infrastructure. And then file, source than that so that we can reader, the. Pcap file and then, we've taken that on a stage father-to-be, a streaming one so, I hope you've got an insight into actually. How easy it is to use the spark API for, data sources and. Hope you've been inspired to, go there and try. It yourself and your native cords, so. That you can be able to ingest data directly. Into, into, spark and then analyze, it using all the great techniques, and tools there, so. Again thank you for your time and don't forget to rate and review the sessions, and. Don't. Forget to remember that the. Channels. Are open for any chat as well okay, thanks, for your time. You.