Running Apache Spark Jobs Using Kubernetes

Show video

Hello hi everyone. We're going to cover today. How to run apache, spark, on, kubernetes, so it will be myself, i'm, i'm iran i'm cto and uh, founder for, iguazio. I have with me. Marcelo. A long time big data, expert, and. Spark expert. From our team that is going to also, show the demo and talk about the details. So with that let's uh, move to the. First slide. So um. First, you know let's let's talk about the challenges, today. Uh. One of the critical, elements when you're building machine learning in ai, is to build a production, pipeline, so not only, work with csvs. Excel spreadsheets, etc. In research, the real problem is, deploying, that in production, in production, you have, real data you have streaming, data you have to do etl, from operational, databases, you have to. Integrate, with apis. And once you bring all that data in you have to run analytics, at scale, not only on, within a single, jupiter, container, you know image or container. Or notebook. But you want to scale out the processing, of the data. And once you've done that you run training, and serving. And all of that pipeline, and spark is very instrumental. In processing, data at scale. But on the same type needs to coexist, with other, elements, of the stack. As, we're going to, present. And as part of that we want to move into a microservices. Architect, architecture. And use. Kubernetes. As sort of the baseline, for that. When we look into, the entire pipelines, for machine learning and over ai. Then we have different, steps we have the step of data ingestion. Followed by preparation. Which is joining, aggregating. Splitting, data, and turning into meaningful. Meaningful, features. Followed by training, where in the training you want to run various, algorithms, very various. Training models. On the same data set, to generate this uh model. Once you've run the, training, you want to move to a validation, step take some portions, of the data. Around the, model against it and verify, that you have decent accuracy. And finally you want to deploy those models, alongside, with apis. That come together with them for, the future vectors. If we're looking, into this entire pipeline. We can see that spark can, can play a major role. In some of those, uh steps, you know especially, around. Data engineers. Processing, data preparation. And analytics, at scale. On the same time we have many other frameworks, in the industry. Around using python. Packages, and libraries, like psychic learn and tensorflow, and others that may, work as a microservice. So the overall. Architecture. Should be one that you have sort of a, sort of a data lake or, our data. Sets, running, on a shared. Medium. You know cluster file system, hdfs. Managed databases. Whatever. And on top of it you need to have those micro, services. And the micro services, can comprise, of portions, of them would be spark. Jobs or spark based containers. And some other, will be other like api, functions. And model serving, functions. Where spark may not be the best tool for for the job so essentially, we want to create a single cluster. And have, those, two types of workloads. In a single. Environment. This is really where. Things like kubernetes. Come come into play so. Uh you want to, to have, one cluster. Uh, where um, you have the, a single. Framework, for scheduling, different workloads. And kubernetes. Is quite generic, unlike. Things like hadoop, which is very specific, to big data, workloads. And. And it's also pretty secure.

And Has a very vibrant. Community. Which, allow you to. To leverage all those innovations. That have been. Done there for for the recent, years. So. Again we want to move from, hadoop, architecture. Which is sort of keeping stagnant. And is not really cloud native what does it mean cloud native it means, essentially. Being able to. Drive. Workloads, through a more agile, architecture. As microservices. If they fail, or, need to scale that is done automatically. Using containers. Versus virtual machines, or. Or, non-isolated. Workloads, on top of hadoop. So this is what we want to move into, an architecture. Where the data layer is becoming. Instead of a file system. Like hdfs, with all of its limitation. And, very. Exhaustive, of, resources. With all the replicas, etc. And. We want to move into a managed. Storage, or database, framework, when it stores things like object store in the cloud. Or managed databases. The next layer of scheduling, we want to move from a single quick pony like yarn. Into something very generic, that can run any type of workloads, and have a vibrant, community, as we discussed. Which is called kubernetes. And instead of running middleware, which is very specific. To. The data workload, we want to run all the different middlewares. All the different services. On the same cluster and not, have to, to manage. Those frameworks, independently. And then we can focus, on running our business logic, and not focus, up too much, on managing, those. Clusters. So what is uh spark, over kubernetes. Or how does it work and in a minute we're going to see a live demo. Uh by marcelo. And the general idea in kubernetes. Is everything, is a container, so you essentially. Or a cod which could be a set of containers. So, we want to, essentially, go and, and launch. Uh. Such container. And, in spark you know we may have. Different. Types of containers, we may have, like workers. And, executors. Or. And we want to have a driver, which is the thing that launches, the job and manages, the lifestyle. Of the job, so the first step would be to. Talk to some service.

On The cluster. We're going to see the in a minute how it's done because there are different ways of doing it on kubernetes. And this, service, is going to essentially go and spawn, the driver. And the worker, associate. Assign, resources. Like. Uh, storage, mounts. Uh. Configuration. Cpu, gpu resources. Into those, workloads, and launch them and also manage the life cycle. Of those jobs while while they're running so for example if, something fail it will automatically. Restart. The resource, if there aren't enough resources, it will, essentially, uh, notify, us etc. So this is what. The spark over kubernetes, provide us is, though is, a mechanism. To launch spark, based containers. And the orchestration. Associated, with them, with pushing the drivers, and the workers, and potentially, other. Services. For shuffling, and, another, on the same cluster. And and then. Manage the life cycle of the job. Now with that there are, essentially. Different, uh ways to, to skin the the cat as you say there are, uh, three different mechanism, and, and marcelo, is going to cover those as well as demonstrate. How it works. Hi my name is marcelo, litovsky, i'm a solutions, architect. With uh iwasio. I'm just going to continue, on what euron, was talking about, as far as how to run kubernetes, how to run spark, on kubernetes. Uh and there are different modes of running. Spark on kubernetes. Uh. But they all require, first of all that you have a kubernetes, cluster up and running, that you have the right permissions. And you have to write have the right components, deployed, to that platform. So the first. Way of running a job in kubernetes. With sparc, is, where your driver, runs. Outside, of where the rest of the cloud the spark cluster is running so your your driver, will run, on, a container, or a host. But the workers, will be deployed, to the spark, uh to the kubernetes, cluster. Um, there are some advantage, of disadvantages. Of doing, all three of them, but it all depends on your architecture. And how other applications, in your environment. Are, running. In the second case, you are. Submitting, the spark job but even the driver, runs inside the kubernetes, cluster. So you have everything, running, inside the cluster. Um, you still have to interact, there's there's several, dependencies. As far as being able to interact. With, your your spark jobs, so. In the first two there is a lot of things that happen. Outside of the kubernetes, cluster that you need to be aware of and access that you need to be able, to monitor the job to understand, what's going on, um and the last one the last option is to run it as uh. Using the spark. 4k. 88s. Operator. Which is basically. An operator, in general, in kubernetes. Uh, has, the. Uh, the. The full template of resources, that are required. To run that type of job that you're requesting, in this case it's a it's a cooperator, for spark. So he knows, when you, deploy your application. That it needs to deploy a driver that it needs to the player worker. Uh it knows based on your, specifications. How many replicas, of each of the workers is going to deploy. And then, you can. Let kubernetes. Run that cluster, and use the kubernetes, tools, to be able to monitor. Um. How you're. Running. So, like i mentioned earlier there's three different modes, of operation. When you're working, with. With the driver, running, outside, of the cluster. The main thing is that you have to be aware, of that communication. Between your, spark. Your workers, and. Uh your your driver. Um you also, need, that. Access to kubernetes, to be able to deploy the resources.

So You have to have what is called a kubernetes, config, file, that. Determines, who you are it's kind of like a login, to the cluster. And, it will allow you to connect, and establish, all the, um. Deploy all the resources. Uh. In the first case like i said the driver will run on your. Container, or your on another host, but then the workers will be running, on the kubernetes, platform. So the first two are very similar, you know there's a lot of things that are outside of the platform that have to be you have to be aware of and you have to manage, outside, of kubernetes. Uh the last one is a little bit different, where. Uh. When you deploy. A spark, a. Job. The account, the service account, and i'll show you that in the demo that that is associated, with that job will, have the ability, to do that full deployment. And once you let that job start, you can interact, like i said with kubernetes, with the commands that i'll show you enough in a couple of minutes. To be able. To, um, to monitor the job to understand, if errors are happening, how to look at the logs how to. Communicate, with the spark. Cluster the ui itself, so those are things all components, that are built into that spark, operator, for kubernetes. So now i'm going to give you a quick demo. Of doing things with the spark, operator. And let's make sure i share my right screen. All right so, the first thing like i mentioned, is that, you have to have the spark, operator. Running, on kubernetes. I already have it installed. Um, so if you, if you use helm. Which is the package, manager for kubernetes, you'll see that. I have a version of the spark operator, running, in, my environment. I can look at, any applications. That have already, executed. So i can look at what i already run. Um in this cluster. And. I can deploy new jobs so, the way i will do this i have, the yaml. That, defines. How. You deploy, that job, into, kubernetes. I have uh actually this is part of a github, project that we'll share, uh surely. So, first of all i have my. My python script and i'm using, examples, from the, uh, from the apache spark project, so this is just a simple, um, by spark script, that. That that's a basic calculation. And, i have. The. Yamo. That, is going to. Communicate, it's going to give the instructions. To the kubernetes, cluster. On how to run. That script. And there are a few things that. You have to be aware of when you're looking at this yamo and start building. First of all the version of the operator, that you're working with has to match. What, you, are. Deploying, so so this is the version of the operator that i'm running. Um, then there is the image that you use, in this case i'm using a python. I'm going to run a pi spark script. So this. Image. A docker image based on, uh. By spark if you're using scala you're gonna need to use a different image. This image also has to be built so it has all the dependencies. That each of the workers, will need, to execute. Now you can start like i did, using the images, that are already provided by the apache spark project but, you can build your own, base on on that image just to make it easier you can add your, your uh your own packages. To that image, um. I'm also taking my application. Is the application, is not that the actual, script, is not part of the image. I'm going to pick that up from. From a github, project. I cannot. Also run it locally. Now the caveat there is that you have to be. Aware, that when this container, executes. Um. It's running on kubernetes. It does not know anything about your local file system so you have to understand, how to mount. The required, file system in order, to be able to access a file that exists, elsewhere. That is not inside that container. The way this image. Um this yaml, is is, um. Was built and because, i'm running on a single, node kubernetes, cluster. I'm able to mount. Uh my temp directory. Inside, that container, so anything that i put in temp directory. Will be available. To, to, spark, as the application, runs. I'm also defining. The limits. Of, the resources, that i'm going to consume. So if you're used to. Uh yarn. As a resource, manager, you know there's a lot of complexity. To managing yarn and managing permissions, and and, resource allocation. Kubernetes, makes that a little bit easier, uh so you can as part of the spec of running your application, you can define how many cores, how much memory. If you need to you cheap use, on your process you can also allocate gpus. To that container. Um. You know, this is as long as your, uh. Kubernetes, admin, allowed you to, request, those resources, as well there is some configuration, on the community, side that that limits what you can access. Uh so i have. A definition. Of what the driver, i've been calling workers but executors. What the driver and the executors. What resources, are going to request. The file system that i'm going to mount and in this case i'm, mounting, and this is a bad practice for production but i'm mounting a local file system just to do the testing.

And, Then i also specified, on the top, the script, i'm that i'm going to execute, which is going to come from a git, uh repository. I also have just to show you i did one. Another, yamaha. Alter. Slightly, to. Pick up the file instead of the python file from, a git to pick up the file, locally, from, the temp directory. And. I'm able to do this because i mounted. That file system. Inside the container, so like whatever is like in the temp directory, in my. Node, is gonna be available. I'll be able to execute, that, that application, as well. So in order to run this application. Um. I have the yaml. What you have to do is you have to. Tell, um, kubernetes. To deploy the application. Now i have run the application, advance so we're going to this is going to end very quickly, is, deployed to the cluster. Now it's running on the background. And. Obviously. You need that way of being able to interact, with you through the cluster, to, to know what the job is doing and and how is it operating. So, the first thing is what i showed you earlier. You can. Run a, get application, so we know that. The one that i just executed, should, be. Here. Or is still probably still deploying. Then i can look specifically. At. That application, look at the pods, that that application, is is running. And you can see that now my driver. And my worker i had one driver, and one worker. The worker should be coming online. Shortly. So those are running. Now. As you start doing development. You'll likely run into some errors some things are not going to be available. Some things are going to be missing so you want to be able to interact with this, uh, product with this application. And, understand, what's going on. So you can use the kubernetes, commands there's a communist. Command, i call, it's logs. That. Will let you look at the logs of the driver. So you can see exactly, what, happened so if there was an exception on error you will see it here. And you can also see. The output, of the script. I believe that this one is still. The one that i just triggered is still. Still executing, so, if your job is executing. You can do a, f. That actually, tells the execution, log. But i guess by the time i did that it had finished already. And, as you look through the output. Um. You should see. The, result. Of. That calculation. And. It is a lot of output. Some debugging, the application. But i just. I'm missing it because. I. It is it's a it's only a single line of output. Um. I'm going to do something much easier which is grep. So we can see. So so basically this is the output of my script. Um. I have another, the version that i have, for uh, for the local file system. Gave me a more interesting, output, but i run the one that is on github, so, it is different than. Um. It has a different, uh. Layout of the output, so. Just to do a recap. Of how to look at your application how to troubleshoot. And uh and i'll close the demo with that, uh. Get the up get spark applications. We'll give you a list of all the applications, that are running this is the application. That i'm executing, that was executed, earlier. If you want to look at the pods. Based on an application. Name you can look at. The. Pods, with, uh get pods and i'm just doing a. A grep. For. The. That application, that i'm running. And, the last thing the more importantly, is to be able to. Um. To look at the logs so you can see the output. Uh, now when you are done and you want to clean up, uh the last thing to do if you want to keep your cluster clean and not have all those applications. Um. Laying around, you can actually delete the application. With i before when i did the first command i didn't apply we took, we took the yaml. And. Created the application, and run the code. I do a delete. And my application, will no longer be. As a typo. The application, will not where exist so when i do a. I get, spark. Applications. That my application, no longer, exists in the cluster. So this is how you handle. Uh running a spark for humanities, job using. Using the yaml, and using all the cube cto, commands. This requires. A. Good amount of understanding, of how kubernetes, works. Um, you know you have to. Make sure that, your yaml. Is formatted, properly, for that version of of the spiral operator. You have to make sure that the location, of your python script or your scala script. Is available to the containers, when they execute. So that means that you also have to make sure that you can mount. Correctly, whatever. File system resources, you need. At runtime. And with that that was the end of the demo. I stopped sharing my screen. Go to the next. Slide. Uh i do have a a repo. Uh with the demo, that, uh, you could reproduce, this on your desktop, with, um. Docker, for for desktop for windows or for mac. Um i i did this uh repo, last year for our mlaps, conference. And, it probably needs a little tuning, um. Before you watch this video i probably will go in and make the necessary, changes so you can execute on your laptop, as well.

Okay Thank you marcelo. For a great demo, and. Let's continue, and see how potentially, we're going to. Improve the experience. Through, some additional, tools. Because, as you see some of the challenges. Uh around devops. Still, still exist, uh. We need to configure. Yamos. It's not so trivial for someone that just wants to to launch a, job. And we need to configure, the resources. And and security, attributes. And volume amounts and and, all of that and, package dependencies, so there's still a bunch of, things that we need to address. Uh. Also, scaling, and scheduling. Of the workloads, in a slightly more uh, automated. Fashion. And and also how do we integrate. Uh with other frameworks, how do we embed. Uh various, uh, packages, into the image that we want to launch so we need to build. Images, every time we launch so there are still. A lot of challenges, that force us to to spend a lot of uh, devops, resources. And this is where we're, we're trying to. Improve the, experience. With some, extra, functionality. That we've, we've added. Here in iguazio, so. The first. Idea is we said you know, how do you, automate. Devops, today. And the way to automate, devops today the most automated, fashion of devops, today, is the concept, of serverless, functions, if you, if you go to for example, uh. Amazon, you have lambda functions, you write some code you click and now you have, a function that runs in production, and responds, to http, requests, okay. So no one needs to build containers, no one writes any ammos, that. Like, marcelo, showed us you just write code and you deploy, it. But, the key challenge, is that while serverless, is very, uh, friendly, very, nice there's. Resource, elasticity. You don't use it you don't pay for it, you automate the entire life cycle of development, and operation. It's still not a good fit for. Data intensive. And. Machine learning. Workloads. Because there are. Different attributes. To those workloads, and serverless, functions so today if we look if we examine. Existing, serverless. Technologies, today. They have very short lifespan, you know after five minutes your your london crashes. Uh and you may have jobs around for 30 minutes for training, you know. If you're running, scaling, you want to scale your. Your workload. Serverless technology, usually, do that, using an api gateway and a load balancer, that throws. Different requests, at different containers. While if you're thinking about spark, and and other technologies. Like dusk, or. Horvath, or other machine learning and data. Analytics. Frameworks. They're, using other strategies, like shuffle. Reduce, and rdds, in the case of spark. Or potentially, hyper parameter, tuning, or with some other frameworks. So we need a different approach at scaling, and, and also. In most cases it's stateful because we're processing, data, so. If, serverless, functions are easily, stateless, and you attach to resources, through some. Http, calls or s3, calls. You want to be able to uh, access. Data, from your function in order to process and analyze, it, not necessarily, through an http, hook.

And The last. Thing is when you're running a job. You're not passing a json, event, as the request. Like in usually. In every serverless. Function that are. Restful. Typically. What you want to pass is a set of parameters. For your job, you know, and, data sets this is the input data this is the query i want to run on the data or this is the, input data and this is the model i want to train with those parameters. So there is a different, input, so with that we understand we cannot just use a serverless technology, that was invented, for event-driven. Or real-time, workloads. Like lambda, or in our case the open source project called nucleo, which we maintain. We want to need we need a different. Story, we need to extend the serverless, and still maintain. The inherent, benefits, of elastic, scaling. Not paying for things that we don't use, and, automating, the the devops. Work. But we want to apply it, to the problem at stake so we want to potentially, create think of it as, a serverless, spark. Or a serverless, desk or serverless horowit. And this is really the idea, behind. What we call, uh. Nucleon. Functions. Or, in combination, with the framework called ml run which is automating. This work so. We've seen that within kubernetes. There are things called crds. Or operators, they know how to take code. And through, and, a spec, usually, described, through a yaml, file. They take those two those two things and they run a job and then they finish, okay. So we understand that there is your code. Or, a service to run some something. There is what we call a runtime, which is this operator. And we could have different runtimes, ones for spark, and ones for just regular, jobs or ones for, dusk if you use or more in the python. Camp or or vote for deep learning, and. Nuclear for real-time, functions etc, so our different. Types of resources. That embody, the the code. So we want to create a function that comprises, of those two things the specification. And the resources. Along with the code. And. Abstract, it in a certain way, and then we want to define. The contract, between those functions, and the user. What does the user want to do he wants to pass data. Parameters, for the job potentially. Secrets and credentials, because if my function. Is going to access some external database, i may need the password, so, and i don't want to pass the password, through clear text so i need a mechanism, for passing. Credentials. Into my code. And once the function, finished, executing. It's going to generate some outputs, it's going to generate, results, you know for example, accuracy, of my model. It's going to generate operational, data like logs, telemetry. Data, monitoring, cpu usage, etc. And it's going to generate new data sets. Like, you know i'm running analytics, i have a source data set and then i have the analyze. Data set with the. Join results, or whatever. Same for models, you know i'm running a training. On on some data and i generate. A model, so. What we want to define, is such a, box. A virtual box where we throw in parameters, and data and secrets, and we get back. Results. Logs, and, data artifacts, through a simple. User interface. And we also want to be able to define those function, as a set of a bigger story. A pipeline. So for example we may have a function that ingests, data. Then another function that prepares, the data, and the third function that. Runs training on on the prepared data, and the fourth which does validation. Etc. So we need a simple way to, essentially, plug those functions, into. Into such a, generic, architecture. And this is really where we invented the concept, called ml run. Which is, it's automating, this infinite. Things no yamls. You're only focusing, on simple. Code. Then everything that you series, is fully. Automated. So, for example, when i have my. My function, i can write, something in, inside the notebook, and we're going to see a demo in a minute. Uh, with, with marcelo, here. So i can take. Some code and i define a function the function. Includes, the code. But also it could include, some. Definitions, of resources, like cpus. Or. Jars, and things like that so i can, essentially, create a python, object that, encompasses. All the configuration. This is what we call the function. Alongside, with the code. That is, uh.

You Know is using, is a pointer, to where i'm storing my file, and i can store my file in various. And we can do, uh, we can use different execution, engines, as, marcelo, showed us there are three ways. Of using the cluster so. Each one has, some slight advantages, and disadvantages. I may want to run spark locally. Within my jupyter, notebook. Or maybe against a, live cluster. That, is always up and running not sort of decommissioned, when i'm not using it. But on the same time when we want to run exactly the same job against. A cluster that is, built ad hoc for my job and i have a. Way to customize, the image. And the spark, version that i'm going to use, and. And all the packages, that i want to or charge i need i need to use for that job, it provides me a lot more power, power. And it's only going to be used while i'm launching the job so it's safe resources. And other things. On the other end there is some downside, like it takes some time to spawn, this, cluster, and create that so if we have very short tasks we may not want to do that so. You see in this. Mechanism, i can just. Run exactly the same code without, modifying. Too much just a few configurations. I can run the same code, locally, inside, my notebook. And and the same code, with slide configuration. Change. Uh in a cluster in a distributed. Fashion. And we're going to see that in a demo. In a minute. The other point that i mentioned, is that we want to compose. A workflow. From all those different, function, instances. And the best solution, in kubernetes, to run a workflow, is something called kubeflow, meaning kubernetes. And, workflow. Or kubeflow, pipelines, is a specific, project. Open source project. Uh the initialize. Initiated, by. Google originally. And and other companies. And you could essentially, just plug, different, steps, within a pipeline, and say you know what the first step is getting the data the second. Is preparing. And training and so on so, we can use, this tool which is great, it also knows how to record. Information, about artifacts. And logs and other things. And we also couple it with another tool called, ml run developed by iguazio, and it's open source. It allows me to track everything in a very easy way as we're going to see in a minute no ammos. No coop cuddles. Uh command, just everything, through. Ui, or notebooks, or. We're going to see that in in a minute so. Introducing. Serverless, into a pipeline, allows us a very simple composition. Of such a workflow. Because every, step in that pipeline, could be a, self-managed. Function. With its own life cycle, and versioning. And images and definitions. That we define only once we store it, and then we pull it in when we need to use it. Another thing is that we can take a single workflow, and combine, jobs as some of them are simple python, jobs. Some of them are you know maybe go code or java code some of them. Are, spark, jobs. Et cetera some of it is maybe tensorflow. Service, so. We can combine all those different things into a single. Pipeline. Uh unlike. Solutions like hadoop where we're very confined, to just a set of. Microservices. For big. Data. One example, of a solution. That is deploying. Uh this, technology. In in production. Is a company called payoneer, that's one of our customer, and there is a. Public, use case around what they've done. Their. Unicorn. In the payment. Services. And they wanted to achieve two things they originally, had hadoop, clusters. They want to achieve two things the first thing is move from. Fraud detection. Into fraud prevention, which is very critical. Critical for their business, because, if they can prevent, fraud. It impacts their bottom line they can make more money. They can take more risky. Proposition. And the second thing they want to, solve, is cut the time to production. So instead of having, very long release cycles, investing, huge amount of resources. They want to be able to, just updates, things and launch into production, in more, service cicd.

Fashion. An agile. Mechanism, that's enabled, through the microservices. Architecture. So the first, solution, that they had, based on hadoop. Was, too slow, very cumbersome. So, you know for example a typical etl workload, you take data from the sql, you run etl. Using spark. You run some aggregations. And some. Processing, on that, and then they use some r server. To, to predict. This entire process, from the etl, job to prediction, and blocking, the customer. Fraud, for the transaction, was taking, 40, minutes. 40 minutes that's pretty. Long time and it allows you to steal a lot of money if you want to. So that's not an ideal, solution. The next is, managing. Hadoop, and spark. In this, cluster. Silo. Is very. Resource, intensive, as many of you have. Known. And it takes a lot of time to product, as a new. Application, or a new version. Of that application. So, with that they wanted to move to this architecture, which is entirely, serverless. Fully automated. So they achieved two, uh two things one is everything, is becoming real time. And the second, is. Fully automated. Devops. And deployment, of software, so they can deploy, a lot more versions, of the software, actually on a weekly basis. Every time they want to change the model every time they want to change the logic. Very easy to do. And in the same time they manage to get, real-time, performance. With, this, microservice, architecture, so. The example, is now instead of doing etl. You're just going to, ingest, the data. Through a cdc. For a stream. Using, a rabbit mq. The first stage, is using. A nucleo, function, it's serverless, function this they have native support for rabbit mq protocol. And they're real time so they can essentially, ingest the data crunch it immediately. And pass it and write it to a database, in a sort of a. Slightly, cleaned up way. The second step, is a spark, function which essentially, runs. Also, as a job. Periodically. Process the all the data that was ingested. And combining. Real-time, data with some batch data. The the third step, is using. The model training. Functions, which are simple python, functions, with. Psychic learning in this case, so those functions, essentially, read the data. From the, database, or the feature store. The data that was generated, by spark or by the real time ingestion, function. They build, the models, out of that. Data, and they store the model back into the database, or the system. And then you have model inferencing. Functions, that actually. Listen, on the live stream, of events. And based on the enriched. Data, that was generated, by spark. And the real-time, data that was generated, through the ingestion, function. They make predictions. And if this is speed they suspect, that the transaction, for that specific, user. Is is a fraud. They say you just fire an event immediately. Into the the database. And and change. The user. State to blocked. So, this entire process take about, 12 seconds, from the minute we got the event. And we pass it to through some real-time, analytics, pipeline. Until we actually block the account so. It's from the first indication, of fraud so likely the user will not, be able to.

Steal Money, because it's like you mean instantly, going to. Block the. Further, transactions, that it's going to do. On the system. And the other aspect as i mentioned, is because it's serverless, with automated, devops. Then, that means that there's no, you can do releases, as much as you want, everything, is rolling upgrades, without even taking the, the system, down. All the monitoring, logging telemetry, is all built into the solution, you don't need to. Develop, anything, to to achieve that. So with that let's move into the second demo of showing, more of a serverless. Architecture. For spark, and cube flow. Right so i don't give you an idea. Of. All the other elements, that have to exist to be able to incorporate, this as part of full ml, application. We also talked about. Uh, the, the complexity. That it adds to do everything in yaml and cube ctl, cube cattle. Um, and then, how we can do it using. Ml run which is a. A, component, that allows you to encapsulate. All those. Uh, resources, that you need to run your pi your spark application. Without, having to worry about the yam all the images, and everything else that goes with that infrastructure. So i took the same code that i run. From the command line using the basic. Um. Kubernetes. Spark operator, for kubernetes. Uh i added. Some uh, artifacts, of ml run to be able to keep track, of what i'm doing. So i'm basically, logging in into mram, and, mrran and you will see what this does when i execute, it, um. So when you start. Wanting to keep track of your experiments, it's good practice, to. Login artifacts, login results, so you can keep track of everything that you execute. So in essence this is, the same script, except that i send the output instead of printing to the output, to standard output, i'm actually saving it to. An artifacts, database, and i'll talk to that in a few minutes. Now as a developer. You're working on jupiter notebooks. And you want to test the full extent of your application. From. Running it in your notebook, to. Run it at scale. In kubernetes. Um so when you run on your notebook, your notebook is a single container. And it has limits of how many cpus, how many how much memory you have located you might need, uh, gpus, and you don't have gpus. So there's some constraints. So how do you go beyond. That environment, to be able to execute. Your uh, your spark, code your pi spark code, uh so ml run gives you the ability. To. Use, the, the ml1, api, and. Facilitate. That communication, with the kubernetes, cluster so your script. Can run seamlessly. In the in the kubernetes, cluster. But without having to exit your notebook. So i have a few definitions. That similar to what you saw, on the yamo. But now i'm using. Specifications. I'm using a language that i'm more comfortable, with, um, you know i'm defining, where my script is. I'm defining. How much memory, how much cpu. I'm going to use how many replicas. Of the execute how many executors, are going to exist. Um. I'm picking up, the dependencies. The jars that i need. To be able to run, my code, so i i'm actually. Leveraging. What i have in, my the context of my jupyter notebook. To lift all those dependencies. Um, and then. Instead of having to write the yamo. I, use emma run to define, the simple things that i need to be able to execute, this, first of all it is a kind of, it's a kind spark job so this is going to be a spark. Job. It's going to use this image, as a base image for the execution. I provided, the location, of the script. And as you can see it's going to be local, to the image so so. I have to know that. That it's going to be mounted, in the location that i expected, to be mounted, at, which. If you once you start learning kubernetes, you see that there's multiple ways of of doing that, now with mlrun. Will give you, a few tools to do that without, having to worry about that complexity. First thing is to be able to mount it was your own file system so i can mount, with a simple command. Um. Mount the file system we also have a, function to mount a persistent, volume claim. So you can mount a volume, from from another volume that is outside of iwasio.

So That guarantees, that you have. Access, to the file system, in the same way that you have access to the file system in your notebook so the, the container, environment. Is going to have a very. Very. Similar. Path to everything that you're running. There's a few things that are specified, that i needed to run within, our environment. And then i define. The limits the requests. And the number of replicas, the same way i did in the yaml but like i said in this case we're using. Stuff that a developer. Will be. Used to and things that are friendly to a developer. And then i can i think another. Another point marcelo, potentially, to add is the, function definition. Is reusable. Across, jobs so i don't have to repeat. The definition. Of this function, and its replicas, and, cpus, i do it only once. And then i execute, it multiple, times. Yeah so this object becomes, part and i'll i'll show. This, in. This uh, capability, in a few moments when i show you what's next on this notebook, as well. So i define this and i run it. Now it. The interaction, with the execution, is is within my notebook. But it's running and it's building. The spark, cluster on kubernetes. Is running my code, and you will see the same output that we saw. When we run it from. From the. Using the yamo. So, somewhere around here. I'll see that uh. That that pi result. And i keep missing it, um. It is a lot of output on this on these jobs. There you go so that formatted, a little bit different so be highlighted. Now the other thing. Is, because i run it within this framework, if you look at the bottom of the job, you have a definition, of any input any outputs any artifacts. That i might have, recorded, during the execution. I also have a link to look. At the history. And the execution, of this job, and any artifacts, associated, with it, so mram, provides a ui. That gives you access, to the execution, of that job. You can also, here you can see the logs as well, so you can you can they said you either you can use to see the logs. There is. A version for each execution. Of your job so as you run an experiment. Uh if you're recording, results, if you're recording. Um, information, of your execution. Then you can look at the different versions. Of of the jobs and the different, parameters, that you use. So. By, by just, wrapping around. That, uh that pi spark script. Um, using the mlran, libraries, you'll be able to. Keep track of, all these experiments. And, more importantly, is that. Me as a developer. I'm not constrained. To the resources, that i have on my notebook, i can leverage the kubernetes, cluster seamlessly. Without having to worry about docker containers. Or. Yaml. That i have to build to be able to execute the job. Now. I run that sun as an independent, job but. Uh usually it's not going to be a single job. And that's it there's going to be a pipeline, and a sequence of events that you have to follow to you know. Spark might be doing the data processing. But you might have a tensorflow. Job that is going to do, training, on the model. So. We can take the exact same function as jerome mentioned earlier i do not have to redefine. How that is going to be executed. I use the q flow of dsl. To define, the pipeline. I take the same function, as i define it above, and, now i can actually. Incorporate, it into a pipeline. I can run the pipeline. From, my notebook. And at the end of the run i'll have an experiment, link that will take you to take me to. Qflow. We have integrated, q flow, uh one, with uh, with iwasio. So you'll see that execution. Of that spark, step. As part of, of a pipeline, and you could be able to see as well the logs, and of the execution. Now obviously this is a single step and and there's more complex. Pipelines, i'll show you another example. Of of a. Pipeline that has, a lot of that's the typical, processing. Of um. Of machine learning pipeline. Um. We're looking at, you know getting the data. Training the models. Providing, you a summary. Of the data, as you consume, it. Um. As well as. As part of the pipeline, deploying, the process. And if, if you're getting data or your data prep step, was, a spark, job, you can run that as part of the same pipeline. Mixing. Steps that are run spark. For kubernetes, jobs with tensorflow. With other components, of your pipeline, that are developed at different teams. Um. Within the application, environment. Uh, so. The key thing is. For me as a developer. Is the ability, to do all this. Using, something that is very familiar, to me, that. Puts together, the full machine, learning pipeline. Uh leveraging. Not only kubernetes, but we're leveraging kubernetes, we're leveraging. Spark for kubernetes. Where, we're using kubeflow, to build pipelines. As well as, jerome mentioned earlier.

Uh, Serverless, components, to to be able to do inference. In uh in real time. And with that. I'll end my uh my my part of the demo. Thank you marcelo, that was a great uh presentation. And a great demo. Uh i think what we've seen through, marcelo, is how you know on one end kubernetes, is great as we mentioned before, it provides you a single cluster. That you can run all your different workloads. All the modern workflows, side by side with traditional, big data. Applications. Spar, presto, hive all those can run on, on kubernetes. The challenge with, running spark on kubernetes. Is that there's a lot of manual, work, associated, with that and a lot of devops, work. Uh messing with the animals, and kubeco, commands etc. And what we've seen is throughout. This serverless, approach that we've introduced, with with ml run. You have, full automation. Of, essentially, launching your job from. From within the notebooks. Together, with various configuration. Those functions, by the way they're also stored in a database so i can always, pull them in, a later, and use them or i can plug, other functions, from a database. Into. A new pipeline. That i've created. And and you have all the logging. Of of your outputs. Presented, directly, to you in the jupyter, notebook. So you don't need to wander around with, cli, commands. Or, recorded. In, in, this tool called mlrun. That i could just go later, and see what, what was running. What was the git commit, version of the job that was running what were the results, etc. For everything, that uh, that happened. And i could launch jobs for a ui, and all those things will, make my life much. Easier. So um, so with that, and. And again thank you marcel, for the demo. Uh we and everyone that wants to. Learn more about those technologies. About how to run, kubernetes. On, spark, or, how to further automate, that using serverless, technologies. Uh you know just being me or marcelo. My, uh twitter handle is uh iran hadif, my full name. You can look me up in linkedin. Or. Other places, and. Marcelo. Will also is very happy to, help people usually, in, those.

Areas. You.

2020-08-01

Show video