Serverless and Open-Source Machine Learning at Sling Media (Cloud Next '19)
My. Name is Robin Warner I am a customer, engineer and Google Cloud I've, had the privilege, of working with sling, TV over the last year with. Some really exciting things they're doing in order in transforming, their workloads, you, heard this morning that, a, big focus in Google cloud is. Making. It easy to deploy and, making, sure you're not locked in so. The. Promise of serverless. Networking. Or serverless computing. Server lists workloads, to minimize, operations, and. Open. Source tools or, managed, open source tools they, really allow you to, run. The workloads you want to and, still, have the power of the open source community behind, that I've been I've had the privilege of working with them for about, a year now and watching, this journey over, the last probably, nine months, as they. Take existing. Machine. Learning. Workloads. Machine, learning. Algorithms. That, had been built previously and moving. Things into, GCP, and all, of the things that they got to learn along the way and we all love to learn stuff along the way so. Austin. Bennett is a senior data engineer, with, sling sling. Media and, salmieri. Ozzie is a data, scientist, also with sling media and they're gonna give us a little. Glimpse into the journey that they've had over the last nine months, so, Austin, you kick it off. All, right cool so, far so good. Cool, I think he just covered that so, I'll walk you through the agenda, to start, you. Know why, is it that we're, playing. Around with GCP, at all. You. Know a bit about our data and overview. Of the sort of things that we're addressing, how. How. We took things from a local laptop to, virtual, machines to using, cloud, machine, learning engine. Know. Machine, learning project, can you know work well without data. Being where you need it so some, thoughts on how, to move data around and, migrate it a. Bit, on pre-processing. With data, flow on beam and. Digging. Into how to automate. Some prediction. Using cloud composer or you. Know hosted. Airflow in the open source sense. So for. Our motivation. We. Had used, to be a, analytics. Group, so just reporting. For things around, sling TV. But. The goal. Wound up being we, needed to support machine, learning workflows, for a bunch of reasons the, current tools being. You. Know managed. Database. Services, really, aren't, great for machine. Learning workflows. So we, needed to change, our tooling and infrastructure. One. Way I think of, a trade-off that is needed, in a lot of places is a, data. Warehouse versus a data Lake so data. Warehouse being really, just a sequel kind of engine, could be you, know your my sequel or whatever store or some MPP, system, is. Super. Familiar to many people and, updates. Are really. You. Know straightforward. The. Ability, to in. A data Lake to otherwise work, with machine learning flows. Just read the data off of some, form of say blob storage, super. Nice for you know SPARC ml and things like that so, kind. Of where to. Use which system, so you're not trying to unload your data out of a, database, itself. Also. Kind of an origin our group you. Know, was nice and sent me to the. K DD conference in 2017. To keep track of research. In the area there. You. Guys may have come across the tf-x paper on. You. Know things around tensorflow, and a really nice way to productionize.
Machine. Learning so was familiar, with that and eager to follow. The, developments. And they've you know been attentive, to that, open. Source super. Cool I, don't. Like managed services, where. What's. Going on under the hood isn't. Apparent, as well as then I can really dig into the code and understand. What's going on so here's a whole lot of things that we either had been using our made sense to, jump. Into and then really a, quick shout out to these other, things we're, trying. To get through the whole infrastructure. And. Pipeline, to you. Know some more even opsi sort, of things and. Then. Really. What kicked off this, process. Was the, cloud. Next last, year so where, you guys are now. As. Well as we. As a group came and saw, all the great things other companies were doing so it made sense to really dig in and, figure. These things out. Some. Was going to share a bit on our data background. Hi. Everyone, thanks for joining our session, I'm gonna start. Talking about the data and background, and the many iterations we, have to take in order to get to our end-to-end machine, learning pipeline that we have today. So. How many of you know what sling TV is. That's. So many for. Those of you who don't know sling, TV is an online TV platform, where, users. Can sign, on and watch online. Live. TV, and on-demand contents, I like. So many other subscription, services, or. With customer, turn we. Want on identify, the customers, before they turn so we can do something about it. For. This binary. Classification, problem. We. Can look at so many different. Features. For. Example. What. Users, watched what, channel they watch how. Long they watch it for what day of the week and the. Timeslot this, is just an example we can look. At millions. Of attributes. So. For, this charm prediction, problem that's assume that we look at six days of activity, and one a prediction, at day 30. For. This one user the. User comes subscribes, on the first day and watches the ESPN for. 1500. Seconds. And it's, a Monday night they. Come back again on the second day they watch Food Network and it's. Tuesday 6 to 8 p.m.. They. Don't watch anything on day 3 and 4 and they come back again and watch some Travel, Channel on a Friday morning. For. Many dot non deep learning models this. Data needs to be aggregated per user so for, that one user you will have total number of seconds. Watch Channel. Most-watched, total. Number of sessions, most. Was, it mostly a weekday or weekend, and most common timeslot. However. This kind of model will disregard the daily pattern of user activity. There. Are these studies, online the. Papers that use. Convolutional. Neural networks, for churn. Prediction and, they, use daily, patterns. Daily, activities, of users, the. Model architecture, for, this. Kind of model is has. Multiple, convolutional, layers followed, by max. Pooling and then it gets to a fully connected layer and, then. It will get to a final soft, max or sigmoid, output, for, the final classification. I. Don't. Do this left. The. Screens flicker yeah, okay, in. Order to use this kind of model. We. Need to clarify, between. Days of activity, and inactivity, so, for the days that the user has been an active we need to insert. Zeros for. Example for the days three four and six we, insert, zeros for, the numerical, values and encode the categoricals. Then. These before. Before. Feeding, this data into the model the data needs to be encoded, and scaled, for. Numerical, it needs to be scaled between 0 & 1 and for, categorical it needs to be encoded. So. After. Trying that we, were able to improve our model, metric, by 15%. Compared. To our baseline non deep learning models, for. That one sample if, you look look at one user for, 6 days of activity. With. 5 features, it's very simple to pre-process, and, train. The model, however. With many if you want to look at a more granular time. Time. Standard, for example minutes or hours and for. Millions of users with. Thousands. Of features this problem, gets very challenging really, quick. So. Now I'm going to talk about how, I did, the pre-processing and. Each. Iteration. First. I started, small with. Datawarehouse. I moved, that they dog to a virtual, machine pre. Process and save the data on the virtual machine using pandas, and numpy I'm gonna get into the details in the futures, life I trained. The virtual I trained, the model it was the Charis model on the virtual machine the predictions, on. The virtual machine the only thing is that I needed to pre-process. The predictions, the same way that i pre-process, the training, data and.
Then. I said that saved the predictions, to the original. Database. So. We realized, quickly that that system was, not. It. Was not good for production, it's not very scalable and, we wanted to move towards. Server. Less scalable, system, so. We started. Changing. Our tools and moving through GCP. First. I tried I started, with the original. Database. Moved. It to Google storage and, did the pre-processing and data lab using. The same functions, and same tools I, saved. The pre-processed, data to, Google storage did. The training with Karros on CM, le, cloud. Machine learning services, and did the predictions, on virtual machine, I, did, this because last year Karras was not compatible with the model API but I think today things are different. After. That I did the predictions, on the virtual machine and save. The predictions. To the original, database. Again. I needed to do the pre-processing. On the prediction, data. This. Is detail about the code I used for the first two iterations. First. I. Read. The unloaded, CSV, file into a panda's data frame. Then. I scaled and encoded, the. Data, and. Then I said it went on fire while, inserting, zeros to the. Form. The reason I showed earlier and. Then. I saved the numpy and pickle files the encoding was you I did the encoding using label, encoder and standard scalar by, scikit-learn I say. The typical files which I needed to keep track of throughout, the modeling process, the, problems are scaled. Extra, packages, the pickle files and. Prediction. Pre-processing. This. Is how the model architecture. Looks like it, well first if you look at the first four lines I am. Reading. The whole numpy array at once and then. I'm loading the pickle files. The. Problem, with loading, everything at once is that if. You look at the bottom part I get, an out of memory error in. Cloud. Machine, learning. And. Then. After that it's just a normal carrot model. Before. Solving. The out, of memory error while the data was small I experimented, with hyper parameter, tuning which, is a really cool tool, you can set, your parameters and. Experiment. With it so. Here, this, is a very exaggerated, hyper. Parameter, tuning, example, we never want to experiment. With this many, but. Here, I am. Tuning. I'm trying to experiment, with number. Of filters and the convolutional. Layers patch. Size optimizer. And. Dropout. Size on, the, right you, will see the ya know file that is submitted, along with the job and in. There you will set the. Ranges that you want to try for. Example if you look at the optimizer, there's. Adam, as GD Adam axe and others. On. Tester board you can follow all these trials, that are being run and. Based. On your model metric, you can pick, your favorite one so. Here on the lower left pane. Pick number, 18 and. On. The Google cloud platform you, can go and find what number 18. Hyperparameters. Where, so. Here the, first. Con layer the, number of filters were 112. And, the. Optimizer, was Adam axe while. In, time for board you can also check the model, graph and see, if everything, looks. Expected. And if anything stands up for troubleshooting. So. Now back to the out of memory error I still. Needed to solve that part so. This time I started with the original database, I moved, the data to Google storage did. The pre-processing on data lab and saved, on my array again but. This time I converted. The numpy array to a TF record file so I could use it. Well. I could use TF data and tensorflow, to, beat the data in batches and then. I. Did. The predictions, on the model API in cloud. Machine learning and save, the prediction to bigquery. Again. I needed to do pre-processing, on predictions. So. Here's how I I converted. It converted. The numpy it to TF record file I use, the an image, example. Which saves all my features to a feature called X which. Is inconvenient, because, later it will harp it will be hard to get to each feature, by. Name. But. This solve the out of memory error and the scale issue. So. This, is the model input, data and architecture, first, you. Need the parsing, function, to parse your. Input. The TF, record file here. There's, x and y-axes. All the features, combine. Them on the. Input, function, it, will read the data in batches, and you can set it to shuffle your training data. So. Just a reminder this. Is where iteration. Three is everything. Is almost in Google, cloud platform except. For the beginning part which is the original database, and now, Austin will take us to how he moved. That to bigquery. Thank. You. Okay. Cool so I, meant. To switch the order of these slides. Okay so bigquery. Is. Great. I mean, I talked about a trade-off before, of. Between. A data Lake and a, MPP. Database. Realistically. Bigquery gives you the best of both worlds. Like. Curating. Data and needing to think about the files on, a storage, system is. A, pain, and I don't want to do it and I'm glad there was a managed service for it I.
Can. Update data, and, not think about things, it's, as. Performant, as anything, else I've seen. Haven't. Needed, to do a ton with, it yet but the ability, to share data across, different organizations. Across projects, is super straightforward and I. Didn't. Pay attention enough, earlier today as to whether or not the bigquery read, API was talked about but it's super cool gives. Us. Direct. Access to the underlying, storage. Which. You, know for then, letting, data flow beam, jobs or spark, etc. Access, that, data directly. A. Great, way to, try. Stuff out also, which. I wish would have existed, when I first got started is there's. A transfer, service, that you, know input, a few parameters. In. The UI which I prefer to not use UI's but this. Will get you moving really quick in, our case we moved a few terabytes, over and, saw. Performance. Was, great. Given. That I just gave you you. Know great sales pitch on how great it is I also like, everything. Isn't all roses, all, the time right so I, ran, out, of. Resources, even on bigquery. In. One specific instance, although oh well. Talking with the bigquery team eventually we, were able to do this sort of query which we were able to do elsewhere they. You, know fixed, things more or less and then this became not an issue, so. Ml. Can't be a, like. Successful without ingesting. And moving data around so. Want. To talk about, a few ways batch. Ingest. From various, places, third, party API is that we may hit and stream API s and using you know some various. You. Know open source or at least managed to tools, oh. Yeah. Also container. Organization. Is pretty cool we. Have like, a lot of our stuff had been has, been orchestrated. From like a central, virtual, machine which. Although has, worked very well also. Has, its problems for. Scheduling. Resource. Conflicts, and. You. Know like package. Issues. And versions. Those. Sorts of things go away when we. Adopt. These, sorts of, containerization. Practices, which we've been working towards. Many. Of our batch ingests, are patterns. Like this and. And. Also. I guess the point I'm gonna try to get across in this talk to is you, know multi. Cloud is totally, doable. You know it's. Use. The great tools where they exist, or within the confines of your organization. If people say you need to use something don't. You. Know it's it's not all or nothing.
So, Many. Of our workflows our, data. Gets dropped into s3 for instance this is a very simplified workflow of once a spark, job that transforms, stuff back into s3, to. Do things with it for instance just take that output and copy. It over using the transfer service or gsutil or whatnot and. Then it's in cloud storage another. Way. That we've played with is on, data, ingest. In the first place into s3, get over to cloud storage right, that'd be more data flow job, get. It over to cloud storage. Especially. Once it's there and because we're talking about batches, straightforward, to get it into bigquery. Also, dataflow beam. Okay. Yeah, that arrows missing but dataflow, beam can go straight into bigquery, and then I put. Poof because I think I, haven't. Tried. It yet but I think that there's. Zero problem. Right having. Data flow and being readwrite from s3 and, right. There the, things. Exist to do it but I haven't tried it yet but that seems like a great way to simplify my architecture. Another. Bit, on batch ingest, we have a bunch of third party data providers, that we get data from so hit the API. In. In, the first iteration. Gke. And a container was, managing, everything so it would go to a bigquery, get. Say the last time stamp of things go back to the API get the data it needs is right. To. Storage. And then, we'd, use a, data. Flow um beam job to right things to say bigquery. And Google, storage. Things. Got a little more. Simplified. When adding, air flow to the mix which then let, the. Orchestration. Being. The composer, airflow layer and, smaller. Containers, and, discrete. Services. And. Then we're able to reduce. Like. Query, costs. From keeping, the, metadata, itself, in fire store rather than in bigquery, and needing to you, know essentially. Using fire store as a. Metadata. Store of the, last time stamp so just a quick key value lookup. Streaming. Data ingest, we, have processes, that look like this. You. Know nginx, and logs -. On-premise. Endpoints. Stream. - Kafka, go to log stash that, winds up getting bashed and written to s3, and. Load, it into redshift, 4 to, support our other fantastic. Groups, and colleagues we work with we then wind up adopting more of the data Lake model, unloading. Things. Out of redshift, using say spark and writing, that back into s3 for column. There bits. Since. We absolutely have, to support, s3. Dataflow. And beam though can, read right from Kafka, and are. Able to write to these other, bits first streaming. Data we're still. Toying. With whether. We're gonna get enough value out of BigTable, itself, to bigquery.
Read, From BigTable, right, to both or if. Bigquery. Is going to be itself. And. Although. Data. Flow and beam can talk to our Kafka, clusters, directly, we went ahead and just jumped all, the way to having nginx, and logstash forward. Right to pub/sub, because. If with. The eventual, goal being, let's. Get. Our on-premise, infrastructure, into the cloud and it, looks like we're, leaning, towards using Cloud endpoints, and cloud functions, although we're. First. Proving out the rest. Of the pipeline that's. Here. Big. Fan of things, beam and giving, a shout out to this session on, Thursday. Of my colleagues, there. 11:42. 12:30 stream, analytic, stuff. Alright, and then some. Was going to share more on our modeling. So, now that we have data in, bigquery Thank. You Austin for putting in the query, everything. Is in the query now so, now. I start, from bigquery, but, again as often mentioned, earlier we were up next, last, year and we learned about all these new tools, like data flow TF transform, that, are great for pre-processing. Big. Scale data so, I wanted, to give those a try and, I, found this course Sue's on Coursera, that are really helpful and they. Don't lab is a great tool to follow, the courses on Coursera, and learn TF transform, so. This time I started from bigquery then. Did the pre-processing on data lab but as the, pre-processing I mean submitting, a job to data flow and, then. I saved, pre-process. Data on, Google storage train, the. Same model on C, Emily predicted. On C Emily and. Wrote. It to bigquery, but. This time no pre process for predictions, so. That's great. Here's. A learning. Example of TF transform, on data, lab in case you guys are interested you, can follow it and troubleshoot. Your code it's interactive. And it's easy to follow examples. Serverless. And, customizable. So. Here's. The, our final, iteration which. Is much, shorter than the beginning and scalable. And. Serverless. We. Start from bigquery to. The pre-processing, on data flow, save. The TF record files on cloud. Storage, do. The training, and predictions, on C Emily and invited to bigquery. Here. We we, solve, all these problems of. Scale extra. Packages. And no more pickle files. There. Is no pre-processing, for predictions, because TF transform, writes. The, graph and you can point your model to the graph of the output, of TF, transform, and it will automatically, process it for you. Here's. An example of, a beam pipeline, that I you. First. I read the data from bigquery. No, more unloading, data from, database. It just reads it, straight. From there second. Step is grouping, by user which, is just reordering. The elements, then, I insert zeros and. Reshape I'm going to get into the details in the upcoming sub slides and, then. I scale and in code using the TF transform, functions. And then finally, I write it to a TF record, file. These. Are the first three steps the. First function. Is just a query just, select a query the. Second, one is data. Validation. Here I'm checking to see if days greater, and equal to, zero but it could be anything. The. Third step is collect users which is a. It. Reorders, then put elements so it can group, by later. The. Fourth step is the right format, numpy. Operation, function, which is just a normal number function, and the, fifth step is where the encoding and scaling happens. So. After the fifth step the TF transform.
Graph. Gets. Written to your bucket where you, can point your model, to, here's. The, dataflow. Example, I'm gonna show you a demo later but, just in case here's like an example. You. Can follow each step and see. Where you are you. Can see your job description. And summary and, on. The right middle. Pane you can see the auto scaling, so. Here I needed. 70, workers for ten minutes it, auto scale to 70 and then back down and my, job was done in 15 minutes. So. Now I'm gonna move, to the demo. All, right so. There's these two files that are very, similar to what I showed you I'm, gonna. Run. It, from. The terminal. And. Hope. That it will. Work. So, here's a GCP, platform, a homepage here, you can find the compute, engine ml. Engine like all these tools that I just talked about bigquery but, I'm interested in data flow, and. One. Of them has kicked off. The. Graph is still being analyzed, so, we have to wait a little bit. Okay. So. Here you can follow what's happening, the. First step is a big Q import and it. Didn't run so. In order to check why you can always click on the step and click on the logs. And. It. Will give you the error which. Here it says. Big. Query execution, failed, because on Brock and recognize name I just, put a. Unrecognized. Name there so you. Can see the error. All. Right this is the second, one. Which. Is exactly, like the one I showed you before you can. See. The. Step step that's running I'm running it for my training data and also the validation, data and, they, all get saved here's, where, the. Transform. Function gets, saved. This. Is where the transform, happens. The. Other scaling. Is kicking off here you can see the job information and. The, resource. Metrics. And. The pipeline options, you. Can always, submit. A setup file with it to specify. What you want, here's. The auto scaling the, current workers are zero but, it. Wants to be seventy. If. We wait a little bit the current workers will go up to 70, and then back down and, the job will be done this is the same job that I showed.
You The picture before so. We. Don't have to wait 15 minutes so. I'm going to back go back to the slides now. All right so in, the model the only thing that you need to change is the serving, input function. Where. You point it through the TF transformed, graph. Output. This. Is how it happens and then the floor. Is the. Predictions. Output, where there's. Viewer ID probabilities. And prediction, classes. So. Now that we have all the pieces of the puzzle together it would be great if we could have this automated, daily, so, we could have daily predictions, every day and. Here's. Austin who, will. Tell us how he did that. All. Right, so, cool. Things, should be automated and whatnot. We're. Doing batch predictions, one. Of the more. Straightforward ways of doing it is unload, from, bigquery, to, get. Those records. In, new line JSON, format. In. This case we used beam. Data flow write, those to Google Cloud Storage a second. Step being, hit. The ml. Engine predict, service. Read. For read those files from Cloud, Storage write them back to cloud storage and then just. For what we've been doing those, batch results, are just going into bigquery. So we can do things with them so it's. Three step process to. Highlight. What, this is we have you, know our beam job which is called extract, up PI the. Documentation. Is super straightforward on how to use. Beam. Data flow. To. Do some things. Composer. Airflow. Less, so for these operators, I'm really hoping they wind. Up getting accepted for, I forget, if it's summer, or season of Docs, where. Then there should be some resources. Helping. Even. The open source projects, get some documentation, going, but you. Can take more or less the. The. Python. Command, command. Line and translate that into what's, needed for air flow this. Is for the, data flow operator. Equivalently. Hey. We see a whole lot of variables, we're using and, using the g-cloud ml, engine job submit, again documentation. Really great, and. Here. With the ml engine batch prediction, operator you can see I had. To make, this pretty straightforward, we're like it's. Pretty much plug and play here, and, the same goes, for the. BQ, command. Line tool and the. Load, command a little. Bit more to it but also again, pretty. Straightforward. So, that's, how. We you. Know weave these things together yes they can be, you. Know cron jobs for, those sort of command line tools that we saw but. Air. Flow Composer gives us a lot more visibility, into stuff and also hey let's learn these because, those lessons. Get. Applied all over the place like we saw earlier for moving, batch data around.
Ok. So so, far summary. We've, seen. How. We slowly, built on, to our eventual solution the. Real. Goal was how. Can we have, in our tool set the. Like, best, scaleable, solution that. We can work on even with say our twelve gigabyte, data. Sets when we need to work with our mini terabyte, data, sets and anything, in between but. Hopefully. As you've seen you don't have to. Not. Show any progress for a long time while you figure things out you can work towards those. Solutions. Using. Tensorflow stuff we've seen better, predictions. Trying. To get that across and. Yeah. You can. Use. Open, source tools and also we're a very small group, that that, we've been working, with so leveraging. Server. Lists and managed services. Are amazing. So. We don't have to. Quick announcement also, I'm helping organize the beam summit in Europe. If you're into beam dataflow, for, even, running things like tensorflow, transform, check this out in mid-june. Also, there. Will, be a beam. Summit track for two days in September, in Vegas, Apache. Con another, place to check. Stuff out and then wanting, to say thanks. To a bunch of people so especially. Our slingin dish colleagues. For. You. Know encouraging, and supporting, this path either you, know directly or by. Supporting. You, know the needs of the group so we had the time. And ability and attention. To devote to these sorts of things. To. The GCP team, that. Helped, us along this journey. Certainly, thanks to the. Open-source. Tools. And everybody either. Paid. Or volunteer in. These open source communities, and, you. Know you guys for your attention.