Lessons Learned Scaling Machine Learning at Go-Jek on Google Cloud (Cloud Next '18)

Lessons Learned Scaling Machine Learning at Go-Jek on Google Cloud (Cloud Next '18)

Show Video

My. Name is vellum I am, the data science platform lead for Gajic, today. We're gonna be talking about some of the lessons we learned scaling. Machine, learning at kojic on google cloud. So. If we're already I'm gonna just dive in. So. Who or what is go Jake so. Go Jake is an Indonesian, technology, startup. So. We have a wide variety of products and services and, some, of them are on the screen right now the. One that we're most famous, for is, our ride hailing a service, go. Ride and the. Image, the photo actually shows, you what the service does it's, a woman being taken on one of these motorcycles. To a destination. The. Reason why it's not a car and it's a motorcycle is, because of the unique challenges in Indonesia so, a lot of people don't know this but in, Indonesia, or, Indonesia, is one of the largest countries in the world by population I have over 260, million people so, it's, the fourth largest country, in the world by population. And. They. Have a lot of unique challenges with, public. Infrastructure, and traffic, congestion and the way that Indonesians, have solved this problem is, by using motorcycles to get around and they also have their own motorcycle. Taxis called objects, and, when. We originally started, we. Started as a call center for objects so, it is offline there's, no application it's, completely, disconnected. You, just call in and we send energy and to take to take you to your destination. So. Originally. Our founders, launched. The service and there was a great growth and, demand. For it because, the, traffic is so bad an engine Indonesia, that the, service really solved workday problems for Indonesians, and they. Continued, to collect information about what products our customers, wanted. In Indonesia and in, 2015 we launched, our first, mobile application. Was. Just the modest bundle of products, so, the, key products there are go ride the, ride hailing on motorcycles, go, food so, that's food delivery also, on motorcycles, and then, things. Like go Mart's grocery, shopping and other logistic, services, and, when, we launched this initial application, the, the, uptake, was incredible. So, we. Hit hyper growth very quickly and the, mount in Indonesia was insatiable because, it solved so many of the daily problems that Indonesians, had getting. Around the city and. So, over, the next couple of years we launched a bunch of products, not. Just that random, we've always been a very data-driven company, we've always. Looked. At what the customers, wanted and launched. Very targeted products, to them so. Currently. We are a unicorn, one of the few unicorns, in Asia or in, Southeast Asia and. We've, got eighteen products that we've launched in the application, and many more that are not in the application, in. 2018, we, are, focusing. On international, expansion. But. Let's talk a bit about our. Home indonesia so, our app has been downloaded almost, 80 million times we, operate in 60 cities throughout indonesia. And. In many of our product verticals we're, actually the largest player so. If i can just talk to two or three of those for. Ride-hailing we have more, than a million drivers on our platform, on. A typical day you'll have hundreds, of thousands of drivers online. At the same time servicing. Our customers. For. Go. Food our food delivery service we have more than 200,000. Merchants on our platform and we call them merchants because it's. Not just restaurants it's, also moms. And pops that have that, are selling food from, their garages so, this platform that we've built really allows for a kind of socio-economic, mobility. Enables. People to rise. Up from from poverty. So. The. Third product that I wanted to highlight is, go. Pay is, one of the leading, Emani. Platforms, that we've launched in, Southeast, Asia so, we have many different products and I. Just, wanted to kind of highlight the scale at which we're operating so. Along. With that is its data so, this. Visualization represents. With, some of the data points that we're dealing with its, visualization. Of Jakarta, the capital, city, of Indonesia, where, each pixel represents a, person. Being dropped, or picked up on. A annoyed. Girl right private, platform, so. This, is just one time of data point that we have on a single product but, we, have many types, of products and data points and data. Science plays a key role in our organization. And machine, learning as well when you have all. These decisions being made in real time machine learning is critical to. Making. Those decisions but. The, business no organization, look to data science, also.

To Understand, our customers and especially. With international. Expansion looming, we, wanted, to understand. What the unique demands. Of our customers were in different markets. So. As a data science platform lead it's, my responsibility, to make sure our data scientists, or as efficient, as possible right. So that their, that. Their time is being used correctly. And that they have the right tools available to them especially with international, expansion you don't want to just add more data scientists, you want to really add. Leverage to the existing ones that you have so, I launched an investigation into, our data scientists, we're spending their time to. See if we can't improve. Their efficiency and. When. We launched this investigation. One. Of the data scientists, on screen Darius came, to us and said hey I've, got a lot of projects. That I'm working on right now I'm. Actually getting a bit swamped and maybe. You can focus your investigation, on me and see if you can't optimize, the way I'm doing things there. Is very representative of a typical data scientist in our organization. He's working on a lot of interesting things like fraud, detection driver. Allocation. Personalization. And forecasting. So. He's got projects all across the board and we, thought that if we solved areas problems, then we could solve all, the data scientists, related, related inefficiencies. So. We looked at where Darius was spending his time and, this. Is what we found so. This, is a bar of the where Darius is spending his time on a typical project so, as part of his life, cycle, of his project and we. We honed in on the different tasks that he was doing the gray areas, represent, task which we didn't deem to be not, really data, side not, really data science per. Se so, these are tossed like provisioning, infrastructure, installing software it's. Kind of a systems and engineering, related work that Darius was doing and. Then if, you look at the colored blocks there, is all of the data science related tasks that he's an expert at that he's good at doing so, we wanted him to focus more on that and less on the engineering and if. You have a call, center on, one day and then a few years later you have a unicorn billion-dollar, company with thousands of employees and you. Tell data scientists to deliver results then, you get something, like this because data, scientists, will need to provision, infra if there's no platform for them to build on but they've all these ancient systems so. What we told there is was that we're. Gonna, we're. Gonna try, and solve your inefficiencies and problems, and we're gonna look at all, of your projects, through three different, lenses, the, first is the sourcing of data the. Second is feature engineering and the, final one will be machine learning and by, looking at these three aspects, we'll we're, going to try and improve your. The. Way you're spending your time so. That you can focus more on your own data science, and machine learning. So. The first, of the three is sourcing, of data and. This is a very foundational block, that we had, to address before we could get on to machine learning and feature, engineering so. We're all serious can you give us an example of a project that you were working on where you had troubles with sourcing, of data and and. This is what he told us he says he was working on an exploratory, data analysis project, he, had this hypothesis, that if, you are if, a driver is going to pick up a customer, the, closer they are to the customer the quicker, it will be to pick up that customer except, the. Bit of info data t had didn't. Indicate that the, opposite was actually true so the closer of the core is to the customer, for some drivers, the. Longer it takes them to pick up the driver and this. Was very counterintuitive so, he what he said was okay, he's going to investigate why this is the case so. What, he did is he asked, so. What he wanted to do is he wanted to look at the specific drivers in more detail so he wanted more details but data on the drivers so. He asked other. Data scientists, in his team for, and for this data and they didn't have it he, hosts his manager, he asked the, VP of data science, he was the CTO if. He was directed to a team in Bangalore. That could potentially have this data and it turns out they did have the data so. What they told him was that he, can't query their production, database because, it'll, lock the TV and it. Will ruin the driver experience, and, so. The. Problem here is that this is not really Darius's, job to hunt after data so we knew that this is something that we needed to address before, we could get to any, of the m/l or feature engineering, so.

Our. Realization was. Don't. Go to the data the data come to you so. The key thing here is that there is and we have to go after the data and we, need we realized we needed to centralize, data storage, within our organization. And we did it through. The following means, we, said we're. Gonna build a data foundation, and. We're. Gonna bring. In all the data from all the product teams that, are being created, on, a very frequent basis and we're, gonna do this with three components bigquery, Kafka and cloud storage so. Bigquery will become our data warehouse. This, is a kind of no-brainer I don't know if I have to sell anybody in bigquery it's big data storage scales it's, easy to access sequel. Kafka. We. Needed some of the functionality, in Kafka that's what we opted for Kafka in this case but we do use pub/sub as well but, the café's industry-standard event, bus and cloud, storage as our data leg the. Problem with this is though that even if you standardize publishing. Of data to a centralized, location. You. You, have to force people either. Through a stick or a carrot to do that right, and we wanted to actually give them some benefits, or incentives, to publish so, what we said is if, you publish your data to the to the foundation we'll give you some benefits and the, benefits that we gave them were reporting. One so the time managers were happy so we automatically, generate reports, based on the data that you publish there we, give you a dramatic archival, we. Give you automatic monitoring. Of an alerting of your events on Capcom and. Then finally because you're using bigquery and other. And other Google services like cloud storage, you have centralized. With indication, and authorization. Built-in so. With all these benefits that we gave all the product teams and other teams in the organization, they started publishing data and once, they started publishing data. People. Like Darius and analysts, and other employees. Could then find. Data and and, inverse, led to more insights, and so what Darius that is he went back and he looked at those drivers with, the weird pattern. Of, taking. Very long to reach customers and this is what he found. So. Here. Two images that illustrate what. There is found when he looked at these drivers on the left there is a driver that's permanently. Stationed, on a building and, he's. Just always sitting, there and, on the right there's. A driver that is moving through a bunch of buildings that 106, kilometers, an hour now, then, in miles per hour I don't know what that is that's like 60 or 70 miles per hour but, the point is that it's, impossible, in Jakarta nobody, moves at a speed that's like the speed of light in Jakarta so what we realized is that.

These. Drivers were actually faking their location, that were using software that allowed. Them to pretend to be at a specific location so that they could get preferential, treatment by, our machine learning models when, they were being assigned to trips, so, in that case the driver was pretending, to be in the shopping mall and then if somebody leaves the shopping mall that he would be assigned to that trip but in reality he's actually far away and then, he would have to drive there right, so armed, with this knowledge Darius, could then have could, then go and pull, the model that could identify these drivers and then react accordingly, so, the key thing here is building. A data foundation, is fun foundational. And fundamental part, of solving. ML problems, and you shouldn't be hiring data scientists, or not at least not many until, you've bought, this fundamental. Block so. So, this was the first part that we did. Right. And feature, engineering is the second part so we are serious, about how. We could what. Projects, he has in feature engineering where he had frustrations, with working, with free features. So. He told us about another project and this, one is a lot more important. For the Gajic, so. This is the driver allocation. Problem so, there. Is told us how, this works. So basically. If you have if you're a customer and you're making. A request for a booking to go to a destination you, need to be assigned a driver, most, customers think that okay, they just want the closest driver that's right next to them you know oftentimes, you can see on your little app there's. A driver right outside your door and you, wonder why doesn't this. Algorithm. Allocate this driver but. It's, often a very complex, process right because sometimes, drivers want to head home, sometimes. Drivers are on a trip sometimes, they've, just made a turn onto an highway and they're gonna take a long time to turn around sometimes. You want to optimize for the driver experience because they're also part of our system right so the drivers might not have had an opportunity for, a long time to take a trip, with a customer so they're.

Also Trying to earn a living so. Basically, dairies realize that, the. Driver allocation, problem is one that is very. Dependent on features, because, the features will really drive the decision, making of the model and this model is. Actually extremely, important, for our organization if, you're doing a hundred million bookings, every month this models gonna process a lot of money and a, small tweak to this model can have a massive impact on the bottom line right. And so diría said okay, let's first make a list of all the features we need so. He starts listening down the features and these are some of the typical features that there is off. The top of his head came up with so, their driver letter features like what, is this location, speed direction. ETA, so. Focused, on the driver features and, then there's also customer related features like the profile, what their clicks or. Their actions and, their, history then, their own spatial, features like what's the demand in this area that's the supply like traffic, related, features and then finally, temporal. Features like what's. The time of the day what's the day of the week is, it a public holiday often. In Indonesia, you have religious. Holidays that can completely change the way that traffic, behaves and how supply. And demand dynamics change, so. He. Made this list of features and then he said about creating, these features so, that he could train a model and then. You. Know deploy, his model into production, but. He ran into some problems and these. Were some of the problems that he ran into. So, the first problem we ran into was the volume of feature data so, we, had the data foundation, now and what Darius was doing is he was splitting, up virtual, machines and he was scheduling. Jobs that would transform, that information, into, features and then publish it somewhere and then he would train a model on those features this, is a batch process is an offline process, but it's a scheduled, one the, trick here is that for. Data scientists, that's a lot of work, these. Pipelines run for hours and days the iteration cycle is very slow and making. A small tweak, it, means you have to rerun the whole pipeline making. A mistake means you have to start all over so, this is a frustration, that is, costing. Us a lot of time data science hours something, we wanted to solve. The. Second problem we had was with real time features so, you can imagine if you have a calf car in the rain stream you. Can't just run a query on Kafka, right you need to actually build a system that streams, that data into a data store and with transformations. That pulls features, and then you can access those features in real time so. Every. Time we wanted to launch, a project we needed an engineer, to come in and build a system that could stream these or and events, and both features for, real-time access so, this is something we wanted to solve across the board, the.

Third Problem we had was consistency. We. Had engineers now building features in, real time and, then we had Darius, building a Tim batch but these were disconnected. The real-time, features were in Java and go and other languages, like Scala and Darius. Was building these features in Python, so there's an inconsistency there, and this creates, a problem because models, are being trained on one set of features and they're, being served, in production with different features and this real scope for problems, to come in. Also. There's a duplication of work you actually just wanted to find a feature once not twice. And. Then finally discovery, so, Darius actually, went to, lunch with the data scientist and he realized that this data scientist had already. Developed, a lot of the features that Darius had developed and so there is was. Redeveloping. The same features due to a lack of knowledge and a lack of discovery, in the organization, so we really wanted to we. Knew we needed to solve, this problem holistically, for the whole organization, so. That the discovery, was there and the standardization. Was there so. Our realization, was that feature should be free you're. Gonna have to pay a cost to both features the first time and but. After that it should be available in all of your environments, consistently, production. For. Serving and for training and. It should be discoverable, and you should have information, about your features. So. We were gonna put in bold, is a platform, for Darius, to access. Features, and create, features and, the. First part of that platform is data. Flow so, we had our data sources now with our data foundation, and we, needed a way to solve the consistency. Problem and, how we did that is by introducing data flow so what's great about data flow and wide so suited for this purpose is because we want consistency. Right we don't want to redefine the same features over and over and with. Data flow batch, and stream are supported. As first-class citizens there's really no distinction between the two there are just different data sources so, with data flow you can take in CSV you can take in events you can take in sequel. Sequel. Relational, data and then, transform, it with a single transformation and produce it into any kind of data store so. By. Introducing data flow we have a single, place where Darius can bold, his or define, his feature some. Of the other advantages of dataflow is that is. Automatically, scales there's no servers, to manage, there's. No lock in because, the code that you write on data flow the. API is called beam Apache, beam you. Can be run on flink, it can be run on spark so. There is a portability, to the code you write so you're not locked into data flow. So. Some, of the challenges with data flow is that because, it's streaming and batch in all cases the, API is kind of tricky you're also limited, by the fact that -. And go lang support is not, as feature full as Java so in most cases you'll be forced to write Java code so. That's. The trick with beam, and data flow I'm certain often, in the case where this is you'll need an engineer to help define the features with the data scientist but the key cookie thing here is once, it's defined it's done. It's, always there and it you can store your features then in training the training store as well as a real time store. So. Here's, an example of a feature definition, so, in, this case we're just going to look at how many trips the driver has completed in the day and, we take trip events this. Code is Python code and then, you apply the. P, transforms. On, the P collection, so the trip events is a collection of events. Or elements, and then you apply subsequent, transformations. Like you filter out all, the successful trips so generally left with that you. Add you create a data structure with a count of one per event. And then you just do a group by and then, you're left with the count for all of the drivers so, this is a very basic example of how you could define a feature although, in our case oftentimes, we use, Java. To define this but it has a lot more boilerplate. Right. So now you've solved the feature creation, and standardization. And consistency, problem but, now we need to introduce storage, as well so, for training we, introduced. Bigquery. So. Bigquery is just a no-brainer store, for us we. Looked at a lot of competing. Databases, to store training data but ultimately it came down to one thing our label, data is generally, in the data sources and raw, data and our, feature data. Often. Needs to be joined onto our label data to, create training sets so.

We, Didn't want to have a completely, distinct, training store or at least not for as there at least not if we didn't have a good reason so. We opted for bigquery. Because of its high. Scalability and, the fact that it's a completely cloud-based service, even have to manage any infrastructure, easy, to access sequel. Based for. This use case is also very good because. Generally. Features are represented, as columns, in. This data store and bigquery, is a column restore so it's very efficient, at querying. Feature data all. Right so that's what we introduced for our, training store it's also very. Closely. Integrated with other cloud. Services on Google cloud but. We'll get to that later, so. If we're serving we introduced two data stores the first is BigTable, and the second is cloud memory store or Google's, hosted Redis. BigTable. Was really a game changer for us not just for features but for other applications, as well and the. Reason why it's so good is because. It allows you to consistently. Access feature, data at very low latency so less than 10 milliseconds, guaranteed. It. Allows you to handle very high load so you can write and read to it up to 10,000 times per second, combined, per, node, right, and if you want to scale up you just add more nodes right. So you can scale, up linearly, to any amount of nodes with, big table and, then, we also introduced, cloud memory store Redis for. A similar use case so, sometimes when you have hundreds. Of thousands of drivers, there's. So many features and some of those features change so frequently that the load is just incredible, right so you're, gonna have to spin up maybe. 50. BigTable, nodes to handle that load so let's say you you're, you're you need to update some, features at a rate of, 200. Or 300, thousand times per second, it doesn't make sense to use BigTable, Matthew scales so that you get a few scares so, we introduced radius because radius allows you to read and write at extremely high rate, often. For these types of features we don't care if the, datastore dough goes down so the durability of radius is, not really a concern for us and then ultimately we, slept on top of this whole system a feature serving API this, API just intelligently. Finds where, the features are located so if our future lookup request comes in it breaks support that request and finds the data and joins it back and. Also very importantly, there's. Also caching, in the features serving API without, that caching, it's. Extremely, difficult to handle the load. But. Now if you look at this chart you. Actually have a lot of information, about the features you've got data flow when you're creating your features you know what features you've defined you've, got the axis on a feature serving API like how which features are being used the most and then, you've got the training store where your training, data store bigquery where you're creating training data sets and training.

Models And you know how good those models are performing at inference, right so, with all this data if you log it it's very useful to data scientists actually and, so what we did is we just, dump, all of that metadata about features into, a database and then we just slept on top of that data, studio so, this is just a bi tool essentially, that allows you to look at data and we. Can see the relative efficacy. Or. Impact that specific, features have relative. To each other for, predicting. Outcomes or, optimizing, for specific objectives, and if you expose this to data scientist then it's. A really powerful tool for some key insights. And. Also, even, if you didn't have any of the impact if you just had a list of features that's, already useful to data scientists because then, they can take this list and say ok I'm gonna get a big query and just select these specific. Features and train my model. And. Then finally, this is what we've built so, we've got our data sources we've got a standard way in which Darrius. Can define a feature to transform, those data sources he, doesn't have to worry about the training store he doesn't have to worry about memory, store a big table he. Doesn't have to worry about the serving API all he needs to do is find features in the Explorer, select. Them in bigquery and Chinese model and then remember. Which features are available in. The future serving API. So. So we've solved, his driver, allocation. Problem to, large degree because the feature engineering was the toughest part for him all he needs to do now is define, features as, feature transforms. The. Final part that we're gonna look at is machine, learning right. So we've got both of the foundational, blocks now of the, data sources and feature engineering so. That let's, have a look at some of the projects that Darius was looking at or was, working on for machine, learning.

So. Dynamic pricing is one of the key things that all Ryan Hayling companies must make or, must, bold. So. The. Problem here is that if, you don't have done any pricing, then you have any efficiencies, in your market. So, you have customers, and drivers and they need to be matched the. Drivers need to service your customers but. If if there's not enough drivers, in an area you me to incentivize, drivers to move into that area and services customers, and. You do this by having. Variable. Pricing and specific reason regions, so, here's. Just an example of a neat map in Jakarta, where the center of Jakarta has a higher higher, prices, than in make parts Kurt's right and this. Is just to minimize drivers, to move into that area to, service those customers, so. When we started with this model, or this project. Darius, said ok, he's just gonna take data that's one big query tryna. Male model and he's gonna deploy it and that's, what he did he trained a model on some of the data he, had some ideas and he deployed it on the outskirts of one, of the towns as. A small experiment, but. There were some problems, the. First problem he had was a didn't scale so, his model was written in Python it couldn't scale out of that experimental, base the. Second model a second problem was with this model that it. Was not interpretable, it was a black box and nobody. Else in the organization could, explain what it was doing and, the third problem was even if we could look at the results of the model we, didn't know if it was good or bad because. This, is actually extremely challenging. Problems you solve it's, very difficult to have a baseline or a control, in this dynamic pricing. Problem. If. You have an experiment, where you're raising prices in one area you're, affecting, essentially, the whole world it's there's no isolation, or I poster, apples comparison, so. You're, raising prices you're pulling in supply from external areas and we, had other issues as well so let's. Say you've got food. As one of your products a food delivery and then you've also got essentially, people, delivery on go ride. They. Share, a supply, base so, if you raise the price on the go, right side then you're taking riders, or people, that deliver food away from your other product.

And This. Is what we found that, we, needed to have a very clear, balance between our products, in terms of dynamic pricing. So. This was our realization do. Machine learning like the great engineer, you're not, like the great machine learning expert you aren't, and. This, was important for us so the. Reason for this is that we. Knew we were making a mistake by jumping into ml so, what, we did first is we define, our objective, we, both take very clear a very, solid, engineering. System, that could handle the load and that could. Handle. A very basic mathematical. Model not a machine learning model and, we. Deployed this with. Clear, objectives with, a clear measurement, of success with, clear dashboards, and monitoring and everything without, any machine learning. And. Then we had dynamic pricing and, now drivers could see in specific areas where, the prices, were higher or lower and then they could in service those areas that were incentivized, to go there ultimately we would end up adding machine learning to the system but at the start the, key thing was. Knowing. What yours what, your success criteria, is and building, a rock-solid engineering. System first right. But, there's still some inefficiencies in, the system so. What, what, all of the drivers were complaining to Darius about was that okay, you can see that there are some areas that are high in the mod but it's. High in the morn for a reason it's, not high in the mall from the reason. The. The fact of the matter is that there's high traffic or there's a bottleneck or there's some reason why it's it's not hard to get there so, if they had some way to know. In advance that there's gonna be high demand that would help them and this, is good for costing came in there, is knew that he could actually forecast, wanted, amount would be like in those areas and so he said about or. This is what he told us building. A model that would do forecasting, so. He would take all. Of the data all of the supply data demand data traffic, data user. Click stream data all, the signals that he could get on a per region basis, so this is an extremely, large amount of data terabytes. And terabytes of data that he was streaming and it, was so large that or. This, model was so complex that we even took bulk. Weather stations, in Jakarta and streamed. In weather data because, if rain falls in Jakarta then, traffic. Will immediately change very, quickly so. What he wanted to do is he wanted to take all this data through a TF model and then, predict on a per region, basis, wrapper area basis what did, Amon would be but. He ran into some challenges as. Usual, so. One. Of the things that, one, of the problems he ran into was that you. Couldn't actually train this model on his local machine even, as it just a taste on a subset, of the data and when. He tried to move. The data on to a virtual machine and train it there it also fell over the, virtual machine just couldn't load all of his data into memory so, what he did is he asked a bunch of engineers, ok can you help me spin, up a spark cluster and train, the model there and, so they did that and worked they, use spark and they they train the model and they could actually deploy, it, but. Now there's, a problem because there is is dependent, on these engineers every time he wants to make a change in his lost engineers, can, you help me with this cluster, and another. Problem we had was that it. Was actually quite slow it, would take hours and sometimes more than a day to train this model. So. The lesson here was. Don't. Break abstraction, if your data scientists try, and operate with. The appropriate, tooling at the abstraction, layer that you're comfortable with and that you're an expert at don't. Drop down into the engineering world unless, you absolutely have, to and. As. A data science platform lead it's, my responsibility, to give day Darius, the tools that he needs to operate. At, this, abstraction layer so, what we introduced today is was. Cloud ml engine so, the system one of the reasons we also chose bigquery, as, our feature. And, we're. All data warehouse. So. It's, it's got a very close integration, with, cloud ml engine so the client email engine it allows you to train tensor flow models, it's. A completely, managed service it's scales and matically training, is distributed, and, there's some other advantages like you've. Got hyper parameter tuning which means your models can be more accurate you've, got levers that you can pull if you want to iterate and train faster, you can go from CPUs, GPUs GPUs. So. This is what we introduced to Darius and the. Way he used it was as follows he's. Got his future transformations.

Already So he's got his data flow creating, features into, bigquery those, are the time series signals that he is getting from the field like the weather data that I spoke of he. Trains the model on cloud ml engine and he serves it on cloud ml engine then, he's, signals. Are coming from. Kefka as well so. These are the raw events, and then, he's got a separate stream that, does inference, so this is just a cloud that ever job that triggers, serving. On cloud ml engine so if rain falls it'll. Trigger cloud ml engine for new inference predict the demand and then publish it back to Kafka. And. So this worked one. Of the downsides to using cloud email engine is that it only supports training, for tensorflow so he only is HT boost or scikit-learn or any of these it, doesn't work for training you can serve those the other types of models but there are some downsides but, for us the key thing was we, wanted to have at least one tool that the data scientist could use without, having to pull, in engineers. So. Then we had forecasting, so these are just three of the different regions so. Here's a 30-minute forecast, the, green line represents one of the forecast and the yellow line represents one. Of the were the actual values and so we knew what the demand would be in the future we could tell drivers this. Is an area that'll be in high in the one or we could just change the pricing, and that will also influence the way the drivers react to, the dynamics. But. This wasn't Darius toughest challenge Darius. Had other challenges, as well. The. Toughest, challenge Darius had was on the homescreen of our application, so this is our home, screen and you, can see at the top there some products I go. Ride and go car in those but, at the bottom if you can actually scroll down there's a very long feed and we. Have full control over what we display on that feed and for us it's extremely, important, to personalize, our application, all of our customers. Use our application for different reasons some order foods some buy groceries, some use it as a ride hailing application, if you have a teen press products you need to customize it and as.

A Business. You have objectives, like you want to introduce new products to certain customers so this is a very important, part of our application, so. Darius, was tasked with building this model to. What his model would get is the. User information. So. The users ID. It's, the user's location and the time at which the, user has opened, the application and, then, what he has to respond with is a list, of recommendations to. Personalize, this. Home, screen, but. This is actually quite challenging because. That data is only available to, us when the user, opens the application see, the trick here is that we, don't track users when they're not inside of our application, so we don't know where or when you're gonna open the application until, you do it and the, type of personalization. We want. Or, Darius described it as follows, he. Said that they say you know that a person, will. Order a pizza for his children. On a Friday night he's been doing it for three, weeks in a row so. You want to be able to recommend that, buying. A pizza to him if you doubt if it arrives at home on that Friday night again. But, if it arrives at the office or if your eyes at the friend's house then you don't want to recommend that right, so that's. The type of control we want in personalization. So. So there were some the challenge is of course that we don't know where, where when the user will pop up so. There is said okay fine he's gonna try and build a model and. To see if he can serve this. Right. So here are his targets on the right so, he needs to serve the response in 30 milliseconds, it's the home screen of the application so, you have to serve responses, very quickly our, SL is at 10,000, requests per second, so. The heart the throughput is very high and with international expansion we, knew we needed to support even higher throughput, soon so. This was this model that he both was an application, it. Was embedded within an application and it was before we had the feature platform, so, in this case he, actually had user data that was loaded into the application, statically. As batch, data so. He would get the user ID the time and the lat/long when, the user opens the application and then, what he would do is he would transform. Based. On the user data as well, as the incoming request new, features, he wouldn't Agenor eight new features, put, it into the change of flow model do inference and then it'll produce a list of recommendations which he would return and. So. What there is there is he did, a benchmark, he's just had a look at what, performance, he's actually getting with, this application. And, this was his initial, performance, his, target is 30 milliseconds but he was serving, results at 140, milliseconds, which is almost five times to slow so. As. Any, engineer, will tell you the first thing you do in kubernetes is you scale it out and. That's. What we did of course that doesn't really solve your problem right there's, a certain amount of complexity, and producing, these results, or, and, doing, inference and scanning. Out will just reduce the load it won't actually, minimize. The latency that much, so. Then eventually what we did is we introduced, the featured platform and. By. Doing that we could externalize, a lot of the features outside, of this application so, that only the model is hosted in the application.

And. So that's what we did but, the latency, was still too slow I was taking too long to serve results, and we couldn't actually deploy this at scale, in production, it. Would be too disruptive to the user experience and we. Really hit a wall here and this is a common problem because, each project. Has a. Unique, model with its own characteristics, a, completely. Unique system. And. So we mulled, over this and we came to realization. We. Should pre compute everything of. Course in this case it's. Not that easy to just pre compute everything because. You don't know where users will be and, you don't know when they'll open the application and you don't know which user it'll be so, the amount of combinations is, immense if you have many tens of millions of users. You. Need to somehow reduce the, amount, of. Predictions. Or recommendations, that you want to pre-compute and the way we did that is by doing. Two things the. First is we bucket, locations. In two areas so. We say, all. The areas all the locations within a football field sized area counts as that area and then, we bucket time into Windows of let's say 30 minutes or an hour then. For each customer so, for each of our millions of customers and for, each of these tens. Of thousands or hundreds of thousands of areas and for each of these time buckets we produce a customized. Personalized, recommendation. But. You're still talking about, hundreds. Or, more. Billions, of, recommendations. For, a single model so. Really, really big datasets terabytes, and terabytes of data, and. Of course the solution that we presented was BigTable. To store this data so, it's one of the few tools that we have found that can actually store, and serve all, of these recommendations. So. The way that we changed there is a system was as followed for, as follows, so, users. Now have specific. Actions that they take like let's say that you buy a pizza on a Friday night that, is something that describes, your behavior we. Log that as an event on caprica and then, we have a dataflow job that, hosts a model attention flow model and this. Model, streams. In this event and then, it produces personalized. Recommendations. For that user based on the unique. Behavior. Of that user and then it stores all of these recommendations or. It just updates the existing recommendations, in BigTable and so. This process continually, happens behind, the scenes and now our model service has changed now our model service, is a lookup service right. And so the lookup server is always doing is just looking at BigTable, at a specific landing. Zone where, that information is, found and I just retrieves, it and serves, it so, if you think about this this, paradigm shift is very powerful as an engineer because your. Machine learning system is completely, disconnected from the. Serving infrastructure, if, you if you delete or if you remove all of the email, pipelines and the event stream you still have BigTable, that's independently, testable that, you, know the characteristics, of it can handle terabytes, of data it, can serve responses, at 10 milliseconds and. You'll. Never have an issue as long as you've tasted at least once for your use case, so. This is what we presented. For Daria's and when, we looked at the performance. It. Was good enough 15, milliseconds you're always going to hit that 50 milliseconds there's, enough buffer, of. Course this doesn't solve all of your problems as, an, engineer or data scientist, some, models, you can't recompute some. You always have to have a model in the service but. If you have the option think, about storing. Very, large amounts of data as. An option, or an alternative, to serving, models in production, because, of the. Production. Ionization. Or, abilities. Of that. So. What. Are the impacts of these changes that we introduced to Darius. Originally. When we looked at Darius's workflow, in a typical project, he, was spending a lot of time I was provisioning infrastructure, instantly software, all of these tasks that were not really data science related, right, we wanted him to focus more on what he's good at which, is creating. Features of living models, evaluating. Experiments, and things so. We introduced what is essentially a platform, but. Both on Google Cloud and if. You look at the data sources we've got all the types of data source that you want that Lake you've got a vein stream you've got bigquery as your data warehouse there's.

A Single place where Darius can define his features in cloud dataflow he doesn't have to define features anywhere else, he. Can do inference, in the stream on data flow by hosting cloud, ml models in data flow alternatively. Can also hosted on cloud ml engine he's. Got his. Serving. Store, in BigTable, and then cloud, memory store and a training store in bigquery and he doesn't have to manage those most. Of this infrastructure, is managed by us and it's. Not really managed by us as engineers because it's. All just. Cloud services, in most. Of the cases, so, all Darius actually has to worry about is doing, modeling in their lab finding, features in the future Explorer, and then, using bigquery to, create these training sets and using. Ml engine to train, those models of course. They, will always be custom projects, I think. The key thing here is that we. Introduce. Technologies. That allows if there is two operators is abstraction, layer where, he doesn't need to pull in engineers, he can do things on his own and it, can produce results on his own he doesn't have to spit up infrastructure. So. Now Darius. Is using, a lot less time on, a typical project he's spending more time on developing models, and because, at the time has been compressed it's, faster for him to get to market and he can take on more projects. So. The impact on Gajic is, our. Data scientists, can now deliver projects, faster, we. We have more touch points we know within our application because, we can deliver more models. Our customer, experience has improved a lot because now the. Models are more accurate because we have more features as well. We, have less data scientists per customer and this, is a great thing because. Expanding. To new markets, you don't necessarily want to just add new data Sciences, it. Was better, to leverage the existing ones, you have and obviously having, less infrastructure, is a big one for us it's. Least less time. To spend managing infrastructure and it cost less and. Just to recap the lessons later. Data to come to you centralize, your data whereas don't, do anything else unless you've done that first especially, if you have a very big geographically. Separated, organization. Like we do the. Features you'll pay a cost once but. Then after that it should be free and available in all of your environments, and discoverable, and trackable, this, is also a very important, step and then. The third one was don't break abstraction, if, your data scientist operates, with the tools, at. This abstraction, layer that you are comfortable with that your domain expertise and then, finally if you have the option pre, compute everything pre, compute results.

Okay. That's, it for me. You.

2018-08-01 10:13

Show Video

Comments:

among the top handful of ML/data science-related talks at Next '18. a shame that it wasn't more widely attended.

Other news