Cloud OnAir: Credit Karma improves efficiencies in financial assistance with BigQuery and Looker
You. Hello. Everyone, will come to Google, cloud on air, live, webinar, from Google cloud and we, are hosting webinar, sessions every Tuesday. My, name is Morgan hey and this is my colleague Harish today, we will be talking about how Credit Karma improved, a working efficiency, SQL, by using the Google bigquery and, a looker dashboard, you, can answer crash, or you can ask a question anytime, on the platform, and we, have Google Earth standby, to, answer, at them let's, get started. And. Market. The analyst at Credit Karma and this, is Harish gaagaa hi this is Harish Berger i'ma stop, VI that is a coma. Let. Me introduce Credit, Karma for a minute so, Credit coma is a financial, technology, company founded, in 2007, best, known for providing credit. Score for free for our 18 million members and we. Can always provide the data-driven, insights, and identify. The 10, opportunities. For our members and make the most of their money whether. Someone is interested, in getting a better deal on the loan like, the credit card or mortgage, or, simply, just need to monitor their identity. Or credit information we. Can always provide the actionable, tools for them and help, them make the financial progress. Some. Quick stats about Credit, Karma and so just to mention we, have more than 18 millions members. In the US and Canada, with. That been said if credit, come up with the country it will be the 16th, largest country, in the world which, is ahead of Germany and. There's. Around a 40 billion dollar of credit originated. Through Credit Karma over, the last ten years that's. Bigger than the GDP, of Kenya, Jordan and Iceland, and, they're. Around 20, million Americans, who have applied, for a product like credit card offers on Credit, Karma platform, that's. More people than were born in the US over the last five, years and. There's. Around half of the u.s. Millennials use, credit camera app that's. More than the number of Millennials, who watch Hulu or Amazon Prime as. I. Mentioned, Credit Karma is highly detailed written company, we, can always provide the actionable tools and help our members to make the financial progress the, so Harish would you please introduce what's, data infrastructure, looks like in the past and a nowaday at Credit Karma absolutely. Thanks Megan, so. Before 2016, we. Had in-house columnar database and he. Was spending a lot of time in, in. Terms of scalability and performance tuning. Scalability. In, terms of they were keep. Changing the dashboard database, configurations. To support our data database, needs and they. Were also spending, a lot of time in hardware maintenance and also. The expensive license renewal. And on. The performance, tuning side the. Queries performance, was really slow because most of them are coming from the complex, join on the columnar database and it's, delaying, overall our ETL processing, time and, analytics. Team we were waiting for long time to see them today data so, we, ran. With a lot of the POC. And. Finally. Like, we. Found that bigquery was, the best database. For us to support our business needs so let, me cover like what how did our data stack looks like now. So, from 2016. Kirmada. Does take growth from 30. Terabytes and to. The 1.8 petabytes, and we. Collect. Nine terabytes, of data every day and we. Runs many model prediction also. So. This is all about volume but also we, build meaning, API jobs, which. Able, to connect the various different, sources for example we are connecting cell force for finance data and we, connecting, Facebook for media and Google AdWords for marketing, nets and.
Like That there are other, API jobs to which we, run own every day and land it to the bigquery, and. Do, our look. At reporting on that on. On. The other side of the. Other type of jobs which is like mostly the machine learning models, that. We're also running on the bigquery so, we, for. Example if you want, to forecast our daily. Active users and if you want to forecast. Like how many email you want to send and what, kind of offer we want to send and if. You want to predict our and scoring rates so. All these different models also we are so, I would like to thanks Google and their entire support team who really helped us to brought, to this position and, this. Also helped us in our day-to-day business value. Now. We have like three different use, cases which. Is talked more about how we leverage a bigquery on our day to day life so first use case is daily active user forecast, I will let me run to drive this one thanks, Harish so, name it define, daily active user for a second, so daily active user is a metric, to measure the number of members who come to our site every day and it can be reflected by the way we connect with our members and also, measured, by the product. Performance, so, a better deal active user forecast, model would allow our stakeholders. To monitor the product performance, in a very prompt, way and we. Use Google bigquery, database. To, store the model input and also the forecasted, result and then, migrate, the result from the bigquery database. To look at dashboard, so, that our stakeholders, can monitor, the performance and, then compare, against actual daily active user on the dashboard, directly so, in this way they can improve the speed and quality of their business, decision, for engaging with our active users. Prior. To the current improved, the data infrastructure, and the pipeline we have to train the model manually, every time we launch the new initiatives. So, it cost a long turnaround, time for adjusting, those impact.
Of The new initiatives, and also. We have to collect the data from different channel, owners or organizations. Which, means it's hard for us to unify the model input and then produce the output very promptly, for our stakeholders, because. The data collection is siloed it adds, up the layer of channel dependency, on to the pipeline and also. Human, error might be involved during the process, so. Harish would you please tell us how, business, intelligence, team at Credit Karma solve, those challenges and, improve, the infrastructure. Absolutely. We. Have automated. Data, pipeline, system which. Connect, our, expected. Email, click ratios and actually. Will click ratios and then we forecast like how actual, users going perform on the email what we send on. This data, pipeline, we use the airflow which. Is an open source tool from the air, BnB this, is like for, managing. The complex workflows, and scheduling. The jobs, first. First thing is I'd love to cover how we drive, the expected, email click. So. In this our, pipeline, is first reading, the Google spreadsheet the, Google spreadsheet is, we are running through Python, and, that, we that's. Made the excel sheet is mostly maintained by our campaign managers where, they day to day basis they're continuously updating, our expected, email and push, notification, click volume once. The. Python program runs it connect its load that in our back to the bigquery and then, on the other side we. Connect, back. To the bigquery again for our actual data so. Once we have our expected email and push notification, data and the actual email and push notification data we. Combine both this data and prepare our data set for, the model, then. We. Send. This data back to our machine learning model and that machine, learning model are trying to predict the. Daily. Users how. How, they want to take an email how they're going to read those email how they're going to perform over those emails right and then, we. Send. This forecast number back to the bigquery and ultimately. Then we connect this looker to. The bigquery to see the overall performance on. Our daily forecast thanks. Duration so because of those improvement, on the data pipeline, we, got the chance to improve the flexibility. And robustness, of our forecast model, just like the daily active user forecast, and also, our stakeholders, can get a better and up to the reports, directly on the Luka dashboard, for their business planning and they, also can monitor, the product, and the channel performance on the data, the, lucra dashboard directly, and it. Improves. The working efficiency a lot and, next. Her wish we'll talk, about how Credit Karma used machinery, and tools to improve, the emo targeting, efficiency. Yes. Thanks big round so you know we are trying here we are like in this particular use case we are trying to improve, overall, the. Targeting, efficiency, on, the email notification. So. We. Before. Before, like before, I proceed, on the remainings I let me talk more about what, is model driven email is motor driven email is is a machine learning model, where. Especially. We are looking for what kind of offer users. Are searching for example if they are looking for a credit card or personal no non mortgage loan so, based on the, requirement. We predict, the the. Email we predict the users who, most likely tend. To get those emails and who also. Like you to open those emails and click on those emails and then take the offer promise I. Will cover. The data pipe in more in detail but I want to first explain what are the challenges we had before. So. These are the couple, of points here like we have huge amount of. Variables. To support so we have a tip a tip, million-plus members and there are like thousand plus input variable, it. Was very difficult to and. It was very time-consuming a, resource and tells you to to collect all this data collection you know like and then send. Those data back to the model and then score and then send. You to users it, was almost, a nightmare for us to run those huge, amount of data on the columnar database and. There. Was no. Manual. Check and they. I mean there were no quality, checks also in place and we, we don't know like what are the data anomaly we. Found over those emails and variables like many variables are coming with the coming. With a blank values or with.
They Don't have any values so there. Was no point in running model on such values so it was very difficult to identify such, problem without an automated, pattern system. We. I, will, talk more about like how we solve those challenges we. We. Again. We use air flow to, to. Implement. Such end-to-end pipeline, this, pipeline is, help. Us to collect, the data. Over. The all the users who perform, on the emails in the last few months and we. Collect all those data RNA scoring and then finally. Send out an email so, let me start, with the more little, bit more detail at how this pipeline works this. Pipeline is actually. Works on the three models one. Is the impression to click that, means that when we send out an email to our users when, they saw, the email and they click on that email that means impression, to click and the, second model is click to applications. When. They try. To click, that click, on the offers and they try to fill all their information on the partner side and then submit an application and, the. Third model is approval, once, they submit an application their, application sent to our partners, and then. Partners. Make, a decision, whether they get approval, on their offers. Or not so. We. In. The first step in, the Google bigquery we tried to, collect. All these three model data so for four impression to click and a click to application, we. Collect those data points from, the user. Recent monetization, history, like how they perform over the email in. The past few months and for, the approval model, we rely on our online recommendation. System team where, they given us more idea about what. Are the approval odds for, our members so, once we collect all this three model data set, we. Run we. Run the queries and then we store the data into the Google. Cloud storage that. Is for preparing. The training data set then, once that data set is ready we. Send those data back. To the tensorflow so tensorflow is. Helping. Us on this.
Is Another like the cloud service we get from the Google where. Like be able to run our machine learning model or all these three three. Data sets so, this model is actually, scoring, the, every members on the thousands of variables just, to make sure that we we, we make a right ranking. On every members that what offers, they supposed to receive so, this. Scoring is calculated. Entirely, through this model and again, store. It back to the bigquery, so, once we have our scoring set ready now. We want to drive this chorong and now we want to try this our scoring data set, through. Through. The excel sheet so this excel sheet is it's all about the campion sheet here. We maintain the different email campaigns, so our campaign managers, come in and they keep, adding the new campaigns. And the modifying, the existing campaigns and they, change the behavior of this campaign that means for. Example if I want to send out an email to the. Users who have a score between 600, or 700, with the age band of 30. To 30 30, to 35 so, they. Change those criteria, and they look for that. What offer we need to send to, those users so they select, all those criteria over those actual she'd our, pipeline, we don't automatically. Read this excel, sheet and combine. This scoring data set together and then, we run our targeting, queries in the bigquery so, this targeting, queries are all about figure it out like who, are eligible for sending. Out an email so. Once we have this targeting, queries ready then, we send this data back to the Hartman's Hartman's. Is an internet tool which. Allow, us to send out an email, to. Our members, with, their specific openings. And. Harish would you please introduce what, hermit is quite, a comma and how we utilize, Hermus. To sends an email to our targeting, users great question so Hermes is an internal, tool. For. The user level targeting, where. We can allow. Marketing. Analysts to write. Their own sequel, query and. Decide. The range of the users whom they want to send an email and what. Kind of offer they want to filtered out. Serration. Overall. With. This data pipeline, itself a lot is improve. Not. Just improve, the overall you, know like efficiencies, also improve, it help us to send out the right offer to. The users, this. Helped us an overall, our engagement. On, the emails and, also. It reduced. The overall resources. To, you know collect the data every time and, refresh. Refresh. The data run the model and send, out up-to-date emails every time so, this whole automated, system is reduced lot of the main power and also. Improve, the overall data processing, time and also add a lot of value to our business and, we. Also have a looker who, pointing. To those the, email campaigns data where we do. The measurement check, that how each campaigns, are performing for example. If. He if you send a chase card to the particular members weather, weather. How many members basically open those emails how many of them click on that how many of them take the offer from us so all this we, continuously, drive. On. The luca dashboards and the, based on that we. Going. Back and reengineering our the. Entire data reengineering. Our model to make sure that we have right, data set and we have right feature, set to ensure that we. Keep delivering the best the. Email to our members as we can this. Is really helpful because email, is one of our most efficient, way of targeting, our users, and also how, we engage with our users, sending. The more relevant message, with the right amount of frequency are pretty critical to our business and, next. Harish is gonna wrap up the our love to use case how, we obtained, a better model for user acquisition and, a better offer pricing, on the search media. Absolutely. So this so, this is a one of our under views case where we are trying to improve the overall the. User engagement, and user experience. Of, when they are looking, for in any offer on the paid search and display media so. We. We had like. This. How. This work is we have a Google return. On adspend algorithm. That. Allow us to bid in auctions, at the user level granularity. So, we, and this, also help us in overall quality, of the new user, engagement, based. On their monetization. So it's monetization is all about how user click on the emails in the past and, how. They took the offers over the over, the Google Ads and, ultimately. This helped us to send. A you. Know like increase. Overall efficiency, ins you. Know like sending the overall revenue to the CK and. Based. On our target. So. There, are like before I explain more about how this data pipe end works how, we improve. The automated, the. Ad campaigns, the, features, let. Me first explain what are the challenges we had before why. We why, we got the less traffic in the past over. Our ads, and, how. We get, the flat value on. The every transactions, so we had this flat to flat rate, concept, initially. On. The every credit card application, overheads now, this flat 2 flat value, also means that how.
How. What kind of users us came. To came to the Google and take the offer from us in the past and this. This. Base this this basis also is all about like these, user sets are coming from different user segments like, for example they. Are coming with a different score band and each band. They. Are coming with a specific platform. Like. Mobile, or web and. Then they are coming in. The last 7 days or 10 days or any, months or not and it's, also about. Coming. They are coming from the organic or display media are not so based on this different media, searches, and based, on the overall, this. Mixed traffic we, were defining, the, average. Weight on the, offers, so. This average weight scoring was. It. Was like it, was good for the campaign level analytics but it was not good for the user level scoring. So. Ultimately. Like we were getting. The flat red for every critical transitions, that. Was actually, slowed down the overall ad revenue growth and. We. Also had a difficulty, in identifying the, issues in the pipeline for, example when, I talked about in the earlier point how the different, user segments, come in on. The site and they take the offer from us there. Was. Like. The issue where data is not normally, distributed and we. Were not. Sure that what kind of volumes are coming and how the data is coming and and. Ultimately. We. Are spending lot of money on our ad ad, campaigns. So. Overall. Overall. It was like. Impacting. Our. Transition. Revenue on, our credit card applications, on the ads. So. This. Is how we solve the challenges we build the data pipeline, which is automatically. Collect. All the, user the. History how they perform over the Google ads we. So. V what so. This is like, for. Example if users are. Coming. With a specific segments. So we we make sure that we collect all this data set you know for, a few months so, what. We did is like. We. We went had in our database in a big query and then we figure it out like what are the user behaviors, are for example it's not all about five segment is also about thousands, of variable, for, that given member how, he actually, came to site at which time he came in from where he came in and what, kind of over he were looking for not just not. Just on the our website, is also on overall on the media and we. Collect all this data points, and then, we. Store them in the Google Cloud Storage and then. We ran our machine learning model, to figure it out that propensity of buying. The offer from, from. Us so. Once we have this data set ready we send it to back to the Google the. Google AdWords, has this own algorithm. The. The rows which is return on adspend, that. Algorithm, collect. This data set, and that's. Run. Their own scoring. On those data set and then, it's determined, that hey this is the users most, tends to likely take. A offer this. Particular offer from our ads and we. Ultimately. Send all this data back to the bigquery and then, we, have a looker again which, is connecting, to the bigquery and see the overall the.
Performance Like, how, these things are happening like how many how, what kind of users are coming every day what, kind of offers they are looking for and where, we can improve more. So. Ultimately. This helped us a lot there, was no manual, work we, automatically, our our bigquery, is automatically. Executing. On a scheduled basis they collect all the data points on, the user behavior, and, it. Automatically, sent to the Google AdWords Google AdWords decide, their own scoring, and target so it's it's it's a good aligned. With. Our incentive, between the bigquery and the Kurama data, and, it's. Also I would say it helped us like. We, it, also helped us in a way that it's use our proprietary, data and it, makes. Our media channel more profitable. Thanks. To rich for amazing presentation. So from analyst, perspective. Is really, critical, and also helpful for us to get a better efficiency. By using Google, cloud platform and, the bakery, as well as UK dashboard, just like these three examples we. Can definitely reduce their turnaround time for generating, a forecast model for our, stakeholders and we can get a higher and. Included. Email, targeting, efficiency, and we, can also get a better click to conversion, ratio in a very cost-effective, way and. I. Hope you enjoyed this webinar. Session, as much as we did and stay. Tuned for the live Q&A and, we will be back in s in a minute. You. Welcome. Back so let's get started with some live questions so. The first one have, you used a cloud dataflow for, the pipeline as well also, are, you doing this on a real-time basis, or a batch, process. Great, question we, have not used. We. Have done the Pierces on the dataflow but. Yeah this is nothing, we have put in a production mostly. We are using the data proc and tensorflow. For. Our pipeline, and, yes. And we are also doing, lot of in-house. Testing. On the real-time model. Building but, currently most of them are on the batch processing, do. You have anything, to optimize the performance of bakery, yeah, it's depend like what kind of query you are running for if.
You, Are. Using, the legacy sequel query we definitely, encourage you to use the standard sequel query and make. Sure that you keep all your joints in mind and also. Depends. On what at. What time you are running those queries if, you if you are having. A limited license, and. And. You have for the many analytics, teams who using, the bigquery at the same time I would. Recommend that use those heavy queries in the in the batch in the badge of in the night. Next. Question how, does look how to connect to, yes. So Luca Luca, has, a so. We can in, the looker we can first define the connection to the bakery on their admin. We have full control of connecting, to the any database, and. Then we, have look ml feature where, you just need to import the bigquery table, and it automatically, will the look ml for you all you need to do is just simply. Refresh. That model and simply. Drag and drop and start. Visualizing your data on the bakery life. Next. Question, what, advice would you give to someone, who's starting this process, now yes. That comes, with a. Lot, of things for, example depends. On what kind of infrastructure, you have what. Kind of instances, you want to run if. You, if. You want to run like on, on. A smaller, data set it's, fine you can spin up the small GC pins and then you can create your small project, and then you can start. With a limited license and you can continue to go from there but, again if it is coming to the middle in a large-scale. Business you. Have to decide. Again on the architecture, side that how many instances you need how, many projects, you need because, you ultimately, don't want one project to create and everyone, want your company to you the same project because of limited, concurrent, concurrent, query limit so, you have to split those project out and, decide. Your. Organization. Numbers like, how many people who gonna run the queries and what kind of queries they wanna run. If. You organize it very well then yeah of course this is like perfect. Use case where. You can. Continuously. Use and continue scale your instance. Over the cloud and keep running your queries, next. One so how up today is the data that can be showing I'll look at dashboard so, that is also, depend. On like how you want to refresh the data if you want to refresh on the scheduled basis in the looker you can always. There. Is option where you can create your, derive. Sequel queries and you can set a trigger alert for, automatically, cache shall automatically. Clear the cache so, in that way if someone go to the local dashboard it will hit to that look ml behind-the-scene it run. That query and automatically. Clear the cache for you and show you up-to-date data and but. Some time like people don't want to create or sequel a doc sequel queries because this comes with lot of performance, impact too so, they, can always like always. Go to that a look at dashboard there is option, there. And you can go ahead and click on that clear cache button and that. Will just simply go and hit to the bigquery, table and give you up-to-date data on. Your dashboard okay, cool so, next question what, made you decide to switch to, bigquery yeah.
This. Is what we something discussed initially. That we, we. It. Depends like from, where you are coming from we came from the columnar database system, they. Come with a lot of the maintenance issues and also, the performance issues. Because. Our ETL queries are very heavy and their runs terabytes, of data every day and it's. Really slowed down the performer, because of the complex joy and, ultimately. Our ETL was delaying and the analytics, teams not getting the data on the right time so. And. Also the it. Was a lot of maintenance also, and also. Cause to the infrastructure, also you know in terms of Licensing renewal so, if you consider. All this factor I mean Google bigquery is the place where you. Can go to the cloud you can spin distance so quickly so easy to go there there are a lot of features rightly, available, all. You just need to spin the instance you create your project and you. Can always scale, that, on demand basically, you don't need much, interaction. With any support team or so because it's so self-explanatory, you, can simply go ahead and go. To that instance. And decide your or could incur a limit define your own cluster and, start. Using the database cool. I think that's all the question we got so, thanks, for joining us this webinar session, please stay tuned for the next recession Google. Into a Google cloud Internet of Things from devices to cloud in their bellowing business, outcomes, in a secure, way thank, you thank you.