Hello, everyone and welcome to our presentation titled. Advertising, fraud detection at, scale at t-mobile my, name is Eric and I'm a data scientist, and I'm joined by fine who is a data engineer, both. Of us around t-mobile's marketing solutions, team which. Is a new team with empty mobile that provides services to the advertising, industry we'll. Be talking today about a research product we developed intended, to identify potentially. Fraudulent activity. Using. Data gathered from t-mobile's. Network, while. Discussing, some of the challenges we faced and other, tips we learned in the process of developing this tool, to. Give you an overview of the presentation. We'll, start with some background information about, the high-tech industry. Then. Fine, we'll talk about spark, tuning and optimization, then. We'll discuss spark, code for. Those with. Python background. And some, specifics, about when the algorithms we are using to define fraud, and. Finally for, pharma. Finish by explaining our process, of automating, the project and. Our use if I'm alone. Again. Before we go into the details of the project I want, to go some background about the ad tech industry for, those less familiar because, it will help explain why ad fartist, difficult, and pervasive, issue. To. Give you an idea of the scale is estimating, that advertising. Fraud casts advertisers, about 23 billion dollars in 2019. And that. Number is expected to grow in the coming years you. Could say there are two main reasons why ad fraud is so prevalent, first. Reason is a general lack of regulation, around online advertising, oh that, changed a bit after CCPA, started being enforced earlier this year but. Federal and global regulations, are still fairly lacs. The. Second reason is that the ad tech industry is fairly complex and not. Well understood by most people with. So many moving parts that can be extremely difficult to monitor everything that goes on doing an ad transaction, especially.
When The average American is exposed, to about 5,000, ads every day. To. Give you an example of the complexity, of ad tech here's, a diagram of the main components used during an online ad. Campaign. You're. Likely familiar with, the idea of an advertiser, and a publisher but, there are a few more components that, control. The movement of information and determine. Who sees what ad and web. For. Each impression this, entire process of determining which I had to show how. Much that will cost and, who will see it takes less than a second and each, moving part presents, an opportunity for, fraudulent. Activity to occur. Although. Since it is almost always the advertiser, who, is the one losing the most money the. Fraudulent activity is most likely to occur inside of the publisher. This. Is made more apparent when we dig into the two main types of fraud the. First is a bot firm a bot. Farm is essentially, a collection of devices with the sole purpose of browsing a particular website or app or. To. Generate ad impressions, for that publisher while, the publisher benefiting, from ad impressions, and, the clicks but. To be clear everyone except the advertiser makes money in this scenario. The. Other main type of fraud is referred to as domain spoofing. This. Can take several forms but, simply does the process of a publisher sending, a URL to the SSP there, is not the actual URL the publisher, to trick everyone upstream and your thinking they are sending that to illegitimate website. And. This. Leads us to the question of how we identify, founded activity, the. First thing we need is data we. Need to be able to see and track where is going on the billions of online ad transactions, that come through our network, being. A telecom company we have the advantage of being able to utilize all of the network data used by mobile devices. Second. We. Need to use this data to develop a model that can identify several. Different types of fraud and lastly. We need to be able to scale this model to handle the four to ten terabytes of data a t-mobile collects it every day. There. Are a number of that, we use at t-mobile including, both on-prem and, cloud-based platforms. As. You can see here there. Are instances where every use a hybrid approach. But. The most important, thing is, to develop a working pipeline that can extract data from its source so. The data can be used to train a model and generate. The required outputs, the. Process always carries data from a traditional format whether. That be RC, parquet, or CSV, to, the desired output whether that be a table of metrics or, a dashboard for visualizing, data. Many. Data scientists, like myself are, used to working with Python, and sequel to, accomplish some or all of these tasks, but, when the scale of data the. Scale of data becomes great enough for, example when you're dealing with terabytes, worth of data spark. Becomes a necessary tool. Higher, than sequel don't, really have the machine learning tools required to do most data science work and Python. Just can't handle such large amounts of data. So. With that in mind I'm going to pass it over to fun who, is going to go into some more detail about spark to. Hair. 1 my, name is Sun, and a data, engineer of data scientist, team at t-mobile marketing, solution group. Walking. Is spark daily, means you. Are going to involve lots with resource, allocations, as far, as reading writing, joining. And a gray jeans which, has happened to be our topics, today. When. You start just stop applications, this, mean. It's. Static allocation by default does mean you have to put a resource, manager example, how, many executors, you want to work with the. Problem, is you, usually don't know how many executors, you actually, need if, what you're doing your job isn't that obvious you.
Often Allocate, too few that, make a job worried slow, and even fill or. You. Allocate too many that, you're wasting a lot of resources in. Added, in addition those, resources, are going to be occupied, by the entire, lifetime, of your application, which. Also leading to waste resources unnecessarily. That's. Why the flexible, mode. Dynamic. Allocation coming, to place with. This mode executors. As spinning, up and down depends on what you're working with as you. Can see in the left chart here, this. Mode however come. With a bit of setup, you may want to set. The initial and, medium. Executors, as the, name suggests those are the minimum resources for, your app take. A bit longer to spin up but it's faster when you're executing your jobs. You. May also want, to set the, maximum executors, because, you know without any limitation, your app can take the whole cluster take, over it yeah, I'm talking about 100%. Of the cluster. Execute, the eye the time of config, will, keep. Your execute the leaf a bit longer than, it should by waiting for a new job out. Of ye could be killed by idling, and a new job will require new. Executors, which it numbs need some time to start which, also cause some delay. Cache. Executors. Either time mob config, will, release executors. Those saving your cache mind. You if you don't set it up and you. You and you catch the data quite a bit often this dynamic, allocation would. Not become much better than static allocation. Spark. Doesn't, read files individually it priests, in batch you. Try to merge files together until. Partition, is filled up, think. About this audios case you have 1, million files, in, a HTML, directory, and you have one. You, have one gigabyte. Of data with, 1 million files so, you have 1 by 2 for file, with. The configuration max, partition, my 8 megabyte, here you have about, 500. Thousand tasks the, most likely. Most, likely jock would fill the. Bottle left chart show you the correlation, between number of tasks and, the. Pattison size is simple, the, bigger partition, you can pick the less than you will have too. Many tasks will make the job of fail because of drama. Travel. Memory error a a memory, error, over. Never, connect an error, - less that normally would slow down a job because of lacking, paralyzation. And. Often. It will true execute the memory error as well. Fortunately. We found, a formula that can calculate the, ideal, pathogen size and, if you involve data, size number. Of files number directory number of executors, number of course, apply. All of those parameters, into. This formula from. The above example, you have button, size about two gigs and two. Thousand tasks a much, more reasonable number. When. You when, you join your table or do, any operations spark. Will shuffle data. Shuffling. It's a very expensive job. Your. Entire data set will be moved through network. Connections. Between, executors. Tuning. This shuffle behavior however is an art, take. A look at this tall example we. A mismatch, waiting for. Between. Shuffle partition config a headphone here and enough of current, available, threads if in this case because, I have two cores for, executives, with, four threads for nothing. The. Ideal, shuffle, partition, can. Be calculated, by this formula you, can get it easy you, can easily get the number of executors, if your own static, allocation mode but. How about dynamic, allocation, use. Max, executors, if you have it or if you don't I'm telling, you a secret, try. To use number of sub directories, of the folder you reading from. Writing. In spark it's a bit interesting, see. The live example here. If. You have a thousand, executors, you have a thousand, of files the point, is we, couldn't know if that is too many files or too few files until. We. Know the data size. From. Our experience, with huge, data keeping. File size less, than one gig, or. Keeping. Fighter. And of number of files lag and mm is good enough in, order to do that you can do either coalesced or repetition. OS. It's, good, enough in general if you want to scale that the files repulsion. On the other hand with, shuffling, stage will speed up a job if your data it's skew like. The right sample here, of. Course it will be more expensive. Now. I'd like to show you a quick demo about how this spoke, to Nina how effective, it is in our case our, demo. Today has two sections first. We'd, like to show you our customized, kernel, and its features we. Call it my spark, this. Kernel is our attempt, to work with spark using, Jupiter on an on-premise, environment. It. Allow us to, set up spark, up recursions on HT notebooks Italy, with, a restarting. Jupiter. It. Extremely, useful for data engineer, who has to work with different scale of data and need different set, of conflicts, for each data set. It's. Integrated, with sequel, so, instead of typing data, frame that show 20, comments false all the times you. Can simply use magic, cell sequel type, your query and enter, it. Also applies, a permanent, fix for the broken, format when you're displaying sparse data frame.
It. Connects to spell UI and give, you a quick review of current, job progress. Now. Let us present, you the optimization. Technique that we mentioned we, apply it into two data sets they. Are almost exactly the same about data in structure, our, 740. Directories, in about terabyte, daily, the. Main difference, is the first one has to about 250, thousand files and the, second one has only 11, hundredths. If. We read them both using, default conflicts, for, the first one your 16,000. Partitions, then therefore you have 16,000. Tasks when. You perform an action they're going through entire, data set it. Took about 20 minutes to, finish. For. The second one it weigh less files, number. Of partition, is about half compared, with the first one. Only five minutes. Now. Let's apply the formula we, show you earlier is. Equal, with databytes, plus number, of files time. Open causing bite which, is four megabytes here all, divided. By number of executors, which is number of directories, in dynamic allocation. Time. Number, of course from, our configuration. We. Cannot print down some useful information so we can refer back later. Then. We run the same code as before. You. Can see the number of partitions drop, to, less than 1000. And, where the execution, time is reduced a half. What. If we apply the fix for the static and data set we. Ended up with eight hundreds of partitions, in task and less. Than two minutes to accomplish the job. Now. Erik, our. Data, scientist, talk will go through another aspect, how, a data scientist, would work with PI SPARC many. Data scientist, code in Python. This. Is the programming, language I am most familiar with and I know there are many other data scientists. With similar backgrounds. Making. The transition from Python, to PI spark can be challenging, for anyone so, I'm going to talk through an example of some cone I need to write in price park and, how I got there using with the help of the UDF. For. A lot of what data Just My dear in day-to-day basis, there are many similar functions, and methods in spark that would be familiar to someone, working. Used to working with the pandas and scikit-learn libraries, for. Example or reading a CSV, file splitting. Your data into chain of test sets and fitting, a model are all things that have similar syntax between, the two and. If, you are using only predefined, methods, and classes writing.
Code And spark may not be very difficult. But. There are a number of instances where converting, your Python knowledge into spark and de, spark code is not straightforward. Whether. That be because the code for a particular machine, learning algorithm, is an, inspired ket or the, code you need is very specific, to the problem you're trying to solve, sometimes. You need to write your own code and spark. And. With spark script as cryptic. Error. Messages, to guide you that may seem like a daunting task, so. I'm going to show you a piece of code that I wrote for this ad fraud project first as the UDF and lana spark code. And. Then talk through how I got there. As. Someone who is comfortable more. Comfortable writing in Python. Than spark I decided to first write out my code as a Python UDF, which. You can see here once. I did that I broke my code down into logical snippets or lines of code that performed a fairly basic task as you can see outlined in the collide boxes, from. There I was able to find a function or more, likely a group of functions that. Serve to accomplish the task I set out to complete, after. Doing this for each the logical snippets I just. Had to make sure that the bits of code fit into each other and tested. The end result was more or less the same then. I was left with code that ran in spark rather than a Python media. And. To. Justify this effort here the result showing the difference in performance between, the UDF and the, Pais bar code, Pais. Bar code ran over twice as fast which for a piece of code like this that runs every day is, saving many hours of compute time over the course of year. So. Now I want to take a minute to talk briefly about one of the metrics, we use to identify potentially. Fraudulent activity. In our network, the. Metric is called normalized entropy. The data scientists watching familiar with decision, trees you. May have used or know of Shannon entropy which, is a metric commonly used to calculate information.
Game. Normalized. Entropy in this case is the. Normalized for mr., Chanin entropy, for. Those unfamiliar, with Shannon entropy and. It defined by the equation in the top right as a, negative sum of P of x times log of P of X for X where. X in this case is a t-mobile customer, and, P of X is a number of time to give an abstruse up from customer exits network or C of X I divided. By the number of times I apps shows up in the entire network or, C of capital X we. Get the normalized entropy by, dividing Shannon entropy by. The maximum, entry for a particular app an. Equation for this is given just below the other one it, is defined as 1 minus the sum of, C of X I times. The log of C of X I divided. By C of capital x times, log of C of keflex you. Can then multiply by 100 to get a value between 0 and 100. The. Idea, behind this metric is fairly simple since, common apps tend to get intermediate, traffic from a number of different people we. Would expect safe apps to have a value of around 40 or so give. Or take a couple standard deviations, you. Can see this by the histogram, in the bottom right corner which. Shows the distribution of, normalized entropy values for. All the apps we're tracking. Whereas. Values that are close to zero or close to 100 we, would score higher in terms of their potential fabulous, this. Is of course not a flawless, metric which is why we include several, others and the final analysis, for. Example often. Only, people people, only use their banking apps every few days so. Apps like these would have a higher normalized, entropy but are unlikely to be fraudulent. Now. Back to spark. Code. You see in the top left is a code we are currently using to calculate normalized, entropy and what, you see on the right is a time difference between running this code with the default configuration and, an, optimized configuration, the. Optimized config, sets, the executives to 104. Cores per executor, two gigs of memory and shuffle. Partitions, equal to the executives x cores or in this case 400, as. You can see the difference in compute time is significant, showing that even fairly simple code. Great benefit, greatly from optimized, configuration, and, up saving you a lot of waiting time, this. Is a great skill for all data scientists to have for analyzing, large amounts of data in spark. The. Last thing I wanted to bring up with regards spar configurations, is the difference between static and dynamic allocation, the. Two previous configs for static that, when I use dynamic allocation with, the dynamic equivalent, of, the static config shown. On the top right the, dynamic config takes almost twice as loud compute, the, same data using, the same code as a static config, the. Reason for that is that the spark cluster will, initially only spin out the minimum number of executives. It. Is only when you start running your code that the, number of executives, increases and only. Is needed. This. Adds time since, executives, do not spin up the media leading can extend your compute time this. Is most relevant when running shorter, jobs and when, executing, jobs in a production environment, so. If. You are doing something like EDA or debugging dynamic. Allocation is, still probably the way to go. Now. We scale our model, to cover two terabytes of data we're. Ready for the next step productionize, it and turn, it into a complete, internal, product. This. Is a quick, overview of our effort. Platform, the, data pipeline, at the bottom of the architecture. Which is what we talked about in the last few minutes is, skill. And could, chew up our Hadoop production, environment. The. Final, of the algorithm. As well as the other measurement, as you can see would, be a grated, and transform, into an eighth intermediate. Platform, in the middle here where, web applications, are analytic. Tools and alerting. System can reach out efficiently, as you. Can see we also implement. Out on a normal infection, model and together, with Emma flow which, we're, gonna talk about it shortly we. Have a complete, workflow, to, automate, monitor. And operate, this product. In. This, project, mmm, float play an important. Role in, monitoring, phase because. Of its. Simplicity. It's built-in, functions, for, tracking and visualizing, KPIs.
With. A few lines of code as, an. Inner right we'll, be able. To lock hundreds. Of our metrics in no time. We. Tack them by dates for, this visualization, purpose, and with, anomaly, detection model, in place we, observe, our machine, learning models output without, constant, monitoring. The. Last thing I want to talk about how, was, our, marketing. Team reduce. The data generated, from this project we, create an UI to. Help navigate the information, provided. Navigating. To the actual project we, can search for information on an individual, app for, example we searched for an app developer by cutting, a, developer. That was banned from Google Play Store in July of last year we. Can see quite a bit of data collected about this app including. Basic app information, negative. Patterns and it, just called generate by the effort project the marketing team will then be able to make decision, based on this information whether, they are comfortable publishing. Apps on this app, when. We search for New York Times app we, can see all of the data associated with this app including. Rating. And test score, other. Features of ui include distribution, and information. About matrix. Used to generate just score and just go itself which. Will help marketers, determine, what just called a comfortable, with an, alternate. Search tool that allows for sorting and filtering by, different matrix a, monitoring. Page will show a selection of KPIs and their, values over time including, potential anomalies. We. Tacked up by David's for this visualization, purpose. And we. Read our anomaly, detection model in place we. Can observe our machine, learning models, output without constant. Monitoring. The. Last thing I want to talk about how, our, marketing team would use this data. Generated. From this project, we. Create a new UI to, help navigate the information, provided. As. Some of you might know the, validation. Is very, difficult, process where, come to something like trot, due to the lack of clear, definition, of fraud, and it's very important, from legal, and moral, perspective, to. Make sure it is done correctly we. Are currently working with our, marketing team to run a be, testing, to validate our process, and make improvement. Accordingly and. That. Is our presentation. Today I hope you enjoy and we now start taking questions. You.
2020-07-29