IT Expert Roundtable: Using Naïve Bayes & Machine Learning Server to improve data quality
Hi. Welcome. To IT showcase, expert, Roundtable, I'm. Sherry Bateen I'll be your host, we. Are the IT showcase, team and we love to talk to you our customers, about how Microsoft. Does IT. We've. Recently, published a paper on, how, Microsoft, weeds, out fake marketing, leads using. Machine learning server, and today, I'm here with the experts, that helped write that paper, this. Is your opportunity, to, ask direct, questions of, our smees and receive. Candid, answers. You. Can be posting, your questions in, the Q&A, window, and, I'm, going to ask our Smiths to introduce themselves, we'll. Start with you Kavita, hi, I'm Kavita, I'm a software engineer in the core services, engineering, organization. I. Develop. Services, and machine learning models in the marketing space hi. My name is Ashish and, I'm, a program manager within, core services engineering same group as in all three of us and I'm. Responsible for machine, learning driven, initiatives, that focus towards. Lead quality starting. From its inception to its consumption. Hey. I'm Shira banki I'm, a senior, software engineering lead I manage, marketing automation and, global demand center for Microsoft, and. Co-responsibility. I work with machine. Learning space, how it we can improve data. Quality we, are machine learning and sales and marketing. Very. Good, we. Are gonna make every, effort in the next hour to answer all of your questions, however. If we, can't we'll. Stay behind in the studio continue. Answering questions and post, that extended, footage with. The video at microsoft.com, slash, IT showcase. We'll, start with a few questions of my own for. These experts. Ashish. Could you tell us a little bit, more about the group that you guys worked in sure, so. We are all part of core services engineering, and when. In core services engineering be responsible, for providing. Different. Kind of services, and. Support, to all of Microsoft, whether. They be online customers, or commercial, customers.
Within. That our. Group is focused on sales and marketing, related. Activities. And supporting sales and marketing groups and. V3. Of us and our, extended team specifically, focuses, on. Lead. Quality, machine. Learning driven initiatives. And make sure we are leveraging. Reading. And machine learning technologies. To solve customers, within marketing and sales. Shishir. Could you talk us through what the solution, was before, your machine, learning solution, yeah sure sure so we, had a in, sales and marketing as she, said like we, have a service, in our space called customer data enrichment, service and what. It does is basically as the customers. Are signing, up for trials, or like a webinar. Like this one or event, registration, or downloading, a white paper or something like, what they do is they provide some information to, Microsoft right, and those, information. Become lead to Microsoft, and that's where like we, were observing. Like there, were leaves with a lot of fake information and, with the bad quality, data right and, as part of this customer data enrichment, service we injected, a how we can validate the data before we can start matching, an enrichment on our side and, so basically this we put together a static, solution in place which, was a list of a profanity, words across different languages and and. Soon we, realizes hey definitely, we are able to catch those bad quality, data and. And. Started, saying hey the solution was not good enough to scale because we were seeing a lot of variation, of those profanity, words and a, lot of limitation, like hey the moment there are gibber's words. Are coming the solution was not able to scale and catch those and that's, where we started venturing into a how machine, learning, can help us solve this. Problem. To. Solve improve the lead. Quality basically. Good. Kavita. Could you walk, us through the solution. Sherry. So, we. Host a service called customer data enrichment. Service, so, what the service does is it takes us input, lead.
Information Leaders. Any person, who is interested in a Microsoft, service, or, a product and, we. Get. Lead information, through a variety of sources like Shane mentioned so, during. Events when. Early. Signs up for a webinar or downloads, a white paper or a trial we. Capture some information, about a lead it's their usually, their person, name the company name perhaps. A phone number and an email and, it, gets sent to our system, to do matching an enrichment before, it gets. Sent to sales so, we do matching an enrichment mainly because the sales can then have a more engaging conversation. With the lead, so, prior to doing matching, and information, prior, to doing matching, and enrichment we realized that there's a lot of profanity. Girish, a lot of leading formation coming in where. They are really not serious about, knowing. Or buying. Or microsoft product or service so we want to try to filter such such. Information. Out of the system so, that's where we have a number of data quality services, and a. Frivilous company services, is one such service so, the way this privilege company service works is we train, a knife base model, on both. Good. And frivolous, company names frivolous company names include. Both profanity. As well as gibbous not, only do we use the whole words we use character, engrams from each of this list. Both for, both. For frivolous. As well as good company names and the, nave based model is, coded up in our and we, use machine. Learning server, to host the model so, that's the bottom part of the architecture, diagram you see so, the machine learning server, has a web node and three compute nodes so, it's an enterprise configuration. Of machine, learning servers, so you can add as many compute, nodes as you need or based, on you, input. Volume so, we decided to go with the three based on a lot of performance. Performance. Tests, so, the knife based model is running in each of the compute nodes so when an input company, name comes in so, the web node routes the, requests, to one of the compute nodes where. There, is a model running that's already pre trained on now both, good. And frivolous, company names when, input company, name comes in it gets split into engrams, and then. The model, is able to score and determine, whether, each of the engrams is, part of the good or frivolous dataset, and then comes up with the combined probability score, that helps determine whether the input is frivolous or not, okay. Very good, one, last question before we go to customer, questions, why. Is it important, to weed out fake. Marketing, leads sure. Sheri I can take that money so what. Was happening was within sales and marketing ecosystem. We were facing challenge. And, the challenge was that of really, poor quality. In. Marketing, leads and all, of these poor quality marketing leads were containing, fake names junk names and company names and without. Appropriate. Filtration. All of these were going to our global teller. Centers all over the world and Dara. Centres were suffering. Productivity. Loss of productivity as, well as they were not able to get to some good revenue because they were most of the times is going through this junk. It. Is extremely interviewed, exactly, so. What.
We Did and on the other hand we did not want to place. Unnecessary. Burden on our, customers, who were filling these web forms and forcing. Them to enter. Accurate, company names even if it was a genuine typo, or maybe for whatever reasons our customers didn't want to enter the real first name and last name you still want to let them go ahead and download that white paper that they want to read through so. We, came up with this mid. Pathway, so to say, where. We, let the customers. Enter. Whatever information they want to enter in web forms, and, we. Will leverage, machine, learning algorithms, to separate wheat. From chaff so, to say and and. Figure out what are the good leads that. Have genuine. And genuine company names and what are the so called bad leaves that should, not be sent to our telecentres. All over the world and you. Know that's where this machine learning a lot of them helped, great. That's, a good start, let's, go to the first customer, question. What. Is your understanding of, bi Big, Data AI and. Machine learning. That's. A big one yeah. Maybe. All of you can chime, in on that yeah. I can. Take on the, first time that's so so, basically, the that, data is a key right so right. Now I think we're producing a lot of data like as part of our day. Delay cycle, and last. Part of our business processes there a lot of data is get being, gathered, right and that's, where we are traditionally, moving from our normal, relational, base to more of a big data type of system, right but, the key thing here is like, what, we do with that data how we can find out the insight, from that data and there, are multiple ways means that the traditional, bi was one way right mensches which is a like after, the fact you're, not learning you're just saying hey this data is there and what the inside I can gain out of it right and then, that's where the more, AI the more, lettuce. Strains are AI where you want to learn from the data and start, predicting. Start prescribing start. Describing hey what is the what. Is this data tells me how how's my like. The business process will evolve is my customer, going to churn is this like is it am i able to sell something so like, the AI is getting you into the more of a prescriptive.
Guidance More, of a predictive, guidance, sort of thing instead of a legacy bi we're still you have descript you like an more of after-the-fact. Intelligence. And. That's, why I see, this how the technology, is evolving over, the period and definite, means still there is a lot of. Opportunity. To use bi and there, is a lot of business problem where AI can be the best suited so it really depends on which, business business problem we are trying to solve and. Just. To add to what she was saying so I was reading somewhere that the. Amount of data that, dis involved, is generating. On, an, hourly basis, is more, than, the amount of data that was generated before. Year 2000, in last 2000 years and. Recorded. In form of those so. What this means is not that our activity, has gone down the activity that we are carrying out is still are the same but. The data, that is being recorded is massive. Now, imagine coming up with some static, reports, based, on that data it's just impossible, because. The insights are just too deep and too meaningful. And too specific, and unexpected. For us to you. Know figure out up front and put in some kind of BI report, and that's what machine learning comes, into picture you know I don't recently some place at one of the the big stores. Had. Using. Machine learning they figured out that in hurricane, season the. Sale of orange. Flavored pop-tart goes up dramatically. Now, that, cannot, be predicted. By any, static. Bi report, that can only be done when you have this loosely, formatted, machine learning algorithms. You, know running and figuring out these kind of patterns and we distort, that as well in our you know then. We love res machine learning okay. Anything. That that. You want to add Camila oh yeah so indeed recently, water v4. Founders, as of large. Amounts, of data that's entering our systems, and services and, one, other place where where. Machine. Learning can be used for big data as the world is all the telemetry that gets captured right okay, we host a lot of service we put a lot of models in products you know how how are they doing and. So. That's also one aspect, where a lot of information, can be captured and needs to go into big data store so to kind of gather, insights, from so, I think there. Might be spaces, where both BIA, I can. Be used in, conjunction not, just like oh my project is just a AI project, or just my project is just a bigger project so there is scope for both. To, coexist. The most, power will come when they coexist. Very. Good, next. Customer question. Marking. Data is. The. Most complex, how do you handle. Internationalized. Data. Yeah. Maybe I can take, a shortened you know people can add then the. The, whole, internationalization. Is is just. You know especially in the context of marketing is definitely, you know when, we you, know talk about Microsoft, we, are talking about every, country in this world and within, those countries we are literally talking about every person living they have come across Microsoft, in some shape or form and the.
Moment They express any form of interest in Microsoft, they become a potential, customer for us. The. The, moment we know about those customers we, might not even know what their name is and what company, they work for they might have just visited a website they still are potentially, interested customer, starting. From that relatively, unknown, face to. A more known face where, we know who, they are they have shared their information with. Us they have said hey we are interested in these products. And. We want to be contacted in this number you, know there's, a continuum, of how much, we know about that customer and, as. We learn more and more about a given customer we. Face different kind of challenges, as to, what, is the quality of data that is coming in at different, stages for example in China. Customers. We have seen culturally. Are reluctant. In. You. Know divulging the real name, till. It's much late in the cycle us. On the other hand is it's more common to see a full, disclosure of exact, name and company name much earlier in the cycle so it just changes from, country to country and we have, to make sure that our machine. Learning algorithms, and infrastructure, and services are taking care of that. Yes. I can add to what our she said so I think the definitely I agree the marketing, is a very, complex well the data itself is complex because the you have a lot of, Internet. Activity happening on Microsoft, comm pages, or various product of Microsoft comm and then we do get data, from non-english. Subsidiaries. Or non-english countries as well right and really. What. It really depends on like how we are using. Their data and understanding, the linguistic. Part of it a how we are understanding. If it is non-english, where there is a bad and all so, that's where the key thing comes in and what Kavita, mentioned earlier was how we are training what is your training data right so, as part of this problem we did look at all over like top 10 countries, where we were getting lot of leads data coming from and what, we did was we looked at all there we get a lot of training data from our existing, system understand.
Their Profit or profanity aspect. Of it and train our model, to handle those linguistic, aspect of non-english. Languages. As well right and so, the key thing here is even though I agree the data is complex so. It's really, the test and training data is become, more important, as you venture your machine. Learning solution, to non-english data, sets as well and you mentioned that you started with your top 10 yeah. Countries. Yes in languages, yes so is. The vision to add more in and as. We go down the road yes, it's like a continuous, learning, and iterative what model we have put in place so it's a yeah it's to answer your question yes we are constantly learning and constantly, retaining, our model to different languages on also the same languages, right so machine, learning is not about once you put in model you stop you are constantly, learning from it and retraining, the data so. One more thing, to add on the same topic, so, when internationalization. Comes, into play the problem, can become bigger, so. One tip is to tackle it in stages so what we did was how we try, to do was we weren't able to train and go gather good. Company. Names and, frivolous, datasets you might not have your training data sets handy, for all the languages, and all countries so, we try. To split the problem, in phases, so first we try to determine okay who is mostly. Using our service and it was mainly the english-speaking, population so, we first on the first iteration we trained on, English company. Names and then we started, adding more, and more to. The list so when, you tackle a problem such as this mainly, text classification or. Text any, mi, problem related, to text it's important, to keep this aspect, in mind and kind, of break it down into phases so, you can see how, well your algorithm. Is performing, on the majority of the data said that it's going to be used for and then learn from it I trait make it more accurate and then venture. Into other spaces also there are other languages, you might need language, experts, they might need a little bit more resources, when, it comes to scoring step, like. Whenever you train and algorithm, you have to find out the false positive, false negative, with English I can do it with, other, languages, I'm not proficient to, do it so, you might need to leverage or, the services, that, are available may, be the judgments, thermistor or. Translate. Services, that are available these days so, it it, takes a little bit more time so it's important, to split it yeah, just so it sounds like you start with where the majority of your data is the language and, build. A rock-solid model. With that but, it's, easier for your, domain, space it's what you know as well as it's. Fitting, than most of the data and then you take on another chunk in another chunk and you need other experts, to help yeah, and. This is one more point I would like to add is it's important to know then. The, so, called body is over. And the tail has started list. Of countries is very, very long yeah and he. May end up covering, 95%. Of the cases, in. 30, or 40 countries, yeah so, at that point we have to make a business decision do we continue to invest in each, of those countries that are resulting. In a much smaller number of leads and we end up with very very long tail and would have resulted in a very you, know you, know investment, from our side it's a business decision it's a business decision yeah. Okay. Next. What. Are the limitations, with this approach. That we have she. Shares the, the. Machine learning approach, so so, one, of the things like likely. Kind of learning as as we roll out this machine learning so basically. The scale right the operationalization, so, I think the machine, learning building, machine learning model I think the. Way the AI is democratizing I think. Everybody. Can start making those model but I think the true fact comes in is like. How. You operationalize. The model how, you measure, your model is performing, great is it are you hitting the successes, or the the KPI which we wanted to measure the, false positive, you know, those, sort of thing right so. The, key thing what we realize is with machine learning server was. It.
Has Helped tremendously to. Help us operationalize, our model, right and, that was the keel key learning right and, also we also started, seeing there, was a challenges. With how machine learning helps in the time when we develop this model they were the, machine learning group itself was like, doing, some fixes for us so we had some challenges on how we can scale out those. Machine. Learning requests for, those models and that's, where we kind of machine. Learning provided, us some hot fixes and I know I think they address. Those issues now right, so if, I do like summarizing, the operationalization, was, key challenge, and trying. To understand, what are my measurement. How I can measure my model, is still working right right so those were the those, two challenges, which we are constantly learning and evolving our matrix, to. Measure them very, good I know Kavita, walked, us through the. Three. Compute, nodes if we start with less than three yes. So. We, started with the one compute, node and. One web node and. Hope that it, would all work but what we realized. Was when. A request was coming in they were the. Way how service works is multiple. Requests come in at the same time and we, need to, validate. Multiple. Messages, at the same time so, when we used one. Compute, node to the over the amount of memory and the, CPU that was available, it, was just. Overload. On that compute, node to serve the request so each request was taking, way too long and the this model was so becoming the bottleneck, so, then we through, the performance, and stressed as we determined. What. Is the whenever. One model, run runs how much memory is it using how much CPU is it using then, you determine okay if one. Model run takes in so much memory and compute node how. How, many requests, can one compute node handle parallely, and then, you say you, say okay how many compute, nodes do I need at that point and that's how you scale, out your solution, based on the input, throughput that you need and how many parallel, requests, that, you need so. What is the throw of your system. Today so. The true put is like for, the frivolous company, model we, are trying to, we. Are trying to take in at least a 15 or more parallel requests, at one time so that would amount each parallel request can, take, in at least 150. Leads at, one shot so it. So you can assess how many in. It's. Actually, upwards of 2000, 2500, leads at one shorter right. Although. I would add. One thing is that just. Number of bleed every lead will have different, load, so we cannot just you, know go by a static number.
And. This I'll add one thing is you know when I remember when Kavita and she sure they, were facing, challenges, and. You. Know the server was getting overloaded, you, know we, decided, that let's be patient let's, give, it because, we are all learning together. Business. Is learning you. Know technical team is learning and, all, of us are learning more and when we all realize, that we are in this learning mode the, patience, comes from within and said okay we have to let this thing mature. And not call it all quit so one, of the things I would say to all those who are trying out is just be patient, because. We are all learning, and filling it out. Also. To add, to. That so. The, way this model was deployed was in a provider pattern, so, there are multiple providers, one, of the first few providers is the frivolous company name so that says he when his company's frivolous, or not and then it goes to other providers, which, does matching in enrichment, so, when. This model was performing, poorly we, had a good, way in the service, to, remove this model and let. The normal flow continue so, when this was bought a net introduction. So we could quickly turn a switch basically. A conflict, enabled. Or disabled, to say that hey, let's disable this model is not working well or we are running into issues so, we did some we. Did some, you. Know integration, runs but, the volume that we faced in production, during the certain time was really hard and, that was when this became a bottleneck, so when you add a, machine. Learning model into, your architecture, just make sure that there, is a you. Know a configurable, way to, disable, or enable because. It's a chance to provide its piloting boiler you know. In case you. Get bottle necked it's not going to be able to pull it out yeah, okay, great lesson. How. Did you get the training data tell, us a little bit about, the. Training data and and how you yeah. So. For training, we had both. Frivolous. And also. Good company, names for, good company names we, used to of course like we discussed. We prioritized. The countries that we wanted to tackle first and for, actually. Most of the countries we got the stock exchange, company list so. That was where we got the training, dataset and then we also had internal, sources that we leveraged, to add on to the stock exchange, company.
List So, that became the good training, data set for, the bad training data said there are two parts like I mentioned, there is the profanity, as well as gibberish, for, profanity we, use some internal, sources. To, give us profanity, in the different languages, so Achilles, yeah so there was a content validation services. Team that had our list and they have world readiness experts, etc who, were able to give us a very, very extensive, list of. Profanity. Words and different or languages, and, for the gibberish, list we, had, a program that. That. Would get. Random, characters. And then put them together so that formed our key barriers sure they decided, I, would add that you know just building a really good, training. Data was you know one of the challenges and. I'm sure you, know most, of the you know users out there will face, that as one of the challenges, they have to tackle upfront, and we, were fortunate because we had a very engaged, field office, all over the world that, pitched in at different points in time and, helped, us with it from Japan or from you, know Europe and helped, us you know figure out this good data from bad data like I was, mentioning earlier, and. Having that cough you. Know engagement, from field office and having. People who have skin in the game willing. To pitch in and help you improve, the quality of your training. Data is really critical yeah, I would add one more thing here I think the kind of efficient, competitors, Don its labeling. Is also key like you can get a training data but how you can differentiate, this is a good data or. So like as part of our maturity cycle we kind of venture into our field officers, and those thing but how we started, like, within our sprint, team like engineering team we all divided, this training data among cells like, sort like 10,000. Dataset I know that was a manual effort but. The team was vested in saying hey we went to solve this critical, business problem, and we need to get this label right right, so we distributed. The training it among ourselves and, then manually, we, like we automate, some party yeah yeah good good bad bets because the label part was the critical for training machine. Learning models and, howsoever, manual and boring it may sound actually it was pretty exciting, yes, we. Had quite a few chuckles and going through this process and. Point. This. Process, comes into play after you have one, model like running. Right and then you want to see hey how well is it performing, and then there's, like tons of data to look through to see hey is this performing. Correctly, or only. When different. Input, comes than, what you have already tested. Yeah. Okay, the. Other side of it you know what's coming out is, it right or is it not. Next. Question, is the, gibberish data set tool public. Or private. So. We developed, it to a colleague of mine actually, developed, it or we can now try to expose, that if there is interested, github may be. Yeah. And in fact Kavita, you just stared at coach yeah that's that's right so we have. We. Have a co-chair, that will be published soon, as part of that I have. A walk through the our code that we used and, how. We, did the training part. Of the machine. Learning process, how we do scoring and how it is deployed, on the machine learning server and how the actual, client code that, can call to this call. Into this VP API, and then get the results back so that, should be available to you as well see where we're posting, not. Just the video in a blog forum but the curve on github, so you're. Saying we can share this as well yeah yeah, I do like to shout, out my colleague adder who, has helped us build at Cambridge which is fantastic so I think I will be sharing that as part of the co-chair ok. Yes. So we all will take care of that. Next. Question. So. Talk. To us a little bit about the business any, business challenges. You may have faced. Just. For. This project. Sure. So. The product itself was. Meant to address a business challenge and I can't Blake touched on it in the very beginning but, what was happening, was that the. The. Telecentres. All over the world. Facing. Productivity, loss because.
Of The bad data, that they were receiving and, so. When we started solving this issue. Initially, we resorted to static, list what you should have mentioned and static, list does not really take you far enough at all so, that's when we move to machine learning and I, would say and the challenge. When we started implementing, was, that. Of making, sure that, as a team is learning and, implementing. We, have enough air cover for the team. Because. You, know they'll be iteration, there, will be multiple cycles just like that developing any machine learning model there'll be multiple cycles, that we have to go through there's always time pressure we need the solution, we need it now exactly, you there. Were some challenges you know setting. Proper expectations. Exactly. And it's a mindset of machines, a machine learning mindset is hey we are all learning machine learning includes learning, and. Sometimes, learning Kelty end up taking time and learning is not an individual, learning it's a whole organisation that is learning so. The patience, comes with the fact that there is a learning, that we are all going through and machine. And human as well so I think that was really critical part of the whole cycle there we made. Sure that. The team has enough air cover and business. Has enough patience, to make sure that even, if things are faltering we, know that we are still moving in the right, was. There any pushback, to, this solution. Not. Sure how much we can, push. Panko is basically. As. The. AI is democratizing, like, there are various. Themes or, like everybody, want to do everything, in AI right, so, and, it will the challenge is not just us but I think even like the. Who is look at like going to build the problem like machine, learning model they will probably face the similar challenge where more. Than one team is trying to do the same thing right, so. Who, gets who. Who gets to do it but I think the the thing is like as, a company, you need to think hey for, a customer, we don't want to like, like, attacked, like the that customer data from multiple angle, right so as a company it makes sense to streamline, and provide, a singular, solution right, so we do had like we're like we have more than one team working. On the different aspect, of this problem, and we, ended up like hey collaborating. Because there we kind of leverage each other's, skills. And leverage each other's strengths and then, basically the, model which we were building we started, giving the signal to other team so that they can improve other aspect. Of like they can see because, the lead quality is one aspect in the customer, journey because. There are other thing lead is doing online like the lead activity, right so and we are scoring, then we have the lead scoring concept, right and the lead scoring the lead quality signal is one of the aspect when they do the lead scoring, so there were a teams who, are building the lead scoring model and they, were not looking at the lead quality angle, which we were looking at right so we kind of here and we said hey let's collaborate, rather than saying hey let all. This versus, let's collaborate each other's solution. And. Then we ended up giving our signals, lead quality signals to other team which like as a win-win, for Microsoft, and win-win for the whole team as well right yeah. Is. Machine. Learning discovery. Or design like. Architecting. Software, do you use a top-down, or a bottom-up, approach I. Can. I can take my, so I think the, the. Key thing on machine learning is, we. Had in mind is we need to find the right business problem, like, he is this business problem, is right, to, be picked up for the machine learning we just don't want to do it because hey everybody. Is doing machine learning so let's just do the machine. It's. Not a solution you need to have the right business, problem. Which. You want to attack with machine learning right, and then, definitely the, architecting. Is right and machine learning is like means, you need to pick your language starting. A which language you are going to write your model on whether it's our versus Python, you. Need to think through the training set or data like the training set test set those sort of thing and eventually, you need to think through the operationalization. Aspect. Hey what is the scale that machine learning model e to be used is it, just on a desktop application like, here you somebody is just running against, that, model on your desktop, versus, it will be deployed on the server and then you have like a thousands, and millions of requests, hitting that machine learning model, and you need to do on the fly right so those, are the things need to taken, into consideration so, I do believe, the, architecting.
That Solution, is important, and then, in throughout, this journey I. Kind, of like, like, realizes. The operational, as operationalization. Piece, was a key learning for us how we operationalize, model. Because. Like I said we. Can every, but as the AI is democratizing, each. One of us can build the model right with, the little learning and those, sort of thing but how we operationalize, that scale that's. Where the architecting, the solution, as, is. The biggest challenge I and Kavita what about picking the right algorithm. Or model yeah. So towards, that we, actually tried with. Different, algorithms, to see, which one we were want, to land up, so, knife base wasn't just one algorithm that we tried we also used, from. The rx, package, so, Microsoft ml has published, several order with algorithms so there, are rx, neural networks logistic, regression, fastest. Fast linear algorithm, suey we. Try to train. With the same training data on all of these algorithms and, tried, it multiple, ways, right, right, so, we tried multiple ways and with. The Microsoft, ml package the algorithms, that converted it is extremely, fast and there are ways and machine learning server, to, do the scoring step in milliseconds so. We did observe that the, scoring process is, much. Faster. With the. RX algorithms, and then knife, base we got the most accuracy, with so, that's how. You choose, algorithms. And for, that question, I also want to add that it. Is a iterative. Process, a machine learning cross-project there is some requirements. Gathering in terms of what problem you want to solve get. The training and test data set them so all of these AI problems, are data heavy so what you put in is what you get out if you train, on junk data you're not going to get good results so, there is a there's. A heavy emphasis on, the data that you use so you have to take the time to gather. The right data and then, you, train the model and, training. The model you have to choose which, algorithm, you want to use which, programming. Language, you want to use and then, you train a model you split it into training. And test sets and then, you do the evaluation process, so. That's just the training part of it now, when you want to operationalize. That what are the things that you need to do and prior to training you have to pre-process the data you have to clean up the data sometimes the data can be very structured, sometimes. The data is very very unstructured, and, you have to invest a lot of time before. And after. The fact rather than just you know braiding, the model so that's our key learning like give. It enough time it's, not like a regular. Coding. Project where most, of the time is on the coding, aspect of things and then once, it's put into production it's kind of done right you have minor bugs to, fix but, here you have to invest and, evaluation. How, well is your model doing okay what dataset is an on doing well on and what, it a set is a doing well on and then retraining. It so, there is a lot, of, dev. Time involved, prior. Post. And different. Other types of skill are on getting the data so, that's what differentiates, ML, project, again some of the traditional, software sounds like there's a care and feeding aspect. That you're always evaluating. The accuracy of your model, and then deciding, when you need to. Retrain. And, improve and there will be drift, that'll happen with time and just keep an eye on make. Sure that drift if it's wide apart yeah, then let's retrain the model okay do you know at any point, in time the, accuracy, of your model do you have like a dashboard that tells you that or, yeah. We do, calculate, the accuracy, of the model periodically. And then when, we get false positives false, negatives prior. To release, cycles, we try and do this exercise of, trying to figure out how, well we have done from. The previous release and then see if the model needs to be retrained on any of the company names that's, being tactics false positive, or negative, X cetera yeah and, just, the one thing I would add I think of it a test on it so the the, other aspect, of like which, model, you will choose when, you are attacking the machine learning problem is it alike a classification, problem, versus.
Like A time series sort of problem so that again you need to well understood, based, on your business problem, and data set what you're going what you are going what is the outcome you are looking for and that will, help you decide hey should I go for the knife bias classification, or there are a bunch of classification. Or I, should go for like a more, of a time series type of option so yeah that. Will be cater and. Just like I attended. One of the seminars I said getting the first order. For a Salesman is the hardest one getting. The first machine, learning that can create an impact in the hardest way but, once you get through that then I was recently in a meeting and we, were all discussing, you, know how we solve, the segmentation, issue and somebody, said hey can you use machine learning for that I was, like that is awesome you know so once the the group see the success, cycle then, the the, floodgates. Might start so you know we are just more work towards that very. Good you get comfortable yes, then you get comfortable with the idea. If. You could go back to the beginning of the project what, would you do differently, I. Can. Take that one first, probably. At the beginning of the project you, know and I'll say I. Underestimated. The, importance, of training. Data and the. Team would come to, me all the time and Kavitha an editor and they would see me on the hall and there's a training date I said and then I'll sort of run away and go to this. So. I really, underestimated the, importance of training data here and and, having gone to this whole cycle, you. Know you realized that probably, having, a good training, data is the. Number. One deciding, factor for success of any machine learning model so, you know going forward you know we, wanna make sure we had none of the project, starting on machine learning similar, kind of feel and, we made sure understand. I was being honest I said let's make sure that training data is is number. One thing we tackle, up front because, that can make or break the project that's, this my learning, well. So to add we also I also, personally, underestimated. The effort involved, in you, know evaluating. How. Well the, model is doing or evaluating. On a test dataset okay. You have a hundred. Thousand, tests dataset let's say and then, this model is coding now how are you going to evaluate whether, how, well is it doing there needs to be a manual labeling. That. Needs to happen prior. To you know even gathering, the test data set and trying to evaluate it so, you. Should budget time, for, each of these steps getting, training data how you will do the scoring. Are you going to have, cycles. Spent, within the team to do this coding or can you outsource it to somebody else or, use a judgement service so that, aspect is something where we would you, know be a little bit more.
We. Would try, to tackle in a different way yeah and, I would add to the learning, like actually we are already working on the next ML. Problem, and then we are taking all this learning from this project to that again. The focus was the how, we want to deploy your model, so like which. We were not thinking very upfront when. We started on this specific, problem and now, we we, kind of over here means we know what right model we have to use which. Language we are going to use that there is a, operationalization. Friendly. Or not so those are the things we are keeping. In mind up front now so yeah and this is the first time you do anything yeah you don't really know what you're getting into yeah, second. Time you do yep, so. Like if I like we, are actually one, thing which we already doing, differently, and Kavita, is already doing differently is we. Are started. Exploring Python, as our language to build our model, and Microsoft. Ml server is already really is a great support, for Python, for. Operationalization the operationalization, so, if anybody is looking for trying out so I would say Python. Is another great language to, build the model on. So. Can. You quantify the business, impact from this solution, sure. I, may. Not be able to share the the absolute numbers, they're internal to Microsoft. But. The. The key business impact, that the. Exact, measurement, still needs to be done because we have the pipeline plumbing needs to happen but the machine learning model itself is in place, what. We are looking for is two, things one. Is increase. In productivity of, our you, know tell our representatives, because they were you, know spending a lot of time just separating, wheat, from chaff and, typicals numbers I got was they were spending two, to three minutes on every. Bad lead and, you. Know if in a given day they were able to call few successful, cause, that, was a win for them that, was really disheartening to know. Then. So that's the. The quantification, of what is the increase, in productivity we know at some point in time and you know the, second. Part is you, know again I can't show the, specification. Able because. They were spending so much time on, the. Poor quality lead they were not able to get to that 10 million or 20 million dollar lead that was you know down there and it would you, know become old with time and rot and eventually you die which, is really unfortunate so. That. Is the impact, on the revenue that we will have because we have a much cleaner list, of leads that our Telegraph's can work through and and you know create impact so these, are some of the two I would. Say quantifiable. Business impact the, thought which is not quantifiable. But actually it's really critical, it's the morale of the workforce that is doing the work of Terror reps you, know if they are making only two or three good calls in a day the, moorlands, not going to be very high for them so, if we keep that in mind actually you feel pretty good about the work we are doing because we are making somebody, else's life better and they. In terms will go home happier, and you know so on and so forth so you know that's a non-quantifiable, power. So. What. Are some of the other approaches, you could have used if you didn't want to use machine, learning server. So. There, are multiple, options available, to. Operationalize. The model from. Microsoft. Itself I can tell, you of couple of options one, is as your ml the, other is our. Services, on sequel server then, there is the machine learning server, so, there are multiple paths yeah there is multiple paths, to, operationalize, a module, each one has its own benefits. And. You. Have to choose which one you want to go with so. As, your machine learning is very, good if you have a quick. Small model that you want to develop if you want to train in school then as your ml is great. But. You've tried all of these yes for, this for, this project yeah, so as your ml is probably the easiest one to operationalize. It, doesn't. Require a big great ramp up it's. Very easy to operationalize. The model build a model operationalize, the model on as your MA, so we, tried that first and as, the training data set grows, each.
Step, Of the azure ml, process. It's a step approach that you take with us your machine learning it's, back to buy a certain compute, node and that compute node you can't change so. As a feature, size kept growing and we wanted to train on, bigger. Data sets like 1 billion company, names let's say let's get the accuracy, up then. As, you're a machine learning started. Having problems with the training steps so we were advised to train, outside of, as your machine, learning but just call it in, machine. Learning, and then, it now comes down to the package that you are using and the scoring step of the algorithm you are using how much compute, the stack need and if, that fits within your as your ml. Compute. Node size and that's great so you, can choose as you're in there now, if you want to have more control over the. Machine that you use the packages, that you want to use and. Like. The whole end. And then, you can go with a machine learning server where. You can provision the compute node you want and based on your true put you can put, whatever machine, that you want with as big of a compute, as you need in, order to solve parallel requests, so and you can put the latest, latest. Algorithm. That you might find somewhere, and you can download it onto the machine and make it run so machine, learnings always really great for both the R and Python if you want to have more control over, the environment and, sequence, overall services is really good if your data resides, in sequel server so, in our case we didn't inherently, I have, the data within sequel server but if you had a lot of training data in, sequel, server you don't want to take it all out bring, it to some other place try, to do the training try to do the scoring and then, go, back and forth with such humongous, data so if you have, data that already sits, in sequel, but, then Secret, Service our services is really great so we. Had to follow kind of a write rate of process, to try to understand, each of these technologies, and what, limitations, we. Might have and just. Yeah tickle when is the right time to use the right. And. Just to add on to like what Kavitha were saying so we did started, developing. Our our. Model, in. My, bias as a custom, implementation. In sheesha so like we do have one model which we like wrote the NIE bias in she sharp and deployed.
It As like any, other she shop servers right and that's, where we learn hey that's, not a great use like if you have to deploy that that layer doesn't mean may not help us leverage. All the machine learning. Capabilities. And that's where we kind of explore. The sequel or services, or the. Machine. Learning server and, started, exploring, the language like which has more, inherent, support, for machine learning like python and are. Also. To add a sequel. Server if, there are simple, models. Or simple algorithms, like she, she mentions whether a colleague of mine developed, had. A custom implementation. So you can tweak the implementation. Pretty well so, if you want to change the algorithm, maybe. It's using some engrams and. Adding a lot of the engrams from the training data into the scoring. Process that, you don't want to do but, it can that step, can still give you the accuracy that you need and the it's very very fast then you can have, custom implementations. As well or you can try to use R and write, your own package and to, do your algorithm, so in some of these cases you might need your own, way, of doing it and if you have like, a small data set and you can't code up the algorithm, then sequel someone just for. That process just sequel server is what we used so, all that you need for owner base pretty much is the end grams and the frequency, of occurrence in, good. Data set and bad data set which you can pretty much get. In from. Sequel server and, if you want to operationalize, it if you want to expose it as a API then that's where it becomes, a little bit of a challenge but if you just want it to run, as a DLL and stuff you can still do, your own implementation. Okay. Very good, I. See. That we're getting close to the top of the hour and I do want to ask our, experts one. Final, question and that. Is what is the one tip you would like to leave our audience with today. Shishir. Let's start with you yeah so in, this whole journey. Like. When we started venturing into this whole ml based development. I really. Didn't. Have anybody in my team who was like a data science title engineer, or like a PSD, guys, but. Like we kind of as a group we invested in a lot of learning machine learning technologies, like our machine. Learning concepts. And all those stuff right so really. The key takeaways, was like if you really are looking for venturing into machine learning you really don't need to be the if you start with you really don't need a data science, step scientist. In your team or, anybody of PhD, it's really the maturity. Curve as you, grow in into machine learning space at and.
You're You you are having a lot of challenges, you are solving with machine learning at some point you really need to get, into hey you need a PhD in mathematics or, you need a data, science folks right but, really this whole space what we learned was the key thing was having. A good business, domain knowledge of, your space that, was the key which my, team had and my. Team had and also a good data engineering, background, how you can data transform, data cleansing, those sort of things with, the combination, of the, definitely. The language, like our like and and definitely. To start with having a good concept, of Statistics right which, this team we we are having those as a nature. Of our work we were doing a lot of BI before, so we we had those in knowledge, Twitter's right so I would, encourage everyone out there like hey if you are really looking for, Sol your any. Problem, with machine learning then. You can really get started with lot of Microsoft, technology out there like a jour ml Kavita mentioned I am a server and definitely you need to invest. The learning time on learning the are learning. Some machine learning concept. But yes you really don't need to be data scientist, to get started, on any machine learning, true. Democratized, a idea. And. And, I would add to what you just said and, then I'll have my own point so I'll make two points so. What. I would add is that if, I were to choose between a. Team. That is machine learning expert. And team. That has business domain expertise, I would, lean towards a team that has business domain expertise, because. We, were in working, sessions, where. We were going through mounts and mounts of data or rows and rows of data and just, knowing machine, learning, would, not have resulted in kind of the insights, and the feedback and the iteration, and loop through that we were able to go through otherwise and that all came because of business. Domain expertise, and business. Domain expertise takes a long time it's a marination process and, machine, learning is. Also a long process but in our case the, business domain expertise was really important. The. Other, point. That I want to make to your question earlier was that, picking. A good, business. Problem, to solve is very, important, something. That is not far, as burning and let's solve. It because that will be too much pressure on the machine learning team it's an iterative process but. Just, picking. A business problem that has the right business impact but it is not too urgent that will create too much pressure on the team is very important, and once, the, team sees the success cycle then. Like, I said earlier the floodgates will open you know so. That is very important part here okay. Beta. So. For, this particular, problem that we try to solve detecting. Frivolous names so I mentioned we have a training. Data and it has profanity, liberation, company names we train on letter and grounds so what we found out was when we started to train on profanity, words, with letter engrams, there are many engrams, that were captured, which. Were you. Know in. The good. Data set but the whole word, was itself profanity. So it was leading to a lot of false positives in. Our case so we had to go. Back also we found that during the testing phase that. There, were a lot of false positives so we had to do some investigation, to, determine why it was happening so for bad. Words we don't use letter and grams because we want don't want our false positive, rates to be so high so if somebody is entering a, genuine company we don't want to tag, it as frivolous so, it's important, to have a good choice of training, and take test data set and evaluate, your false positive, accuracy, before, you put something in production. I. Had. One more thing so I what. I would like to add one more thing here is the there. Is a experimentation. Mindset, needed like when you are working on this machine learning problem the hypothesis, mindset, and experimentation, mindset, because you need to constantly learn as Kavitha mentioned earlier it's not a like a traditional, development thing but.
You Are constantly, hypothesizing. Some some problem and saying hey will that work and then you develop the model deploy, it learn, from it how it behave is and again you tweak it and again deploy it so I would, say like as, more. And more like we are going into cloud services, or services mindset. The, machine learning is similar thing we need to apply lot of this hypothesis, and experimentation, mindset, as we go through that was there okay. Very good that came, through loud and clear, today. We. Are near the end of our program and I do want to thank our experts, for taking time away from your day jobs to come here and talk to us and our customers, truly. An important, thing to do and I want to thank our customers for joining, us we love the questions, in the dialogue and the opportunity, to come talk to you you. Can find this video posted. In, about. A week on, microsoft.com. Slash. IT showcase, we. Have. A, new. Live, shows, every. Week so join, us again and bring your colleagues, on. Our site on our IT showcase, site we have technical case studies business case studies articles, blogs. Vlogs, code share as Kavita, has already participated, in all. About. How we do, IT here, at Microsoft how we're running, this large global, enterprise organization. On Microsoft's, technology. We. Thank, you for joining us today and wish. You a great day.