NLP Text Recommendation System Journey to Automated Training
hi everyone my name is aditya i'm engineering lead at salesforce working with an awesome team of scientists and engineers today we are going to talk about our journey to automated training at scale for a recommendation system let's look at the agenda we are going to talk about the goal then the current the scenario which was and which was the motivation for building our system we'll talk about our approach and the metrics we kept in mind then we will deep dive into the system architecture talking about the feature engineering model training and the serving aspect of the system next we will talk about the evolution how we transition from a simplistic system to a more advanced robust system towards the end we will touch on our deployment strategy how we also do roll backs for machine learning models and finally we will talk about the challenges and our takeaways all right so what's the goal of the system the business goal is to provide agent assistance in providing solutions for the customer problems and what's the current scenario well agents rely on traditional search results for finding the relevant answers to the questions customers have and these questions are generally very long and time sensitive what's the approach we are taking to tackle the scenario we just talked about well we have a recommended system with two layers candidate generation and the ranking layer the candidate generation takes in a large corpus of knowledge articles which are provided by the customer organization and we filter it out and generate a smaller set of articles which can be used by the ranking layer to provide the most relevant ranked results back to the agent and that in turn is assisting the agent to provide an answer back to the customer so the business metrics are important uh here as well along with the approach the agent time to resolution is a key business metric we want the case to be resolved as quickly as possible by the agent and that is where the knowledge articles recommendations are helping the customer so we want to track this business metric at the same time we also want to ensure that if there is no resolution we do not want the agent to be spending too much time on the case the the a scenario could be where uh the the case is delegated to the next year and that that would be the the target for the current agent well in that case also we want to make sure the time spent there is less thirdly we want to uh track the attach rate so the case and the article they are kind of like the quotient and the answer and we want to ensure that if the if the our recommendation is relevant then the attached rate should also go up well it's not always true but it's still a very good measure for us to track besides this these business metrics we want to also track the recommendations serve the count of the recommendation serve that is more of a signal of how what is the scale of the system then the monthly active orgs and the monthly active users that is another key metrics and finally the serving latency that gives a sense of how quickly we are responding back as a system as a ml system specifically let's deep dive into the system architecture so let's go one level deep inside the two layers we just talked about the candidate generation and the ranking model in the candidate generation the first step we do is convert the complex long user coefficient into a more meaningful um formulated query which we can send to our candidate generator right for us the candidate generator is the search system it's a it's a very complex system in itself and it can be a very long talk on how how we generate the candidates there but we won't dive deep dive into that for now for now it's good to understand that we extract the key terms the nouns the part of speech which gives a good sense of what the intention of the question is and then we formulate multiple queries for the ir or the search system and get back a capped number of results back and that is what we call as the generated candidates but once that happens then we go on to the ranking layer the ranking layer has pre-computed features which we we have offline jobs for so we have pre-computed features like the document frequency the uh so a lot of features around the document a lot of features around the incoming query uh in from the past so we have those kind of features already sitting in a in a data store we use that and then we do feature generation for the candidate for for the for every candidate with the quotient so we have like a pairwise feature generation once that happens then we pass it through our model which has been trained and that model gives gives back the uh gives back a score which is used for ranking the results and once we rank those results we can pick the top k by the by the score all right so now let's look at the serving workflow on the bottom right corner you can see the service agent that is the entry point to our workflow system the case is created by the service agent on behalf of the customer the case is representing the user question once that case is created it is it is uh fanned out and we have a message to you bus to scale out the whole system because there are multiple cases getting created parallelly by multiple agents so we want to make sure the system is scalable so we have a queue here the handler on the other side of the queue picks up every case and then invokes our recommendation system as we talked earlier there are two parts to it the candidate generation and the ranking the candidate generation in uh you can see the arrow number four here is is picking up the results back from the index which is the search index it's a huge system in itself different uh topic altogether we can uh deep dive some of the time but yeah so we get the candidate generation happening in step four after that the the the shortlisted candidates are sent to the ranking layer and as we mentioned earlier the ranking layer will rank the rank the candidates based on the feature features which were computed earlier and finally in step 7 the recommendations are sent back to the service agent well there are two more personas here as you can see though we have the org admin and the knowledge base admin which are also part of the customer organization um so the knowledge base admin uh is responsible for maintaining the knowledge base um uh creating the knowledge base updating it creating knowledge articles in different languages so those kind of things are are triggered through the knowledge admin knowledge base admin so that is happening asynchronously and it gets indexed into the search index offline coming to the org admin the org admin is the one who is controlling the setup flow of our system of the recommendation system they get to select the fields or which are important for the customer uh customer data so customer data for salesforce is very complex in a way that the schema is schema is very flexible and customizable so it's important for the organ to specify what columns what what projections are important to the customer that is what this data set of ui is for um and then uh there are there are a bunch of other uis i have highlighted the metrics ui here because that that is something which shows back to the customer how well their data and the model are doing so if their data shape is not uh perfect for uh the training of the model they will get a sense of that here at the same time if their model performance is not up to the mark they will they will also get a sense of that here and then finally talking about the salesforce internal uh internal uh persona so those are maybe support engineers internally or even even engineers on the team who can jump in and help troubleshoot some customer investigations so now let's talk about the other side of our system which is the offline processing system which includes data prep feature engineering and the actual model training in the data prep and future engineering the first step is to ingest the data from the system of record into a data lake to say which where the data will be processed processed to bring it to a shape where it can be used for model training right so we do a bunch of cleansing and sanity checks on the data set which is once it's ingested we pre-compute certain statistics around the data corpus and then the feature engineering aspect will generate the feature vectors which will be used by the model training so we have 100 plus nlp features which are which are generated uh using uh using a cross of features from the article and the incoming question or the case we do a feature crossing and we have hundred plus uh features generated and they are categorized under six to seven statistical feature categories so the feature categories are uh representing how how relevant ultimately they are they are pushing the model towards signaling the model in a way to say that how relevant this case is uh or rather the article is to the incoming case or the question well uh we'll deep dive into that a bit more uh and then finally the serving and the training drift i just put it out here to highlight that that's a that's that's something which happens uh quite a bit uh if one is not careful in making sure your libraries are shared between the serving and the training stack and if your data data is basically drifting uh over time so those kind of things can happen so just calling that out here in the future engineering state uh the model training itself so we have a ranking model uh which is uh auto tuned the hyper parameter is autotuned using cross validation and grid search and we also have our auto model comparison logic which helps us pick up the best model between the currently serving model and the newly trained model for a particular customer or particular organization and this happens for every organization automatically so that is where our strength lies where we have built a system which is automatic now in terms of training doing model comparisons auto-tuned so now the in the the interaction of the data scientists when the system is running it's minimal uh they are looking at the next best thing now and uh there is minimal hand holding required some of the key metrics are of course the area under the curve we do both um the accuracy and the pr and then we have the f measure precision recall and finally the hit rate at k so we we look at the accuracy at the top for all the cases all right so here uh talking uh talking about the training pipeline itself to the left you can see the salesforce app which has which is representing orgs of all shapes and sizes that the data from the salesforce app is ingested into the data lake at what point where at which point we can start the actual feature engineering and the model serving uh and of course the model training before that so the the entities we care about here are the cases articles and the attachments themselves uh then we go through uh once the date once the data is there in the data lake it goes through typical stages of preparation and precomputation uh then we have the feature engineering where we do uh the actual transformation of the of the of the of the data columns into more model understandable uh trainable attributes we then do feature crossing and generate the features as we talked about earlier into different categories into different statistical feature categories finally we do some kind of feature selection to make sure we are not including features which are which are not adding value towards the training itself in the training stage you can see the the feature weights are learnt and we do the validation and comparison as we were hinting towards earlier finally we get a winning model and that is pushed back to the app the agents and the admin interact with the app and get the results which they can send back to the customer another key part here is the model retraining you can see here the retraining is drawn here the arrow is very simplistic but a lot is happening here we have an auto automatic retraining cycle which um once which happens periodically and it also can be triggered by the customer if they change the data sharing all right so now we cover now we will cover the system evaluation uh now that we know the system architecture talking about the training and the serving side let's look at how we evolve from a very simple system to what we have right now in version zero uh we started with rule-based system we didn't really think about training a model on the first day we wanted to first showcase that uh the business goals can be met by what we are trying to build and so we went with heuristics rule-based system and uh integrated with our uh with our salesforce app and that is how it started we signed up the first pilot and the pilot uh was our kind of partner in a way right so we learn from the pilots knowing about what their use cases are and that could be generalized um our first use case hardware was more targeted towards a collaborative user space the communities specifically um but the questions coming in the community are no different than um questions which uh will come in a service setup so the quotient the length of the question the nature of the questions could be different but i think at a technical level the problem has a lot of common aspects uh and we use best answer as a positive label uh when we were looking in the communities days um eventually we we went on to train a generalized model based on open data set and then in version one we had uh the glimpses of our model which we are working on um building upon now so we had a ranking model uh as we talked about earlier it was uh it was trained using of offline notebooks and it was done on demand uh it wasn't automated in the first cut um also we had static static data set right the customer customers data for salesforce is very dynamic and it can be configurable but we didn't start that way we let the customer only specify the the entities and without really uh we didn't have the facility for them to specify which fields what projections were selections what filter criterias would be applicable so we didn't we didn't have that in the beginning um eventually we had to do it because the customer orgs are uh the customers have very specific requirements and the way they manage the data is very different it's very unique so we have to incorporate a setup flow where they can specify the selection and the projection on the on the train on the data sets um the next few bullet points are key milestones actually so we we invested heavily on retraining uh so that we can keep improving the model quality as new data comes in uh that is to tackle the data drift to say that the data keeps evolving over time and we want to keep up with the quality of the data then we had multilingual support to expand to our european customers and expand beyond english and of course uh the auto trained pipeline which i have been referencing uh quite a bit that's that's uh that is a key uh highlight of our system as well where um we have minimal involvement of engineers and scientists uh now to to run in production finally observability and the deployment and rollbacks were key to our key basically to to have a robust system and we invested heavily in that and that's in production now as well okay so looking a bit more into our model deployment continuous integration and the rollback strategy you can see on the top floor right the top flow uh flowchart the the developer or the scientists basically they are the ones who are continuously upgrading the training code and adding new new features in that so once that happens it goes through a typical uh continuous integration build cycle that code is built together and then that gets bundled um inside a training image that gets pushed uh into the container registry right so that that is uh no surprises there that is something which we invested in and now we have a stable system uh where uh where this happens very uh in a very consistent fashion uh once that happens the the deployment cycle has a rollback rollback uh mechanism so the the the devops person will kick off uh the deployment through uh by updating the image tags for the staging and the deployment targets and once that happens the images are pulled from the container registry which was updated earlier uh that gets deployed to the test environment and then the production environment if the test environment is successful and if it is not successful then we we basically can easily roll back and which is which is an option in our workflow system where they can pretty much say roll back and the the previous images the previous image is redeployed on the target tag okay so looking a bit into the container itself uh this is this is something which is more in lines with making the container independent of the cloud itself and we can host it anywhere so we host it in a managed training service uh it takes in um some uh some um i was always parameters which which are needed for training the hyperparameters themselves uh the training data set for uh for model training and then um some static configs which are used for housekeeping inside the container um that that will uh that generates that once the training happens we get the model it is pushed to a store storage bucket on the cloud and then that's uh exposed through a model api and the serving system picks up picks it up from there all right so before we wrap up let's talk about the challenges and the takeaways the first challenge the category is around data namely the privacy and the sharing compliances handling handling encrypted data at rest and in motion uh data freshness for which we have to build a hydration pipeline make sure the data is always up to date and the models as a result are also up to date uh tackling two sparse data to dense data so when's the one if the data is too sparse uh it might not meet the requirements for the model to be trained successfully so we have a backup of of using a global model which lets the customer get started and it is also in lines with the cold start problem i talk about here at the bottom um so another key highlight i want to make is the custom and the non-standard fields right so that is something very unique to salesforce sales being a very complex platform letting the customers customize their fields uh uh we have to also keep that into account and um and make sure that the training pipeline can handle all coordinate cases around custom fields building the ml infrastructure of course along the way was a challenge we had to invest in that and also learn along the way um and of course training serving skew this is a common problem if by default this can this is an easy trap to fall in uh if the libraries are not shared or if there is a feedback loop for example from the model back to the training algorithm this is something which can very easily happen so this is something i would say yeah it's a good learning to watch out for as well and then on the same lines the takeaways i would say start small ship and iterate uh prioritize your ml infrastructure from day one so that your model is uh um i mean you can train the models in a very consistent fashion and not just one-off um and then start with a simple interpretable model uh that is something which will help you debug and also uh and also like the time to resolution to issues customer issues in the beginning is very critical as your onboarding pilots so i would say yeah start some start with simple interpretable model um and also keep keep into account the size of your data um because that will also uh in kind of influence the the type of model architecture you are choosing finally i i'll say prioritize your observability and data data privacy that's something we talked about earlier this is very important so always prioritize your data privacy over modern quality and yeah invest in your infrastructure overall perfect thank you everyone you
2020-12-31 01:51