AWS re:Invent 2020: Intelligent document processing: Extract data & insights from documents

AWS re:Invent 2020: Intelligent document processing: Extract data & insights from documents

Show Video

welcome to this session aim215 intelligent document processing extract data and insights from document i am mona and i am aiml specialist essay here at aws and i am joined here today with sandeep mystery who is a senior machine learning engineer at sn compliance together we're going to talk about how aws services and solutions can help you get started with automating your existing documents quickly without any prior experience in machine learning this is a 200 level session and not a technical deep dive we will cover high level of services and solutions for this use case and in case you want a deep dive we are going to provide resources at the end of this session to help you get started talking about the agenda we will cover why we need to automate document processing we're gonna look at what makes this use case so complicated then we will talk about how aws solutions and services can help you with this use case lastly sandeep is going to talk talk about how ascend compliance has solved problem for understanding documents using aws aiml services talking about current state of documents we all know that document is very important tool for communication collaboration and record keeping among organizations moreover the document itself can be in various formats languages and sources for example you have emails in french you have your tax form so on and so forth moreover these documents are in silos and not in one place so in case you would like to search something within these documents it becomes really challenging to do so lastly they are growing in number every year just to state the growth of documents according to gartner research organizations worldwide has recorded a 25 growth in uses of paper each year so we kind of figured out why documents are important then the question is why do we need to automate these documents documents has lot of information which impacts your life my life and your business lives if you are able to extract the critical information trapped within these documents you can do lot many things such as search automate your existing business processes and you can implement compliance and control let's talk about some of the ways people are processing these documents today and its challenges the first approach is guess what a good old human being we look at a document we go and gain all values into an application and when hit save that's manual processing we all know that extracting text manually is time consuming error prone and expensive second approach most organizations are taking is using traditional optical character recognition or role-based systems to extract test text well that's awesome except that we found that it is extremely limiting as these rule-based systems are not intelligent enough and they break with format changes third approach is using machine learning there are not a lot machine learning practitioners today and even if you have those the challenge is having a large label data set 80 of time is spent in just data labeling right and even if you've got your perfect model the challenge is to deploy that model in production and scale it to process millions of such documents lastly most of the machine learning applications requires human to review low confidence predictions to ensure that results are correct but we all know building human review systems can be time consuming and expensive because it involves implementing complex processes and managing a large group of reviewers let's see how aws can help solve some of these challenges we identified the problem of understanding documents into three simple steps of extracting text from documents then getting insights from this extractic test and lastly having a human oversight in between these process or you can have it after the process and aws ai services can help with these steps let's quickly understand what is aws ai services first so if you look at our aiml stack it's broken down into three layers in the top layer we have our aws ai services these services makes it really easy to incorporate ai into your applications why they make it easy because they are pre-trained models without you having to build and train algorithms you do not need any prior machine learning experience to get started with these services they are just apis where you send a request to perform ai operations such as convert speech to text convert documents to text and you get a response back it's as simple as that we are going to focus on these ai services today which is amazon extract amazon comprehend we will also cover amazon augmented ai which is a part of amazon sagemaker or middle layer of the aiml stack let's talk about how we can help you extract data from documents which is the first step of document understanding amazon extract is a fully managed machine learning service that automat that automatically extract text from documents some of the key benefits of amazon extract is first amazon extract can help you extract data quickly and accurately because it's built for scale and can help automate document workflows enabling millions of document pages to be processed in hours rather than weeks second it reduces the manual effort that you would have to do with any kind of template development why it reduces the manual effort because amazon extract does not use templates rather it uses machine learning under the hood lastly it's cost effective so we covered all the three benefits let's talk about how amazon extract works all you need to do is send the document to amazon extract apis and these documents can be either images or pdf amazon extract will give you a response back in the form of detected text along with the confidence score talking about some of the key features of amazon techstrat not only it extracts text but it's also a smart system which is very useful when you are working with documents having multiple formats for example in case you are working with like a two column report or a request for proposal documents which has multiple for segments amazon extract will give you the pages the paragraphs along with the lines as well as word also it will give you the exact positions of these words paragraphs in the document which is very important for context another key feature of amazon extract is that it has the intelligence to segment documents to understand where there are forms or where there are tables and in case of forms it will give you the key value pairs within the forms and in case you have invoices or receipts you can use extract apis to detect tables from these and not only it will give you the table it will also give you the cell row number and column level information within these tables moreover documents can be in different languages amazon extract supports multiple language another very common challenge customers have faced in with document understanding is that we have documents which has multiple formats such as handwritten content along with the type content and these documents for example can be uh employment application or medical intake forms another with this brings us to another key feature of amazon techstrat which is uh handwriting detection so amazon extract can help detect handwriting from these documents lastly you can integrate amazon extract with amazon augmented ai to bring human in the loop to review the extracted data which has to be highly accurate for you for example ssn number or invoice amount which is a critical information and needs to be highly accurate it's more like you have your first pass of getting text from these ai services and then you're using humans to double check what ai has predicted for you all right we talked about how amazon extract can help you extract data from documents this data is unstructured data and there is a lot of potential sitting in your unstructured data the question is how to get it amazon comprehend is going to help amazon comprehend is a natural language processing service which will help you find insights and relationship in any unstructured data similar to amazon techstrat it requires no machine learning expertise to get started with as these are just apis and any developer can use it in their applications how amazon comprehend works you provide input text which can be emails charts or it can be the text or data extracted from amazon extract amazon comprehend has pre-trained models which can identify various types of insights from the text these insights can be entities such as places people brands or events for cases where you have text in different languages it determine the dominant language of the text extract key phrases understand how positive or negative the sentiment of the text is it does topic modeling for you it will classify the text and it will also help detect redact and mask personally identifiable information or pii data from text another advantage of using amazon comprehend is that you can perform all of this at limitless scale without spinning up any servers you can also bring your own data and use comprehend custom auto ml capabilities to build a custom set of entities using amazon comprehend custom entities or you can perform document classification using comprehend custom classification that are tailored uniquely to your business the advantage of using comprehend custom is that you don't need a very large training data set to get started customers always ask about data security while using these services so security is our top priority and you can use amazon key management service to encrypt and decrypt the documents to ensure security you can also use aws private link to access amazon extract comprehend apis securely within your own amazon virtual private cloud lastly for customers running workloads which require compliances these services follow broad range of compliances and for more information you can check our compliance website now let's understand how we can have the human oversight in document understanding amazon a2i is a service which provides built-in human review workflows for common machine learning use cases such as content moderation and text extraction from documents so some of the key benefits of human of using amazon a2i is that it can help you easily implement human review workflow because we provide you with 70 different ui templates to get started quickly using these pre-built workflows for various use cases second you can really put your models into production very fast as a toi reduces the time to market why it reduces the time because even if your model is not up to highest standard you can still put it in production knowing that there is a human backdrop to catch the low confidence results of your machine learning model third is the flexibility to use multiple workforces we provide you three choices here uh first option is amazon mechanical talk which is on demand 24 7 globally distributed workforce and in case you have really sensitive data and you want to keep your data within your organizations you can use the second option that we have which is a private workforce option and lastly you can also use various vendors through aws marketplace another benefit is that you can integrate a toi with any custom machine learning model and we'll see that in a bit lastly it helps increase worker efficiency talking about how amazon a2i works your client application sends input data into your machine learning model now this could be any custom machine learning model or this could be amazon ai services like amazon extract or amazon comprehend and once the model makes a prediction two things can happen from here if you see the third step in this workflow in this scenario you can directly use that prediction in your client application or in case of a low confidence prediction which is the bottom step in this workflow you can define a threshold to kick off an a2i workflow automatically and that will send the low confidence predictions for human review and these reviewers can be your private workforce and once it's reviewed you can consolidate the results into amazon s3 into a single answer and from there you can it can be consumed by the client application another another important thing one a toi can help you is that it helps you with continuous model improvement with your custom models and how it's going to help that you can combine your results which is your a2i data and combine it with your existing training data set and retrain your model lastly putting it all together we covered amazon techstrap we covered amazon comprehend and amazon e2i now let's see how we can combine these services in an architecture to create a serverless scalable document understanding pipeline so you can think like you have documents and it can be mortgage documents legal documents and or financial documents you use amazon extract apis to extract text from these documents and it can be done both real time and batch manner and for highly accurate results and for validating stock codes or invoice numbers from techstrack you can set up a human review loop using amazon augmented ai then you can send this data to amazon comprehend to extract insights such as comprehend custom entities you can perform sentiment analysis or you can classify these documents or organize based on topics so there are so many things you can do moreover you can set up a second human review loop to review low confidence predictions from amazon comprehend now you can do many things from here first thing you can do you can index all of the text and insights you have received from text and comprehend and you can send it to amazon elastic search and in case these documents are your archives all your archives are searchable now second you can store this this text in uh or insights into any relational database or data warehouse like rich amazon red chip and you can start deriving business insights using sql queries and lastly you can also implement compliance and control moreover you can use aws lambda functions and amazon step functions to orchestrate all of this workflow in a serverless manner based on your use case to see some of these architecture components in action in a demo i'm handing it over to sandeep now hi i'm sandeep mystery i'm a machine learning engineer at sn compliance and ascent compliance helps companies collect and manage the supply chain data they need to conform to changing global requirements the platform consists of three components corporate social responsibility product compliance and vendor management corporal social responsible responsibility allows you to see how your company is doing in regards to operations uh for conflict minerals human rights anti-bribery and anti-corruption and sustainability next up product compliance so this allows you to check if there's anything in your products that is present that should not be and what to do if these substances are present so there's various regulations that this applies to so this would include reach you and medical device regulation rojas and the eu waste framework and finally vendor management this allows you to measure how your company is doing in parts to trade and if you can have a more effective supply chain so we have many different types of customers in different verticals for example aerospace electronics medical automotive industrial equipment retail and oil and gas so the first area ascent applied machine learning to was for intelligent document analysis so we're currently collecting 20 000 documents a month from suppliers around the world on behalf of our customers so if you're a customer of ascent and you would then upload your product part list to our platform along with your suppliers contact details then ascent can campaign on your behalf to ask your suppliers for the relevant documents needed so we're currently collecting uh documents on rojas which is the restriction of hazardous substances it's a regulation you need to comply to if you're selling products in the eu then again for the eu there's reach which is the registration evaluation and authorization of chemicals this is another regulation that applies to eu then on the in the us there's a dot tract act section 1502 in regards to conflict minerals and then more specific to california in the united states there's proposition 65 so we're collecting over 20 000 documents a month from suppliers and we can't have humans read them all so we wanted to apply machine learning to help aid this process so let's go over a sample document now uh this one is going to be about reach and it's it's just an example there's no structured format for reach documents so it's free form text so on the first half of the document we have the company letterhead we would like to check if one is present or not and also on the first half there's some company contact information so here you can see an address a telephone number and a website link since this document is in regards to reach we'd like to find some legislative reference to the reach regulation so you can see it highlighted in orange if that's not present then it's not a good document for reach reach also has a new revision every six months so we'd like to see a data reference to see what uh version of reach this document is conforming to then on the second half of the document uh in this case there's a list of part and product references so we'd like to extract those and then we also like to know if there's a human signature or not present and then there's a date for the document so this is different from the date of reference this is the date the document was made and specifically for the reach regulation this needs to be within 18 months they're blurred out for privacy reasons but we'd also like to extract the name phone number title and email address from the company representative now i will give a demo on our sandbox area where we show progress and new features to our stakeholders our goal is to extract and verify information from a document within 15 seconds we understand that mistakes can happen like an incorrect file being uploaded by supplier ideally these mistakes we've corrected while we have their attention i've just dragged in five documents for our system to analyze okay so in this graph we can find a summary of the regulations that were found in these five documents and for more details on the document we can click a row entry on the left hand side is the original pdf which i uploaded and on the right hand side is a summary of the features that our system found in this document so in this case no email address was found however the other features like signature letterhead part numbers name url address organization and phone number found so now let's walk through how this was built on top of aws services uh in this slide it's a high level architecture view there's not going to be any machine learning i'll be going over that shortly so first we have a user with a document and they're going to use their browser which has the aws amplify stack inside and we're using amazon cognito for authentication and authorization so when someone starts a document analysis uh the aws amplifies sdk will communicate with our opsync endpoint using graphql this request will then get processed by an aws lambda which will queue a message on an sns topic for processing by step function this will do the heavy lifting inside we'll go over this shortly and then once the document has been analyzed and message will be sent back to the aws appsync endpoint which will then use graphql subscriptions to notify the user's browser that of the result so now let's go into the details of our aws step function workflow so on the left hand side we'll have the original document that was uploaded by the user uh it's typically going to be a pdf this is not something machine learning models understand so we'd have to transform it into something they do so we'll take each page and transform it to an image then to get the text from that image we'll pass it on to amazon comprehend and get the text back humans can only process page one at a time but we've optimized our step function workflow to process pages in parallel so for example if a document has three pages we can process all three of these at the same time to decrease latency so now we have the page as an image and text we can apply image processing for one flow and then at the same time we can use text processing using services like amazon comprehend and get results from both of them so once we have the results for all pages we would consolidate them so for example in our use case not every page needs to have a signature it would be okay if the last page only had one and then once we have the consolidated result we can persist it to a database like amazon aurora and then send the result back to aws appsync now let's look in to how we process the image so we'll have the image come in we've managed to port the tensorflow lite python runtime environment to run in a lambda environment so we don't have any servers for machine learning inference inferencing this allows us to scale very well there's no server costs and we'll process the doc documents as they come in so from the output of this image classifier we'll have a score on the likelihood of a letterhead being present in that page and the likelihood of a signature being present in this page then we'll output the results processing the page as text is a bit more complicated first we'll use amazon's comprehend service to detect the dominant language in the of the page text our system currently only supports english so if it is english then we can pass it on to amazon's comprehense entity recognition service to extract any company names and part numbers present in that document and at the same time we can use amazon comprehense personal identifiable information service pii to extract any names addresses phone or fax numbers email addresses dates and links in that text and then this time we have a tensorflow lite text classifier model instead of image classifier which processes the text and identifies the main legislations in that text and we can output the results okay so now i'll walk through how we built our custom classifier models on top of aws services so we're going to have some online all data it will either be an image as a page or a text as a page then we'll use amazon sagemaker's ground truth service to get humans to label the data so here's an example ui that one of our workers would do uh here the page is presented as an image and they can draw a bounding box around any letterheads or signatures that are present in this image we currently have 250 workers internally labeling data for us and we can also label text as you see here and the worker can select one or more of the appropriate categories that the text falls under okay so now we have some label data now we can use aws sagemaker's training instance to train a machine learning model we have a custom nvidia docker container with the auto keras library which can build either an image or a text classifier for us and once we train this model we'll have a keras model which we then need to convert to tensorflow lite format to run in our lambda environment in the future we'd like to integrate with amazon atoi service so we can get humans in the loop to when they're uncertainties in our model models predictions and get better accuracy so we're very happy with our partnership with aws uh the project started off with two engineers myself and corey peters for this first six months then we were able to gain buy-in from our stakeholders and leadership team to grow and expand the project as well as the team size in the future we'd like to continue to retrain our custom models with newly human label data and also build new machine learning models based on input from the regulatory team and as i mentioned earlier we'd also like to integrate with amazon a2i to get better accuracy in the field when our machine learning models have uncertainty in their predictions then we'd also like to combine the positional information we get from amazon techstruct with nlp so for example uh the document date typically falls in the first half or the second half of the page and not in the middle so we can use the positional information extract to identify what date type of date we need to use uh so that's all i had now back to mona to close things off thanks sandeep this is pretty innovative what you have accomplished with these services in very less time and with few engineers summarizing it what we have covered is that we covered these three services to create document processing workflow which you can use to automate multiple use cases moreover another important thing is you can do indeed you can independently use these services in case you just want to extract text you can just use amazon extract and in case you just want to extract insights and you have unstructured data you can just use amazon comprehend and you can have a human loop set up with your custom models or you can combine with any of our aws ai services here are some of the references and resources to help you get started so thanks for staying with us and i hope you enjoyed this session today thank you

2021-02-11 16:05

Show Video

Other news