good morning to everyone i on behalf of the micro service at your service bridging the gap between nlp research industry project i am so pleased to welcome you to this online workshop we have entitled elg average for nlp development first of all let me introduce myself my name is joaquin lago i work at graduate that is one of the partners of the microservice project and i've been working with the natural language process in technologies for a couple of years right now but today i'm just going to be the percentage of this event and i will try to do my best to give the floor to the current speaker in the current moment and to take care of the question you as our audience may have during the event so please i feel free to use the chat of this full platform or even better the question and answer box to let us know your doubts comment questions and we have already booked 10 minutes lot more or less for answering all of those questions at the end of the of the event so don't be shy please we will try to to answer all requests if we can't uh with this workshop worship we have two main objectives in mind first of all is let you know about the european language script we expect that when you are leaving this workshop you will have a better idea about what energy is how to benefit from the energy and how to interact with the european language and the second objective is to do a bit of dissemination of our project our microservices at the service project and to do so we have split our our presentation our workshop in three parts leads by three different speakers first of all mr sebastian anderson will tell us about a microservices project itself sebastian works in a finnish company called linusoff and he is an expert in speech and language processing and natural language processing technologies and he has been the leader of the microservice project for around a year so he is the person we need to talk to to to let us know everything we need to know about the about the microservices project thank you sebastian for joining us today then we are glad to have here with us yeah mr ian rogers tom prospero sorry mr er roberts from the department of computer science of the university of sheffield as our expert in in elg mr roberts has participated in the development of the language europe the erlg platform and he is co-author of many papers about the european language street thank you ian for joining us today again then our my colleague pedro alonso will show us how to interact with the lg platform pedro in the scope [Music] yes in the scope of the microservices project has been working with the led platform for around a year so he can tell us some tips and clues about how to use the the platform and how to get the most benefit of it thank you pedro for joining us then we expect to do a little break just to let you think and recap about what we're going to tell you today and after the break we will uh have that little time slot for question and answer resolving and i will pass you through the chat a link to a very very short feedback survey that they will use to send you back certificates of assistance of this event so thank you all for being here and please sebastian i give you the floor to let us let us know about the microservices thank you let's see if i can share here here and that was an annoying little button there all right let's move you here all right so i'll present the microservices that your service project then so uh this project consists of a consortium with the gradient from spain and link soft from finland and reykjavik university from iceland and university of tartu from estonia and the whole project is co-financed by the connecting europe facility of the european union so we looked at it a little bit before already so i'll start with introducing the microservices project then ian roberts will introduce the elg platform and then pablo will take a deep dive into how to do things with elg and then we'll take a short break like 5-10 minutes depending on how our schedule holds and finally then we'll have a q a where we can hopefully answer some of your questions so our project is concerned with the european speech and language technology and i'm sure you're familiar with it there are many variants of these technologies like speech recognition machine translation and so on and so forth we have large international organizations that provide these type of services often in a software as a service manner for many languages and sometimes they even offer a platform as a service where you can add your own tools and a little bit of parenthesis you can also rent your infrastructure there if you have your own platform and then it's called infrastructure as a service so then all is good right if you want estonian mt finnish asr icelandic tts and so on um our impression is that the there is gaps in the services provided by the large players especially for languages with not so many speakers i think it's fair to say that we haven't failed to be the first second or third priority by the large corporations very often um but we have a strong speech and language technology research community in particular in the areas that we have looked into and there is a strong tradition of sharing the tools as open source open source code and then we also have the european language great who aims to establish themselves as the primary platform for language technology in europe in in competition with these organizations from asia and america and they aim to do this for both commercial and non-commercial non-commercial language technologies and ian will tell you more details about that so back to the open source tools then their benefits and challenges here so the benefits are that they are a good way to get hold of the latest and greatest in research in in language and speech technology for example and there's also many established and trusted tools in this uh floor and fauna of tools and of course they're they're free to use and they're free to adapt and as an example both links of speech recognition and machine translation are based on open source toolkits and the challenges it can be difficult to find unless you're involved in that particular line of research and know the research groups and sometimes when you find them they can be quite poorly documented and especially the documentation is lagging behind the latest update so you don't get the support you need to install it and when you come across a tool laying around it's not always easy to see is it worth the trouble at the end um did they sure they advertised it with some academic references but does it really work and then finally if you manage to install it then it's not so easy to share with the colleagues that needs to use it or reinstall it so this is where our project comes in that we have gotten funding and a plan from digging into these research groups and the githubs and find the tools and test the tools that we find interesting and then we'll package them and i'm dropping some terminology here in the in the cloud already that they're dockerized solutions plus apis so we package their tools and then dockerize and put an api on them and then they are easily shared with others and we'll share we are sharing them via elg and hopefully then i mean we have ourselves used for these tools but the idea is that also others will find them useful for the benefit of the researchers and for the benefit of the integrators um so taking a little bit of a step back then what is it that we package um i mentioned already the software as a service paradigm where you basically access tools via the internet we often via some api in our case a rest api so we have here in the right hand box my application where i have a speech recognition tool for example and i add a rest api to it put it on a web server and i expose it to the internet and then other integrators can can call my speech recognition send audio over the internet and receive back a transcription and these sas services you can package with docker and what you end up with when you package it with docker is a virtual machine of that service and that virtual machine you can share easily you can share with an email or a download page a download page is more common and those packages are called images and the benefit of that is that makes it very easy to install tools with with different dependencies on the same host machine for example i share one gpu machine with some colleagues of mine and we all have our different set of tools on dockers which makes it very easy for us to share the same gpu but with completely different tools and if you look at the the research community then for example if you participate in a machine translation challenge then you can package your tool onto a docker and then the next year then someone else can easily try your tool with a different set of data and compare how does their new version compare to your old version and so on and it also makes it easy to combine these different images into a micro service architecture so in a market micro service architecture basically package your nlp components your speech and language technology into these docker containers and then you have a a platform that sort of orchestrates these these containers kubernetes is a common option and this type of architecture makes it very easy to integrate open source components third-party components you can easily add new languages and functionality provided that you find a tool that does what you like and you can also easily replace outdated components when you find a better one and then on top of that kubernetes you have a rest or an api layer that then clients customers can integrate your tools with and this is a simplified picture of how elg works and it's a simplified picture of how linksoft's platform works and to my understanding it's a fair representation of also how the big organizations clouds work so if you look at the progress so far what we have found uh if we have for example found heliots it's a language identification tool that we have found very useful in our own pre-processing pipeline of texts like dividing up is this swedish is this finnish provided by university of helsinki and we have dockerized it we have put it on the docker hub and it remains to put it on elg but that's work in progress and we have some other examples down here like birth models from sweden and norway's national libraries for example uh if i summarize then this is a three-year project and we are pretty exactly halfway so we expect to finish next year this and the objective is to make at least 40 tools available as these easily integratable microservices for a total of 11 languages and we have a project web page where we continuously update this list of tools and we also have a links and recordings to previous workshops for example one on how to make these docker images and add an api yourself and i would say it's surprisingly easy and i believe with that i'll hand over to ian or joaquin thank you celestion now uh we will let mr ian robbins to tell us about the lfg please mr yell so yes as sebastian sebastian and uh instead i um my name's ian roberts and i'm uh one of the developers behind the european language grid uh platform uh and uh the eu project uh i work at the university of sheffield um but i've i've obviously been working very closely with colleagues from um from the other number the other partner organizations in the eog project since the beginning of 2019 so a little bit about the background first so i mean it goes without saying that multilingualism is at the heart of the whole european idea and the idea of the ideals of the european union uh there are 24 officially new languages that all have the same legal status within the eu as well as dozens of regional minority languages um things like the norman's army and other languages that that sebastian was mentioning um and then languages of immigrant immigrant and trade partner countries so that includes things like obviously the um european economic area so norway and and iceland and so on um but also languages of important trade partners outside of europe as well um so you know if we if if we want the the kind of this idea of a digital single market within europe uh then it needs to be multilingual um and there are many economic social and technical challenges that are um that are presenting themselves when it in trying to to reach this uh this point obviously one of the big uh challenges is that within europe the uh the language technology landscape is very much fragmented uh so sebastian obviously referred to uh you know the big players the likes of the googles and the microsoft and amazon um but particularly for uh you know this smaller languages uh the the eu and european uh language technology marketplace generally is is generally made up of many small companies small medium enterprises which deal with a small they specialize in a very small number of languages or a small particular domain within within the language technology whether that's speech whether it's text or whether it's even more more fine-grained like that there'll be there'll be individual companies that specialize in doing uh you know processing medical data in hungarian um and and part of the the challenge is to is to bring all of these small players together in one in one place so that the services that they offer can be uh can be more easily findable by potential customers um so this is just a good sort of i don't expect to be able to read everything that's written on this but this is a kind of a timeline that sets the um the european language grid in in its historical context and it all really dates back to the foundation of the meta net network of excellence uh in 2010 um and there'd been various a whole series of other initiatives and projects since then which eventually led to the uh the beginning of foundation of the european language grid in 2019 um so there's a couple that i've been particularly drawn to draw your attention to here so in the early days of matanet uh in about 2012 um we put together a set of we call white papers uh detailing the sort of the the the state of support for uh language technology in the number of the european languages in the digital age um some official eu languages some of the minority languages we were talking about before and some of the kind of partner languages like icelandic and unsurprisingly at the time we're talking 2012 here um again you can't necessarily read this this slide in detail but potentially the the take-home message is that the only language that has quite good support across the board is english or was english um there are some other languages french spanish german um italian particular those particularly that had reasonable support um across the different the different types of areas of language technology but majority of languages support was somewhere between fragmentary and non-existent um and this is the graph obviously you can imagine that there's there's there's english has had good support there's a those few languages that had reasonable support and then a long tail of languages for which the support was very very much um fragmentary and uh i mean that was 2012 we're ten years later now but the the the overall picture is yeah the bass line has moved up but the overall picture is still the same so english english is kind of approaching excellent maybe um and then some languages like french french spanish dutch german are getting towards good support but it's still the case that there's a massive inequality between support for the major languages and support for smaller languages um the other thing uh worth mentioning is that uh in 2017 the european parliament commission to study um into language equality in the digital age um and that ultimately ended up in a european parliament resolution that was passed by a landslide in 2018 that made a number of recommendations like 45 different recommendations on various related topics um but several of those recommendations pointed towards the need to establish a european europe wide lte platform for sharing of services and resources um as a kind of one-stop shop for uh you know for for language technology in europe um and this led to the uh you know the foundation of the european language grid um there was also a study done by uh myself the connecting europe facility um that was essentially predicting that the language technology market in europe would reach something like a billion euros in 2020 um [Music] but uh yeah and and another study that said that would maybe maybe maybe 29.5 to 30 billion dollars market worldwide by 2025 uh which were obviously incredible numbers at the time but they're already obsolete um thanks to the the massive you know rising popularity of artificial intelligence and deep learning these these numbers of uh you know of our way out of date now uh but basically the takeover message is that there's a big market but it's very fragmentary and there are hundreds of small small enterprises addressing very specific niches so what we need is a single platform um as a kind of a hub where all the different users and and consumers of uh producers and consumers of language technology who would know will know that if you want to want to make something available put it there if you want to use want something to use for a particular task look there and that was heads the foundation of the european language grid so the open language grid uh as a horizon project was uh began at the beginning of 2019 uh and has the the selection of the kind of project objectives and it's pretty much what i said so we're establishing a language technology platform marketplace uh in europe to tackle this fragmentation we wanted to be a platform for both commercial and non-commercial language technologies so both the kind of the research type services that um that sebastian's been talking about but also commercial providers such as things of its own commercial um commercial offerings and as a kind of a you know a way to connect customers and and providers um and also as a way to store not just services but data sets uh to deploy services and and connect um connect them and make make the resources and services available for use by others to enable businesses to to grow and benefit from from being able to scale up like that uh so elg was originally originally conceived as a three-year project so it would have been um start of 2019 to the end of 2021 uh it's been extended by six months because of the effects of the pandemic so we're now officially in i think month 39 and 42 so we've just within the last month or so completed what we're referring to as the final release um i mean it's not final intentionally we had sort of three milestone releases during through the life of the of the project but if we've been continuously adding um adding features on an incremental basis as we go along um so we we had the official kind of quotes final release um last month um but we're still adding adding more more to that now um and we've been sort of moving into into starting to get more contributions from outside the the project um so the uh you know this is your kind of big picture so you've got european liquid is the platform in the middle uh and then you'll find organizations both um public organizations and private businesses who have their developed language technologies either um you know data data data resources or tools and services are able to make their tools available through the in the elg platform and have them properly ever indexed and linked so that they can be easily easily found so when potential customers or potential users again which can be public or private organizations come along and want to to find tools that they can use then they they can use the european language grid catalog to discover these services that that may be used to them to test them live in the uh in in the catalog interface and uh and then when they find a service that works for them then on their data then they can actually integrate it using the apis in their own tools uh and make use of it uh in in products or um or whatever else uh and and we provide uh things like a python sdk which uh pedro will say more about later um which allows allows for easy integration of the lg services into other other um excuse me other software so aside from that aside from explicit contributions by individual providers um the elg also harvests metadata records from other relevant repositories on a regular basis um so places like zenodo elrc hugging face some of these you may have you may have heard of yourselves um plugging voice is the kind of um model library for for machine learning models so the the idea is that the elg will regularly pull metadata not the actual data but metadata from these other repositories and and create links there so that again you can come to the elg as your starting point but if the resource you want is hosted in another repository then we can redirect you to there so the idea then is that elg becomes this kind of metadata catalog that will cover everything um so they would become this joint technology platform for the whole european lt community and a joint tool data and resource sharing platform and a marketplace as well so what i haven't mentioned yet is that um as well as being able to provide these kind of free of charge services the vision is for elg to be able to support commercial providers who make their services available for payment and subscription um this isn't quite in place yet for the because of the legal structure reasons but uh this is what we're actively working on at the moment and hope to go live within the next couple of months um and so the elg becomes the kind of yellow pages if you like of the european lt community a one-stop shop where people can both users and and producers of language technology can can meet so the current state of play are for the elg this these numbers are actually slightly out of date already because i i put these slides together well the slides i'm using were originally based on a set from about january or february um i made some updates earlier this week and since then i've integrated another another branch of extra services so we're up to something like 770 i think functional services directly integrated with an elg at the moment um the catalog has something like 12 000 uh total metadata records which are made up of uh 1700 or so organizations um maybe half the resources in the catalog at the moment are corporate data sets of one kind or another um like i said there are there are 700 something uh records which represent uh functional services that are directly integrated within the elg and that's the kind of thing which which this micro services project is targeting um so the as i mentioned this is supposed to be a long-term sustainable initiative so while we the horizon project is nearing its end the idea is that the elg platform will continue to to run beyond the end of the project uh under the ownership of a new not-for-profit company the new legal entity and that it's that legal entity that we're kind of in the process of establishing at the moment um and once the once that is established then we will be able to uh to start offering uh billing and payment support for commercial services um it's a bit difficult to do that when it's actually still a horizon project because it's kind of which partner and the project doesn't really have a legal existence of its own so and this is why it takes a little longer to get these things sorted out but this is definitely on our radar for the moment uh we've we've kind of worked out a number of um you know ways approaches of covering the the ongoing costs on a long-term basis so watch this space really is the is the the message there the other thing i want to mention is that part of the um the elg project horizon project was what they call financial support for third parties so this was uh essentially two open calls for proposals for what we referred to as pilot projects um which were either um when she was called offers of funding to external organizations to either uh integrate their services into elg and broaden the elg portfolio or to demonstrate how you can how they would use the services provided by the elg to enhance their own products as a way to kind of demonstrate the usefulness of elg to consumers and one of the organizations that was funded in the first open call for this was lingsoft who integrated a number of their um their tools for speech and language into the elg as uh as microservices and i think that that was kind of the germ of the idea for um for the microservices at your service project um so because i i worked quite closely with lane soft back in 2020 to try and get some of their um their microservices integrated in the lg uh so i'll show you a few quick graphs uh just to finish off uh so this is the kind of number of resources over time just showing you how kind of things have grown through through the lifetime of the eog project this particular graph runs from february 2020 through to um november 21. and you can see this kind of gradual increase over time i mean this this one graph doesn't really give a lot of information in the sense that um this is probably a more useful one which is broken down by resource types so you can see obviously that a big chunk of these the red line at the top is the corpora of the data sets the orange line yellow line in the middle is the records for organizations and that's something else which we're going to mention i think in a little bit the idea is that uh organized the the records in the plg catalog don't just represent the services and the data sets they also represent the organizations and projects that are the providers of these data sets so it's all kind of linked together um in in that way the sort of pink line at the bottom say at the bottom just because the um the integrated services obviously the numbers are somewhat swamped by the the copper that we've we've harvested from other other metadata repositories but you can see that you know we we've grown up to it was around 500 integrated services at the point where this graph was produced like i say since then um we've added another 200 250 plus services coming from a variety of different different places most notably um there's another eu funded project called anti-eu neural translation for the european union um who are producing something like 500 plus machine translation models for all possible combinations of the 24 um eu official languages and i think we've we've integrated about a third of their models so far we've got the other two thirds in the pipeline and we also started very recently um obviously in light of recent events um we are working now on setting up uh machine translation systems for ukrainian uh into and out of a number of eu languages uh as part of a platform to assist uh refugees from the the recent conflict um and that's that's an ongoing effort but we've certainly integrated probably a good uh 40 40 or 50 so um different services involving the ukrainian language over the last couple of weeks um and then obviously beginning to get a decent number coming in from microservices at your service as well uh so this is this is uh you know just to give you a flavor of you know how things are moving on and that these numbers are you know you can't really just show this on a static graph it's all everything's growing all the time this is a quick snapshot of the number of registered users as you can see we're up to something like four or four or so consumers um 200 250 or so um provider users registered as of uh january this year um so i'll give you a quick look at the at the alg platform itself so we've got on the data consumer side so consumers can search and browse the elg catalog for all different types of language processing services data related projects and organizations so there's a combination of a free text search and a faceted search system which allows you to filter by language by service function um you know by license conditions of use and so on for various resources and view the detailed information and and uh sort of view the statistics about what's what's involved if you're if you're interested in services um then you can actually try out and test language processing services live in the um in the elg um on the elg site um and you can call the services from the command line or the python based api this is all stuff that pedro will go into more detail on later on uh as a provider you can contribute um you can contribute by providing formal descriptions which of your resources uh your um your language technologies and also the organizations and projects and these all get linked together um so that you can kind of follow the links from one to the next to the next and i'm just going to uh excuse me just going to give you a quick demonstration of that before i hand over to uh to pedro so so this is what you see when you go to the um yeah language grid catalog like i say there's a there's a free text search box across the top but the main interesting part is this faceted search down the left hand side and you can see all the different facets that you can uh you can make use of so in particular so say for example you know if i'm interested in um particular types of services so i'm only interested here in elg compatible services which means the ones that have been fully integrated that i can call from the elg interface and the function that i'm looking for is say spell checking uh and you can see i'm already down to something like eight services here which are um various ones produced by bing soft if i wanted to to narrow it further i could narrow down by um by language and you can see these facets change as you do a search the numbers the numbers filter down to just the ones that apply to the um what you've found so far so let's have a look at say look to look at my the english um english proofing tools now you'll see with looking at these that um obviously there's general metadata information about the um you know about the the service itself this is the input this is this is what it outputs um and so and there are sort of keywords and descriptive descriptive metadata um download and run now for open source services we actually allow we we provide a link to allow you to download the docker image and run it locally on your own machine this particular service is a commercial provider uh blinksoft so it's not not not one that you can actually download and run yourself and then the other two tabs are to do with how you would use to use the service so of course i'm not logged in so yes this is the thing you have to be signed in to erg in order to um to test these services out so i've now logged in and i can actually i can just try out the service here this is an example of misspelled text submit it for testing and excuse me uh and you will get back her response saying uh okay so this this is a word uh which has not been recognized and it's suggesting that you might want to replace it with example so this is just just to demonstrate the so that we have these kind of what we call these try out user interfaces for all different types of services we have ones that do text annotation like this there are machine translation services we have uh um an example for audio uh audio services as well as a speech recognition where you can either just speak into your microphone and record live or you can upload a file that's the speech that you already have and it will do the speech recognition and show you the text code samples is an example here of how you can actually make use of this service from um from your own code either with a very simple curl um where you just have to paste in the actual access token um or the python example here is the python sdk which handles all the authentication um details for you and these are the same things which uh pedro will talk about later on the other interesting part though is by going back to this overview tab as i was saying that um that the catalog includes not just the services and the data sets themselves but also uh information about the the providers and the the organizations and the projects so here for example i can say resource provider links off now this link soft is this is their metadata record representing links office and organization um and again it's much the same so for organizations you have keywords you have these are the largest technology areas that this company is active in um and that's a faceted that's a lot to search facet as well so you can you can you can search for companies that do these particular areas of language technology in the language particular language that i'm interested in um and then there's links here for the uh related uh technologies and projects so here for example linksoft have obviously produced they've integrated a number of these these different services machine translation speech recognition um proofing tools text analysis um and they're also part of these projects so going back to the proofing tools you'll see this is funded by lsdisco this is the record about the um the elg pilot project that we funded blinksoft to to carry out and again you can see that's what that project is what provided all these these services so you can you can bounce back and forth between the two um between the resource you can find a project that funded it then you can find other resources that were funded by the same same project and so on so this was what i was saying about about the organization records uh so i think that's probably enough talking from me now um and i will hand back to there we go i will hunt back to either to hockey or straight over to pedro whatever's easiest thank you thank you for your presentation we think i think now all of we have a better idea about what llg is and how to interact with it so now it's turned to my colleague pedro alosso now we are going to change into the how to section of this workshop and he is going to share with us how to interact with a logic platform replace pedro hello good morning my name is pedro alonso from gradient i'm going to share my screen okay so i am going to talk about of how to use it how to use the pedal please change to full screen mode please yeah thank you okay i'm going to talk to you about how to contribute to the eld platform and how to benefit and use the tools and resources that there are available in the eld platform so first of all we want to to show you how can we upload tools or resources to the platform so for this we must be a registered as a as pro as providers so for this we must see sigma as a user in the yearly platform and we want to become a provider so we must make this request and the the eld must accept us so we have to wait for becoming a provider to be to this request being approved should happen within 24 hours usually it's not it's not not a long long wait it's just a spam filter really more than anything yeah it it's usually quick disapproval so once we are providers uh we have to add the itself so we go to my grid section and to the organization subsection and here we have to fill some some fields and data about our organization like the name description our website put some contact details keywords to to make it easy to find the organization in the browser and the language technology areas that the organizations work about also if there is apparent organization we should put it here so once we have the organization created in the platform we want to upload there the tools so what kind of tools can be uploaded here okay first of all these are the the tools that are the types of tools that are supported by the platform so there are information extraction text classification machine translation automatic express recognition text-to-speech generation and image analysis it is important to say that these are not a strict types what i mean by this like for example machine translation it supports any type of text to text tools so for example summarization tools whose input is a text and the output is also a text with the summary of the input it could fit here as a machine translation type tool and the same with other kind of of tools that may fit in any other categories by the the types office input or output and then we also have corpus we can upload the corpus we may want to upload to the platform they must be in a zip file and in the metadata it's important that we put that if they contain personal or sensitive information and in that case if this information is anonymized or not okay now we go into details about how to upload the tool and what is the process that we must follow to to upload a tool so first of all we have to select of course the tool we want to contribute and implement an api a to to access and to use this tool this is understand and standardized api that is specified by the elg so that all the tools are accessed in the same way with this api so it makes makes it easy to to standardize the use of all the tools in the platform this api must go inside a docker docker container so we must document the the tool and the and the api and make the the image a available publicly then we have to to upload it but and write some metadata in the platform also in the middle we will talk about how to test and check that our implementation of the api and of the localization is is correct according to the to the specifications of the eld so we are going to to make to go deeper into each of these steps by using an example of a tool that that we have a working with during this microservices project so this is a question and answering tool a a model for a spanish so i'm going to show you here the github so here there is all the information to install and to use it but the the important part is here this is the the script that implements the tool which the important part evaluate method that gets the the question and the context and response which uh with the result with the answer of this of the question the the user's proposed so okay this is the tool we are going to to show how we we developed the api so now talking about the api here is the link for the specifications of the api and it's an api that accepts some types of requests that are a text requests structured text audio and images and it has some type of responses defined that are the failure response and the annotation classification texts and audio responses it's important for developers to know that there are the eld provides some libraries helper libraries for java and python that they make it very easy to develop the api because they help to to parse the request and also to build the responses that with the specifications of the api without the making the developers to to be really very careful with the with the fields of the of the responses or the requests so we are going to show some examples of the the structure of the requests this is a text request that as you can see it has a some fields like the params the content the mime type and it can have features or annotations but most of these are optional fields in practice the majority of the of the tools are going to to have this this simple format that is only the type and the content with the text also the audio and image requests have a extra format field to indicate the the format of the of the audio or the image you are going to to use and in this case it's even more recommended to use the early helpers to handle these these formats that are not have a more complex structure than the text and the than the text data and now we are going to talk about the responses of the api so the successful response [Music] it has the type the type is always here it may have some warnings are other properties in its case for for the tool and the failure responses it must show an array of status messages that have this structure a code that maybe for example the there is no translation for for this text or that it may have been some problem during the execution and there is a standard set of failure messages standardized by the eld and that they are also translated to many languages so here for example the the request size is too large invalid message the invalid response image format not supported [Music] this is an example of the response of the type texts that you can see it has an array of of texts uh that we will explain now a bit a bit more deeper because the text response may be in the simplest case just a text with the with the answer of the tool but in matting in a tool that the answers with some translations for a text it may be a an array of texts like this and in other case it has it may be a two-level list for example where the first level is like the sentence but in the second level there is an array of the words in the sentence for example this is an example of the audio a response where the most important here the most important difference with the text response is that a [Music] say that the eld offers uh as microservice to to store a image or audio so it's important to know it as developers to to use it and benefit from this so we can put in the content location field the the url of the of where it is allocated to to find it so now from this general extractors we're going to the to the case of this question answering tool that we are using as an example so we can make to the tool this sequence of type structured text with where we can say that the capital for spain is madrid and we ask which is the capital of spain this is a translation of course because i think most of you will understand it better in english but the the original case is in spanish so we expect here to get that the answer is madrid and the answer obviously is madrid and it also says that it's madrid is in the position 24 to 30 in counting characters so it is true if you can't hear the 24 i'm not going to count it now but and now we are going to see how we implemented this in the code so this is the the script where the api is served and we can see that we import the the tool the spanish question mastering tool that we've seen before so in this endpoint we receive the the requests we get the context and the question and we call the evaluate methods of the tool to get the result and what we have to do with the result we have to generate the the structure that we have just seen here with the start the end and the answer and the score it gets for the answer so as you can see this is the same structure we have just seen here so now the third step uh what happens when we already have the api implemented we have to document it to and for this we have three options for the later deployment in the ald platform so in the majority of the cases that is also the case of this example that we are making there is a standalone container that is just a tokenization of the tool in our container and we upload it to the platform and is available there but in many cases it may happen that we want to contribute a tool that already has an api but it's not the eld api so we need to also implement this eld api what we can do is like in this middle example um make a an adapter the that the is the implementation of the alda api that what it makes with the requests is a forward them like make another request to the to the api of this tool we are we are using but they are all in the same container so they can refer each other as localhost and then the last case is the same when there is a [Music] tool with an api but it is already deployed in cloud in the internet and what we have to do is as we have to have um the api implementation in the eld platform when you call this this api they forward and make a call to this external api and gets the response but for the user in the three cases is the same the user will call the tools through the eld platform so in fact the users don't care and they will now be they are not going to be aware of how this is internally a working so i'm going to show you first of all how to implement this tokenization but it's really simple as you can see here we import the the files to the container and we just call the the serve script that is the one that implements the api and we expose it in this port so the once we have the the tool dockerized we want to before uploading to the platform we want to make some compromises to check that it has it doesn't have any problem with the with the ald platform and with the ald api so we will talk about this a longer in a in a further section so first of all first off talking about this i'm going to finish with the process of uploading a tool the next step is make the the image public so in this case we use a docker hub but you can use any other platform like gitlab for example so here is our our image be available we can make a docker pool for getting it and here it's the arrhythmia with some information about it and then finally we have our tool prepared to be uploaded what do we have to do we have to go to the elg website and go again my grid section and do service or tool and here is where we will have to cover a lot of fields with information about the tool like in this first tab of identity we have to put the name the description of the tool the version also it's a required field and then like the who is the tool creator the tool provider etc then in the categories tab we have a which is which type of application is and the keywords for example to to again make it easier to find it in the browser and then we have many other tabs like the contact where you we must put information and details for for support of you can't contact if there is any necessity and then in the tools and service tab is more technical for more technical information like the parameters of the requests and the inputs and the outputs format also you have to put if it's language dependent or which language does this tool understand and you may provide some sample input data so this helps both the validators during the process of checking that everything is okay and also as i am as young a so it does before it helps the users to to know how to use this this tool from the catalog and to see which is the the purpose of the tool also we have the distribution tab where we can put information like the privacy of the image uh where is the location of the of the docker of the of the docker image the the endpoint so that the ele platform can access the image and call the and call the api the license of the tool and some special hardware requirements the tool may have like if it needs a lot of memory of or if it needs a gpu so you should put it here to see if these requirements can be can be available or not depending the case i will just jump in and say that at present um we aren't able to provide gpu capacity on the live elg cloud um so you know anything that you're sort of publishing for use on there would need to be cpu rather than gpu at the moment we're looking into the options of how that might be possible but it's it's very difficult to do it in a way that's cost effective uh given the scale that we're we're at at the moment um but certainly you know we're happy to host records that point to gpu software that people can download and use themselves but we can't we can't promise to be able to host it in our cluster on the gpu yep thank you okay and now we can say that there is another option for writing the metadata instead of doing this manual process that we have just seen from the website interface we can upload the metadata in an xml file for this maybe we can recommend that the first time you you want to upload a tool you do this manual process and then in the next cases you can download as xml file the the metadata you you put for this first tool and use this file as a template for others so you in this template you change all the fields that you require to change and then to import it into the eld so maybe it's it's easier and make it a bit fast the process so finally once we have filled all these metadata some validators of the ald must accept the publication so there is a technical validator that will check this api compliance for example and that everything works okay in the interaction with the platform and also a legal validator that will check the the license conditions to see that there are not any legal issues and when this approval is done the service is published and can be available in the the ld platform okay now we are going to jump a bit back in this testing i said because this is the the reason we require this this checking during developing time probably this will be the common case where the user makes a request to the to the tool and it the answers this tool can be executed as python or java or whatever a program or it can be already dockerized but it's a direct request in the asd platform there will be an intermediate service execution server that is the one that will receive all the requests and forward its request to the container of the tool that the user is going to use and the same with the response it acts as a proxy so we want to have this service execution server in our local deployment for tests that everything goes okay and if you for example make a request that the the server doesn't understand it will it will tell you so you can change all the the things that don't work as expected before uploading it to the platform so for this there is um a gitlab where there are a this server is this one where there is a simple service to test but also there is a service with temporary storage and with an interface and another one with both both things we are going to show how to use the simple service it's a docker compose that deploys a both the rest server so this intermediate server that is the same as the the one daily platform in a container and in another container it will deploy a our tool our tool container that we have to indicate here which are which is our our image in this environment variable file we can put the location of the image and information about the port and the end point so with this deployment we are going to have this this infrastructure deployed okay this is what we have already seen in the gitlab interface the same with this and i'm going to talk about another option to make this deployment that is using the eld sdk that we will talk about in the last section about this sdk and it makes easier the deployment you just have to use one instruction indicating the the path of the image and the execution location but anyway in both cases you obtain the same deployed infrastructure and you can use it for example to to make a request and check that everything is okay so in the case that there are there is not any problem we make here this is the same request we used in the example before and we receive the same answer without any other complaint but what happens when there are problems there are some potential errors that we may discover with this testing the first of all is the case when the request format does not comply with the idea specifications so in this case if we make a request like here with a type that is not among the eld api defined types like non-existing type for example it will say that there is an invalid request and what happens here what happens is that the user makes the request but the service doesn't understand it so it doesn't a forward to the tool it just responses with this a with this error that is shown here invalid request message then the second case when the we do a correct request and the tool a responses but this response does not comply with the idea specifications so what happens in this case is that the the the this intermediate service a put here as you can see in the logs could not resolve type non-existing type because these are a response annotations audio etc the the types it understands so it responds an internal server error and in the logs you can check what was the problem and finally the case of when there is an exception in your tool that is not correctly handled when there is any exception that it may happen during the execution of the tool what it should answer is with a failure message but the failure message that we saw before that is the one specifically specified in the api but it's the an exception is not captured for example it may respond with an internal error and the the intermediate server will a crash will not understand this this answer the tool is giving and you can see here am a stack choice of the it has in this case of this example is a value error so we can see here that something went wrong and that the tool should response with a failure failure message and not a with the with the exception of the process okay now we go to the second part of the of the presentation that is how to use the the tools that they are available in the alg platform so first of all say that there is an sdk for for python that allows the users to to use the the tools in a really easy way as you will see now from python but you can also do a request to the with to the endpoints with http plata protocol for example okay so when you want to to use a tool first of all we have to have to know that this tool is available in the in the ld so the first two you may want to do is know how to look in the in the catalog so as ian explained before there is a catalog in the website of the of the ld that is this one where you can apply some filters to find the the tools or the resources that best fit your requirements so imagine i want to i i need to use a machine translation tool for uh spanish i will so i can spanish so here i have all the tools for machine translation in spanish so i can select the one that best fits my requirements and use it see here is an example of the same i have shown so i will escape it and we have to say that there is another option to browse the catalog from the sdk that is as easy as this we import the class catalog and we call the method search with the filters we may want to apply so we will receive a list of results that will be if you print them will be shown like this in a very visual way where you can find some information about them the license languages subscription and important the the idea of the tool first of all once we know which tool we are going to use we have to authenticate in the eld for this we have some options first one is direct authentication so when we want to use a service we have to instantiate this service in this case the service with the id 474 and it will ask us to introduce a code for authentication that we can find it in the link that it shows we have to log in here and if we login it will show us a code that we have just to paste it here and we can use the in advance we can use the tool because we are authenticated the problem with this is that it will give you a expiring token that is valid only for a couple of hours so for avoiding this you can use the parameter of line access that will give you a non-expiring token that you can store it in a file with this instruction and then the next time you want to instantiate the service you indicate the file with the tokens and you can do are authenticated automatically so once you are authenticated you can use the tools and in this case this is an example of calling a service a for name identity recognition in multiple languages and as you can see is as easy as this just introduce the text and get the response in this case it's an annotation response where it says that for example nikolas tesla is a name a that is a male and that the name is lincoln and the surname is tesla for example another a kind of way to do the request instead of using plain text you can use a file if you have a file a for example with this name example.txt with a text inside you can indicate the the file to the service and you will receive the the answer in this case is a different request as the previous one but you can see annotations types where jind is a name for example and then finally a the third kind of request is using the text request object that is also pretty simple you just have to to build this object request and indicate in the text and send the the request object and getting the the result now we are going to talk about an advanced use that pld provides that is public pipelining tools what does it mean maybe we want to to change to to use as a chain some tools for example if we want to work with a language that has a few speakers and maybe there are not many resources in the lda for working with it so we want to translate it first to english make uh use a tool of sentiment analysis in english and then translate back to our language so we can a pipeline like this in this case is for german not for english
2022-05-24