ML processing pipelines using CWL, deployment In context of research teams and OGC-API integration

ML processing pipelines using CWL, deployment In context of research teams and OGC-API integration

Show Video

Hi everyone. My name is Francis Charette Migneault. I'm a research software developer at the Computer Research Institute of Montréal and today i will be presenting this talk about machine learning processing pipeline using CWL and their deployment  in the context of research teams and OGC API integration.  For today's agenda, I will be talking  about the following subjects and concepts.

So CRIM's research team development context and the CWL integration with Weaver, the solution that we use for OGC API integration. I will follow with examples as presented below. First, CRIM is a research institute center that helps organizations acquire competitive edge through the adaptation of technological innovation. Because we work with multiple organizations operating over different goals, challenges and products our expertise expands over a wide range of domains such as computer vision and geospatial technologies, speech  and natural language understanding, data science,   operational research and many more. This results in CRIM being a company that must regularly adapt to fast-moving research by  processing prototypes of applied technologies but it must also be transferable to production  environments of the partner organizations.

The context under which CRIM must  operate therefore requires us to employ technologies that match our realities. First, the multi-domain technologies and the  diverse objectives between the prototypes that are   created for different organizations must provide  flexibility, reusability and simple solutions. Second, the research aspect of prototypes that we  develop and the packaging of those technologies   must also remain interconnected due to the  need for adaptation to production environments.

Though, they cannot be dissociated, solutions  must not limit exploration of new research possibilities. Because of this, it's very important for us that the tools we employ fulfill some very important requirements that CWL provides. Amongst everything that we require CWL provides us with the flexibility of packaging  methods of core algorithm whether they run Python scripts, Docker images or other kind of application and also over the type of input outputs such as images, textual data, audio for speech, etc. Also, CWL grants us the reusability of developed solutions through workflows that helps us chain advanced machine learning and AI algorithms with other common tools which, afterward, can provide  us with powerful analysis and result insights. Other a very important concerns for CRIM are that multiple of the services that we must  employ have to be used in tandem with already existing solutions, with interoperable  standards and new technologies being developed constantly.

So new algorithms cannot operate in their own sandbox. They must integrate with a wide range of operational architecture and services such as web processing service or WPS OGC APIs, climate data platforms, and other  self-contained applications with Docker. Furthermore, most of the services that are being  developed must often provide some extended metadata related to the prototype application  that we develop such as documentation references or retrieving execution logs. Since we do not want to dissociate research from production, it is primordial for us, although we  extend metadata, to have scripts that run with the same method both for local execution and  continuous development in production deployment. So, to reach that goal our solution was to develop  Weaver which is presented with the provided link   to the Github repository and I will present  this tool for the next following slides. So, Weaver allows the deployment  of execution processes of various types of over multiple remote  locations using a common web interface.

It has two modes of operation. First, the ADES or Application Deployment and Execution Service that basically executes an Application Package defined by CWL at the location where data resides and can be accessed. And second, the EMS or Execution Management Service that allows the execution of processing workflows by dispatching  children jobs to appropriate ADES instances based on where the data resides and by linking the  processing results between intermediate jobs. So, as shown in the figure here, we can see  that one EMS-configured Weaver can register with many different remote servers that are  used to process data at different locations.

Weaver also provides other features such as  process management by getting their description, different version management, catalog search, etc. It also provides job monitoring and sequencing for different users and workflows. And finally, it provides some result retrieval from both output result of jobs and error  logs in the case of failures. Under the hood, Weaver uses CWL to execute  jobs and automatically maps the execution fields across multiple standards.

Below are all the supported standards that Weaver supports. So, this includes WPS interfaces, ESGF-CWT API interfaces, other remote ADES that are different from Weaver but that provide the  same API interface, built-in applications, Docker images, and large workflows that chains many  of any of the previous process definitions. So, in Weaver, each one of these type of processes  are defined with the different CWL requirements. Weaver also provide different  types of input and output mapping   whether the data is located over AWS buckets, HTTP references or local files from previous jobs. So as displayed in the figure, Weaver provides the  communication and translation between different API connectors and then, with the custom  requirements it provides, it can execute different jobs using the CWL engine. It applies the required interface based on that requirement and then executes the process with the appropriate  HTTP request with the remote API or Docker images.

Since at its core Weaver uses CWL, the  only difference between a processing workflow let's say "standard" and the one registered in  Weaver is the definition of what needs to be run in steps. Instead of a local file reference Weaver  will use a process identifier in the run step. When executing the workflow, we will simply execute  each step and we'll call each of the corresponding processes with the appropriate requests, whether it is located at the remote ADES, a remote WPS, to run a Docker image and, we'll do all of this based on  the requirement definition of each process. Again, each step result will be propagated between each  of the steps, so it's possible to really connect any kind of processes between one another and  Weaver will ensure to communicate the result on each situation by transmitting reference whether  they are HTTP links or bucket save location. Weaver provides also a rich OpenAPI specification  as presented in the following image.

This provides a standard method to execute processes  and retrieving results using a restful interface   and allows Weaver to really act  only as a web interface service. Because Weaver produces process type abstraction  with the automatic adaptation to different API interfaces, it facilitates the result production since  specific details about execution methodology and   script commands that would otherwise need to be  called directly are completely hidden away using CWL. All these processes are represented  using the same schema, at the output of the   RESTful API so it helps comparing between  different kind of processes and users that use   the API don't have to understand all the finer details under each process as it becomes completely transparent. Weaver also offers a centralized search of processes although they come from different sources and nature since it  can communicate with remote servers of different   types such as WPS or ADES.

Finally, since Weaver understands CWL at its core, running a local application and deploying them for  production and environment for remote execution   is exactly the same. The deployment of a local application defined through CWL is seamless as the deployment request is done with exactly the  same definition. This helps development team and research team in our case to test functionalities quickly locally and when they are satisfied with   the result they can deploy it directly.   So for the next slides, I will go through four examples that present how CWL and Weaver were employed  to deploy Application Packages we developed. So, the first example is lake/river discrimination. It is a model that was trained to distinguish lakes from rivers.

It takes as input an high resolution digital elevation model data   and masks of water body locations to  produce region proposals of lakes. Using the model that was pre-trained by CRIM's research team, the packaged application and all its execution environment was defined in a Docker container   Then using a CWL DockerRequirement, the whole Application Package was deployed into Weaver with the appropriate HTTP requests and finally we obtained a process that can be executed remotely through Weaver's API to produce a new feature  proposals over region of detected lakes. So, the images on the right presents such result obtained after calling Weaver's deployed process. Since the whole process is defined with the  CWL that is directly deployed in Weaver, any other iteration of model training  were very quickly deployed in Weaver because no other modifications were  required for packaging the whole solution. The second example is another model that  was trained for land-cover mapping task over Sentinel-2 satellite imagery. It takes as input an image with multiple bands which need to be specified with number of patches and normalization  parameters to produce classification results over the seven classes presented on this slide.

And once it goes through the whole image each pixel will be classified with one of the seven classes. The whole sampling and classification procedure was again packaged with Docker so that the whole environment  can be replicated over different locations. That whole process can be shown  with the presented input image   and the obtained result on the right.

So very simply, the model takes as input the image produces the result for each pixel and we obtain  a classified image from the model inference. So how does this look for CWL? Well, basically the Docker application is referenced with the DockerRequirement which contains the  whole model definition and environment for execution, and the input parameters can  simply be provided with standard CWL execution. That whole CWL definition is directly sent to  the API endpoint where Weaver is located and   automatically Weaver understands that CWL definition and generates the corresponding OGC API process description with automatic mapping  of all the fields and metadata and   documentation details on how to execute the  process, which are the inputs, which are the outputs and how to retrieve them.

My third example is a CWL workflow that chains two WPS processes that are pre-deployed. So in this case the servers are already existing and they already offer some processes and they simply  need to be chained one after the other to produce a single process combining the two intermediate results. In the first case, we have a service that produces a subsetting of a region of  interest and the second process produces ice days indices to indicate the number of ice days in a year. So, chaining these two processes into a common workflow with the custom requirements  that are defined by Weaver, we obtain first the region of interest, and that region  of interest gets processed for ice days.

Weaver will execute the whole workflow  by sending the WPS request to each of the providers through HTTP requests. and will obtain the result of the the first output of the first step and chain it directly to the second  process to do the ice days processing. So when running the the workflow through  Weaver, each of the steps are directly processed based on the CWL requirements, and Weaver manages the transmission of intermediate results depending on where  the processes are actually running. So as i just mentioned, the first step is the region subsetting. It extracts the data from the input NetCDF and produces the NetCDF only for the subsetted region. So this process is really highly reusable. We can use it for any indices.

When we combine it with the second step which  computes the ice days, we can obtain patch based processing of a given region. This is very important in our case because most of the time the large amount of data cannot be processed in one  time for the massive regions that need to be processed. So using Weaver's API, using both  the individual processes and the full workflow are exactly the same.

So, this makes it very easy for users that need to compute ice days for some region as they do not require  anymore to chain the processes manually. Finally, the fourth workflow is a point cloud classification. The pretrained model that forms the first application predicts classes from input point cloud and simply produces classified point cloud.

The second application  takes in the the classified point cloud and rasterize the result into an image. The third process packages these two previous processes into a workflow that  can be executed directly and the whole pipeline is then parallelized with the fourth workflow using standard CWL requirements. So here I present a very short example of one of the gridded results we can obtain from this whole pipeline and some enlarged results  to ease viewing the small and finer detail.

So for conclusion, CWL is a great tool  that helps CRIM to easily scale many of its   processing chains and real use cases. It provides simple definition which allows to maintain focus on core algorithm and prototypes developed. In conjunction with Weaver, CWL allows CRIM to rapidly deploy remotely executable web services  for many different kind of application domains and using both of them in combination, we are able to also obtain added value over existing and future algorithm as the packaging  time are greatly reduced for server deployment The visibility and the powerful processing capabilities are also increased thanks to Weaver's API.

Thank you for your attention and do not hesitate to reach us for any question regarding Weaver or  partnership opportunities with CRIM. Have a good day.

2021-02-12 02:23

Show Video

Other news