ML processing pipelines using CWL, deployment In context of research teams and OGC-API integration
Hi everyone. My name is Francis Charette Migneault. I'm a research software developer at the Computer Research Institute of Montréal and today i will be presenting this talk about machine learning processing pipeline using CWL and their deployment in the context of research teams and OGC API integration. For today's agenda, I will be talking about the following subjects and concepts.
So CRIM's research team development context and the CWL integration with Weaver, the solution that we use for OGC API integration. I will follow with examples as presented below. First, CRIM is a research institute center that helps organizations acquire competitive edge through the adaptation of technological innovation. Because we work with multiple organizations operating over different goals, challenges and products our expertise expands over a wide range of domains such as computer vision and geospatial technologies, speech and natural language understanding, data science, operational research and many more. This results in CRIM being a company that must regularly adapt to fast-moving research by processing prototypes of applied technologies but it must also be transferable to production environments of the partner organizations.
The context under which CRIM must operate therefore requires us to employ technologies that match our realities. First, the multi-domain technologies and the diverse objectives between the prototypes that are created for different organizations must provide flexibility, reusability and simple solutions. Second, the research aspect of prototypes that we develop and the packaging of those technologies must also remain interconnected due to the need for adaptation to production environments.
Though, they cannot be dissociated, solutions must not limit exploration of new research possibilities. Because of this, it's very important for us that the tools we employ fulfill some very important requirements that CWL provides. Amongst everything that we require CWL provides us with the flexibility of packaging methods of core algorithm whether they run Python scripts, Docker images or other kind of application and also over the type of input outputs such as images, textual data, audio for speech, etc. Also, CWL grants us the reusability of developed solutions through workflows that helps us chain advanced machine learning and AI algorithms with other common tools which, afterward, can provide us with powerful analysis and result insights. Other a very important concerns for CRIM are that multiple of the services that we must employ have to be used in tandem with already existing solutions, with interoperable standards and new technologies being developed constantly.
So new algorithms cannot operate in their own sandbox. They must integrate with a wide range of operational architecture and services such as web processing service or WPS OGC APIs, climate data platforms, and other self-contained applications with Docker. Furthermore, most of the services that are being developed must often provide some extended metadata related to the prototype application that we develop such as documentation references or retrieving execution logs. Since we do not want to dissociate research from production, it is primordial for us, although we extend metadata, to have scripts that run with the same method both for local execution and continuous development in production deployment. So, to reach that goal our solution was to develop Weaver which is presented with the provided link to the Github repository and I will present this tool for the next following slides. So, Weaver allows the deployment of execution processes of various types of over multiple remote locations using a common web interface.
It has two modes of operation. First, the ADES or Application Deployment and Execution Service that basically executes an Application Package defined by CWL at the location where data resides and can be accessed. And second, the EMS or Execution Management Service that allows the execution of processing workflows by dispatching children jobs to appropriate ADES instances based on where the data resides and by linking the processing results between intermediate jobs. So, as shown in the figure here, we can see that one EMS-configured Weaver can register with many different remote servers that are used to process data at different locations.
Weaver also provides other features such as process management by getting their description, different version management, catalog search, etc. It also provides job monitoring and sequencing for different users and workflows. And finally, it provides some result retrieval from both output result of jobs and error logs in the case of failures. Under the hood, Weaver uses CWL to execute jobs and automatically maps the execution fields across multiple standards.
Below are all the supported standards that Weaver supports. So, this includes WPS interfaces, ESGF-CWT API interfaces, other remote ADES that are different from Weaver but that provide the same API interface, built-in applications, Docker images, and large workflows that chains many of any of the previous process definitions. So, in Weaver, each one of these type of processes are defined with the different CWL requirements. Weaver also provide different types of input and output mapping whether the data is located over AWS buckets, HTTP references or local files from previous jobs. So as displayed in the figure, Weaver provides the communication and translation between different API connectors and then, with the custom requirements it provides, it can execute different jobs using the CWL engine. It applies the required interface based on that requirement and then executes the process with the appropriate HTTP request with the remote API or Docker images.
Since at its core Weaver uses CWL, the only difference between a processing workflow let's say "standard" and the one registered in Weaver is the definition of what needs to be run in steps. Instead of a local file reference Weaver will use a process identifier in the run step. When executing the workflow, we will simply execute each step and we'll call each of the corresponding processes with the appropriate requests, whether it is located at the remote ADES, a remote WPS, to run a Docker image and, we'll do all of this based on the requirement definition of each process. Again, each step result will be propagated between each of the steps, so it's possible to really connect any kind of processes between one another and Weaver will ensure to communicate the result on each situation by transmitting reference whether they are HTTP links or bucket save location. Weaver provides also a rich OpenAPI specification as presented in the following image.
This provides a standard method to execute processes and retrieving results using a restful interface and allows Weaver to really act only as a web interface service. Because Weaver produces process type abstraction with the automatic adaptation to different API interfaces, it facilitates the result production since specific details about execution methodology and script commands that would otherwise need to be called directly are completely hidden away using CWL. All these processes are represented using the same schema, at the output of the RESTful API so it helps comparing between different kind of processes and users that use the API don't have to understand all the finer details under each process as it becomes completely transparent. Weaver also offers a centralized search of processes although they come from different sources and nature since it can communicate with remote servers of different types such as WPS or ADES.
Finally, since Weaver understands CWL at its core, running a local application and deploying them for production and environment for remote execution is exactly the same. The deployment of a local application defined through CWL is seamless as the deployment request is done with exactly the same definition. This helps development team and research team in our case to test functionalities quickly locally and when they are satisfied with the result they can deploy it directly. So for the next slides, I will go through four examples that present how CWL and Weaver were employed to deploy Application Packages we developed. So, the first example is lake/river discrimination. It is a model that was trained to distinguish lakes from rivers.
It takes as input an high resolution digital elevation model data and masks of water body locations to produce region proposals of lakes. Using the model that was pre-trained by CRIM's research team, the packaged application and all its execution environment was defined in a Docker container Then using a CWL DockerRequirement, the whole Application Package was deployed into Weaver with the appropriate HTTP requests and finally we obtained a process that can be executed remotely through Weaver's API to produce a new feature proposals over region of detected lakes. So, the images on the right presents such result obtained after calling Weaver's deployed process. Since the whole process is defined with the CWL that is directly deployed in Weaver, any other iteration of model training were very quickly deployed in Weaver because no other modifications were required for packaging the whole solution. The second example is another model that was trained for land-cover mapping task over Sentinel-2 satellite imagery. It takes as input an image with multiple bands which need to be specified with number of patches and normalization parameters to produce classification results over the seven classes presented on this slide.
And once it goes through the whole image each pixel will be classified with one of the seven classes. The whole sampling and classification procedure was again packaged with Docker so that the whole environment can be replicated over different locations. That whole process can be shown with the presented input image and the obtained result on the right.
So very simply, the model takes as input the image produces the result for each pixel and we obtain a classified image from the model inference. So how does this look for CWL? Well, basically the Docker application is referenced with the DockerRequirement which contains the whole model definition and environment for execution, and the input parameters can simply be provided with standard CWL execution. That whole CWL definition is directly sent to the API endpoint where Weaver is located and automatically Weaver understands that CWL definition and generates the corresponding OGC API process description with automatic mapping of all the fields and metadata and documentation details on how to execute the process, which are the inputs, which are the outputs and how to retrieve them.
My third example is a CWL workflow that chains two WPS processes that are pre-deployed. So in this case the servers are already existing and they already offer some processes and they simply need to be chained one after the other to produce a single process combining the two intermediate results. In the first case, we have a service that produces a subsetting of a region of interest and the second process produces ice days indices to indicate the number of ice days in a year. So, chaining these two processes into a common workflow with the custom requirements that are defined by Weaver, we obtain first the region of interest, and that region of interest gets processed for ice days.
Weaver will execute the whole workflow by sending the WPS request to each of the providers through HTTP requests. and will obtain the result of the the first output of the first step and chain it directly to the second process to do the ice days processing. So when running the the workflow through Weaver, each of the steps are directly processed based on the CWL requirements, and Weaver manages the transmission of intermediate results depending on where the processes are actually running. So as i just mentioned, the first step is the region subsetting. It extracts the data from the input NetCDF and produces the NetCDF only for the subsetted region. So this process is really highly reusable. We can use it for any indices.
When we combine it with the second step which computes the ice days, we can obtain patch based processing of a given region. This is very important in our case because most of the time the large amount of data cannot be processed in one time for the massive regions that need to be processed. So using Weaver's API, using both the individual processes and the full workflow are exactly the same.
So, this makes it very easy for users that need to compute ice days for some region as they do not require anymore to chain the processes manually. Finally, the fourth workflow is a point cloud classification. The pretrained model that forms the first application predicts classes from input point cloud and simply produces classified point cloud.
The second application takes in the the classified point cloud and rasterize the result into an image. The third process packages these two previous processes into a workflow that can be executed directly and the whole pipeline is then parallelized with the fourth workflow using standard CWL requirements. So here I present a very short example of one of the gridded results we can obtain from this whole pipeline and some enlarged results to ease viewing the small and finer detail.
So for conclusion, CWL is a great tool that helps CRIM to easily scale many of its processing chains and real use cases. It provides simple definition which allows to maintain focus on core algorithm and prototypes developed. In conjunction with Weaver, CWL allows CRIM to rapidly deploy remotely executable web services for many different kind of application domains and using both of them in combination, we are able to also obtain added value over existing and future algorithm as the packaging time are greatly reduced for server deployment The visibility and the powerful processing capabilities are also increased thanks to Weaver's API.
Thank you for your attention and do not hesitate to reach us for any question regarding Weaver or partnership opportunities with CRIM. Have a good day.