AWS re:Invent 2020: AXA: Rearchitecting with serverless to accelerate innovation
Welcome to this session focused on how at AXA we use AWS technologies in order to accelerate our innovation and to ease the maintenance of our connected car pipeline. My name is Vincent Hiribarren. I'm a Technical Leader at AXA Group Operation. And then with Xavier Loup, Principal Solutions Architect at AWS. Together, we will tell you how we transformed our legacy, Telematics project into a serverless one so you can do the same thing. You will learn about the AXA TEX project and marginally about Telematics.
You will have some insights on our first platform introduction which was infrastructure as a service oriented and some issues we had with it. That's why we decided to transform it to a serverless project. And I will share with you some lessons we learned, while doing this transformation. Before going further, let me set up some context.
So I work at AXA. We are a multinational insurance company. We do asset management, life and savings, and also property and casualty insurance.
More specifically, in our case, we do car insurance. We are deployed in more than 50 countries. And, more specifically, also in Europe, where our project takes place. As part of AXA Group Operation, and, more specifically, in the emerging technology and data division, which is an innovation division. So there, we test new technologies, new algorithms in order to create new kind of insurance products.
Then, we found the Telematics team. That's us, six years in business, and there we use data science to transform automotive data. And the project I want to talk about is called TEX.
TEX for Telematics Exchange Platform. I'm currently-- Well, the way people use car is changing. We've all heard about autonomous cars. They're not there yet, but they're coming. Car-sharing platforms also encounter some huge successes. So there is a need to create a new kind of insurance products.
Our platform enable pay to drive and pay how to drive kind of insurances. So with them, your premium can be adjusted, depending on so where you drive. We can also provide some feedbacks at the end of your trips. So you can drive better. It completely change the way, someone can experience his car insurance.
Our platform currently manage more than 60,000 trips per day for about 40,000 customers. Those figures are steadily growing. How do we do that? First of all, we collect data, data from natively connected cars.
If your car is not connected yet, we can also use some boxes that we put on your car battery, or sometimes, on the diagnostic port of your car. Some projects can also use simple smartphones. We also gather contextual information like the weather, cartography, road traffic. So we get all of that. We process them and we compute a driving score.
The driving score is an expose to X entities or to partners, which manage new kind of insurances. The platform processes the data as it comes, so we can provide some quick feedbacks to users. In our first transformation, we did that using EC2 instances.
We had something like more, more than 50 EC2 instances, split in various environments and it was fully designed to be brought on back in AXA data centers. That's why we use very few AWS technologies. Apart from EC2 instances, we only use Elastic Load Balancers. We manage all [?] paths using tools like Shinken, Grafana, Kibana. And globally, it was using lots of open source tools, like Kafka or Elasticsearch.
We use Python as a kind of common language between data scientists, and backend developers. Unfortunately, it becomes quite difficult to manage this architecture. We had to update its operating systems quite frequently.
It becomes more and more difficult to add new components because it was complex to deploy. And we even had some strong security issues. For the story, we had to deploy some IPsec links between all the nodes. So even AWS could not see what was happening on the platform. So we needed to change that. We wanted to change that, to do less system engineering.
We are a tiny team. We cannot afford to manage some operating system issues. So we wanted to do less system engineering. We wanted also to scale more and more easily. Actually, our platform could scale, but we had to manually add some nodes since we were not using, for instance, auto-scaling technologies coming from AWS.
We also needed more securities. Actually, we manage personally identifiable information and we have strong security constraints from strong legal constraints. And we simply want an easier way to manage tools information in a secure way. And, finally, we wanted to focus more on features because came a time where we spent more time in managing the platform rather than adding some new features. So two years ago, we decided to change this platform.
And since at AXA, the policy evolved, so more cloud services could be used, it was a perfect moment to try another approach. So we started to discuss with AWS to see what we could do. And I think it's time for Xavier to tell you more about the story. Thank you, Vincent. The AXA TEX project team was thinking they could address the challenges of TEX V1 using a better architecture.
They were interested in serverless but they were not sure that it could be a good fit for their project. So we contacted AWS. And, together, we decided to work on a target architecture. We organized a workshop to understand their challenges, their goals, and to be able to do recommendations. We wanted to see if it could replace the existing components deployed an EC2 by AWS serverless services. The main challenge of TEX V1 was that the project team was spending too much time on the undifferentiated infrastructure heavy lifting.
That was slowing their capacity to innovate. AWS helps reduce the amount of infrastructure task you have to do. However, depending on the service, AWS will take more or less responsibility. Let's take the example of compute. If a customer chooses to use EC2 instances, he will be responsible for provisioning, scaling, patching, securing, connecting the instance.
And the opposite, if a customer uses Lambda, he will only be in charge of his application code. Lambda is fully serverless and all the infrastructure is abstracted. And between IS and Lambda, you have over managed services like Beanstalk, where responsibilities more balance between AWS and the customer. This is the same for messaging services, where you have a choice between deploying service on IS or use an abstracted service.
And you have the same choices again for the databases, relational or NoSQL. TEX V1 was not using any AWS managed Service because at the beginning, AXA wanted to ensure the application could also run on premise. That created a lot of operational burden for the team of TEX V1. But in 2017, AXA built a new move to cloud strategy allowing the usage of AWS managed services. So together with a TEX team, we analyze how to improve TEX architecture by replacing existing component with more abstract ones. Let's start with compute.
The AXA TEX project wanted to use serverless, instead of EC2 VMs. However, they wanted to be sure if they could use Lambda, or even use the containers. So we looked deeper at the workloads requirements. Most of the compute was asynchronous processing of events sent by the connected cars.
For this need, Lambda is a very good fit. It offers great scalability, cost is optimized with no unused resources. And, moreover, Lambda is well-integrated with the other AWS services, especially Kinesis for streaming. This was around 90% of the compute task.
However, there were some use cases which were not possible with Lambda, like the communication with some telematics boxes, which required UDP or long batches. For these use cases, they decided to use Fargate, a serverless compute engine for containers. With Fargate, they were able to get the flexibility of containers, but without the need to manage the instances.
At the end, the target for compute was Lambda and Container. For messaging, even streaming, we also wanted to replace Kafka on EC2 by more abstracted services. So we analyze for business needs. The service had to manage a minimum of thousand car events per second. Each event could have multiple consumers. And there was a need to preserve order.
Taking into account, these requirements, Kinesis was clearly the right solution, as it is designed to manage event streams. It provides great scalability, it's able to ingest millions of events per second and it is possible to batch read events to optimize performance and costs. For the project, there was a certain need for simpler queues to process the driver score. And for this need, we decided to use SQS, which is even simpler to manage than Kinesis, because you do not need to manage shards. So for the messaging, and even streaming services, we were also able to abstract most of the infrastructure heavy lifting. We did the same exercise for the databases.
At the beginning, the team was not sure if they should use SQL or NoSQL. And, once again, we had to deep dive on the business requirements. For the current trip data, TEX needed to scale to store a lot of events. The requests were mostly key value searches and data did not need to be persistent after the trip had been analyzed. DynamoDB provided many benefits for this use case.
It's fully serverless, no management task, paper use billing. It scales horizontally offering high throughput and fast ingestion of data. And the time to leave features makes it possible to automatically remove all trips data. For the other types of data, the project needed to do more complex request with table joins. So the team chose to use Aurora PostgreSQL.
And they decided to use Aurora serverless to minimize infrastructure tasks and to be able to have automatic scaling. So, once again, they had a pragmatic approach. They use both SQL and NoSQL, the right tool for the right job.
TEX V1 was only using IS. At the end of the architecture workshops, we had defined a new target for TEX V2. And the new architecture was leveraging AWS serverless services to help the team focus on features. I will now let Vincent present the full target architecture. And you can see here as a result. It's a path of the platform, but it's a complete project.
Some paths are shared between projects, and some other are very specific to project. Let's have a deeper look. First of all, we have the ingestion path. There, we collect automotive data.
Typically, it's a path that really depends on the project. Here, we are in a case where we fetch data from a common factor. But with some other projects, we have data which is pushed to us. So here we have a CloudWatch Events, which trigger some Lambda, and this Lambda simply fetch data from a car manufacturer database. This data is installed in registry bucket to be archived and for further processing.
Then, comes the processing steps. So here we transform the data. Up to the point, we can compute a driving score. We transpose the data using Kinesis when it is not stored in the database yet.
But once it's stored in the database, we simply use SQS to carry simple processing orders. I want to focus a little bit on our driving score algorithm because at the beginning, we feared it might not be possible to use a serverless technology because we use Python to compute our driving score and some Python libraries can take lots of places and some size limits with Lambda. Fortunately, AWS created Lambda layer optimized for data science and we managed to use to say... to set it our driving score. We then have the enrichment part. Well, actually, this step is used before computing your driving score. And there, we enrich data using contextual information, like the weather, road traffic, and cartography.
Since it's a service that can be used by other part of our platform, we express it using an API Gateway. So it's kind of microservice. You can also observe that we use Amazon ElastiCache. That's not a serverless service.
It's a managed one. But, in our case, it proved to be more interesting to use than DynamoDB. So we do not hesitate to have mix between serverless technologies and managed services. Once we have full trips, and once we have computed a driving score, we start all of that in our persistence layer, which is, essentially, which does use an Aurora Serverless Database.
And we expose this result to X entities, simply through an API Gateway. Here's the figure is simplified, but we also use other technologies like Amazon Certificate Manager to manage our TLS certificate, also Cognito to manage authentication. We also deployed Data Lake for data scientists. Here, we store all data in S3 buckets.
And we use tools like Athena, or EMR to explore the data. And, finally, of supervision and monitoring part. So in our first platform version, we had to set up all the log collection, all the monitoring. We have to maintain all of that. It was quite a hassle.
But with serverless technologies, it basically come out of the box, we have nothing to do. Data automatically flow to CloudWatch Logs and, currently, CloudWatch dashboards are enough for monitoring this. We also use Secrets Manager and Parameter Store to store configuration and third parties, third party secrets data use to connect to third party services, like to get the weather from an external server.
And to prove you that this architecture is well-suited for our needs, I have some CloudWatch graphs to show you that. At the top, in green, is incoming traffic from cars. At the bottom, how all of our Lambda reacts to this traffic, lots of traffic in the day, lots of Lambda activities.
Less traffic in the night, less Lambda activities. And if we focus on a specific Lambda, which is our enricher, on the right, you can observe in blue that today, we have more multiple concurrent execution of the Lambda, so there is no delay in our pipeline. There are two side effects to that. The first one, if we have no traffic, we pay nothing for the business logic.
Of course, we have other costs, like data storage or periodical jobs that we may launch, but it's zero dollars for our business logic when there are no traffic. And so if you need to deploy other environments, like for pre-production, UAT, it helps to pay less if you have less traffic than in production. We also change the way we organize our AWS accounts.
All of our accounts are placed in what we call the AXA learning zone. It's a concept that was presented in a previous re:Invent session, I think you may find the video on YouTube if you want to know more about that. This AXA organization deploys some SCP to all of our accounts and also allows centralized billing. Then, in all accounts, we have a special stack, an AXA foundation stack, which is deployed in there, so security and audit can be enforced, and also compliance rules.
So, for instance, it simply configures tools like GuardDuty, CloudTrail, or AWS Config. Then we decided to dedicate several AWS account for all of our variant of our platforms. So we have one platform for production, another for pre-production, another for integration, and each developer has his own account.
So he can test things without breaking anything in production. We also have a dedicated analytics account for data scientists. From there, they can explore the data, in production, and in read-only mode without disturbing our production. And, finally, we have a tools account.
Actually, it's all continuous integration, continuous deployment base. From there, we built our artifacts using Jenkins, we store them using S3, and we deploy them using Terraform. In this account, we can also host some shared services like DNS management, or centralized authentication for administrative purpose. So did we succeed in resolving our issues? Yes. We do not think anymore about system engineering.
The remaining place is on database, where we have sometimes to tune things but it was already something which was a shared responsibility between a system engineer and a software engineer. We also scale more easily, since we only use components that can scale, like Kinesis or Lambda. Some components may need manual tuning also, like Kinesis. We also have a more secure project. For the story in our previous spin test, we only had business logic issues whereas in our first platform, we also had to deal with operating system level issues, virtual machine level issues. So world class of issues simply disappeared.
Thanks to serverless technologies. And, finally, we can focus more on features because we spend less time in managing our infrastructure and we are more comfortable in deploying new components. As a conclusion, I would like to share with you some lessons we learnt so you can avoid some mistakes we made.
Also, you correctly understand the serverless usage. First of all, serverless does not mean zero configuration. As an example, with Aurora Serverless, you still need to set up a VPC and to configure an IP addressing plan. Also, with Kinesis, you also have to tune the number of shards you need.
Then the technology maturity is not the same between components. For instance, with Aurora Serverless, you will have less features than with its non-serverless counterpart. As an example and as of today, Performance Insight is not available with Amazon Serverless or it's more difficult to do export import from S3 buckets. But things can evolve quite quickly with AWS.
So I'm sure there will be some upgrades in the near future to fix all of that. You should also think about having an hybrid architecture. Full serverless is not always possible. And, in our case, we have a mix of serverless technologies, managed technologies, and also containers in some very specific cases. So you should keep that in mind and take that into account in your Infrastructure as Code.
And, finally, be aware of automatic scaling. Because for the story, we had the scaling issue with Lambda and our business logic. Due to a bug or architectural simply scale too much, and we paid far more than usual.
So please set up alerts, set up the limits, and currently, you can even set up some daily budget limits, alerts, so please configure them, use them. That's it. And before closing the session, I would like to thank you as the team contributed to the project. And if you wonder about our next steps, it's simply to add new features and to tune things while AWS also updates its technologies.
Actually, our main technology work is on our CI/CD pipeline, which use a more traditional approach than the serverless one. So it's a topic by itself, but we'd have to work on it. Thanks for watching. Don't hesitate to share. And please take some time to complete the session survey.