AWS re:Invent 2020: Gameloft: A zero downtime data lake migration deep dive

AWS re:Invent 2020: Gameloft: A zero downtime data lake migration deep dive

Show Video

Hi everyone. Thank you for joining this session today. I am Fred Nowak, Gameloft solutions Architect and part of the AWS game tech team.

Gameloft is migrating massively to AWS. Today, I have the pleasure to have on the virtual stage Alexandru Voinescu. Alex is Gameloft's Business Intelligence Technical Manager, and he will explain how his team has migrated Gameloft's data lake to AWS with zero downtime.

Hi, thank you, Fred. And thank you for joining this session today. I am Alexandru Voinescu, BI Technical Manager at Gameloft. Today's session will give insights on how Gameloft migrated its data and analytics pipeline in AWS. This is our agenda, we will go through a small presentation about Gameloft, and then we will present our former infrastructure.

Then, we will go over the key choices we made during this migration and talk about the actual execution of our plan. We will wrap it all up with our plans for the future and the key takeaways. Gameloft is a Vivendi company and is also a leader in the development and publishing of games, and has established itself as one of the top innovators in its field since 2000. We create games for all digital platforms.

We are operating on our own established franchises such as Asphalt, Dragon Mania Legends, Mortal Kombat and Dungeon Hunter, but also partner with major IP holders. We distribute our games in over 100 countries and employ over 4,000 people worldwide. Every month, 80 million unique users play games, and more than 1.5 million games are downloading every day. Gameloft is in a global process to migrate to AWS.

With multiple games among the world's top 10 most downloaded games, our data volume is massive. You can imagine the volumes of data that games with over one billion downloads produce. The BI department in Gameloft handles almost everything that is data-related in the company.

We provide infrastructure services, analytics and reporting across the whole company. We focus on providing data and insights needed to improve gameplay experience and create better products for our players. This section will cover our former infrastructure and we will share insights on the reasons we migrated to AWS. This is an overview of our former BI data platform system.

If it seems complicated, it's because it is. But first, a little bit of context. So remember those 275 games, well, every Gameloft games sends telemetry data to this pipeline.

This data is analyzed by all of our game-related departments, the goal being to provide insights that improve our players experience within the game, and thus creating a better product for them. Data goes through all of these technologies and processes. It included custom software and a lot of open source technology like Cassandra, Hadoop, Spark, different databases and tools. All of these were deployed on our own data centers in Canada and in China. Data outputs is something we called client-facing service.

They handled security, validate the events, and push data further down the pipeline. The customer cluster was used like a whole data layer with data indexed by the second. It provided real-time data and also issued seven-day retention of raw data to be used further down the pipeline when needed. Then, a series of processes would store this data in different formats, CSVs and packet files. This data would be used by most of our transformation systems and clients, BI, data science, monetization, finance, HQ, and many others.

So why move to AWS? Well, this pipeline worked, and it worked well for over seven years. But as time went on, we started to notice the first warnings that we needed to change. We ran software that was not really supported anymore. The client-facing servers, our custom software were written more than seven years ago.

Although it was very good software, when it was made, development and maintenance was becoming difficult. We risked being locked in with our own on-premise technology. We also needed a better way of delivering data to users, the partitioning [INDISCERNIBLE 00:14:23] cluster was slowing down a bit.

Further, granular partitioning would pose problems for the cluster because it would, at least, three per hour already huge number of files. One other issue was that the data in the HDFS cluster was becoming increasingly hard to access and integrate with new workloads and third party tools. If we needed to scale very quickly, we simply couldn't. Scaling up or down for that matter the client-facing servers. The Cassandra cluster and the HDFS cluster would probably take days or weeks and a lot of resources. Our own on-premises hardware was near the end of life, soon servers would need to be replaced.

We also needed better visibility in regards to cost control, monitoring and compute. We have a lot of internal customers, so we really needed to be able to see who is using what. And finally, we needed to democratize access to data and infrastructure. Clients really wanted to explore raw data at its deepest level. And we really struggled to provide the needed availability and performance for all the data. We did have some big Spark clusters, over 1.5 petabytes,

over 500 cores, but they were really shared so clients were often waiting for resources to be available. Also, developers needed direct access to infrastructure, whenever they would need to deploy something, being a POC, experimental, production-grade workloads. There was some times that we would start with a lack of physical resources.

And often, as is the case with most POC or experimental work, those resources were left running for days and weeks without actually being used. We needed a dedicated data pipeline. Could you give us more details on that project and did that experience help you for this huge migration? Yes, we worked on a similar analytics pipeline for a different project, for a different business division.

At that time, we did not have any experience with AWS. We had to build everything from scratch, and we learned a lot on AWS analytics services and more. This project helped us to design the target architecture for this migration quickly.

But most importantly, we became confident on the key choices made for this migration. The first key strategic choice was regarding data ingestion. When we first thought about data ingestion and transport, we thought about Kafka. At that time, AWS did not offer managed Kafka so we turned our attention to Kinesis. Alongside Kinesis, we considered the full AWS service implementation using API Gateway and Lambda.

While this seems to fit perfectly in theory, our ingestion logic makes use of lots of metadata from different systems and it was hard to implement this in an API gateway. Also, we have three plus billion events per day that we can't really match. Lambda and API gateway turned out to be expensive for this workload. So we decided to rewrite our client-facing servers and house them in containers on our own Kubernetes cluster with Rancher, which I will talk about in a minute, because this was a strategic key choice for us as well.

The flow is simple and effective. Data sent by games is process by our software and is passed on to multiple Kinesis streams. Some are configured with seven-day retention very much like the old system. But the difference with the old system is that Kinesis is a managed service, so we have just reduced the maintenance overhead by quite a lot. We could also scale up and down the streams and become more efficient with resources since our traffic is not linear across days, weeks, or months.

For our real-time use cases, we would set up a microservice that would filter out the data from main Kinesis and create, and also stop when not needed, dedicated customers streams. We had to consider how we would deploy the applications and workloads required for this migration. We went with Kubernetes with Rancher over anything else, mostly because we had a lot of experience with it and in the end, it really simplifies operations.

We deployed it to EC2 machines in order to benefit from the scaling opportunities that an EC2 service offered us. This was a strategic choice for us because the Kubernetes cluster would allow us to migrate either through lift and shift or refactor almost all of the VI workloads from our ecosystem, ETLs, crons, applications, tools, etc. We would have them under one roof with a centralized overview over everything.

Also, we could easily test and deploy new tech without being committed to new hardware. We could now deploy easily and fast what we want and for how much time we needed. Developers would also give direct access to infrastructure, they would be able to easily deploy and develop without having to wait for instances. Our objective was to create an environment which would enable data exploration users to achieve their data goals easily. So for our data storage and analytics layer, we went with the obvious options, S3 In the old system, we had two layers of raw data, CSV and packet files in HDFS.

We had different systems that were reading from the CSV files, but really, those files were not accessible to anyone outside a small team of developers. Using Kinesis Firehose, we took advantage of the perfect integration with Kinesis streams. The data would be first added in the raw data layer and then go through a process of cleaning, segregation, partitioning and enrichment. The clean data would then be stored on another S3 bucket ready to be used by the data explorers.

S3 was a huge improvement over HDFS. It gave us the opportunity to be more granular in terms of partitioning. This translates into faster data processing. Add that to life cycle backup policies, perfect integration with the ingestion streams like Firehose. We got a huge reduction on maintenance overhead.

Most importantly, on almost all self-managed delivery of the data. But in a couple of weeks, the new partitioning system offered fastest pure access to both raw and clean data as soon as the data reached the S3 bucket. EMR provided the flexibility we needed from the processing layer when doing complex jobs like the transformation needed from clean data to raw data.

The ability to use efficiently EMRs compute power only when needed and how much you needed it meant no more cloud Spark clusters or jobs waiting in line. So we had a plan. The ingestion layer will be replaced by Kinesis, the data processing and tools will be handled by Kubernetes with Rancher, and the data analytics layer will be covered by S3, TNI and DMR. We will also replicate a limited version of this pipeline in China. The pipeline is almost completely automatic. The only not managed service is the client-facing servers.

This would reduce the maintenance overhead and allow our developers and DevOps to focus on more high value tasks. Changing infrastructure enabled us to organize better our internal types of users. We now have real-time users using Kinesis, Data Explorer using a T9 DMR and analytics users using our third party partners like Snowflake and Looker. The new architecture would also be more adapted at handling new use cases and new technologies. We are in a way better positioned to optimize and grow than ever before. Besides being able to test anything from the open source world, we are also in a position to test new AWS services.

For example, we had a POC with Timestream it was embedded, we are now beginning to work with SageMaker. So we had a plan. Use managed services where it made sense and implement simple and maintainable services where needed. We would also rely on S3 for our raw and clean data. This will give us perfect integration with our third party tools like Snowflake and Looker, and also offer a better partitioning of the data we had.

Alex, you said that your first experience with AWS helped you to design this target faster but at that time, you had neither internal customers, nor backwards compatibility to ensure, and the volume of data was one or two orders of magnitude smaller, right? Yes, we started from scratch so really we had no constraints. Also, the scaling was pretty reasonable under one million events per day. We had a working production grade system from ingestion to visualization done in one month and a half with a team of three developers.

But this project was a real punch in the face, the scale was massive. And it was a system that had been running for over seven years with lots of constraints and clients. Migrating a production environment of this size is not easy. It was a giant, a seven-year-old production system with over 1.2 petabytes of data for 275 plus games, countless clients across the company and over three billion events processed per day. In order to migrate it safely, we needed to ensure zero downtime of the pipeline and the integrity of the data.

You'll also have to ensure a period of backward compatibility and make sure that the system performed at least on par with the old system. Finally, we wanted to enable more use cases for our data. So, we are now talking about zero downtime only for the system, but for the clients along the pipeline, and they were many and using different technologies. Zero downtime for them would mean having the time to test out the system, being confident that the data we provide is accurate, having the right support to migrate. And all of this while the system was running.

So how do you achieve real zero downtime for a system of this size? Well, the best scenario is to have both systems working in parallel with the same amount of data. Duplicating all of the traffic would mean time and production-grade data and traffic to really test the system. And so we did.

We started with gradually duplicating the data from the data center to AWS. We used the NGINX mirror module and started duplicating traffic from groups of games, one by one. This would allow us to evaluate and make adjustments to the AWS system as data and traffic grew and grew.

Then, we needed to switch the primary endpoint to AWS. This is where the Kubernetes with Rancher came into play. Behind the network load balancer, we deployed an NGINX on the Kubernetes cluster and used the mirror module to duplicate traffic to our data center.

However, not everything worked as planned, we obviously run into some issues. One of the first issues we run into were hairpin requests. So these are requests that come from inside the Kubernetes cluster and are sent back to the cluster and get the same Kubernetes node from where the request originated. This request is then discovered, as it has the same source ID.

This would happen when two services, like NGINX and the client-facing service, for example, want to communicate with each other via their external DNS name. In order to overcome this problem, we switched the target type to IP. This however, require that we enable proxy protocol between the network load balancer and NGINX in order to have the original IP address.

While using the IP target type, we hit another limitation, the simultaneous connections to a single target is 55K per minute. In order to overcome this, we've changed the instance type to a smaller one and increased the number of instances. This data duplication really allowed us to have a solid redundancy in our pipeline. If something crashed, we were prepared to refer to the on-premise system at any time and without impact. And guess what? It did. Out of the 275 games, one of them crashed, the game would not work anymore after we switched to AWS.

So obviously, we reverted immediately to the old pipeline, and started debugging everything. The game team finally found out the problem and fixed it. But really, the best thing was that we could revert quickly to the old pipeline, that really saved us. So, data duplication worked like a charm. We avoided serious issues by being able to test the system with production data and traffic. When we refer to accuracy and compatibility, we are targeting the client.

We had clients that would use the old pipeline either by using our Tennis NQs, CSD files or the HDFS cluster. It's so easy to migrate them to new tech, Kinesis, S3, TNI and DMR. Data duplication helped a lot because this enables them to modify or refactor their workloads without impacting the old one. Most importantly, they were able to verify data accuracy and data integrity by looking at the complete data.

Probably the biggest change was for the real-time users that needed to switch from Tennis to Kinesis. But first we needed to do this ourselves and I'm going from Cassandra SNQ to Kinesis. By checking the data, we ran into some issues that we could spot only in the production environment, and only with full production traffic. One of the issues we had was about sharding. If the sharding key was not random enough, this would put pressure on all assumptions and scaling up Kinesis would help out very little. You need to make sure that messages are distributed evenly between shards by having a uniform enough sharding key.

Retries can sometimes be too aggressive, especially if messages go for the same shard, so user delay between retries. Also, make sure you use enhanced fan out, if you have multiple consumers because there is a limited read capacity and independent consumers may compete for the same read capacity. Another one would be the DynamoDB table is used by KCL to store Kinesis sources, they need to have enough capacity.

Even if you have only one service instance. If you don't ensure enough capacity, this might become a bottleneck for consumers even though there is enough Kinesis read capacity. In order to make sure we have the same data in both pipelines, we checked our data constantly and did adjustments when needed. We needed to check all of the 275 games each with over 100 data points.

This was a done deal. We used Spark to check for the data on premises and took advantage of a T9 DMR for the data in the cloud. This also allowed us to spot and fix problems in the data. In terms of performance and optimization, we had huge wins with the new pipeline.

We increased the availability of all the telemetry data from daily to within minutes. And we enabled direct SQL access to raw data within minutes we received it. This was a huge win, as in the previous system, most of the data was not available daily with few near real-time exceptions. With the help of S3, we now have better partition data that enables users to access it faster than ever before. It also enabled us to have an improved data retention policy. With the ability to have visibility into when needed and how much we needed it we really optimized compute usage.

This meant that users were not queuing in line for resources anymore. I've said before, S3 enabled us to repartition our data the way we wanted. Previously, because of the huge number of files, the HDFS cluster forced us to group some of our data. That had a big impact on the speed of data. The old HDFS partitioning group data so that the number of files were as small as possible.

This was not aligned with the querying model. So users usually read way more data than they would actually need. If they would need to get data only for one data point, they usually read data for at least two or three, or even more. With a new partition in S3, you read exactly what you want, with huge gains in speed and availability. For TNI and DMR are now the factor tools that we use for raw and clean data exploration. So that was the journey with some challenges and how the revert, you made it at the end, and that's great.

Now that you have migrated these data to AWS, what about the future and what are your plans, Alex? This new analytics pipeline and data lake gave us the opportunity to enable our users to do more with game of data. We enabled more real-time use cases for game teams and other support teams. Data scientists have now directed SQL access to raw data using fine-grained permissions. The high number of games and data points we handle means that monitoring data sent by each one of them is a difficult task. We are now evaluating AWS services to detect anomalies in each game event stream.

It will trigger the alarms in almost real time and the data we are looking at seems very promising. We are also planning on improving our forecast and data governance systems, AWS Managed Services. So what have we learned from this migration? Well, first of all, duplicating data was key to success. It enables an easy transition for clients and allowed us to revert when needed. Also S3 dramatically increases data usability.

Data can be better partitioned. It is easily integrated with practically everything and can be accessed and monitored easily. Use as much managed services as you can, as this will reduce overhead by a lot. If you're smart about it, it also can be way cheaper than your current on-premise deployment. You will overprovision at first, but then you can really optimize as the system becomes stable. One example would be our China deployment where we reduced costs by 40% after the first three months.

Make sure you are redundant because what can go wrong will usually go wrong. This helped us when we had issues with one game that stopped functioning. We simply reverted back to the old pipeline, and we were able to fix the problem with minimum impact. Focus on your clients and gather as much data as you can from their use case. Make sure everyone knows about the migration.

Last but not least, AWS is full of possibilities when it comes to data. Make sure you check them all out and experiment with them. Thank you for attending the session. Also a big thank you to Gameloft, BI and the data lake team especially to Mariane, Marius, Mihail and George.

This migration would have not been possible without them. Thank you all for watching. Thank you for watching and don't forget to fill the session survey in the end.

2021-02-18 14:27

Show Video

Other news