Running shopware at scale - Shopware SaaS | #SCD23

Running shopware at scale - Shopware SaaS | #SCD23

Show Video

Hello and welcome in Duisburg from my side. I'm Jens Küper . I'm cloud engineer at Shopware. And you might have seen the title in the programs. And first thing is like scale could mean two things like having, I don't know, a shop with a million visitors an hour.

And it could also mean scaling in terms of having thousands of online shops, which is what we do. And this is a bit of the direction the talk is heading at. Maybe the remote does not work. Ha.

Nice. Um, yeah. About me. I joined Shopware over six years ago.

Back then, Shopware 5 was the biggest thing, sadly, today, like, kind of finish that era, but I think in a good way by opening it up to the community. So I worked at Shopware 5 and shortly after I joined the company, we started thinking about a new version. Back then we didn't have a name. We called it Shopware Next. I also did a talk five years ago at the Community Day, just over there under the name Shopware Next because we didn't want to spoiler. So since then I was a core developer.

Be there like from the start of of Shopware Next and Shopware 6 and in 2022 we decided we want to offer a SaaS solution. So we formed a new team, the cloud team and I joined as a cloud engineer and today we are all here. So my talk will be about three parts. First up a bit about the infrastructure, what decisions we made there. Next up is the running at scale and the problem we encountered or a few of them.

Of course, they are more than there in this talk. And last but not least, it's about observability. So having a look inside Shopware inside your e-commerce solution to ideally prevent what's happening in the picture, the data center is on fire. I'm not sure if everybody is aware of like what Shopware SaaS is about, so I did a quick slide for that. It's a fully managed version of Shopware. It comes with automatic updates and security patches.

And I think this is kind of special because we are actually the first one who receives the security patches. So even before the plugin is released, even before a new Shopware version is out, the cloud is patched. So it's kind of by design, the most secure version out there. It's highly available, which means our team has a 24/7 on call duty, so it can happen. And it has happened that in the middle of the night your phone wakes you up and you have to look up why something is not working as you expected.

It's running in multiple availability zones, which means it's located in different physically distributed data centers. So even if one data center catches on fire, it will still keep on running. And we choose AWS as a infrastructure provider for the whole solution. It's a multi-tenant system, which means some of the resources are shared across all customers.

For example, the the app service, which brings the benefit one, it's like more economically, but on the other hand, you can use resources from one customer who is currently not performing that well for another one to like even it out, even like high load scenarios. We have autoscaling, including the database, which is I think kind of special because scaling MySQL efficiently is still a big job to do and there are like numerous optimizations that are on top of like already Shopware. So we have a global CDN, full page cache protection we do on the fly image transformation. So we will compress images if the browser supports this to web or above to improve the page speed.

Same for the content with GZip or broadly, we will do all the certificate handling for you so you can bring your custom domain to us and we will. Yeah. Make sure you never run out of a certificate.

We have a Captcha solution with invisible CAPTCHAs so the customers are not disturbed by annoying captchas. And we have also figured out the mailing thing. So you do not have to bring a mail server, anything like that. So this is like Shopware in a nutshell how you would do it in a marketing presentation.

And so let's look at it from a better perspective, like a tech perspective. If somebody wants to access a cloud shop, he will send a get request, maybe choose a domain scd.shopware.store. And ideally in a perfect world, global CDN, full page cache, you get an immediate response and you are all set. The reality often looks different. It could also be that I just made it up a little scd.shopware.store is not a real store, so you will

probably get a 404 status code from our reverse proxy because the shop simply does not exist. And third case could be it's a real shop. We haven't cached it. You will go through our infrastructure end up.

In one of the application clusters and get served a real response. And the custom reverse proxy, which you saw in the diagram, is the code we wrote ourselves. It's written in Golang and it has a couple of jobs. First up is to add metadata to every request. So for example, the SCD shop, it could be that he booked the Rise plan, the Evolve plan, the Beyond plan. And depending on that, things change in the background in terms of features, licensing, resources and so on.

So this is added to every request. We add routing information which will come in handy later. And we do rate limiting there because it's a lot more efficient to do it outside of Shopware because if people run into your rate limit, you do not want to spend any valuable resources for just keeping the rate limit up.

So we do that there as well and a few other things. But yeah, this is one of the jobs the reverse proxy has to do. The other feature we built into Shopware Cloud or Shopware SaaS, is dynamic routing and versions. And the idea here is, of course, you have different Shopware versions. We release like at least once a month, but in the future, probably more often. So.

And we we have multiple versions running in the cloud and the reverse proxy will tell on which version, which customer is, and then route the traffic accordingly. But really great feature comes with like custom version. So every of our developer in our version system can go to a merge request and add a label and say, Well, I want to deploy exactly that code on Shopware cloud. Then we build a custom version with his changes and about ten minutes later he can test it out in our SaaS environment. He will also get a new shop ready to like click around, but he can also migrate that existing shop to that version. And then we have a control plane in the back which will calculate the difference between the version checks.

Well, do I have to do a database migration? Do I have to recompile the theme ? Then you can start a rollout for a simple for a single shop. It's very simple for like all shops in clouds, it's a staged rollout. So we start with a small percentage, see if everything checks out, go on and go on.

And like within a few hours every shop is on a new version. We can also do rollbacks that way, like for code only changes. It's super simple. And looking back, because we started three years ago and this was there since I think the very beginning, it was one of the best decision we made because for one, it allows our developers to try out their changes because Shopware Cloud and Shopware SaaS it's not only Shopware, it's like a lot of extensions.

Configuring version and so on. So being able to test that without affecting anybody is a good thing. And also the rollout mechanisms we built really helped us to, yeah, offer the uptime we have achieved so far. So now to the problems.

One of the problems we ran into very early on where Shopware comes with quite a few of database tables. I looked it up a few days ago and it were 311 in Shopware SaaS. So like product ACL and of course a few extensions, a few of our own, which is not a whole lot for MySQL 311 tables. Not a problem. But now if you put a thousand shops on a single database cluster and.

Maybe you think, Well, that's crazy. It would never withstand that load. Yeah.

A thousand shops of, like, very big customers doesn't work. But with our cloud environment, we also, like I said, we have like testing instances for our devs and we had like the free plan people just trying it out, demo instances. So they're a thousand from a load perspective, it's totally doable.

From a table perspective, not so much. You end up with over 300,000 tables and MySQL has a has a nice little feature built into it that every time you do a database query and say, Well, select everything from product, he will look up the metadata of that product table and see, well, there's a column with ID and it's in the form of binary and so on, and it will store this information in memory so it doesn't have to look it up again when the table is opened the next time. And it's about 20 kilobyte per table, at least for our use case. And if you do the math, it's like 6.2GB of memory of the database cluster just for nothing, just for knowing that the table is there.

And the quirk that we stumbled into is it's not like you start your MySQL server and the memory is gone because accessing that tables happens over time. So slowly your memory curve is going down and then you restart your server. Everything is fine again because the only two ways to fight against that by default is either restarting your MySQL server not a good idea or deleting the tables. It only happens when a customer goes so also not a good idea.

Yeah. So after we figured it out, took some time to like mitigate it and the optimizations we did where basically biggest one MySQL eight because at the time we launched SaaS 5.7 was the only version available and there was a bigger problem with MySQL 8, they optimize it, they introduce the concept of general table spaces, which you can use, but we also like actively looked into MySQL engine alternatives, basically for example, Percona, because you configure there you can configure this behavior and say, well, invalidate this metadata cache, I don't know, after an hour or a day. So there are options there. But yeah, figuring them out can take some time. The next thing that was a bit tricky were scheduled tasks.

I don't know if everybody knows what they are about, but it's kind of the cron drops of Shopware 5. So tasks that run at a certain interval and there are like some relaxing tasks that run once a day and there's a product export that runs by default every 60s. So for one shop, it's like every 60s that means 60 times an hour and that means 60 times 24 a day. And now if you say you, you take one second of compute time.

So it takes your server one second to like do the product export, which is for a CLI command. It's realistic. So it's actually quite fast. It's 1400 seconds a day for one shop.

For the Cod cleanup, it's not a big deal. It's like 24 seconds. That's doable.

But for 10,000 shops it's like 166 days of compute time every day. So, for example, if you have one worker, you would lag behind 166 days after one day. So your queue would be like gigantic, which also means like if you have a like a single threaded server for that, you would need 166 of them just to cope up with that single scheduled task. Of course you can increase concurrency and so on.

But just to give you an idea like at scale, I don't know, 24 seconds become like two and a half days for 10,000 shops. So what we did was, for one and this is an improvement that is not in Shopware, it's only in Shopware SaaS we do sort of smart scheduling. So we look at the customers and do not strictly stick to that interval because like if a shop isn't opened by an end user or no product is created while doing a product export every 60s you don't need to do that. So this is something we did. But of course we also talked internally about that problem and since then improved the scheduler task handling in Shopware itself.

So I think it's a general pattern. You see like a lot of problems we ran into which are not purely infrastructure related. We take that back to Shopware as a product, talk with the teams and then improve up, improve that. And it's the same with like doing rollouts because I think I'm not sure if everybody has anybody has noticed it.

But like since three years we are doing rollouts and this means like we are finding bugs more quickly. So the chances of having a totally broken release because we have also other mechanisms in place to fight it, but testing it out in our environment really helps us to create a more stable product. Another problem we encountered were bots, because as soon as you open a website to the public or a lot of websites, they will rush into you and try everything. One problem we had were mailing and mails because bots like forms and they like to try them out. And there's also the possibility in Shopware and the administration to test the mailing functionality of your shop, which is a great feature, but not for bots and for people who want to abuse it because then they have basically a free mail server which they can configure in certain way, of course over an API. But yeah, in general it's a problem we had to fight against and also mailing comes with reputation, so it's not like, well then I'm sending out a million miles a day.

I don't care. It will drag down your reputation for all the other customers. So what we did there was first we validate email addresses and by that I don't mean like doing regex. Is it valid? But actually talking via Smtp to the providers checking out? Does that inbox exist? Is it full, is it a good address ? And so on.

So there you can do some stuff. You can monitor the bounces. So if somebody reports that email is spam or the inbox is full, you can monitor that and eventually stop sending these mail because if you know, it will never be received but are not sending it at all. And for the testing feature and the administration, that's a simple fix just to rate limit like bots are often faster than humans.

So you can find out the difference there. Next problem we had were forms like every shop comes with, I don't know, a newsletter, formal log in form registration, but like to try them out as well. And it's a bit related to mails because some of these forms will send a mail. So what we did there was I think quite easy. We just introduced a Captcha solution. We stick to invisible captchas to not disturb the ecommerce experience and conversion rate and also to rate limiting because bots are faster filling out forms and human.

So this helps as well. And last but not least, we had generic attacks, so I think everybody has noticed it when looking at the access logs of your server bots. Always check for admin. It's like the most common one around because WordPress is a big product and lots of sites with lots of security vulnerabilities.

And you would say, Well, why do I care if they browse that site ? You do care because Shopware actually takes quite some time to figure out that admin is not a valid site. And you would think, well, that's a back and Shopware just add a blacklist and block it or a block list. It doesn't work that way because you have zero URLs and in theory there could be like a landing page, a category with the name admin. So for us, a disaster as a SaaS provider, we cannot say, well, it simply does not exist. We do not know that. But again, there we talked internally to the teams.

We improved the for zero for handling and Shopware so that we can like faster do that response and also be able to cache that response. And you actually can also do rate limiting on the 400 status code because 400 usually means the user did something wrong, but input whatever. And if they tend to do it very often, like visiting a site that does not exist or filling out forms with invalid data, probably they are not real users because they would figure it out eventually and you can block these kind of requests as well. So this is all like technology that is running in our SaaS infrastructure to fight against these problems.

And now to the third part and maybe the most fun part of it all. Observability, I think it's a very key point and it really helped us because if you do not know what is happening inside your infrastructure, inside Shopware, you cannot really optimize it and you cannot really find the actual problems. And observability is consisting of basically three pillars. One is metrics. So this is basically time based statistics. You could say tracing is like actually looking in the execution of the application and figuring out how long, what takes and logging is log messages.

Well, it worked fine. Something went wrong. These kind of things, ideally with stack traces and so on. And for metrics. I have a simple example. The most common one I think is CPU utilization for either your processes, your servers, your containers, whatever.

You can visualize them and have a look at them. You can also monitor or have a look at the performance of your application and compare it to different code versions to see if code changes actually brought improvements in real life and not only in PHP bench or whatever. And for tracing, the best thing you can do is distributed tracing. So this is like an actual trace from, I think, one of our staging shops. It was uploading a media file in the administration. And you can see like how long the symphony took.

How long did each release call took how long each database query took. And with distributed tracing, it goes even further. So you see the the green one in the top right corner because if you upload a media element, usually thumbnails are created for the media element and that's done in the worker. And with distributed tracing, you can pass a trace ID along to the worker and then you can also see the execution of generating the actual thumbnail like a bit after the actual request. I've cut it off, so it's also more detailed.

But just to give you an idea, and this really helps to have an understanding of how your application is performing, seeing maybe the queues flow or seeing which database requests are slow because you can also not only see it took ten milliseconds for database requests, but you see the actual query that is performed and you can do statistics on them, which queries are the slowest ones. Maybe there's a problem. Maybe in in Shopware, maybe in an extension, and you will figure that out. And if you have all these things, you can sometimes it doesn't like me. Sorry.

You can do alerting. So ideally, you recognize the smoke before the fire is going on. And you can do that by taking your metrics. It also works with tracing and logs, but usually you take your metrics, set up a monitor, and then set up a condition to alert either via phone, via slack, via email, whatever you like. And to give you an example. I think the first thing people like to do is looking at CPU.

I don't know why they do that, but it seems to be a simple metric and you could say, well, if it is above 70%, give me an alert. Problem here is it's not really a good metric because it does not show how your shop performs. It could be that you are at 70% and everything is fine.

It could be that you are 10% and your shop is not responding at all. So not the best metric in the world. And I wanted to give you some ideas of metrics that could actually work.

Um, first up is I think most of you have experienced it at some times that a run out of space and it's a very easy fix for that. You can add a monitor with a forecast and say, well, if it runs out in the next 12 hours, give me an alert. So it will look at the rate it's declining. And so you have enough time to to act because usually it happens over a longer period of time because, I don't know, logs are written, not cleaned up, and eventually the dust is full. So that's a good one to do. Um, then you can check the uptime.

For example, if you have containers, you have multiple of them. It could be that you deploy a new version and that it's like starting working fine for some time and then crashing. So ideally in uptime curve should rise and rise and rise and rise until the point you redeploy. If you see like spikes in it and it's not like new spikes from from autoscaling. But if you see like the same thing over and over again, then probably something is odd.

You can check out for this as well. Um, one thing like we noticed is if you have a more complex setup with load balancing with CDN and so on, you can check the delta because of the requests that are coming in. So for example, and then add your reverse proxy and then at Shopware. And of course it will be fewer and fewer ideally because the CDN caches and and so on. But if that rate gets really out of balance, you probably have a problem because traffic is lost somewhere. And if there's no good explanation for it then I don't know you misconfigured a firewall or set up your load balancing wrong.

Also something like this. And last but not least, you can check, um, if you have a a good running online shop. A lot of customers visiting a lots of bots visiting. You will notice that traffic, at least for us, it is never a zero at no point in time in our SaaS infrastructure at zero. So there's always traffic no matter the time of the day. So it makes sense to have an and monitor to to visit or to have a look at um at a baseline of traffic you expect.

And if you are below that then probably something is wrong again and you have to look so you can look out for these as well. Of course they are also the options of having monitors that learn by themselves. Have a look at the traffic of the day before. Um, yeah, it can be a bit tricky because in the end it's a machine and it will not be able to detect every pattern.

Maybe it gets better with AI. Um, yeah. So that's it from my side so far.

Um, I'm not sure if I'm allowed, but in theory we could do a Q&A. What are you using for monitoring? We have decided to go with Datadog simply because it's an easy solution in the first place. But at the moment we are switching everything to Opentelemetry to the Opentelemetry standard.

So in the future we will also have the ability to use basically any provider there in the wild or even host our own solution. Yeah. And for alerting, we use Pagerduty. It's connected to Datadog. There was another question.

Yeah. Um, more commercial question. How much merchant interest is there to use a cloud solution? Sorry, Can you repeat. How much merchant is the interest? Is there to use the cloud solution? How much version?...

merchant interest. Um, in and it's hard to say. Like the majority of our customers are still using like self-hosted or working with an agency. Yeah but we see customers joining the SaaS system over time.

Are you using Docker environments for the scaling or? Yeah, yeah. We use like every of our app servers are running in Docker. It's a, it's a cluster of app servers.

And then we use auto scaling by adding or removing Docker containers to the fleet. We've also experimented with serverless approaches, so like running Shopware and Lambda, for example. So yeah, we are looking at all possibilities there. Okay, so are you using ECS, EKS or Amazon services for it? Or do you have an own method? Yeah, we are using AWS ECS fargate. So they are.

Yeah, they are standard solution for Docker containers I think. Do you encountered big issues with Fargate like with storage handling and stuff like that? What do you mean by storage handling? Like bind mounts and to share like space like plugins be shared between the storage? Okay, maybe we are a bit special there because we like our containers. They do not have like network storage attached to it or something like that. We build a version one with all the plugins in there and it's baked into that version forever, so it will not change.

This also allows us to use like read only file systems and so on for security. And if we want to like update the plugin, we will simply release a new version in the SaaS environment, reroute the traffic to the new version so we don't have these kinds of problems. Like the storage is directly attached to the container via AWS EBS, and I think there's a caching layer in between. Yeah. Thank you.

I'm just going to answer. All right. I think we are perfectly on time.

The timer is now going to zero. So thank you, everyone for listening and feel free to hit me up during the day. I will be here all day. Let's talk about infrastructure SaaS, cloud, whatever you like.

2023-06-10 11:12

Show Video

Other news