Serving global customers with serverless and distributed technologies - Yacine Hmito of Fabriq

Show video

So, hello and good evening. Sorry for the little mess. Alright, let's go! Today I am going to speak to you about performance all over the world, thanks to serverless regionless technologies.

So what does that mean? When I arrived earlier, just before arriving at Algolia, I left my hotel room and I saw this on the carpet. It's very strange - it's marked one world, but it's a good image to illustrate to you tonight's theme, which is what we do when we must serve a SaaS to people all over the world and treat this as a single world. Because very often when you go to a cloud, you are going to stay in one region. It can be AWS, Azure, GCP, whatever, it is very localized, except that we live in a globalized world.

Our customers, they are everywhere, so it has to be treated as a unit. So I'll show you how we've done that with fairly recent technologies, and some of the advantages and the disadvantages. So, before, a little bit of context, what is fabriq? fabriq is a SaaS, Software-as-a-Service B2B, so for business and companies in the industry. So I won't lose you and waste too much of your time talking about the whole profession behind fabriq, but basically it's used directly in factories. And if you you'd like to imagine for yourself what it looks like, it's a kind of mix between Trello which allows you to manage cards, tickets - and Typeform, because you have things that look like custom forms, Calendar because we get stuck in these forms and Power BI because we show some pretty curves everywhere and we serve data.

So the idea is just to say it's a productivity SaaS like you know plenty I'm sure. There are no very specific Fintech integrations or things like that, or very data intensive stuff. It's a SaaS. So fabriq, it's a SaaS and here it is the stack it's running on.

I call it legacy, but I should call it current, to be completely accurate. Moreover, it is not to be very polite, because we, it serves us very well this stack and this is the very simplified version. So what is it? It's a single page application in Vue.js. Not too many parts to that part, it satisfies us very well.

Who goes behind when we have an API call? It passes by Cloudflare, which serves as our firewall. Basically, we keep it simple. Then it goes to AWS. In AWS, we have virtual machines that are running Django, and therefore Python. It's good old MVC as we know all is something very well known.

When we have attachments, we put them in the S3 buckets, and when we have the transactional data, we put them in an SQL database. And then, when we need to do the search, we use Algolia because it's fast and it's easy. So, a lot of SaaS looks like this, which shouldn't surprise you.

Except when I arrived at fabriq, I realized that there were business issues associated with this stack, and it wasn't necessarily going to be permanent. It wasn't necessarily going to be the most relevant for the future of the company. So we embarked on a construction site and before telling you about it, I'm going to introduce you to the team because I don't do this myself. The idea, it may have come from my arrival at fabriq, but it is above all executed by the upstream team here. So here you see the faces from everyone on the team. So there you go, that's me, I think that you see me with my beard and therefore I am Head of Technology at fabriq, and therefore responsible for tech choices.

And then there, quickly, I added a slide with the CEO because he's sitting right there. So if you have any questions on fabriq, please ask him. So a new stack. For what ? Because we have some problems with the stack I just showed you.

Cost issues. AWS is a given. From DX, because Django it works very well when we make an estimate. I don't want to spit on Django, it's great for going fast, but when you have a SaaS that already exists, we see new patterns emerge and we start to have performance concerns. It's very easy, it doesn't matter and finally it ends. Besides the performance issues I'm going to talk to, it ends up affecting the developer experience. When I show the stack to candidates to join us, many are not particularly excited about doing Django.

Security is also a big topic. I won't talk too much about it, I don't have time and we have a great presentation on safety right after this. But instead of going into the details of these problems, I prefer to do it by talking to you of the ideal stack to solve them. So an ideal stack - what is it for us? It's a powerful stack. Everyone wants a high performance stack,

around the world. Why all over the world? Because fabriq customers, fabriq is a small company, it's 50 employees, and the product team is 15 people. Customers are mainly French because those are the ones we manage to sign. But on the other hand in industry, they have factories all over the world.

And when I say everywhere, we are already on five continents, we were on five continents, I believe a year and a half in the box, in 23 countries. And you have the map of all the manufacturing sites - this is not common for a company so small - and we had a stack too trivial to have to deal with that. So, we would like it not to be expensive because we don't necessarily have very deep pockets.

Nobody likes when it's expensive, but everyone wants it to be secure and with as few ops as possible, as few ops as possible. For what? Again, ops is equal to money. We know how to make stacks that serve everyone, everywhere. You hire a few DevOps specialists, they will build you a distributed architecture on AWS, and it's going to cost you a hell of a lot of money. It's complicated to maintain.

Our goal is to be profitable, so that's not really an option. What are we doing? We started looking at serverless, which we thought could be a perfect candidate for the problems we were facing. So, serverless, I'll explain it very quickly because if you are here, you likely know what it is. This is the idea of technologies abstract away server operations, so that the end user doesn't have to deal with it. It is managed by the cloud provider. So we send our compute, our code, and it's handled automatically.

No monitoring, no scaling, everything happens by itself. It's magic. So a powerful stack, logically yes. Serverless is elastic.

There is the question of colds starts that we often hear. It can pose a problem. But, that's not our focus. In our case we had to distribute all over the world.

But as I told you before, serverless is not distributed automatically. We're going to have to set it up in several regions on the cheap. Serverless, it's pay as you go, so we only spend what we use, it's not bad, it's better than what we have today - I'm very sure about that. if you go to call providers, it's not just anything, that's for sure, with as few ops as possible.

Here it is complicated. Complicated because effectively serverless, it is the promise is that you don't do ops. But when you deploy AWSLambda in 40 regions, a lot of things will have to be optimized. So, ask yourself the question, but what happens if I have data that I distribute to one place and then to another? It gets very complicated very quickly. So ops come back to the forefront and for us it is a problem and a good DX, in serverless, it depends - there are frameworks that might be able to help, perhaps it is better, but it is still more recent. So here it is, serverless, it looks like a good candidate, but with some problems.

In this case, there is an overlay on serverless, let's say, regionless. So, this term, I don't know where I got this, but it speaks to me. Why is that? Because serverless, it is the provider that manages the servers, and you don't see the servers. Regionless is the provider that manages geography; you don't see a region. You go to a cloud provider with regionless technologies, you put your code, it manage it and pushes it to the right place. That's the central idea.

And suddenly, with serverless regionless, we use a whole set of technologies and the first technology that we will use for compute, this is the part that will replace Django, Django on virtual machines, is Cloudflare Workers. So what is it? Cloudflare Workers, it's a JavaScript runtime made by Cloudflare. And unlike others, including isolation technology that is used, they are V8 isolates. What is a V8 isolate? Unlike a container that needs to start processes, etc.,

the V8 isolate is like a chrome tab. You see how fast it is to open a Chrome tab? When you start a new worker, let Cloudflare take care of it for you, it's fast like that. And that's why there is no cold starts on Cloudflare Workers, time that you check TLS, that's the thing, it is already started. There are 275 points of presence for Cloudflare Workers. There, the question of everywhere in the world, that's settled for us.

And then it's regionless, in the sense that I I code my Cloudfare Worker and I put it wherever you need and on top of that, it's not expensive. Small flat in fact what is expensive is just Cloudflare, but Cloudflare Workers is not expensive. So that's great! I'm selling you something but is it really fast, since the goal is for it to be fast anywhere in the world? So we did a little benchmark. I'm warning you, it's not scientific at all, it's me, I opened VMs somewhere and I have made a few calls, but it's to give an order of magnitude and that allows you to validate. Is there any interest? So, from a VM in Paris, we're going to check on our current stack on django and that responds in about 110 milliseconds.

But if we do this from California with our servers which are hosted in France, it takes 600 milliseconds. Sometimes it goes down to 250, but it is very, very, very variable. So, there are plenty of things that are variables, like are we on the right network, etc. There are plenty of topics but these are precisely subjects that we do not not control and that a user does not control. And that is problematic.

We have a customer, it has factories in Taiwan, we have someone working at the level group, which is at fabriq in France, he lives in Taiwan and then he contacts us, and tells us the application in Taiwan is really slow. People in Taiwan, they don't care nor realize. For them, fabriq it's like that and I tell you in the industry, they are not very demanding with the tools.

But now, we have at a given point, someone who tells us in fact, you can feel the difference. The round trip to France is not not free, the data shows you that. So, on the new stack, when calling from Paris for second reference, we are roughly nearly in the same order of magnitude.

Not bad. And then, California, it's crazy, we're at 120 milliseconds. So we are very happy. There is however a small difference. 20 milliseconds more, why? I don't know, but it's five times more fast, so it's so cool. So super-techno, regionless, so cool. It's not expensive, but there is a big point - does this please the developers? Can we easily develop on it? In conclusion, there are pros and cons.

It's super lean, put your code, it's super cool. There are good tools nearby services. You have R2 to store files, you have KV store to make caches, you have the entire suite of Cloudflare tools. That's great. On the other hand it is a specific runtime, so it's not compatible with Node. So that is a point of pain because with Node, there is no reason why this shouldn't be the case.

So the consequence is that there is no ecosystem. You take an npm package, you are not sure it will work. And the deployments, it's not crazy with the management versions and all that, it's not the coolest thing, but overall, we are still very satisfied.

So that's it for the compute, for Cloudflare. It's cool. And for local development, because Cloudflare is something spinning online, they have tools to replicate some things locally, they have open source the runtime there is not long, but is not very practical. But we decided to use Deno directly, I'm not going to spend too much time on it, but Deno is a runtime which is different from Node. It runs very close to the browser like Cloudflare Workers, so the distance between the two is short enough and Deno, it gives you straight away a trainer, this or that support script directly. This is a nascent ecosystem.

It's a better bet than trying to rely on Node. So that's it for the compute part. Great, but actually compute is easy to run some code somewhere. It's not the big problem.

The big problem is data. How do I make sure that the data is available all over the world? So the big problem is distribution. And there I'll say, we have a client, two people working in the same box. Among the first customers of fabriq, so they tell us the adoption of fabriq was dazzling. In one year, so four months after COVID, we deployed fabriq in 22 factories in five countries. So you take a client, in fact immediately, he puts fabriq everywhere.

And he quotes information sharing cross-site security in fabriq is very useful to communicate quickly on security issues. Why do I share this? Because very often when you want to distribute data, you say to yourself I have clients in Brazil, I have clients in the United States, I have clients in Japan. So I will make a database in Brazil for Brazilian customers, an American database for US customers, etc. Except that at fabriq, a given client wants to share the data and read and write in all sites, all over the world.

So you have problems immediately with the consistency of the data, and it is not easy. So there you go, I'll show you the map anyway, as it's telling. So to make sure it's consistent across this world map, , this is not a piece of cake - and that's why we use Fauna. So Fauna, what is it? It is a serverless database, so you call it by API, you don't have anything to manage; no instance, nothing. Lucas introduced it to you earlier, it's both a relational model and documents.

So if you're used to doing MongoDB, it's easy, SQL, you find a little familiarity. There is a hierarchy system of databases. You can create databases in the databases. In our case, it is used in particular to isolate data between customers.

And it's very useful and easy to use. And there is a new consensus algorithm called Calvin. So that's a technical term, and I talk about it right after.

And finally, private region groups. What is that idea? Let's say I have a client with a specific geography. For example, it operates in the United States, and in France, I will create a region, a group of regions just for them. And I will replicate the data in the United States and France. On the other hand, if the customer is only in France, I I'll keep the data in Europe.

And if he's all over the world, or multiple region, completely global, the story is the same. So this consensus algorithm, why am I talking about it? Because we will talk about a scarecrow called the CAP theorem - some of you may have already heard of this. I'll make it very short.

The CAP theorem is something that says that theoretically, you can't have a consistent view of your data while replicating it everywhere and everything ensuring that the database is always up, you will have to give something up. So either you say, time to time, my database does not work, or you say to yourself from time to time I have a part of the network that cannot or doesn't have the data. Or you say to yourself, from time to time, the customer in France, they will see something different than the customer in the United States. Oh sorry, excuse me. The user of a given customer in France will see something different from whoever is in the United States.

That's not really good. So it's a theoretical theorem. Calvin, what is it? It is...

It's an algorithm that is is trying to cheat but can't beat the theorem CAP from a theoretical point of view, but what it does, is it is a CP system. And on availability, there are a lot of techniques to ensure that the database, in practice, in fact, is available. And that's something, there's no product other than Fauna that does this in the serverless database ecosystem, you have some who try to solve the same problem with old technologies, let's say, Yugabyte-style, Cockroach, so it's normal SQL and then we try to distribute this over data existing, and it's complicated. Or you have Amazon AWS. So they are very, let's say, they are honest they say theoretically, it's not possible, we don't do it, it won't be consistent, it's up to you to manage in your application layer complexity. I find it brave and honestly, that was what attracted me the most until the day I found Fauna.

Now here is the test. Is it really fast? From Paris, an a latency check is good, but here I do a latency check and I tap on the database. So in Legacy, just a line in a table, whatever is the simplest. And for Fauna, there is a collection that we go read and something in it.

So in the old stack it takes 150 milliseconds from Paris and from California, it's about 700 milliseconds. It's too much. I also remind you that it is a SPA (single page application), the app.

So API calls, there are some, there are plenty. And with the new stack, so with Fauna which goes through the Cloudflare Workers, I remind you, we have about 140 ms from Paris. And there is a region in which we responded to both the United States and in France, and in Europe, sorry. And for California, we are at about 350 ms. So it's twice as fast.

So there is still a small difference, I need to talk to Fauna. I don't know why it is so slow in the US. I have my little idea because I believe that in the United States they reply both east coast and west coast. But hey, I'll investigate with them. But hey, bet fulfilled anyway, it's much faster. And what about the developer experience? It's API-first, it really is easy to use.

There is a great web console where you create databases. It's quiet. They have a query language to them called FQL, which is a point of vigilance. The language is great but it remains something that is not known to the ecosystem and is not SQL, it's not functions or as easy as Firebase. So there is a learning curve. And the problem is that it's too recent so that there is not a big ecosystem around it.

So there are few tools; no tools for schema management, existing tools, but not necessarily mature tools for migration management. So we had to do it all in-house. It wasn't necessarily part of pleasure, that's it, but we're happy with it. And finally, this is a normal database, but it does not support search, and If you really want to have a super research experience, it's not great. Yes, I forgot to say, a new version of FQL should settle part of these problems. We used algolia, we already use Algolia, already serverless.

So that's very cool. Research in fabriq, it looks like this. So you type something, it finds you all of the entities, you have filters on the right, etc. All of this is managed by Algolia. But hey, I'll talk about it quickly. So what is Search-as-a-Service? So you have a SaaS, you put your data and then you have in endpoint an API research, it's quite simple.

Performance is excellent. We can isolate the data by facet, by index, by app. We have many ways to manage data customer to ensure it's not mixed. And finally, the part that will be interesting is how we do so that it performs well all over the world. So they have something which is called DSNs.

DSNs are the idea you have an index and we will copy it in all the servers you are interested in. They have a model that looks a bit like the Fauna region group, so that when you create an app, you say here are the DSNs in which I want to deploy my index. And so we can have one app per customer and say here, this client is in such a or such country, I want it to be in such datacenters and then it takes care of it, let's say, the app takes care of that itself. There you go, so that's great. SO, algolia's DX, the SDK is great, it works very well, the docs are excellent and there are frontend components, if you want to interact directly with the API.

The downside of the algolia DX is that there is no local development - while with other technology, you can work in the same environment. So. So to sum up, our serverless regionless stack looks like this. Our SPA, it goes on Cloudflare, there's a lot of stuff inside, the firewall, the Workers, local names, we use R2 for the buckets, I don't talk about it too much because that's not very interesting.

And then after, for the database, we spoke about Fauna and Algolia for search. There you go, so this is our new stack. Where are we today? We deployed our first service on this stack in August 2021, so it's been a year now and it is used by all customers. And here today we have three services on a new one stack for all customers. And finally, since December 2022, we are taking services that existed on the legacy stack, so pieces of the Django monolith, we are rewriting them precisely on the new stack.

But this, this double of a new architecture, I don't talk about it too much, but the idea is better to isolate data between customers. And on top of that, we already have three clients who are on the new architecture with the new stack and as it goes in the months to come, the years to come, we will migrate more and more until our architecture diagram will look like this and not a mix legacy and new. So, so we are going to gradually migrate architecture for all customers, feature migration from the monolith to the new stack. If we have a new feature that requires a new service, it will be done directly on the new stack.

I hope in a few months or years to get back to you with a more complete experience. There we put our toe in, from fabriq completely serverless regionless, it looks like this. Well, thank you for listening. As a reminder, I am Yacine and I am Head of Technology at fabriq. And then you can follow me on Twitter.

Here, here. Thank you very much Yacine, it was great! We have about ten minutes questions for Yacine so... Whether... If there are any questions from the audience? Raise your hand if you have a question, I will pass the microphone to you. Thank you very much for the talk, Yacine. You were talking at the very end about local development.

Yes, I know it's a topic that's debated in the serverless, space to know if the development, the environment development must be local or must be remote, go through emulations or pass by a copy of the service remote. You, on your new stack, suddenly you are full emulation with the different providers; with which one do you work? Full, it is absolutely necessary to test on real before putting into production, that's for sure. But yes we emulate, so why do we emulate? It is mainly to have a better feedback loop. Because the network round trip, when we are developing with the app itself, it's not too problematic, but when we have a battery of tests, it's less fun. And besides Cloud, I remember, Cloudflare Workers, two years ago years they said yes, we will never do local development because you have to do turn on real.

What is important is that the latency be as low as possible. Now they completely backtrack because they realize that even if we reduce latency day to day, it does not work. So it depends on the workflows. There are people who get along with that very well. I understand it, but we, we could not. It's the same question on development...

I have a follow-up question on the Cloudflare Workers, you said, it's not Node directly, it's a bit of a spinoff. Yes, it's their own runtime, completely. And so when you run your battery of tests, you don't run them at all locally? How ? How does it work? You can, you are obliged to use tools specific to this battery, to this runtime to be able to write your tests? So the way a normal team would do this is they would use the wrangler, this is Cloudflare Workers CLI and they have a project called Miniflare, which allows you to simulate Cloudflare Workers on Node. So. But we use Deno, something called Denoflare, but for our tests in fact, we do it just turn on Deno, quite simply. Afterwards, we still do a test on Cloudflare Workers itself and then sometimes we surrender on account of the differences between Deno and Cloudflare Workers, but everything that is done locally is done on Deno because the experience is great.

OK thanks. Any other questions there? It's true that it's early but i want to know what about the service level agreement of all that architecture? Do you replicate them through Fauna? Or else it is you who do it with your database ? Service level agreement, you speak response times or availability? The availability. Availability, then, availability, it is guaranteed.

Cloudflare Workers, with 275 presence points in general, there is no problem, what. When it doesn't work on a POS, they rotate it, route it to another. That doesn't mean it's perfect, but in general they are very, very good at that. Fauna, they have a system replication on each region. These are group regions. Inside they have regions and within they are still replicating.

So with that they have good availability and then algolia, it's on their terms. It's also very good, so we have no... we have no worries about that.

Afterwards, on the legacy either, AWS is very good, it replicates in three areas availability is great. Where we have more questions, it's more like SLA data recovery. So everything that is backup, on Fauna, we have daily snapshots, on AWS, we are on Aurora, so we have daily snapshots and on top of that they have point-in-time recovery which is a trick, it's science fiction. It's you can, you can backup when you want, what. So that's great, we lose that.

Other questions ? Do you have any issues with provisioning your infrastructure with all of these different services and code frameworks? To work with each provider because they don't necessarily have the same provisioning solutions? Great question that. So we don't provision nothing, already it's all serverless, but there is when even something pretty cool. There are databases to build, there are workers to declare.

Today, Terraform integrations are not at all satisfactory. And then Terraform has its own problems because when you move on serverless, in general, actions between services, if there are API keys being exchanged everywhere and if you use fairly standard infrastructure on it, find yourself with your keys access in your state file. So basically, a big thing is usually not well encrypted, who wanders somewhere, in which there are the keys of the kingdom, so it's not crazy. So today, we're doing this by hand, that's not really good. Fauna had shared with us, they have integration with serverless framework in which they can provision things automatically, so we'll look.

But yeah, I think it's not still very mature on this issue. That's a side question too. in terms of pricing on Fauna and Cloudflare, how is it going in relation to the number of employees? I know there is a lot of SaaS where it's free. But if your team gets larger, it's starting to get very expensive.

You are not yet not very big, but how does it go? It doesn't scale on number of users. SaaS annoys me too that prevent you from growing your box. Cloudflare Workers is pricing, how to say, at call. In fact, it depends. It depends, Workers, you can price it on the call, at run time. And for Fauna they have read-ops and write-ops.

A bit of a model quite similar to Dynamo. And then in the end, it's usage-based. THANKS. I have a question regarding the FQL, that's it ? Yes. FQL.

You said you developed your own ORM, I think, internally, Ultimately. Are you planning to do open source on it or you have contacts with Fauna to know how to develop the thing. You said that it was quite painful ultimately as an experience. Yes, Fauna tooling today is not yet mature. Afterwards, we did not develop our own ORM, but if we have to talk about the ORM subject, ORM and FQL, it doesn't work together at all. On known ORMs, there is no Fauna backends that are developed.

It's very complicated because often ORMs, there is a presupposition of SQL behind it, so it's tight. But we, in our architecture, we have rather chosen to have repositories where we will in fact pass business objects. And then afterwards, in the repository, we write our FQL directly. There in our tooling,

what we did instead, it's that we made a whole system to manage migrations, so migrations that will create new collections, new indexes, things like that. Most of you know in backend you write files migration and deployment, it will execute them for those who have not yet been. So we had to redevelop that on Fauna. There are already tools that do this on Fauna, but it didn't suit us for architectural reasons, including isolation between customers. Last question ? No, thank you very much, Yacine. Great, thank you and see you later.

2023-02-21

Show video