How we Live Migrated Millions of BBM Users & its Infrastructure Across the Pacific (Cloud Next '18)

How we Live Migrated Millions of BBM Users & its Infrastructure Across the Pacific (Cloud Next '18)

Show Video

Hello. Good morning ladies and gentlemen my. Name is Mohan Krishnan, I am the CTO at bbm, today. I want to share with you guys a story, about, how we migrated. The, bbm uses, and this is the what was originally, known as the blackberry messenger from. Our. Canadian. Data center on Prem to, Google Cloud, before. Starting, though I'd like to ask maybe you guys a couple of questions, firstly. How, many of y'all are considering. Doing a migration from on Prem to the cloud in maybe the next six to nine months show. Of hands. Okay. Great so, hopefully, this content is relevant for your the. Next question how many of you all know what, bbm, is, okay. Keep your hands up how many of you all know that bbm is today available on Android, and iOS and continuously, being improved. Great. So, you. Guys know that our story that's. Awesome thank you very much let's. Go. So. Let's start. What. I'd like to start with is a bit of background because I get lots of questions as to what's, going on with bbm, what's happening, so this is just to set things up a bit I. Work, for a company called M Tech M Tech is Indonesia's, largest media, conglomerate we, own TV stations, both paid on free-to-air, national. TV stations, we, own production, houses and been, in the media business for about five to ten years in, about. The last two to three years we've also been going online, primarily. Again. Taking our content, from traditional, media to online so, we run today Indonesia's. Largest online. News. Portals. Similar, like Conde Ness but for the Internet, we. Also run, Indonesia's, second. Largest video platform, we only second to YouTube. Today. Our. Content, and our services, touch over 50 to 60 million monthly, active users in Indonesia. As. Part of that strategy of, growing, that side of the business we. Decided, that besides, just being a con. And producer, we, also wanted, to be a platform a platform, for people to come and consume, our content, to. That goal, we, found that in Indonesia there's still a large amount of bbm, users compared, to the rest of the world and, we. Then engage with blackberry to wholly license. So we've acquired a license from blackberry, for, bbm, for the consumer, business so, today since 2016. When the agreement was signed we, build develop. And distribute bbm. For the consumer, market a, key. Part of achieving our goals around, that was, migrating. Bbm, and I'm going to get into the, specifics, around that in a bit but I'd, like to share how large from, a human, person, standpoint.

This Project was it, spanned about two years we. Officially, only finished in June 2018. It involved. Anywhere between 40. To 60, engineers, span. Across 12-hour. Time zones across, three, different countries Canada. Singapore. And Jakarta. The. Next. Thing I want to share with you guys before we get into the specifics, is that this is not your father's bbm, anymore, many people still think that bbm, is only available on, BB iOS and BB 10 devices, this is bbm, for Android and. IOS modern, operating systems apart. From that this is also a bbm, that, is on par with your modern Facebook, messengers, and whatsapp, this. Includes things like video, and voice calling, large, groups. Availability. On multiple, platforms including. Desktop and more, importantly, what, we've been focused on and why we made this transition is, we are building features for. Our market, how, many of you all are familiar with WeChat. So. Okay we chat you guys know that story it's huge, in China and, the, reason why it's huge also is because, it's a messenger, that's focused, for, that market, that's, exactly what we're doing we're trying to make bbm, the, WeChat, of Indonesia, to, that goal we've, been building features, like wallet so today if you use bbm, Indonesia you can send money to each other you can make transactions. You, can buy things like internet, top so your phone pay, your phone bills, apart, from that we've, also been integrating, bbm into, the local phone systems, in. The next two months we'll be rolling out a feature where, you can make PSTN. Calls and, receive calls, via. A number, that is a side to your bbm so when you register for bbm, in Indonesia you get a local phone number and you can use it this, is extremely, relevant for, the market that we operate where, there's still many users who are rural areas with very poor data connectivity, they. Can still keep in touch by, piggybacking on, the PSTN, network to contact, their friends and family on bbm. So. In a nutshell what. Was the transition. So. It involves. Moving the, bbm infrastructure, footprint, from Canada. All the way to GCP, in the Asian. Region. It. Also so, what and why did we do this well, first and foremost as you guys probably know bbm. As a globe. The global use of ways for bbm has reduced, but, it's still a sizable.

User Base in Asia primarily, in Indonesia, in. Africa, and in the Middle East we. Wanted to bring the bbm services, closer, to our users, our this will improve latency, you would also improve network reliability that, was the first goal but, a second, major driving, goal was, Reap lat forming, bbm, on cloud. Infrastructure, that, allow us to build the type of features that I was talking about earlier and a fast and effective manner that, was a huge goal. Lastly. It's important to keep in mind that during, the entire one and a half years that we were doing this transition, we, still had billions, of messengers, and tens, of millions, of users using. The platform without. Any problem, we. Were migrating. Them in the background without them realizing and. Not. Impacting, their service, also, during, these 18 months we continued to push out new and, exciting, features for user base. So. What was, it that we migrated, I know there are some ex blackberry people in the audience. Today so these guys are probably familiar with it but, bbm. Was, a simple. Simple. Application. To my grade it consisted. Of over between, 20. Major, components. Each of these components, had multiple, sub components, you, look at it and you go and you might go wow. It looks like a microservices, Architecture from, our end I think it was an accidental, micro service architecture, it was a amalgamation. Of different components. Built, over a ten to three year time line from, different, teams using, different technologies, and frameworks. That are available at that point that, added, a lot of complexity. We. Had everything from JBoss different. Versions of spring, custom. Java frameworks, that we had to rationalize, and plan how we were going to deploy and run in GCP apart. From that it was a sizable, infrastructure. Footprint, over, 14,000. VMs, on physical. Machines running in blackberries data centers in Canada, NetApp. Appliances. Fusion. I of. Fiber. Base. Storage. Large. Data, analytics, and Hadoop clusters, and also. An assortment, of databases, it. Had a 600. Had to migrate a 600, node Cassandra, cluster, multiple. MySQL, 4n, Postgres, databases, as well so. It was a large a large, piece of infrastructure, to move. This. Diagram. Here is mostly for short factor don't don't look at it too hard but. This is an attempt, at trying, to map out all the different components that made up bbm and their, interconnection. As you, can see the summary is it wasn't, a simple system neatly. Compat, mentalizing two logical, components, that it was not. So. How, did we do the migration what did it involve there. Were three major, parts. Or tasks. That we had to go through firstly. Was, the preparation, and I'll talk about that in a bit secondly. We, mapped out the different, services, that were running and how, they would end up running in GCP. And lastly, was the actual traffic cut over and I'll go to that one by. What. We were trying to answer in preparation, was, what. Is there to move and where. Do we move it to, why. The. Reason why we had this question of what is there to move was. Keep in mind that the team that, was going to do this migration, for the most part did, not build, and did, not operate, this service, they, didn't, have any idea, as to what was running so. We, had to go through a process of trying, to inventory the service, and understand. What was there first from, an inventory standpoint. There were two major aspects, firstly. Was just mapping out the individual services, whereas. The source code for the service how do you build it what are the Jenkins pipelines what are the dependencies what, are the configurations, that we need to pull, in each, component, had anywhere between four to five hundred different configurations. Many, at times there were configurations. Where people. Didn't understand, it's just been that way for multiple, years and we had to do a lot of so-called. Archeology. To try to figure out and put the pieces together. The. Second part of the inventory was, from an Operations, standpoint how. Did these services, run in, their, current on-prem, infrastructure. What. Were the type of servers, that were required to run these services, what, were the type of metrics, and alerting, that was used to make sure that the services were healthy what, type of log output did they produce which. Compartes. Of the log were relevant, from an operational, standpoint and, lastly. Even bandwidth, utilization, because. We were moving to a model where egress. Bandwidth, was going to be meted as opposed. To a typical, on-prem setup where you don't you don't you buy the bulk, top peak bandwidth, throughput that you need so, we had to map that out and figure it out.

What. Worked really well for us from a service standpoint of, trying to figure out how the service ran is that, we documented, everything in code we didn't, just write it all down we, actually had our developers, work, out right ansible playbooks, to, get this all set up and they made sure that the services at least stood up within a vagrant box that. Gave us the assuredness that this information was documented, from. An Operations, TenPoint what we did was we created this. Giant. Spreadsheets. As you can see here again don't look at it too hard but, essentially, it mapped out all different, aspects, of the, service and how it ran in production. The. Next step was actually deciding, where we were going to go it. Wasn't a slam dunk that we were moving to GCP we, actually ran PLC's. With three different cloud vendors, I'm sure Tim and some of the guys from Google in the crowd still remember, this we, ran two to four week Piercy's, on-site with blackberry, where, we try where what we did was we try to put one two. Key services. Into. The cloud and then, actually load tested, in there see how it worked this gave us two things firstly, it helped us further understand, these individual, services because we were also still figuring it out and secondly, it gave us a lot more clarity as to how these services were running within, GC, within, the individual cloud providers, in, the, end we, did an evaluation we evaluated, on three major Krait criteria, the, technical, capabilities, of the cloud provider its. Partner, support and costs, and, goes. Without saying I'm here next and GC PE was the winner and, we've, been happy with the decision since. You know we've been happy with the decision it's really worked out for us from, a partner standpoint, as well we, worked with some really great partners, like Pythian and cloud cover but, also in the room here today and that, really really got us over the line, the. Table, on the right here, shows, some like, in a condensed, form the different aspects, of the comparison, there were a lot more details, I can get to specifics, if you guys want but, it was a long process by, the time we were done with this phase in venturing. The system, and picking, the cloud vendor, we, were four months into this project already. So. Now that we know what we want to migrate we. Know where we're going to migrate to the, next step was to really figure out how. Are we going to have these components, run in the cloud and. The. First question, and I've seen this question, you know being repeated, through conversations. That I've had with other people who have gone through migration and I'm sure some of you guys who are considering, migration, now I had this exact is heavy are having this exact question is do. We just live, and shift the services, as is or. Do we use this as an opportunity to re-engineer. It re-platform. It on something new now. Intuitively. Lifting. And shifting makes sense you, minimize risk already making this major change why. Change the application at the same time, why increase, the complexity, but, the. Challenge, with. Services. That were originally, built for on Prem infrastructure.

I Believe, is that sometimes, there's certain aspects, or certain operational. Aspects, of those services, that, rely on very, specific. On Prem, characteristics. Trying. To replicate, this on Prem characteristics. In the cloud cloud, end up, adding risk to your project so. I don't think it's a done deal, I don't think anybody. Who's making, a large migration a complex, enough migration, is, ever going to have purely, a lift and shift or a. Reap, lat form re-engineer, it's going to be a spectrum, a hybrid, approach and, for, us that's exactly what happened. There. Were certain characteristics. Of components. That when we saw we realized that we'd have to re-engineer the. First being. Applications. That relied on file, storage so, this were applications, that wrote to disk and you. Know either the sand disks or net apps we, realized early on that there was a great. Opportunity, to move this to cloud, storage which, was more suited for object, storage now. Making. This change was, a lot of work it. Involves, a lot of change we practically, rewrote, these components, however, what. We realized in rewriting them is not existing. Complexity. Within these applications, for, example to, handle sharding, between multiple, file endpoints. Or to, handle failures, that only happen within, file systems, we, could remove and it actually greatly simplified, these applications, yes, we had to introduce special code to handle the fact that now your rights are have much higher latency because they're going over HTTP, but in the end we felt a trade-off made sense the, other two areas where, re-engineering. Made complete sense and, was a no-brainer was, on our log inside of components, and data engineering, I'll go into that specifics. In a bit. So. If you look at all those different components you know I said they were like you know almost 20 different components. And multiple sub components, what I'm trying to what I'm going to try to do today for. You is that, you know digest. It all and talk, about these different components or sub components, within this different level so networking. Applies for all the components that we're talking about let's, start, on. A networking, side on Prem, what we were running were. A lot of f5, load balancers, how, many of you all have, f5, load balancers in your data centers today ok, as you, guys probably know those. Things, are great workhorses, they're extremely, reliable. They. Are extremely flexible. You're, not going to surf that same level of flexibility, in the cloud and it was that was one of the gaps that we had to deal with early on within. Our own Prem datacenter the, f5 l TMS were used for both external. And internal load, balancing, they, were also used for NAT so. How. Did we migrate, this existing, functionality, to, GCP, we. Essentially, ended up using every. Type of load balancer, that's available in GCP we, use the, network, load balancer, the TCP, proxy load balancer, HTP, load bounces, and the, internal, load bouncer and I'll go through why, we ended up needing to use every single load balancer in a bit, the. Other aspect, was NAT there was no immediate, NAT, solution, available manage, NAT is still not available within. GCP today at least I don't think it is or. Maybe it's just coming out two years ago when we were started it was not available yet, so. We ended up deploying, our own, man. NAT, solution. We ensured it's hichy the good thing today is if you Google you'll find, several. Best, practice references, on how to set something like this up this is a solve problem but, you have to row at least we had to roll our own back then and back. Then we weren't entirely sure how to make sure that was hey CheY so way to figure that out other. Got just that stuck out when we were doing the networking piece the, first was multicast, again. If, you're in Prem you don't have and run a physical. Networking. You, know you're, not gonna have problems with having multicast. Multicast, is, not available, on any cloud provider, and we, had components. That relied on multicast, in, particular, some, of our Java applications use. This grid caching technology, called in finish man the, good news there is changing.

In Finish banned from using multicast to using other type of drivers, we ended up using TCP, ping it's pretty straightforward, so we had to make that change, the. Other thing to be aware of especially. If you have a large footprint, is that, there is a maximum, cap on the total amount of IPs that you can have within a network and this catches, big. Deployments. Off when. We were doing this we had a maximum of seven thousand IPs and this led to a lot of sleepless nights I was trying to work things out the good news is by. The time we actually went to production, Google. Upgraded, its network and now today they support up to 15,000, IPs and our 6000, odd infrastructure. Footprint fit within it. The. Other gotcha was around load balancers, and this is why we ended up having multiple load bounces the. First gotcha was we had a service that, was serving, an endpoint publicly. On port, 5:06, one the. TCP, load balancer, supports. A specific. List of ports they're about 12 ports at its support if you don't happen to if, you have existing clients. Or services. That rely on these parts without, changing, it you're, dead in the water you cannot serve from it so, that's why we had to replace it with a network load balancer which didn't have this limitation, so, now we have NLB. Secondly. We, had services, with TLS endpoints, where the clients were doing mutual, TLS, so. We couldn't rely again on http, load, balancers, where TLS. Was being terminated on the, load bouncer, we, ended up solving this by using TCP, load balancers, and doing mutual TLS, on the, instance, itself then. We had a class of applications. With our bog standard hitch, Pisa, HTTP. Endpoints, and we could serve them through the HP load balancers, so. These are some of the limitations and, again. You know you you look at what's out here you, look at Kerber annuities and all these cool, standard, stuff but, we, don't have that luxury when you're moving existing. Infrastructure. And finding. This type of problems early, and designing, for it is critical. On. The. Storage side the story was a bit more simpler the, good news that the summary, or the upshot was for the most part where, it came the. Performance, of Google's. GCS, and PD, SSD, disks. Surpassed. What, we were having running in our data center we, had NetApp appliances, and we had San, arrays built, off SSDs. And. Bored, PD SSD, and GCS was a good fit and replacement, for what was running there. Now. We get into the data base and caching. Side of things, the. First database that, we had to deal with were, around we had a large. Oracle, deployment, now as some of y'all might know if you've dug into this Oracle, is thought you. Know sanctioned, to run on GCP neither. Did we want to pay any Larry. Ellison, tax we think his yachts are large enough so, not, only did we have to plan to migrate the database we, had to plan to, transform, the database off Oracle, to, Postgres, before, actually rolling it out or migrating it to GCP so, we planned it in and got that done let's say, when he came to some of those my sequel databases, we just targeted, cloud SQL as, score SQL was reasonably mature by that point and not. Many of this database was super. Critical for production, running of our app, for. Cassandra a lot. Of time was spent trying. To size the, instances, that we're going to run the Cassandra nodes so, figuring out from a you, know CPU, Ram we, end up using PD, SSD, turned, out those were fast enough our use case we, also spend a lot of time working out the operational, aspects of it building, playbooks, understanding, how we were going to do rolling updates, dealing. With downloads all that, had to be set up and we set up a special team within our group to handle that they, got most of the setup within two to three months but that was a big investment.

Ready. Send memcache we started out setting. These up on our own doing something similar to Cassandra, but then we found a partner even and they, are pretty competent I would strongly recommend them, if you're looking at trying to set up something similar, at that, point memory, store wasn't available. We, hope going forward we can look at memory store as, a replacement, for third party services. From. Even. Net. Net we. End up with about 400, nodes. Of, Cassandra. Nodes a host, of my sequel Postgres. And memcache. Instances. The. Next portion was monitoring, and this is where God interesting. What. We had originally, on on-prem, was, a large CA Wylie. Deployment. Capturing. Hundreds. Or thousands. Of JMX metrics, from each instance. All. Of this was then graphed on graph on a-- and alerting. Was done through na jos now. When we first looked at that our natural inclination was, you, know what let's move all of this to stackdriver that, makes sense it's provided, in wool cloud why wouldn't we leverage, it turns. Out there was a big mismatch at, least at that point two, years ago with getting JMX metrics, especially, the vast amount of gmx metrics that our services, were using into. Stackdriver. It was non-trivial. We would end up needing to write lots of scripts manually. Listing out the gmx metrics it wouldn't scale the. Second x fact that we found stackdriver. Suffered, with was, just around the sheer size of our dashboarding, the, stackdriver dashboards. Back, then I haven't looked at it lately so you, know don't. Get upset with me but, they. Just weren't going to cut it there were underperforming they frequently, just got they hung the, amount of analytics. That you could do on it was very poor so, we ended up going with data dog and that's, worked out quite well for us so. What we have today is a set up that looks similar to this. We. Export, their certain metrics they don't come, from the instance like GL B matrix, you know metrics, from your managed services, from your networking from, your Software Defined now, King those, we export, from stackdriver to Dana dog apart. From that we run the data dog agent, on. Our instances, and that also pushes, metrics to the era of bedrock then functions, as system. To collect our metrics as well for, us to build handle. Alerting on it these. Then become, available as, dashboards, which, our ops team, and our developers. Use as well, as for incident, management trough, Seany it, worked out quite well for. The most part today we have over 6,000, hosts running, in this manner and over. 250,000. Metrics running. On daily dog so, it's quite scalable there. Are a couple of gotchas worth considering, firstly. Is there. Was a lag between. Capturing. The stack travel metrics to Dana dog now, I don't know if they were just trying to throw a spanner into that integration just, kidding but, it.

Was Delayed and initially. It was unusable there was a six minute delay, between, something, happening, on the load balancers, and it, being available to daily dog we, worked with Google and data dog to get that solved today the delay is about a minute or less but. There's still a delay and we continue to work with them on it if you guys are looking at a similar integration, do push them along it'll benefit all of us the. Other thing what considering, is that data dog is not available. Within the GCP infrastructure, at least. When we rolled it out it was hosted on AWS, which, meant that you're going to have a fair bit of egress, going, out or from your nodes especially, if you have a large enough footprint, like ours going, to data dog data, centers which I think, now a SS in Europe and, in, u.s. East, however. The good news day is the latency is introduced by that wasn't, too damaging for the most part the dashboards, and the alerting and the metrics all being captured in reasonable amount of time and it's working okay for us. The. Next expec, was around logging, ok. And what we had running on Prem was, we'd, have instances, running, a Kafka, client, that. Also doubled, up as a syslog, agent so, applications. Would just log as though they are logging to syslog it, would then get picked up by kafka client, we should then push it to a Kafka broker, and the. Kafka broker then had a bunch of workers picking off the logs and writing, it to a Hadoop file system this. Mostly worked but what it meant is that to query, these logs users, would need to spin up Hadoop jobs and basically, do a MapReduce, to find logs this, one this was a bit slow and painful so, we wanted to re-engineer it we didn't want to just have this coming along secondly. Also especially if you re platforming, on GCP the tons of build services, like part, cloud pub/sub and dataflow that makes sense to solve this problem and that's what we ended up with so. What we ended up with is we ran the standard flu and D agent, on, the, instances, this. Flu and the agent acts as a local. Our syslog, daemon so again you can have those applications, as the you. Know just log, as though they are writing to syslog, it. Then gets pushed to cloud pub/sub, which. Then gets picked, up by data flow and then dataflow writes to two endpoints it writes to elasticsearch. And. It's accessible, to our team through Cubana and it, also writes avro files to GCS, and it's accessible, to our teams for Zeppelin the, use cases, for. Elastic. Search and. How. Zeppelin, Cabana is different. Elastic. Search allows our users, especially. Our operational, team or the developers, working on the services to access these logs pretty quickly and easily, through, some search queries but. We can't we, try to from, fork from a cost standpoint we try to keep our elastic search cluster, maintained. At a fixed size we don't need to continuously, grow so what that means is that we age out logs so, we keep up up to about 14 days retention, within our elastic search clusters, but, within GCS. We, store those logs, now for perpetuity, but. What it means is that to access data from that it's a bit more tedious and you go through we. Provide Zeppelin as a front-end for it and it mostly works it also serves, as the input for a lot of our data analytics. And data engineering, jobs. Which. Is what I'll talk about now so. On a big.

Data Standpoint as, I mentioned earlier we had raw data in Hadoop FS. We. Had, a on-prem we had a large hive. Based data warehouse, we. Were managing these data engineering jobs using easy and click, view for dashboarding. This. Entire, stack was Reap lat formed we. Replaced, kuzey. With airflow for job management. We. Replaced. All our hive data, with, Spock. So, and which. Would feed off, GCS. So the logs that we had ingested earlier, we also had events streaming, coming in through GCS, and this would be picked up by a ETL, jobs running on apache spark this. Apache, spark cluster would also write back to GCS where necessary and for for. Our data science, and data engineering, work we, would use park', files. We. Would also sometimes, for primarily, for business analysts, and for business reporting, run. ETL, jobs and write it straight to bigquery and then. We would have tableau, run. Queries against our bigquery, instance so, this allowed us to in, some cases have, you know when you have clear business reports, that are clearly defined and not going to change too much we, have ETL jobs right to bigquery and make them queryable extremely. Quickly in. Cases, where we want to provide more ad hoc analytics. We provided, that true Zeppelin, running, over pocket files within GCS, this, as well has worked quite well for us we migrated. Over 100 ETL jobs, that. Today run over 18,000. V CPUs, we. Process 20 terabytes of data a day and these. Jobs on average run between one to two hours max in the, past when we were running on Prem this, jobs would take sometimes, up to two days so, this has been a huge improvement, for the business and. Finally. At the top of the stack is our application, service and for, this when. We looked at what was running initially, we saw a combination of over 15, different type of CPU, RAM disk, combinations. We. Tried initially trying to streamline, this migrating. What we had but what we found in the end is that, at. Least for our use cases the. Comp the configurations. That were running on Prem when, the configurations, that we ended up running in the cloud in. Retrospect, it seems obvious because, we try to order scale some of these instances, what, we could end up with which we generally ended up reducing the size the footprint, of these individual, instances, so that we could scale them up and down easily, so. My. Suggestion, here the takeaway here would be don't. Waste too much time trying to do this especially if you're coming from an on-prem environment, just get to the point that you can test this and figure, it out as you go along how, you do that is what I'm going to touch on a bit later. So. We've, just finished the. Whirlwind, you. Know plan of trying to figure out how we're going to re-engineer what, we're going to reengineer how, we're gonna run that that. Now again took us about two, months plus sitting. In you, know windowless. Meeting. Rooms working, together with partners like Pythian and cloud cover and our engineering team and getting all that bashed out testing, some of the assumptions but. Now that we had figured it out the, next step was to actually build, a service up now, that we know what we want to go how, do we get running the. First, thing that we did before. Actually. Coming up with a plan was and this was actually done quite early in our transition. Plan is that we. Started with some guiding principles as, to how we were going to do traffic cutter work and again I think this is a takeaway for teams you. Need to sit with your business and figure out what. Is acceptable, to them from, our end because, we are a real-time, messaging, system with tens, of millions of users it, was decided, that we, cannot afford to have any downtime, for users, I put. An asterisk there and I'll come back to it in a bit but. We didn't want to have any downtime, for users we couldn't afford that what. That meant though is that from a planning, standpoint from, a timeline standpoint, we, buffered.

Things We stretch things out we ensured that we had enough time to do the engineering which is why it took two years in the end there, is definitely going to be a trade-off in terms of how much downtime you are willing to take and how, much time the overall transition, is going to take that, you need to figure out yourself but do consider this the. Second thing that we knew from the get-go is that we wanted to allow for learning, through, the transition, process through the traffic control process. Because. As I mentioned earlier for the most part this was a team that did not build or operate, this service before we. Knew that there are lots of unknowns, unknowns, and the, only way to figure these unknowns, was, to actually experience, it when, we were running in production, how. Did we achieve that retrieve, that by ensuring that the traffic cutover could, be done in an incremental fashion and, I'll touch on how we achieved that in a bit the. Last thing is we, wanted to be resilient, to mistakes we. Wanted to make sure that if we had a problem in the traffic cut over we, didn't have to incur downtime, until. The problem was fixed we. Needed a way that we could send traffic back to our on-prem DC and then ensure that services, could continue to run, while. We figured out how to actually fix it in Google. Cloud. So. That meant no big banks we, couldn't afford that. The. Traffic cut over itself can be broken down to four stages, firstly. We're setting up joint networking, secondly. Is getting, a data replicated. Taps of a deployment and a client traffic cut over itself let's, go through this. One. Thing that's really nice about Google, cloud that we leverage was, the ability to set up a dedicated, interconnect, between our on-prem, DC, and the, data centers, in Asia. Now. Because. Google has a, private. Interconnect, they talked about it it's, called a premium network this ensured that we could set up these networks between these theses and they, were low latency, and highly, reliable we. Didn't have any outage, any service. Impacting, outage during a period every, now and then we'd have some spikes but they were mostly resolves, but most importantly, the latency, between our data centers in Asia back, to Canada was between 200, 250, milliseconds, this. Meant that we could do certain, type of traffic cut our strategies, that, not, having, this type of, latency. Would. Afford so. For example, if the latencies, were a lot higher some, of the database replication techniques. Wouldn't, work a touch on that a bit. So. Because. A large, footprint, of our data was stored in Cassandra, that, goddess. Replication. For free, Cassandra's. Some of you guys probably know has native four application, built into it so that was pretty straightforward we, spun about Cassandra nodes let, the replication, run true and we had our data in GC p when, it came to Postgres, we, set up master slave for applications, Alvaro is here he worked with us on it and, what. We had to do in the end to actually get the users onto the, new master was to actually do a master promotion, now this, is the one time that, we had to take some downtime when. We made the master promotion, we, didn't have the application service service, point to the new master and proxy, requests, back from the whole application service so there was a minimal downtime about 5 to 10 minutes while this process was ongoing this. Didn't impact any of our key, services, though messages, continued to work people could chat with each other and therefore, was seen as acceptable, when. It came to caches, we, modified, our application, to be able to do dual right so any application that would use memcache or Redis we, introduced the ability to define more than one cluster you write to both clusters one cluster operating, in gcpd. Other on-prem and then, when we finally cut over we were both read and write from the next cluster. So. We've, got the data migrated. So we have the networking set up before the data migrated, the next step was actually rolling out the app servers now. If, you remember from the start I mentioned there were many, many, app servers many many services so the first step was, to really map, out all the dependencies, with the application, service and pick, one, application, server with the least amount of dependencies, we. Then would take the. Deployment, of that application server till, the end what. I mean by tilde n is we would have production traffic working. On that one application, server that one service, before moving to the next this. Is almost like canary. Your traffic card over it allowed us to ensure that any type of mistakes that we might have made with our first service we, do not repeat, for subsequent services. That we were cutting over and I think this was a bit this was very. Useful for us if. There were any requests. So, when we move the service over to GCP if there are any requests, that were dependent, on services.

That Weren't in GCP yet because, we're cutting one service, at a time again, we would be able to leverage the, our GCP direct interconnect, to just proxy this requests back to our on-prem, data center if we, didn't have that low latency, link a lot of this type of migration techniques. Wouldn't, be possible. Once. We had a single, individuals, app server running in GCP, we, would then cut. Server, to server traffic to it first and again that also would be cut incrementally, we'd have 10% of the fleet make the requests and then. Slowly bump, it up more as we, had more. Assurance that the service was running fine. We. Then would repeat we, then move to the clients. We. Would then we then move to the client, traffic cutover how. Did we do this incrementally the. First step required. Instrumenting. The client to, be able to receive, service. Endpoints, from a third party so. Basically, push the service endpoints, to it the. Way we achieved this was obviously, we had to push out a new, version of a client that supported, this capability but we also leveraged, firebase, remote config, as the, back-end service to push this values or the service endpoints to the client we. Could have built our own and service. To do this but, the benefit of using firebase remote config was that it already has, internal. Capabilities, around client, targeting, that would be that proved to be very useful and it, also allows, you to roll values. Or key updates. Incrementally. So, we could say one, percent, of the entire client, population has this new service endpoint see. How the service, performs, and then slowly ramp this up when. We got to 100% we, then move to, the next service so, essentially that's what it was we did service by service piece, by piece we, take the learnings and then apply to the next one Yuri, please you repeat, it 20 times and you fully cut over without, your users knowing. And, that's it that's how we migrated, the entire bbm, service to. GCP. I think a couple of key learnings, that we, discovered, as we went along firstly. Is they. Are definitely going to be gorgeous. Unless. You're deploying, a, cloud, native service, that was already designed for the cloud they, are got just hiding somewhere, and you need to dig them out if you haven't found them yet it's, probably because you haven't looked hard enough get.

Alignment From the business, in terms, of what, impact they're willing to take as part, of the transition this, will really define how, much effort, and time and, money you're going to need to invest in to your transition. The. Third aspect is and this is something that we totally missed when we were starting out is the communications. Between the different, teams involved, in the transition is, going to be a single, toughest, non-technical problem, we were not prepared for this this, communication, can be between any group it could be between your partner's yourself. And Jesus, and Google cloud technical, team or in, our case for example it was that and also, the combinatorial, impact, of having teams spread across Canada. Singapore, and, Jakarta, you, multiply of this and you've got an very, n, square, problem on your hand, lastly. I'd say and this really worked well for us is baking. The ability to fail you're, going to be making mistakes, that's the only way you're going to get something complex right with, if you bake it in from the start you can recover learn, and move on. Thank. You very much.

2018-08-04 10:32

Show Video

Comments:

Hoping there will be an incentive to users of BlackBerry Android devices: we should have the ability (without paying a subscription fee) to opt out of the ads on BBM. What say you? is it a possibility? Thanks.

Mantap!

I'm currently an avid BBM user living in Canada. I didn't notice any disruptions while this massive transfer took place. Very impressive to say the least. Although, it's a shame more North Americans don't realize how great BBM actually is.

Awesome work BBM

Been using @BBM since 2013 and have loved it since.. BBM is the best app ever, it's great for fun and to get business done.. #Blackerry4Life

great job on BBM

very nice talking ,good understand about our company

Other news