Nest Cam Services Migration to Google Cloud (Cloud Next '18)
Hi. Everybody my name's, miles ward I'm the global head of solutions, for Google's cloud platform. What. That means is I help people. Adopt. The cloud move, to the cloud I have. An incredible, privilege to, introduce a, friend of mine that I did, not help move. To class they. Did it all on their own and as, a result they really, know all of the, wrinkles and the hard steps, that it takes to figure this out and the. Results, that they've received are. Absolutely. Incredible so I can't wait for Nils Palmer in who, is the senior engineering, manager for nest to. Explain to you how they've moved to cloud are you excited about that sounds good, awesome. Thank, you so much perfect. Thank You Mouse. It, was pretty amazing you guys are comfortable. Perfect. Last time I've been this far in front of a movie theater was like the premiere of Titanic, or something but it's. A long time ago. Thanks. For coming. I'm Neal marinas well said I'm and. As I'm leading Sr yet nest and. I'm very excited to share with you our, story, how. We migrated, on to Google Cloud. Real. Quick poll who here, has migrated. To Google cloud recently, over the last couple of years and I see one hand it's really hard to see a couple of hands perfect, who, here is planning, on moving to Google Cloud, that's. Even more hands so perfect. That's. Great so, let's see how this works next. Just. To set the stage I, just wanted to share the numbers, and the scale of the service that we are operating, to. Support, nest cameras, has anybody here I'm, going to keep this casual like this anybody here have an S camera at home a lot. Of people very cool, thank, you first of all, so. It's. It's really massive, and and we were facing a challenge really on two axes number, one is just a sheer size and the scale so. We're operating on hundreds, of thousands of course recording. Tens of petabytes of data and, and. The data just keeps coming at like hundreds of gigabits per, second. So, that's. Obviously number one getting data, mound of compute. And data over into, the cloud or into Google Cloud the, other one comes just with the service. Requirement, and the customer promise that we're making so. As. You know as customers we're recording 24/7. And we're, doing so and the customer expects zero, skips in their timeline so they can go back to any to. Any particular event the.
Other One is. That, we make a customer promise that they at any given time can open the app and actually, check in on their video footage and. All, of that in millisecond, latency so. That's. What we had to move pretty big, service certainly the biggest service I've ever worked on. Let's. Go. Back a little bit where did we come from so about four years ago pretty much exactly four years ago, Dropcam. Was bought by nest and. Dropcam really has been the foundation of what is known as nest cam today and then. Six months prior to that Google. Bought nest. So. That. Kind of brings up the obvious question like did they make you do this like, did they make you move on to Google Cloud and I want to debunk that because it would actually like cut, the talk in half because. We talked about our decision process so the short answer is no we. Are up until recently were an alphabet company and we. Call them other bets and as, a bet you really make your own decisions, so and, it's also fair to say like there's good parts of nest that continue to run on AWS, and there's other bets, that. Are not on GCP that are also either self hosted or in other cloud providers so, the. Decision, process that we're talking through is completely, self-made. Okay, so where were we about four years ago nest was already a pretty big place we had thermostats smoke detectors, we had cameras, since. Then we've added more, cameras, more, thermostats, security. System etc, etc. Today. I really want to focus on the camera, side, of the business, and. The. Camera, in AWS, was built. Or the camera service was built with. Comedy's AWS. Components, such as ec2, s3, dynamodb, and then, for analytics, redshift in Kinesis I, like. To use you now G of, of. A box of Lego's each. Clutch provider has their own you know kind of kind, of tiles and, some. Of them are very compatible so if you compare like s3, to Google Cloud Storage they fit very well together but then there's other services such, as DynamoDB, that, you. Know you have something like that in the Google Cloud box but, you actually need to adjust your build quite a bit. To. Create. The same customer experience, so, that's. Kind of what I'm gonna walk you through how we like, found, replacement, technologies, for all of these.
The. Other thing I wanted to say here real quick you know as we were acquired much. Smaller operations, we were operating on tens of gigabits of recording, which, it was, the equivalent of hundreds of years per day today. We record in the thousands, of years every day of video. We're. Running on on hundreds, of servers and they were still scaled fairly manually, they were provisioned, automatically, through. Puppet and cloud in it and so forth but. Definitely. Less mature than we are today. But. Then you, know the nest acquisition, was a great opportunity and, it. Really, supercharged. The. Demands. On our cloud infrastructure. So. Number. One nests already had a foothold in Europe and. You. Know we needed to follow suit so very quickly we had to move our operations, into you know a global operation the, second one nest allowed customers on newer cameras, to, turn their sensors from 720, to 1080p and with. All the image recognition and, so forth that we're doing that actually means you need to operate on about like twice the amount of pixels so, and. Last but not least we. Have. Constantly. Over. The last four years we have. Constantly. Improved our vision algorithms, to. Actually, send meaningful, events to our customers, so that also. Caused. A significant, increase in just compute, demand so. We, were faced with the decision what are we going to do and. Let's. Talk through how, we ended up deciding to go to Google cloud so. Without. A doubt probably, for us the biggest, reason, was. Google. Global or. The Google the global, nature of Google's Network and its components, so. For. Our service. All. The video is obviously recorded, in the cloud but then the video that is live. Viewed is actually being reflected, off the cloud so as such it's, actually fairly easy to understand like the closer you get that compute to the customer the better the experience is going to be the lower the latency they're going to be so. As. We, enter new markets it's, really important for us to have a presence, as close as possible. So. Google. Cloud helps there right if building and we'll look into that in a second so building, a multi region story in. Cloud, whether. You do it for just proximity, like us or regulatory. Reasons, or you, know quote-unquote just, to achieve resiliency. It, just seems. Very easy from. Experience, it's very easy to do it with with, Google cloud. So. There was one the other one was savings. So if you run a business you, are obviously. Worried. About operating, costs you are worried about margins. We. Are one cost that, we're tracking is the cost that, it there. Were the service cost of operating a camera. This. Is real, numbers, without the actual values but this. Is how we have seen the cost of operating a camera decline, over time now, I. Do. Want to say it's highly, dependent on your use case so, in our use case we actually stripped everything apart. Enterprise. Agreements, our eyes just like stripped. Out apart looked at Street pricing compared their to Google cloud pricing, and with. That alone we've seen a 20%. Decrease. In operating, cost, now. In. Real numbers that means, tens of millions, of dollars for us and that's, just in one year, very. Very massive, and obviously. Our finance team is excited about it but we are too because like what, it actually allows us to do is fun new features, so we are building additional. Features. And like some of the features that you see on your cameras today have been funded through those. Kind of costs down efforts, that, will continue to drive so Google.
Cloud Really helped us there as well. So, let's take a look at the migration approach. So. In. A nutshell this, is what it looks like I really was hoping this is like a smaller room so I can show up here but I have a laser pointer. Cameras, stream into the cloud from. There, video. Is recorded, into s3. And. Video. Metadata, along, with like any sort of event that we have detected, goes. Into DynamoDB. Great. And then the client whether, you use your browser or your phone is you. Know rendevouz. On the same server and you can consume the stream or watch recorded, video now. When. We started off we sat down with the Google cloud folks and we're like hey couldn't we just move that s3, thing down into Google Cloud storage it's, great idea but the way our public cloud pricing, models work it's like all your inbound bandwidth is free, but. Outbound isn't so you, know it would have been really cost prohibitive, to do any sort of hairpin, turn in a table yes just to stir it into GCP so. What. We had to do instead is like really build up the entire stack. In. Google. Cloud and build, the necessary abstractions. -. You. Know for a server that is in Google Cloud to know to talk to Google, Cloud storage, and. And, to spanner, respectively, as a replacement for is three and dynamodb, so. That. Took a while but. It was you know. It. Was the way to go and then, the. Actual migration became, trivial so, we. Updated. The database record said, you, now belong to Google Cloud and we could do this one camera at a time which is really important, you now. Belong to Google cloud we, sent the camera a disconnect, signal the, camera doubts back in, connects. To Google. Cloud and stops recording over there easy, and. We're done this works extremely well by the four new cameras, but. What about the cameras, that have been recording, and a half data. Recorded. You know assume they are on a 10 day recording plan. So. What we wanted to avoid is a big, copy job because ultimately it would have still been egress charges, so, and, also. If you imagine it you you, know many of you are users, not. All the video you have ever recorded, you, actually look at it's. A small fraction of the video that we ever actually playback, so. And. On top of that the data is actually self expiring so the you. Know after 10 days we have a customer, promise that we're, going to delete the data so, what we build is a technology, called scatter gathering, so is if, a client requests, video. You. Know now connected, to Google cloud that, is still stored in AWS, we just pull that in automatically. Across. Clouds stitch. It back together and present it through the API as if.
It Would be all in Google cloud now. It's. Very specific, to our use case, but. What. I, what's. Interesting here is. You. What's. Interesting here is you. Can. You. Can do this. With. With any application like even if you have like you, know. It's. Three log processing. And the data is expiring over there you, can easily, you. Know use public, cloud api's, securely, over the Internet bringing, back the Lego analogy, there's. No reason why you you, know can't. Intermingle, those different services so, it. Was kind of a breakthrough moment for us to really like keep, migration, costs under control. All. Right. Next. I want to talk about the you. Know how we build a global architecture so, to, do that I want, to talk. Through real quick how a camera first connect when you first power it up it, goes through this routing lookup. Layer, where, you, know backed by a load bouncer. That. Backed. By a load balancer which in turn is backed by compute. Engine instances, and. We. Do a quick lookup figuring out where this camera belongs and then actually connect this camera straight to the to the owning server of that camera now. Very. Quickly you. Can add a, couple of regions right and now really try to look at both ends of this right whether it's the low bounce on the left side with. A couple, of API calls or clicks you can add compute, engines from, any, region, really to that load balancer and if you look at it from the right side as well you. Know whether it's the global spaniard storage. Or you. Know a. Global. Orgy alike within the geographic, region a Google. Cloud Storage bucket all, these things and and you know other services, that I'm not showing such as pops up just are, global, in nature making, it very easy to build a global infrastructure. The. Other thing that's that's important, here and it's not shown is it's. Really the same VPC, network it's the same firewalls, it's the same public API is that we've called, in. The same public API endpoints. To. Launch these instances, it's really just a matter of like specifying, the launch region, of where you launched your instance and. Last, but not least all of them are connected through Google's fast backbones so even if you have you know cross region, traffic. You don't have to deal with VPC, peering or anything like it it just works. Next. We can just add an entire continent. Right, so. Going. Into Europe for example going into Asia next you know we're now it's very cookie cutter to go into new markets and new regions, and you. Know, sre, or you know operations. Is now no longer in the way of the business to, enter new markets which, is really a great enabler. Yeah. That's how. We build a global architecture so. Next, I want to talk there's there's like three takeaways, I think, that I want to talk about, and. The, first one is like focused on reusable, tools so. Doing. A migration like this it's like a huge investment so you know in our case we've written thousands, of lines of codes and. At. Every step in every turn we've asked ourselves like, is this a one-time, deal or is this something we could actually like reuse. In the future so. I. Want, to show you remember, scatter gathering, how we brought things from AWS like, on-demand over, so, imagine, like whether it's like a future, feature or, something. We do automatic, as part of like a dr. Strategy. Or so but let's imagine we're moving a camera from. You know the US to europe like. Why wouldn't we want to let you scatter gathering, that we've now once built to. Pull in data on demand from other regions if we had to and then stitch it together and present it back imagine, a camera you know an owner moving, taking, the camera with it you can take it as far as like in the future sensing. Of where. That. Camera is best connected, so again. It's just one example of, reusability. It's something that we really really focus, on during. This migration. My next favorite topic, is to. Think outside of the ops box I. Want. To this. Is probably. The. Most important, thing that I learned during this migration so. I run sre, and, I was. Tasked. With getting the, service over to. Google cloud. But. My. My, lesson here is like whether you're s re PE. DevOps. Net. Ops, make. It known within your organization, that this is really a cross-functional. Exercise. And. If. You do it well and you really focus, on the customer, you, can really, build it like a seamless, experience. Or, make the migration extremely. Seamless and you. Actually end up or. Possibly, and, I'll show that in. A, better place than you were before so. I want, to show two quick, examples number, one is. For the life you case so if you do this poorly, you. Can easily. Get to a point where like okay we migrate the camera first of all we try to avoid migrating. A camera while it's actively being viewed but let's assume for a second you migrate, a camera while it's actively being viewed you move it over sender the disconnect signal that comes back in, if.
A. Customer is viewing this camera, feed. It. Is you, know it turns black customize to exit, the app go back in and so forth it's frustrating so, we work very closely with the apps developers, to say like hey you know if we send you a signal that we just changed regions, can, you just like reestablish, those spinner for a second and that's, effectively what we did but it. Actually now improves, the resiliency, in our communication, signalling between apps and clout so, it's, one example the other one is around. The. Recording, model so. The. Cameras, and this is something where we worked with the embedded team again we own end to end of the communication, so we worked with the embedded team and we. Use a something. We call backfill, so the camera in itself has a little two-minute buffer, so, when, we send it the disconnect, signal to come back into Google Cloud it's a five seconds. On. The, other side in Google Cloud the camera will negotiate with the service to say like hey you know where did you left off recording. Over in AWS, or anywhere, else for that matter and, we. Are trying to and and then we're negotiating. And stitching. The timeline back together so, as a result if you come back later that day and your camera had been migrated, you absolutely, seen no seams on your timeline which is pretty. Neat so, again, it took a lot of effort to get there but it's like we were we were that. We. Were that concerned about the customer experience doing this migration. And. Then. The. Other thing is like RER connecting, and, do it often. If. I would have waited or we would have waited for. Google. Cloud to provide us, with a replica, of what. We had in Amazon we would have never moved right so. You. Do need to adapt and if, you look at the marketing material, very, often what I found is like it scratches the surface of what the actual performance, characteristics, are off their services. And even if you look things like s3. And GCS they're ever so subtly different. So. I. Encourage, as. You go down this road I encourage you to stay you know very nimble and really, embrace that that you need to make changes a lot so here's, a very simple example that like you can wrap your head around so. The. Average. Recording. Node in AWS. Looked something like that. So. We're using c3. Instance, types and, we, really like them because they come for free with these and awesome, and femoral drives super-high. Ops 2. Times 80 gig we strap them together and. We. Use them both to you know spool video before we flush it out to s3 but. We also use it to. To. Keep a cache of recent video so we don't actually have to retrieve it from s3 every, time a customer requests something now. We're. Like okay sweet like we can built the same thing over in compute engine so we're like oh let's take an n1 standard, eight comes. With a 30 gig of ram so really very similar performance characteristics, and. We. Plugged a hard drive in PD, hundred eighty gigs oh sorry 160, gigs and we're like oh this is going to be great. It worked phenomenally well, in QA but, once we actually like put it in like, load under it put it in you. Know low. Tested, it it. Turns out and is like not the fault of PD we just like its. It doesn't match the, the i/o, capacity of a of, an ephemeral SSD drive so we're, like okay that doesn't work so. Next. Google. Has a similar, and this is what I'm saying like you know both of them check the check marks they have like a femoral drives Google. Has a has a drive that's you know 375. Gigs local. SSD you know, it's as volatile, it's as fast, the. Problem is though it's too. Big it's the smallest it's. A smallest denominator, that you can attach and local, SSD. The. Problem is and. It for most customers it's not a problem because they actually attach like 10 or so of them to, run big Cassandra, notes but in our case if you operate thousands, of these servers it would have just been a giant waste so, we kind of ruled that out as well and then we started thinking. Like, let's take a look at the hundred 80 gigs 160 gigs and see what we actually use them for so. We. Added. Instrumentation. Around. You.
Know You cache hit rates your cache miss rates and so, forth and it turns out we actually didn't, need all this storage so, we, ended. Up with a machine and this is was all like in a matter of we err of a week or so so like again like staying very very nimble we, ended up with a machine that was like an N one. Hime, m8 that's, how it's called so it just comes with more memory and we copped out of 25 megabytes, sorry. 25 gigabyte hard drive and. You. Know it's s volatile, as a local SSD, it's. Has, even big or, better IR performance, and, in. The process we saved a bunch of money as well because we didn't have to attach all these additional drives so, it's, that kind of thing where you really like once the rubber hits the road you really need to make, adjustments. And embrace that and be open to that so. Great. So. Takeaways. This. Kind of ties back into. The. Hard drive story. Don't. Be afraid to experiment my. Number one advice is, invest. In good instrumentation, like you want to set the baseline of like what the customer is experiencing. In. Your old cloud provider, and. Then, you really, want to say with confidence whether, you're moving the needle in the right direction or in the wrong direction. The. Other thing I'm feeling fairly, strongly about is forklifts. Don't work like, you don't want to come in on a Monday and change DNS of your entire service I. Really. Try to move either one customer, at a time of data I'm in our case it didn't work due to data constraints but, like if you can one service at the time have, a failback strategy, and so forth. We. Did that more than once where we you. Know in. The early phases were moved a couple of cameras over and then we move back you, know in internal, cameras. And. Then like as I said before like you it's. Like hitting. Six. Right numbers in the lottery like you're, not gonna pick the right infrastructure, on the get-go so really I encourage, you to. Stay nimble and, like actually embrace that fact to, to, make changes often and see if you move the needle in one way or the other. That's. One. The. Other one is focused. On reusability, so I said before it's not a necessary p/e, DevOps. Problem. Alone really. Tried to focus on the customer and. You. You do have an opportunity just as I showed you to, actually come out with the with it with the service and a platform that. Is better and more stable than it was before you, actually went down this on this journey to. Migrate, to, cloud or to Google cloud, and. Last but not least, save. Money and. Have. Fun with it. It's. Something that we do it's an our DNA, to constantly, like create additional Headroom for features and we've. Just scratched the surface on it like I mean so stains use discount so you kind of get for free but. Then Google, has great knobs, to. To. Really, drive your operational cost down so we are looking very.
Heavily Into auto scaling and we're very excited about preemptable, VMs in. Particular for some of our workloads, to, reduce, cost even further.
2018-08-14 06:38