AWS re Invent 2020 How Venmo responded to the demand for contactless payment on Amazon Aurora

Show video

hi everyone welcome to fsi 206 however and more responded to the amount for contactless payments on aws and on amazon aurora i am babali sen a senior solution architect with aws and with me i have nick chibutario joining in for the session today nick is the cto for venmo and he will be talking about their architecture on aws and how they scaled on aws today welcome everyone uh thanks for attending this session my name is nick chubataryu and um i am the chief technology officer of denmark and i'm really grateful for the opportunity to present to you today i hope this presentation finds you safe and healthy especially during these unprecedented times today we're going to talk a little bit about venmo's evolution in the contactless payment space and how we leveraged aurora and other amazon solutions to accelerate uh delivering contactless payments for our customers uh i imagine that quite a few of you are already familiar with demo but some of you may not know us so for those that don't we'll go through a quick overview of who we are uh we'll talk a little bit about the rapid increase in demand for contactless payment solutions how we responded to market demand uh during covid and how leveraging aws enabled us to rapidly iterate and bring a solution to the market for our customers we're also going to talk about how we use aws in general and why our aws partnership has proven to be a key enabler for us uh in a differentiator in the way that we build and ship products so just a little bit about me um as i've said before um i lead all of venmo's technical and product organization before venmo i was the sdp of engineering at an adaptive robotics company that built artificially intelligent robots for manufacturing i did that for just over a year leading their engineering teams worldwide prior to that i worked in a technical leadership capacity at amazon for five years um and microsoft for six years before that some of the technologies that you may be familiar with that i helped develop at amazon where amazon search cloud drive and alexa at alexa was actually one of the earliest amazonians to work on the development of the echo show and i also worked on amazon's computer vision products and technologies at microsoft i developed windows server visual studio specifically app and sites i was actually one of the founding members of that technology and a few revisions of system center so for those unfamiliar with us we actually started as a peer-to-peer payments app in 2009 offering our customers a fast safe and social way to pay and really our social feed is what makes them more unique and unlike any other payment solutions out there we're currently used and loved by 65 million active users and we've experienced tremendous year-over-year growth and as our user base has grown so as the sophistication diversity of our products we ship products such as them with debit card direct deposit instant transfer and recently our female credit card we've made key partnerships as you can see with other well-known companies to extend our presence and market reach and as demo is now gaining track traction as a payment method to retailers we also develop the ability to pay businesses both online and store with demo and all of this has not only led to exponential year-over-year growth in terms of venmo customers but a 1200 increase in total payments volume over the the past four years which is really a tremendous achievement for us let's talk a little bit about aws and demo um vemma was born and grew up on aws so to speak and as you can see we not only leverage aws core services such as s3 and ec2 we use multiple aws services from databases to networking solutions to managed services we use technologies such as lambda for event-based scalability for instance kinesis for real-time data analysis cloud watch to monitor our services and so on to us one of the biggest advantages that aws offers is the breadth of solutions honestly speaking it's it's really unmatched we want to absorb and leverage as much as possible of their cloud and infrastructure stack out of the box so to speak so we can direct the bulk of our engineering focus on building products and features for our customers and honestly speaking this has been a great approach for us using aws and their suite of managed services wherever possible lets us enhance developer productivity product scalability and speed to market and just you know in general just gives us the ability to move fast so certainly contactless payments technology is not new uh per se but during the past year uh the entire world has just seen surging demand for uh contactless payments uh due to the health risk posed by uh covet 19. and as the pandemic began to spread we immediately began to hear demands from our customers for touch prepayment solutions the good news was that them already had touch prepayment solutions um in-app payments are of course touchless then there's our chip powered debit card and we'll talk about today our qrc base payments as soon as kovid began to spread the world literally started to change right around us at an unprecedented pace handling cash just became toxic to many businesses um and you know in addition to that just staying open uh wasn't has been a change especially for for small businesses personally i've seen a variety of each of these scenarios from you know folks that just needed tipping to restaurants that are moving more and more towards contactless payments um a great number of them are already using qr codes to you know either pay or display menus a couple of days ago actually i was in the home depot and i was amazed to see how big uh touchless is becoming there was a sign with various qr codes uh that you could scan just to get product information so this is really important um and we kept listening to our customers who desperately needed these solutions and we saw that we really had an opportunity to help everyone from individuals to small businesses stay open and receive much needed money to continue to to operate so in just six weeks we were able to develop and release a solution that enabled demo customers to present and share a way to receive contactless payments uh via qr codes um as well as pay via qrc at the increasing number of businesses that are using this uh technology as a touch free payment methodology so to to go from literally zero uh lines of code to beta in six weeks is pretty awesome um so i want to extend a huge kudos to um all of the extraordinary teams in venmo and paypal that that made that happen so let's talk a little bit about how we did this from an architectural perspective qr code payments flow from a point of sale as you can see you know represented by the cash register uh to our app payment platforms which use amazon aurora as our database solution of choice uh to ensure that we support as many qrc technologies as possible we also made code 128 standard available to support traditional linear code scanners so qr codes um with higher correction are rendered in the demo app as you can see for the cashier to scan and once that happens communication is secured and encrypted over https between our partners paypal and demo and we use persistent ssl connections to interact with aurora and the caching layer the venmo application leverages paypal's qrc infrastructure to coordinate transactions with merchant point of sale systems um venmo qrc payment apis are deployed across multiple kubernetes clusters uh and we're going to talk about kubernetes uh and how we implement that in a later slide the memo qrc app actually has its own qrc name uh its own kubernetes namespace uh and it auto scales based on cpu utilization uh independently of our other uh platforms and services the qrc app both writes and reads data from redis and aurora uh we use redis as our caching solution um and also for uh for a feature flagging framework um and it also acts as a message broker that's used for asynchronous processing so for example sending notifications for qrc aurora is our primary transactional database and it's configured with a single writer which is our primary instance of course and multiple read replicas that handle all the read-only query traffic diving a little deeper just to talk about the overall venmo construct we also use aurora for other key components of our solution as an example our identity service and our brain tree integration for large merchants this solution architecture is very similar to the slide that we just looked at you essentially have uh our venmo pdp app deployed across multiple kubernetes clusters at auto scales based on cpu usage of course um our p2p solution also has its own kubernetes namespace um which deploys in the scales in the plan independently of our platform service pdp application writes and reads data from redis and aurora redis is used as a caching solution aurora is our primary transactional database for all of our payment data and it's configured much the same way with a with a single writer or multiple read replicas you can see on the slide that we integrate into braintree and paypal we use paypal's multi-tenant services to process some car transactions also braintree is our cardvault repository and is able to process certain type of transactions as well and the integration into both paypal and braintree happens via a restful api implementation um when we begin to die to diversify as a service a large bulk of our payments volume um is still p2p transactions so even as we evolve uh into uh other financial services offerings for our customers uh p2p is still um at this point um you know our main uh traffic producer so to speak um our systems do get frequently tested by high velocity events like super bowl uh black friday and so on um what we have to withstand traffic spikes of three to five times our normal volume um and this is actually really important uh because in building our qrc application uh we didn't know what the traffic pattern was going to be and we needed a persistent state solution that checked a lot of boxes scalability latency availability resiliency and so on and one of the main reasons we used aurora vis-a-vis other solutions was its speed and scalability in literally every benchmark we conducted aurora was multiple times faster from a transaction processing perspective and its ease of scalability and native integration into aws just made it a no-brainer for us to use so let's dive a little deeper and someone into some of these considerations and see and see why we uh we picked it uh so as we talked about the ability to auto scale as seamlessly as possible uh was one of our main requirements um aurora just one hands down here um we can enable database clusters to handle sudden increases and workload um and when those events expire auto scaling just removes the necessary replicas so you don't have to pay for unused database instances and we only pay for what we use having read replicas available with aurora was also huge so low latency reads is actually one of our core requirements for qrc and aurora replicas work really well for rescaling because they are fully dedicated to read operations on your cluster volume from from an availability perspective um we use multiple availability zones um we strive for better than uh three nines availability i know three nines are standard in industry but uh our target availability benchmark is four nines uh so having redundancy on multiple availability zones was also important when we uh when we designed our solution and we also needed the capability to retry certain operations quickly and efficiently and just given the nature of qrc fast retries on reads were a must for us we needed to be able to process asynchronous workloads very quickly and efficiently to accommodate for spikes and as we designed qrc we also use feature flights judiciously to account for failure modes and scenarios from our integration partners so you know just how extensive is our use of aurora well as i said before it's our primary transactional database for all of them was payment processing it handles hundreds of thousands of transactions per second um with a billion dollars worth of daily transaction volume and growing so you know very very big transactional database implementation and as we talked about before it's a highly scalable solution that adjusts the demands of our business and um scalability on the fly is something that we both need and uh and use quite a bit um aside from the solution choices that we made i think it's important to stay just like with any other solution when you're building something uh successful you stand on the shoulders of giant so to speak um so as an example we were able to uh when we built qrc uh very quickly react and reuse development patterns from our direct deposit work um for for instance to enable folks to um use demo to receive uh stimulus checks and with qrc when we made the technology choices that we made and we looked at getting to market as quickly as possible pattern reuse for example from other partnerships that and other integrations that we had done where we had to work with different software enabled us to work quickly and integrate qrc with uh key merchants and partners um one being cvs of course which was our first partner that we rolled out to um that now accepts uh venmo and paypal at 8 200 stores um which is uh which is actually a you know a huge rollout to start with uh and that's been really successful so now that we have a good idea of how the solution works um how did we scale it to 65 million customers well let's talk a little bit about how we use kubernetes um we went all in on kubernetes we moved to kubernetes to improve our deployment consistency across development stages to empower developers to own more of their own application stack and increase development velocity so developers uh are essentially responsible for their own deployment manifests um and they can use our custom build code pipeline to deploy their code and codify run books to manage their production workloads after the initial migration we've optimized to meet the scaling needs for solutions like qrc keep in mind as i said before we didn't really know what the traffic pattern for qrc would be so to safeguard that as you can see we logically separated instances of our platforms using namespaces to quickly and safely iterate on uh functionality uh we have a single contour ingress and per namespace route delegation what this provides us is a really simplicity of configuration and flexibility for dealing with service routing concerns and with this setup our qrc service deployment was optimized independently of any other platform concerns uh what this assured us um was that uh throughput and uh and latency requirements were met for this very different traffic pattern um that we hadn't seen before we architected for scale in spite of resiliency issues by increasing our redundancy um so we run three kubernetes clusters for api service workloads uh they reach this distributor across five availability zones um for the purposes of this chart i've only presented three three here but i think you get the picture um so what is what does this mean um this means that we can lose an entire cluster without sacrificing scale requirements to service our applications what does that mean some failure examples let's say if we have a loss cluster due to an expired tls cert where we deploy a bad release which you know we all have we we know all it happens um or one of the clusters is dns starts failing uh this particular failover architecture addresses all of these types of failure modes uh without service interruptions to our customers uh to deal with the risk of cluster-wide issues we build custom tooling to be able to gracefully degrade um our public envoy proxies sit behind application load balancers and the albs distribute traffic load uh across our three clusters to help mitigate against uh total loss of service we have custom traffic management tools that allow us to override our proxy routing configuration at runtime to manually shift traffic to deal with cluster management and outages this is manual right now in the very near future we plan to fully automate this by introducing help based routing and our proxy configuration so in practice we can replace a cluster during peak traffic uh with no additional customer impact and we're ready to deal with spikes by over provisioning our service layer and setting cpu scaling thresholds at 50 um given that with the reserve capacity we can start hundreds of new service instances in seconds and we can scale new nodes with um within really within minutes um and we can flex between 100 to 300 uh notes per class per cluster throughout the day the main takeaway from this um is um you know of course in measuring performance um is we increased of course uh transactions due to uh per second due to qrc uh and as you can see there was about a 50 increase um you know when we um in tps um from um you know from from qrc and um and other um you know service use uh what's interesting to note though um is our availability our availability to to serve linearly growing traffic um while reducing cpu utilization uh and actually improving query response times so this is actually pretty cool if you think about it and we use the multi-part solution to get here um aws aurora played a key role on the computer storage side um just out of the box and what this allowed us to do is focus on in the uh in database optimization initiatives around performance and scale so you know to to make sure that as our uh transactions per second um went up um our uh query response times stayed under a millisecond um and see cpu utilization went down so looking at query response times for instance um even with the roughly fifty percent increase in tps um we actually had um a higher percentage of our queries that took less than one millisecond in terms of response time so we went from um roughly about eighty eight percent of our queries uh that took one millisecond to to 93 percent of our queries that took one millisecond even as traffic went uh went up uh same with cpu utilization um even in the face of uh increase in tps uh the cpu efficiencies of as a result of aurora upgrades and the index optimizations that we did to our data models uh resulted in cpu consumption going uh going down and with regards to compute and storage uh what you see on this graph uh shows our current current uh aurora cluster compute and storage footprint um and we are able to unlock performance and efficiencies running large size database clusters in aws aurora storage engines scale to the growing needs of uh qrc and vemo's product portfolio quite well as you can see here and again this is this is just a recap on how we were able to actually improve performance in um you know in the face of uh increasing traffic um i won't drain the slide um you know like as we covered all this but again these are some of the metrics that um that we looked at and some of the results that we uh that we saw um and even with the you know 33 to 50 percent increasing in craft and and traffic um we gained uh both compute as well as um query response time efficiencies um what does this mean for us in practical terms um as you can see here by leveraging aws technologies um including aurora for both cloud infrastructure managed services and take advantage of things like push button compute um auto scaling um cloud nativity um with my sql which we which we use quite a bit um this enables our agility consistently and in this case um it allowed us to build a solution that went from zero lines of code to a to a beta release candidate in about six weeks uh which went live um again as i said to cvs on november 16th of this year and it's continuing to go live with many merchants onboarding without using aws as as we have it honestly would have been practically impossible to achieve something like this otherwise so let's talk a little bit about what we learned from shipping um an end-to-end solution like this in six weeks um obviously this has been a very successful release for us with everything we ship of course we we always look at lessons learned um so first you have to be available on online if you're not online your customers obviously can't use your your solution customers have choices and we're very cognizant of that switching costs are low so we always have to be prepared to scale up there are surprises and traffic patterns change um these are things that you need to account for in both choosing a technology solution and just designing how you build software uh again build upon the things that work before in order to move faster than before and we were able to do that and just make good technology choices our jobs as developers are to ship delightful features to our customers as quickly as possible so leverage what you can in terms of managed services um uh out of the box and um empower your developers to you know to build solutions and not have to worry about infrastructure um platform um you know managed services um and aws and aurora specifically in the sense in this case really enabled us to do that so what's next for us um you know there's quite a bit on the slide but um what i'm really excited about um in the future is applied science and how artificial intelligence really enhances our customer products we want to double down on our aspects of artificial intelligence like machine learning natural language processing and computer vision uh to provide a more meaningful uh and valuable experience for our customers uh so we're really excited about our uh growing use of uh tech like sagemaker uh which really helped us accelerate our appliance science or map um we're also um really excited about aws on graviton true to unlock uh more price to uh we're better price to performance ratios uh with arm processors uh we're very closely following the road map of rds proxy and aurora service list which just released so congratulations to the team uh on our journey to grow the business and create value for our customers um and we have quite a few um you know really compelling features uh to release uh in the near as well as long term so uh so stay tuned um i think there's a there's a lot of exciting stuff coming um in closing i just want to say thank you i you know i know this is um a um you know an unprecedented time for all of us i want to thank you for taking the time to view the presentation i hope you found the content valuable i also hope you continue to stay safe and healthy and i hope you have a great rest of the year thanks so much for attending you

2021-02-09

Show video