Under the hood at Auror

Under the hood at Auror

Show Video

heads up the engineering team at aura now i'm reading a lot of soft pieces of paper science apologize rob ends up with the engineering team in order um aura he loves talking about code even more than he doesn't get to rock now that he doesn't need to write it these days he's too busy leading auras growing team but has bounced from managing people back to coding every time he's changed jobs and deeply enjoys both he loves finding ways to work smarter faster that aren't harder and hates deadlines so i'm sure we can all relate so without any further ado let's hand it over rob [Applause] hi hi um so yeah i'm here to talk about uh what we do at aura and and um how our rebuild's gone and more specifically what technologies we got to go ahead and use when we got to start with a fresh slate so um we we had a an old product and and this is less about the decision to to do the rebuild and more about what what we picked up and what's been working well and even what hasn't been working well so i will start with the actual decision to do the rebuild in the first place sorry so first a little bit about aura our first customer ever was countdown new zealand uh the w as for woolworths which is the australian owner of can't tell um we phoned up the countdown head office we i've not we the founders when they were thinking about starting a company dialed the countdown head offices they said hey can we talk to your head of loss prevention please and they said sure and they got on the phone and had a big yarn about like the state of loss prevention and uh retail in new zealand and what what kinds of software could help with that um being founded in new zealand has been really cool for us because of that kind of opportunity to get access to our customers and and like the the leaders in that space if this was the us and we just dialed our head office and asked to speak to someone we'd get turned away pretty quickly so we we now work with literally the largest retailer in the world among a lot of others in australia and new zealand so the us is our current focus but we have pretty much all the large retailers in new zealand and we're very close to getting there with australia as well the biz spark program which just in case anyone's not heard of it was a way for startups to get free azure cash uh was a large factor in our decision to work with microsoft from day one but when we did the rebuild we we actually had a good think about do we stick with azure and um it was in the end a pretty easy decision and we're really glad we did that who said that so um aura itself the product is about allowing loss prevention staff and in retail stores to report on crime and other incidents that happen in their store and then taking that information that intelligence and intelligently sharing it with other people uh whether that's like nearby geographically or if it's um like head office or law enforcement police um we we do things like um by aggregating all the data from the shoplifting that's going on in a particular city we can tell retailers to start putting things like honey behind the shelves because um gangs are starting to go after it we can tell the police that like a particular person is doing a lot of offending so really look out for him or um you know he carries a weapon so if you see him just call the police and don't approach him that kind of stuff um organized retail crime is actually something where i think we make a big difference it's surprising how much shoplifting is actually in aid of um just like there's warehouses or garages full of baby food and that's going to be tampered with before being taken overseas and sold to you know mothers in say china for example and so to be able to bust those uh rings is really really good and we work with our retailers and they work together even if they're competitors and we all work with law enforcement on gathering and telling that kind of thing and helping out um it's also really interesting how on the oral platform uh two retailers who are traditionally competing with each other will be really happy to share intelligence so that's that's one thing we we enable um that that you know because we're shipping the data for our customers um that that makes it easier for them to actually collaborate but uh if someone shoplifts in a particular suburb aura will share that information with all the other shops nearby and that is something that happened before aura so security guards would be on facebook groups or whatsapp groups and they'd share photos of offenders with each other but having that happen in aura means that all the right people get the information and also there's that corporate governance of privacy concerns and things like that so about three years ago we we had to make a decision on whether or not we'd do a rebuild um i'm kind of glossing over this today but if anyone wants more information i've got an old commander talk i can share with you um and i'm really really happy to talk about it as well in particular if you're having this thought process at your own company um i think while we were lucky in some ways i've still learned some lessons that are hopefully really helpful so our situation was we had really happy customers who love the features we've built so far we've been going for about three years we we had a nps net promoter score over 70 which is really good we had no churn so no customers once they were on the platform left since then we've had a few tiny tiny customers tuned so what we say officially is we have no enterprise churn which is a really good metroid but we're starting to look at the us and we had our first australian customers and we were we could see we could extrapolate from going from new zealand to australia to the us how the platform was going to need to evolve in terms of the requirements and and what it does so we'd already identified pretty fundamental changes to what the software needed to do but then there was also the situation that whenever we wanted to add more features because of the architecture we had so far it took a lot of work to build like each new feature was kind of like in squared cost like a really good architecture means that if you want to add a new feature the number of features you've built so far is more like to the power of 1.1 in terms of how much more work the 15th feature for example is but the situation we were in to finish the next feature um with a high degree of quality and nothing broken that didn't break anything else that was really really tough um just from the the state of the code base um so uh we had spent about six months attempting to refactor that code base um putting a lot of effort into improving the the the architecture and the code and you know moving out of stored procedures into ef code and a lot more unit tests a lot more documentation but we still we still couldn't see the light at the end of the tunnel there in terms of how much refactoring it was going to take until we had a platform that was easy to keep adding more features to and then beyond that it was just really tempting to look at consolidating the different technologies we were using so because we were moving fast and experimenting with things and didn't have too much time to go back and revisit our experiments for example there were five front-end frameworks on the aura platform at the time that made it really hard to maintain and it also had big performance impacts when you've got a 20 megabyte download of javascript before the page can even start doing anything that's a big problem for customers on remote sites uh so yeah as part of the rebuild we just focused on react as a front-end thing and that's been great so uh tom and phil our co-ceos had already been thinking about what if we just start over they'd already been having discussions before i'd started but um after i'd had a good look at where we were uh they brought it up to me like what if we did a rebuild and then we we had a quick look at what what features we actually needed in the rebuild and that decision got quite easy at that point because you know we want to build one more feature before we can do that we're going to need at least another six months of refactoring and then building that feature is going to take a long time because of how hard everything is to do at the moment so we looked at how much work it would be to actually build the product we wanted which wasn't a one-for-one rebuild of what we had there were things we could simplify the things we could throw away and there were new aspects that we needed to do as well compared that with the length of time for the refactoring and then what products what features we were going to need in future for customers that made it pretty clear we wanted to do a rebuild at that point um and then we looked at the other benefits so going cloud native moving off virtual machines that was a big draw card being able to add more features in future faster that was a really really easy way to sell it to the board and then just in terms of scalability and performance like picking new technologies and architectures was going to help there but we also took the opportunity to do what's sometimes called super tenanting that's what daniel larsen told me it was called so that's when you you have a multi-tenant architecture but not a single instance of it so we've gone with one production instance for new zealand one for australia one for north america and one for the uk that's what we have in terms of customers at the moment and so if we moved into africa we'd have a new production instance there it'd be nice and performant because the data center would be in africa and then in particular the data sovereignty aspects are a big reason why this was good for us so new zealand customers don't care about our australian customers demanded that their data stay on australian shores uh and we knew that u.s customers were going to be saying the same things so a super tenanted architecture is working really well for us from a legal perspective and from a scalability perspective because the the usage patterns and difference across different across those geos as well when new zealand needs to scale up is different from when the us needs to scale so we said yep let's start over um haven't looked back it worked really well for us and it wasn't just the developers that wanted to do it so it was easy yeah um we came up with some principles for how we were going to do this um one of the number one things about the old platform that we wanted to move away from was the complexity like the number of ways in which each thing affected each other thing so anytime the features could be simple um we we said that the way we make things flexible and suit the different customers is we actually just keep it really simple and the password said the way we're going to make four customers happy as we build exactly what they want and then modify it to be exactly what they want as well and that just got really really complicated complex uh and then the the simplicity of the rebuild that was a necessary to get it out the door in time but it really helped with our existing customers as well we also said that anytime we could just buy something off the shelf and use that we'd do that at least to start with because we wanted to get the rebuild done as quickly as sanely possible so for example and i'll go into the details later orth0 for logins uh it was a pretty non-controversial decision to go okay they are experts and and authorize authentication let's just use them um launch darkly a feature flag platform that was a little bit more controversial because it's quite expensive um and you know when a developer looks at what feature flags are they think oh it's just a json file i could build that um but we still said okay if we want to build it we'll build it after we've released our first version uh but in truth using launch darkly and the sway for features they actually have i see the value and i'm not sad about paying the full price for it anymore we prioritized cloud native components so anytime azure had an offering that was just new resource in the portal as opposed to we install something on vms and operate it we went with that app services over virtual machines that's an easy decision but azure search over elasticsearch that was one we spent a bit of time on in the end we went with azure search because it was going to be the easiest thing to get out the door and we knew that so at the time there was no elastic search hosted option on azure um and it had features we really wanted but you know we still went for the cloud native option just to get the rebuild done continuous deployment was a no-brainer uh today we do about seven deploys a day um after the rebuild was done we were doing about three a week as the team's gotten bigger as we've gotten better at building smaller and smaller features and just working incrementally that's gonna open up another aspect was that even though we were doing a full rebuild and we had an opportunity to think about the scalability you don't want to architect for a hundred or a thousand times your current um load because that's going to take a lot more work so the advice that i see out there is designed for 10 times the current scale but be prepared to rebuild for 100 times your current scale what we did do was make sure that architecture means that when we do come to rebuild something we don't have to rebuild the whole platform again we're now in a position where the the architecture is a lot clearer and less entangled and when something gets slow we can replace that piece not the whole car another thing that i think well in my experience some companies don't do this but i've found that having real tangible business benefits is optimizing for the developer experience so when you can deliberately make your your developers productive um saying it sounds like a no-brainer but there's a lot of companies where the architects just steam off in a direction and it's like okay that's going to be a cool architecture but it's a lot of work to even get one bit of code written because i have to um you know i there's no local emulator for the particular platform that you're having me use um so on azure you've got emulators for storage functions um there's no problem um emulating what app services does so anytime you can get stuff working on your dev machine you don't have to hook up to a dev instance on azure that does make things simpler and easier and then there's a lot of other ways in which the development developer experience can be improved so with react you've got hot module reloading absolutely worthless in production but for a developer you can get stuff done just so much quicker with that kind of thing so we always prioritized being able to enable those kinds of experiences so here's the big list of technologies that we went with on the rebuild i'm going to mostly talk about the azure technologies i'm going to skip over the frameworks because you've heard of them and i'll briefly touch on what services we've gotten a lot of value out of as well and why um so uh at this point i'm just gonna ask for any questions and also for the people that are online um if you could just type out your questions as you think of them and then buddy is gonna read them out to me when he sees them um so for the rest of this talk um it's one slide per technology effectively so i'll actually stop at the end of each slide and ask for questions so especially for the online people um if there's anything on that particular topic we can address that before we move on but there will be questions at the end as well if you've got anything general to ask about but for everything i've done so far any questions cool yep one question i had to ask so the feeling i'm getting is that you were quite in a like almost a watershed big clump something or other and now you're looking to go towards microservices or you're looking to piece it off so that when you come to do up skill up upgrade change things add things new features you've got that commonality of the microservices being able to talk close um so what we've actually gone from is a big ball of mud monolith to a more deliberately architected monolith which relies quite heavily on various platforms uh in a way that is uh what's the word more cohesive more modular yeah so um like the the platform we had that we've replaced um i'm not in any way upset that it was a big ball of mud and it was quite hard to refactor because we were trying to prove out our market worth a market fit and everyone was working as hard and as fast as they could adding on this feature that picture that picture um so we didn't really get a chance to take a breather and step back the new platform just because it's a more clearly architected monolith is what means that in the future when we evolve the architecture it's going to be easier to do um in in pieces as opposed to older ones yeah um so microservices i do think are a great idea for us the team wasn't big enough to want microservices but we are ready to move in that direction and we've since um built a new product so this is a tangent from my talk but we we have a separate product from our main product which deals with pci card data so for that piece of software to be pci compliant we we deliberately made that in its completely own section so that's using just plain http api requests to communicate with the main platform so that's our first like macro service yeah thank you come on it's just one general question before we get into this one why didn't you go with azure ad b2c great question was that rory um yeah so we we did look at azure ad um there's more out of the box with auth0 and in particular there's more user-facing stuff out of the box or thought zero so um we got sign up emails forgot password emails we got a portal that we could give our customer success team to go in and manage um our users without having too many permissions um we got um that thing where if you sign in from the wrong country it warns you whatever that's called uh there was there's just more stuff that we didn't even have to write a single liner code for for using author over azure id cost wise i think in the future we may consider moving to that because also is expensive um yeah that answers that okay app services so if you want to run your code on something app services is a great place to do it uh it's the platform as a service way to host your website from our perspective there's other stuff you can do there um our webs our web application um it's just an api there's a react front end and a asp core nbc back end it talks to everything else um deploying a dotnet website to app services is just the number one thing that they imagine you're going to be doing with it so it was really easy to do moving from virtual machines onto app services we were really happy with that because you didn't have to patch the machines restart them worry about them being down scaling out is something you can do automatically set up rules so that you get more servers when you've got high demand that's something we're doing and it's working really well scaling up is something you can do so when you suddenly sign the biggest retailer in the world your us instance can get a lot more powerful without you doing much work at all and then in terms of how we were deploying onto app services we kept our old model so we had been using team city for ci building the building the packages and then octopus for deploying into our virtual machines um as we were doing the transition it was nice to have the same ci cd systems in place so we still today have octopus deploying our packages into app services um which could be the reason that we are having one little bit of trouble so overall it's been awesome but um the the one thing that we've been struggling not struggling with but has been a little bit bumpy for us is the deployment slots so that's the way you can um deploy your new code and then once that's warmed up all your traffic can flip over onto the um the new version sometimes called blue green deploys uh we found that every now and then developers that have to get involved in that actual swap process recently we've discovered part of it is just we need to retry the cleanup of those slots because there's a there's a maximum number of slots you can have open and sometimes it fails to delete one that's not being used because it's still being used or something like that um but everything else has been working pretty much perfectly for us um and yeah i think that's because we've written our own power shell that octopus executes to actually enact those parts in the deployment chain if you were using azure devops to do your cd i think you probably wouldn't bump into this because um that's been done before there oh yeah i will stop questions thank you app services linux or windows based or don't care it's currently windows because we thought that sticking with windows would cause less problems and then we couldn't be bothered moving however we are now looking at linux just because it's cheaper and you know it's all.net core so why not cool any questions in the room on that any performance problems with app services um any question was any performance problems with app services and none that weren't just we did something really stupid in the code or we needed to scale it up because we suddenly have a lot more users at this time of day or something like that but no no like weird performance issues where something was being slow and we couldn't figure out why which i have heard of other people having on app services did you ever consider aks did we ever consider aks instead of app services um because we're kind of a monolith um we stayed away from kubernetes for now we've recently hired someone who knows a bit about kubernetes and he's he's not that keen for us to move so yeah well it's just it's not quite the right tool for what we're doing right now i won't rule us out roll it out in future cool um so i talked about scalability and and the re-architecture being a opportunity for us to do things a little bit more uh cloud-natively so um azure functions are one thing that we've been able to leverage to do those occasional but slow tasks for example when a user uploads an image we resize that into various sizes and if a lot of users all upload a lot of images at once we don't want our operational web server to start getting slow and for people to be unable to to use the app in other ways so we moved those slow tasks over into azure functions um if you look at aws lambda that's i think very serverless and that you literally shouldn't even be thinking about what server it's running on whereas functions is a little bit more of a hybrid where you can decide am i running my azure functions inside this app service plan which is where i get to say okay i want these this size of virtual machine ready for my functions to run on or you can go with consumption plan where it just gets very elastic and when you need 50 servers you've got 50 servers and when you need zero you're paying for zero the consumption plan documentation tells you that it could take 10 minutes to warm up from a cold start and that you know you probably don't want to be handling your users um api requests on on an app on our functions on consumption plan we've never seen anything like that slow in reality and i've seen blog articles from people doing performance testing on cold starts and azure functions is the best for for what their experiment was i'd love to see a histogram of what it looks like but you know you can't you just can't rule out that if you're on the consumption plan every now and then there won't be any servers available for your user and they'll have to copy the code onto a new server and warm it up and start running it and it could be slow if that's a problem for you you can move to the premium consumption plan where you do have reserved instances but you are paying for them even when no one's using them another thing functions do um it's got a lot there are a lot of built-in bindings for the way data comes in or goes out of the functions we started using those in particular we had functions handling event grid events but they didn't quite work exactly the way we wanted them to so my advice is that if the binding isn't working for you you don't have to use it and they're pretty simple to sort of emulate you you can write your own code that does exactly what the binding did the the event grid event to serialization we wanted to tweak it so we just turned our function that was accepting event grid events into a function that was a web hook and handled the handshake with the event grid delivery system ourselves um another thing i'll mention because i think it's a good principle in some cases um a lot of the advice around microservices you you don't want your services reaching directly into other services databases um you want to encapsulate the the storage of information behind the public interface of your microservice it's just like object oriented encapsulation it's a good idea we decided that our functions couldn't talk directly to the database that our app server was talking to and had to go through the app server for any queries or updates um and we have not regretted that decision at the time we were using an oldish version of.net core and azure functions took a while to upgrade i think to version three but that isn't going to be a problem for anyone these days um so yeah overall i would say uh functions once you read the documentation and understand what they're for they work really well and that they do what they're meant to any questions did you find minimal down time while scaling up or scaling out was that on app services or functions we do most of our functions work in a on non-time critical tasks like asynchronously resizing images or firing off an email that took a long time to compute and render to a user so we we never had enough a reason to look into the scaling out being slow certainly no timeouts any questions in the room cool speaking of looking into slow things um app insights uh this is the thing i was the most pleasantly surprised by coming from an aws startup and having to kind of stitch together a number of different technologies in order to see what's going on in your servers in terms of logs and metrics and um what was causing errors app insights as kind of it does it has all the features you want in one place and that brings quite good benefits and then the other side of that is there's a lot of features of app insights that just work out of the box as soon as you turn them on if you're using the technologies that microsoft expects you to be using um i think a lot of the reason azure has been working so well for us is that we're kind of deliberately staying on the golden path we're using.net core

um we're using sql server and all of that means that the people building azure had us in mind when they were building it so um app insights as soon as you turn it on on a net website you're getting really good intel on all the log messages the distributed tracing just works and that that's really really powerful we can go into app insights and see okay the user clicked on this button which led to your web server making these two web requests so your your um your spa making these two api calls to your web server and then inside one of those api calls you can see that well the timing for the client was this fast but the timing on the server was this so there's obviously this much network overhead and then this led to these two sql queries and they took this much time um it's really really insightful to be able to see like especially when you're performance tuning um what's been going on there um there's a really good way of querying it and and dan informs me that it's pronounced cousteau after the explorer which blew our minds because we'd been calling it custo for like two and a half years and we can see why cousteau makes sense because you're exploring the data but i'm just still not happy um yeah so you get a lot out of the box but you should also look into what are the ways you can leverage it metrics is a thing where you can start deliberately putting in your own product level metrics like the user got this far in their journey or this many dollars was paid off by this particular user and then you can go and explore the metrics inside app insights as well um i've written down powerful and cost effective and been reminded that when you say cost effective you still have to be on top of it it's pretty easy to accidentally send way too much information to it in which case your your algebra will go up and your cost alerts that you've set up will warn you that maybe you should turn off debug level logging going to app insights for example um you know existing frameworks like entity framework produce a lot of info level logging messages so you can configure like okay i want my info logging messages but not entity frameworks ones that'll save you some cash oh questions yeah sure so um the question was uh you have ray gun on your slides you're talking about errors and app insights and and what's the overlap so we use ray gun to manage things are broken someone needs to look at it so reagan is good at grouping up errors it's really good at notifying you about things going wrong and things that are surprises but when we get notified by raygun about something going wrong we'll then normally start using app insights to actually investigate the forensics of what's going on there does that answer your question that would make more sense because then they don't even get sent so um whenever a something that's notable from a business perspective happens we're starting to push that information through event grid event grid is a pub sub system you can have topics and you put data into the topics and then you can have multiple subscribers that are have those topics delivered to them we looked at service bus as well that's a old technology that predates azure and is very very powerful the reason we went with event grid is it just seems simpler and also it's the thing that azure seems to use for most situations when azure needs to tell you about some kind of callback so an example is when a blob gets uploaded into blob storage if you want to register your interest in that happening azure uses event grid to deliver you those webhook events that something's been uploaded so that's what i said okay this is the the native pub sub of azure um and it's probably going to be the simplest thing to configure and use um that worked really well for us we we use the the storage situation when a user uploads a file and we also have our own topics where we're putting in information that we want to react to asynchronously so if a user finishes reporting an event in our platform we'll put that information onto event grid and then event grid will tell all the things that need to care about that that happens so for example an email might get sent to the right people a search document will get sent off to azure search and another example might be that we produce an activity item in our activity feed it's a little bit like facebook our app um so there's a there's a feed of everything that's going on that you can filter across um but the cool thing about um event grid being used in that way for us is we actually have different um configurations in different environments so i talked about as a second product from us that um it's pci compliant we don't have that running in the us at the moment so um we are using event grid to notify our new zealand and australian instances of that product about events being published and there's pretty simple filtering and stuff on there as well so you can say okay this thing cares about these kinds of events and this one cares about those um very when i said it was simple part of that's just the api is um you call its rest api and it delivers you web hook events um so it's really straightforward to think about um i've got a note on here about it being very scalable this turned into a problem for us uh you can put 5000 events into event grade and one api call and then it will deliver those 5 000 events to all of its subscribers um the thing that tripped us up was we thought that if our server was going to take a little while processing those then it would kind of cue up those for us so if your web server is taking a little bit of time to handle each request event grid will very quickly scale out and start delivering you those events in parallel which really really hammered our web server we looked into any options for throttling that and there was nothing really built into event grid um this is something where service bus is like that would have been a feature for the past 20 years to say okay if you want to deliver there it's like not in parallel please don't don't spin up 100 web servers to make sure all those events get delivered quickly um so i'll actually talk about the throttle mechanism we put in place later when i'm talking about azure storage cues because that's what we used sorry forgive me i'm just thinking what six dimensions i thought was a ruling on that so you can slow it down so you can slow down event grid i could be wrong but i'm sure there was like if you get a certain number yeah you can either speed up or slow down okay so yeah we did we didn't find anything i tried to like http native ways of returning uh um four to nine but yeah we could be wrong yeah yeah yeah i think we re-engineered that part about two years less than two years ago but yeah any other event great questions media services does what it says on the tin we didn't really change the way we're using media services so our old platform users could upload videos from quite um esoteric video management system so the cctv cameras they use in store um produce a lot of different file formats the only one we can't take is the exe file format but for everything else media services takes that in and then gives us back a nice streamable variable variable bit rate video that we can play back to our customers um the only place that this has been an issue for us and this isn't really a technological issue is that we have a use case where there's a lot of video uploaded but not a lot of not viewed very often so i think when azure were putting this together they're imagining like broadcast streaming um and you know you you pay it's makes sense from a cpu perspective as well but you pay quite a bit for transcoding and then the the playback is really cheap so for our use case um with our biggest retailer in the world um this is starting to get really expensive and we're looking at how do we do this on demand as opposed to preemptively services question there's a couple questions that came through just as you started um was there any security concerns with eventbrite just being public facing them any security concerns with event grid being public facing no v-net um [Music] we um i think the advice that we followed was that you make your urls really hard to guess you you know you put like a a key in the url for delivering um data onto the topic that's what we've done and we haven't had any problems um if you did require all your stuff to be on a v-net then that would be a lot area for us yep why don't we look into competing consumers to deal with volume of events um partly this was down to our decision to keep the database behind our app server um around the encapsulation side of things so rather than having you know consumption plan functions to scale out to handle that big load of events um that was still going to have to hit our database eventually through our app server we preferred to find a way to throttle it because they weren't urgent events for us to handle and one on media services was how did you get your customers to give you access to their cctv systems how did we get our customers to give us access to their cctv systems so um what what our product actually gives you is a form where you input data and all we actually take in there in terms of cctv integration is the user can upload a file which we recognize to be a video so we do integrate with cctv systems at fuel sites when we can integrate with number plate recognition systems as well but for our retail sites it's literally just the user goes to the um computer in the server room that's connected into the cctv system and hits the export button and doesn't pick the exe format cool can you not set up like api keys and calls for workbooks to make it more secure that's pretty much what we're doing yeah but the key has to be in the url not anymore yeah it's changed we're moving towards ad authentication on that there's a reason you're at the front cool i talked a little bit about azure search being chosen over elasticsearch so this um our users when they are on our platform there's a big search box at the top of the screen on every screen and it's used quite frequently you know you want to put in someone's name or some clothing that someone was wearing or the name of a product or a store or a company or a user um using azure search to power that has taken a big load off of our database um in particular being able to think about the the corpus of things that can be searched for as its own data store and being able to put lots of different kinds of things in there and then there's this just this omni search box that finds you every type of thing has been really powerful from a user perspective this is a brand new thing in the rebuild we in in the old system we were just querying in sql to get back data and that was a a big problem because some queries some queries would slow down everyone so yeah moving that out into its own system great for transactional performance um azure search is based on the scene uh just like elasticsearch so you get a lot of um [Music] natural language type features like word stemming and misspellings and that kind of thing that's really helpful very very scalable and performant with its column store indexes the advice i'd heard around elasticsearch was that it's not a durable data store so if you want to put data into elasticsearch make sure you've got your original copy somewhere else so that when elasticsearch explodes you can just start over and repopulate a new cluster with that same information and i know multiple startups in auckland that have had to do that the biggest problem is that if you've got a cluster of say three three three nodes on your elasticsearch cluster and then one of them stops working then the other two immediately fall over with the new traffic so you kind of have to very very heavily over provision if you're using elasticsearch and operating it yourself i haven't really tested that on on azure search but i'm sure it's fine but at the same time we um we treat our database as the source of truth and then we project out of there all the data we want to have in azure search um so every time a user publishes a new event we'll rebuild the document for that event itself and upset it into azure search we've been really happy with um azure search does does what it needs to do it's performant it's easy to use the documentation is great it's got a whole lot of features in there you know these days you can use cognitive services to extract information out of pdf documents and and index those um and there's a lot of other transforms you can do as you're indexing into azure search like um particular ways to treat words so that they're more searchable you know if you've got um names you can apply sound x um filters that the way in which word stimming works is through that as well the only thing that azure search doesn't do that we wished it did even from day one as aggregation so elasticsearch lets you do a group by on your query and and for us that would have been quite helpful just around how we're arranging our data sometimes two of our documents in azure search actually represent the same thing and we'd like that to be one search result um we got around that by not putting numbers on our paging it's just an infinite scroll and you don't actually know that you only got nine results when you should have had ten but um yeah elasticsearch would let you do that properly and pull in all the right data as you go oh questions elasticsearch not yet i'll give it a minute yep the word absurd yep uh insert slash update so if it's already there you update it if it's not there you create it it's different to upset sort of answer reset hey um what do you use to pull the data from sql database into the usage sure so question was what do you use to pull the data from sql into azure search and just search comes with a lot of pre-built integrations for pulling data in we looked at those and wanted more control over what was happening so we are calling the azure search api ourselves when we know that there's more data to put in so when the user hits the publish button in our system that will go and update all their data in sql and pop something on our event grid topic then one of the subscribers on the event grid topic will say okay there's this new thing you need to react to it and then we go in query sql again and actually pull in quite a lot of data from a lot of different tables to build out this pretty chunky search document and then we push that into azure search [Music] disaster recovery have we ever tried our disaster recovery there have been situations when we um our indexer has been failing for more than a few minutes and so we've had to do partial rebuilds we haven't had to do a whole rebuild and i i don't think it would take a matter of days but it wouldn't be fast great question peter on that rock we essentially did a full rebuild when we when we did the um the rebuild of the system didn't we when we migrated yeah yeah so the process hasn't really changed since that's a good point it doesn't really work yeah yeah yeah the migration from the old system onto the new system was was a really fun and exciting exercise but got to test out a lot of these kinds of things but we don't get to do that anymore unfortunately cool blob storage um this is pretty straightforward you put files in azure blob storage that's where they belong you could put them in sql that would be slower um so we had been using blob storage uh on the old platform when the user uploads a file it goes into the web server and then the web server puts on blob storage as part of trying to make things more scalable and cost effective we started delivery leverage sas tokens more in the rebuild so in the new platform our api starts handing out urls to the users which only lasts for about 10 minutes but give the users direct access to download and even upload from blob storage this has taken a lot of work off of our web server and the performance is better for the users as well i definitely recommend using that kind of approach um aws s3 has something similar as well and i can't remember exactly what it's called right now but it's not sas tokens sas tokens can also operate in a mode where you can cancel them on the server side when you need to so there's quite a bit of power and the security there even though it sounds like it's a bit scary from a security perspective it's still worth looking at and uh yeah we haven't had any problems with the scalability of this very scalable product any questions on blob storage do you employ life cycle policies to manage data tearing or data retention do we employ life cycle policies to manage data retention or data tearing so um someone correct me if i'm wrong but you can you can tell as a storage that this particular piece of data should remain in hot storage for the next half a half a month but after that we don't think people are going to look at it so um move down the tier so that you pay less for it we don't do that right now definitely worth looking at because we have a ton of data that um is pretty time relevant cool uh storage queues we used storage queues to throttle our event grid events so you can very quickly configure event grid to deliver into a storage queue as as the subscription the subscription just goes straight into the queue and then the way um you deal with an azure storage queue is you pull items off it when you're good and ready as opposed to at throwing them at you a thousand at a time so um this was a really quick and easy way for us to move to handle those like five thousand event credit when event grid events um in a um not throttled and as as we needed to um yeah this was a pretty recent development to build this and it worked perfectly haven't had any problems with that um table storage is the third thing in azure storage that we use um and so this is quite a lot like blob storage except you're not keeping files you're keeping data that you know is in particular columns so you do define what um what columns are in table storage and it does matter quite a lot what data you put into the partition key and the row key in terms of being able to query that data out effectively also you can think about items and table storage as having a primary key more clearly than the files and blob storage like the idea of upsetting is a bit more straightforward and table storage um we've used table storage from day one to store our read audit log so every time a user looks at a particular piece of data we're putting that straight into table storage and we've only recently had to test that that works and does work so that's cool we are also starting to look at using it for the backing store for other azure search use cases whereby we put the data in [Music] in the table storage and then project that into azure search for querying because table storage is only good for querying on the partition key and row key and all of your other columns incredibly slow to query on so projecting straight into azure search makes that indexing a lot faster it's going to go a little bit faster because we're running out of time but any questions on that one um because of the issues with indexing and table storage not issues the realities we did look at cosmos db as a place to store data when we didn't know up front which of the columns we were going to want to query on so rather than using sql because this was a particularly high volume use case we used cosmos db for a particular thing and we found that it's actually just too expensive to use in terms of getting it getting the queries running on cosmos tv but they still take a matter of minutes for what we want we need to be fast what azure search would have done very fast but um we're also having to scale it up to quite a high level in order to get those under like you know half an hour so um this is something we're actually moving away from i'm not saying you should never use it but it's the kind of thing where you'd want to look at what it offers you and tick a few of those boxes before before you started using it it's got a lot of different apis for like what it looks like so you can treat it like table storage but that's just a convenience thing it's not like a good architecture thing questions table storage azure sql um a lot of people are familiar with sql server um it's a 32 year old technology um people say that you can you can use new technologies on your front end but use picking a boring technology on your back end is a really good idea i subscribe to that so um we've had all of our transactional data goes into azure sql and that works really well we know how to performance tune that we know how to operate it backups are reliable um being able to use azure sql as opposed to just sql server running on a virtual machine somewhere um gives us the enterprise grade features uh backups geo-replication security those all work out of the box and one thing that was interesting was the automated index management so sql server itself can tell you what indexes it thinks you're missing and then azure has kind of turned that into more of a product where it looks at those and then it applies them and then it measures whether or not they made a difference and it does that kind of over time by itself for you we turned that on to see what it would do we were kind of interested to see what would happen given we have different production environments with different workloads maybe it made sense for our databases to actually have different indexes we've since decided that it's really interesting to see what it proposes but we still want to have humans deciding if we should implement those indexes there's been cases where it's like oh you need to index column a and b also you need another index with column b and a and that's the kind of thing where a human would go okay i can make that the same thing but the automated index manager just creates them all for you so we recently turned it off and reviewed the indexes created and cut down our database by about 20 in terms of storage space by deleting indexes that weren't necessary um so yeah for us azure sql's been working really well but we are very deliberate about not using it for certain things so we don't put our files in there when we know we're getting thousands of rows a minute we put that in something else because we don't want to slow down our primary data store that all of our users are using in those kinds of situations into the framework the dot-net core orm yeah so um that manages the schema changes for you when the developer decides you need a new table that'll produce a sql script and our octopus deployment systems what actually rolls those out one question yeah any performance issues experienced on azure sql paths versus sql on is so i can't really compare our azure sql pairs with our azure sql is because our is instance of sql was a completely different schema um with um you know it had accreted and it was quite baroque and it was very very slow um i know that azure has you got micro outages as a consideration so you want to retry your connection openings but that's all built into frameworks like ear for you we scale and scale up and scale down our database in response to demand and that that works well last one um arm templates um i recommend infrastructure as code as an approach at the time arm templates were the the best choice for us and we are looking at pollumi at the moment there's also bicep someone should do a user group talk and play cfa um uh yeah the thing with arm templates is uh writing them by hand is horrible but you can just export something that's already in the um portal and that's really easy you just have to then refactor it to variables and stuff in it so question templates uh why is only 90 of our infrastructure encoded in our templates and what happens to the others oh yeah thanks peter yeah um i was just i was just rushing because uh we're out of time but um the arm templates get you 90 of the way there but the last mile of configuration sometimes has to be your own powershell script so um team plates can do what you can do in the portal um but like if you if you want to roll out a brand new production environment um templates will create the sql server for you but you're gonna have to write the code that creates the schema which creates the tables inside that database similar thing across other technologies so at the when we started an arm template could create a storage account but not a storage container it can do containers now but we had to do that in powershell back then it's kind of the last mile situation are there any resources is any resources that an armature plate is not able to create an azure i i'm not aware of any like high level resources that are just not an option with an arm template but the problem is when you get to the like final stages of configuration yeah yeah the detailed parts the question on infrastructure is code do control your infrastructure scaling via templates do we control our infrastructure scaling by arm templates and ci we don't but when you don't you do have to be careful because if you scale up a database manually and then you deploy your arm templates obviously it scales itself back down again so um yeah probably yeah we looked at terraform um terraform works on azure and it works on aws so if someone already understands terraform they should be good to go um we just felt that at the time the terraform team weren't trying hard enough to stay up to date with what was on azure also the fact that an arm template can just be exported out of the portal from something you've just created makes it really really easy to use great question cool i'm just going to really quickly go over some of the like most important services the third-party systems that we we paid for instead of building ourselves um all xero do authentication for you if you want suddenly decide you need to do mfa it's already there if your mfa needs an app so that you know they've already got an app for you so there's a whole lot out of the box that they'll give you the biggest problem i had with austin is they don't do particularly elastic contracts when you sign up with them you have to say how many users you're going to have and we don't know we're a startup launch darkly probably the most well-rounded feature flag product out there um their architecture is actually really cool in terms of how it works it's quite resilient um you can make it even more resilient if you give it your own radius cluster which we're doing right now um but the portal is cool you can have product people logging in and turning things on and off uh full story is a system um that lets you like literally record the web page interactions the user is having so in a situation when users are having problems you can go and replay exactly what they were doing technologically the way it works is really cool with html just being sent and also the way we've configured it we don't actually see any data that's that's in our system all of the text is redacted but you can still see the what button the user clicked on and where that went you can also look at the developer console recording in full story so that if any errors were occurring while the user was doing something you can see the areas that were in the browser console even if they didn't have a look in there so we found that really good for product insights and also tracking down what bugs are and how they happen reagan touched on that it's um it's good for throwing errors at it's not a complete logging solution where you put your info and your warning logs but every time there's an actual error um we send that off to reagan and it kind of has features around the workflow of we think we've fixed that but please tell us if it's not or we just did a release so you can tell us how which release um actually started to introduce that error or where the error rates climbed up or that kind of thing they branched out into other features that are really useful for developers as well um recommend taking a look and then intercom is something we've always been using whereby there's a little thing in the bottom right of the web page and the our users use that to interact with our support team intercom quite a cool company just the way they they run their product and the features they come up with they they added product tours as a feature exactly when we needed it which was a thing that lets you um give the user a tour around the screen like this button does that go over here if you need to do this particular thing because we were looking at actually building that kind of stuff ourselves but intercom just still out of the box now [Music] um so christians about the overall stuff so do you guys use some full story for some ui ux research yep yep yep we use full story for ui ux research are there any like user consent issues or anything like that full story has features for user data consent so if you've got a public-facing website you can configure it to say to users are you happy to have your interactions recorded if they say no full story will stay off um for us because our users uh signed up that they're covered by a contract we have with our customers um we've actually put that into our legal contracts that that are part of this so we don't enable that feature in full well we had to disable that feature in full story any city in la califor yep we we used azure city in from the beginning just for serving like our javascript more than anything and we've recently switched over to cloudflare for certificate management and are probably going to look at cdn features on there very soon yeah the azure cdn has been working well for us during that time how did you deliver features how long did i rewrite take and how did we deliver features so um we thought we'd take three months initially and then it was about 14. there was there was a little bit of re-estimating after the three-month guess as well but um during that 14 months there were two major features that we built on the old platform to keep our customers happy but we also were very careful to manage communications with them about what was coming and why we were taking a break on adding features to the platform and one thing that really helped was we we got a really really good uh product designer in um and the way that he's envisaged the ux on the new platform has made all of our existing customers very happy to move over that's a big thing yeah yeah they can talk right now yeah okay so let's try to mention that you are using c sharp and so if you're starting today first of all are you happy with that we're starting today with you another option and what does that mean yeah cool so are we happy with c-sharp um we i would say it's been working really well for us the only other thing we considered was nodes and i'm glad we went to c-sharp entity framework as a really good rm as orms go and our back-end code is really complicated um being able to uh yeah just the unit testing frameworks the di the the orm and the web api frameworks i'd say have all been really great for us i personally like c-sharp a lot um the the features that they keep coming out with um that they really really help productivity from from a talent perspective as well like there's a lot of really good.net developers in new zealand um our board are mostly australian based and they they thought i was crazy but i showed them like the options there but yeah if if i'd been doing this in sydney without based i think they would have been pushing me into ruby or python we'll go got the last yeah question what was the migration strategy for existing customers what was the migration migration strategy for existing customers it was a big bang for each geography so we moved our australian customers first because they weren't using as many of the features we hadn't built yet as our new zealand customers and it was basically turn off the platform one day having warned them and then a whole lot of sql scripts that transform the data into our new schema in place and then things like recomputing the azure search documents etc and then flipping over the the dns to point to the new system it was uh i think we were down for 11 hours something like that yeah and the last one what was the top the top design decision that allowed you to decouple what was the top designed decision that allowed us to decouple our monolith um i think the fact that we were just starting over meant that it was a lot easier to plan out the structure of the code in the monolith um we we have a lot better test coverage but but more than anything it's the the modularity and and being able to say that like even if these two bits of code are running in the same process they're thinking of themselves as microservices where the the surface area of the interaction is is pretty um high level in public between those two areas um there's a really good uh code mania talk on youtube josh rob's talk on connections and i think that way of thinking about code quality has been a lot more useful to us than things like solid but um yeah if you've got unit test coverage of your code um that does actually take you a long way towards that code being maintainable for other reasons as well cool cool thanks engage any third-party security companies yep yep so third-party security companies we have pen tests done regularly and when we did the rebuild they took a really good look at it they found more and code review than they found in the um the active testing um but yeah they're good yep yep cool [Applause] just call that out you know yes they are hiring yeah and in particular having signed that contract with that very large walmart um we are uh growing the team um very deliberately at the moment we we don't yeah yeah so yeah call it out so yeah thanks and thank you everyone for showing up that's been great we hope to run many more of these and uh you know if you've got anything you'd like to share uh it doesn't matter where you're on your journey that's you know this has been fantastic because it's been a real this is this is what it's really like in the in the real world instead of our ivory tower that that some of us get to live in so yeah thanks again to rob and thanks for teaming or for supporting us

2021-05-29 14:58

Show Video

Other news