Building A Petabyte Scale Warehouse in BigQuery (Cloud Next '18)

Building A Petabyte Scale Warehouse in BigQuery (Cloud Next '18)

Show Video

Hi. There, my, name is Tina toshko and I'm a product manager on Google bigquery, and. I want to thank everyone for joining on. This session there's. Just so much content, so many amazing stories to choose from and so for. Myself from from all the other speakers that I'm joining with I wanna thank, you for picking this session specifically. So. This session is all about, scale. Lots, of data, large. Volumes of data complex. Problems but, it's also about, it's. About organizations, that have a rich. History of. Creative. Technology and. History. Of solving, large-scale, problems. And, so, what better way to. Demonstrate. The complexity, and scale of something, than to have not one not two, not. Three not four but five presenters. For, a session I think this is a record for next. So. All. These organizations, have solved, tackle, problems for their relatively. Unique scale and. If. You fast-forward 2006. Who had our own technical problems to solve we. Couldn't do interactive, analytics. At Google, scale and that's where German. Came along so Dremel. For. The past ten plus years has been powering, Google. Data. In analytics infrastructure. And, bit quarry as most, people in this room already know the, quarry is powered by Dremel the quarry is Dremel, it's, the same or engineering team it's. The same infrastructure, the, same codebase, and. So. When. You think about it when, you're running bigquery, you're, running on top of Google's, innovation. You're running on top of Dremel and what. What. Gives. Bigquery, its scale. Characteristics is, its native. Cloud, architecture, cloud. Native architecture, is its unique architectural, principles, unique architectural, considerations. So it starts with stateless. Multi-tenant, compute cluster which is really what Dremel is right, which. Is really unique, offering. And. There's. Joined together by a logical, storage system, that scales into petabytes, and more, and. The tool wired together by our network that, scales to petabytes and then. We started to, go beyond that right we with separate computing state. Intermediate. State during processing, which accelerates, many. Of our queries gives us performance. Durability scalability. Characteristics. And then we allow to share, data in place without creating data silos, without. Moving that around needlessly, right you bring. Analytics, to data you bring users to that and not the other way around and finally, we were out there all around in a fully managed offering, that, minimizes complexity, manageability, overheads for your organization's, makes it easy for you guys to use, this.

Type Of technology at scale so. There's. This really really good quote by, someone. I really respect. The. Query was built for scale. It's. True I believe this person and. So, you. Don't have to believe me when you. When. You, listen. To this here's. A couple of anecdotes that I've found on Twitter from, folks who really really appreciate using. The query. Our. Queries have gone from four hours to 13 seconds it's. Awesome. Here's. Another one I can vouch for bigquery it's pretty. Magical. So. Now I want. To discourage folks from from jumping on Twitter right the second and, prepare, in professing. Your love for, bigquery. Especially. Using foul language please, don't do that okay but. The next one is really my favorite perhaps. My favorite Twitter. Quote of all time it's, from a gentleman who works at Spotify. And, two years ago when we started to work with Spotify. This. Is what he tweeted, finally. I can tell the world that, the query is the best thing that has ever happened to me I feel. Like if it's in caps I have to shout it and. Mature. I mean he he, probably doesn't have a lot of interests. But. He's. A brilliant engineer and Spotify. Has a brilliant organization, of course the delightful. Product, everyone, loves using Spotify, but also incredibly, create, technologists. And. I'd like to please welcome, one. Of them here on stage with me Nicole. Nicole, is going to share the story of how Spotify, is embraced cool cloud platform and bokurei specifically, for their use cases, please. Thanks. Tina I appreciate the intro. Hey, everyone thanks, for having me here today I'll. Start with a brief introduction, my, name is Nicole and I work at Spotify, in the as, a technical, product manager in the data infrastructure, tribe, which is our name for department, at Spotify. Specifically. I work with a team, that, works on building tools and to. Support big query across Spotify, so. We work closely with bigquery, everyday we. In DI and as a whole build internal, tools and infrastructure to, support data. At Spotify. So. Before. I get into this talk I want to just apologize briefly. If I get a little bit technical, feel, free to find, me after the talk and I can explain, things a little bit more detail if you have any questions. So. I thought I'd start with a very, brief history of. The. Short history of Spotify in the cloud which. Actually now that I think about it would be a great name for a band I, don't have to do that someday. Alright so four, years ago when we started the. Transition, to. To. The cloud, at. That point Spotify, was almost completely, on-premise. And bare metal we. Were doing. I, think. Our largest point we had something like. 2500. Nodes in our Hadoop cluster and over 10,000. Machines, and, four data centers around the world and. What we were finding is that even at that kind of scale one of the largest Hadoop, clusters in Europe it still really wasn't enough we, are expanding rapidly in, the US and we were really having trouble keeping up with, a growth in and being able to do our analytical, work and. That in particular was a really big pain point for us running. A single query on this cluster in hive could, take us hours, so. After a couple years of small, experiments, we finally decided to take the plunge and move to, the cloud and. But before I move, on I wanted to give a brief plug to my amazing, colleagues Josh and Ramon he'll be giving a talk about integration and tomorrow afternoon so if you have interest any interest in hearing more about how we did that migration to the cloud please listen into their talk. So. For us the move to bigquery was actually a really big success. It was rapidly adopted by the Spotify, analytics, and data science community, and, to the state still remains one of the most popular, GCP tools we use, that Spotify for, us in particular what, really meant, for us was that we were getting fast results we. Were able to add capacity a lot easier and, we found that the integration, with the other GCP tools was really. Help as. Well, one of the biggest wins for us was really how it democratized. Access, to data at Spotify, it, was really that ease of use meant that more people could be asking, more questions and, getting, responses faster. Which meant that overall we were getting a massively. Shorter, time to insite using. Sequel particularly, as a querying language, was really powerful for letting people across, the company ask questions, of our data in a way that they had never been able to do before.

So. I thought I'd give you a quick example of how using. Bigquery impacted, our ability to ask questions of our data so. This is a kind of question that somebody in data. Scientists might want to be asking how, much time did users in Spain, spend listening to Spotify in October. Of 2016. Before. When we were working with hive it would take us something like 16. Minutes or so to get, an. Answer for this one question now. When we move to bigquery, the. Same question only takes us about 33, seconds so. As you can see a huge, sea change in our ability to get those answers quickly now, if you multiply that by hundreds. Of questions asked, by hundreds. Of people at Spotify daily, the kind of impact this kind of speed is gonna have is really clear and. In case you guys are wondering the answer to that question, is the total hours or equivalent nearly ten thousand years. All. Right so, what. Does it really mean or what do we mean when we say, petabyte. Scale what does that really look like, internally. And Spotify we sometimes refer to this as how big is data, and. Our answer for that is really, really big. So. Right now. Spotify. Has about. 770. Million active, users listening. To our catalogue of over 35. Million songs, in 65, different markets so, as you can probably imagine, we have generate a lot of data and being, able to ant ask the right questions, in the right way becomes really important to us and so, that's what we end up using bigquery for. So. We, use bigquery, to do that kind of analysis, at, a scale that we've never been able to do before, right, now we have about 25, percent of employees at bigquery that are using bigquery on a regular basis either directly, or through custom, tooling on top of bigquery we. Process around 400. Petabytes of data monthly and store over a hundred petabytes of data specifically. In bigquery doesn't. Even talk about things, we're storing in other parts the GCB product, suite and, that also includes loading and around 500, terabytes of new data daily, into, bigquery and. What that means is that in total we end up running around 6 million queries a month so that's, a really big scale. So. For us that moved to bigquery. Has actually solved, a lot of the issues that we used to have in our old clusters, but, as we grown we really had to develop new techniques to ensure that we're going to continue to be successful. For. Us we actually see this as a big win we've, moved beyond those lower level problems. Of you know we even get the answers we need to. The higher-level questions, of how do we continue to ensure good performance, if. Our data doubles in size again so. I thought I'd start with a few things here that we've discovered over the past year, that have helped us a scale effectively, in bigquery. We, found that the, things we're doing fall into four kind of categories, the, first being, administration. Managing. How bigquery, is being used across the company second. One is around education, learning, about how to tune our bigquery performance, to get the best results. Third, is integration, that's. Leveraging the. Api's, and other GCP products to get the right tool. For the right time and then. Finally we have working, closely. With Google. To improve bigquery, so, working on a partnership with them. Someone. Dig a little bit deeper into each of those, first. Of all when we talk about the, the kinds of things we were doing and administration, to be successful, one, of the first things we did was move to capacity, model this, has been really helpful to us because we have ensures. That we have dedicated slots, and. Helps us be predictable, with our costs that's been a big win for us it. Does mean we need to think a little bit ahead to enable, to grow our capacity but, we've been finding that it's actually significantly still, better, than the. Problems, we were having in our on-premise. Hadoop. Clusters for growing capacity we still have a significantly, faster speed of growth there the. Second, thing we do is we've. Been creating sub reservations, which are kind of smaller pools within our larger. Reservation. That allows. Us to isolate jobs, and business. Critical projects and. That's actually been really helpful for us particularly. Because we regularly, rebalance, which projects, have access to which pools and, that means that we're getting the best performance for. The right time a good, example of this is whenever we're gonna do a large backfill, on our data so, if they went wrong or we need to add.

Some Data into a table that already exists we. Will move that particular project into its own sub reservation, with, an isolated, set of resources. That it can use and that means it's preventing, preventing. That from impacting, the day to day work everybody else but also ensuring that that particular. Job has the resources and needs to complete on time that's, been really powerful for us. We. Also make sure that, business. Critical projects, are in their own sub reservation, that means no matter what happens they're always going to have access to the resources they need to complete those business critical tasks. The. Next thing we've been working on has to do with education, specifically. Learning. How to tune our performance, as being one of the main things, for. Us what that really means is learning, how do we need to manage our concurrency. How. Do we use batch vs. interactive, mode to, help figure. Out when jobs should run these. Kinds of things help us ensure that we're getting the best performance, when. We're running things on bigquery. The. Second thing we been, learning a lot about is the impact, of Dremel, architecture, understanding. That it's. Not a relational, database and that, using, a colon or datastore actually, provides. Significant, benefits for certain two queries but could be very different and, not. Work as well for other, kinds of queries and so ensuring. That people, who are running queries on bigquery across the company really. Understand, how, to optimize for, the kind of architecture they're, running on and. Finally. The thing, we do in within Spotify is we have a very passionate. Community. Of bigquery users, and we leverage, that a lot to help share. Best practices, around the company. Next. We work on integration. Specifically. We use bigquery. API wrappers. To allow us to build our own set of custom tools on top of bigquery to meet our own needs this. Really allows us to but to do deeper integration, with our own tools and environment, while, still averaging. That power of bigquery behind, the scenes behind that that's been really helpful for us to continue growing and. Fine-tuning. Something, for our specific needs. We. Also leverage, other, leverage. The integrations, with other Google cloud data products and specifically. The ability to move data between, different. Products easily that, allows us to really optimize about, using the right tool for the job. Finally. We focus on building, a good partnership specifically. With the Google bigquery team as, well with our strategic, cloud Engineering's, engineers. That work with us here at Spotify. We. Work with them really to help keep expanding, the features and capabilities of, bigquery.

That, You really need to be able to operate at this kind of scale what. That means for us on a day-to-day basis, is that we're talking with them about feature, requests, you know as we're pushing the limits of what bigquery. Does right now we, work closely in telling them hey, these the new features we need this is where we'd like to see this grow a great. Example of this for us is not, too long ago we were working on becoming. Ready for gdpr the. General, data protection regulation, that rolled out in Europe a few months ago and. We. Realized, that the, technique we were using for encryption. And decryption just, wasn't going to scale to the. Extent we needed for, for gdpr compliance, so, we work closely with the bigquery. Team to work on new, tools and features that allow us to do that at the ski we needed to if. The speed we needed to do as well that's, been really helpful and then. Finally, we've been working closely with the. Support. Team at Google to help us with troubleshooting. Specifically. Whenever, we run into any issues or, users, have challenges. We. Or we have performance, issues we're able to really reach out and get quick responses, and. Solve. Our problems very smoothly, and. That's. If for me I'm gonna pass that over to tea now to talk, about the next. We're. Good. Thank. You Nicole that was that was fantastic so, Spotify. Has you. Know at one point in time before, moving, to Google cloud platform Spotify, had, the, largest Hadoop, cluster in Europe or, right. Up there and, highly tuned they spent a lot of time thinking about performance. And operational, complexity, and, you. Know now, they're working with us to solve the problems which gives, them an opportunity to, to focus on delighting, their customers, rather. Than trying to tweak. Certain, HDFS, variables or things that they, may not think are important, for their business. Well. Spotify, has been with us for three. Years now but. I'd like to welcome on stage. Our. Next customer here our next speaker at, 1 from both you. Have probably used. You. Probably use a property. Of both every, single day in your life, so. And. They. All. Has been, with us on stage, last. Time and they've. Kind of grown since then so it's really intersting to see the progression of where they're going and where, they're going to be next one, please. Thank. You Tina it's. Like he said my name is Tom Tom and that a technical, director at oath on the media. And analytics, warehouse, team also known as ma. And. For, those of you are not familiar as, familiar, with oath it was basically formed, last year after, Verizon's, acquisition, of Yahoo which.

Merges Together with AOL and. That. Formed, a, kind, of house of brands, represented, here by what we like to show his orbs not. An exhaustive list but we have a lot of well-known properties, products. Including. Huffington post flurry. Tumblr. TechCrunch. And Engadget to name a few. So. When. We talk about media analytics both what, we're really talking about is scale in. Terms of volume, and velocity of data we've, spent the last year working with all the brands, from, the merger, trying. To from. Around the world and across their platforms. Conforming. Their data and to bring it together and presents. A uniform. Business view of our company we, also. Try to serve a wide range of users and use cases we provide KPI reporting, for executives. Product. Metrics for product managers and engineers and, then. We also support, editors by providing real-time Afternoon. Information. On the, performance of their content, and so. To satisfy, all these use cases we, try to strive. To provide a single cell. Service. Interactive. And scalable. Data platform. So. Get a little history about how we got to this point and before we had bigquery. Generally. At companies, we start with analytics he started small you have a single pipeline presenting. The data to panelists. And. Then you bring on more products, you have more requirements, and you started to add another stack another warehouse. Companies. Grow you acquire other businesses, that bring in their own warehouses. These. Warehouses, the inside they interact with each other and you have to kind of reconcile, that information. So. You kind of land at a picture, that looks kind of like this you have a lot of specialized. Data store have evolved, into data silos, you. Have a lot of engineering teams kind of repeating, work which is not very cost-efficient. You. Load data into multiple places using. Different logic to process it sometimes, yielding different results, and, then those differences eventually, have to be reconciled, somehow. You. Have the version text ax that starts to evolve they use technology is just like relational, databases that, when you reach a certain threshold can't. Maintain, with the pace of the services required and. So, overall you have a very poor. Foundation, for building an analytics platform. So. What was the solution so. Yahoo obviously, being a huge contributor, to Hadoop, had, the expertise, to build out large, scale clusters, for storing data large. ETL pipelines, you know we have clusters, in tens of thousands of nodes, so. That was not the issue and what we really needed was a complimentary. Service that could take that data, expose. It make it interactive. Scale. It with the needs of the business and then, also be kind of cost-efficient. And. So for us that solution, was bigquery we. Use that as the basis for our new architecture, it's. Able to scale with the needs that's brought on all the brands. From our merger. It's. Flexible nothing terms of capabilities, to satisfy, two different use cases that we have for our users. It's. A managed platform, so it brings lower operational overheads and so it maintains a, relatively, lower cost for us as well. So. Initially. When we started our pipeline it was a very basic, hadoop design. We ingested, our log data and event data into, our cluster, process. That with high with, pig stage, it up in hive and then, service it to our end users through. Our bi analytics, front-end, tool which, is liquor and. This serve. This purpose when we started off and if, you remember the original chain, when. You start small everything. Works fine and as, Nicole said they started with hive as well but as the. Demands grew hive couldn't keep up with the services that we needed to provide. So. I'm looking at alternatives, we started to look at bigquery and, actually this diagram, was presented at last year's conference by my colleague, Nikhil Mishra as. Our first step for. Introducing bigquery, we wanted to take a very measured approach for, adopting into our platform, this. Gave us the opportunity to work with the dev team to. Understand, the GCP environment, to understand how that works we are clouding, or our Hadoop.

Environment. Let. Us understand the capabilities and the shortcomings, of the tools and, be. Able to figure out the right way to integrate it we left it in there along. With hive and, still. Serving the data up through a bigquery looker. So that our end users really didn't experience any difference, in their service except maybe some possible gains in performance and some functionality. And. Then it led to our current design. Represented. Here I think the biggest thing you can notice is all the data now is being served up through bigquery we've. Greatly increased, the capacity that we store in there because we have brought in all, this data and all the brands from our merger, we've. Added in a real-time pipeline in our system which allows us to take advantage of bigquery streaming, ingestion capabilities. We've. Also started, to take advantage of additional GCP services, like pub/sub data. Flow and data prop because of how well they work together inside, of that ecosystem, and. Really this overall, presents, a simpler design for, the overall architecture that we're trying to achieve. And. Like I said that architecture, afforded, us the ability to simplify. The design we've, gotten rid of all those specialized, data stores and data silos. We're. Able to unify things, like the streaming and real and batch pipelines, so that all our logic is centralized in a single location. We. Had the flexibility, in terms of services on to provide the data we've added our own API on top of bigquery API so. We could provide programmatic access to our. Users that require that and. As. The needs require we can scale up both storage. And compute to meet the satisfy. Their needs as they grow. And. Here, the numbers can speak for themselves in terms of the amount of storage we have the, number of users the amount of usage that users apply in terms of queries against our services. And. Really the the numbers kind of speak to the significance, of our system kind of the importance, of it providing, value for our company and the day-to-day decisions that our colleagues need to make I. Think, the numbers also speak to the validity of our choice in using bigquery as the backbone over here infrastructure. Because. It has been able to grow to match the needs and demands of our, services. So. A couple learnings that we've had along the way as mentioned before GDP. Are as well I'm sure of a lot of us had to face that new regulation, that. Took effect earlier, this year which is so, enjoyable. But. For us it really presented, a very daunting proposition of, having to restate petabytes, of data within our warehouse, at. A fairly frequent rate which, would been very, time-consuming, very. Resource, consuming, obviously. Very costly, and so we had to find alternatives. To handle that and. Looking at those we were able to settle using, a bigquery. DML. And. What that allowed us to do is to apply massive. Scale updates and deletes on our data it. Didn't require us to do any additional work in terms of reprocessing the data or having the restage our data everything we said that. The. Functions. Were very performant, to meet our expectations. We. Did work cooperatively, with the dev team to get enhancements, along the way like being able to apply DML to, streaming tables, which is one of our requirements. And we, can't continue to work with the dev team to try to improve the process by looking at things such as clustering, which we expect they're going to yield additional performance gains. In terms of streaming obviously, being able to ingest event. Level data into. Your warehouse is a huge advantage and.

For, Us we kind of applied a lambda architectural. We. Overlay. The real-time, data with, batch data and so if you're querying for the last 10 minutes or the last 10 weeks of data you're going to the same table using the same query essentially. But. It makes you think in different aspects. So yet then to think in terms of quotas and inputs amount, of events and the size of those events. Of. Course nothing comes for free so, what we experienced, with tables with heavy streaming ingestion there's gonna be some performance impacts in terms of queries so, we had the work to kind of either overcome, or be. Able to work with the situations, so in, some cases we use on-prem, Hardware to offset some of that in other cases we're looking at alternatives, like, micro batching ingestion which we're hoping will increase, performance of those situations I. Think. The really biggest discovery for us is the. Independence, of the quotas for a lot of the things that you can configure. Streaming. Ingestion, quotas. Versus batch ingestion quarters don't impact each other. Querying. As well as ingestion, quotas, don't impact each other and what it really gives. You is a lot of flexibility, a lot of on-demand flexibility. To configure, your, situation. Especially in highly dynamic situation. Environments. So. You can handle any event that comes to you and be able to handle. Those accordingly. So, in the media business we, kind of allowed a high-profile, high-volume. Events, there's late-breaking news there's, live sporting events and. So we went in an arrangement with the NFL to stream their contents, and. It was decided that we wanted to provide real-time, metrics around the performance of that streaming and, for. Us that meant experience. Peaks, involve, in terms of events, and size. Of days that we don't typically have, to encounter so, in. Looking alternatives. To satisfy. Those requirements. Using. The hardware route we. Really didn't have the runway to procure. Set. Up the hardware not to mention what do you do with a hardware after the event is completed and so, we look to bigquery being. Able to, dynamically. Increase, our quotas, meet. The expectations, of the events turn. Down the quotas at the end of that and. The best thing of all is you only pay for what you use during, that time. We. Were able to handle. That engagement. With. With, no issues in a very short amount of time and so overall it was a very successful engagement. So. What are we looking to do from here obviously, we want to continue to unlock the value of data that we expose. In our platform we're, going to increase the volume look at additional data sets they could provide value for other parts. Of the company. Looking. At additional GCP services, like machine language. Being. Able to do things like user segmentation. And clustering, in. That platform and, then. We're working with the, bigquery team at Google on pre. Releases and alpha beta versions, of additional features, which. I won't mention here have to take a way to Thunder from T no company. But. They'll be announcing of those I think shortly tomorrow. And. With that I'm. Done. Take. It so on. I. Will. Take everything again. Thank. You so much fun so, the, impact, of that. NFL. Game, event shouldn't be understated. At, warning coke came to us said hey guys we're gonna forget, this NFL game going on and we're gonna stream. Four. Gigabytes, per second, into your streaming system. And. You. Know we freaked, out a little bit at, first but, you know over over they gave us enough free time and we delivered, it the, important. Thing to note is that all the complexity, was on our side right old all Tuan's, team was looking at is an API surface right, so we went out there and we. We, made sure there everything would work right our sorry's were on call everyone. Was, very happy with the outcome and. They, just sent rose to us so, and then they paid for only what, they consumed so, it's a great story and the. Other color I would say is as. You. Get to these scales the anti-pattern, space expands. Dramatically, there. Are so many things that are unknowns that you have to do you encounter for the first time do you have to fix so, so, folks like Spotify and, folks like both, you. Know, the problems that that they, face and they work, with, us to. Resolve straight in our system so people, who aren't necessarily at that scale get a more, robust more performant system. Well. Now I'm gonna kind. Of change the, gears. A little bit, Spotify. And I've been with us for a while they're, operating at huge scale but, we have a newcomer into our platform believe.

In April, Twitter. Announced that they chose Google cloud platform as their vendor. Of choice or a cloud. Cloud. Vendor of choice to, host their next generation data analytics, platform so that's early for them but, they're already getting some results so, I'd like to welcome a couple, of Engineers from, Twitter, who, are going to be talking about how, they're using bigquery today, please, welcome Pawan, and Robin. Thank. You Tina. Hey. There I'm Pawan in. M Roman we're. From the revenue platform, team at Twitter and today. We're going to talk about some, of the compelling, technical, challenges we, faced recently how. Bigquery, helped us solved them and. Where we go next. So. A Twitter, we, have a track record of, victories, solving, some. Of the problems that have been impossible to sell before and. And. We. Really. Really. Want to tell a story of, how. We start using bigquery and what drove us to get here. 10. Billion, advertise events process daily so, this is the skill we have to deal with and. And. What. Happens to this skill when. Something goes wrong, it's. Really hard to tell. So. We. Had a need to write arbitrary, queries against. Our data sets to. Answer some of those questions and. For. A lot of our data sources we own it's. Even impossible or, extremely time-consuming. So. With that in mind I, want, to. Share one or II will. Turn real life cases. So. Two. Quarters ago we. Had a bug. Someone. Reported, we showed, the wrong F we, showed ads to the wrong user something. We should never do, so. First, steps identify. If there are any other campaigns, that were affected. Fortunately. We have tools to do that it's, called. Business. Audit, records any. Change, made to a campaign gets logged into this table and my sequel and so, all you have to do is to, query it, turns. Out it's really hard, to, do at our scale. I own one or Twitter's largest, my Seco dusters and this is one of its largest tables, you. Can literally hear my seagull scream and pain when you're trying to query a 3.2, billion row table. So. How. Long did it take for, us to finally ask answer, the question, were. There any other campaigns that were affected. It. Took us two, weeks, it took us two, weeks for. Engineers, the DBA and a, whole lot of pain to finally, answer the question. How. Could we do better. The. Platform. Team the. Data, platform team has been experimenting with bigquery for a while and we thought it was the perfect technology, to help solve our problem. With. The power of BQ blaster that Roman is going to talk about next our. Data was ingested. In no, time and now. Things. Are as simple as writing sequel, but without the, timeouts this, takes less than a minute to execute. And. This. Was just one use case. So. We looked at this use case and a couple of other test cases and, we really serve potential. To. Answer questions in more depth if we move. All our data sources to bickering. And. Here's, where. It becomes a really hard engineering problem, how, do we, generalize. The ingestion regardless. Of the source and. That's kale. It. Was never an easy transition for us big. Worry alone was not enough and. We. Really had to build a set of tools to. Enable those experiments, to, see what queries we can run and also. What new questions we could get answer it. The. Career blaster obq blaster how a call for short is one of the first tool that were built for this and let's. See what it does. We. Started with introspection, on our own systems, and. With. So many with. So many data sources and database, tabs we. Really wanted to build a tool that, would have structurally, the original, data source. So. The initiator for this. Would, be a scouting job that. Picks up the data. From. Any original source that we have and lands. In a Google Cloud storage. The. Very last step is a. Bigquery. Import, job that picks up data from Google Cloud storage and allows it straight to be a table.

So. This is a tool in our show we. Also want to talk about where we go next. So. We took the learnings, from this initiative and are applying it to our next challenge. Getting. The data organization right. We. Consume. More than hundred, terabyte, data Rev. Of Revenue logs per day. Moving. All this, Ruby query is. Our next challenge and, we. Know how to tackle this now thank. You. Thank. You guys. So. We. You've heard some of the stories of folks who are doing interesting things at scale who. Have lots of data but I don't want to give you the impression that this is the. Only thing that the query is good at. The. Query is has. A wide. Wide, range of customers of all, scales, and. All sizes and all types of complexities. So. Much so that the vast majority of our customers are actually using our perpetual free tier you get 10 gigabytes of storage and, one terabyte of data processed, every, single month for free, right, so if you have that much data you never have to pay us a penny and. So. We would encourage you to try, it out if you haven't yet because. It, might be a great tool for your use case no matter how big or small and. Of course you can go take a look at slash. Free to, get your $300. Credits. That you can use on bigquery or any. Other product that you may like and, of course you can take a look at some of the other sessions that we have we. Have lots, of customers talking about how. They're using our infrastructure. And our services, it's. A very varying. Style of use cases so, you can, get a variety of. You. Can experience a variety of what. People are talking about. You.

2018-07-27 18:29

Show Video

Other news