Build an open format analytics and AI lakehouse
[Music] well hello and welcome everyone let's get things started for day two of next let's start by a quick stop pole uh raise your hand if you ever had to build an ETL pipeline which was largely copying data so that you can enable a business use case quite a few keep your hands up if you felt that after creating that pipeline you increased the complexity of managing that data or security and governance well quite a few so that's really the Crux of the session today we will really be talking about how how you can actually build a architecture that's predicated on a single copy lake house one that allows you to make sense of all of your data with all analytics and AI engines and do so in a secure governed and performant manner my name is gorov Sak I'm a group product manager here at Google Cloud focused on lak house Ai and analytics and I have the pleasure of joining me today is Rohit danan who is uh Chief technology officer for acutance Tech platform who will be sharing insights and learnings from their lake house Journey so first of all I do want to start off by saying that this is a super exciting time we are at the dawn of generative Ai and there are lots of reasons that you might have heard why people are excited about this but I want to point out two specific things that are very relevant to the lake house topic well first of all if you look at a typical Enterprise today we are only scratching the surface in terms of the percentage of data that's really recurring into value and a large part reason of that today is if you look at businesses today they generate a lot of data every time a support interaction happens every time uh a customer leaves a feedback on an online platform or creates a video that's a lot of unstructured or multimodal data thatting created the problem was that the tools to unlock that was very expensive well now generative AI is going to has the promise to solve that particular problem and really make sense of all your data second the assistive technologies will help every data practitioner each one of us to do more and this is why it's super excited that it needs to be kind of grounded in the context of your Enterprise data well and while a lot of focus has been on AI models the heart of AI is data and this is why many you have been already thinking about these use case or trying out these use cases today have been asking how do I connect data to use cases easily and to do so in a secure manner governed Manner and to be able to do it without writing a lot of custom infrastructure let's talk about how a gcp lake house helps you achieve these things so the biggest problem when it comes to connecting data to use cases is really the siloed nature of the data Lake the data warehouse and the AI stack each layer in the stack has its own way to access data and its own way to manage metadata well what this really means is if you are going to start up an AI project or a AI use case and if that data is present in a warehouse you will now right to have ETL job to pull the data out and as pointed out through a stop Pole that really becomes challenging well that complexity only increases if that data is in different formats or in different clouds besides the security and governance challenges all of these the physics of moving the data really comes in the way and while this is extremely painful for us data Engineers ultimately this slows down value generation and and we are today here to look at how we can really help this kind of use cases now last year to address some of these problems at at this very conference we announced Big Lake big lake is a storage engine that unifies your data Lake warehouse and AI workloads it does so by providing a single fabric of data over which multiple engines can interact with and do so in a secure and govern manner it abstracts and decouples some of the common shared infrastructure from query engines such as metadata and security and abstracts into a runtime on its own this is what allows query engines to interface with all the data underneath spanning across formats and Cloud boundaries and to be able to access that data with built-in fine grain access control and also with price performance ever since the beginning of this year we have seen a lot of growth with many of you building these lake houses in fact processing over big lake has really increased by 27x beginning of this year and the first question that many of you are asking is what table formats does pig L support table formats have really taken off in the last three years because of the promise of providing Rich data management capabilities over open format data Apache Iceberg Delta or hoodie are popular names we have specifically seen a lot of interest in Apachi Iceberg and this is why I'm excited to share that the support for apachi iceberg is now generally available in Big Lake you can create or modify an iceberg table from open source engine such as spark and that table can automatically become available for your big query users to go query and consume this is made possible through a shared meta store architecture big lck meta store is a shared metastore between open source engines and big query making it easy for you to uh access your open source tables directly in bigquery and quer it when it comes to querying big query natively understands Iceberg metadata and and this is what makes it really easy for big query to not only deliver a performant query experience but to deliver it with full fine R Access Control model it Big L enforces find access control over your Iceberg data and enforces it when you're aquaring from Big query as well as from open source engines when you're going through the big lake interface this really makes sure that not only running open source Iceberg but you are actually running in a secure Manner and you are running in a performant manner and you are integrating all of the data which was already present in bigquery or in other places while Iceberg has dramatically accelerated the adoption of open lake houses on Google Cloud some of you have already been using Delta or hoodie for your workloads for specific use cases and this is why I'm also really excited to announce today that Delta and Hoodie is Now supported through big query manifest you can now create a big L table by pointing to manifest generated from apachi hoodi or from from Delta and that table can now be queried from Big query the Manifest support now supports partitioning as well which means we scan less data every time you're running queries against these open formats lastly the same security model that we spoke about of iceberg in context of find and access control all of those features also work on Delta as well as on hoodie which means you can either use big query or open source engines to query these tables with row column access control or through Dynamic data masking besides the support for table format many of you also ask about query performance it's really one of the chief characteristics when choosing a particular technology stack to run on and performance has been an important area of focus for us this is why what we have been doing is really expanding the underlying infrastructure that we've already used for big query it's called Big metadata that really represents Rich metadata data profile I all extended to open formats so that we can capture Rich Park status and metadata and use that effectively for query planning all of this a meant that on The tpcs Benchmark we are now 4X faster than we were before on big query external tables and this increased performance by the virtue of leveraging metadata also means that we have to scan less data reducing the cost of running these queries and posing a 75% reduction when compared to bigquery external tables so if you are running bigquery external tables today it's really a no-brainer to upgrade those to bigli tables and get better performance lower cost and fin an access control for your workloads put put together these tools will really help you connect more of your data and more of your formats and across clouds to more use cases and really help Advance the development of analytics and AI now now another big area that keeps coming up all the time is that many of you have said that we really want to pursue an open format strategy and reasons range from long-term vend neutrality or being compatible or interoperable with multicloud or hybrid architectures a single format can make that easy but management of data is painful you have to ride a lot of these background jobs to constantly optimize file sizes garbage collection or be able to do features that will optimize your overall query performance and impact the TCO of your overall workload this is why and this particular problem also gets even more complex when you are doing use cases like high throughput streaming change data capture which results into the dreaded small file problem that really ends up impacting query performance and had requires a lot more infrastructure maintenance and management to just keep the workload really running in a performing man specifically when there are use cases like gdpr which often require you to perform deletion operations at a granular level those becomes really expensive because the open formats are really modifying uh multiple files and the toity could be a batch of files so uh these are some of the challenges that many have been asking and they are solvable but they require a lot of work that creates U infrastructure this is I'm super thrilled to announce something special that we've been working on over the course of last year it's called Big Lake manage tables that provides a fully managed lake house experience for your open format data it provides DML and transactions on your parket workloads for your data deciding in your cloud storage buckets it provides a fully managed lout experience with automatic background storage optimizations such as file compaction and garbage collection to help you make sure that you are running Optimal Performance and cost and it provides an interface to do extremely high throughput streaming ingestion that scales to tens of gigabytes for the most demanding workloads Paving the way for CDC and other Advanced analytics use cases that you can now run on open format data the key underlying principle for us has been openness where data stays in your buckets and it it's managed in open formats all of this is enabled in a fully open architecture by bu building in full compatibility with Iceberg Big Lake manage tables provide Iceberg metadata so that your open source engines can query it just like if they were query ice spook tables Big Lake manage tables really brings the best of Google manage storage management infrastructure that has already been tested at massive scale with big query and innovates on it to make it optimized and work on your open format data we are very excited about this there are three key ways in which you can use biglick manage tables let me start by calling this out first one you can use bigquery to perform DML ingestion into your tables just like you would do on a big query native table and then big query will uh Big Lake will actually uh perform all of these underlying data management operations such as file compaction autoc clustering clustering of data automatic reclustering as well as garbage ction to make sure that you don't have to ride these infrastructure task and you still get a very performant experience at the right price point if you are building streaming pipelines that's the second use case where you can now use the bigquery right API to actually write data to these tables using connectors from popular open source engines like spark or just build your own client Library based connectors to write it from your own custom engines and this really scales to several gigabits per second part of the reason why this becomes possible is that historically when you are building High throughput steaming pipelines you are constrained by the right throughput from object stores or the transactions on the open formats Big Lake manage tables take an approach where inje is First ridden into a right optimized storage that is massively paralyzed and then it's converted into parquet into colum ner this really allows inje to really scale up to extremely high limits for the most demanding workloads but yet provide you rate zero rate Stillness your queries actually merge the data between par as well as the right optimize store to make sure there is no latency when you're quering the results and then files uh the small files are automatically uh compacted in in optimal manner to ensure the best performance uh and third is that you can now also use open source engines to query these tables via the iceb snapshots and the big link zero trust security model of find an access control also applies in these use cases so these are just some of the ways we are already seeing customers build capabilities around this and with that let me invite Rohit from rakuen to share a little bit more about their Journey on lak house and key learnings from it thank you GV and thank you for sharing uh sharing with us the exciting features of uh of Big Lake uh so my name is Rohit Divan I am the CTO of the platforms group at uh at rocken and so while got talked about all the features of Big Lake I wanted to get more into kind of our motivations around moving towards the lake house architecture and some of the considerations we had in uh in choosing big lake so rockan is a pretty fascinating company we were actually founded back in 1997 as the first uh e-commerce Marketplace in Japan and starting in the early 2000s we started diversify into credit card travel um other fintech businesses and kind of fast forward to today uh we're a conglomerate of uh more than 70 businesses that connect together in a ecosystem through a common membership and a common Rewards program known as uh known as super points um if you're in Japan you pretty much can't miss us I mean almost 3% of Japan GDP goes through Rakuten in some way so we had almost I think last year about 34 trillion Japanese yen in uh in in transaction volume and so as you can Imaginee we have pretty much very incredible breadth and depth of uh data so we have a I mean you can see our mission statement up there I imagine many of you in your data platform organizations have something uh something similar um but this going to simple Mission hides a lot of complexity behind it so one is just a sheer number of users we have uh more than 2,000 data analysts hundreds of data Engineers hundreds of data scientists and also as a global ecosystem company um we have dozens of people who are working on governance audit security making sure we're compliant with gdpr CCPA and the variety of kind of data protection regimes around the world many Acquisitions also means that we're deao multicloud so we're using all the major hyper hypers scalers and uh and we're Global as well so at our level of complexity I think one of the key considerations is that any core data data platform process needs to be automated was any manual step simply would not would not scale at at our at our level so starting in about 2000 we we started on our I'd say data platform modernization journey and back then we were actually a purely uh kind of on premise system uh but as a very datadriven organization um the pressure on that system was growing very very rapidly I think at some point point we're looking at 1 or 2% growth per week which compounded is is absolutely massive so in order to achieve kind of more elasticity and scale we decided to move towards a hybrid data platform and so as of last year so this is October of last year uh we are now fully hybrid so we have our data both on premise as well as on gcp now of course this comes with its uh own set of challenges we move from just having say object storage and Hadoop on Prem to now also having data replicas in GCS and also copying that over into into big query storage as well so from a governance perspective this is increasing our uh our our load because we need to make sure that permissions are aligned across all these different uh all these different systems um at the same time from a data engineering or scientist perspective there's a long time to get value from that data so once data was produced we we need to run pretty heavy ETL jobs now to copy it into say big query where the bulk of our users were then going to consume it so they might be like several hours or or more of delay to get that uh time to value and additionally we had to as we made copies through these uh replicas uh we needed to be able to monitor quality to make sure that the copies are available and consistent across each replica which is much more complicated than it sounds so this was kind of real motivation to say we should consider a lake house architecture so fast forward to today and so we were one of the early launch Partners uh for big lake and we've been able to reduce our number of data replicas in half so on Prem we uh we moving to just object storage and on on gcp uh we're able to keep all our data within GCS and for the most part not have to copy that into into BQ storage so this is of course now reducing the time to data um it also simplifies our our governance and access because our governance teams can set uh set policies um right essentially in one place and have them apply to our data whether it be on Prem whether it be on gcp in AWS or Azure um in addition kind of all all consumption is is going can go through big lake as well so there are big lake connectors for all the popular query interfaces at least the ones that we use um like BQ and uh and trino um I should mention we also have very significant cross use of data and that makes our governance even more complicated um but definitely moving to Big Lake has uh has helped us to simplify that uh significantly so as we were going uh making this journey to Lakehouse there were these were I'd say the five kind of primary considerations we had so one was performance so we were definitely concerned that moving to um not having data and big query storage would lead to a a significant reduction in performance um however in our testing uh we found that if metadata caching is turned on the performance nearly approached I'd say like 98 99% approached the native performance um if that data were in were in BQ storage um we wanted to have fewer copies of data and definitely that's something we were able to achieve with big lake as well and so getting much faster and realtime insights we're able to set policies in in in one place so definitely it's our ciso is having fewer sleepless nights as uh as we are as well Iceberg support is something that was important to us I think as a company we believe that's uh that's going to be the standard uh standard format going forward and so in a future phase uh probably later this year we'll also be adding taking advantage of uh this the iceberg support um that's part of the big lake Big Lake ecosystem so net net I think big lake um meets all these considerations and has helped us to kind of achieve the next step in our lake house journey I definitely wish you all the best uh in your next step I mean so many of you showed up for an 8 a.m. session which tells me a lot of you are thinking about this topic as uh as well um I won't be present during the Q&A but I will hang around uh for some minutes after the after the session uh to answer any questions you might have thank you so much for your [Applause] time let me pass back to garv partnership uh well Rohit mentioned a lot about how getting to a lak house simplified their old data architecture and really enabled uh many personas to come together and consume that single copy of data uh now as many of you been talking about uh lak houses uh one of the big challenges that we've been already hearing a lot is that how do we solve the data problem of the AI let me explain what I mean by that there's a lot of value that many of our customers are exploring in the foundation models but to really make it relevant and really work for their Enterprise use cases it needs to be grounded in their data it not only needs to be grounded in their data uh but it needs to happen in a manner that's secure govern private and do all of that in a manner that does not really blow up price performance so to how do you really do AI in a scalable manner by bringing your own data with security and governance that's already building that you've already curated and this is where the big lake house opportunity is to really simplify those challenges for you so uh we mentioned that the analytics and AI lake house has for long been siloed where analytics has largely focused on the structured data but then AI especially generative AI which is now unlocking all this value from unstructured data had almost like a separate path to get to the public Cloud object stores where this data is rely stored so the challenge we asked to the team back is how do we help customers get access to their data using the tools that they are already amiliar with using the governance Frameworks that they are already using and this is why I'm really excited to announce the general availability of object tables object tables are a layer on top of public Cloud object stores they represent metadata of underlying files that are stored in public loud buckets and they are special kind of table because they have the ability to actually retrieve the underlying object and serve it to the processing engine to run AI computation on top of it these object tables are also really simplified to develop pipelines when you add new objects to the pipelines let's say new images got added to a bucket the table automatically updates and really simplifies end2end development of AI pipelines and finally object tables enforce big lakes fine security model which means if you are securing a specific Row in the object table you are actually securing the underlying image and the retrieval of that image that can happen to the signed URL these capabilities make it really easy to build these use cases let me give you some examples of use cases today that we are seeing well first up a lot of companies today as they are building their own versions of Enterprise llms or fine-tuning Foundation models to build their own versions are first needing to do pre-training this is where we call about how do you actually take the data that you have collected from various sources in Object Store like GCS but still be able to uh transform it parse it or be able to filter out for things like opts outs or training license rights which is largely stored in structure tables or it's extracted from this data through big query object tables you can now join object tables with rest of your other big query tables where you can write simple expressive SQL statements that help you curate your training carpus for this fine training to happen this makes it really really simple that otherwise would take a lot of coding to develop the other thing that we are already seeing is once you actually get to a model how do you know that it's really safe for your use cases how do you know it's really high quality and giving good responses this is where you can create object tables and call that object call that model uh directly using big query whether it's hosted in verx or whether it's hosted in managed by you on your own custom infrastructure through simple lines a few lines of SQL we are already seeing examples where Customs are creating this object table over images and then use it bigquery SQL to call Vision API to figure out safety elements of the image and then be able to take action on it one of the actions is really how do you actually secure specific Pi objects if you detect specific metadata and you can then cut off that access to specific groups or user admin personas to be able to make sure that the AI development that you are doing is consistent with the governance that policies that you have described for your own organization we're now taking these capabilities one step further where if you think about the documents use cases a lot of data whether it's PDF files or text files resides in object stores well now you can use document AI workbench and with just a few clicks you can train a custom llm extractor to extract specific things asked from the document you can ask a question um to really figure out what specific fields and schema you want to Define for your extraction and that's created that really generates a API endpoint that gets deployed well that API can now be called using big query remote models you can simply register that remote model in big query with SQL and write SQL on object tables on PDF files to be able to invoke that API and then be able to extract text from the documents find GRE access control like I said is already baked into the framework so you can really control if you want to secure access to specific documents that contain sensitive information or PR information well you can then build document analytics in bigquery by combining with rest of your structured data or you can also do other many interesting things with it we are seeing opportunities to take that extracted text in big query and with the latest capabilities that we announced yesterday on embedding generation you can now generate embeddings on those text columns that embedding an index can then be sync to the vortex matching engine where you can actually leverage vectors search to implement your Enterprise llm chatbots that is grounded in your data to implement techniques like retrieval augmented generation and all of these capabilities make it really really simple with few lines of SQL to bring your data to make custom tuned LM applications to do and to do it by preserving security and privacy in a govern manner so I'm really excited with all the capabilities that we shared with you today and the experience from Rohit and his team to help kind of build these lake houses and we can't wait to see how you'll actually build in your own [Music] Journey
2023-12-24 14:03