Preparing and Designing the Data Lakehouse for Artificial Intelligence — Vini Jaiswal Databricks

Show video

Drew Paulin: Hi everyone, welcome to the first presentation in this year's Databricks lecture series. I'm really glad that you could all join us. I'm Drew Paulin, the academic director of MIDS here at the I School. I'm very pleased to introduce Vini Jaiswal as well, who will lead the presentation and discussion today, focused on preparing and designing data lakehouses for AI. Vini is a developer advocate at Databricks and brings over nine years of data and cloud experience, working with Unicorns, Digital Natives, and Fortune 500 companies.

She helps data practitioners to be successful in building on Apache Spark, Delta Lake Databricks, MLflow, and other open source technologies. Previously, Vini was Citi's VP Engineering Lead for Data Science where she drove engineering efforts, including the one where she led the deployment of highly scalable data science and machine learning architecture on the global cloud. She also interned as a data analyst at Southwest Airlines and holds an MS in Information Technology and Management from the University of Texas at Dallas. Currently, she is co-authoring the book Delta Lake: The Definitive Guide by O'Reilly. So, welcome Vini. Thank you so much for joining us today. Before we get started,

I want everybody in the audience know that we're going to set aside 10 minutes or so at the end of the presentation for questions, so please do feel free to post questions either in the Q&A panel at the bottom or in the chat and we will definitely return to them at the end of the session. Vini, you want to get started? Thanks. Vini Jaiswal: Thank you Drew for an amazing introduction. Hello everyone, thanks for making time to attend this session and let me present my screen. All right, Drew can you see my screen? Drew Paulin: Yes, absolutely. Vini Jaiswal: Awesome. Yeah so I love seeing people join.

In today's lecture, we will learn how to prepare and design that datalake house for artificial intelligence. I'm assuming you are taking data science course and you're interested in learning about artificial intelligence. So, I'm here to talk about what are the latest innovations as well as the architectures I have seen working with customers. I am, as Drew mentioned, I am Vini Jaiswal, developer advocate at Databricks where I help the practitioners to be successful, building on Databricks and open source technologies like Apache Spark, Delta and MLflow. I usually don't talk about my career in detail, but I was asked by Tia and our amazing panel to go into a little bit detail to explain my career journey since it's a lecture for academia. Working in airlines, flying planes, was my childhood dream, but oh well, I switched my gears and being passionate with math toward my schooling, I selected electronics and communication engineering as my undergrad.

Moreover, I chose this field because I was fascinated by the digital communication between satellites, how cell phones work, the evolution of cell phones from heavy cell phones to very smart lighter phones and robotics-use cases were becoming more realistic. To pursue higher education, I moved to Texas for doing master's from UT Dallas. During the start of my semester, I came across a demo for market basket analysis and how Amazon uses it to recommend products on customers' purchase history.

So upon my research, I found the world of BI super fascinating and interesting along with the advancements in the field of digital world, so I shifted my focus to do a master's in information technology and management with a major in data analytics. To further gain my practical knowledge, I started a data engineering internship at an audit firm where we got the data from our retail clients in the form of magnetic tapes. Yes, that was a thing back then. We had a mini infrastructure or data center lab where we would host all the data in the server. I would do the ETL from those tapes into the SQL server and create MS access reports for the auditors, so you see how my data journey started. And earlier, I did express my wish to work in the airline industry so when I got a chance, I did a second internship as a maintenance quality control analyst, data analyst at Southwest Airlines, where I built a reporting tool using Southwest in-house technology and in its own proprietary data warehouse. I would pull data from legacy systems and various sources including Access, Excel, Oracle SQL server, and so many Southwest enterprise data warehouses. Then, I would work with quality inspectors

who would perform the maintenance reports and finally, after I have gathered all the data, I would do the quality report for the decision makers, available in the form of dashboards. It was a very fun and interesting experience to work on the data for maintenance of the flights and providing intelligence around improving the quality of suites so that was interesting. Soon after, I had a variety of experiences at Citi after my graduation. It was a long tenure at Citi. I started working on the infrastructure side where I would work on like data center projects, building labs, building a lot of desktop solutions. I was tasked with managing applications, which would go on around like 400,000 Citi devices. I then worked on a project to move the Citi infrastructure from on-prem to cloud. That was a cool thing back then, cloud adoption was increasing and after a good success, I transition working on the data science platform for Citi internal businesses.

While I totally enjoyed it, I wanted to expand my experiences from the banking industry to a lot more verticals and I was fascinated by use cases that people are using AI and data science for, so as a result, I started working at Databricks where I worked on at least 100+ data and AI architectures for Databricks customers in the Digital Natives, Unicorn, commercial and enterprise segment, and many industry verticals. And now, I'm doing developer advocacy for lots of data and AI practitioners. I want to help them leverage the AI technologies, leverage the data, and make intelligence out of it, so that was my career journey. This is, in a nutshell, what I showed you. Hope I didn't bore you and hope you're still with me. Please do paste your questions in the Q&A and later, as Drew said, we will have time to answer a few questions. Great so after I walked through my journey of data and AI, let's start with the topic that we are here for on AI. I would like to start with a famous quote from Marc Andreessen, "Software is eating the world." Just as software has transformed many businesses and creating new ones, AI will do the same.

By definition, AI has been able to achieve tasks that previously wouldn't have been possible by manually writing software. However, we all know that most of the time that we spend in developing any AI application is focused on data and that without good quality data, AI just doesn't work. Data is the bigger beast that is eating AI and while that may just sound like a catchphrase, there's actually a lot of truth to it.

So I would like to show you some statistics. While the data is growing rapidly, you must have heard from Gartner reports or any other data reports, data is growing massively. It might grow to 177 zettabytes by 2025 and only 10 to 20% of the data is learned from while 80 to 90% of data still has not been learned from, so give yourselves a pat in the back for being here because you being here, you being AI enthusiasts already solves some of the initial problems. We can multiply forces to bring intelligence from this data. So, let's talk about the problems that people commonly encounter and by we haven't been able to leverage full potential of our data to make artificial intelligence products.

So, let's take a look at why the focus really is on data. I did mention about how data is eating AI so the main reason is that we have become really good at dealing with code and data separately, but we are terrible at combining it. In software engineering world, the main goal is functional correctness and in most cases, you can write good test to ensure that, you know, software is working well. AI, on the other hand, tries to optimize a metric, which can be a moving target with changing data. So one mistake I see a lot is that when a feature gets into the hands of customers, they don't think of it as the beginning of the process. They think about it as the end and I see many teams make this mistake all the time and people some data driven feature they

release it, they find the performance to be mediocre, and then they either roll it back or abandon the project itself and sometimes these organizations even come away with the mistaken impression that machine learning is therefore not actually very effective for the problem that they are trying to solve with their data. When in fact, they kind of quick at the beginning, so that's one of the things we see. Another factor is the outcome, so any outcome for a software is deterministic whereas in AI, the outcome of training a model, can change significantly based on changes in the underlying data input variables. All of these factors combined make it painfully clear that AI is hard because it depends neither on code, not on data alone, but a combination of both. As a result, many different people need to get involved. Software is mainly about building, you know softwares by software engineers, but to train a machine learning model and deploy it in production, you usually like to involve combination of software engineers who builds the platform, data scientists who will tell them orders to the predictions and do all sorts of like intelligence around it, and data engineers who will build your data pipelines who will curate your data so what is lacking? There is no standard way for domain experts, data engineers, machine learning engineers, and ID operations to engage with each other.

Lack of collaboration leads to project delays, low productivity, deployment difficulties, and poor real time performance so getting all of these people involved and coordinating among them is a major challenge. Now this is not uncommon of course, some of the most meaningful problems can only be solved by bringing many people together. However, it becomes almost impossible if those people cannot agree on the tools that they want to use or they should use. In summary, we are faced with three challenges. AI is hard because of the interdependencies of code and data, many people needing to get involved, and a massive amount of competence needing to be integrated. So what are the attributes to the solution? Let's look at it, so to do AI right, companies need an AI platform, but before we begin, how can an organization solve this problem? Let's take a look at what's the need for each of the buckets that we categorized in the previous slide so to kill the challenge of data sprawl, the machine learning platform needs to be data native and open. What it means is that, it needs a platform that allows discovery, access, exploration, preparation, and reuse of data.

Regardless of what cloud you use or what data format you use, it should possess inherent quality checks, versioning, compliance, and governance capabilities, especially as the data evolves. It should provide all collaborating teams seamless access to the most recent data using some of your tools and it should be easily integrated with existing infrastructure. Second, to make ML products more productive, the machine learning platform needs to foster collaboration. We need to remove silos between data, we need to provide teams with more access, we need them to collaborate together, and it should allow participants to use the tools, languages, and infrastructure of their choice, brings together all teams involved, for example, we saw data engineers, data scientists, developers, and software engineers in one place to share, learn, and collaborate with each other. It should also securely facilitate any real-time interactions between people and teams so that they can accelerate

the projects, accelerate the needs of their infrastructure, and finally, the machine learning platform should support the full ML life cycle rather than just a piecemeal approach to different parts of machine learning because you must have seen, there are a lot of tools out there, so focus on the problems at hand rather than focusing on tools, have trust in your data entries, give them love (TLC) to build the tools they need. One more thing, start small and iterate. The right way to build and plan for iteration is going to bring successful AI projects. You should assume that you're going to start out with something that works okay and that you will need to measure performance, collect more data, and iterate until it's good enough. You have to just keep on iterating and that's partly organization like

having the political will to invest that time in iteration, but it's also about how you develop. You need to put in place any of the instrumentation that will perform, that will inform the iteration process, for example, make the whole constructing, each iteration of the model as low as possible, and the infrastructure to support the rapid experimentation and iteration is really crucial so when you see organizations sort of floundering with machine learning, it's often because they don't have enough resources invested in the infrastructure to make iteration easy. So, let's look at how a typical end-to-end data architecture looks like. On my screen, you can see that I have source data, it is either getting distributed as a batch or stream. What I mean by batch is you allow the data to be inserted, maybe one batch at a time, and for stream processing, it needs to be continuous. So data engineers right, the pipelines to make data available to downstream users and downstream users might be either data scientists who are trying to implement machine learning or AI projects.

Another downstream user can be data analysts who would want to write SQL queries or generate insightful dashboards to make important business decisions and to facilitate all the data engineering processes, you will need a data platform where you get the required compute and storage infrastructure because you got the data, you need to process it so you need storage solution, you need compute platform so have the data platform team worry about it, implement proper security controls and governance around your data and infrastructure. For an alternate data provider, you might have to integrate it with client facing application or maybe in some other in-house systems. This is where I was saying integration becomes super important. What if all of this is simplified as well as unified so that you don't need to keep your resources for maintaining the infrastructure and start realizing value of your data because the reason I say this is a lot of teams actually invest a lot of their time in developing the infrastructure, capacity planning, working on what tools to use this that and they have very little time spent on the end application that they are trying to build. So let me talk about how the data landscape has evolved from warehouses to lakehouses.

I'm pretty sure in the audience, you are familiar with how data landscape must have evolved from data warehousing to datalakes and warehouses and before, organizations were doing business intelligence, but they soon realized that they could do much more with the data. So they want to not only just do BI, but they also want to do machine learning and AI on their data so here are two different evolutions: datalake and data warehouses. I'm going to compare it and then give a unified solution, so datalakes and data warehouses have complimentary but different benefits that have required both to exist in most enterprise environments. Datalakes do a great job supporting machine learning. They have open format and a big ecosystem, but they have poor performance for business intelligence because they suffer complex data quality problems. Datalakes are literally like you can ingest all your data whether it's in whichever format you want. It's not managed well but data warehouses are great for business intelligence applications

because they have a great manageability, but they have limited support for machine learning workloads and they are proprietary systems with only a SQL interface. Unifying both of these systems can be transformational in how we think about data so what if we could borrow that is data management capability, which is reliability, which brings reliability integrity security quality from their data warehouse and you could go get it, but it still should support all of your workloads. My use cases are now not just limited to BI, I want to do AI and machine learning on it, so if there's a best way to bring best of the both worlds, it's lakehouse. This isn't the dream anymore, we have a way forward and it's the lakehouse platform. Since I'm working on Databricks, I have seen the evolution, data that is fully embraced and created an identity for data lakehouses so lakehouse is effectively saying I can bring all the data that you have and have the core data management capabilities of warehouse, a foundational compute service to support all the primary datalake use cases.

So, with the lakehouse, much more of your data can remain in your datalake rather than having to be copied into other systems because I say that in data warehousing, if you want to leverage your existing data, use it in some other tool, you need to copy a lot of data, so there is a lot of storage costs involved, there is a lot of transfer costs involved, but with lakehouses, you can have data in one place so that you don't have to copy it over and over. You no longer need to separate data architectures to support your workloads across analytics, data engineering, streaming, and data sense and what this brings you is it removes data silos. Teams can actually come together and work from a single platform or get value from the single available data repository. So this is why companies have started to use the lakehouse to provide one platform to unify all your data analytics and AI. So I will talk about lakehouse architectures that I have used at Databricks. We start with data prep so I have three different tiers here. You can see data prep designed for ML and then ML frameworks and then deploy anywhere at scale so your data can be in different formats. Back to a 10 years ago or so, we started working on structure and semi structured data, but now we are talking more about

how we can bring analytics or artificial intelligence to the natural data images like either in the form of video or audio so next comes the featurization, featurization is the act of optimizing the inputs of machine learning model and one of the tools I know is Databricks machine learning features tour, which has been co-designed with Delta Lake and MLflow. So, not only does it benefit from the pristine data from data, but it also natively stores features with the model itself and MLflow model registry. I'm not really sure if you know about Delta Lake, but we have a lot of talks. If you follow me on LinkedIn, I have few talks on that link if you want to learn more about it. Basically, it has changed the you know data management capability in the bigger world. It provides acid transactions,

which allows users to feel that they have reliability for their data, so I will talk a little more about that in the upcoming slide. So once your data is in shape, you can chain models using any of the ML library you want if you are in the Databricks platform or elsewhere. If you're using a favorite tool of your choice, there are a lot of libraries available and there are a lot of open source contributors who keep bringing this amazing libraries to make intelligence pretty easy for your projects so Databricks runtime provides optimized runtime for machine learning and it already installs a lot of packages on its own so it's a very powerful tool to check out. All you need to do is just work on your project, focus on your project, but all the infrastructure is kind of like Databricks and one more thing, for modern deployment, it supports any and all deployment needs from batch to online scoring on the platform of your choice and, of course, if you have any favorite cloud either Amazon, Azure or Google Cloud, Databricks supports all of them, so you can use any of the clouds with Databricks and then to productionize, once you have model deployment, once you have featurization, now you want to productionize and scale your ML pipeline so users can also benefit from the ability to run the entire machine learning life cycle in an automated fashion wherein auto ML allows users to automatically generate the best models from the data, so if you are new to RIML, we have your resources, stay tuned. I will upload some resources if you need to check it out so ML operations is really a combination between dev-ops,

model-ops, and data-ops, so that was the full ML lifecycle. I hope it was a little clearer to you. If not, we will have a demo at the end to show you what it looks like. All right and then, I want to say that this foundation that we just looked at, Delta Lake has actually accelerated a lot of it, so let me show you how. If you are working as a data scientist, you might have

your full modeling process sorted and potentially have been deployed a machine learning model into production using MLflow, but what if you could reduce the time you spend on data exploration, what if you could see exact version of the data used for development, what if the performance of a training job isn't quite what you had hoped for or you are experiencing out of memory errors, you must have seen out of memory errors in this part if you leverage Spark. All of these are valid thoughts and are likely to emerge throughout your development of machine learning process. A key enabler behind this lakehouse innovation is Delta Lake. So when a data practitioner trains on Databricks machine learning, they not only benefit from the optimized rights and reads to and from Delta and because Delta provides acid transaction guarantees, big highly benefit from the quality and consistency of their data pipelines.

And some other use cases like you are waiting for a query to run is a common issue and it only gets worse as the data volume gets increasingly large. Luckily, Delta provides optimization techniques that you can leverage, such as data skipping as your data grows and new readers insert it in to Databricks Delta. There are file level min max data sets, which are collected for all columns of the supported types and finally, if you don't know, the most common predicates for the table are in the exploration phase. You can just optimize the table by coalescing small files into the larger files to gain some performance efficiency. So Delta also provides in-built data versioning to its time travel feature through an integration with MLflow. Customers can automatically track exactly which version of data was used to train a specific model

and this is a great capability you use describe history and you exactly see the full data lineage, what has happened to the data over time, so it can allow you to do governance, full tracking, auditing, and also like you know, whenever you are deploying a software, if something goes wrong, you always love to have a capability of rollback when those incidents happen and Delta allows you to do that. And so, ML platform does need ML specific tools for data and features, but you really want these tools to use the same data management and security model used by data engineering and restore your data systems. If you don't have that, you end up having to manage to different security models, you have to spend a lot of resources in managing different complex architectures and you waste time and money, copying data between your data engineering tools and your machine learning tools. That's the gist of it, so you see how lakehouse platform with Databricks has provided collaborative and reliable functionality for your AI applications. All right, looks like that was too much, so let's switch over to the demo and I'm gonna switch my share. All right.

So if you can see my screen, I am presenting a Databricks notebook. Here, I'm using Databricks for the demo, which offers an interactive working environment. To process our queries, we do need compute right to process our queries. So I'm using a spark cluster and as I mentioned, Databricks machine learning runtime has fully packed libraries already installed, so I don't need to worry about installation, I don't need to manage any libraries. It's all available for me. I'm just going to use that machine learning

clutter and I am good to go and to start with, I am using the data from LendingClub so I'm going to show you how we are designing lakehouse for classifying bad loans for a lender and because you know this lecture was about how you get prepared in designing lakehouse architecture for artificial intelligence, I will focus mainly on how you design the pipeline. Alright so as you see this data is from LendingClub, we have gathered a sample data set for funded loans from 2012 to 2017 and each loan includes applicant information provided by the applicant as well as current loan status, whether it's paid fully, whether it's current, late, and the latest payment information and so I'm going to first do some you know some environment setup so here is where I have the Parquet files. My data is in Parquet files. It is getting generated in the Databricks data sets. So, as you can see, we have Parquet files if you know the Parquet file format. Using that file, I'm going to create a data frame so that I can use Apache Spark APIs to read my data, so this is the file that I need to provide and then I'm going to use raw df here as a data frame where I'm reading the data from my file path here and then injecting it into my data frame so I'm going to select the columns here, which are loan status, what was the interest rate, how much is the revolving utilization per person on their credit, and then a few other details, which will help me identify what type of loaners I have, if I should if I can analyze the trend, if I need to approve my loan request in the future or not, so I have all those data set and I'm going to display my raw df. I already have this running because of the essence of time, so I'm going to explain the results and the query.

So as you can see that because we have display function and Databricks, I'm able to look at the table view of my data, so we have loan status, as I mentioned. It's either fully paid, current, and there are other statuses as well and then interest rate. So I'm going to as a data analyst or data scientist, I am interested in looking at what columns I have and what columns I can select for my featurization or maybe as a predictive input variable etc. So taking a quick glance at it, I have interest rate revolving utilization so it's a good column to use for understanding the credit, how people are spending the credit. Now I notice that the issue date of loans and credit line are in the month and year format and for me, to predict or to make a better predictive model, it would be helpful for me to separate out the year so that's one of the things I can do in my data transformation. Then I see that is employment length, which is a good factor to consider, then you are measuring the credit score, verification status, payment. All right, then I have loan amount. So I picked some of the variables for my analysis and I know what transformations I need to do for my data, for my machine learning model. Also, this loan status can be a

featurized vector for my machine learning models so what I'm going to do is change it to a binary digit. Sorry, change it to a value where I can either have a bad loan or good loan categorization so I'm going to do that in my transformation so after looking at the data, now I'm going to apply some transformations to my data. This is a step after you have your data so I'm going to use SQL functions. PySpark provides a lot of SQL, a lot of functions that are available for data transformation, so that's why I'm importing that library. Here, I'm using King df because I want to transform my data so it's more meaningful when I'm doing analysis at my but I'm doing analysis in my next step so out of straw data, what I'm doing is filtering out some of the some of the columns. As I mentioned, I'm going to use loan status and create a column if it's a bad loan or if it's a good loan and how I can say that it's a bad loan or a good loan is you know any good loan is either paid off, default, or charged off and whatever is remaining, for example, dealing with loans or not paying loans, they are categorized as bad loans so that's what I'm creating here and then for some of the columns, as I mentioned, there were months and dates. This is what I'm doing. I'm applying some transformations to replace the integers and then change it into the date format. And again I'm applying some transformations for other values, which are useful for identifying the credit scores and credit and approving credit for the loan requesters so I'm going to trim down some of the data values. As you know that, whenever you have data, you need to do data cleaning as well, so this is the set for data cleaning.

So once I have that, I am actually going to create a view so that I can use that throughout my notebook to query the results as well as use it for creating more tables. So, after I run this, I'm going to see that I have issue year and earliest year as well as credit lengthen years. What I did is I actually subtract issue year minus earliest year to get this value so as you can see that, after I apply transformation, I have new columns created here: bad loan trigger, either true or false, issue year, earliest year, credit you're in, and so I applied my transformations, I have data ready for my next step. So now, what I want my data scientists to get from the data is the transactionality is data lineage right, as I mentioned earlier, Delta Lake is a key enabler for improving reliability of your data. That's why I want to make sure my data scientists can work on reliable data, so what I'm doing is these are just the basic

utilities in case if I was demoing this, I can do this existing files. That's why I have it here, but mainly what I'm doing is creating a database called Delta Lake database and this is the location I'm providing so it can write to this location and then, I'm going to create a Delta Lake table using Delta so to create a Delta table, you can use Delta format and also like within Databricks, you can do using CSV, using Parquet, if you have files in other formats and then you convert into Delta. Another cool functionality is described details, so this function allows me to understand some of the metadata off my files so after I run this data, I want to show you how metadata information is useful for either data engineers or data analysts, for example.

They might be interested in seeing what format I have for my data, so Delta is my format. It also allows you to see like which database has my data tables and then location created at, last modified so all of these statistics are made possible by Delta and you can also see size and bites. Alright, so I'm going to do exploratory data analysis first so that I can understand you know some of my data values as well as like what type of trends might, what type of you know analysis I can do with the data, so I'm going to first create a count of loans data because I want to know how many rows I'm dealing with, so count is the first thing I'm going to do for exploratory data analysis, so it looks like I have around 650K rows and then I want to view my Delta table. I want to see you know, the number of loans across U.S., how many loans have been taken so it looks like California has

a lot of loans and then Texas and then there is somewhere dark area in New York and then in Florida so these are some of the pointers I could see right from the get go of data exploration. Now to further to see down in my data, I want to see what types of loans these are so I can see that if I use a purpose, I will be able to see what type of loans people are taking from my bank. So debt consolidation seems to be the highest and this is why people are taking loans. Second highest is credit card, so people might have credit card debts. So this allows me to see like you know some trends like why people take loans and then I'm going to skip this one, this is just showing you, you know, the location of Delta. Oh, one important thing I would like to call out is Delta log. This is where

Delta is able to provide you the lineage information, this is where it stores the historical context, all the checkpoints of your data, data versioning, etc. Data versioning is very important because you will be able to see at what point of time my data change and what changed my data so it's a very powerful utility for your data pipelines and then I'm going to show acid transactions as well as like you know how you can integrate pageant streaming together so in the next few cells, what I'm showing is confident streaming and patch queries, this notebook will run an insert command every two seconds. This is where I am looping, creating a loop for my two second data intervals and then we will run two queries concurrently against data and you will see how it works together. Alright, so now I'm going to create a table called loans aggregation delta and then I'm going to select address and count of loan from my data frame and then I'm going to assign it a different table name so once I do that, I'm going to see the aggregated part and here I want to use my address date and some of the loan count so here you can see that, when we try to run this, I will just run this to show you what it looks like.

So, because I'm not displaying anything, that's why I'm getting this result and then we'll get the data aggregation part and then I'm going to start the read stream and create a temporary view. Once I have that, I will group by address date and then, this is where the stream gets initialized. You must have heard about actions and transformations, so I am creating an action here on my data by creating a read stream so as you see, as soon as I have this read stream run through, it is going to give me a cool dashboard, so this is another functionality in Databricks whenever you have any streaming application, you can see real time inputs, how the streaming is happening so as I let it run for a little bit, it will change here, the input results because as I mentioned, we are running this every two seconds. Alright, so once we do that, I am going to do that in the loop, so what I'm doing here, I'm creating a loop of time every two seconds, so that our data changes a little bit and we're able to see the noticeable difference, but another thing I've been doing is I want to insert a value of Iowa because maybe I forgot to add the value, I don't see this Iowa in my data yet so I'm going to insert that and then once this runs, it will allow this operation to happen and then I want to review the results, so here I'm updating a value, for example in your data project, what happens if you all of a sudden realize that oh this row contains wrong value, Delta allows you to update the columns so that you can update a specific column, you can set a specific value of what row you are trying to alter, so here in my loans aggregate delta, I am actually setting my loan count 226K for the state of Washington. As you can see after running this, Washington is changed and to compare it to my earlier value, I don't have significant value of Washington. It's 14,000 and here my Washington value has changed from 14,000 to 226,000. That was a significant change. Another cool thing that you can do in Delta Lake is merge. What happens if you know

if there are missing values, if all of a sudden you realize that some of values, some of the rows are missing, and those are critical for your project, Delta allows you to do merge into operations, so that you can predict the model better or maybe you can have a better recommendation or something like that, so here I'm showing you how you can merge different rows so here, I have items, which are IA and zero. These are the values for address date as well as loan count, so I'm going to show you how after running this merge command, I am able to see this in my result. So here, I see that Washington still has 126K value, okay I change California to 25,000. So yes, my California value change to 25,000. If you remember from the earlier graph, our value of California was somewhere at six digits, yes it was hundred thousand so it drastically reduced. That's why I'm able to see the change so you see how powerful it is to you know run the updates and merge on the fly and if you want to delete, for example, delete is very powerful when you maybe have made a mistake in your data or maybe if there is a GDPR or CCPA law, which requires you to delete some information of PII data. Those are really extreme use cases, but here just to perform a sample delete, I'm going to delete from loans aggregate where address equals to Washington, so that all my data for Washington gets removed, you can see that there is no loans from Washington, all Washington looks good, they are all paid off. Cool and then Delta also supports schema evolution. If you look at Delta Lake, it doesn't do schema evolution, but in Delta, you can so what I'm doing here is selecting the columns and then

I'm adding like you know what format I want these columns to be in and then I'm going to update one of the columns so earlier, I only have address date here, address date and loan count, I don't have loan amount and I want to add a loan amount column so I evolve this scheme of the table, I added amount column here so I'm going to show you how different states have varied amount of loan in total, looks like Texas has the highest amount of loan taken from our bank and then another cool feature is traveling back in time as I mentioned earlier, because Delta provides transactionality and versioning, you will be able to travel back in time either for just the auditing or maybe to do rollbacks so here, you know describe history allows you to see what exact versions have been modified in my table. I can see that I have zero to 14 versions. It also shows you timestamp, which is pretty cool because you can see when my data was changed. You can also see the user which changed your data and what kind of operation they performed so that you can catch any bad records or you know do quality control on your data and then operation parameters, usually these are very helpful when you are doing like performance benchmarking or it can be useful in variety of use cases and then it also shows you what mode to change, cluster ID that performed it, etc so pretty cool function of this type of utility and then to restore my version, what I can do is select star from my table and then what version I want to retrieve. So, because I made a lot of modifications after my version six, I will resume my version six here, so you can see that California is six digits now. This is the initial original number of loans that California had. Washington still has 14,000 so I rolled back all of my changes that I performed to my table so that was pretty cool. Yeah that's all. This demo was basically to show you how you can perform different functionalities, different operations on your data as well as how you can prepare

your data lakehouse. After you prepare your data pipeline, this is now ready for downstream applications, for example data scientists can maybe run a classifier model or do a featurization. Yeah so that's about it. Drew Paulin: Thank you so much Vini. That's great. We want to spend our last few minutes in the presentation addressing any questions that people in the audience have so if you do have a question, you can either post it in the Q&A panel tool below or you can use the raise hand tool and I will enable your mic so you can ask your questions directly for Vini. The one question that is there is from anonymous. What's the best way to get familiar with Databricks? Last handful of times I looked into it, it wasn't cheap to learn the platform. Any advice Vini?

Vini Jaiswal: Yeah, very good question. As I am a developer advocate at Databricks, I do want people to learn and leverage whatever free resources we can, so this is one of our ways to extend our support for the community by providing guest lectures or tech talks. Please do check out our tech talks. I have posted it in my LinkedIn. So you know, there is also our YouTube channel for Databricks, so if you want to learn anything in Databricks, we have a lot of demos about the platform as well as how you can start with the compute resources and there are also educational materials around how you can, because you mentioned it wasn't cheap to learn the platform, I want to mention that you know, there are a lot of ways, we have notebooks there, we also have a Github repo, which you can just you know fork and start running with the Databricks platform and then Community Edition is the best way or you can use our trial version with AWS so stay tuned. We do bring a lot of educational content to the community and if you have any request to you know, learn something just let me know. I will do a tech talk or something because we do have that running all the time at Databricks yeah. Drew Paulin: That's great thanks Vini. I also posted something into the chat just like a getting started Docs link that I believe was shared last year in one of the presentations, but that's great advice. There's another question in the Q&A. Do you have any advice for anyone who wants to get into AI or data science in general? So it's a big question, any advice for them?

Vini Jaiswal: Yeah so I have a lot of cool advices of what I have seen across what data people do is first of all, you are attending this class that means you already are in the program I believe. This is a great way. Second advice I would give is even when I was in my academics journey, I would do the Coursera courses or I would do like you know some YouTube content or you know a boot camp. I think there are some data professionals I have seen who just don't have data degree, but they do the boot camps and they get started with it and also like I personally found Kaggle very useful. Just look at some Open Source project and start iterating, as I mentioned in my lecture, start small and iterate on it, you don't have to think about big picture, you have a lot of data access now available to you so just pick any data set and try to iterate on it to build your profile and you know you can update it on LinkedIn and yeah I hope that advice is helpful. Drew Paulin: That's great advice. Thanks Vini. Jimmy, who is a course lead for our machine learning at scale, has a question. Jimmy, I just

let you use your mic so, James Shanahan: Great thanks very much. Thanks Vini for a great talk and as Drew mentioned, I developed a course on large scale machine learning and we actually use Databricks as one of the platforms for looking at data at scale, graph data, and also building big data machine learning pipelines so that's a plug for other people who want to get into Databricks and large scale machine learning, but I had a question so great talk, thank you for sharing all this. So MLflow, so how is MLflow doing these days? Is it mature enough as a product? Are there lots of traction out there and case studies, maybe you could share.

Vini Jaiswal: Yeah so that's a very good question and I absolutely want to thank you for doing the courses on Databricks, that's awesome. MLflow is being widely used and I don't have slides with me, but I can share. It has evolved over time. We started very small. Now it has over like I don't have the numbers, but it's pretty much used in like production pipelines for machine learning for Databricks customers as well as lot of people are using open source MLflow to get tracking experimentation projects and then we have evolved over to you know, using MLflow models in the last year and soon, we will be releasing the deployment as well so we only can make this progress when community has adopted it. I will share the numbers once I have it handy, but I have seen MLflow being used in production applications yeah.

James Shanahan: That would be great Vini and I have one other question and that is around deep learning and so maybe could you comment on deep learning and how that works with Databricks and Spark in general, just wanted to get a sense of where things are today. Vini Jaiswal: Yeah good question so recently I have seen a lot of deep learning use cases around vehicle data, foot traffic data as well as like IoT sensor industry and you know, some of our customers are using it for identifying cancers, so there is, if you are using Databricks, there is a runtime called ML runtime GPU. That's what is being used for neural networks and deep learning and also we facilitate lot of you know, open source libraries into our product. As I said, we try to provide managed platform for any later suits so please do check that out yeah. James Shanahan: Great, thank you so much. Vini Jaiswal: Yes, James. Thanks for your questions. Drew Paulin: Thanks Jimmy, thanks Vini. Any other questions for Vini? Okay well, I just want to say thank you so much Vini for sharing your thoughts, your experience with us, and for walking through the great demo lakehouse and Delta Lake based applications.

I also want to say a big thanks to Databricks for hosting this lecture series. Please do keep your eye out for our next in the series, which is planned for later this term. I want to say thank you to Tia Foss and Rob Reid for organizing these events as well as Gary Morphy-Lum for the Zoom support and finally, I want to thank everyone in the audience for joining us today. The recording of the presentation will be posted on the event page of the I School website in the next week or two and so I'll post an announcement on Slack once it's up. Thanks so much everybody. Take care. Vini Jaiswal: Yeah thank you Drew and I loved amazing audience interaction. Please feel free to follow me on LinkedIn and if you can connect with me, I can post the resources that you have asked for yeah. Thank you all.

2022-01-20

Show video