Moving to Databricks & Delta
- Okay. When it come to moving to Databricks. Before I dive into the topic, let me first give you some context about my company and what we actually need Spark for. So we went to dot com as one of the biggest German wetter quarters and mobile apps regarding weather forecasting.
As the name tell us. We also do a lot of media productions with weather content for many of major German TV stations. And last year we decided to build this Wetter business for building analytical data products, we want to sell to our B to B customers. Let me give you two examples for this. The first so-called 'Weather Needs' is where we combine weather data with state's data, for different product categories that we calculate from the past.
We train our models to run the weather impact on the product category, and then combines this with our weather forecast. So we can tell you what's the weather impact on your product group, like ice cream or even other products like electronic goods and this information our customers can use to improve their programmatic ads, for instance. A second example, I would like to give a few as unrelated to geo data.
We have a lot of location data from our mobile apps Here you see a front end. We did with one of our partners, GFK. Here you see the visits and German bigger cities had centers for this year. And you can see is a clear drop in Spring when the COVID-19 started and much less people went to the city centers and this you can use for data analysis on retail locations or other industries like tourism. Yeah, my name is Carson. And at Wetter.com I am responsible
for the IT side of all our data products. And so I have to make sure that all this works and yeah, yeah, we use a lot of Spark as you might guess. But before we look into our journey to Databricks let's have a brief look how our work looked before we use Databricks.
So this is a typical architecture picture. You are from a 10,000 feet view. You might already have seen a couple of times we get data streaming data through our APIs. We get a batch data deliveries, Suez three for data processing. We use Spark on EMR, elastic MapReduce, and AWS service.
Our data was stored in S3 buckets using Paquette format. For reporting we use Tableau with Athena, that's a southerness SQL service offered by AWS. To do internal reportings to our customers. We don't have a real front end. We have a rest API where they can download the data or we deliver the data using a three or as a means of transportation.
Yeah. So it's worked pretty well. So why did we consider moving to Databricks? Well, the first and main motivation was the data table databases offerings. So we had the requirements to do GDPR deletes of a single records. And this was quite painful if you have big data sets on their screen, but also for other data corrections, as if we did something wrong in our pipelines or had to do some minor enhancements, we always had to do a full recalculation of all our derived data, which a lot of time and also quite some downtime.
And then of course the platform itself. So we were looking for a notebook environment that improved our usability and productivity. We will look into this in detail. And also with EMR found out that 80% of our EMR cost a are being dedicated development on the ticket cluster and not from automated trucks because everyone was using a cluster of its own, which was either most of the day when someone was just trying to figure something out.
So that's something we hope to improve, as well. And finally, MLFlow was interesting for us because in our existing projects, we had some hiccups regarding the handover between data scientists and the data engineers. When it comes to improving the model, we started to build a custom solution. In this talk, I will not go into detail into this year because we don't have that much experience with MLFlow yet. So yeah, let's migrate to, with Databricks and data, let's first look at the infrastructure side. This was the architectural picture.
I just showed you a two minutes ago and for migrating to data and Databricks. What do we do instead of storing package files on S3 we use data tables instead of EMR, we use data picks clusters, of course, and like some PowerPoint magic and some migration has done, of course, this didn't happen in an instance. So there was some work involved.
You have to take some decisions on the way. And for the rest of the presentation, I would like to share you the experience we made during this journey. So the first thing you have to consider when you start using Databricks as a unit, a data of X-Box space, until you have different offer options, you can use a single classic workspace. So that means one workspace for all your stages and projects. We have our projects and stages separated on AWS. And yeah, it'd be odd for me to just burn Databricks. You can do this logically in the workspace or physically by having different workspaces.
The darn Sattia is these classic workspaces. They must be manually created by Databricks. So no automation here on your side. Then of course you can have multiple of these workspaces, but they all have to be set up and configured manually.
And if you don't have single sign on solution, you can integrate with Databricks. This also means a user management for workspace. Then there came a new feature was a so-called account API, and just quite a recently released by data bricks. There you can create these workspaces on your own using this new API. This wasn't available when we moved to Databricks. So we decided to go for a single workspace and we haven't met any limitations yet.
We have all our data and one AWS account anyway. So it's a single box space for us, the best solution for us, if you need multiple workspaces, I think it makes sense to have a deeper look into the account API. To set up the workspace Databricks requires something from, from Europe or good, bad something from us. So first in a three bucket where all this Databricks workspace data are stored like the notebooks and other stuff, Databricks needs and AWS Ima road to be able to deploy from their account and to our account. And of course you need some instance rules for us.
Easy two instances that makes a cluster similar to the EMR instance roads. And then it's Creates doing this box for station. Something in your account creates a VPC and other related AWS resources. The downside here is you cannot reuse an existing VPC. So we had to do our, redo our VPC pairings.
You can provide a specific IP range for this VPC but this something you have to do manually before Databricks creates a box space. With this new API, I've thought it's also possible to deploy into existing DPCs, which might have simplified things for us, but it wasn't too much work around anyway. Looking into security and combining databricks and NWS. When now we choose our log ins and as we have not seen a sign on one for AWS, one for databricks and the data per site each user must have a valid email address. So it's also hard too, for the technical users, we use a little bit of hiccup for, For other users, you can create tokens as these tokens you use to access the API for running automated trucks.
Of course you can restrict access to the different clusters based on users or groups you've defined and your Databricks box space and what is really nice and what you can see on the screenshot, you registers the easy to instance roads and Databricks that are available, and you can restrict access to this instance, rules and Synthroid as well, which has the benefit. Even if the user's allowed to spin up their own and configure their own clusters, they are only allowed to use the instance roads of their projects. So they cannot take another instance to look into some other projects, data. This is something we use and it works really well.
There also is a possibility of configuring IRM pass-through. So that means instances of your cluster down run was a specific role, but say you use IAM role of the AWS user you use to log in. We cannot use this as we don't have single sign on AWS which is a prerequisite here. Think it makes sense to be able to share clusters, even among projects, something we are not, we cannot do for now. When it comes to classic configuration. Everything was a really special, what we reused the init script we had from EMR.
So that's mainly put, then you can put, drop a few packages that were available, some different here is for the worker instance. You have to decide for one instance types. So you cannot use instance feeds as for CMR and to have the cluster Pixer sort of sentence types, depending on the size, but you can mix on demand and spot instances.
What we usually use for our daily clusters as we have on one on demand and sense for the driver, then follow up a spot instances. And the auto scaling works quite nice. So for our development class, that's where we usually have like two workers and the maximum twelve or twenty workers depending on the workload we are doing.
And if the resources nobody uses the resources, the cluster skates down automatically and send sketch up of more than one query or larger queries or jobs are executed on the cluster. So this really helps in reducing the development costs, which were the major portion of our cloud costs. Yeah, so much for the infrastructure. Let's have a look on our Spark pipelines, how we migrated our data and applications. So the most simple thing was to convert the package files we already had two data tables.
It's just this, to use just convert command. It requires some compute resources for analyzing the data, but we haven't experienced any issues here, But per state has a com and a few other new features and SQL commands. The first is optimize. What does it do? Optimize takes a lot of modifiers and combines them into larger ones. So in our jobs, we just, we didn't write small files.
We had quite very configured from the past was EMR. But if you do some update or delete on the table, we really generated trends like thousands of small fires. And this is something you, you will have to mitigate using optimized.
You can do this periodically on as, as a housekeeping job, or you configure this as a type of property. So that it's standard as part of the DMA, but it's something you really should do. There's a set order feature, which improves selective queries.
This cannot be configured as the table property because it's more expensive. We already had something similar while not that... Not that well implemented, but more, more roughly implemented using petitioning and sorting was in petitions.
We just kept it and said, ordering on these tablets really didn't change anything. So it's something they're not using yet. The last command is really important because this is a vacuum. Because if you start doing the leads on your data table, this will generate new parquet files and the odd files will stay on S3 which is great for this data time travel feature you have but after the retention period, so after a certain period, these files are not needed anymore and they just fill up your S3 bucket.
So you need some periodical housekeeping. Let's have a look. Where to spell the meta data of our tables, and first have a look at the table, meta data itself, because with data now, for every table, we had in parts B, we now have two tables.
First, the data table, which I call it native data table, which we got by converting our existing parquet data or creating new tables. But unfortunately this table cannot be accessed by Athena. And as we used Athena for Tableau, as we don't have a cluster that is running all the time, we want to continue with Athena.
We have to create this. I don't know if that's the official name for it. I called them Symlink tables. So it's another external table that allows a CNR as a tool it's using JDBC to access the data tables. And now of course you don't have all these features like time travel and DMA, and you just can read the data for reporting.
And the next question is where to store the data with. We had all our hive, meta data, so information about our table and the AWS data catalog. But the data is meta-story. You have a second option. We continued with the glue data catalog because it's a place that can be accessed from tools like Athena. And it's also a place you can access from different Databricks block spaces you might have in the future because the only downside here it's not officially supported with Databricks connect.
What does this come to later? But let me turn for us at works. We haven't any big issues here or any issues at all. So we stick with the glue data catalog for our meta-data. Surprisingly the biggest effort we had was regarding all our workflows. We used AWS step-functions which is a serverless a workflow engine.
And you can see your like typical workflow for us. We have a lot of similar steps running sequently for the clusters created and all the steps, my executed each step, and this big picture quads, the second step functions that starts the jobs and the way to, until it's complete, we use alumna writing steps to the workflow. Access control is implemented by IAM. And yeah, we had a nice overview introducing Databricks. This changed a little bit.
So it was the same workflow you see on the slide and it's not just a tiny workflow and this is not because the complexity is still there, but it's sudden why is that? So with Databricks, you have two types of clusters. You have all-purpose clusters and you have drop clusters, drop clusters are much cheaper regarding of their topics units. So it makes sense for a political drops to use this job clusters, the downside is you just can submit one Spark drop, which gets executed. And then the job finished. The cluster is germinating. So we could use this versus detailed workflow, but what's the downside at for every step, a new cluster would be, have to create.
We would have to go to create a new cluster. This is of course what really delays the workflow of our, for the production of workload. Maybe that's still okay, but for testing and workflow, then all the spinning up the clusters is more time than running the complete workflow on a test dataset.
So we did some refactoring here and... Used... Created our own workflow and pies to launch all the steps in a sequence with the data, this advantage that we don't have this graphical overflow anymore. And I think this applies to any workflow tool that you use, because this is something that's not specific for step functions, but for this job clusters.
And finally, let's have a look. What we had to do with our applications, where we had done our pipelines. There are some steps we, where we didn't do the processing, the Spark, but with Athena we started with that last year.
But Athena cannot write to Delta. So we might migrate these steps to Spark. It's not a big effort putting a secret step into a Spark program, but a to be done. The next thing we had to change is...
We usually process the data day by day. So an audit to handle late coming data. We deleted the last two, three days and just recalculated to complete data. We did this by really deleting objects on a suite. This is something you must not do with data tables because then the data table was referenced as a fired by TSP as a delete commands or comments to play. And if you run the lead on a complete partition, it just returns immediately. Yeah.
So it works even faster than the S3 duty. For writing data, you might either use the insert command, or if you use a bright method of the data frame, you just swap the parquet format by data format. And there you are. And the nice thing, if you do this, there's no explicit editing of partitions afterwards, the sustainment advisor data tables themselves. Okay.
That's no way for me. Great it's applications. Let's have a look on how analyzing data and our data lake and developing and working with the data platform has changed paths. First, nice thing. The observe that's a pricing packages we installed on the cluster are available on the notebooks as well. The non-pro grants and some cluster note, I think on the driver.
So you really have the same binary. And the past was in clusters. We had some issues with visualization packages like Folium.
We couldn't include it because the binary is one's a cluster, but not on the Docker container, whereas a notebook was running. So we know it works seamlessly fast. Regarding accessibility, as nice as immediately after you log into the Databricks workspace, you can access your notebooks even without a running cluster. So if you just want to look on some code, so the last resides, I just stopped this in notebooks. I just want to learn something while the cluster still spinning up, the notebooks are already accessible. Also attaching the notebook to another cluster is just a click.
There's no need of restarting any components as we have to with EMR notebooks. Usability improved. That was also one hour ago. So now the quote competition works really nice in the databricks notebooks. And at this myself, I use a lot of SQL commands directly.
If I just want to do some analytics, look into the data and using the Databricks notebook, you can do a quick visualization, plot some graph some bar charts to give you a little bit more meaningful insight. And also as a table representation as much better answer, data's much more readable than VCs. EMR. The last point, does Databricks connect? I already mentioned a few. If you don't like, or don't want to work with notebooks and then Databricks connectors. Definitely the tool for you because if you install it, it allows you to submit a Spark job locally on your laptop with the files.
But the programs you just change locally on your laptop and submitted to already running, or purpose cluster in the cloud. So you can use the same environment you will use for production workload. You would directly see the drive output on your console. That really worked great. But one thing you have to be aware of, if you have different time zones like myself, I'm located in Germany, AWS account is an Island.
So that's different than the time zones. And that gave me some funny side effects and took me some time to work this out. Then, of course, very important. How does collaboration work and how does Databricks integrate with skit and Bitbucket. We use Bitbucket. So you have some boot and collaboration. Likes is office 360, 65 editing.
You can have two persons editing the same notebook at the same time. You'll see that another user's working on your notebook. You see the cell, he or she is just working. And that's really nice and funny feature, but more important of course is a good integration. That's a direct, a good integration, but it was a UI of Databricks.
So if you have a notebook, you can zoom to a repository and some branch, but you have to link every file manually. Of course, we had already a lot of notebooks from our time developing EMR, we couldn't download it. You couldn't do a bunch or buy a download and have this already linked together. So, but luckily there's another way. And that's the approach we ended up using.
If you install the data, pick CLI something, you want to do this any way. I guess there's a commands for workspace export and import. So I checked out all the code from that parquet on my local laptop and do an Databricks workspace import. And I have the latest versions and my Databricks book space.
I can grow up with them, modify them once I'm ready to push this back and to get to the export to my local machine and then to get a push. And this also works well if you change more than one file and we tried both space, but everyone in our team ended up towards the second. Okay. Let's have a brief summary of what we've seen today. For some points, I haven't really mentioned performance because performance wasn't a driver for us, but of course, before we migrated, we did a two weeks POC with Databricks that really helped us a lot. And of course,
we looked into performance here and we took exam two dedicated steps of our workflow. And we roughly observed the fact that two a year, we kept the Spark version the same. So we are still on Spark two, four, five, not on spot three yet using the same instance from AWS as the same plasticized. We, without any other changes,
we got a factor of two. In total this is hard to compare for us because yeah, our pipeline and the functionality and that increases. So you really cannot compare to previous months. Anyway, wasn't set much important ones.
More important, of course, is the topic costs. This is really difficult, not that straightforward to compare because cost for Databricks and EC2 are higher than for EMR. EC2. But we save resources by sharing these autoscale clusters.
And we did some scenario based calculations here prior to that. And so, okay. Also I just, as a productivity improvements is DMA capabilities as a business case, it's really positive for us. And that's of course what's important to convince my management that we can use their topic.
The migration effort, as you've seen in this presentation was mainly for changing the workflows. What really helped us here is we had automated integration tests in place. So after all the migration steps, we just, we ran our integration tests and checked every single Sparking. And there was hardly any work for, for the notebooks to just import them. And yeah, they convert to Databricks, not ops.
And there you go. Yeah, that's all. If you have any questions, I will be available in the chat for some more time. If you have some questions later or would like to learn about our data and our meteonomics products, you can contact me, you have my contact details. And, Finally, of course, I'd ask you to, don't forget to rate and review the sessions.
And enjoy the rest of the data and AI summit.