Data engineering is the fastest-growing field. As data is being generated, companies need people who can manage and process all this data on a large scale. Data engineers are being paid more than software engineers, and everyone wants to get into this field. However,
it is confusing; there are so many different tools available in the market. Just take a look at this big data landscape; there are so many different tools available, it's always confusing where to get started. Even if you do get started, you might get lost in the entire learning process. This video is your ultimate guide to become a data engineer. I will give you a clear and concise roadmap to becoming a data engineer. On top of this, I will also provide you with eight
different end-to-end data engineering projects. So, this is not just a theoretical roadmap; you will gain practical experience if you follow the steps that I will give you in this video. I have been working in this field for the last five years. I started as a job,
then I became a full-time freelancer. I work with a lot of startup companies and also with big companies like Ware, so I understand how data engineering practices are being performed in smaller companies as well as in large companies. Therefore, I will give you a clear understanding of how these different tech stacks are being used in all these different companies, so that you will have a better understanding of what to focus on and what not to. Before we talk about different tools and the skill set to become a data engineer, I want to mainly talk about having the right mindset. It's not just about data engineering, but even if you decide to
learn anything, you need to have a positive mindset that you can become something — you can become a data engineer, data scientist, or whatever your goal is. If you're watching this, that means you want to become a data engineer. So, the first thing that I suggest is to have a positive mindset that you can become a data engineer. Do not let negative thoughts, such as you not being capable enough or you not being smart enough, stop you. Have a positive mindset. The second thing is you need to be fully focused while you execute this roadmap. We live in a world of distractions — you have social media, your phone, gaming, and all other things. So, find
your distraction and make sure for the next six to eight months, you remove all of this distraction and just focus on executing this roadmap. Believe me, if you stay fully focused and remove all distractions, then no one can stop you from becoming a data engineer in the next six months. To help you in this process, I will give you a challenge at the end of this video, where I will provide you with a quick guide on how you can stay consistent, learn in public, and also grow your network. So once you acquire all the skill sets, you will also have different opportunities available to you. I highly suggest you to watch this video from start to end. Get a pen and paper and start taking notes so that you understand and remember all these things. And before we start,
I would appreciate it if you could hit the like button on this video. That helps this channel to grow and also keeps me motivated to make more videos. And if you are new here, then don't forget to hit the subscribe button. Let's get started.
So here's the situation: you might be at different stages of your journey. You are completely new, or you know a few tools but want to know where to go next. The first thing I always suggest to people who are just getting started and do not have a technical background is to clear their computer science fundamentals. This is the core — the bread and butter of everything we do on the Internet. Understanding and having strong computer science fundamentals will help you in the long run.
I'm not telling you to go to college and get a degree. All I'm saying is that get an understanding of the basics of computer science fundamentals, such as understanding how code is compiled, how code is executed, basics of data structures and algorithms, building blocks of programming languages, loops, conditional statements, variables, and all of the others. For this, we have one of the best resources available on YouTube for completely free, provided by Harvard University, called CS50. If you go on YouTube, you will find this playlist. This playlist has everything you need to clear your basic computer science fundamentals. All I recommend is just watch the first five videos. This will
give you enough understanding about computer science fundamentals. If you spend just two to three hours daily learning about this, then you can finish these five videos within a week. Once you clear your basic computer science fundamentals, then you need to take one step forward and work on your foundational data engineering skill set. There are two skill sets that you need to focus on, and you might already know, which is understanding a programming language, and second is SQL, Structured Query Language. Now, people used to say you can learn Java, Scala, or Python, but these days companies mainly prefer Python for data engineers. So you can just learn Python and get started with your
data engineering career, so that you don't get confused between multiple languages. The reason to learn a programming language is basically you will be automating some of the workflows, you will be writing some transformation jobs, you will be deploying some of the data pipelines. So, you need to have a basic understanding of how to do all of these things programmatically. And the same with SQL, Structured Query Language. Most of the data that gets stored is in databases. Now, SQL is the way we communicate with all of these databases. So, if you want to insert,
retrieve data, delete some records, or update some records, you can easily do that using SQL language. SQL has become the universal data language, so no matter which database you use, you will be writing SQL code there. Learning Python and SQL is non-negotiable. Now here's the good news: when I was
learning about all these different things, I had to refer to multiple blogs, videos, and courses to understand how these things are performed at the data engineering level. Now, just to solve this problem, and if you have been following me, you will know that I have like a dedicated niche course for Python and SQL for data engineering. These courses are specially tailored for data engineers. So, if you are a data engineer who wants to understand how Python is used from the data engineering point of view or how SQL is used for data engineers, then I have created dedicated courses for that only. These courses are taken by more
than 5,000 people, and they all love it. The way I break down the complex topics and make you understand all of these different things in a simple manner using my real-world examples will make you fall in love with the process of learning data engineering. These courses are completely hands-on, and there are some amazing projects like Spotify Data Pipeline and many other projects. So, the only resource that I will suggest you
for the core foundation of data engineering is my Python and SQL for Data Engineering courses. I will also add some free resources, so if you don't want to take this course, then you can also go from the free resources. Again, I spent around two to three months preparing for all of these different courses, so I will encourage you to at least check them out — Python and SQL for Data Engineering. And I'm also running a special discount on all of these different courses, so you can find all of this information in the description. Doing this much will give you a strong foundation to start your journey as a data engineer. Now,
what you need to do is focus on highly demanded tools and skills in the market. There are hundreds of tools available in the market; we just want to focus on the highly demanded tools that can give us opportunities to get a job. So now, I'm going to suggest a different skill set that you need to acquire to build your core data engineering skill set. Here's a different approach that I will suggest: one part is just learning about the tools, but you also need to understand the core foundations of data engineering. Why do we perform data engineering? For the learning approach here, it's a little bit different — just pay attention right now, so that you understand the entire process clearly.
Here's the thing: when you start watching videos, you will get bored after one hour or two, and after you get bored, you might jump onto doing something else, like you might watch some random YouTube videos or you go on Instagram and scroll through reels. What we really need to do to avoid the boredom is to replace all of these different activities with another learning material. This is what I'm going to suggest: we will be doing two things at the same time. One, I will recommend you a book so that you can read that in the background, and also you can do a course to learn your core data engineering skill set. Let's say you decide that you will spend 2 hours daily to learn about data engineering. What you can do is you can spend around 1 to 1 and a half hour learning from the courses or the video materials, and after that, you can also spend one hour reading a book. What it will do is, you will learn the highly demanded tool in the market by watching courses, but at the same time, you will also gain the theoretical knowledge from the core foundations of data engineering.
So the book that I recommend you to read is "The Fundamentals of Data Engineering". I have the hard copy; you can buy the hard copy, or you can also get the ebook from the internet. This is one of the best books available on data engineering foundation, and I highly recommend you to read this while you are on this journey of becoming a data engineer because most people learn tools, they learn the technologies, but their fundamentals are very weak. So, fundamentals don't change for 10 to 15 years;
that's why this book will give you strong foundations about data engineering. I know it might be a little bit confusing, but the way it works is that you read this book in the background whenever you pee or whenever you have the time because you can't watch the videos all the time, but you can read a book. You can read a paragraph, you can read one page, you can read one entire chapter whenever you get time. So this way,
you will stay focused, and you will enjoy the process of learning. Now, while you are reading this book, I will also recommend you to do a course to learn about the highly demanded tool in the market. The next core data engineering skill set that I recommend you is to learn about the data warehouse. Everything you do as a data engineer will eventually get stored inside the data warehouse. This is where businesses generally start extracting value from the data. So, if we want to find out, let's say, what was the last five years of revenue,
or how many products we sold this year compared to last year, you can find answers to all of these questions in the data warehouse because they are built for analytical queries. So again, learning about data warehouses has two parts: one is learning about the foundation, and second is learning about the highly demanded tool in the market. The foundation of data warehouses includes understanding OLAP and OLTP systems, understanding dimension tables, extract, transform, load (ETL), ER modeling, or dimensional modeling such as understanding fact and dimension tables. There's one more book available on data warehouses called "The Data Warehouse Toolkit" by Kimball, but you don't really have to read this book, and I will tell you why in a while. After clearing your data warehouse fundamentals, then you can learn about one tool where you can practice all of this foundational knowledge. There are many data warehouse tools available, such as Snowflake, BigQuery, Amazon Redshift. As per my recommendation,
you should definitely learn Snowflake because companies are moving from traditional data warehouses to Snowflake. Snowflake is the modern data engineering database, so I will highly recommend you to add Snowflake to your skill set. Now, where to learn all of this? Again, I have created a detailed course — one of the most in-depth courses you will find in the market, especially designed for data engineers. The
book that I told you about, "The Data Warehouse Toolkit", I have already referred to that book to create this course, so you don't have to read this book by yourself. You can just complete this course; you will get an understanding of the data warehouse core fundamentals, understand how to build a modern data warehouse using the Snowflake database. It took me 2 to 3 months to prepare this course. I've referred to multiple blogs, courses, books to create and bring everything at one place, so I will highly encourage you to at least check these courses out. I have put all of my hard work behind all of these courses.
Imagine this situation in your brain, right? You're reading a book in the background, "The Fundamentals of Data Engineering", so whenever you get to pee or whenever you find time, you can read a page, you can read a paragraph, and also you are dedicatedly focusing on learning the core data engineering toolset from different courses. This will start feeling like magic as you go forward in this process. Once you finish learning about the data warehouse, then the next thing you need to focus on is data processing. Now, the core of data engineering is data processing only because we get data from multiple places — RDBMS, web analytics, sensors — all of these data are coming from multiple places in multiple formats. What you really need to do is you need to write some logic to bring all of these data to one place in a structured format. All these data are coming in different formats, coming at different frequencies, so you need a proper tool to process them. Data generally gets processed in two different ways: one is batch processing, where you take a chunk of data and process it daily or weekly as per the requirement, and the second is real-time data streaming, as you see it on Google Maps or Amazon. Right? You get notifications in a
real-time manner. So as soon as the data comes in, you need to process it and pass it forward. One tool for batch, or even real-time data processing, is Apache Spark, one of the most highly demanded tools available in the market. It is used by big organizations like Google, Microsoft, and many more. So you can learn the same way: first, you learn the foundations of Apache Spark, such as understanding the core architecture, the higher-level APIs, and what are the different functions available in Apache Spark, and then you can learn the tool that powers Apache Spark's environment. There are many different tools available in the market, such as Databricks, AWS Glue, Dataproc, and many more. My recommendation is to learn Apache Spark with
Databricks, and the language you will be using is PySpark — the combination of Python and Spark. I'm working on my Apache Spark course, so it is not launched yet at the time of creating this video, but you can definitely follow me if you want to get more updates on this. I will add some good courses and resources where you can learn Apache Spark and Databricks in the final documentation that you will get at the end of this video. Now, for batch processing, you can learn Apache Spark, but for real-time data streaming, you can learn one highly demanded tool in the market called Apache Kafka. Apache Kafka is a distributed event store and stream processing platform, so you can process your data in real-time.
Now, understand this: when you process all of this data, you execute all of these different tasks in a sequential manner, such as extract data from multiple sources, then do some aggregation, do some transformation, and maybe load this data to some target location. All these operations or tasks need to happen in a sequence; you cannot have the third task executed first or the first task executed last; it will not work. And for that, we also need an orchestration tool or workflow management tool. So, the next skill set that we will be focusing on is learning about the workflow management tool, and one of the most important tools and highly demanded tools available in the market is called Apache Airflow. It was developed by Airbnb, and then it was open-sourced so that
everyone can use it, and every company uses Apache Airflow to build their data pipelines. So again, at this point, you are reading "The Fundamentals of Data Engineering" and also learning all of these different tools. At one point, everything will start making sense. So this is kind of like individual dots, and as you move forward, all of these dots will start connecting, and you will start understanding why these tools exist in the first place, what are the different problems these tools are trying to solve. As a data engineer, we process big data — the huge volume of data — and you cannot store and process all of this data on your local computer. And for that, we have the cloud platforms: we have three
main cloud providers — AWS (Amazon Web Services), Microsoft Azure, and Google Cloud Platform. So, if you are learning Python, SQL, or data warehousing from my courses, then you'll already know that we are using AWS to understand all of these concepts, so the confusion for choosing the right cloud platform will disappear. Just by doing the foundation courses, you will already get introduced to the cloud computing platform then and there. But if you're going from the self-learning path, then I will highly recommend you to either start with AWS or Microsoft Azure. Now, this is the case if you don't have any cloud experience. But if you already know one cloud, do
not jump from one cloud to something else. If you already know one cloud computing platform — GCP, Azure, or AWS — then forget about all of those different cloud providers. Just focus on one and start learning the data engineering side of it. And if you don't know any cloud computing platform, then I will highly recommend you start with AWS. I just want to give you a clear answer because a lot of people get confused between AWS, GCP, and Azure. I'm giving you a clear answer: AWS has a good market segment, so you can always rely on AWS. But if you have any other preferences in mind, then you can go with Azure too,
because it is growing at a rapid pace. So learning either AWS or Azure will keep you in a safe place. Now again, I do have plans to launch courses in the future on all of these different topics, so you can follow me if you want to get updated. Now at this point, you can call yourself a data engineer because you've learned all of these different things. If you've reached this part of this video, then you can write it in the comments, "I'm going to become a data engineer."
Just send some positive vibes in the comment section so that everyone can feel motivated. Now, these are the core data engineering skill sets. Now I want to talk about some of the advanced levels or the things that have come up in the last few years, so you also need to pay attention to all of those. One of the trending topics in the market is the open table format. While reading "The Fundamentals of Data Engineering," you will learn about the data lake. A data lake is basically a centralized repository where you can store all of these data,
and as per the requirement, you can query this data and select the chunk of it. The problem with the data lake is that it does not support a lot of different functionalities; it doesn't have ACID transactions. Now, to solve all of these problems, we have the new concept called the open table format that comes with a lot of different features on top of the data lake. There are many different tools available, such as Iceberg and Delta Lake. This is something that has been trending in the last year,
so I will highly encourage you to keep your eye on it and add the skill set to your portfolio. Then we also have the data observability tools. You might have hundreds of data pipelines running in your company. Now, how do you monitor them? How do you keep track of the errors,
and how do you debug them? Tools such as Datadog can help you with that. So these are the modern data engineering tools that help you with tasks like this. You can also learn about the modern data stack. Here's the list of them. All I recommend you to do is just explore these tools;
do not get attached to them because these tools come and go in the market. But as you read "The Fundamentals of Data Engineering," your core concepts are clear, so you will understand why each and every tool from the modern data stack exists and where they actually fit. And as you go forward in your career, you also need to know about the DevOps and the data ops side, which is basically deploying and automating the entire workflow. And for that, you can also learn about Docker and Kubernetes. At this stage, you need to think like a principal engineer. You don't have to be just a data engineer because you are growing in your career, and you need to be in a position where you can take decisions so that companies can solve their problems using technology.
You can also start reading blogs from different top companies like Netflix, Zerodha, AWS, GCP, so you will get an understanding of how data engineering is performed in the real world. So this was everything that you need to really focus on to become a data engineer. Now, the thing that I told you at the start of this video, that I will give you the complete guide on how to stay consistent and learn in public: so if you are really serious, and if you have been watching this video from start to end till here, that means you are serious about learning and becoming a data engineer. Because most people already gave up in between. But I assume if you've stuck to this video till the end, that means you are serious about learning data engineering. So, you can comment below, "I watched this video till the end," so that I will know that you are really serious about learning data engineering. Now, if you want to stay consistent and focused, what you need to be is accountable to someone else. Now, the best way you can be accountable
is on social media. So you can go to platforms like LinkedIn, Instagram, or Twitter. What you have to do is you have to announce it to everyone that from today or tomorrow, I'm going to start my data engineering journey by following this roadmap. So, you can also link this video or the document that I will be giving you, so you can tell people that I will be focusing and executing this complete roadmap for the next 6 months, and I will be sharing my daily learnings here. All you have to do is every day, whenever you learn something from, let's say, my Python course, or SQL course, or from a book that you are reading, all you have to do is just summarize that entire thing in your own language. Do not copy-paste. You don't have to be a content creator; all you have to do is just share your learning with people. If you do this,
you will gain confidence, you will be focused, and if any recruiter sees that you have the knowledge and you are trying to learn and get into this field, they might contact you in the future. Once you have all of these skill sets, a lot of people do this: they learn something, share it with the world. All I ask you to do is create a post, announce it in public, and tag me, so that I will repost it, so that you will feel accountable to complete this entire journey. That way, you stay consistent, build your portfolio, and open doors for new opportunities.
Now, this is the complete document that I was talking about. Everything that we talked about in this video, I have added here. With that, I've also added some extra resources, such as projects from my YouTube channel and some technologies that you can learn from. So, you can go through this entire roadmap and start your journey in data engineering. So, I upload quality content on data engineering on this channel. So, if you are new here, then don't forget to hit the subscribe button. If you found this video helpful, then please,
please, please hit the like button. Thank you for watching; I'll see you in the next video.
2023-12-27