God Tier Data Engineering Roadmap 2024

Show video

Data engineering is the fastest-growing  field. As data is being generated,   companies need people who can manage and process  all this data on a large scale. Data engineers   are being paid more than software engineers, and  everyone wants to get into this field. However,  

it is confusing; there are so many different tools  available in the market. Just take a look at this   big data landscape; there are so many different  tools available, it's always confusing where to   get started. Even if you do get started, you  might get lost in the entire learning process. This video is your ultimate guide to become  a data engineer. I will give you a clear and   concise roadmap to becoming a data engineer. On  top of this, I will also provide you with eight  

different end-to-end data engineering projects.  So, this is not just a theoretical roadmap;   you will gain practical experience if you follow  the steps that I will give you in this video. I have been working in this field for  the last five years. I started as a job,  

then I became a full-time freelancer. I work  with a lot of startup companies and also with   big companies like Ware, so I understand how  data engineering practices are being performed   in smaller companies as well as in large  companies. Therefore, I will give you a clear   understanding of how these different tech stacks  are being used in all these different companies,   so that you will have a better understanding  of what to focus on and what not to. Before we talk about different tools and the skill  set to become a data engineer, I want to mainly   talk about having the right mindset. It's not just  about data engineering, but even if you decide to  

learn anything, you need to have a positive  mindset that you can become something — you   can become a data engineer, data scientist, or  whatever your goal is. If you're watching this,   that means you want to become a data engineer.  So, the first thing that I suggest is to have   a positive mindset that you can become a data  engineer. Do not let negative thoughts, such   as you not being capable enough or you not being  smart enough, stop you. Have a positive mindset. The second thing is you need to be fully focused  while you execute this roadmap. We live in a world   of distractions — you have social media, your  phone, gaming, and all other things. So, find  

your distraction and make sure for the next six to  eight months, you remove all of this distraction   and just focus on executing this roadmap.  Believe me, if you stay fully focused and remove   all distractions, then no one can stop you from  becoming a data engineer in the next six months. To help you in this process, I will give  you a challenge at the end of this video,   where I will provide you with a quick  guide on how you can stay consistent,   learn in public, and also grow your network.  So once you acquire all the skill sets,   you will also have different  opportunities available to you. I highly suggest you to watch this video  from start to end. Get a pen and paper and   start taking notes so that you understand and  remember all these things. And before we start,  

I would appreciate it if you could hit the  like button on this video. That helps this   channel to grow and also keeps me motivated  to make more videos. And if you are new here,   then don't forget to hit the  subscribe button. Let's get started.

So here's the situation: you might be at different  stages of your journey. You are completely new, or   you know a few tools but want to know where to go  next. The first thing I always suggest to people   who are just getting started and do not have a  technical background is to clear their computer   science fundamentals. This is the core — the bread  and butter of everything we do on the Internet.   Understanding and having strong computer science  fundamentals will help you in the long run.

I'm not telling you to go to college  and get a degree. All I'm saying is that   get an understanding of the basics of computer  science fundamentals, such as understanding how   code is compiled, how code is executed,  basics of data structures and algorithms,   building blocks of programming languages,  loops, conditional statements, variables,   and all of the others. For this, we have  one of the best resources available on   YouTube for completely free, provided by Harvard  University, called CS50. If you go on YouTube,   you will find this playlist. This playlist  has everything you need to clear your basic   computer science fundamentals. All I recommend  is just watch the first five videos. This will  

give you enough understanding about computer  science fundamentals. If you spend just two   to three hours daily learning about this, then  you can finish these five videos within a week. Once you clear your basic  computer science fundamentals,   then you need to take one step forward and  work on your foundational data engineering   skill set. There are two skill sets that you  need to focus on, and you might already know,   which is understanding a programming language,  and second is SQL, Structured Query Language. Now, people used to say you can learn Java,  Scala, or Python, but these days companies   mainly prefer Python for data engineers. So you  can just learn Python and get started with your  

data engineering career, so that you don't get  confused between multiple languages. The reason   to learn a programming language is basically  you will be automating some of the workflows,   you will be writing some transformation jobs, you  will be deploying some of the data pipelines. So,   you need to have a basic understanding of how to  do all of these things programmatically. And the   same with SQL, Structured Query Language. Most of  the data that gets stored is in databases. Now,   SQL is the way we communicate with all of  these databases. So, if you want to insert,  

retrieve data, delete some  records, or update some records,   you can easily do that using SQL language.  SQL has become the universal data language,   so no matter which database you use,  you will be writing SQL code there. Learning Python and SQL is non-negotiable.  Now here's the good news: when I was  

learning about all these different things,  I had to refer to multiple blogs, videos,   and courses to understand how these things are  performed at the data engineering level. Now,   just to solve this problem, and if you have  been following me, you will know that I have   like a dedicated niche course for Python and SQL  for data engineering. These courses are specially   tailored for data engineers. So, if you are  a data engineer who wants to understand how   Python is used from the data engineering point  of view or how SQL is used for data engineers,   then I have created dedicated courses for  that only. These courses are taken by more  

than 5,000 people, and they all love it.  The way I break down the complex topics   and make you understand all of these different  things in a simple manner using my real-world   examples will make you fall in love with  the process of learning data engineering. These courses are completely hands-on, and  there are some amazing projects like Spotify   Data Pipeline and many other projects. So,  the only resource that I will suggest you  

for the core foundation of data engineering is  my Python and SQL for Data Engineering courses.   I will also add some free resources, so  if you don't want to take this course,   then you can also go from the free resources.  Again, I spent around two to three months   preparing for all of these different courses,  so I will encourage you to at least check them   out — Python and SQL for Data Engineering.  And I'm also running a special discount on all   of these different courses, so you can find  all of this information in the description. Doing this much will give you a strong foundation  to start your journey as a data engineer. Now,  

what you need to do is focus on  highly demanded tools and skills   in the market. There are hundreds  of tools available in the market;   we just want to focus on the highly demanded  tools that can give us opportunities to get a job. So now, I'm going to suggest a different skill  set that you need to acquire to build your core   data engineering skill set. Here's a different  approach that I will suggest: one part is just   learning about the tools, but you also need  to understand the core foundations of data   engineering. Why do we perform data engineering?  For the learning approach here, it's a little   bit different — just pay attention right now, so  that you understand the entire process clearly.

Here's the thing: when you start watching  videos, you will get bored after one hour or two,   and after you get bored, you might jump onto  doing something else, like you might watch some   random YouTube videos or you go on Instagram  and scroll through reels. What we really need   to do to avoid the boredom is to replace all of  these different activities with another learning   material. This is what I'm going to suggest: we  will be doing two things at the same time. One,   I will recommend you a book so that  you can read that in the background,   and also you can do a course to learn  your core data engineering skill set. Let's say you decide that you will spend 2  hours daily to learn about data engineering.   What you can do is you can spend around 1 to 1 and  a half hour learning from the courses or the video   materials, and after that, you can also spend one  hour reading a book. What it will do is, you will   learn the highly demanded tool in the market  by watching courses, but at the same time,   you will also gain the theoretical knowledge  from the core foundations of data engineering.

So the book that I recommend you to read is  "The Fundamentals of Data Engineering". I   have the hard copy; you can buy the hard  copy, or you can also get the ebook from   the internet. This is one of the best books  available on data engineering foundation,   and I highly recommend you to read this while  you are on this journey of becoming a data   engineer because most people learn  tools, they learn the technologies,   but their fundamentals are very weak. So,  fundamentals don't change for 10 to 15 years;  

that's why this book will give you strong  foundations about data engineering. I know it might be a little bit confusing,  but the way it works is that you read this   book in the background whenever you pee  or whenever you have the time because you   can't watch the videos all the time, but you  can read a book. You can read a paragraph,   you can read one page, you can read one entire  chapter whenever you get time. So this way,  

you will stay focused, and you  will enjoy the process of learning. Now, while you are reading this book, I will also  recommend you to do a course to learn about the   highly demanded tool in the market. The next core  data engineering skill set that I recommend you is   to learn about the data warehouse. Everything  you do as a data engineer will eventually get   stored inside the data warehouse. This is where  businesses generally start extracting value from   the data. So, if we want to find out, let's  say, what was the last five years of revenue,  

or how many products we sold this year  compared to last year, you can find answers   to all of these questions in the data warehouse  because they are built for analytical queries. So again, learning about data  warehouses has two parts:   one is learning about the foundation,  and second is learning about the highly   demanded tool in the market. The foundation  of data warehouses includes understanding   OLAP and OLTP systems, understanding  dimension tables, extract, transform,   load (ETL), ER modeling, or dimensional modeling  such as understanding fact and dimension tables. There's one more book available on data warehouses  called "The Data Warehouse Toolkit" by Kimball,   but you don't really have to read this book,  and I will tell you why in a while. After   clearing your data warehouse fundamentals,  then you can learn about one tool where you   can practice all of this foundational knowledge.  There are many data warehouse tools available,   such as Snowflake, BigQuery, Amazon  Redshift. As per my recommendation,  

you should definitely learn Snowflake because  companies are moving from traditional data   warehouses to Snowflake. Snowflake is  the modern data engineering database,   so I will highly recommend you to  add Snowflake to your skill set. Now, where to learn all of this? Again, I have  created a detailed course — one of the most   in-depth courses you will find in the market,  especially designed for data engineers. The  

book that I told you about, "The Data Warehouse  Toolkit", I have already referred to that book to   create this course, so you don't have to read  this book by yourself. You can just complete   this course; you will get an understanding  of the data warehouse core fundamentals,   understand how to build a modern data  warehouse using the Snowflake database.   It took me 2 to 3 months to prepare this  course. I've referred to multiple blogs,   courses, books to create and bring everything  at one place, so I will highly encourage you   to at least check these courses out. I have put  all of my hard work behind all of these courses.

Imagine this situation in your brain, right?  You're reading a book in the background,   "The Fundamentals of Data Engineering",  so whenever you get to pee or whenever   you find time, you can read a  page, you can read a paragraph,   and also you are dedicatedly focusing on  learning the core data engineering toolset   from different courses. This will start feeling  like magic as you go forward in this process. Once you finish learning about the data warehouse,  then the next thing you need to focus on is data   processing. Now, the core of data engineering  is data processing only because we get data from   multiple places — RDBMS, web analytics, sensors —  all of these data are coming from multiple places   in multiple formats. What you really need to do  is you need to write some logic to bring all of   these data to one place in a structured format.  All these data are coming in different formats,   coming at different frequencies, so  you need a proper tool to process them. Data generally gets processed in two different  ways: one is batch processing, where you take   a chunk of data and process it daily or weekly as  per the requirement, and the second is real-time   data streaming, as you see it on Google Maps  or Amazon. Right? You get notifications in a  

real-time manner. So as soon as the data comes  in, you need to process it and pass it forward. One tool for batch, or even real-time  data processing, is Apache Spark,   one of the most highly demanded tools available  in the market. It is used by big organizations   like Google, Microsoft, and many more. So you  can learn the same way: first, you learn the   foundations of Apache Spark, such as understanding  the core architecture, the higher-level APIs,   and what are the different functions available in  Apache Spark, and then you can learn the tool that   powers Apache Spark's environment. There are many  different tools available in the market, such as   Databricks, AWS Glue, Dataproc, and many more.  My recommendation is to learn Apache Spark with  

Databricks, and the language you will be using  is PySpark — the combination of Python and Spark. I'm working on my Apache Spark course,  so it is not launched yet at the time of   creating this video, but you can definitely  follow me if you want to get more updates   on this. I will add some good courses and  resources where you can learn Apache Spark   and Databricks in the final documentation  that you will get at the end of this video. Now, for batch processing, you can learn Apache  Spark, but for real-time data streaming, you   can learn one highly demanded tool in the market  called Apache Kafka. Apache Kafka is a distributed   event store and stream processing platform,  so you can process your data in real-time.

Now, understand this: when you process  all of this data, you execute all of   these different tasks in a sequential manner,  such as extract data from multiple sources,   then do some aggregation, do some transformation,  and maybe load this data to some target location.   All these operations or tasks  need to happen in a sequence;   you cannot have the third task executed  first or the first task executed last;   it will not work. And for that, we also need an  orchestration tool or workflow management tool. So, the next skill set that we will be focusing  on is learning about the workflow management tool,   and one of the most important tools and  highly demanded tools available in the   market is called Apache Airflow. It was developed  by Airbnb, and then it was open-sourced so that  

everyone can use it, and every company uses  Apache Airflow to build their data pipelines. So again, at this point, you are reading  "The Fundamentals of Data Engineering" and   also learning all of these different  tools. At one point, everything will   start making sense. So this is kind of like  individual dots, and as you move forward,   all of these dots will start connecting, and you  will start understanding why these tools exist   in the first place, what are the different  problems these tools are trying to solve. As a data engineer, we process big data — the huge  volume of data — and you cannot store and process   all of this data on your local computer. And for  that, we have the cloud platforms: we have three  

main cloud providers — AWS (Amazon Web Services),  Microsoft Azure, and Google Cloud Platform. So,   if you are learning Python, SQL, or data  warehousing from my courses, then you'll already   know that we are using AWS to understand all of  these concepts, so the confusion for choosing   the right cloud platform will disappear. Just by  doing the foundation courses, you will already   get introduced to the cloud computing platform  then and there. But if you're going from the   self-learning path, then I will highly recommend  you to either start with AWS or Microsoft Azure. Now, this is the case if you don't have any cloud  experience. But if you already know one cloud, do  

not jump from one cloud to something else. If you  already know one cloud computing platform — GCP,   Azure, or AWS — then forget about all  of those different cloud providers. Just   focus on one and start learning the data  engineering side of it. And if you don't   know any cloud computing platform, then I  will highly recommend you start with AWS.   I just want to give you a clear answer because  a lot of people get confused between AWS, GCP,   and Azure. I'm giving you a clear answer: AWS has  a good market segment, so you can always rely on   AWS. But if you have any other preferences  in mind, then you can go with Azure too,  

because it is growing at a rapid pace. So learning  either AWS or Azure will keep you in a safe place. Now again, I do have plans to launch courses  in the future on all of these different topics,   so you can follow me if you want to get updated.  Now at this point, you can call yourself a data   engineer because you've learned all of these  different things. If you've reached this   part of this video, then you can write it in the  comments, "I'm going to become a data engineer."  

Just send some positive vibes in the comment  section so that everyone can feel motivated. Now, these are the core data engineering  skill sets. Now I want to talk about   some of the advanced levels or the things  that have come up in the last few years,   so you also need to pay attention to all of those.  One of the trending topics in the market is the   open table format. While reading "The Fundamentals  of Data Engineering," you will learn about the   data lake. A data lake is basically a centralized  repository where you can store all of these data,  

and as per the requirement, you can query  this data and select the chunk of it. The   problem with the data lake is that it does not  support a lot of different functionalities;   it doesn't have ACID transactions.  Now, to solve all of these problems,   we have the new concept called the open  table format that comes with a lot of   different features on top of the data lake.  There are many different tools available,   such as Iceberg and Delta Lake. This is something  that has been trending in the last year,  

so I will highly encourage you to keep your eye  on it and add the skill set to your portfolio. Then we also have the data observability tools.  You might have hundreds of data pipelines running   in your company. Now, how do you monitor  them? How do you keep track of the errors,  

and how do you debug them? Tools such as  Datadog can help you with that. So these   are the modern data engineering tools  that help you with tasks like this. You can also learn about the  modern data stack. Here's the   list of them. All I recommend you  to do is just explore these tools;  

do not get attached to them because these tools  come and go in the market. But as you read "The   Fundamentals of Data Engineering," your core  concepts are clear, so you will understand why   each and every tool from the modern data  stack exists and where they actually fit. And as you go forward in your career, you  also need to know about the DevOps and the   data ops side, which is basically deploying and  automating the entire workflow. And for that,   you can also learn about Docker  and Kubernetes. At this stage,   you need to think like a principal engineer.  You don't have to be just a data engineer   because you are growing in your career,  and you need to be in a position where   you can take decisions so that companies  can solve their problems using technology.

You can also start reading blogs from different  top companies like Netflix, Zerodha, AWS, GCP,   so you will get an understanding of how  data engineering is performed in the real   world. So this was everything that you need  to really focus on to become a data engineer. Now, the thing that I told you  at the start of this video,   that I will give you the complete guide on  how to stay consistent and learn in public:   so if you are really serious, and if you have been  watching this video from start to end till here,   that means you are serious about learning and  becoming a data engineer. Because most people   already gave up in between. But I assume  if you've stuck to this video till the end,   that means you are serious  about learning data engineering. So, you can comment below, "I watched this video  till the end," so that I will know that you are   really serious about learning data engineering.  Now, if you want to stay consistent and focused,   what you need to be is accountable to someone  else. Now, the best way you can be accountable  

is on social media. So you can go to platforms  like LinkedIn, Instagram, or Twitter. What you   have to do is you have to announce it to everyone  that from today or tomorrow, I'm going to start   my data engineering journey by following this  roadmap. So, you can also link this video or   the document that I will be giving you, so you can  tell people that I will be focusing and executing   this complete roadmap for the next 6 months,  and I will be sharing my daily learnings here. All you have to do is every day, whenever you  learn something from, let's say, my Python course,   or SQL course, or from a book that you  are reading, all you have to do is just   summarize that entire thing in your own language.  Do not copy-paste. You don't have to be a content   creator; all you have to do is just share  your learning with people. If you do this,  

you will gain confidence, you will be focused, and  if any recruiter sees that you have the knowledge   and you are trying to learn and get into this  field, they might contact you in the future. Once you have all of these skill sets, a  lot of people do this: they learn something,   share it with the world. All I ask you to do is  create a post, announce it in public, and tag me,   so that I will repost it, so that you will  feel accountable to complete this entire   journey. That way, you stay consistent, build your  portfolio, and open doors for new opportunities.

Now, this is the complete document that I  was talking about. Everything that we talked   about in this video, I have added here. With  that, I've also added some extra resources,   such as projects from my YouTube channel and  some technologies that you can learn from. So,   you can go through this entire roadmap and  start your journey in data engineering. So, I upload quality content on data engineering  on this channel. So, if you are new here,   then don't forget to hit the subscribe button.  If you found this video helpful, then please,  

please, please hit the like button. Thank you  for watching; I'll see you in the next video.

2023-12-27

Show video