Hadoop Tutorial For Beginners | Apache Hadoop Tutorial | Hadoop Training | Edureka
Hello. Everyone, welcome, to this ed Eureka video on Hadoop tutorial, for beginners in this tutorial we, will learn the fundamentals. Of Hadoop and spark, we will also implement two interesting, big data use cases using Hadoop and spark what. A better way to learn technology than hands-on training right so. Here's. The agenda and, this is what you will be learning in today's training as I. Already told you we have two big data use cases that we will study about first, is the u.s. primary, election, and the second is the instant, cabs startup, much like uber, cabs we. Will start with the problem statements, of both the use cases and then proceed ahead to learn the Big Data technologies. And concepts, in order to solve them since. We are going to use Hadoop and spark we will start with a brief introduction. To Hadoop and spark next. We will understand, the components, of Hadoop in detail which, are HDFS. And yarn, after. Understanding Hadoop. We will move on to spark we will be learning how spark, works and its different components, then we will go ahead to understand, k-means, and Zeplin k-means. Is a machine learning algorithm, and Zepplin, is a tool we are going to use to visualize our, data and finally. We will proceed to the solutions, of the use cases with demo so, we are all good to go let. Us start with the u.s. primary, election, use case first in this. Use case we will be discussing, about the, 2016. Primary elections. In the, primary elections, the contenders, from each party compete against each other to represent, his or her own political, party, in the final elections, there, are two major political parties, in the US two Democrats and Republicans, from. The Democrats, the contenders, were Hillary Clinton, and Bernie Sanders and out of them Hillary Clinton won the primary, elections, and from the Republicans. The contenders, were donald trump ted cruz and a few others as you already know donald trump was the winner from the republicans so now let, us assume that you are an analyst, already, then you have been hired by donald trump and he tells you that, I want, to know what were the different, reasons because, of which Hillary Clinton, won and, I, want to carry out my upcoming campaigns.
Based On that so I can win the favor of the people that voted for her so. That was the entire agenda, so. This is the, task that has been given to you as a data analyst, so what, is the first thing that you will need to do the. First thing you'll do is that you'll ask for data and you, have got two data sets with you so. Let us take a look at what this data sets contains, so this is our first data set which is the u.s. primary, election data set so these are the different fields present in our data set so, the first field is States so we've got the list of the state of Alabama the, state, abbreviation, for Alabama is al we've, got the different counties in Alabama like, our to go Baldwin Barbour bit Blount Bullock Butler etc, and then we've got Phipps, now Phipps our federal, information, processing standards. Code so this is basically means a zip code, then. We've got the party to which we will be analyzing the Democrats, only because, we. Want, to know what was the reason for Hillary, Clinton's, win so, we will be analyzing the, Democrats, only and, then we've got the candidate, and since I told you there were two candidates. Bernie Sanders, and Hillary Clinton so, we've got the name of the candidate here and the, number of the votes each candidate, got so Bernie Sanders got 544. In our tuga County and Hillary Clinton got to u23, 87, in this, field over here represents the fraction of the votes so if you add these two together you, will get a 1, so. This basically, represents the percentage, of vote each of the candidates, got so. Let's take a look at our second, data set now so. This data set is the u.s. county demographic, features data, set so the first we will have again. FIPS in the area named r2 county, baldwin, and different, other counties in Alabama and other states also the. State abbreviation. So here it is only showing Alabama, and the fields that you see here are actually the different features you won't, know what this exactly, contains, because it, is written in a coded form but let, me give you an example what. This data set contains let. Me tell. You that I'm just showing you a few rows of the data set this is not the entire data set so, this contains different fields, like population, in 2014. In 2010, the sex ratio how. Many females males and then based on some ethnicity. How many agents, how many Hispanic, how, many black American, people how, many black. African. People and then there is also based, on the age groups how many infants, how, many senior citizens how, many adults so, there are a lot of fields in our data set and this will help us to analyze and actually find out what led to the, winning of Hillary Clinton, so, now you have seen our data set you have to understand, your data set you, have to figure, out what, are the different features or what are the different columns, that you are going to use and, you have to think of a strategy, or think, of how you're going to carry out this analysis. So, this is the entire solution strategy. So the first thing you will do is that you need a data, set and you've got two data sets with you the, second thing that you'll need to do is to store that data into HDFS now, HDFS. Is Hadoop distributed file, system, so you need the store of the data the next step is to process, that data using SPARC components, and we will be using SPARC sequel, SPARC M Lib etc, so, the next task. Is to transform, that data using SPARC sequel, transforming. Here means filtering, out the data and the rows and columns that, you might need to in order to implement, or in order to process this the, next step is clustering, this data using SPARC em lib and for clustering our. Data we will be using k-means, and the final step is to visualize, the result using Zeppelin, now, visualizing. This data is also very important, because without the visualization. You won't be able to identify what, were the major reasons, and you won't be able to gain proper, insights, from your data. Now. Don't be scared if you're not familiar with terms like sparks, equals Park M leg k-means, clustering you. Will be learning all of these in today's session so this is our entire strategy, this. Is what we're going to do today this, is how we're going to implement this use case and find out why Hillary Clinton, won, so.
Now Let, me give you a visualization. Of the results, so I'll just show the, analysis, that I have performed, and I'll show you how it looks. So. This is Zeppelin which is in my master, node in my Hadoop. Cluster, and this is where, we're going to visualize our data so. There's a lot of code don't be scared this is just scaler code with sparks equal and at the end you will be learning how to write this code. So. I'm just jumping. Onto the visualization. Part so. This is the, first visualization. That we have got and we've analysed it according to different ethnicities. Of people, for example in. Our x-axis we have foreign-born, persons and in y-axis, we're seeing that among the foreign-born, people what, is the popularity, of Hillary Clinton, among the Asians and the circles, represent the highest values, the bigger circle is the bigger counts, so we, have made a few more visualizations. So. Now, we've got a line graph that compares, the votes of Hillary Clinton, and Bernie Sanders together. Again. We have got an area graph also that, compares Bernie Sanders, and Hillary Clinton votes. And hence. We have a lot more visualization. We. Have got, our bar charts and everything finally we, also have, state. And county wise distribution. Of votes so, these, visualizations. That, will help you derive, a conclusion to. Derive an answer, or whatever answer, that Donald Trump wants from you and don't worry you'll, be learning how to do that I'll explain, each and every detail of how I've made these visualizations. Now. Let. Us take a look at our second use case which is the instant, cabs use case, now. There is a cab, startup, in the US which is called instant, gabs and again since. You did a very great job in, analyzing, the u.s. elections, they have hired you again to, solve their problem, now. Basically. This company, wants to know what is the demand for cabs at which pinpoint, location, and during, which peak hours, and what, they. Want to do is they, want to maximize their, profit by finding out the, Beehive, points, where, they can get a lot of pickups, and getting their cabs there during peak hours, so, this is your second, task so again the first thing that is that you need, data set so, this is our uber data set that has been given to you in order to analyze and, find out what, were the peak hours, and how much cabs are expected, in those locations, during, the peak hours so. This is just a date/time, stamps, that represents. A pickup time and a pickup date for a particular uber, ride this. Is 8 January 2014. And it is around midnight and then you've got a latitude, and longitude. This, represents, the location of the pickup and then this is base which is at TLC, base, coat so this is like the license number of the driver so. Now again we have to make a strategy of how we're going to analyze, this data so. At first you've got your data set in CSV format this, is your first step and you've got the data then, again, the next step is to store the data in HDFS like. We did the first time again. We have to transform the data because, this data set actually, is really really long it contains a lot of rows and columns and, maybe you don't want to analyze, all of them at once so you will again filter out some of the rows, after. It's transformed, to start clustering, again in k-means, now, I'm gonna tell you that not. To worry I'll. Explain k-means, from the start and how to do it and by cluster, we, will find out the center. Points of each cluster which will represent each of the pickup points, or each. Of the beehive points, so. That. Is why we are performing, clusters, in order to find out the cluster centers which, will represent a beehive point where we will be expecting, the maximum, number of pickups. During, peak hours. So. This is your entire strategy, now let, me just show you the visualization. Of this when also like I did for the u.s. elections. So. Here's our code again which is the scale ax and a spark sequel, code let. Me just jump directly to the visualization, part, so again. This is our x-axis, and, we have the count of the number of pickups, and the y axis, we have the hours, and that is the time and the. Day which we have grouped it according. To size and the count which is the number of pickups. So. You. Can see the largest size that you can see here are these. Ones and this is found, in the fourth cluster, and it, is found at the seventeenth hour which is around 5:00, p.m.. This. Is what we found out after we have analyzed and visualized, our data set our uber, data. Set and, again. I will be talking about how many clusters are to be made and how to make different clusters, and how to find everything out so. This is again the visualization. Of the uber data set and then this one represents the location, in order to identify the Beehive. Points, again. We have classified, the number, of pickups according. To the different hours, of the day so we've, got the 24, slices, over here and you can see the biggest slice is in, the seventeenth hour and the, sixteenth, hour which is around 4:00 to 5:00 p.m. this.
Is The visualization, of the data set now let's go back to our presentation now we have seen what, we have to do but now let's understand, what it takes in order to perform all of this what are the things that you need to be aware of or you need to learn so. Here's what you need to learn in order to perform the analysis, of the to use cases we will start with the introduction, to Hadoop. And spark and we will understand, what is Hadoop. And what is spark and then we will take a deep dive into Hadoop, to understand, the different, components, of Hadoop for, example, the storage unit have had it which is known as s DFS, and then, yarn, which is the processing, unit of Hadoop and then we will be using different tools like Apache spark, which, could be easily integrated, with Hadoop, in order to perform a better analysis, and then. We will understand, k-means, and Zeppelin because we have used, k-means, clustering in, order to cluster our data and we have used Zeppelin, in order to visualize, it and. Then we will finally move on to the solution, of the use cases where we will be implementing, it directly so this is what you need to learn so let's get started with Hadoop and spark we. Will start. With. An introduction to Hadoop. And spark. So. Now. Let's take a look at what is Hadoop and what is sparked. So, Hadoop is a framework where you can store large clusters, of data in a distributed manner and then process, them parallely, then. Hadoop has got two components for storage it has HDFS. Which stands for Hadoop, distributed file. System. It allows to dump any kind of data across the Hadoop cluster and it'll, be stored in a distributed manner in commodity, Hardware before, processing you've, got yarn which stands for yet another resource. And negotiator, and this, is the processing unit of Hadoop which allows parallel, processing, of the distributed, data across your Hadoop cluster, in HDFS.
Then. We've got spark, Soho pachi spark is one of the most popular, projects, by Apache and this is an open source cluster. Computing. Framework, for real-time processing. We're, on the, other hand Hadoop is used for batch processing, spark is used for real-time processing, because, with spark the processing, happens in memory and it provides you with an interface for programming entire, clusters, with implicit, data parallelism. And fault tolerance so. What is data parallelism. Data parallelism, is a form of parallelization. Across. Multiple processes, in parallel computing environments. A lot. Of parallel words in that sentence so, let, me tell you simply that it basically means distributing. Your data across nodes, which operate, on the data parallel. And it works on fault-tolerant. Systems like HDFS. And s3, and it is built on top of yarn, because with yarn you can combine different, tools like Apache spark for better processing, of your data. And if, you see the topology of Hadoop and spark both, of them have the same topology which, is a master/slave, topology, so in Hadoop if you consider, in terms of HDFS. The master node as known, as the name note in the working node of the slave nodes are known as data node and in spark the master, is known. As master and slave are known as workers so this is these are basically daemons, so, this is a brief introduction, to Hadoop and spark and now let's take a look at spark complimenting, Hadoop, there's. Always been a debate about what to choose how do brook spark but let me tell you that there is a stubborn misconception. That Apache spark is an alternative. To Hadoop and that is likely to bring an end to the era for Hadoop it is very difficult to say how do vs. spark because the two framers, are not mutually exclusive but. They are better when they are paired with each other so. Let's see the different challenges that we. Address when we are using, spark and Hadoop, together.
You. Can see the first point that spark processes. Data a hundred times faster than MapReduce. So. It gives us the results faster, and it performs faster, analytics, the next point is spark, applications. Can run on yarn leveraging, Hadoop cluster, and you know that Hadoop cluster is usually, set up on commodity, hardware so. We are getting better processing. But we are using very low cost hardware, and this will help us cut our cost a lot so hence also the cheap cost. Optimization, the. Third point is that Apache, spark can use HDFS. As storage, so you don't need a different storage space for Apache, spark it can operate on HDFS. Itself, so you don't have to copy the, same file again and if you want to process it with spark. So. Hence you can avoid duplication, of files so Hadoop forms a very strong, foundation, for any of the future Big Data initiatives. And spark, is. One of those big, data initiatives. It's. Got enhanced, features like in-memory, processing machine. Learning capabilities. And you can use it with Hadoop and had it uses, commodity, hardware which, can give, you better processing. With minimum, cost, these. Are the benefits that you get when you combine SPARC, and Hadoop together in order to analyze big, data let's. See some of the big use-cases so the first big data use case is web detailing. The, recommendation. Engines, whenever, you go out on Amazon or, any other online shopping, site in order to buy something, you will see some recommended, items popping, below your screen or to, the side of your screen and that is all generated using big data analytics and, ad targeting if, you go to Facebook you see a lot of different items asking, you to buy them and, when you got search, quality abuse, and click fraud detection, you. Can use big data analytics, and telecommunications. Also, in order to find out the, customer, churn prevention. The, network performance, optimization. Analyzing. Network, to predict failure, and you can prevent loss before the error or before, the fault actually occurs, it's also widely used by governments, for fraud detection and, cyber security in order to introduce different welfare, schemes justice. It has been widely used by healthcare, and life, sciences for. Health information, exchange. Gene, sequencing. Serialization. Healthcare service quality improvements, and drug safety now, let me tell you that with big data analytics it, has been very easy in order to diagnose, a particular, disease and find, out the cure also, so, these are some more big data use cases it is, also, used in banks and financial services, for modeling true risk fraud detection credit. Card scoring, analysis. And many. More it, can be used in retail transportation. Services, hotels and food delivery, services, and, actually every field you name no matter whatever business you have if you're able to use big data efficiently. Your company will grow and you will be gaining different insights, by using big data analytics and hence, improve your business even more nowadays. Everyone. Is using big. Data and you you've seen different fields, and everything is different from each other but everyone, is using big data analytics and, big, data analysis. Can be done with tools like Hadoop and spark etc. So. This, is why big data analytics is very much in demand today and why it is very important, for you to learn how to perform big data analytics with, tools like this so, now let's take a look at a big data use solution, architecture, as a whole, you're. Dealing with big data now. The first thing that you need to do is you need to dump all those that data into HDFS, and store, it in a distributed, way and, the next thing is to process, that data so, that you can gain insights, and we'll, be using yarn because yarn, can, allow us to integrate different tools together which will help us to process the big data these, are the tools that you can integrate with John you can choose either Apache hive, pachi, spark, MapReduce, Apache Kafka, in order to analyze big, data and Apache, spark is one of the most popular, and most widely, used tools, with yarn in order to process big data so this is the entire, solution as a whole, now, since. We have got introduced, to Hadoop, and spark let's, take a deeper, look at HDFS.
Which, Is the storage unit of Hadoop so, HDFS. Stands for Hadoop distributed file, system, and this is the storage unit for Hadoop and here is the architecture. Of HDFS. Since. I already told you that it is a master/slave, architecture. So the master node is known is named node and slave nodes are known as data nodes and then we've got another node here which is known as secondary. Name node now don't get confused that secondary, name node is just going to be a replacement for a name node because, it is not so. I'll tell you what a secondary, name node does now let's go back and understand, the entire architecture. Properly. So. You can see the little icons over all of these different, nodes so basically the name node is the master daemon and you can think of it as the king there's got a helper daemon which is secondary, name node which has the icon of the minister, and then the pawns represent, the slave nodes or slave Damon's data, node over here it contains the actual data, so, whenever you dump a file into HDFS, and it gets distributed your, data is stored in the data nodes. The best thing about HDFS. Is that it creates an abstraction, layer over, the distributed, storage resources, so, when you're jumping the HDFS. File, it gets, distributed in different machines, but you can see the, entire, HDFS. As a single, unit because it is laid out in such, a structure now, let, us view, each of these components, one by one so the first we will take a look at name node the, name node is the master daemon and it contains and manages, the data node so, what a name node does, is, it. Preserves. A metadata and metadata meaning, data, about data so. Whatever, file or whatever data is stored in the data nodes the name node maintains, a proper sheet a proper file where everything is, mentioned, that, what data, is stored in which data node and it, serves any kind, of request from the clients, also and since it acts as a master, node it's, also received, heartbeats, the, little hearts and you saw popping, in the earlier slide the data nodes are actually sending. Heartbeats. To, the day name node which is nothing, but it, just. Tells the name node what the data node is alive and functioning. Properly now comes the secondary. Name node the secondary, name node does a very important, task and that task is known as checkpointing so checkpointing is the process, of combining edit, Logs with FS, image, now let me tell you what an edit log is and what is an FS, image let's say that I have set up my Hadoop, cluster, 20 days back and whatever transactions. That happen with every, new data blocks are stored in my HDFS. Whatever, data blocks are deleted, every transaction, is combined, in a file known as an FS, image, and the FS image resides, in your disk and there's one more similar, file which is known as an edit log now. Edit logs will not keep the, record of transactions, 20 days back but just a few hours back, now let's say it will keep the record and the transaction, details that happen in the past four hours and checkpointing. Is the task of combining the edit log within, FS, image it allows faster, failover. As we, have a backup of the metadata. So. A situation, where the name node goes down and the entire mat is lost we don't have to worry we, can set up a new name node and get the same, transactional. Files and the metadata, from the secondary, name node because there's been keeping an updated, copy and check pointing happens after every hour but. You can also configure it so, let's. Understand, the process, of check pointing in detail, here, here is the FS image and the edit log so. The FS image in the disc and the edit log resides, in your ramp what, the secondary, name node does is that it first copies, the FS image and the Edit log and adds, them together in order to get the updated FS image and then this FS, image is copied, back to the name node and now the name node has an updated FS image and in the meantime a new edit log is created, when the checkpointing is happening, so this process keeps going on and hence it helps the name node in order to keep an updated copy of the FS image of the transactions, every hour. Now. Let's, talk about the data nodes these. Are, the slave Damons and this is where your actual data is stored and whenever a client gives a read or write request, the data node serves it because the data is actually stored in, the data nodes so. This is all about the components, in HDFS. Now let's understand, the entire HDFS. Architecture. In detail so. We've got different data nodes here and we can set up different data nodes in racks in rack, one we've got three different data nodes in Iraq two we've got two different data nodes and, each. Of the data nodes contains, different data block because, in data nodes the data is stored in blocks and so we'll learn about that in the coming slides so. The client can request either a read, or write and let's say that the client requests to read, a particular file it will first go to the name node and since the name contains the metadata the name node knows exactly where the file is so we'll give the IP addresses, of the data nodes with different data blocks are of that particular, file and it will go tell the client that you can go to this IP address and, you can go to this data node and you, can read the file and then the client in turn goes to the different data nodes where, the data Plock is present, and finally.
The Read request is served and now, let's say the client wants to again. It. Will contact the name note in the double-click on the metadata and we'll see where the space is available and, it will check whether space is available or not and then again it will give the IP addresses, of the data nodes where a client could write the file and similarly, the. Writing mechanism, also happens, so, this is how the entire, readwrite, request, is served by the data node now let's talk about HDFS. Block replication. And since I've been telling you that HDFS. Is a fault, tolerant system, let's see how so each file is stored in HDFS, as a block and whenever you dump a file into HDFS it, breaks down into different blocks and it is distributed, across your. Hadoop cluster, and the default size of each block is, 128. Megabytes now. Let's say I have got a file of 380, megabytes, so it will be divided up into three blocks the first block will be 128. Megabytes the, second will be 128, megabytes and, the third will occupy whatever, the remaining size of the file which, happens to be 124. Megabytes, now, let's say that we have a file size of 500, megabytes how many blocks will it create alright, so AJ, says it's 4 ro, it says it's 4, and, of. Course you guys are right it's 4 blocks the first 3 blocks will be 128. Megabytes and, the lacs block, will just occupy, the remaining, file size, which is, 116. Megabytes, so now. Let's discuss block, replication. So whenever you dump your file into HDFS, first. It is divided, up into blocks and then each of the blocks is copied, 2 times so. Now you have the original block and two more copies of the same block so. Replication. Factor. Equal, to 3 means, that there are three similar blocks, in your hadoop cluster, so. You can see that I have a file of 248. Megabytes, 128. Megabytes and, 120. Megabytes so my block 1 is there three times and block, 2 is also in there 3 times in, three different data nodes so we use this replication, factor, so that if any of the data nodes goes down we, can retrieve the data block back from the two different data, nodes so. This is how. Data blocks are replicas, in. Hdfs. Now in order to do the replication, properly, there's an algorithm, which is known as RAC awareness, and it provides us. Fault. Tolerance, RAC awareness, algorithm.
Says That the first replica of a block will be stored in a local rack and the next two replicas, will be there in a different rack so. We, store a data block in rack 1 so that our latency, is decreased. Now, these are the commands that you will use to start your Hadoop Damons your, Hadoop. Damons. Like your name node your secondary, name node and your, data nodes in the slave machine, so, one other to start all the Hadoop demons in HDFS, and yarn I have, not explained yarn yet but yarn is the processing, unit of Hadoop so it will start all the yarn demons like the resource manager, and the node manager also then. This is. The command to stop all the hadoop Damons with, JPS. You can check what, are the demons that are currently running in your machine so, let me just show it to you now the first thing I need to do is I need to change the directory to my headed directory, so I'll just do CD, Hadoop and now I can, run all the, commands, and you can remember the first command was dot, a for /s, bins for, slash start all dot. Sh so, it will ask for password alright let. Me tell you that you can also configure it to be a password, less process. So that, you don't have to enter passwords, when it wants to run certain Damons so, now let's use JPS. So here are all the demons that are running in my master, and, I've, got my node, manager my secondary name know JPS, itself, is, a daemon then data node resource, manager, and name, note I'll, tell you about the resource manager, and node, manager, in the coming slides don't worry about that so these are just the demons that are running in my master, machine let me show you what the demons are running in my slave machine this. Is the terminal of my slave machine, I'm, just going to run JPS, here and these are the processes, or the demons that are running in my slave machine so node manager and data, node both our slave demons and they, are running in my slave machine, if, you want to stop all the demons you can run the same command. Instead, of start you can just put a stop, here, so. Since I'm gonna use my HDFS. I'm not going to stop it and show you but the process is the same so these are a few commands that you can use to write or delete a file in Hadoop if you want to copy a file from your local file system to your HDFS. You, use this command Hadoop FS -. Put this, is the name of your file so, you have to type the proper path of the file so that you can copied your HDFS. And you can also mention the destination, folder in HDFS, where you want to copy it now. If, you're leaving a blank it means that it will copy the master directory, and Hadoop and if you want to list out all the HDFS, files you can do that using this command and if you want to remove that file remove the same file again you, can use this command Hadoop. FS, -. RM, which is used for removing so. This is also the first step that you need to do when you're starting to analyze something and this is the way that when, you copy your datasets into our HDFS. First and then analyze, it so, we've seen, HDFS. Now let's take a look at yarn which is the processing unit of Hadoop so what is yarn you know there's nothing but it is a MapReduce, version, too so when Hadoop came up with its new version, had it 2.0. It, introduced yarn as the new framework, and it stands for yet another resource. Negotiator, and it provides the ability to run, non MapReduce. Applications. And because of yarn we were able to integrate with different tools like Apache spark hive Pig etc, and it. Provides us with a paradigm, for parallel processing, over Hadoop. Now. When, you're dumping all your data into HDFS it is getting distributed and all this distributed, data we're processing in parallel, and it is done with the help of yarn and you. Can see over here that the architecture, of yarn so that it is again a master/slave topology. So the master daemon, here is known as resource manager and slave Damons are known as node manager so. Let's take a look at these components one, by one so. First step is the resource manager, so this is the master daemon and it receives the processing, request and whenever a client comes up with a request he comes to the resource manager, first because the resource manager, managed all the slave nodes or node, managers. So. Whenever. A client, comes. He wants to process, some data the resource manager, takes that. Request, and passes, the request of the corresponding, node managers, now let's, see what is a node manager node managers, are the slave demons and they are installed on every data node so that you know that our data is divided, up into blocks and are stored in the data nodes and they're processed, in the same machine, so. In the same machine where, the data node is set a node, manager is also present, to process, all the data and, present. In that data node and it is responsible, for the execution of the tasks on every single data node so, this is where the actual processing, of data takes, place now.
Let's, See the entire architecture, in detail so, the client comes up with a request to the resource manager, in order to process the data and then, the resource manager, passes on the request to the node manager so they are important. Components that I'm going to talk about and you should pay attention to, it so the, node manager we've got a container and app, master, now in an app master, is launched for every specific application. Code or every job or every, processing. Task that the client comes up with so, the application, master of the app master, is responsible to handle and take care of all the resources that is required in order to execute that code so if there is any requirement, for any resource, it is the app master, who asks, for the resources, from the resource manager, and then. The resource manager provides the app master, with all the resources and, then it asks the node managers, to start a container and the, container is, then a place where the actual execution, happens, now. Let's see the yarn workflow, in order to understand, things better, here. Is the client, and this client wants, to run a job in this example, I'm considering, a MapReduce job. So the MapReduce code is first displayed has MapReduce. Job and then the client wants to run this particular job, he, submits, the, job to the resource manager, and asked the resource manager, to execute this the, resource manager, gets back to the client with an application, ID for, his job then, the resource manager, starts the container where the app master, is launched, now. The, app master, is also launched, in a particular, container, then the app master, then gathers, all the resource, requirements, in order to run that job and ask the resource manager, to allocate, all the resources, after. That when all the resources are provided the node manager launches, the container and starts, the container and this is where the job executes. Now. Let's take a look at the entire yarn application, workflow step by step so. The first step is the client submits an application to the resource manager, then the resource manager allocates the container, to start the app master, then, the app master, our, registers. With a resource, manager, and tells the resource manager, that, an app master, has been created, and it is ready to oversee, the execution of, the code then the, app master, asks. Containers, from the resource manager, the app master, also notifies, the node manager to launch containers, and after the containers, are launched the application code is executed, in the container which was, the application. Code of the particular, client, and then the client contacts. The resource manager, to, ask for. The application, status whether it is executed, properly or not or has it been executed, successfully, the. App master unregistered. With a resource, manager, so, this is the entire, workflow. Now. Let's take a look at the entire Hadoop cluster architecture, HDFS. With the arm so, here you can see that both HDFS. And yarn follows, a master slave topology, in the master in HDFS. His name note and master in yarn, is resource, manager, the. Slave Damon's in HDFS. Are the data notes and this is where all your data is stored and in yarn it is node manager this. Is where the data is processed, in a container and the app master, takes care of all the resources, that are necessary in, order to execute, your program there. Is one important thing that you should know and they must have noticed that my data node and my node manager, will lie in the same machine so. This data node and this node manager, will be in the same machine and this node manager will be in the same machine but it is not necessary. That the node name and the, resource manager, would be in the same machine they could be but it's. Not necessary. Now. Name node can be in a different machine and a resource, manager can be in another machine alright so. Don't, get confused that these will not also be in the same machine which is not the case now, let. Me tell you about the Hadoop cluster hardware specification. Some of the hardware specifics. That you should keep in mind if you want. To set up a Hadoop cluster so. For the name node you need RAM with 64, gigs and your hard disks should be a minimum of one terabyte, the processor, should be a xenon eight core and the, ethernet should be three by ten gigabytes, the. Operating system should be 64-bit. CentOS, or Linux, but. The power should be redundant. Power supply, because you don't want the name note to go down why because if the name node goes down your entire HDFS.
Will Go down and for, the data note you, need sixteen gig of ram hard. Disk should be 6 by 2 terabytes. Because, this is where you will be storing all the data that needs to have a lot of memory the. Processor. Should be Zen on with 2 cores. Ethernet, 3 by 10 gigabytes, and, OS should, be again, 64. Bit CentOS, or Linux and for the secondary name node Ram should be 32, gigabyte, your. Hard disk should be 1 terabyte processor, Zen on with 4 cores Ethernet, 3 by 10 gigs, OS. Should be 64. Bit CentOS, or Linux, and power again should be a redundant, power supply, now, you might pause your. Screen and take a look at the screenshot of this image don't worry this presentation, than this recording, will be there in your LMS as, well, so. This. Is what you should keep in your mind dance if you're setting up a Hadoop cluster so, these are the hardware specifications, required to do that now. Let me tell you about some real Hadoop, cluster, deployment. Let. Us consider our favorite example which is Facebook, so. Facebook has got 21, petabytes, of storage in, a single HDFS. Cluster. And 21, petabytes, is equal. To 10 raised, to the power of 15. Bytes and they have got two thousand, machines per cluster and. 32. Gig of ram per machine. They. Run 15 MapReduce, tasks, and each of these machines run 15 MapReduce, tasks and 1200 machines have 8 cores, each and, 800, machines have 16, cores each and there, are 12 terabytes of, data, per machine, so. It, is a total, of 21 terabytes, of configured, storage capacity. And, it is larger than the previous known, Yahoo's, cluster, which was known to be the largest Hadoop, cluster and it was 14, petabytes, the, Facebook is beaten Yahoo with 21, petabytes, now. Let's talk about another use case which is Spotify, so how many of you listen to music and Spotify. Alright, so it looks like some of you do so even Spotify, users Hadoop for generating, music, recommendations. Because when, you listen to music you see that some of the music, some new songs are recommended, to you which, is also belongs. To the same genre that you have been listening to right, so. It is done by big data analysis, with Hadoop and Spotify, has got sixteen, hundred and fifty nodes and they have, 65. Petabytes. Of storage approximately. And Spotify, has. 70. Terabytes. Of RAM. And they run more than 25,000. Daily Hadoop jobs has, got 43,000. Virtualized. Cores, so. Is even larger cluster, than Facebook, right so. These, were the two use cases who, use Hadoop, clusters, in order to process and store big data, now. That you have learned all about Hadoop, the HDFS. And yarn both, the storage and the processing, components, of Hadoop so, let's take a look at Apache spark, Apache. Spark is an open source cluster, computing, framework, for real-time processing, and it has been the thriving open-source, community, and his most active, Apache, project at. This moment and spark, components, are what make Apache sparked fast and reliable and, a lot of spark components, were built to resolve, the issues, that cropped up while using, Hadoop MapReduce. So. Apache, spark. Has. Got the following components, has, got the spark core engine, now, the core engine is, for, the entire spark frameworks, every. Component is based on and it is placed in the core engine so, at first we've got sparks. Equal, so. Sparks equal is a spark module for structured data processing. And you can run a modified, hive queries on existing hadoop deployments, and then we've got spark streaming. Now. Spark, streaming, is the component of the spark which is used to process real-time, streaming, data and is useful addition, to the course park API because it enables high throughput fault, tolerance stream.
Processing, Of live data streams and then we've got spark em light, this. Is, the machine learning library for, spark and we'll be using spark. Em alive in to. Implement, machine learning in our use cases too and then we've got graph X which, is the graph computation. Engine and this is the spot, API, for graphs, and graph parallel, computation. It, has got a set of fundamental operators. Like sub graph joint purchases, etc, then. You've. Got. Spark. R so, this is the package, for our language to, enable our users to, leverage spark power from our shell so. The. People who have already been working on our are, comfortable, with it and they can use our shell directly, at the same time and they. Can use spark using this particular component which is spark, our you, can write all your code in the our shell and spark will process it for you now let's take a deeper look at a realistic, people and all. These important, components so. We've got spark core, and spark, core is the basic, engine for, large-scale, parallel, and distributed data. Processing. The, core is the distributed. Execution engine. And Java, scalar and Python api's offer a platform for distributed, ETL development, and further, additional, libraries, which are built on top of the core allow. For. Diverse streaming. Sequel, and machine, learning it's. Also responsible. For scheduling, distributing, and monitoring, jobs in a cluster and also, interacting. With storage, systems, let's, take a look at the SPARC architecture so. Apache spark has, a well-defined and, layered architecture where, all the SPARC components, and layers are loosely coupled and integrated, with various, extensions, and libraries, first. Let's talk about the driver program, this. Is the SPARC driver which contains the driver program and spark context. This. Is the central, point and entry, point of the SPARC shell and the driver program runs, the main function, of the application, and this is the place where spark context. Is created. Well. What is spark context, spark, context. Represents, the connection, to the entire, spark, cluster and, it can be used to create resilient. Distributed. Datasets, accumulators. And broadcast, variables, on that, cluster, and you should know that only, one, spark, context. May be active, purged a virtual. Machine and you, must stop any active. Spark context. Before creating a new one let's. Talk about the driver program that runs on the master, knob of the spark cluster it schedules, the job execution, and negotiates, with the cluster manager, this. Is the cluster manager over here and the, cluster manager, is an external, service that is responsible, for acquiring resources, on, that spark cluster and allocating, them to, a spark, job, then. In the, worker node we have got the executors. The executor, is a distributed. Agent, that is responsible, for the execution of, tasks, and every. Spark application, has its own executor. Process. Executors. Usually run for their entire, lifetime, of the spark application. And this. Phenomenon, is also known as static, allocation, of executors. But you can also opt, for dynamic. Locations. Of executors. Where you can add or remove spark. Executors, dynamically. To match with the overall workflow okay so now let me tell you what actually happens when the spark table is submitted, when, a client submits a spark user application. Code the driver implicitly, converts, the code containing, transformations. And actions, into a logical directed. A silica graph or D AG and at, this stage the driver program also, performs. Certain kinds of optimizations. Like pipelining. Transformations. And then converts, the logical, daj into, a physical, execution. Of a plan with. A set of stages, and after creating a physical execution. Plan it creates, more physical, execution. Units. That. Are referred to as tasks, under each state then these tasks, are bundled, to be sent to the spark cluster. So. The driver program then talks to the cluster manager and negotiates, for, sources and the cluster, manager. Then, launches, the executors. On the worker nodes on behalf of the driver and at, this point the driver sends, tasks, to the cluster manager based on the day replacement, and before the executors, begin execution.
They First register, themselves with the driver program, so, that the driver has got a holistic, view of all the executors. Now. The executors, will execute the various tasks, and assigned to them by the driver program and at any point in time when the spark application, is running the driver program, will keep the unmonitored the, set of executors, that are running the spark application, code in this driver program here. Also schedules, future tasks, based on data, replacement. By tracking the location of the cache data so. I hope you have understood the, architectures, part many doubts. All, right no doubts now, let's take a look at spark sequel and its architecture, so spark cycle is the new module in spark and it integrates relational. Processing, with sparks functional, programming API and it supports querying, of data either by, a sequel, or via hive query, language, so. For those of you who have been familiar with our DBM. S-so. Spark sequel, will be a very, easy transition. From, your earlier, tools because, you can extend, the boundaries of traditional relational. Data processing. With spark sequel, and it, also provides, support. For various. Data sources, and makes it possible to read sequel queries with code transformation. And that is why spark sequel, has become a very powerful tool. This. Is the architecture, of spark sequel, let's talk about each of these components. One by one the, first we have got the data source API so. This is the universal, API for loading and storing structured. Data and it is built on support. For hive Avro. JSON, JDBC. CVS, pocket, etc. So. It also supports, the third party integration, through spark packages, then you've got the data frame API, data. Frame API is, the distributive. Collection, of data that is organized, into, named columns, and is, similar to relational, table in sequel, that is used for storing data in tables, so. It is the domain-specific, language. Applicable to or DSL. Applicable. On structured, and semi-structured. Data so. It processes, data from, kilobytes, to petabytes, on a single node cluster to a multi, node cluster and it, provides different api's, for Python Java scalar and our programming so. I hope you have understood all, the architecture, of spark sequel we will be using spark, sequel, in order to solve our use cases so these are the different commands to start the spark Damons these. Are very similar to how to commands, to start the HDFS. Damon's so, you can see to start all the spark daemons, so the spark demons are our master and worker and you can use this command to check if all the demons are running on your machine you can use JPS, like Hadoop and then in order to start the spark shell you, can use this, you could go ahead and try this out so this is very similar to the Hadoop art that I just showed you earlier so I'm not gonna do it again and then, we've seen this Apache, spark, also so, now let's take a look at k-means and Zeppelin k-means is the clustering method and Zeppelin. Is what we're going to use in order to visualize, our data so. Let's talk about the k-means clustering now, k-means. Is one of the most simplest, unsupervised. Learning algorithms. That solves. The well-known clustering. Problem so, the procedure of, k-means follows, a simple, and easy way to classify a data set to, a certain, number of clusters, which is fixed prior, to performing the clustering, method, so. The main idea is, defined case centroids, one for each cluster and the centroids, should be placed, in, a. Very cunning. Way because, of different, location, reasons, causes, different results. So, here let's take an example so let's say that we want to cluster a total population of, a certain location and so, we want to, cluster. Them into. Four different clusters. Namely, Group one two and three and four so the main thing that we should keep in mind is that the objects, in Group one should be as similar, as possible but, there should be as much difference between an object in Group one and group two it, means that the points, that are lying in the same group should have similar, characteristics, and. It should be different, from the points that are lying in a different, cluster and the attributes, of the objects, are allowed, to determine which, object. Should be grouped together for. Example. Let us take. In, the same sample that we're using in the US County so let's consider the second, data set we have used there. Are a lot of features, that I already told you like there are age, groups, and they are categorized by professions. And they also categorized, by the ethnicity. And. So. This, is the, thing that we are talking about so, these are the attributes. That will allow us to cluster, our data so, this. Is k-means, clustering. Here. Is one more example let, us consider a comparison, on income, and balance, so in my x-axis I've got the gross monthly income and the y axis, I have the, balance, I want.
To Cluster my, data according. To these two attributes. Here. If you, see this is my first cluster, and this. Is my second, cluster so this. Is. The cluster that indicates, the people who have high income and low balance, in the. Account and, they spent a lot in this cluster comprises. Of the people who have got a low income, but, a high balance, and they are safe you. Can see that all the points that are lying here, have got similar characteristics. That. They have got low income and high balance and here are the people who share the same characteristics, where. They have got. Low balance, and high, income and there. Are a few outliers, here, and there. But they don't form. A cluster so. This is an example of k-means clustering and, we'll be using that in order to solve our problems, so does anybody have any questions so. Here. Is one more example and, one more problem for you so you guys will tell me now so the problem is that I want to set up schools in my city and these are the points which indicate. Where each student, lives, so. My question, to you is where should I be building my school if I have students, living, around. The city in these particular locations and. In order to find that out we will do k-means, clustering and, we'll find out the center point right so if, you can cluster and make groups of all these locations, and set up schools at the center point of each cluster that. Would be optimum, isn't, it because. That, is how the students, have to travel less it. Will be close to everyone's house and there. It is so. We, have formed three clusters, so you can see the brown dots are one cluster and the blue dots are one cluster and the red dots are one cluster and we, have set, up schools in the center points of each cluster, so. Here is one here, is one and here is, yet, another one so, this is where I need to set my schools. Up so that my students do not have to travel that much so. That was all about k-means and, now let's talk about Apache. Zeppelin, this, is a web page notebook, which brings in data ingestion, that. Exploration, visualization. Sharing, and collaboration, features, to Hadoop and spark so. Remember when I showed you my Zeppelin, notebook, you can see that we have written the code there, we, have run even run sequel, codes there and we have more visualizations, by, executing. Code there so. This is how interactive Zeppelin. Is and it supports many many interpreters, and it. Is a very powerful. Visualization. Tool that. Can use, that. Goes very well with Linux, systems, and it supports, a lot of language, interpreters, that supports, our. Python. And a lot, of other interpreters, so. Now let's move on to the solution, of the use case so this is what you've been waiting for first, we will solve our US County solution, so the first thing we will do is we will store, the data into HDFS, and then, we will analyze the data by using scalar SPARC sequel and SPARC ml lab and then, finally. We'll, find out the results, and visualize. Them using, Zeppelin, so. This was the entire US election, solution, strategy, that I told you and that I don't think I should repeat it again but if you want me I can, should. I repeat. All. Right so most of the people are saying no so I will go right through this one again so let me just go to my VM, and execute, this for you, so. This is my Zeppelin, and I opened my notebook and here, and let us go to my u.s. election, notebook and this is the code, so. First of all what I'm going to do is that I am importing, certain packages, because I'll be using certain functions, that are in those packages so. I've imported, Sparx equal packages, and I, have also imported, spark ml, Lib, packages, because I'll be using k-means, clustering, so. Vector assembler, enables, me certain machine learning functions, over here I have the vector assembler, package, it gives me certain machine learning functions, that I am going to use I've, also imported the k-means package, because I'll be using k-means, clustering then, the first thing that you need to do is that you need to start the sequel context. So, I have started my SPARC sequel context, here and the next thing that you need to do is that you need to define a schema, because when you want to dump our data set, or we want to dump our data it, should be in a particular format, and we to tell spark and which, format, it should be so we're defining a schema, here so, let me take you through, the.
Code So. I'm storing, schema, in a variable, called schema and we have to define the schema in a proper, structure so, we're going to start with struct type and since you know that our data set has got different fields, as columns, we're, going to define this as an array of fields and then this is an array and struct so, we are defining the different fields now we, will start with the first field by defining it a struct field inside, the braces which should mention what. Which should be the name of that particular field, so I have named it as state it, should be a string type and true. That means it is a string, type the next we've got FIPS which is of string type now I know that FIPS is a number but since we are not going to do any kind of numeric operation, on hips we're, gonna let it stay as a string then, we've got party as a string, type candidate, as a string type and then votes as integer, type because. We're going to count the number of votes and there is going to be certain, numeric, operations. That, we are going to perform that will help us to analyze our data then. We've got a fraction votes, which, you know is a decimal type so we have to keep it as double type the next thing you need to do is that spark, needs to read the data set from the HDFS. So, for that you have to use the command spark, read option header, true header, true it means that you have mentioned, and you have told spark that my data set already contains, column headers because state as abbr, they, are nothing, but they are column, headers so you don't have to explicitly, define, the column headers for. It neither, will spark choose any random row as a column header so, it will choose only the column headers your. Data set has then you have to mention the schema, that you have defined so I have, defined it in my variable, schema, so that's why I have, mentioned, it in my file should, be in CSV, format and, then I have mentioned, the path of the, file in my HDFS. This, is the path and I store this entire data set in my variable, DF. Now. What I am going to do is that I'm going to divide up certain rows from my data set because you know that my data set contains both the Republican, and, Democrat data and I just want the Democrat data right because we were going to analyze the Hillary Clinton and Bernie Sanders part, okay, so this is how you divide your data set so, the first thing that we have done is that we have created one more variable called DFR and we have replied a filter, where Part II is equal to Republican, and then, we are storing, the Democrat, Party data into, D F D so we're going to use the DF underscore, D from onwards, and D F are the Republican, data is going to be your assignment, for the next class now. I am going to analyze, the Democrat data and then after, this class is, over I want, you guys to take the Republican data this data set, is already available in your elements, and you've got the VMS, also, with everything, installed, so please when you are at home when you have free time just analyze, the Republican, data and tell me what. Were the reasons that Donald Trump want I want, you to do all that analysis, and come up with that in the next class and we'll discuss about it and whatever, results, in conclusions, that you have made after analyzing. The Republican, data and that way you'll also learn even more and, it will also be practice, for you after today's class so. All right so we are going to take a DF underscore, D now and the first thing that we will do is that we will create a table view and I'm going to name the table view as election. And let me just show you what it looks like and what it has so. This is the command that I have run in Zeppelin, so this is sequel, code that, I have run in Zeppelin. And you. Can see that I have got States, state underscore, abbr and I have only got the Democrat, data all, right now. Let's go back all, right. After creating the tableview now all of the Democrat data is in my election table, so, now what I'm going to do is that I am creating a temporary variable, and I'm running sparks, equal code told I'm actually doing by writing this code the, motive of writing the sequel code or the sequel query is that I want to refine my data even more so, what I'm trying to analyze here is well how a particular, candidate actually won I don't have to do anything with the losing data because you know that each of FIPS contain, one of the losing candidate, members and one of the winning candidate members it, contains the data of the winning candidate and the losing candidate also because my data set contains both of the data of Bernie Sanders and Hillary Clinton in some parts Bernie Sanders won and in some counties Hillary, Clinton won so, I just, want to find out that.
Who Are the winners in a particular, County okay, so. I'm going to refine, that data and for that I'm using this query so, I'm going to select, all from election, and then I'm going to perform an inner join with, their query so this is one more query inside, this query and let, me tell you what I'm actually doing so, first of all what we have done is that we have selected Phipps as be you know that now, you have got two entries for each Phipps so each Phipps actually, appears twice in the data set so I named it as B and now we are counting the maximum, fraction votes, so you know that in each fit we have the maximum fraction vote and then we can find the winner by actually, seeing who has got the maximum fraction boats and, then we have named it as a the, maximum fraction votes, column is named, as a and we are grouping, by Phipps, so. Now each of my Phipps will be selected, which has a maximum, fraction vote, and I, have two. Columns for that Phipps which is 1 0 0 1 and 1 0 0, 1 so, the, only rule will be selected, which has the maximum fraction, votes, now. I'll have the winner data and I've named this entire table inside, this query, as Group TT and then I'm validating. It as we're, election, dot Phipps the, main table view dot Phipps should be equal to the B column that we have created in Group TT table and election, dot fraction, votes should, be equal to group TT a, so. When he doubts it on this query about how I have written this or. So now. What we're going to do is that whatever data that we've got here I'm storing that in election. One let. Me just show you what is in election, one now so. This is my election table, only and you, can see that I've got two Phipps, so one zero six seven one zero six seven now, let me show you election. One so there now, I can, see that I don't have repetition, of Phipps. I have. Only one, entry for Phipps and that is the row which tells me who won in that County, or in, that particular, FIP, or in the FIP associated, with a particular County, you can see for Bullock it was Hillary Clinton, for okay who knew it was Hillary Clinton, Cherokee, also Hillary Clinton and then Statehouse district 19 is, Bernie Sanders so Alaska. Is mainly, Bernie, Sanders, so. This. Is what we've done now and then you can see that, we have also got, additional, columns, as B and a so. A tells you the maximum fraction, votes and B tells you the FIPS. So. The data in FIPS and the data in B are the same and data in fraction, votes and data in a is the same right what. I'm gonna do now is since. My columns are repeating, and they have the same value, I don't. Want a and B now right, so what I'm going to do is I'm going to filter out the columns I don't need and in this case I don't want B and a and what, I'm gonna do is I'm going to make a temporary variable. Again, so I'm using the temporary variable to store some data temporarily. So, I'm writing to the, SPARC sequel code, to. Select only the columns that I want I want, the state state abbreviation. County, Phipps party, candidate votes fashion votes from election one I'm, storing everything in D winner I've created this new variable and whatever there, was in temp I'm assigning it to deep winner and now I've got.
Only The winner data so I have got all the counties, and I've got who won in that particular County and by how much in the fraction of votes what I'm just doing till now is that I'm just refining, our data set so that it will be easy for us to make some conclusions or, gain some insights from that data right and also. Let me tell you that it's not always necessary, that you were filing, your data set in the exact, way that I'm doing it if you have something in mind after you've seen your data and understand, your data and you find out what actually you need to do you, can carry out different steps, to do that also this is just one way of doing it and this is my way of doing it so I'm just telling you and then we are creating a table for D, winner and we are going to name it as Democrat, so let me go, again and let me show you what the Democrat table view looks, like you. Can press shift enter, so there, you, have. We, have column, a and B that we had in election, one. And, so, I've, just got the. Winner data. So, now let, us go back and find out what we're going to find is, that I want, to find out that, which. Of the candidates won my state, and then whatever date and whatever result, I'll get when it'll be stored in the temporary variable when, I'm assigning everything. That. Will be stored in, the temporary variable to, a new variable called D, state and then similarly, I'm, going, to create a table view for, D State which is state. Let, me show you what my state table view actually, contains so, there it is so. I've, got state connecticut, hillary clinton won 55 counties florida, hillary clinton 158 counties so this is what we've come up to for our first data set so. Now, let's