Components Of Apache Spark Run Time Architecture

Show video

that will be fine uh no read okay uh so it's been long so let me find out a documentation e uh can you able to see my screen uh sanit are you able to see my screen yes yes yeah okay so I think uh we'll not jump into the what is the advantage and disadvantage of the things basically you know for Big Data engineering we required a lot of data processing kind of stuffs correct so for that we required a huge volume of the joined joined okay yeah we just started yeah I have think we just started discussing about the overview kind of the things then I think we will okay so uh I'll not go deeper into why big data and all these things I think we have been already discussed the things but let's focus there are some disadvantage in the Ado system so so what Hado system is Big Data is a problem statement which rely on the processing of huge volume of data and Hardo was the solution to it so when I'll say what Hardo Hadoop is nothing but a framework which have some definite amount of tools so each specific tool let's say for example whenever let say is all about a data so basically the data is of what we say so I I can say I store the data then I have to process the data so each data is associated with always two step what are the two things the first thing is that store the data somewhere else then you have a engine or the processing engine to process the data correct so whenever I'll say that big data there are a number of storage system you have your uh normal unique server you have your computer hard dis this things you have your you can say I have amazine S3 bucket someone will say that I have some cloud storage perfect so for storage we require one engine or one platform that is known as your hdfs so what do you mean by hdfs hdfs stands for hard distributed file system perfect so it will be hdfs so the first thing is that it's a distributed it's a file system in mechanism and it stands for so maybe maybe in the Google amag you say that maybe I more focus into the gcp in gcp we have called GCS what do you mean by GCS Cloud Google cloud storage system so all this whatever you are saying on Prem like S3 uh Azure blob storage and you that GCS these all are the same fundamental of your hdfs the file system it's a file system and what is the nature of that file system it's a distributor so you can store at any scale of data inside your hdfs one problem statement is done where to store the data store the data inside htfs the second problem statement if I have huge volume of data how I going to process it so how do the framework suggested one mechanism or one tool that is known as your map reduce what is a map R okay what map reduce is do map reduce is a processing engine which will read the data from the hdfs and the map reduce at the back end is written in the Java code it will process the data using the J Java give you the output there are some drawbacks so whenever I'll say that in the upcoming session we'll discuss pice spk so why pice park because there might be some drawbacks inside your Java inside your map rors that's why we come come to a we come to another framework or another solution that is your spark so what are the drawbacks of map reduce map reduce is not a realtime data processing engine what do you mean by realtime data processing engine basically the data processing is Consolidated into two types one is batch processing and another is your realtime data processing realtime data processing is nothing but your live streaming so let's say for example now there is a test match going on between India and Australia so what time Gap you are expecting the moment K hit a we are Lively seeing it inside our maybe hot star or something it's a real time data processing but think about the Amazon delivery or GE delivery something so you ordered some product screen your screen I think not you are not share the screen oh is that it stopped which actually it was there earlier it stopped I can say it s oh you can say it then it's something wrong mind end okay thanks okay so whenever I'll say that uh batch processing so batch processing is nothing but let's say for example you ordered some products in Amazon or flip cart are they delivering the product the right after the moment you ordered it no so what they will do they will collect all the orders throughout the day at the midnight or at 11 p.m. they will run some kind of thing okay this product should be shed somewhere else that product will be some sh that or else you can say your bill generation of Telecom generation let's say for example you are you have a post pit post ped seam or you have a landl so the moment you call are you getting the bills no they will accumulate all the bills for the end of the months something like then they will generate it so for here maybe for Bill generation of your internet the delta or the time Gap is 30 days maybe for your food delivery kind of thing there will be some Delta so that is your batch processing so map reduce is the engine which best fits for batch processing not the realtime data processing got it so the speed will be very less and the second thing is that the language varer for map reduce you can only and only run the code inside your Java to overcome all these things and there are lot of things the speed even if there is a batch processing the speed is also very less and there is a concept of uh your in memory processing inside your Spar maybe in the later part of half will come into Pi So to avoid all this kind of your drawback so the Big Data industry or you can say the researcher inside the Big Data come up with another solution that is known as your SP so what do you mean by a spark spark is nothing but another processing engine it's only processing engine guys just remember does map reduce toce data no which is meant for storing the data hdfs is the tool or hdfs is the platform where I store the data similarly the storing of the data will remain as is even if it's a hard environment even if it is a spark environment the processing part so I can say this part will going to change so previously it was here there will be a break previously to was map reduce now what I'll say that I'll read the data from hdfs and I will use my spark processing engine to avoid all these drawbacks like the first thing is that your batch processing and streaming processing let's say for example I can say Uber po or your WhatsApp text messages all these are live processing or it's a batch processing all these are live data processing so for that I have a spark so my spark will capable enough to process the data using the B tool as well as life tool similarly language constant would be there I can write my code using python Scala Java and SQL so there are four language thing and there is another concept of your inmemory data processing so if if you do remember I'll discuss this example whenever I'll ask you to give me the country and its related Capital so maybe the first time you reord one hour the second time you Googled it and stored it somewhere in the Excel so you may require in the second time attempt you require 10 minutes in the third attempt you require maybe five minutes but byting all the things whenever all the state all the country and its capital inside your brain or I can say brain is nothing but the memory in the to of your spark so you require fractional of second of your time to process the data so spark is nothing but the processing engine which is capable of uh capable capable enough to uh leverage the functionality of in memory data processing perfect so why spk and if this partk is a framework if I can use this framework by a programming language called as python is so it will you will going to co the term as a python plus par it will give you price par so any doubt till now I think it's the overview what we discussed s you are good so the inbuilt processing will be yours and the processing using python will be your okay now let's jump into the just give me a second yeah just guys give me a second let me find out the documentation so now we'll jump into the spark architecture okay this feature so we discuss about the feature of spark or the advantage of spark so what are the features speed powerful deployment real time and scalable so these scalable means you can upscale the things so let's say for example today I'm processing 10 GB of data tomorrow another additional 100 GB will come so my framework should be capable enough to process the up scale of the data that's what is all your scale so I discuss about the Scala Python and Java SQL correct so these are the programming language you can do now we more focus into the things okay spark has three model what are the three model spark code spark SQL and Spark streaming so what do you mean by spark core spark core is not python plus spark so spark processing engine if you want to try using a programming language here we are using python so that will be your spark code if you want to write the spark code using SQL then that will be your spark SQL and what do you mean by spark streaming spark streaming is nothing but processing of the data using the streaming pipeline okay so from uh data engineering perspective we require three things okay and there is another two model that is your Spar spark graphics and Spark ml so ml is for machine learning people and graphic is for your x coordinate y coordinate and all these things for your data science related things so as a data engineering perspective we are least bothered about your spark graphics and Spark ml okay so now we'll jump into the more elaborative way or the more important thing that is your spark architecture okay if you do remember inside your hard we have a concept of your Master Slave architecture so what do you mean by Master Slave architecture so there will be a master cor there will be a master the master inside your Hardo is known as what data node or worker node or name node name node so there will be a master called as your name node and there are in number of workers will be your there so what are that worker that that workers are called as your data node or worker node guys do remember so this is your master slave architecture so there will be a continuous agreement or there will be continuous IO between the master and the worker so this is what your inside your big data or inside your hard this kind of processing will be there okay so inside here there is one concept called as your resource manager do remember so what resource manager will do resource manager is nothing but a tool which is responsible for locating the resource if you do remember in hdfs we discuss about block size splitting the data into the smaller blocks and putting that so who will take care of your all these things that is your resource manager so resource manager is the tool which is aware of putting which block is empty where I can put and all these things okay similarly in your spark architecture is also a similar approach that is also a Master Slave architecture so here how many worker node I have done I have done three worker node here how many worker node there is two worker node this is worker one this will be your worker second worker okay how many name node is there one name node in programming or in the processing language we are not saying is name node worker node and all these things so here the name node is nothing but what name node is nothing but your driver or you can say spark driver spark driver or driver so driver is nothing but who drives the engine or the processing and the worker is your nothing but your executor kind of thing so what what is Executor and Driver okay so where the executor will reset the executor will reset in the worker node where the driver will resides the driver will resides at the name node or the as note perfect okay so who will takes care or who will coordinate between the driver and the worker so there is a concept called as your cluster manager cluster manager what will say now the first thing is that in the hard I will store the data okay so I will store the data means I will dump the data let's say for example I receive a 100 GB of file so if I want to store this 100 GB of file so which concept I have apply I have to apply the concept of hdfs like distributed so I have if you do remember I have to split all this file into smaller chunks of the data then I have to put into the each individual worker or each individual worker node okay once you put it all this metadata or the information who will takes care of the things this information will be there inside your cluster manager so what cluster manager will do let's say for example you have 100 GB of file and you place that 100 GB of file let's say for example 50gb to worker one 25 30gb to worker 2 and rest of 20gb to worker 3 so all this info will be there with whom all this info will be there with your cluster manager yeah so who manages the all the clustering part perfect so there are three types of cluster manager are there what are the things your y if you do remember what do you mean by y yet another resource negotiator then inside your y will be there then I in today's world there will be kuber netics and all these things perfect and there is another cluster manager I forgot that name but but they are not basically we using yeah how okay mesos mesos no is using that is totally ab and you can say I'm using kuber netics so nowadays in Google and all these things you will find GK Google kuber netics engine and all these things or you will or our primitive spark programming till now also people are using perfect so cluster manager so what cluster manager will do cluster manager will allocat the jobs and all these things so it will take care of the memory CPU overhead and all this concept perfect so this is what your spark architect picture will look like so now we'll go to the flow diagram of your thing so what do you mean by driver program let's say for example I have a piece of code let's say for example I have written some line of code that whenever there will be some data I have to divided that number into two if there is a odd if there is a odd number so the REM remainer will be one if that will be even number the reminder will be zero so that piece of code what I did I have WR in the a. py file perfect okay so what I'll go I'll go and proceed and submit this job to my name node so who will take care of the things so if my code is inside a. py file so the driver will initiate the execution of the job so what does that mean so driver means so inside your programming if you do remember there are some function and end of the day we write underscore underscore IIT main function so what will be the main function main function is nothing but your entry point to your all the program similarly driver is nothing but your entry point to the spark programming that is nothing but your main main function or kind of thing so the first thing whenever you submit a code a py bpy or some code to the spark cluster so a driver will initiate so what driver will going to do driver will execute the program once the driver will execute it will launch a spark a context okay we discuss what do mean by spark context and all this things so it will execute a spark then spark context will had will will contact the cluster manager or it will have a communication with your cluster manager to find out the respecting data present in what are the different different cluster or what are the different different worker node or name node work basically there will be no name node worker node where my data reside so that I can go ahead and proceed and process this data okay so the now the data Maybe maybe it will say that okay some 300 GB of data are present in the worker node one and rest 40 GB of data present in the worker node to now we have to see what is the you can say alignment of your worker so each worker node is comprises of the executor and each executor is comprises of three things task and cach simple okay each worker may have multiple executor it's not in this diagram a worker is having only one executor no this scenario is possible let's say for example I have another worker node three this work on node three can have multiple executor as as well executor one executor 2 executor three as well so so you can see this worker is multiple executor but in this diagram it's showing that one worker one executor no a worker can have multiple executor okay so so what do you mean by task task is nothing but the lowest granularity who driven the things and what do you mean by cash cash is the same concept of of cing persisting all these things so whenever if you even if in the mobile also our um memory have some C so cash is nothing but the same as cash it will be and the task will the smallest chunks of the things okay so any doubt s and hard till now and the exe and the cluster manager YN and all these things will communicate to the executor to start the kind of your the data processing now at this moment what will going to happen let's one one question so you said this this a. py has

code like code for what means uh the data is coming from somewhere and this this is a code just to process that data yes this is the code to process the data so let's say when you are when you are configuring right in the in the code you have to configure like where the data is coming off from and all that stuff no that that's take care of the cluster manager cluster manag I'll give you a use case okay I have some I have one S3 bucket where I have like lot of data let's say like some million records so now like I want to process that data that's an unorganized data unprocessed data right raw data how do I use uh uh this park with this diagram can you explain me as simple as that you have maybe 100 GB of data perfect yeah okay so let's take this is as a structured data and the same thing will applies to your unstructured data so what is 100 100 GB of data let's say for example I'll say you have a file called as maybe uh maybe ABCD perfect you have a file called as ABCD what this data will do maybe I'll say this data ABCD file contain the information about the product okay I have product ID okay and just something called as product name let's say there are two field one is product name another is product ID okay so first tell me if I have this file okay and the use case is that to find out what are the product name which are starting with britania britania biscuit britania chocolates britania maybe you can say sandwich something there are thousand products which will be started with called as britania perfect so your use case now I'm saying I have a 100 GB of product data here sankit go ahead and write a code and give me the output saying that these are the product ID whose name starts with product which name starts with britania is that your use cases yeah perfect okay so that code is that a.y yes so what you have to do you have to write a some SQL or python code correct let's say for example you WR some python code starting with DF britania means you are writing a function okay okay let's say for example you are writing a function called as DF BR okay something you are writing okay okay so inside what you are doing let's say for example you are ask you are writing for product nameing okay EX in product name okay you are checking name there is a function starts with correct this is not the code product name you have to check in product name corre this will be something like of the code you have to pass so this is your a. P you can store this you just change into the python and you stor this file perfect this is your a. py now what will going to happen this a. py now okay first tell me whenever you have a 100 GB of data so where this 100 GB of data will be there so this a. py is nothing going to process this 100 GB of data perfect so where it will be 100 GB of data tell me in datab F in the S3 the fastest is to what is say to store the data so the moment you have the store the data here spark won't come into the picture the moment you have 100 GB of data you might have stored something your S3 pocket maybe I have worker node one you have stored something 10 GB here I have worker Noe 2 you have stored 70 GB and here I have work Note 3 some where you already stor in the 20gb so this is not the part of spark this is the part of your which job hdfs job okay got it yeah so now what will happen so all this information would be with your name node because I have already placed that information I have with my name node so if that information is with your name node that means that information with your cluster manager correct cluster manager know okay 10 GB data here 70 GB data here and raed 20gb data here got it the moment you write a code A.P and

submit to the driver program this data information who will provide the cluster manager correct okay now what will happen this A.P will go to this worker this worker and maybe this worker Noe to execute the code okay so what will happen let's say for example this product I product name out of 100 GB some 10 GB will be product so let's say for example this is my whole product table don't think that all the product table inside one worker no chunks of product data inside a worker another chunks of product data inside another worker and another chunks of data inside your ad so whenever you will see your table you are looking as a single table am I correct whether it's Oracle my SQL or something um um in big query kind of things so I don't know what will be the a AWS I think Redi is the table so table uh in the Amazon AWS in GCS we have big query okay whatever it so so so this so this table will be like you are see so whenever you Light Select start start from product so you are seeing in a single page but at the back end it won't store like this it will be splitted so this split will be data node one this split will be the data node two so each set of code or a. py will go to each of the worker and perform the this set of code inside his worker node you got it so here A.P code will find and it

will run the same a. py code so it will go and search what what are the product information present inside my worker node one worker node one and now you can ask this is worker node then what is your task so maybe maybe this 100 GB again I'll say do you think this 100 GB will be a single place no it is not practically possible again from 100 GB we come down to 10 GB do you think this 10 GB will be a single place of work on node one no that is also not possible there will be small chunks 200 GB if you do remember there will be block concept correct 128 GB 128 GB kind of concept so there will be this kind of things so for each 128 GB or ask will not depend upon the 128 GB number of partition inside your spark there is a concept of partition we'll jump into that maybe two more kind of things so each partition will be driven by a one task let's say for example you have a you you working as a project manager so whenever you will get the project are you directly assigning whole project to one resource no you are dividing the task into stories and stories will be get splited into jira task am I correct or single single jira so that J is nothing but your task one J ID is nothing but your task and two three J ID will maybe created a story got it s yeah the same concept and the two three stories will be into your summaries kind of inside your J the same concept is applies here so the task is the lowest granularity of executing a work and once all this information will be applied let's say for example here out of 200 M it written okay there are three products which name are starting with britania okay maybe from here you will get okay there are 30 products we starting with britania and maybe from here you will get maybe 756 products whose name started with britania so what will happen all this output will come into the to your driver program saying that okay in worker node there are three products in worker two there are 30 products and in worker three there are 756 so out of so it will be 39 uh 58 okay 7 89 there are 789 products whose name are started with britania so all this summation 3 + 30 plus uh 756 who will take care of so it will say that okay three product it will say 30 product and it will say 756 so all this summation will be taken care so it will return three to here it will return 30 to here and it will return 756 to here so driver is the point of contact who will aggregate all these things and it will return 7 89 that's the final output you understand what will be the role of your driver so the role of your driver is not nothing just like a project manager you assign J to him you assign J to that guy you assign J to another guy the project manager will collect all the deliv variables as a delivery manager he will submit to the client perfect yeah are the people are directly submitting to a client no they will report to the delivery manager delivery manager will go and had a discussion with the client saying that okay this is the my final output so here delivery manager is nothing but your driver program we collect the individual output from each of the worker node or each of the executor here I'll say that executor and each executor label task will be output will be get added and we return to your driver and the driver will saying that okay this is the final output of my code perfect okay so now we'll understand all this concept in a theoretical manner now we'll jump into how all this concept are really practically happening so inside your spark there is a concept on action and transformation okay so there are two concept what are the two concept retion and transformation so what do you mean by action okay so let's say for example I'll give you something called as collect and I'll simply say one is edit or I say that sorry okay so what do you mean by action so when the moment I'll say collect collect means what or print so print means it will summarize the output agree if I'll say collect it will give me some output the moment I'll say edit does it will reflect some output no it will be intermediate States agree collect is nothing but task which will return output to the driver or the out the driver will return let's say for example I say collect print take head all these means the moment you write some function it will give you the output that is known as your action transformation is nothing but the intermediate output intermediately output I not say intermediately stage so let's say for example you are editing you are writing something or you are doing something so it will give you a transformation so on this concept there is a concept of dag so what do you mean by D directed a cyclic graph so if I have a output this is my input okay this is my intermediate output one or I'll write T1 this is your T2 okay and this is your output okay so output means only and only you will get output whenever you will as a action like take print collect then only you will get output okay so this intermediate T1 T2 is nothing but the form of your transformation got it so you this T1 T2 won't return any output to the user or to the developer okay so unless until you trigger a action these steps won't get executed these steps won't get executed so this mechanism is known as your lazy evaluation unless until there will be a trigger of Your Action there won't be any change of your transformation so this is known as your sh transformation I think till that point we already been discussed and if you do remember so what I sank andik what I'll suggest you please go through our previous recording just see we have discussed action like take okay if you do remember I'm not sure e I think we have discussed this transformation correct map filter flat map map partition map partition with ke index Union Group by key short by key so these are your transformation transformation mean these are basically used to edit the in between rdd okay and what do you mean by your action so you will get actions what is action yeah action will be your just like your reduce collect count take so once the M you will write take tech means it will give some output if I write take the first element so it will first means it will return return the output of the first element if you say that count it will count the number of elements present inside that data set okay so take save as text file m it will write the output so this is known as your action so what I'll encourage please go through these things so tomorrow I'll start action and transformation and what the concept of rdd into more depth I think it's already been recorded with you so just have a look and tomorrow session we will start from here perfect which session you want me to go through uh action and transformation of SP and concept of rdd okay let me J it down so there are three things you just go through that I have to locate that recording yeah just ask the coordination team to give action okay transformation okay even if I'll ask them to pass it and what concept of rdd perfect these three things then we will go one by one okay yeah Amit you there sorry no I'm asking if he's there not okay no worries yeah okay thank you so Tom more at the same time 9 to 10 I thanks thank you thank you

2024-12-20

Show video