Exploring the tourism dataset using the power of data | EDA project tutorial 3

hey there welcome back to another video this is Roshan Cyriac Mathew and in this video, we are going to perform exploratory data analysis on the Airbnb dataset that is available on kaggle this data set belongs to the tourism domain and if you have someone looking forward to gain experience in performing Ada related to tuition domain or Eda in general then this video is for you if you want to know more in detail about the different visualization techniques used in this video I have done separate videos on this topic and I'll share the links in the description down below make sure to check them out before moving forward if you're new to this Channel and want to see more videos like this don't forget to smash that subscribe button and turn on Bell notification to stay updated each time I upload a new video stay tuned [Music] [Music] so first let's import the required libraries so we're importing pandas and numpy for basic data operations then we are importing matplotlab and c bond for creating a visualizations then we are reporting style to set a style for the plot and here we are setting the style as ggplot now let's load the data set for this data and for this date is it we'll have to load it in a special way we are setting the low memory as faults to import last data sets that have mixed type of data and this ensures that all the data types are accurately loaded into the data frame now let's view the data using the head function now let's find the size of the data set that we're dealing with so this gives us the size of the data set to get more information about the data set we can use the info method foreign entries the data types and the number of rows and the number of columns in our data together star signal information about the data set we can use a describe method so this gives us the count the mean Max the mean and standard deviation of the numerical entries in a data now to get the count of null violence explicitly we can use a snull function so this gives us a count of the different null values in a data set now if you want to get the count of all the unique entries in each column we can use a unique method so in unique basically gives you the number of unique entries for each column now let's remove the columns that are not very useful for our analysis now before proceeding to process or analyze any of the data let's check for duplicates in a data set and remove it let's see the size of the data set once more as we can see the size of the data set has reduced we can use the information that wants more to get more detailed information about the data set so as we can see we have reduced the data from 26 columns to 20 columns and we have also reduced the number of rows in a data next let's calculate the percentage of missing values in our data set let's create a data frame to put the results foreign so when we look at these values we can see that the measure out of the attributes have less than one percentage of null values and only two attributes are more than 15 percentage of null values in them now instead of getting the null values like this you can also use a visualization technique to see the null values in the data set foreign so here on the right Axis you can see the count of the total number of entries and on the individual columns you can see the count of entries per column next let's print the column names now let's replace the space in the column names with a special character foreign let's see the modified column names as you can see we have removed all the spaces and replaced them with the underscore now let's get a count of the different categorical and numerical entries in a data set foreign categorical and numerical entries in a data set you can also use a head function to see the categorical and numerical entries separately let's start with a categorical column so this gives us an idea about the categorical data we can see that the price and the service fee are being shown with the categorical data I'll explain this to you in a bit now let's visualize the different numerical data now let's start analyzing the different categorical entries before moving on to numerical entries so let's first get account of all the null values in the categorical data now let's start analyzing these columns individually let's start with the host identity column so let's plot account plot to see the different distribution of the data as you can see this gives us a nearly equal proportion you can also plot this as a pie chart to get the percentage values thank you so that gives us a percentage distribution of the different values now if you just want the values of the different entries you can use the value accounts method foreign entries let's now replace the null values using the most common entries which in this case is unconfirmed let's now check if the null values have been replaced foreign s have been replaced let's now proceed to analyze the next column neighborhood group so neighborhood group has seven unique values and 29 null values so let's first visualize this data now let's get the value counts if we look at the data we can see that the most common occurring entries are Manhattan and Brooklyn and if you take a closer look we can see that the same labels have been entered in a different way so let's clean this up foreign values in this column to 5 let's check it out now let's replace the null values using the most occurring value Manhattan so that should take care of the null values now let's look at the next column neighborhood foreign unique entries for this column so let's take the top 15 values in the neighborhood and visualize it so to do this let's first get the count of the different entries and assign it to a different variable then let's take the top 15 entries now let's plot a bar plot to see this so that gives you the top 15 entries in the data let's now look at the next column country as you can see this column has only one unique entry now let's see the value counts so this field has only one unique value and over 500 null values in it so let's replace these null values foreign now let's look at the next column country code this is similar to the country column and let's apply the same steps as before now let's look at the next column incent bookable let's visualize this data using a pie chart foreign ly foreign foreign of the different entries and data along with the count of null values now let's replace the null values with fault but before that here DF temp equal to 1 is used to add a new column to the data frame DF called temp with a value of 1 for each row this is done to create the uniform count for each value in the insan bookable column which will be used to create the pie chart now let's analyze the next column cancellation policy okay so we know that there are three unique values for this data so now let's plot a count plot let's get the counter values for the different entries since the majority of the values belong to the moderate category let's replace the null values with moderate now let's analyze the next column room type so let's first visualize the data using a count plot foreign types here are anti-homes or apartments or private rooms so let's replace the null values in the data set using anti-homo apartment let's also get the value counts the next column that we need to analyze are the price column and the service fee column now we know that these are numerical types but they are being considered as string types because that dollar sign different so first let's remove the dollar sign and convert them to integer type so let's define a function to do this for us now let's replace the dollar sign for both the columns so let's print one of the column to see if the change has been reflected as we can see the dollar sign has been replaced and the data type of the values has also changed let's now visualize the relationship between the price and service fee to do this we can use a scatter plot so from this we can see a linear relationship between the fee and the service fee now let's look at the last review column we can see that this data is also stored as a text let's extract the ER from this data and keep only the so let's define a function to do that foreign this data the error has been extracted now let's visualize this data using account blood foreign as we can see that the majority of the reviews are in the year 2019. now for this data let's replace the missing data with a median value so let's first find the median value and now let's replace the null values foreign let's check if the null values have been replaced now that we have analyzed all the columns in the categorical data let's check for null Wireless once more we can see that the majority of null values in the categorical data are replaced now let's analyze the numerical data but before that let's get the column names from the data set now let's start with the construction here and visualize this data so from this we can see that the majority of the constructions happen between the year 2014 and 2015. so let's now check for null values let's now replace the null values using the mode similarly let's plot a histogram for the availability of the rooms foreign the majority of the rooms are available for zero to four days so similarly I am going to fill all the remaining null values using the mode except for the latitude and the longitude now let's see if the null values have been replaced on numerical entries similarly let's check the null values for the entire data we can see that we have replaced budgetable null values in a data set you can also use places using a head function okay so now that we have reduced the null values in the data set let's drop the rest of the entries with null values and then let's check for null values once again we can see that all the null values have been removed now let's check for correlation between the different columns let's drop the temp column now let's visualize the correlation using a c bonds heat map [Music] foreign the price is highly correlated to the service fee followed by the number of reviews and the reviews per month that brings us to the end of this video make sure to follow me on my social media handles to stay updated for more interesting content hope you got an idea of Performing data analysis on the Airbnb data set please do leave a like And subscribe to this channel if you found this video useful thank you for watching and see you in the next video
2023-02-14 09:06