ESSnet Win Webinar Methods of Processing and Analysis of Web Scraped Tourism Data

ESSnet Win Webinar  Methods of Processing and Analysis of Web Scraped Tourism Data

Show Video

yes, there you go okay Thank you very much Sarah hello everyone my name is Lukasz Zadorozny and together with my friend Piotr Szlachta we have a pleasure to conduct this today meetings uh in the in the first, in the, in the first part which I will be presenting, I'm going to tell you about the use of big data in tourist statistics in the second one Piotr going to tell you about techniques and methods of data acquisition. I have divided my presentation into two parts the first one contains some information on the on the use of external data sources in the demand side of Tourism this is the part that deals with the study of stud trips by tourists and their expenditure and to one day visitors their expenditures on products and services related to tourism and their costs related to accommodation, to transport and some kind of meals. The second part we'll discuss the aspect of using non-statistical data in supply side of Tourism that is the site of Tourism that is related to providing accommodation for for tourists such as hotels bed and breakfast apartments and many others.

Okay what's the best way to start of course from in the far InDesign information needs as well as non-statistical data sources that can be used in a given country I know that what I tell you maybe seems obvious but each country has its own specifics, yes, in one you can conduct web scraping, in another this method is in in maybe illegal Yeah, and an interesting example is because, it's South Korea. A few days ago I spoke with my colleagues from this country about Big Data sources and they told me that Yes, we use many Big Data sources for example, car transaction data, data from mobile network operators, from GPS data and that I asked them for the web scraping data yes and they told me no no Lukasz and it's illegal they don't use web scraping and they use data from web from portals, yes, so therefore at the stage of identify National Data sources you source we must formalise the legality of the acquisition and use. I especially recommend that in this moment the document describing good practices in web scraping. This document was developed by our friends from work package C in big data, ESSnet big data two project. I think that this is an important

document that rises quite sensitive issues related to web scraping and which suggests how to use these methods non-invasive way for the portals link to this document I put on this slide. Once we have in difficult data and data sources we have created a catalogue we needed to sort them by type I check them frequency it will depend on whether we will have to use Metals data disaggregation or aggregation. In the last state we divide the sources into those useful for the demand side of tourists or Supply style of Tourism this is the way we do in our country. Okay yes among the internet data portals related to rental

accommodation can be used to estimate tourist data, this can be international portal such as, or as well as small local sites operating only the one country. The information contained there may be helpful too, for example estimated expenses related to accommodation and then also provide the type of information accommodation facilities such as address and their number of rooms or bed Places. Yes, portal such as Trip Advisor can be an interesting source, useful for estimating tourist expenses for example, food, days income data from portals such as seat Guru, Skyscanner or Expedia. It is possible to obtain a traffic data that is helpful to us for estimating number of trips for each country, of course this is connected with expenses of expenditures of tourists. As some interesting sources are portals like and they given us some information about prices of goods and

services in each country. And now I talk about the demand side of Tourism, denying there is no denying the most of the data sources are mentioned considering more demand side of tourists that the supply side at least in Poland. We use data scraped from internet portals to estimate tourist trips and the expense expenditures we use some of the data from internet portals of course, directly in the study of trips reputation of missing records in our survey While others are used to validate the obtain results for example the expenditures for accommodation in given type of accommodation facility. In the slide you can see the model we use in Poland and individual components of this model to estimate trips and expenses, of course in each country in the sample survey, and the sample server in households and some survey about the foreign trips is the is the main source of the data, but what if surveys are not possible, we have one we have a situation like that maybe I'm going off top a bit now but in 2019 we received access to data from automatic number plate recognition system.

It took my colleagues some time to recognize this data because there was a lot of them and we have some trouble with analysing them. However, we didn't expect that we would be able to use them in statistical production so soon, why because we have some problems with our surveys in covid pandemic times. In 2020 due to pandemic and the restriction on the movement of people we had to give up the research consisting in country vehicles at borders and surveying people in their vicinity, so how did we keep the studies going? We used of course so please treat it as a curiosity but also a good example of how useful nonsensical data sources can be. Okay, I have a Slido I have one question to you Do you use web scrap in your organization? Do you see slido or not now we just see your screen Do you have a link? Do you have a link here to the slido or not? No it's not showing, let me just go on to it, hold on a second, okay I have the first results. Just about there because if you want to start sharing your screen again. Okay

everything hey do you have. Okay. doesn't work that's right. We can't see your screen at the moment, Lukasz we can just see You. Okay stop.

Now slide that doesn't work. Now it's like I have some problems with Slido that doesn't work. I'll do your slido screens for you when you're ready but can you put pop your presentation back up. Okay. Okay.

All right I can tell you that 55% of people for our users gives an answer yes and 19 is answered no, no but we are applying to 25%, I don't know two percent. Okay and coming back to the topic. The data from portals related to air traffic can be used in trip survey. We use this data to estimate some information about trips because.

Lukasz we can't see your slides at the moment. Wait a minute. Okay but wait. Okay. No yes okay.

Okay. We scraped the three types of portals,, and From the first portal in in the case is and it provides information on flight number, number of stops, type aircraft, flight prices, flight duration. Okay have information

about the flight number, is the main information, is the main interesting information about flight and which have information about number of seats in the plane. Okay, and it looks like this. The first is the booking, it's a booking Platform, as you can see booking is a number of time when we fly to from Warsaw, is a price for our flight number, and type of airplane. In the second one Skyscanner, we have information about some stops or direct fly or stops for example, in other cities, and in seatGuru, we have information about a number of airplane.

they have in which you have in this. Basically, on our experience we created a set of mandatory and optional variables about these portals, of course their selection depends on what the data has to be used for, whether to validate the result attempt or to estimate the trip, is this sufficient to validate the results and then it is enough to web scrape only mandatory variables but if we intend to estimate some trips, we should also download optional variables and based on our model obtained, we obtained some trips and expenditure estimation to 20 new countries, in comparison with the sample survey it is worth noting that for example, in 2019 total expenditure of Pols traveling to South America and create after modelling by almost 18%. So why we estimate the way we conduct research about both trips to South America, because it is not most popular, not popular destination for both Pols to obtain enough information from the household survey and what about supply side of tourism and because every day I prepare some data from the supply side of tourism of course, in each country belonging to European Union a survey of tourist accommodation establishments is conducted. We surveyed objects according to the type, and location of object, as well as objects that are not typical of tourism, for tourism, for tourists but which are used for them from them, by them and we have three main groups of our accommodations following, in the following 55.1 Hotels and the similar accommodation, 55.2

Holiday and other short stay accommodation and 55.3 Camping grounds, recreational vehicles parks and trail parks. Of course, the survey of tourist accommodation facilities may be different slightly from country to country. In some countries there is threshold for the number of best places, in other there is no threshold and all facilities are surveyed in one study yes in Poland we have some kind of threshold we have, we have a survey about occupants of tourist accommodation establishment with 10 or multiple bed places and the second one is the school passive tourist accommodation establishment with less than nine places. The first one is a census survey, the second one the sample survey and the first one is a monthly survey and the second one we, the small establishment, we conduct, we research one of the one in a year one. We collected some typically data from, for the for from, for the accommodation establishment survey like type of facility nominal, bed place, number of bed places, number of Polish and foreign tourists, okay of course number of overnight stays provided by polls encouraging tourists, number of days of operation per month.

As far as now many countries that conduct research on tourist accommodation receive or have something frames for this research from administrative data and services sources. In Poland it looks a bit different because of course, we receive information on accommodation facilities from Ministry of sport and tourism but we often have to look for new facilities, but ourselves and add them to the, our frame. Why? Because on the list of our ministry is not always update an ongoing basis this is especially true for objects like belonging to the 552 groups, the group. Of course, sometimes we found some accommodation in the long establishment belonging to 551 group. For many years the second method to completing

the server frame was the manual method but we searching internet platforms, we searching info portals and unfortunately in this method was very time consuming and gave to us a poor results. From 2020, we replaced this method with of searching properties using the web scraping or booking of course portals. Thanks to this method we get a ready set of accommodation facilities. Of course, extracting from a connection extracting for connection methods and it was a of course is not an easy way to do this.

We tested several connection methods and we it was only through trial and error that we managed to come up with satisfaction results results, Piotr will tell you more about it. Of course, we stopping not only some big portals like, like Airbnb, like Trip Advisor, but also a small league and regional portals like for example in Poland, because in small portals, small portals have many small objects which they don't advertise, they don't put advertise in big portals, like for example Okay, what information, what kind of information we need to use the data from book portals to complete our survey frame in accommodation establishment survey. You can see booking portal and some information would be useful for us we should definitely save the information about the name of our object, at this point we also often receive information about a type of object, in that case we have a Leonardo hotel, the hotel with four stars, but beware this name will not always correspond with the type, sometime in hotels it is okay, but sometimes we can see for example, Hotel Jose Lucas and this is not what I look at this is apartment but because some someone called this hotel in Booking platform. A very important

variable is the address and we have them on a booking platform, this address is also not the same, sometimes it's not the same like in real, you will see this later, there's information about price per night from one person, two persons or for Children. Of course, we have information about price, number of reviews, rating of object. But we don't have information about number of beds, in our case it's very important information because we have a threshold in our in our survey and this is a this is some problem to us. The second, the second, another website is Airbnb Website, please note on this portal we have information about number of beds and number of rooms, we don't have this information booking platform, we have also information about name of the object, price, number of reviews of course, information about guests. Sometimes Airbnb doesn't have the exact address, would make this very difficult to work at the stage to combine data from this portal with data from our survey from frame, sometimes Airbnb doesn't also have the exact name of Facility. But to third one portal is the we started our web scraping

with this portal, this portal contains everything we needed name of the establishment, of course number of stars, address, number reviews, price per night, and what's more interesting we have information about rooms, so, we can use this information in our survey. Of course, not only information, of course it's not only the information I mentioned can be downloaded from these portals, we can no longer many other information. I listed the most important ones, in each portal has its own characteristic and different types of variables, for the supply side of tourism and the most well the most valuable information is the name of course, is the name of facility and the address, of course information about a number of rooms in establishment, number of beds is also very important, unfortunately not all portals provide such data and that's why I include these variables in mandatory Variables. The booking portals also differ in Other, in the offers of accommodation facilities, unfortunately hotels which have, which has information about bed places, have a small number of establishments, Airbnb have them more information, but we have a problem in with the name of this, name of our establishment and name of address in this portal. Based on the collect data we supplement the sample serving with new objects from 2020.

Yes, this bar chart shows a new accommodation establishment by regions, which we can, which we added to our survey frame, thanks to web scraping, we have, we approximately 15,000 unit condition establishment per month, for collected from hotels and booking and for example in 2020, we put, we put to our survey frame 239 new accommodation establishment with 10 or more of the places. Okay it's all for me from, Piotr will talk to you about the technicals and methods of web scraping and connecting data with our survey frame. Thank you very Muc.h Lukcasz just before you go we've got a few questions on slido for you so. I will I'm happy to read them out for you if you Like? Okay. This first question is about selenium and the libraries I think that this talk. I can

also did that question already yeah about my session or I can all answer you already. So, when it comes to uh selenium and beautifulsoup we also use it, that will be in my presentation it's hard to choose the best it's what's best for you and when it comes to optimized Code, it just trial and error for us we never go under like two seconds and what we do additionality to that we randomize our time intervals so, it's harder for the server to block us but that's all I can say about this right now. Thank you, flight price chain okay, flight time, flight price change frequently and can we potentially change multiple times a day depending on number of purchases, how did you account for this? Of course we provide a reconduct with scraping uh in Daily uh frequency, but we aggregating data from the scraping for example by monthly data and thanks to this we can integrate this data from web scraping with our data from a survey in households, which is a quarterly data we all like. I also cannot something that we are doing

like two scrapings for the same day, so we are doing one scrapping like a couple days before the flight actually happens, and like one month before so we have like, this like you mentioned couple prices but for now or right now we are not using the prices, none of our models except for estimating the number of the passengers in any given flight so I hope this answered your question also. Any good data sources to estimate number of visitors for an area within the UK? I don't know. Well, we can mention our last meeting with Korea, you mentioned this already they are using data from card transactions and from mobile network operated, that's the obvious sources hard to get but I think that also in UK you have Smart cities, when you can get data from, maybe you can get data from foot crossings in cities or from parking. Maybe car traffic some sensors like an NRS systems, yes. What is trade about the visitors or I mean you get visitors but you don't know if they are coming to the work or in for tourist purposes but it helps a bit, I think.

will any good sources for web scraping tourist data be shared please? We use only Airbnb and it's our main source of data to our surveys. Several people organizations do scraping, describing do you also compare data from this kind of organization with your own scraping? And to answer something, no we did not compare that data so far, and also I see the Pierre got second question, I think there is no statistics using any web scraping or big data that are not experimental right now, even in Korea that's where they have this Big Data System started in 2015 it's still experimental data so I'm not, I don't think it will change soon, we will stay as experimental for a long time. All right and did you Yes, the methodology for collecting data are Public. Did you compare the accommodation price from website with the prices from the field, so, if so, is there any difference between two prices? Yes, we compare this data, sometimes we have difference and we talk with our workers so, we sometimes change this data because this some mistakes in our service.

Why not using API for flight data sky scanner? I think, I'm not sure, but uh I think we tried it and it had some big limitations or was only for trial, so I can't really answer to that to that question why we are not using uh now I know we tried but something was off with it. Of course, many we scrap data from many portals sometimes we use this data in our servers, sometimes we only create some experimental data, this is about the methodology for collecting our data. Yes, I we can send the presentation here.

When it comes to this first question, is there any tool for NLP analysis to get the sentiment analysis of command or tweets? We actually started a project right now when we want to do this and get back to us in one year and we will have answer for you. Yes, we have some connection with South Korea. Koreans scaped the tweets, scraped Facebook. We started to web scrap in my organization, we can have data sets with prices location, I would like to ask, how can we get the information about main site? You can get the information about the main site for connection data with many sources, with some mathematical methods I'm not that much I'm not mathematician by it but my friends from the mathematical department use some methods to connect data from fly portals and they help us in estimation trips to for example, South America and in web scraping we must connect information from two sources from the department which can conduct tourist surveys, Department of Mathematics and from some I.T Support, yeah, Piotr is the our I.T support we have, I have some information about tourism and statistics and other and other o friends our mathematician and they help us in in aggregate data disaggregate data and compare this information with many Sources.

And happy if it doesn't have many limitations and if it have good document, how to use it it's really good solution but often it's only some site that that doesn't really give you what you really want and the data is really limited but, yeah if it's if there is good API is its better solution than one stopping for sure. Did you categorize accommodation by categories especially from Airbnb? How do you do this? Yes, we tried this in ESSnet big data 2, we used some information from our friends from the mathematics Centre, they Piotr, they use uh what kind of method they Use? Well actually I can answer that question so, I will take over from here if I can. For Airbnb no actually we didn't but we did it for booking because the categories on booking isn't actually the categories, our categories that we are interested in, so what we what we used to was the main model from after we linked the data, and we were sure, of it so, we used that connection and made models using machine learning and I think the decision tree to categorize the rest of the objects. I'm not sure you

you can use it for Airbnb and of course it won't be decision tree or might be but so, so, I hope this answer your question. We start we will start the scrap Airbnb a few months ago yes, so we prepare this portal to in our another project we will see what we get what we will get. We don't like data from Airbnb, yeah, because the object name can be whatever, and you can get real address on unless you book the place so, it's really hard to connect on this record do this record linkage so, all we do is the statistical matching right now. So, or for us

it's not really useful Source right now, we still try but. Yes we'll see in the future yeah. You can get different accounts price from different websites for the same hotel, which one can you trust? APIs can get price how to defend, yeah, yes this is some kind of a problem but we use let's say we aggregate this data and so we have one price from two web portals. We prefer, we prefer now we prefer booking the portal because we have, in Booking have many of accommodation establishment for then hotels and we prefer data from booking and sometimes we have one price and sometimes we aggregate data and we have one from two portals.

And all my colleagues from a demand site using data about price to validate day data and there is no big reasonable differences between our between our data and data from web scraping. Okay, Piotr should I thought the next part of the presentation then. Yeah, I think it's a good idea. So, I will try to run my screen right now, hope will go smoother and Lukasz. Okay. So, I hope you can see my presentation. Yes we can. Let's start let's dive in right

Now. So, I would like to break down my presentation into four topics, the first will be a recap of the basics of portal structures, that will help us with understanding of extraction of data from portals maybe when scraping that is part for people that don't web scrap and are interested how it's done, how it's, how you do this, then we will discuss data cleaning data transformation and modelling processes a bit and finally I will show how we combined our data and Link it with our registers with web scraping and data. So, let's start with the basics each of us uses our browsers web browser on a daily basis just type in the address of a page or a keyword and after a few seconds review loaded content. From a technical perspective the process is split into two steps first we enter others of phrased and send it as a HTTP request to the server then after a while the responding sent to our computer and we can view the page but actually, what happens is there are a couple types of file loaded. So, we can split it into four categories the first one is HTML one where is that does the page content the CSS files that Define the visual appearance of the side of the portal all the images and of course the JavaScript that is adding interactivity to the website. From the web starting point of view we are most interested in the first type of these files so the HTML one it holds counted that should have a defined structure and often does such as language definitions metadata and information for other computers and programs and the actual content on the of the page.

So, all the elements mentioned before are defined in so-called HTML tags which are like keywords and clouds in the defined information. In addition, we can distinguish between HTML elements which holds the content and HTML attributes which describe the characteristic of an HTML element. For example, if we use a tag. So, this tag used mostly for lean for links but won't use graph attribute with it where we put actual address into one to redirect us to new page.

The last thing from the basics word mentioning about the HTML is the special types of attributes called classes and identifiers. Both types are optional and don't occur on every page but if they do it's worth keeping in mind a few rules. Namely, one identify can be used only once per page, and each element can only have one identifier. So, in terms of web starting when we want to scrap, it's this single information we can use it through this identifier. When it comes to classes, it's totally different. I think element can have many classes and a single class can be used multiple times on a page. From a developer's perspective

classes are mainly used to provide information about the visualization. So, where is the element? How does it look? What form does it have? What Size? and from web scraping perspective, this allow us to quickly retrieve, for example, recurring elements from Portal such as, for example, the list of offers on booking portals something like that. Though that’s all information we actually need to prepare web scraping. But I want to mention one thing before we go to web scraping. I want to talk about web crowding that

is mostly not used in tourists but it has some, we use it in some degree. So, web crowding is used to collect information from pages with unknown structure. All information is done index and story, not database. Then process tour is repeated for the links available in the portal, this is mostly how search engines work like Google all or Bing. In statistics is mainly used to collect information about companies. In the case of the portals mentioned by Lukasz, we know the structure so; we can use web scraping.

How we can describe the process of web scraping? It's actually really easy we just define the information we want to retrieve from looking at the portal it's like Lukasz showed you before. So, for example the name of the object, the Price, the flight number, anything you want. Then we analyse the structure of the page and check which tags and attributes can be used to get this data. From that point all we need to do is build a scraper in actually

any language right now, because there are libraries I think for every language to get to help you really quickly write the code and to get in the stop page and when the extraction is complete we just need to put it in the the file so we can process it later. So, with web scraping, it's worth mentioned couple things since in many countries’ regularity of web scraping remains unregulated, we try to follow a few rules in order not to get blocked by portal owners. So, firstly our scrapers introduce themselves, we are using user agents and send the string information to server about the purpose of our web scraping. So,

far it worked, we didn't get blocked on any, we got I think from one portal blocked only for a couple of hours or something like that, if we are too extensive with our web scraping. So, another thing we do is run web scraping during off-peak hours, preferably at night and finally, we set our delays between queries so, we don't affect the controlling of the button that was actually in the question before we answer already. Another question was about libraries so like I mentioned we use we like the request beautifulsoup and selenium. I request beautifulsoup combo is used for pages that do not have JavaScript, we request, we extract the whole content of the page and then use a beautifulsoup to extract the actual information from the tags we wanted and for the portals that are using javascripts or require a lot of interactivity with the portal great solution is selenium Library, which can emulate user behavior. We can also use it to handle a send forms

with data or execute any other interaction with the portal. So, like one last thing I want to mention about this tourist pattern, that regardless of the selected libraries, tourist portals are prepared for tourists so, they have a modern visual design and use up-to-date Solutions in the structure, that raises a number of issues and problems that we need to address, there are only few portals left that do not use JavaScript and more and more even basic information such as the exact address of a facility or specific flight number, requires additional interaction so easy and fast Solutions are no longer valid on the couple of the small the general portals can be web scraped like in couple of seconds we can get whole content but the rest you need to prepare this step-by-step solutions. Another thing is generation of Content, lately we noticed that large portals such as booking or Airbnb dynamically defined classes in ID, that is big problem, because we can't actually say that okay go to this class and it will give me price No, you need to change and adapt to this. For now the best solution we found is XPath,

but it brings a lot of new products, also lately and so, like since last year or so we noticed a new trend among the accommodation portals in Poland, they are completely changing the structure of the website depending of the season so, there is a different site for winter and different site for a summer holiday, this often requires rewriting the entire code for a web scraping so, therefore everyone has to decide for themselves if they want to use web scraping linking in their organization with Lukasz actually ask this question before and 60 percent asked answered that they do I really that's why I want to check if there is some questions about this topic and we can you tell me how do you web scrape and how long you change how often you change your accounts maybe, oh there is no questions Oh there we are there, okay. Okay Currently has built-in sleep duty to avoid too many records with something like random range 310 be better work? Yeah, I think 310 is with enough, like I what we actually do in our office is the first scraping is always prepared on some private computer outside of the organization of outside of the server that will do the scraping and we just there but yeah, we don't have any delays more than I think seven seconds so, 30 is a bit too much I think. When you inspect an element do you see HTML CSL your those are just like Gmail when we expect an element you see only HTML uh whether it was loaded with JavaScript you don't really know because some elements you for example just scroll the page and the element has loaded but if you inspect that element it will show the HTML it doesn't mean you can actually access with web scraping these elements straightforward, you need to put this code to you to do the scrolling first. So, I hope this did this answers

your question. How do you link the demo site like expand teachers or visitors to supply Site? Oh, that's the question for Lukasz, I hope Lukasz is still with us. Okay maybe not, so I will answer this question first hello what programming languages do you use to obtain the information like I mentioned uh we must use Java Python and the libraries are selenium request and beautifulsoup. I understand the basics of HTML but sometimes struggle to find what I need on app Engine with the inspector browsing tool is there a way to find what you need to be point on solving this? Well, I don't think I can help with that just experience I think or the thing I mentioned Before, so the interactivity to the other portal so maybe if you try to download the content with the request, it looks different than on the page even though you are looking at the page you don't know that you scrolled a bit or did something or even the JavaScript is loading the content so, you can get the content from the request, I hope this helps a bit. We got another question about what programming watches do you we use so like I mentioned it's a python and Java, we don't use our R, is for mathematicians we are not mathematicians.

Is there any way of dealing with websites who structure change every time? Well actually no, you need to be prepared that you will have gaps in your data, only thing you can do is put log to inform you that the web scraping failed so, you can quickly adjust and prepare news to do to adjust to the new structure but that I think that's the only way. So, this. So, how do you link the demand site like expenditure or visitors supply site? Well I will show some Linking, but it's only concerning supply side actually so, I think only Lukasz could answer that but I I am not sure he's still present with us.

So, I will leave that question for now okay. Let's get back to my presentation then. I hope I'm helpful with these answers. Okay so, we have decided to web scrap. now what? We need to clean the data but not without knowledge of the missing values, we need to do data transform but not without knowing the type of variables we can start modelling the data but not without knowing the outliers and distribution of the variables, that's why I would like to present eight steps to help us understand what data we are dealing with of course this is not in any means complete list it's just the list of basic steps we need to take to make sure we can turn our data into information. So, the first step is really easy, it is often forgotten

but it's simply look at the data, this will help us to decide what software or Hardware use, for what we need this data, example we need to use something to duplicate data will be we don't need a very good machine for the duplicating the data, but for the some imputation or machine learning it has to be a lot better hardware and also software because sometimes python’s, well python is good if you want to try something or test something, but for the production it's not really very good. We can also like I mentioned estimate the time consumption of the process and understand the structure of the collection better. So, I will show how we do it with codes in Python, because like I said for the first step python is awesome so, great way to check data uh using python Library called pandas, it will put our data in as a data frame and to check to look at the data we just need to use the methods like shape or head, the number of any given rows we want to see. The next step is in-depth knowledge of variable Types, and this tab will help us to verify the types in the collection and fix any structural and type conversion at all we might see in our data and for that piece we all we need to do is just use D type method from Pandas. Okay, next let's move to data summary for all numerical variables, it includes minimum, Maximum, mean, medium, quartiles and startup division thanks to that Sumari we can know which variables we need to pay attention in further analysis. So,like you can see we

use for example, describe method in Python and as you can see there is a big or drastically increasing value between the third quartile and the maximum, in couple of the columns we web scraped and now yes now we have one of the most important steps in our checking of the data is checking out the missing data, that will help us to better choose the appropriate algorithm as some algorithms are sensitive to missing data also if a variable has to missing too many missing values it might be best to just not use it in our model. So, again in Python there is method called is no, but we like to add something to it and the additional code could determine the number and the percentage of empty values like you can see we have this discrete column that will also only give us information if there is empty values but we added these two columns that will tell us how many how many rows are missing values and what percentage of total data it is. So, once we have carried out with the analysis of the missing data we need to decide what to do with it. We have three solutions, because we can Delight variables with empty values secondly, we can impute the missing data mostly using mean or medium, but it also can be some machine learning methods. So, we are not only, we don't have to use some minor media. The

last option is to add flux for rows with empty values which will allow us to check the quality of the modelling later and check what is the real impact of the missing values. So, next step if we choose to input the missing variables then this step will be useful for us, checking the skewness the distribution of the variable we have to draw conclusions on how to input the missing data. This is very helpful here to create for example a histogram to check the type of this distribution and to check the skewness, the skew method of the pandas libraries is deficient and for distribution we can use matlably library in Python to visualize our data. So, let's focus now on the element I mentioned Before, namely the event identification of outliers and checking if and how many outliers are present in our sets will allow us to better choose an appropriate algorithm also, we can see if our outliers are clearly caused by incorrect data or they are a container delete information. So, also some methods using statistics like person correlation I was really sensitive to outliers and a standard way to identify your flight is to use the outliers formula the outliers are the point lying beyond the upper boundary and the lower boundary we have example here for price. Excuse me for

a second. So, as you can see, we have six elements that we could remove from our set from lower boundary and over 7,000 from upper, but it still is less than 10% of the set but we we've actually priced we want to get the whole range so, we wouldn't use these outliers to remove the data. The next step is very often we have a lot of categorized variables in our collection we need to know the number and the coverage of the sub because some methods need to be balanced in order to use them correctly and some models perform very poorly this is are unbalanced also and I will show it in an example, as you can see we have very dominant variable called hotels from portals and the bottom line are actually really low so maybe it is worth combining the bottom line to one called for example order or something and it will help us with reduction of time we need to use for modelling and the last step I wanted to mention is checking the correlation between the variables both numerical and categorized, this will allow us to discover dependency between variables that we can use when transforming the data. Also, if we want for example to use the machine learning models this step will allow us to select the variables that have the highest correlation. Once again the ideal library for correlation analysis is pandas in Python, we can also visualize the process and weekly check every correlation. So, I have questions for you all right now, Okay, I hope you can see it the question is what is your opinion the most time consuming process in working with web scraping data? Yeah, I can see data cleaning this is winning, I of course I didn't mention about this this simple data cleaning technique like removing the noise from the data, the unnecessary characters and actually, really surprised.

Thank you. 36 persons answered. Okay, let's wait a couple of seconds more the last people that want to answer okay, okay and get smaller no. So, I'm really actually surprised that 65% said data cleaning, because yes data cleaning is a time-consuming process for us, it always data analysis well because data cleaning is so mechanical, tasks that you need to do to clean the data but data analysis you need to complete understand of the variables and what they do, not also to data clean for the modelling so, I'm really surprised the data analysis is a little three percent, because for us it's the most time-consuming process thank you for your answers that was that was interesting. Okay, follow-up question re-sleep times. How does a site defend against request-checking requests aren’t identical in terms of waits or by a specific time thresholds? How the site well mostly they check how many requests there is from one IP not many users the check for time intervals except if there is no time intervals then they will of course block you instantly but well like I said checking to how many requests there is from from oneip so help this answer your question and how do you link the demand site and our visitors hope to test the question you were left before so I will get that at the end maybe something I will show next will help with this so let's cut back to the final part so like I mentioned these eight simple steps do not cover the whole subject of analysis but the minimum we should keep in mind when we presented with a new data set, we will often come back to the data to understand it better that's why I was surprised that the data analysis was at the bottom however, let's assume that we already know everything so, let's try and combine the data from the web scraping with our registers. I will I will give you example how we did it on a supply

Side. From a supply side perspective so, the first idea that email immediately comes to mind is when we wanted to link accommodation facilities with our register is to use the name and order the address of the facility. So, let's do that we are sure that our linkage is 100% accurate but when we start to check how many objects were connected it around it was only about 10% of all web scraping object that is definitely not enough. So, Why? The physical obvious reason is freeform notation of the object's name in the portals they can put anything there, but even if they use the appropriate name they have for example for hotel they still change these little things so, for example they don't write and but use Ampersand or something like that or they make mistakes or typos so fortunately you can use various methods that allow us to determine similarity between objects in register and web scraping data this consists of methods like formulas for example the lave stein the general Winkler on the jacket formula that is known as PC matching. In a nutshell

this can be described as the number of operations that is needed to be performed on one text to get to a second text and it is also a great tool for dealing with typos and missing characters. Okay, so we did it and the number of linked objects has increased but the results were still not satisfactory. So, what about geographical coordinates maybe we have addressed we also have geographical coordinates great, we can calculate the distance between two objects using the level one formulas for example, the haverney harbour sinus or vincentium but what will be the difference between linking the accordion and an address actually a big one, so this is why I mentioned this analysis of the data because we need to come back to it and started to check one by one what we have here there so here you have visualized example of an object from my city Joshua where the Bristol Hotel is so large that it has several entrances to the object and what's worse it actually uses the different addresses on different portals. So, when we check the portal the problem occurred much more frequently than we expected. So, we developed a very simple algorithm, we

calculate the distance between all the objects then we check the distance between the objects whose name are the most similar and we merge to those that are within the threshold. Of course, we need to check the quality of our merging we can use several quality indicators for this such as accuracy sensitivity or precision as you can mentioned as you can see in this example the best, the highest accuracy or the sensibility is for 200 meters. So, in the beginning combining data from web scraping in statistical adjuster look like a fairly straightforward process however, you end up being a process that often requires going back to the data to better understand it and doing a couple of different things using couple of different methods algorithms and so on and so on and we still didn't finish the process there was one question about this this NLP and semantic and yes that's another thing we want to use we want to check how to duplicate the objects using pictures because we think that if the object is on one portal and it's called himself funny little house on the lake but uses appropriate name on the second portal and use the same pictures we can duplicate that, and we can finally connect that so it's not finished process so today Lucas and I wanted to present how we get the idea about new data and what and the steps and the process we did to get results Lucas also show that the update of the register the new indicators. So, thank you for your attention I will check if there is more question and if not I will give a floor to Sarah. SO, we have, we are left with that one question I hope it answers this a bit I know it's not the answer you want but maybe it helped a little I actually can't from my side at anything more.

2023-03-05 07:35

Show Video

Other news