Since the launch of Airbnb in 2008, it has grow into a platform with more than 4 millions of hosts who have hosted more than 800 millions guests globally. Boston as one of the top travel destinations in the US, it welcomes about 22 millions of visitors around the world annually. In this post, I’m going to use the “Boston Airbnb Open Data” to take a sneak peak into the Airbnb activity in Boston.
The price of a listing is one of the most important factor that affects the guest’s decision when choosing a place to stay, and it also determines the revenue of a property generated for the guest. The listing price of a property is determined by several factors such as location, property type, host qualifications, etc. The goal of this article is to explore the data and determine how these factors affect the listing price.
I started with some exploratory analysis on the ‘listings’ datasets. The datasets consists 585 rows and 94 columns, in which some are irrelevant to what we are trying to study here or null values. So the first thing is to clean the data by dropping irrelevant columns including all null columns, converting the price related columns into numeric format, and filling N/A values. Here we show histograms of the listing prices of the properties in Boston. Apparently, the majority of prices falls in the $20 to $500 range, therefore, we only use the data within this price range for the analysis.
The first question we are trying to answer is how does geo-location affects the listing price. So we group the listing price by neighborhood and show the data in the following box chart. If we also take a look at the Boston neighborhood map, it is clear that most the listing with higher price concentrated on the downtown area where more tourists visits.
The second question we are trying to answer here is how seasonality affect the price of the Airbnb listings in Boston. This time we explore the ‘calendar’ datasets to find some insights. The first thing is also clean the data. There are a lot of null values in the price column, which is due to the fact that there are many days a listing is not available for hosting guests, so we can drop these null values instead of fill them. And here we also so a box plot of listing prices of each month of the year from 2016 to 2017. Not surprisingly, September and October are the months with the highest prices, which is likely a result of the school fall semester begins and lots of students and parents travel to Boston who drive up the demand of hotel/short term rental market. (Note: Boston is a town with many universities and colleges). It is also worth pointing out that in the winter season (e.g. January and February) the housing price is lower than average. This is most likely due to the winter weather and Boston is not a winter destination for most visitors.
The last question we are trying to answer here is, can we create a model to predict the Airbnb listing price in Boston. Though both categorical and numerical values are important variable in creating such a model. As we see in the price vs. neighborhood analysis. However, we can start with the numerical values and see if how well it works before we take categorical values into account. So after cleaning the data by dropping columns that are irrelevant to the price of a listing, such as url of a listing and host id. After correlation analysis, variables including number of beds, guests included, numbers of reviews, cleaning fee, security deposit, etc. were chose to build the model for price prediction. Here I applied a linear regression model and the final R score is 0.325, which is not so great.
Therefore, some categorical values (e.g. cancellation policy, room type, etc.) were also added to build to linear regression model. I first created dummy columns for these categorical variables, and then fill the missing value with the mean value. And the final R score I obtained is 0.58 which is hugely improved from the previous model, as seen in the following plots.
To sum up, the price of Airbnb listings are strongly influenced by its geographical location and the seasonality. And we can create a linear regression model with both numerical and categorical variables to predict the listing prices.