-
Data Analysis Project
QUESTION
An essay about obtaining an Airbnb data in NYC area from Inside Airbnb
Subject | Essay Writing | Pages | 8 | Style | APA |
---|
Answer
In this project, the analysis to be carried out involves obtaining Airbnb data in NYC area from Inside Airbnb, in addition to criminal records from NYPD Arrest Open Data. These two datasets will be integrated and then analyzed to ascertain the existence of any relationship between them. We will use histograms and maps to present our results. In addition to this relationship, this analysis also entails predicting the most suitable machine learning model for predicting night rental prices for New York City, and presenting an analysis of the relative safety of neighborhoods in New York City.
Both these datasets included so many variables-some of which were not relevant in this analysis. As such, both datasets have been cleaned to include only the relevant variables and in their best format. For the Airbnb data, the following variables were maintained for use in the analysis process.
id
name
host_id
host_name
neighbourhood_group
neighbourhood
latitude
longitude
room_type
price
minimum_nights
number_of_reviews
last_review
reviews_per_month
calculated_host_listings_count
availability_365
As for the crimes data, these were the main variables that were retained in the dataframe for the main analysis; arrest date, latitude, longitude, and the co-ordinates (important for calculating the distances where crimes were reported with respect to the location of the rental apartments.
Given the vast number of locations in the Air BnB dataset (about 40000 locations), and about 16000 in crime dataset, the analysis approach in this paper involved computing the distance between each Air BnB location to all coordinates of arrests in a fragmented form for ease of analysis. Modelling the relationship between a response variable and predictor variables is contingent on the correlation among these predictor variables. Figure 1 depicts this relationship.
Figure 1 Correlation Matrix
This research work also analyzed price differentiation as per the boroughs to gain insights into what informed the pricing mechanism in this market. As shown in Figure 2 below, Manhattan has the highest average Airbnb rent price, follows by Brooklyn, Queens, Staten Island and Bronx.
Figure 2 Pricing by borough
Moreover, the review scores indicated that For all boroughs in New York, the higher the review scores, the higher the airbnb price. Among all five boroughs, Manhattan has the highest amount of reviews, followed by Queens, Brooklyn, Bronx, and Staten Island. It may be also related to the number of airbnb listings available. Figure 3 below gives a vivid picture of the relationship between review scores and prices.
Figure 3 Review Scores and Price
Lastly, before predicting a suitable model for the prices, this research looked into the statistics on criminal arrests per Borough to develop a proper picture of how thins were. This statistics is well represented in Figure 4 below;
Figure 4 Criminal arrest count by Borough
Admittedly, Manhattan has the most expensive pricing on Airbnb among all five NYC boroughs. The higher the review scores of the Airbnb listings, the higher the price of them. On the other hand, the more than arrest counts within 0.5 mile from an Airbnb listing, the lower the set price of it.
In predicting the best machine learning model, the main variables in the dataset such as price, the latitudes, longitudes, among others had to be transformed into natural logs because of not not assuming a normal distribution. A summary of these variables is as shown below
A significant number of these variables were not normally distributed including price-which is the main dependent variable.
Various machine learning models were trained and tested with the data. From this testing, Random Forest model yielded the lowest RMSE, followed by Neural Netork, XGBoost and Decision Tree. The result of the predictive ability of these models is as follows in terms of the Random Mean Standard Error was as follows:
Decision Tree |
Random Forest |
XGBoost |
Neural Netork |
0.531896 |
0.378528 |
0.40669 |
0.396163 |
As a result, the approach involved optimizing the Random Forest Model for increased efficiency in classifying the houses according to the most suitable prices.
The estimated Random Forest model for the data also showed some reasonable accuracy in terms of predicting the prices. The figure below depicts the nature of this linear regression.
Initially, this model had a relatively lowprediction ability. After optimization, its predictive power is now 0.69 implying that the variables in this model explain 69% variation in night rental prices.
Evidently, the features that should be considered to be pivotal in terms of influencing the price of a listing are:
- listing type (if it is a home/apartment)
- location, which is very intuitive considering that in real estate location is often a decided factor for price
- availability and review related factors
- certain listing descriptor words indicating the character or location of a listing
- security (no cases of arrests within the region)
The figure below illustrates actual and predicted price for the test dataset in order of growing price
Worthy to note are the following pertinent points;
Private room dataset yielded the best model accuracy followed by home and shared.
The difference in accuracy between shared and the other two datasets is most likely derived from fewer datapoints available to train the shared model, resulting in lower accuracy.
It is clear that the model predicts with better accuracy for private listings than for the home/apartment listings. It is likely driven by larger spread of prices within the home/apartments listings relative to private listings. Look at standard deviations for the two population samples below:
Neighborhoods with high crime rates fetch the least prices in terms of listing.
The impact of Boroughs on the model’s accuracy cannot be overstated. Their effects is as shown below;
Borough |
RMSE |
Manhattan
|
0.384 |
Brooklyn
|
0.369 |
Queens
|
0.37 |
Bronx
|
0.419 |
Staten Island |
0.458 |
The distribution of the listing per Borough is given as shown below;
Borough |
Listing |
Manhattan |
21192 |
Number of listings in Brooklyn
|
19801 |
Number of listings in Queens
|
5592 |
Number of listings in Bronx
|
1071 |
Number of listings in Staten Island
|
370 |
Summarily, Random Forest regression model provided best accuracy for prediction of listing price based on variables generated from the initial data given its tendency to underpredict listings that are expensive. Moreover, it also underpredict listings priced relatively low. This model’s importance can be used to further understand what drives the price of an Airbnb listing in NYC.
References
Appendix
|
|