Data Analysis Project | academiascholars.com

Data Analysis Project

QUESTION

An essay about obtaining an Airbnb data in NYC area from Inside Airbnb

Subject	Essay Writing	Pages	8	Style	APA

Answer

In this project, the analysis to be carried out involves obtaining Airbnb data in NYC area from Inside Airbnb, in addition to criminal records from NYPD Arrest Open Data. These two datasets will be integrated and then analyzed to ascertain the existence of any relationship between them. We will use histograms and maps to present our results. In addition to this relationship, this analysis also entails predicting the most suitable machine learning model for predicting night rental prices for New York City, and presenting an analysis of the relative safety of neighborhoods in New York City.

Both these datasets included so many variables-some of which were not relevant in this analysis. As such, both datasets have been cleaned to include only the relevant variables and in their best format. For the Airbnb data, the following variables were maintained for use in the analysis process.

name

host_id

host_name

neighbourhood_group

neighbourhood

latitude

longitude

room_type

price

minimum_nights

number_of_reviews

last_review

reviews_per_month

calculated_host_listings_count

availability_365

As for the crimes data, these were the main variables that were retained in the dataframe for the main analysis; arrest date, latitude, longitude, and the co-ordinates (important for calculating the distances where crimes were reported with respect to the location of the rental apartments.

Given the vast number of locations in the Air BnB dataset (about 40000 locations), and about 16000 in crime dataset, the analysis approach in this paper involved computing the distance between each Air BnB location to all coordinates of arrests in a fragmented form for ease of analysis. Modelling the relationship between a response variable and predictor variables is contingent on the correlation among these predictor variables. Figure 1 depicts this relationship.

Figure 1 Correlation Matrix

This research work also analyzed price differentiation as per the boroughs to gain insights into what informed the pricing mechanism in this market. As shown in Figure 2 below, Manhattan has the highest average Airbnb rent price, follows by Brooklyn, Queens, Staten Island and Bronx.

Figure 2 Pricing by borough

Moreover, the review scores indicated that For all boroughs in New York, the higher the review scores, the higher the airbnb price. Among all five boroughs, Manhattan has the highest amount of reviews, followed by Queens, Brooklyn, Bronx, and Staten Island. It may be also related to the number of airbnb listings available. Figure 3 below gives a vivid picture of the relationship between review scores and prices.

Figure 3 Review Scores and Price

Lastly, before predicting a suitable model for the prices, this research looked into the statistics on criminal arrests per Borough to develop a proper picture of how thins were. This statistics is well represented in Figure 4 below;

Figure 4 Criminal arrest count by Borough

Admittedly, Manhattan has the most expensive pricing on Airbnb among all five NYC boroughs. The higher the review scores of the Airbnb listings, the higher the price of them. On the other hand, the more than arrest counts within 0.5 mile from an Airbnb listing, the lower the set price of it.

In predicting the best machine learning model, the main variables in the dataset such as price, the latitudes, longitudes, among others had to be transformed into natural logs because of not not assuming a normal distribution. A summary of these variables is as shown below

A significant number of these variables were not normally distributed including price-which is the main dependent variable.

Various machine learning models were trained and tested with the data. From this testing, Random Forest model yielded the lowest RMSE, followed by Neural Netork, XGBoost and Decision Tree. The result of the predictive ability of these models is as follows in terms of the Random Mean Standard Error was as follows:

Decision Tree	Random Forest	XGBoost	Neural Netork
0.531896	0.378528	0.40669	0.396163

As a result, the approach involved optimizing the Random Forest Model for increased efficiency in classifying the houses according to the most suitable prices.

The estimated Random Forest model for the data also showed some reasonable accuracy in terms of predicting the prices. The figure below depicts the nature of this linear regression.

Initially, this model had a relatively lowprediction ability. After optimization, its predictive power is now 0.69 implying that the variables in this model explain 69% variation in night rental prices.

Evidently, the features that should be considered to be pivotal in terms of influencing the price of a listing are:

listing type (if it is a home/apartment)
location, which is very intuitive considering that in real estate location is often a decided factor for price
availability and review related factors
certain listing descriptor words indicating the character or location of a listing
security (no cases of arrests within the region)

The figure below illustrates actual and predicted price for the test dataset in order of growing price

Worthy to note are the following pertinent points;

Private room dataset yielded the best model accuracy followed by home and shared.

The difference in accuracy between shared and the other two datasets is most likely derived from fewer datapoints available to train the shared model, resulting in lower accuracy.

It is clear that the model predicts with better accuracy for private listings than for the home/apartment listings. It is likely driven by larger spread of prices within the home/apartments listings relative to private listings. Look at standard deviations for the two population samples below:

Neighborhoods with high crime rates fetch the least prices in terms of listing.

The impact of Boroughs on the model’s accuracy cannot be overstated. Their effects is as shown below;

Borough	RMSE
Manhattan	0.384
Brooklyn	0.369
Queens	0.37
Bronx	0.419
Staten Island	0.458

The distribution of the listing per Borough is given as shown below;

Borough	Listing
Manhattan	21192
Number of listings in Brooklyn	19801
Number of listings in Queens	5592
Number of listings in Bronx	1071
Number of listings in Staten Island	370

Summarily, Random Forest regression model provided best accuracy for prediction of listing price based on variables generated from the initial data given its tendency to underpredict listings that are expensive. Moreover, it also underpredict listings priced relatively low. This model’s importance can be used to further understand what drives the price of an Airbnb listing in NYC.

This question has been answered

References

Appendix

Appendix A:

Communication Plan for an Inpatient Unit to Evaluate the Impact of Transformational Leadership Style Compared to Other Leader Styles such as Bureaucratic and Laissez-Faire Leadership in Nurse Engagement, Retention, and Team Member Satisfaction Over the Course of One Year

Answer

References

Related Samples

Student Budgeting & Money-Saving Tips.

Creative Hacks for Taking Better Notes: Boost Your Learning with Effective Strategies

Conquering Presentation Anxiety: Tips for Students

Student Mental Health: Your Guide to Academic Success and Personal Fulfillment

Managing Time Like a Pro: Tips for Busy Students

Our Benefits

Our Services

Free Features