{br} STUCK with your assignment? {br} When is it due? {br} Get FREE assistance. Page Title: {title}{br} Page URL: {url}
UK: +44 748 007-0908, USA: +1 917 810-5386 [email protected]
  1.  Data Analysis Project   

    QUESTION

    An essay about obtaining  an Airbnb data in NYC area from Inside Airbnb

 

Subject Essay Writing Pages 8 Style APA

Answer

In this project, the analysis to be carried out involves obtaining Airbnb data in NYC area from Inside Airbnb, in addition to criminal records from NYPD Arrest Open Data. These two datasets will be integrated and then analyzed to ascertain the existence of any relationship between them. We will use histograms and maps to present our results. In addition to this relationship, this analysis also entails predicting the most suitable machine learning model for predicting night rental prices for New York City, and presenting an analysis of the relative safety of neighborhoods in New York City.

Both these datasets included so many variables-some of which were not relevant in this analysis. As such, both datasets have been cleaned to include only the relevant variables and in their best format. For the Airbnb data, the following variables were maintained for use in the analysis process.

id                                 

name                              

host_id                            

host_name                         

neighbourhood_group               

neighbourhood                     

latitude                         

longitude                        

room_type                         

price                              

minimum_nights                     

number_of_reviews                  

last_review                        

reviews_per_month                

calculated_host_listings_count     

availability_365

As for the crimes data, these were the main variables that were retained in the dataframe for the main analysis; arrest date, latitude, longitude, and the co-ordinates (important for calculating the distances where crimes were reported with respect to the location of the rental apartments.

Given the vast number of locations in the Air BnB dataset (about 40000 locations), and about 16000 in crime dataset, the analysis approach in this paper involved computing the distance between each Air BnB location to all coordinates of arrests in a fragmented form for ease of analysis.  Modelling the relationship between a response variable and predictor variables is contingent on the correlation among these predictor variables. Figure 1 depicts this relationship.

 

Figure 1 Correlation Matrix

This research work also analyzed price differentiation as per the boroughs to gain insights into what informed the pricing mechanism in this market. As shown in Figure 2 below, Manhattan has the highest average Airbnb rent price, follows by Brooklyn, Queens, Staten Island and Bronx.

 

 

 

 

 

 

 

Figure 2 Pricing by borough

 

Moreover, the review scores indicated that For all boroughs in New York, the higher the review scores, the higher the airbnb price. Among all five boroughs, Manhattan has the highest amount of reviews, followed by  Queens, Brooklyn, Bronx, and Staten Island. It may be also related to the number of airbnb listings available. Figure 3 below gives a vivid picture of the relationship between review scores and prices.

 

Figure 3 Review Scores and Price

Lastly, before predicting a suitable model for the prices, this research looked into the statistics on criminal arrests per Borough to develop a proper picture of how thins were. This statistics is well represented in Figure 4 below;

 

Figure 4 Criminal arrest count by Borough

 

Admittedly, Manhattan has the most expensive pricing on Airbnb among all five NYC boroughs. The higher the review scores of the Airbnb listings, the higher the price of them. On the other hand, the more than arrest counts within 0.5 mile from an Airbnb listing, the lower the set price of it.

In predicting the best machine learning model, the main variables in the dataset such as price, the latitudes, longitudes, among others had to be transformed into natural logs because of not not assuming a normal distribution. A summary of these variables  is as shown below

A significant number of these variables were not normally distributed including price-which is the main dependent variable.

Various machine learning models were trained and tested with the data. From this testing, Random Forest model yielded the lowest RMSE, followed by Neural Netork, XGBoost and Decision Tree. The result of the predictive ability of these models is as follows in terms of the Random Mean Standard Error was as follows:

Decision Tree

Random Forest

XGBoost

Neural Netork

0.531896

0.378528

0.40669

0.396163

 

As a result, the approach involved optimizing the Random Forest Model for increased efficiency in classifying the houses according to the most suitable prices.

 

 

The estimated Random Forest model for the data also showed some reasonable accuracy in terms of predicting the prices.  The figure below depicts the nature of this linear regression.

Initially, this model had a relatively lowprediction ability. After optimization, its predictive power is now 0.69 implying that the variables in this model explain 69% variation in night rental prices.

 

Evidently, the features that should be considered to be pivotal in terms of influencing the price of a listing are:

 

  • listing type (if it is a home/apartment)
  • location, which is very intuitive considering that in real estate location is often a decided factor for price
  • availability and review related factors
  • certain listing descriptor words indicating the character or location of a listing
  • security (no cases of arrests within the region)

 The figure below illustrates actual and predicted price for the test dataset in order of growing price

 

Worthy to note are the following pertinent points;

Private room dataset yielded the best model accuracy followed by home and shared.

The difference in accuracy between shared and the other two datasets is most likely derived from fewer datapoints available to train the shared model, resulting in lower accuracy.

It is clear that the model predicts with better accuracy for private listings than for the home/apartment listings. It is likely driven by larger spread of prices within the home/apartments listings relative to private listings. Look at standard deviations for the two population samples below:

Neighborhoods with high crime rates fetch the least prices in terms of listing.

 

The impact of Boroughs on the model’s accuracy cannot be overstated. Their effects is as shown below;

 

Borough

RMSE

Manhattan

 

0.384

Brooklyn

 

0.369

Queens

 

0.37

Bronx

 

0.419

Staten Island

0.458

 

 

The distribution of the listing per Borough is given as shown below;

Borough

Listing

Manhattan

21192

Number of listings in Brooklyn

 

19801

Number of listings in Queens

 

5592

Number of listings in Bronx

 

1071

Number of listings in Staten Island

 

370

 

Summarily, Random Forest regression model provided best accuracy for prediction of listing price based on variables generated from the initial data given its tendency to underpredict listings that are expensive. Moreover, it also underpredict listings priced relatively low. This model’s importance can be used to further understand what drives the price of an Airbnb listing in NYC.

References

 

 

 

 

 

 

 

 

 

 

 

 

Appendix

Appendix A:

Communication Plan for an Inpatient Unit to Evaluate the Impact of Transformational Leadership Style Compared to Other Leader Styles such as Bureaucratic and Laissez-Faire Leadership in Nurse Engagement, Retention, and Team Member Satisfaction Over the Course of One Year

Related Samples

WeCreativez WhatsApp Support
Our customer support team is here to answer your questions. Ask us anything!
👋 Hi, how can I help?