Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this post we’re going to model the prices of Airbnb appartments in London. In other words, the aim is to build our own price suggestion model. We will be using data from http://insideairbnb.com/ which we collected in April 2018. This work is inspired from the Airbnb price prediction model built by Dino Rodriguez, Chase Davis, and Ayomide Opeyemi. Normally we would be doing this in R but we thought we’d try our hand at Python for a change.
We present a shortened version here, but the full version is available on our GitHub.
Data Preprocessing
First, we import the listings gathered in the csv file.
import pandas as pd listings_file_path = 'listings.csv.gz' listings = pd.read_csv(listings_file_path, compression="gzip", low_memory=False) listings.columns Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space', 'description', 'experiences_offered', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet', 'price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped', 'number_of_reviews', 'first_review', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'requires_license', 'license', 'jurisdiction_names', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 'calculated_host_listings_count', 'reviews_per_month'], dtype='object')
The data has 95 columns or features. Our first step is to perform feature selection to reduce this number.
Feature selection
Selection on Missing Data
Features that have a high number of missing values aren’t useful for our model so we should remove them.
import matplotlib.pyplot as plt %matplotlib inline percentage_missing_data = listings.isnull().sum() / listings.shape[0] ax = percentage_missing_data.plot(kind = 'bar', color='#E35A5C', figsize = (16, 5)) ax.set_xlabel('Feature') ax.set_ylabel('Percent Empty / NaN') ax.set_title('Feature Emptiness') plt.show()
As we can see, the features neighbourhood_group_cleansed
, square_feet
, has_availability
, license
and jurisdiction_names
mostly have missing values. The features neighbourhood
, cleaning_fee
and security_deposit
are more than 30% empty which is too much in our opinion. The zipcode
feature also has some missing values but we can either remove these values or impute them within reasonable accuracy.
useless = ['neighbourhood', 'neighbourhood_group_cleansed', 'square_feet', 'security_deposit', 'cleaning_fee', 'has_availability', 'license', 'jurisdiction_names'] listings.drop(useless, axis=1, inplace=True)