Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this blog, we’d like to introduce you to the use case example of the forester package. We will present a package usage scenario with a real-life story in the background. It will also include code examples, outcome analysis and comments. The package concept is described in previous blog posts introducing the package and describing it in greater detail.
Let’s imagine that we are a young house owner moving from Lisbon to Warsaw, and due to the lack of savings, we decide to sell our apartment in order to buy a new one in Poland. Our decision is rushed, because next month we are starting our new job as a researcher in MI².AI and we don’t know anything about the real estate market in Portugal. Luckily, as skilled data scientists, we’ve managed to scrap information about real estate properties and created a lisbon dataset. It isn’t much but will have to do for our case. The first observations are presented below.
Firstly, we want to discover what happens inside our scrapped dataset and if it is good enough for our analysis. Typically we would start writing an exhausting exploratory data analysis script, however, the forester package offers us a function that provides basic information about the dataset. In that case, we import the package, load the dataset and create a data_check() report with the usage of the dataset name and our target column name which is called Price. The necessary code and the outcomes are visible below.
library(forester) data(lisbon) check <- check_data(lisbon, 'Price')
From the report above we can learn that the dataset isn’t perfect and we can detect some issues considering column values.
- Firstly we identify the static columns (every observation has the same value) which are Country, District, and Municipality.
- We also find out that AreaNet is highly correlated to AreaGross and the same happens with the pair PropertyType — PropertySubType. It means that these columns provide the same information for our future model.
- At last, we find out that a column named Id might not provide any information as it is an index column.
To address these main problems with our dataset we decide to drop the aforementioned columns in order to get better results. The data check report also gives us a note about duplicated columns, missing values, outliers, and unusual distribution of the Price column, however, these parameters are acceptable in our case, so we will ignore them.
lisbon <- select(lisbon, -c('Country', 'District', 'Municipality', 'AreaNet', 'PropertyType', 'Id'))
At this point, we already know a bit about our dataset, so it is time to create the first models. In order to do that we use the train() function which wraps the whole AutoML pipeline. Typically we would provide only two necessary parameters, but as we want to achieve fast baseline results, and we’ve already run the data_check(), we decide to skip some of the modules (we turn off the random search algorithm, Bayesian optimisation and printing messages).
output_1 <- train(data = lisbon, y = 'Price', bayes_iter = 0, random_evals = 0, verbose = FALSE) head(output_1$score_test)
The output of the train() function is complex, but we will focus on the ranked list. In the table below can see all trained models with the few metrics calculated on the test subset. The first model scored 0.77 in the R2 metric, which is relatively high. Not only R2 is the best but also MSE (mean squared error) and MAE (mean absolute deviation). We could already use that model to predict our house price but let’s see if we can do even better after setting different parameters.
We desire to improve models by changing their hyperparameters. Doing that manually would require a lot of effort and expertise. Thankfully the train() function has the option to do it automatically. We set bayes_iter and random_evals at 20 which runs related tuning methods during the training.
output_2 <- train(data = lisbon, y = 'Price', bayes_iter = 20, random_evals = 20, verbose = FALSE, sort_by = 'mse') output_2$score_test
With Bayesian optimisation, we improved the R2 metric for the best model from 0.77 to 0.91. There is also a vast improvement in the MSE. This model is xgboost, trained with Bayesian optimisation. It looks very promising, but to be sure it is reliable, let’s explain how these outcomes were achieved.
Fortunately, the forester package provides an interface for easy integration with the DALEX package, which is a well-known explainable artificial intelligence (XAI) solution. With just a few steps we can create an explainer and feature importance plot that shows us which columns were the most important for the model.
library('DALEX') ex <- forester::explain(models = output_2$best_models[[1]], test_data = output_2$test_data, y = output_2$y) model_parts <- DALEX::model_parts(ex$xgboost_bayes) plot(model_parts, max_vars = 5)
From the plot above we can see that the most important factors for the xgboost model were the area of the apartment, number of bathrooms, latitude — which translates to the distance from a city centre, condition of the apartment and its price per square meter. All these factors seems also very reasonable to us, so we can diagnose that the model behaves understandably and we trust it.
At this point, we checked our data, trained many models, and explained the best one. But we want all this information in one place! In order to do that we can create a report with the report() function. It creates a PDF or HTML file that presents information about data and models in a formal, and clear way. The report will be covered in detail in a further blog.
report(output_2)
Now that we have a model, we can predict the value of our house. We create an observation with all the needed information about our apartment. We choose the best model created by the forester package and we make a price prediction for our observation, which equals 214 156 Euros. Now, we can save the model for the future and add the predicted price to our advertisement!
x <- data.frame( 'Condition' = 'Used', 'PropertyType' = 'Homes', 'PropertySubType' = 'Apartment', 'Bedrooms' = 3, 'Bathrooms' = 2, 'AreaGross' = 320, 'Parking' = 1, 'Latitude' = 38.7323, 'Longitude' = -9.1186, 'Parish' = 'Estrela', 'Price.M2' = 4005, 'Country' = 'Portugal', 'District' = 'Lisbon', 'Municipality' = 'Lisbon', 'AreaNet' = 160, 'Id' = 111, 'Price' = 0) predictions <- predict_new(output_2, data = x) predictions$xgboost_bayes save(output_2)
In the next blog, we’d like to describe in detail the main function of the forester package called the train(). We will focus on showing you how particular steps work and how the user can shape the training process.
If you are interested in other posts about explainable, fair and responsible ML, follow #ResponsibleML on Medium.
In order to see more R related content visit https://www.r-bloggers.com
forester: predicting house prices use case was originally published in ResponsibleML on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.