Site icon R-bloggers

Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC

[This article was first published on R – NYC Data Science Academy Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

Tree roots growing under sidewalks often cause cracking or lifting of the pavement once the tree surpasses a certain size. This creates significant tripping hazards for pedestrians, and liability issues for property owners. Furthermore, the cost of repairing such damage is in excess of $100 million per year in the United States. As such, this project seeks to:

Dataset

In 2015 NYC conducted volunteer-powered campaign to map, count, and care for all of the city’s street trees. This dataset consist of:

  • 432,564 Live Tree
  • 14,099 Dead Tree
  • 40% Sidewalk Damage
  • Average DBH of 11.6 inches
  • 132 Different Species

The ultimate dataset consisted of the following features “tree_id”, “year”, “tree_dbh”, “health”, “spc_latin”, “spc_common”, “root_stone”, “root_grate”, “root_other”, “trunk_wire”, “address”, “zipcode”,  “boro_name”, “longitude”, “latitude”, “block_code”, “sidewalk”. Details on these terms can be found at the dataset link above. 

Technology Pipeline

The technology employed is a mixture of Python, R, and Java. Python scripts are used for performing data cleaning and merging, as well as web scraping tree species data from leafsnap.com. R scripts are used for performing numerical and visual EDA and for running machine learning algorithms.  For the desktop analysis application and to generate heatmaps (see below), Java is used.  Java will also be used to develop a mobile application.

Visual Overview Of Data

In order to quickly check for a relationship between tree diameter and sidewalk damage a heatmap is generated by sorting the data by increasing diameter (from 3 to 70 inches). The magenta color represents the different species of trees. The red and greens pixels represent “damage” or “no damage” to the sidewalk respectively. One key takeaway here is that there isn’t an obvious relationship between sidewalk condition and tree diameter. Another take away is that even though there are 132 tree species in the dataset, only a small number make up most of the trees planted (see bar plot below).

Variable Association

The associations between the predictor variables in the dataset and sidewalk condition is also compared using either a Cramers V function or the ICC package in R.  The strength of association ranges from 0 to 1, with a value of 1 indicating perfect association between two variables.

Clustering (Unsupervised Learning)

Clustering was done by first generating a dissimilarity matrix using the “gower” distance, then using the “pam” function to find the best number of clusters. Using sample datasets (1000 obs) containing all the geolocation related features (i.e. address, zipcode, boro name, longitude, latitude), the optimal number of clusters found is 6. These clusters more or less corresponds to the boro the trees are located in. See image below.

Removing all geolocation related features, with the exception of longitude and latitude, the optimal number of cluster is now found to be 2 which corresponds to the sidewalk condition of either damage or not damaged.

Classification (Supervised Learning)

The R Caret package was used to run various machine learning classification algorithms on full dataset using the typical 80/20 (train/test) split validation method. The accuracy results are outlined below.

Overall, the accuracy results for these algorithms was fairly close and given the nature of the problem, simple the Logistic Regression models were found to be well suited for use in the analysis application described below.  As for what features are most important in determine sidewalk damage, both the tree based and logistics regression models are in overall agreement that having blocks around the trees (root_stone), tree diameter, and location play important roles.

Analysis Application “NYC Tree Insights”

In order to make the models useful for use by non technical users, a desktop applications that performs analysis on the “dead trees” data to predict the potential for sidewalk damage at various years (10, 20, 30, 50, 75) in the future is has been developed.

Additionally, the application also allows for rapid visual analysis by making use of bar plots and links to Google Maps to view the area and even the dead tree in question.

Conclusion

By making use of the NYC 2015 Tree Census dataset, a classification model, with an accuracy of over 75% in predicting root induced sidewalk damage was developed.  Moreover, a Java based desktop application was developed around this model to help stake holders assess the likelihood sidewalk damage in the future if a certain species of tree is planted at a particular location.  The next steps for this project are:

The post Tree Troubles — Predicting Sidewalk Damage Resulting From Trees In NYC appeared first on NYC Data Science Academy Blog.

To leave a comment for the author, please follow the link and comment on their blog: R – NYC Data Science Academy Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.