Site icon R-bloggers

Understanding the Effect of Subsidies on Agriculture with Neural Networks

[This article was first published on DataGeeek, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Today, subsidies are the most common method of encouraging countries to deal with global warming; but it is debatable how effective they are. For instance, agriculture subsidies cause farmers to violate forest frontier and make them responsible for 14% of global deforestation every year. Not to mention excessive use of fertilizers degrades the soil and water and eventually damages human health.

In this article, we will analyze whether subsidies are effective for crop yields in agriculture. First, we will build our datasets.

library(tidyverse)
library(countrycode)
library(openxlsx)

#Crop yields
df_yields <- 
  openxlsx::read.xlsx("https://github.com/mesdi/blog/raw/main/agg_sub.xlsx", 
                      sheet = "yields") %>% 
  as_tibble() %>% 
  janitor::clean_names() %>%
  group_by(location, time) %>% 
  summarise(yields = sum(value))

#Agricultural support  
df_supp <- 
  openxlsx::read.xlsx("https://github.com/mesdi/blog/raw/main/agg_sub.xlsx", 
                      sheet = "support") %>% 
  as_tibble() %>% 
  janitor::clean_names() %>% 
  rename(supp = value)
 
#Merging and tidying both datasets
df_tidy <- 
  df_yields %>% 
  left_join(df_supp, by = c("location","time")) %>%  
  mutate(region = countrycode(location, 
                              "genc3c",
                              "un.region.name")) %>% 
  drop_na() %>% 
  mutate(region = factor(region),
         id = as.numeric(region))

We will first examine the distribution of agricultural support for producers in millions of US dollars in a box plot and the line chart of crop yields in thousand hectares by continent (region). We will interactively do this by using the ggiraph package.

#theme settings
color_palette <- thematic::okabe_ito(5)
names(color_palette) <- unique(df_tidy$region)
base_size <- 32

#Google  setting
library(systems)
library(showtext)

_add_google("Montserrat Alternates","ma")  
showtext_auto()

base_family <- "ma"

#Interactive plots
library(ggiraph)
library(patchwork)
library(scales)
#Boxplot
box_graph <- 
  df_tidy %>%
  group_by(region, time, id) %>% 
  summarise(mean_supp = mean(supp)) %>%
  ungroup() %>% 
  group_by(region, id) %>% 
  mutate(median_supp = median(mean_supp)) %>% 
  ggplot(aes(x = mean_supp, 
             y = fct_reorder(region, median_supp, .desc = TRUE),# for ranking
             fill = region, 
             data_id = id)) +
  geom_boxplot_interactive(
    aes(
      tooltip = glue::glue("Median: {scales::number(median_supp, 
                                                    scale_cut = cut_short_scale(), 
                                                    accuracy = 0.01)}")
    ),
    position = position_nudge(y = 0.25), 
    width = 0.5
  ) +
  geom_point_interactive(
    aes(col = region),
    position = position_nudge(y = -0.2),
    size = 5,
    shape = '|',
    alpha = 0.75
  ) +
  scale_fill_manual(values = color_palette) +
  scale_color_manual(values = color_palette) +
  scale_x_log10(labels = label_number(scale_cut = cut_short_scale())) +
  labs(
    x = "",
    y = "",
    title = glue::glue('Distribution of Agricultural Support (million US dollars)')
  ) + 
  theme_minimal(base_size = base_size,
                base_family = base_family) +
  theme(
    text = element_text(
      color = 'grey20'
    ),
    legend.position = 'none',
    panel.grid.minor = element_blank(),
    plot.title.position = 'plot',
    plot.title = element_text(size = 19,
                              hjust = 0.5,
                              face = "bold")
  )

#Line chart
line_graph <- 
  df_tidy %>% 
  group_by(region, time,id) %>% 
  summarise(mean_yields = sum(yields)) %>% 
  mutate(line_text = glue::glue('{region}\n{time}\n{scales::number(mean_yields, 
                                                                   scale_cut = cut_short_scale(), 
                                                                   accuracy = 0.01)}')
  ) %>% 
  ggplot(aes(x = time, 
             y = mean_yields, 
             col = region, 
             data_id = id)) +
  geom_line_interactive(linewidth = 2.5) +
  geom_point_interactive(aes(tooltip = line_text),
                         size = 6) +
  scale_y_log10(labels = label_number(scale_cut = cut_short_scale())) +
  theme_minimal(base_size = base_size,
                base_family = base_family) +
  labs(
    x = "",
    y = "",
    title = 'Crop yields over time (thousand hectares)'
  ) +
  theme(
    text = element_text(
      color = 'grey20'
    ),
    legend.position = 'none',
    panel.grid.minor = element_blank(),
    plot.title.position = 'plot'
  ) +
  scale_color_manual(values = color_palette) +
  theme(
    plot.title = element_text(size = 19, 
                              hjust = 0.5,
                              face = "bold")
  )
  

 
#Merging the two plots in an interactive way
girafe(
  ggobj = 
    box_graph + 
    plot_spacer() + 
    line_graph + 
    plot_layout(widths = c(0.45, 0.1, 0.45)),
  options = list(
    opts_hover(css = ''),
    opts_hover_inv(css = "opacity:0.1;"),
    opts_sizing(width = 0.7),
    opts_tooltip(
      offx = 25, 
      offy = 0, 
      use_fill = TRUE,
      css = '-size:larger;padding:5px;-family:Montserrat Alternates;'
    )
  ),
  width_svg = 16,
  height_svg = 9
)

When we look at the box plot, we can say that Europe is left skewed which means the majority of observations are above the mean; on the contrary, we can see that the Americas is right skewed which means the majority of the observations are below the mean.

If we view the crop yields graph, we see that Europe and Asia had a tremendous increase between 1994-1995. Besides that, Asia and America look dominantly the leading continents regarding agricultural support and crop yields.

Let’s get back to the main purpose we mentioned earlier in this article. To achieve this, we will use the bagged neural network algorithm to model our data set. But we have to transform all numeric data to the same scale and also convert categorical data to numeric values so that the model works.

#Modeling
library(tidymodels)
#Splitting the data by 80/20
set.seed(12345)
df_split <- initial_split(df_tidy, 
                          strata = region,
                          prop = 0.8)

df_train <- training(df_split)
df_test <- testing(df_split)

#Preprocessing
df_rec <- 
  recipe(yields ~ region + supp + time, 
         data = df_train) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(region, one_hot = TRUE)


#Bagged neural networks via nnet
library(baguette)
#cross-validation for resamples
set.seed(12345)
df_folds <- vfold_cv(df_train,
                     strata = region)

spec_bag_nn <- 
  bag_mlp() %>% 
  set_engine("nnet") %>% 
  set_mode("regression") 

wflw_bag_nn <- 
  workflow() %>% 
  add_recipe(df_rec) %>% 
  add_model(spec_bag_nn)


set.seed(98745)
nn_fit <- 
  wflw_bag_nn %>% 
  last_fit(df_split)
#Computes final the accuracy metrics 
collect_metrics(nn_fit)
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.613 Preprocessor1_Model1
2 rsq     standard       0.721 Preprocessor1_Model1

The R-Squared value (rsq) looks fine; we can now create an explainer object based on that. We will build two explainer objects. First, we will use the permutation feature importance.

#vip
library(DALEXtra)
#Creating a preprocessed data frame
df_proc <- 
  df_rec %>%
  prep() %>%
  bake(new_data = NULL)

#building the explainer-object
explainer_nn <- 
  explain_tidymodels(
    spec_bag_nn %>% 
      fit(yields ~ ., data = df_proc),
    data = df_proc %>% select(-yields), 
    y = df_train$yields,
    verbose = FALSE
  )

set.seed(12345)
#Calculates the feature-importance over 100 permutations
vip_nn <- 
  model_parts( 
    explainer = explainer_nn, 
    loss_function = loss_root_mean_square, 
    B = 100, #the number of permutations
    type = "difference",
    label =""
  )

As a second, we will build a partial dependence portfolio object.

#Partial dependence portfolio 
pdp_supp <- model_profile(explainer_nn,
                          N = 500,
                          variables = "supp")

Finally, we have the ability to plot both of the explainer plots.

#Plotting ranking by the feature importance scores
p_feature <- 
  plot(vip_nn) +
  ggtitle("Feature Importance over 100 Permutations", "")+
  theme(plot.title = element_text(hjust = 0.5,
                                  face = "bold",
                                  size = 18),
        text = element_text(family = "ma"),
        axis.text.y = element_text(size = 18),
        axis.title.x = element_text(size = 14))


#Plotting partial dependence of yields to subsidies(supp)
p_pdp <- 
  as_tibble(pdp_supp$agr_profiles) %>% 
  ggplot(aes(`_x_`, 
             `_yhat_`)) +
  geom_line(data = as_tibble(pdp_supp$cp_profiles),
            aes(x = supp, 
                group = `_ids_`),
            size = 0.5, 
            alpha = 0.05, 
            color = "gray50") +
  geom_line(color = "midnightblue", 
            size = 1.2, 
            alpha = 0.8) +
  labs(x="",
       y="")+
  theme_minimal(base_family = "ma",
                base_size = 15)+ 
  labs(title= "Partial Dependence of Yields to Subsidies")+
  theme(plot.title = element_text(hjust = 0.5, 
                                  face = "bold"),
        panel.grid.minor = element_blank())

#Merging the plots
p_feature + p_pdp

Based on the feature importance graph, it appears that the variables with the most significance for predicting results are supp, region_Asia, and region_America. The other variables look like nearly no effect on the prediction; in fact, the variables with negative scores may even hurt the model.

We created a partial dependence graph for the feature importance graph’s most effective variable (supp), which is shown on the right-hand graph. In most cases, there’s a positive linear relationship between subsidies and crop yields on the positive side of the graph. However, it tends to flatten out towards the end. So maybe we can conclude that the effect of subsidies is fading in terms of agricultural yields. Don’t forget that the values are normalized for the modeling.

To leave a comment for the author, please follow the link and comment on their blog: DataGeeek.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version