[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R Tutorials Update
Interested in more time series tutorials? Learn more R tips:
Time Series Demand Forecasting of Brazilian Commodities
Demand Forecasting is a technique for estimation of probable demand for a product or services. It is based on the analysis of past demand for that product or service in the present market condition. Demand forecasting should be done on a scientific basis and facts and events related to forecasting should be considered.
After gathering information about various aspects of the market and demand based on the past, is possible to estimate future demand. What we call forecasting of demand.
For example, suppose we sold 200, 250, 300 units of product X in January, February, and March respectively. Now we can say that there will be a demand for Y units approximately of product X in April.
Demand forecasting key advantages:
More effective production scheduling
Inventory management and reduction
Cost reduction
Optimized transport logistics
Increased customer satisfaction
Those are just some benefits in forecasting demand. As this is a task applicable in almost any business area, the concepts and approaches used here can be extrapolate for your problem too, with specific adjustments.
This project has the aim of forecast the next 3 years of exports of top 3 Brazilian commodities: soybean, corn and sugar. But not just that, I’ll cover also the following steps of a Data Science project: exploratory data analysis, data preparation, data cleaning, feature engineering and modeling.
The dataset used here came from a public source available by clicking here .
Software Requirement
If you want to reproduce the project in your environment, I suggest you to install the following packages first, before load them.
Exploratory Data Analysis
As mentioned before, the dataset came from a public source called COMEX STAT. This website provides free access to Brazilian foreign trade statistics.
As usual, I always create a dictionary of the dataset I’m working just to keep in mind the meaning of each variable, see the following list:
date: date where occurred the transaction of export or import (our time series information).
The data contains all tracking information of monthly imports and exports of a range of products, by brazilian states, by routes (air, sea, ground, etc) and from/to which country.
At the beginning of the process is a good idea to take a general overview of the data, and for that I love the skimr::skim() function, very handy to understand a big picture of your data.
As we can see, our dataset have:
1 date variable
5 categorical variables
2 numerical variables
For our luck there is no missing data in any of these columns. Two things pop up when we look at type and route variables.
type: Brazil is a country that export more than import (more than 100k of observations on export category).
route: The route that Brazil less use (considering exports and imports) is air. And the route more used is sea.
Production over time
We also saw the date range of this date feature, and it’s from 1997/01/01 to 2019/01/12. Looking more closely at this feature we can investigate how was the exports from Brazil, considering all states and to everywhere, throughout the time.
On this monthly chart, we see that there is a higher pronounced growth trend in exports during the months from March to August.
As expected, over the years, the brazilian exports follow a growing trend, even if 2020 bring us a terrible result because of COVID-19, it is likely to go back to the initial trend in the next year.
Most Important Commodities
We saw before our data have 6 different commodities: Soybean, Sugar, Soybeans Meal, Corn, Soybean Oil and Wheat. Let’s look at them and see which have been more export in the last 5 years.
The plot above is showing the top 3 commodities exported in Brazil by the last 5 years: soybean, corn and sugar. With the more important being soybean. If we compare with the others, soybeans have 55.5% more than the second (Corn) and 63.2% more than the third (Sugar), it is an enormous difference.
Routes
These commodities we are seeing until now are exported by different routes: sea, ground, air, river and others. Let’s investigate if there is some preference to choose the route and product.
Before building those visualizations, sounds a good idea to keep in mind the most chosen routes, considering all products, to establish a big picture of the situation. Look at the table below:
Now let’s see by product and routes what it’s happening:
Although most products are transported by sea (table above), we observe that depending on the route there is a preference for the product that will be exported.
Considering three major products exported in each route, we have:
Sea: sugar, soybean and soybeans meal.
Ground: soybean oil, sugar and corn.
Air: corn (much more), soybean and sugar.
Other: sugar, soybean oil and corn.
River: soybean, corn and sugar.
And a closer look at soybean, give us the following chart:
Yeah, a very high concentration in sea transportation route.
Trade Partners
Let’s look at our data by another perspective, trade partners. Brazil has a lot of trade partners, these are countries which with Brazil export and import more, and sounds a good idea to know which countries Brazil has been doing business.
It’s clear that China is our most important export trade partner. The others positions have been rotating between Netherlands, Spain and Iran.
Now, by imports perspective:
Brazil’s major import trade partners alternate between Argentina, Paraguay, and USA. Curiously, seems that the participation of USA has been decreasing through the time, as opposed to Argentina.
States and Commodities
Brazil is a huge country, the five largest country in the world, and this gives us different temperature ranges depends on each area you’re looking at. This geographical aspect lead to cultures of food been produce in specific regions them others.
Let’s see from which region comes the production of ours commodities, in terms of exports:
Mato Grosso concentrates most of the exports of soybean oil, soybeans and soybean meal, along with Rio Grande do Sul and Paraná. São Paulo, on the other hand, takes part strongly in sugar exports. And very few states export wheat, the most expressive values comes from Rio Grande do Sul, Paraná and Santa Catarina.
Data Modeling
As we know, there is 3 time-series to predict the next 3 years of demand in tons: soybean, corn and sugar. I’ll model each separately, because by this way is better to understand the underlying rationale behind the data.
Something important to say is that I already tried different approaches of feature engineering and here I’ll show you what had the better performance. Another point to clarify is I’ll be using a modeltime framework workflow, integrated with tidymodels principles (quick review down below).
As this is a first part of this project, we’ll be using just ARIMA family of models, be aware that more advanced topics on modeling and feature engineering will covered in the second part.
A quick review of modeltime
For those that aren’t familiar with modeltime framework, it’s a R package that set up a time series analysis workflow in a very optimized way. The package works as an extension of tidymodels but applied to time series problems.
I’ll summarize here the principals verbs used:
Collect data and split into training and test sets
Create & Fit Multiple Models
Add fitted models to a Model Table
Calibrate the models to a testing set.
Perform Testing Set Forecast & Accuracy Evaluation
Refit the models to Full Dataset & Forecast Forward
To get more details you can access the documantation here.
Soybean
The first thing to do is set up our tibble with the right timestamp. And as we already know, the dataset has a monthly periodicity equally spaced (regular time series).
Just by analyzing this visualization, we’re seeing that there is a clear annual seasonality with a multiplicative behavior (values are growing throughout the time). We can verify these assumptions with ACF/PACF charts.
Here we’re confirm the high correlation with annual lags and also one high partial correlation considering 9, 10 and 11 lags. Is possible to use those features to improve performance, but here we’ll be working with the forecast::auto_arima model that automatic look for lags during the training.
Here we’re seeing that there is quarterly seasonality, every second and third quarters occur an increase in exports.
Now we have a big picture of what is happening. The second and third quarters of practically all months are darker, indicating a higher amount of exports in those periods.
Modeling soybean time series
First thing will be standardize our data, applying a box-cox transformation. This is a method used to variance reduction applying a power transformation. As we’ll be using ARIMA family, is interesting work that way.
We also will keep track of the lambda value, important to back-transform our data after modeling phase.
So, we don’t have so much data to work, actually our time series has 265 observations. That way, I split the data in 5 years of assessment and choose the rest to training.
Now, we can start work with the modeltime workflow showed before.
Brief explanation about the auto-arima implementation: The auto-arima algo use the AIC metric to optimize the p, q, d and P, Q, D params, looking for the best values. These metrics works like a R-Squared in order to point you to a correct direction.
You can see in .model_desc column discription or as a legend on the following pictures the best parmans choosed by the model.
We get a good R-Squared (0.792), but is a good idea to visualize how was the fit of the model:
I really liked of this fit, and we’ll stick with this model, seems to get the correct seasonality and trend. The next step is refit the model on all data and see how it works. If needed, the algorithm will update the coefficients to capture the general pattern.
Look, every time that you see the “UPDATE” as a prefix of model description, meaning that the model found better coefficients to explain the data.
Pos-Processing step
We need back-transform our data because of box-cox transformation at the beginning, and the values don’t represent exports quantities.
That is our final result for the demand forecast of the next 3 years of soybean production, with 95% of confidence interval.
Corn
Here we’ll follow the same workflow as soybean demand forecast showed before.
We also have annual seasonality with a multiplicative behavior. Let’s look the lag diagnostic.
Confirm our assumption of annual seasonality.
The interesting of this chart is that we can see a quarterly seasonality too (similiar to soybean seasonal diagnostics), this time with third and fourth quarters.
Looking at this heatmap is visible that through the years the exports are growing and the period of the year that has more exports (3rd and 4rd quarters).
Modeling corn time series
Our formula here will be different, by including this features our model could better capture the seasonality.
The R-Squared here is about 0.643 with good understanding of the seasonality, but the model could not capture the depressions of the time series data. We’ll stick with this model for now.
Now let’s refit the data:
This was our final model.
Pos-Processing step
Besides our 95% confidence intervals been so high, our series capture a similar trend and seasonality of previous years.
Sugar
Let’s investigate the final one.
This time series seems to have a change in behavior after the year of 2012, with a high spike and a significant increase in quantity of exports.
What ACF and PACF tell us?
Here we’re seeing a high correlation mostly with recent 70 lags, and negative correlation with older lags. Then in PACF plot, lag 2 and 9 seems important to our model.
As we confirmed, since 2012 we have higher exports. But looking at this plot, we don’t see any seasonality throughout the time.
So let filter the data and analyse the seasonality after 2012.
Now we can capture an interesting bahavior, seems that the third and first quarter have an increase in exports.
Searching the why of happened this change in 2012, I found some events that probably are correlated to our problem.
2012 was the year that Brazil increases the production of ethanol.
To produce more ethanol was needed to plant more sugar cane (the base of ethanol production).
Sugar also came from sugar cane, so, with more sugar cane cultivation, we saw an increase of sugar production, hence reflected on its exports.
So, there is a huge probability of our time series have really changed its behavior. Another point is that there is a quarterly seasonality that matches exactly with the period of sugar production: 90 days in the summer and 100 days in the winter.
With this context in mind, I’ll use just the data after 2012 for now.
Looking at this heatmap, it’s visible that through the years the exports intensified by a huge quantity since 2012.
I’ll choose look just for the years after 2012 to modeling our time series.
Modeling sugar time series
Here I needed to change the amount of data used as assessment data to 4 years instead of 5.
Besides the fit was a little off of the real values, the model could capture a general seasonality and trend. We’ll stick with this model for now.
Post-Processing step
So, this is our final model to predict the next 3 years of sugar exports, and also the end of this first phase of the project.
Next Steps! (Important)
Throughout the project, we could see that ARIMA family of models is a very powerfull method. But, there is so much machine learning and deep learning algorithms available to work with time series forecasting that this article would be too big if I putt all that here.
Now, that we have a great understanding about our dataset and also we have a really nice baseline model for all three commodities (soybean, corn and sugar), we can go deeper in modeling and cover advanced topics.
As a spoiler to the next part of this project, check this list:
Much more different models (modeltime)
Lot more feature engineering (recipes)
Hyperparameter Tunning (tune)
Resampling tecniques (modeltime.resample)
Stacking and ensembles models (modeltime.ensemble)
Author: Luciano Oliveira Batista Luciano is a chemical engineer and data scientist in training. Learn more on his blog at lobdata.com.
To leave a comment for the author, please follow the link and comment on their blog: business-science.io.