Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Purpose of the tutorial: To demonstrate a quick and straightforward implementation of time series clustering using the
widyr
package in RWhat is time series clustering?: Grouping time series data into clusters where data points in the same cluster group are more similar to each other than to those in other clusters. For example, if we have monthly sales data, time series clustering can help identify stores with similar sales patterns over time.
Load library
library(tidyverse) library(widyr)
Import data
About the data:
Fake dataset that can be downloaded from my GitHub.
Contains 832 rows & 3 columns
Columns:
year
(<date>
): Date information for each observation.storecode
(<chr>
): Unique identifier for each store.sales
(<dbl>
): Sales figures for each store.
Importing data: Using
read_csv()
store_list <- read_csv("https://raw.githubusercontent.com/zahiernasrudin/datasets/main/sample_store.csv")
- Glimpse of the dataset:
year | storecode | sales |
---|---|---|
2022-12-01 | A4P1Q1 | 22432 |
2023-01-01 | A4P1Q1 | 22425 |
2023-02-01 | A4P1Q1 | 20710 |
2023-03-01 | A4P1Q1 | 23054 |
2023-04-01 | A4P1Q1 | 23912 |
2023-05-01 | A4P1Q1 | 22782 |
Clustering with widyr
Using widely_kmeans
for time series clustering:
# Perform k-means clustering using widely_kmeans cluster_group <- store_list %>% widely_kmeans(item = storecode, feature = year, value = sales, k = 3) # Join the clustering results back to the original data store_list_with_cluster <- left_join(store_list, cluster_group)
Define
item
:- Description: Item to cluster. In the context of our dataset, this would be the
storecode
- Description: Item to cluster. In the context of our dataset, this would be the
Define
feature
:- Description: Feature column (dimension in clustering). In our case, the feature is the time component, which is represented by
year
column
- Description: Feature column (dimension in clustering). In our case, the feature is the time component, which is represented by
Define
value
:- Description: Value column. In our dataset, this would be the
sales
- Description: Value column. In our dataset, this would be the
Define
k
:- Description: Number of clusters. This should be chosen based on the specific requirements of your analysis or determined using evaluation metrics. For the sake of simplicity in this tutorial, we will use 3 clusters.
Joining Results: The clustering results are joined back to the original dataset.
Evaluating Clustering Results
- We can visualize the clustering results using
ggplot2
.
library(ggthemes) store_list_with_cluster |> ggplot(aes(x = year, y = sales, group = storecode, colour = cluster)) + geom_line(show.legend = F) + scale_y_continuous(labels = scales::comma) + facet_wrap(vars(cluster)) + scale_color_solarized()
- There you have it, a simple way to implement time series clustering using the
widyr
package in R. Of course, there is much more you can explore and refine in your clustering analysis. For comprehensive documentation and further exploration of thewidyr
package, visit the widyr page itself: widyr Documentation.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.