Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Problem
I have a bunch of data that can be categorized into many small groups. Each small group has a set of values for an ordered set of intervals. Having observed that the values for most groups seem to increase with the order of the interval, I hypothesize that their is a statistically-significant, monotonically increasing trend.
An Analogy
To make this abstract problem more relatable, imagine the following scenario.
There are many companies (groups) selling products in an industry that is on the verge of widespread growth. Each company sets a projection (not a goal) for end-of-year sales (value). and adjust this projection once per quarter (i.e. four times a year) (interval) after analyzing their to-date sales. Having observed that most companies that follow this scheme tend to increase their goals for next-quarter sales (based on an initial, beginning-of-the-year projection that is too low), the market analyst (me) wonders if there is a non-trivial, positive trend across quarters in the year. 1
To summarize how the “variables” in this analogy relate to my abstract description of the situation,
- groups = companies
- value = sales
- interval = quarter-year
(Actually, there is one more variable in this context, given that I am interested in relative value—the rank of the value, relative to the other values.)
And, to make the example concrete, here is what the data (for four companies) might look like.
company q1 q2 q3 q4
A 3.6 8.5 9.2 9.4 B -0.73 0.04 0.28 0.47 C 6604 4631 6604 7217 D -49 -9 400 87
(Bad) Solution: Linear Regression
Just at first thought, running a univariate linear regression might seem like a good way of attacking this problem. However, there are a couple of basic “gotchas” that make ordinary linear regression a not-so-great idea for this scenario:
There are not many intervals (i.e. independent variables) per group. (This condition inherently makes any kind of model—not just a linear regression one—sensitive to the samples. In general, there is a problem with creating models with small sample sizes.)
The values across groups might have very different magnitudes. (Thus, trying to create a single, general model that groups all of the data and uses the group labels as a categorical independent variable would likely lead to unreliable results.)
The values themselves might be very volatile for a single group. (This might be interpreted as a violation of the normal distribution assumption of linear regression. Additionally, if the values are truly monotonic, the assumption of no autocorrelation might also be violated.)
Aside from these caveats, the value for a given interval is not relevant—rather, its relationship with all other values is, and, more specifically, its relationships with the previous and subsequent values. 2
(Better) Solution: Spearman’s Rho
Given the nature of the data (which one might say is non linear) and my intent to quantify ordinality between two variables, it turns out that Spearman’s rho, in theory, provides exactly the measurement that I want—it quantifies the association between paired samples using the ranks of the variables (not their values) relative to their samples. Notably, the statistical significance (i.e. via a p-value) can be calculated (traditionally, using a t-test), which should be handy for my intent on identifying non-triviality.
Nonetheless, even though this metric seems promisable, it will certainly
be sensitive to the small samples of each group (assuming that it is
calculated for each group). Don’t believe me? Check out how the
Spearman’s rho value changes for the numerica columns in the built-in
iris
dataset (which has 150 rows) when it is calculated for just the
first 10 rows.
Another Solution: Custom Heuristic
So, what can be done? Well, even with the hope that the Spearman’s rho metric provides for quantification and significance inference, I thought that I would try to create some kind of easily understandable heuristic that I could explain to someone else without having to delve into statistical theory. Nonetheless, I would be ignorant to not compare (and validate) the results of my heurisitc with those of statistical theory after creating my heurisitc.
Having this plan in mind, I began to think about how I would define my heuristic, which, in essence, tries to quantify monotocity. But what exactly constitutes monoticity? Surprisingly, that’s a more complex question than it might seem. 0
For my purposes, I don’t necessarily care if the set of values is
strictly increasing or decreasing, but they should be “sufficiently”
increasing or decreasing. For example, while it is clear that the
sequence 1
, 2
, 3
, 4
is strictly monotonic (increasing) and the
sequence 1
, 2
, 4
, 3
is not, I would consider the latter
“sufficiently” monotonic. On the the hand, I would consider something
like 4
, 2
, 3
, 1
because the 1
and 4
are “badly” misplaced in
one another’s appropriate places, which are at the extreme ends of the
sequence. Moreover, if I was intent on identifying increasing monoticity
(as opposed to decreasing monotonicity), I would consider 4
, 3
, 2
,
1
“bad”, even though it is strictly monotonically decreasing. But what
about something like 1
, 4
, 3
, 2
(again, assuming that I am
evaluating increasing monotonicity)? Even though the 2
and 4
are
swapped, I might still consider this sequence “sufficiently” monotonic
because the 1
and 3
are placed correctly and the 2
and 4
are
“not too far apart”. Anyways, it’s easy to see how having some kind of
formal definition/calculation/criteria for monotonicity is handy.
The “Algorithm”
After some thinking, I came up with the following algorithm (if one can even call it that).
(NOTE: I explicitly list the variable names that I use in the code that follows to help the reader understand the relationships between these steps and the implementation.)
- Given an
n
-length sequence of arbitrary values, assign each value an integer value between1
andn
to represent its “actual” rank. This rank is to be assigned based on relative value in the set of real numbers. 3
- In the machine learning setting, this set of ranks is the dependent
variable (i.e.
y
) to predict. - In the example situation described before, it equates to the rank that would be assigned to the quarterly interval based on sales relative to the other quarters.
- In the simulation that I run below, this is the variable
y0
.
- Create a matrix of all permutations of actual rank and “assigned”
rank. To be clear, this “assigned” rank is independent of the actual
rank and value. To make things straightforward, these assigned ranks
should be transformed to use only the same set of possible rank
values dictated by the actual ranks (i.e. integers between 1 and
n
).
- In the machine learning setting, this “assigned” rank is the
independent variable (i.e.
x
) used as a predictor. - In the example, it is the quarterly interval.
- In the simulation that follows, assigned rank is
x0
, and the matrix (actually, atibble
) of combinations isdata_permn
.
Calculate the absolute difference between the “actual” and “assigned” ranks for each value in the sequence. Subtract this distance from the maximum rank value in the sequence. The resulting value is what I call the “inverse monotonic distance” (
mntc_distinv
in the following code).Repeat the calculation of inverse monotonic distance for all groups (
grp
) of “actual” (y0
) and “assigned” (x0
) ranks.Sum up the inverse monotonic distance for each value in the permutation group and take the average of this sum for each group. 4 Re-scale this per-group value to a 0 to 1 basis. 0
Identify any group (
grp
) corresponding to a sum-averaged-re-scaled value (mntc_distinv
) in the upper 50% quantile of all values (i.e. assigned the value"(0.5,1]"
for themntc_tier2
variable) as “sufficiently” monotonic. (Themntc_tier2
variable can be interpreted as my heuristic.)
Notably, even though the result set is split at the 50% threshold (which
is a subjective choice), this does not mean that 50% of all possible
groups are classified as 50%. (According to my method, only 33% are for
n = 4
.)
Implementing the Heuristic
Ok, that is enough discussion. What follows is the implementation.
NOTE: In order to keep focus on how the code implements methodology, I recommend reviewing the code but not worrying too much about the details (such as the internal workings of my custom functions). Rather, I’d recommend inspecting in detail only the parts that are printed out (and going back later to understand the complexities, if curious).
library("dplyr") library("ggplot2") library("tidyr") # These packages are used, but their functions are called explicitly. # library("purrr") # library("broom") # library("combinat")
The only choice that I need to make to begin is the length of the set of
values (i.e. n
), which should be a “small” integer. I’ll choose 4
,
simply because 3
seems like it is “too small” and because subsequent
visualization(s) becomes “cluttered” (and interpretation becomes less
direct) if a number 5
or greater is chosen. 0
The following code chunk corresponds to steps 1 and 2 in my methodology, which are basically just set-up steps.
create_permns <- function(n = 1L) { n_seq <- seq(1L, n, by = 1L) combs <- combinat::permn(n_seq) %>% purrr::map( ~ paste(.x, collapse = "")) %>% unlist() %>% as.integer() data_xy <- tibble(grp_x = combs, grp_y = combs) %>% expand(grp_x, grp_y) %>% mutate(grp = paste0("x", grp_x, "y", grp_y)) into_seq <- seq(1L, n, by = 1L) %>% as.character() sep_seq <- seq(1L, n - 1L, by = 1L) wrangle_data_xy <- function(data = NULL, which = NULL) { col_grp <- rlang::sym(paste0("grp_", which)) col_0 <- rlang::sym(paste0(which, "0")) data %>% separate(!!col_grp, into_seq, sep = sep_seq, remove = FALSE) %>% gather(idx, !!col_0, matches("^[0-9]$")) %>% mutate_at(vars(idx, !!col_0), funs(as.integer)) } inner_join(data_xy %>% wrangle_data_xy("x"), data_xy %>% wrangle_data_xy("y")) %>% select(-idx) %>% arrange(grp_x, grp_y) } data_permns <- create_permns(n = n) data_permns
grp_x grp_y grp x0 y0 1234 1234 x1234y1234 1 1 1234 1234 x1234y1234 2 2 1234 1234 x1234y1234 3 3 1234 1234 x1234y1234 4 4 1234 1243 x1234y1243 1 1 1234 1243 x1234y1243 2 2 1234 1243 x1234y1243 3 4 1234 1243 x1234y1243 4 3 1234 1324 x1234y1324 1 1 1234 1324 x1234y1324 2 3 ^1^ # of total rows: 2304 Note(s) about the above code chunk:
Although the “actual” ranks (
y0
) could be any real numbers, I’m using values between1
andn
because my methodology dictates that the arbitrary set of values would need to be transformed to this range anyways. When transformed to this range, the values should be interpreted as ranks.Like the “actual”
y0
ranks (representing the order of the original, arbitrary values), thex0
ranks could technically be any real numbers, but they would need to be transformed to the1
-to-n
range anyways, so I do that directly.The number of combinations of “actual” (
y0
) and “assigned” (x0
) rank pairs is equal ton! * n!
(i.e. 576). For my implementation, the data.framedata_permns
actually hasn! * n! * n
(2304) rows (because it is arranged in a “long” format).grp_x
andgrp_y
(and the combination of the two in thegrp
column) identify then
-length groups of pairs ofx0
andy0
ranks. These are primarily useful for human interpretability and are not actually relevant for computations.I use a function here (and elsewhere) although it may seem unnecessary for a single execution because I’ll repeat the methodology for different values of
n
later.
Now, I implement the initial calculation of “inverse monotonic distance”
(mntc_distinv
).
add_mntc_cols <- function(data = NULL) { data %>% group_by(grp) %>% arrange(x0, .by_group = TRUE) %>% mutate(mntc = ifelse((y0 == cummax(y0)) | (y0 == cummin(y0)), 1L, 0L)) %>% mutate(mntc_distinv = as.integer(x0 * (max(x0) - abs(x0 - y0)))) %>% ungroup() } data_mntc <- add_mntc_cols(data_permns) data_mntc
grp_x grp_y grp x0 y0 mntc mntc_distinv 1234 1234 x1234y1234 1 1 1 4 1234 1234 x1234y1234 2 2 1 8 1234 1234 x1234y1234 3 3 1 12 1234 1234 x1234y1234 4 4 1 16 1234 1243 x1234y1243 1 1 1 4 1234 1243 x1234y1243 2 2 1 8 1234 1243 x1234y1243 3 4 1 9 1234 1243 x1234y1243 4 3 0 12 1234 1324 x1234y1324 1 1 1 4 1234 1324 x1234y1324 2 3 1 6 ^1^ # of total rows: 2304 Note(s) about the above code chunk:
- The
mntc
variable is a “running” binary1
or0
to indicate whether or noty0
is monotonic up through its position in the sequence. (It does not differentiate between increasing or decreasing.) - I use the
integer
data type (as opposed to simplynumeric
) where possible because it is more memory efficient (although memory efficiency is not a concern with this data).
Next is the calculation of the transformed
(i.e. summed-averaged-re-scaled) version of the “inverse monotonic
distance” (mntc_distinv
), as well as the split of the mntc_distinv
into upper and lower 50% quantiles (mntc_tier2
).
unitize <- function(x = NULL) { (x - min(x)) / (max(x) - min(x)) } summarise_mntc <- function(data = NULL) { data %>% group_by(grp) %>% summarise_at(vars(mntc_distinv), funs(mean)) %>% ungroup() %>% mutate_at(vars(mntc_distinv), funs(unitize)) %>% mutate(mntc_tier2 = cut(mntc_distinv, 2)) } summ_mntc <- summarise_mntc(data_mntc) summ_mntc
Now, with the “algorithm” fully implemented, I can begin to evaluate the results.
Exactly how many values make up each 50% quantile?
mntc_tier2 n n_pct
(-0.001,0.5] 384 66.67 (0.5,1] 192 33.33
What does the distribution of all “inverse monotonic distance” values look like?
The positive identificaiton (in yellow) of combinations along the
left-to-right, lower-to-upper diagonal is what I would expect. These are
the values where x0
and y0
are perfectly matched. Conversely, values
along the other diagonal are generally colored in purple, as I would
expect. These combinations consist of sequences of x0
and y0
values
that are “negatively” symmetric (e.g. (1
, 2
, 3,
4
) and (4
,
3
, 2
, 1
)).
Checking the Heuristic
Ok, my heuristic seems valid, but how can I know for sure that it is reasonable? I mentioned before that Spearman’s rho should serve a good measure, so I’ll take a look at it now.
add_cortest_cols <- function(data = NULL) { data %>% group_by(grp) %>% nest() %>% mutate(cortest = purrr::map(data, ~ broom::tidy(cor.test(.$x0, .$y0, method = "spearman"))) ) %>% unnest(cortest, .drop = TRUE) %>% select(grp, estimate, p.value) } summarise_mntc_wcortest <- function(data = NULL) { summ <- summarise_mntc(data) data %>% add_cortest_cols() %>% inner_join(summ, by = "grp") } summ_mntc_wcortest <- summarise_mntc_wcortest(data_mntc) summ_mntc_wcortest
grp estimate p.value mntc_distinv mntc_tier2 n_pct x1234y1234 1.0 0.08 1.00 (0.5,1] 100 x1234y1243 0.8 0.33 0.67 (0.5,1] 100 x1234y1324 0.8 0.33 0.76 (0.5,1] 100 x1234y1342 0.4 0.75 0.38 (-0.001,0.5] 100 x1234y1423 0.4 0.75 0.48 (-0.001,0.5] 100 x1234y1432 0.2 0.92 0.43 (-0.001,0.5] 100 x1234y2134 0.8 0.33 0.86 (0.5,1] 100 x1234y2143 0.6 0.42 0.52 (0.5,1] 100 x1234y2314 0.4 0.75 0.57 (0.5,1] 100 x1234y2341 -0.2 0.92 0.14 (-0.001,0.5] 100 ^1^ # of total rows: 576 What exactly is the distribution of the Pearson’s rho t-test estimates and p-values?
abs(estimate) p.value n
0.4 0.75 192 0.8 0.33 144 0.2 0.92 96 0.0 1.00 48 0.6 0.42 48 1.0 0.08 48
Note(s) about the above output:
- By taking the absolute value of the estimate, I am essentially treating monotonically increasing and decreasing as equal.
- There are a relatively “small” number of distinct values. 5
Now, to understand how the Pearson’s rho t-test estimate
s and
p.value
s correspond to my heuristic, I’ll simply overlay the
combinations that are identified as significant to my previous heat map
of rank combinations. Because I’m erring on the side of flexibility in
defining “sufficient” monotonicity, I’ll say that the pairs
corresponding to the bottom two tiers of p-values (corresponding to
0.0833
and 0.33
) constitute “sufficient” monoticity.
![](viz_summ_mntc_wcortest-1.png
It looks like there is a large amount of overlap between my heuristic classification of “sufficient” monoticity and that identified by a more statistical approach.
Now I’ll repeat the simulation for other values of n
. (Because
computations start to become intensive with n = 6
, and because the
n = 2
is relatvily trivial, I’ll evaluate values of 3
, 4
, and 5
for n
. 6)
ns <- tibble(n = 3L:5L) summ_mntc_byn <- ns %>% mutate(data = purrr::map(n, ~(create_permns(.x) %>% add_mntc_cols()))) %>% mutate(summ = purrr::map(data, summarise_mntc_wcortest)) %>% unnest(summ, .drop = TRUE) %>% ungroup() %>% arrange(n)
What is the breakdown of mntc_tier2
values?
n mntc_tier2 nn nn_pct
3 (-0.001,0.5] 24 66.7 3 (0.5,1] 12 33.3 4 (-0.001,0.5] 384 66.7 4 (0.5,1] 192 33.3 5 (-0.001,0.5] 9720 67.5 5 (0.5,1] 4680 32.5
What about the distribution of mntc_distinv
values? And of the
estimate
s and p.value
s?
n abs(estimate) p.value mntc_tier2 nn nn_pct 3 0.5 1.0 (-0.001,0.5] 18 50.0 3 1.0 0.3 (-0.001,0.5] 6 16.7 3 0.5 1.0 (0.5,1] 6 16.7 3 1.0 0.3 (0.5,1] 6 16.7 4 0.0 1.0 (-0.001,0.5] 48 8.3 4 0.2 0.9 (-0.001,0.5] 72 12.5 4 0.4 0.8 (-0.001,0.5] 144 25.0 4 0.6 0.4 (-0.001,0.5] 24 4.2 4 0.8 0.3 (-0.001,0.5] 72 12.5 4 1.0 0.1 (-0.001,0.5] 24 4.2 4 0.2 0.9 (0.5,1] 24 4.2 4 0.4 0.8 (0.5,1] 48 8.3 4 0.6 0.4 (0.5,1] 24 4.2 4 0.8 0.3 (0.5,1] 72 12.5 4 1.0 0.1 (0.5,1] 24 4.2 5 0.0 1.0 (-0.001,0.5] 600 4.2 5 0.1 1.0 (-0.001,0.5] 1800 12.5 5 0.2 0.8 (-0.001,0.5] 1440 10.0 5 0.3 0.7 (-0.001,0.5] 1680 11.7 5 0.4 0.5 (-0.001,0.5] 600 4.2 ^1^ # of total rows: 36
The distributions are sparse due to the relatively small number of
unique values for each metric (mntc_distinv
, p.value
, etc.).
7 Consequently, it is a bit difficult
to extract much meaningful insight about the relationships among the
metrics. To really understand how the distributions and relationships
scale with larger values of n
, mathematical theory would need to be
applied.
Nonethless, without jumping more into statistical theory, It seems to me
that the identification of rank combinations as significant by my
heuristic classification and Spearman’s rho (assuming that one uses the
traditional p-value-below-a-thresshold approach) would become more
dissimilar as the value of n
increases. This is because my
classification simply splits all possible values into two sets for any
value of n
, meaning that the percentage of all possible combinations
is relatively insensitive to the value of n
. 8
On the other hand, the Spearman’s rho p-values would
become more refined with larger values of n
.
Anyways, I believe that my heuristic serves my purposes well. I only
really intended it to be used for small values of n
. Also, I intended
to create a “relaxed” definition of monotonocity, so having only a very
small percentage of all possible rank combinations meet the criteria
would have actually been undesireable.
Conclusion
In the end, I think I did more work than I really needed to do to answer my original question about quantifying monotonocity and inferring significance, but I think, in all, this was a worthwhile exploration.
- The market analyst does not necessarily hypothesize why annual projections tend to be high (i.e. perhaps due to over-confidence) ^
- Additionally, prediction is not the concern—rather, quantification of trend is. (While regression certainly can help with trend identification, its capability to create predictions is perhaps its better use.) ^
- In reality, the rank values could also be any arbitrary value on the real number scale. ^
- A sum of sums (instead of an average of sums) could be used here and the subsequent results would not change. ^
- Here, there are only
n + 1
(i.e. 5 uniqueabs(estimate)
s andp.value
s. This result is not generally true. (For example, when choosingn = 5
, there will be more than6
unique values of each metric.) ^ - This presents a good opportunity to implement a version of the “nest-mutate-unnest” idiom that can be very effective for creating many models. The “many models” chapter in the R For Data Science book provides an excellent example of this process. ^
- Unfortunately this is due to the nature of the data and the simulation, so nothing can be done about it. ^
- Note that the 33% number found for
n = 4
is not generally true, although this percentage does not seem to change drastically with different values ofn
. ^
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.