Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I just came across a really interesting and simple algorithm for estimating the number of distinct elements in a stream of data. The paper (Chakraborty et al. 2023) is available on arXiv; see this Quanta article (Reference 2) for a layman’s explanation.
Problem statement
Let’s state the problem formally. Let’s say we are given a stream
A naive solution would be to maintain a buffer
Algorithm
There is a long line of work seeking to do better than the naive solution (see Bibliographic Remarks in Reference 1); Chakraborty et al. claims to be the first state-of-the-art algorithm whose proof requires just basic probability theory. The algorithm is easy enough to be reproduced here; see Reference 2 for an excellent walkthrough of an example run of the algorithm.
At a high level, the algorithm maintains a buffer
Here is the theorem for this algorithm:
Theorem. For any data stream
and any , the algorithm above outputs an -approximation of . It uses space in the worst case.
The proof is not too long (about 3 pages) and while requiring some restatement of the problem via alternate algorithms, the most advanced tool it uses is Chernoff’s bound.
To obtain an
Simulations
This algorithm is really easy to implement. Here is a quick implementation in R (definitely can be optimized further, just coded it up quickly for prototyping purposes):
# threshold: number of elements buffer can hold # trajectory: if TRUE, return the estimate after every item in stream (default FALSE) estimate_distinct_elements <- function(data_stream, threshold, trajectory = FALSE) { p <- 1 m <- length(data_stream) buffer <- c() hit_fail_event <- FALSE estimates <- rep(NA, m) for (i in 1:m) { buffer <- setdiff(buffer, data_stream[i]) if (runif(1) < p) buffer <- c(buffer, data_stream[i]) if (length(buffer) == threshold) { buffer <- buffer[runif(threshold) < 0.5] p <- p / 2 if (length(buffer) == threshold) { # fail event hit_fail_event <- TRUE } } estimates[i] <- ifelse(hit_fail_event, -1, length(buffer) / p) } if (trajectory) { return(estimates) } else { return(estimates[m]) } }
Let’s take this for a run with some simulations. All simulation code can be found here. For a start, let’s generate a stream of length 100, where each item is an integer from 1 to 100 (inclusive). For the particular data stream we generated, there were 60 distinct elements. This means that for the naive solution, we would have buffer of length 60.
First, we run the algorithm 1000 times with buffer length 30, half of what the naive solution needed. The algorithm never hit the “fail” event, and here are some summary statistics of the distinct count estimate:
# Min. 1st Qu. Median Mean 3rd Qu. Max. # 34.00 52.00 56.00 60.02 68.00 104.00
The mean and median are pretty close to the true value! Here is a histogram of the estimates:
It’s worth noting that while the estimate is generally pretty good (50% chance of being within 8 of the true number), it can be wildly off as well!
Let’s rerun the simulation for the same data stream but with buffer length 15 (a quarter of the true number of distinct items). Again the algorithm never hit the “fail” event. The summary statistics and histogram are below and as one might expect, while the estimates are still centered at the right place, there is more variance in the estimate. (The histograms might look similar due to different scales on the x-axis.)
# Min. 1st Qu. Median Mean 3rd Qu. Max. # 8.00 48.00 56.00 59.81 72.00 144.00
Since this is a streaming algorithm, it provides an estimate of the number of distinct items after inspecting each item. Let’s look at how good the estimate is over time. Here are the plots for 10 estimate trajectories (true count in blue) for buffer length = 30 and buffer length = 15. Notice that the algorithm gives an exact estimate until the buffer is first filled. As expected, larger buffer length tends to give better estimates.
Let’s do one final simulation: Let’s look at a stream of length 10,000 with each element being a random integer from 1 to 4,000 (inclusive). For our data, the true number of distinct elements is 3,686.
Here is the plot of 10 trajectory estimates with buffer length 500 (about 14% of the true number of uniques) as well as the relative error of the estimate:
That’s not too bad! The estimates are pretty much within 10% of the true values. How about a much smaller buffer length of 50 (1.4% of the true number of uniques):
There’s much wider variation in the estimates now, probably too much to be useful for any real-world application.
References:
- Chakraborty, S., et al. (2023). Distinct Elements in Streams: An Algorithm for the (Text) Book.
- Nadis, S. (2024). Computer Scientists Invent an Efficient New Way to Count.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.