Site icon R-bloggers

Large Matrix Multiplication: DuckDB vs. SQLite

[This article was first published on factbased, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

On my laptop with 16 GB RAM, I would like to perform a matrix-vector multiplication with a sparse matrix of around 10 million columns and 2500 rows. The matrix has approximately only 2% non-zero entries, but this are still 500 million numbers and the column/row information, too large to work comfortably in-memory.

A while ago, I tried using sqlite for this task. It kind of worked, but was too slow to be useful. This weekend, I revisited the problem and tried using duckdb.

TLDR: duckdb is amazingly faster than sqlite for this problem, around 20 times. The matrix multiplication above takes around 20-30seconds.

First, let’s build a small example and see if the approaches produce the right solution.

set.seed(13)
nvars <- 1000 # columns
ncons <- 500  # rows
n_nonzero <- round(0.2*nvars*ncons) # approximate, there may be actually less values
Amat <- data.frame(
    i=sample.int(ncons, n_nonzero, replace=TRUE),
    j=sample.int(nvars, n_nonzero, replace=TRUE),
    x=runif(n_nonzero)
)
Amat <- Amat[!duplicated(Amat[,c("i", "j")]),]
bvec <- runif(nvars)

# Matrix solution
library(Matrix)
AmatSparse <- Matrix::sparseMatrix(i=Amat[,"i"], j=Amat[,"j"], x=Amat[,"x"])
expected <- as.vector(AmatSparse %*% bvec)
str(expected)
# num [1:500] 45.7 41 43.5 38.1 40.8 ...

For sqlite and duckdb, the matrix multiplication is performed as SQL. First, we create connection objects. The function that performs the SQL is identical:

library(DBI)

# set up SQLite
library(RSQLite)
dbname <- "sparsemat.db"
con1 <- dbConnect(drv=RSQLite::SQLite(), dbname=dbname)

# set up duckdb
library(duckdb)
dbname <- "sparsemat.duckdb"
con2 <- dbConnect(duckdb(), dbname)
dbExecute(con2, "PRAGMA memory_limit='1GB';")

# save to SQLite / duckdb
dbWriteTable(con1, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con1, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)   
dbWriteTable(con2, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con2, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)

mat_mult_sql <- function(con) {
    ans <- dbGetQuery(con, paste(
      "select sum(Amat.x*bvec.x) from Amat",
      "join bvec on Amat.j=bvec.j",
      "group by Amat.i order by Amat.i"
    ))
    return(ans[,1, drop=TRUE])
}

Let’s benchmark the different approaches:

library(microbenchmark)
chk_fun <- function(...) {
    res <- list(...)[[1]]
    for (x in res) {
        if(!all.equal(expected, x)) return(FALSE)
    }
    return(TRUE)
}
microbenchmark(
    sqlite=mat_mult_sql(con1),
    duckdb=mat_mult_sql(con2),
    times=100, check=chk_fun
)
# Unit: milliseconds
#    expr     min       lq      mean   median       uq     max neval
#  sqlite 59.9924 61.77500 62.595894 62.49010 63.17130 75.5612   100
#  duckdb  1.6891  1.77065  1.942087  1.98605  2.04365  2.9142   100

Both approaches produce the correct solution. The duckdb approach is 30 times faster.

But do the approaches work with the big matrix, too? And how big are the files created? Let’s find out. We create 250 small-enough matrices in memory and write them to the disk.

library(data.table)
nvars <- 10000000
ncons <- 10
n_nonzero <- round(0.02*nvars*ncons) # approximate, there may be actually less values
set.seed(13)

# the first table
Amat <- data.frame(
    i=sample.int(ncons, n_nonzero, replace=TRUE),
    j=sample.int(nvars, n_nonzero, replace=TRUE),
    x=runif(n_nonzero)
)
setDT(Amat)
Amat <- unique(Amat, by=c("i", "j"))

# save to SQLite / duckdb
bvec <- runif(nvars)

# write to sqlite
dbWriteTable(con1, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con1, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)

# write to duckdb
dbWriteTable(con2, name="Amat", value=Amat, overwrite=TRUE)
dbWriteTable(con2, name="bvec", value=data.frame(j=seq(1:nvars), x=bvec), overwrite=TRUE)

# the 249 other tables
for (iter in 2:250) {
    Amat <- data.frame(
        i=sample.int(ncons, n_nonzero, replace=TRUE),
        j=sample.int(nvars, n_nonzero, replace=TRUE),
        x=runif(n_nonzero)
    )
    setDT(Amat)
    Amat <- unique(Amat, by=c("i", "j"))
    Amat[,i:=i+(iter-1)*500]
    dbWriteTable(con1, name="Amat", value=Amat, append=TRUE)
    dbWriteTable(con2, name="Amat", value=Amat, append=TRUE)
}

Now the matrix has 500 million non-zero elements. The sqlite database is 12GB, the duckdb is 7GB large.

(I also did run both database backends separately. Building the matrix with sqlite took 350s, with duckdb 230s.)

So, let’s benchmark the matrix multiplication:

microbenchmark(
    sqlite=mat_mult_sql(con1), # 500sec
    duckdb=mat_mult_sql(con2), # 23 sec
    times=1
)

I did not save the detailled timing results, but the sqlite approach took about 500s, while the duckdb approach took only 23s. Again, duckdb is much faster, around 20 times.

So I am deeply impressed by the speed of duckdb. There seem to be huge progressions in OLAP queries. One small part of it is that duckdb can use more than one thread:

dbExecute(con2, "PRAGMA threads=4;")
dbGetQuery(con2, "SELECT current_setting('threads');")
#   current_setting('threads')
# 1                          4

There is also the bigsparser which looks as if it can be even faster than duckdb, however, I could not yet figure out how to create a big matrix on disk with this package.

To leave a comment for the author, please follow the link and comment on their blog: factbased.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version