Lazy load with archivist
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Version 1.1 of the archivist package reached CRAN few days ago.
This package supports operations on disk based repository of R objects. It makes the storing, restoring and searching for an R objects easy (searching with the use of meta information). Want to share your object with article reviewers or collaborators? This package should help.
We’ve created some vignettes to present core use-cases. Here is one of them.
Lazy load with archivist
by Marcin Kosiński
Too big .Rdata
file causing problems? Interested in few objects from a huge .Rdata
file? Regular load()
into Global Environment takes too long or crashes R session? Want to load or copy an object with unknown name? Maintaing environment with thousands of objects became perplexing and troublesome?
library(devtools) if (!require(archivist)){ install_github("archivist", "pbiecek") require(archivist) } library(tools) |
If stacked with any of the above applies, this use case is a must read for you.
The archivist package is a great solution that helps administer, archive and restore your artifacts created in R package.
Combining archivist and lazy load may be miraculous
If your .RData
file is too big and you do not need or do not want to load the whole of it, you can simply convert the .RData
file into a lazy-load database which serializes each entry separately and creates an index. The nice thing is that the loading will be on-demand.
# convert .RData -> .rdb/.rdx lazyLoad = local({load("Huge.RData"); environment()}) tools:::makeLazyLoadDB(lazyLoad, "Huge") |
Loading the database then only loads the index but not the contents. The contents are loaded as they are used.
lazyLoad("Huge") objNames <- ls() #232 objects |
Now you can create your own local archivist-like Repository which will make maintainig artifacts as easy as possible.
DIRectory <- getwd() createEmptyRepo( DIRectory ) |
Then objects from the Huge.RData
file may be archived into Repository created in DIRectory
directory. The attribute tags
(see Tags) specified as realName
is added to every artifact before the saveToRepo()
call, in order to remember its name in the Repository.
lapply( as.list(objNames), function(x){ y <- get( x, envir = lazyLoad ) attr(y, "tags") <- paste0("realName: ", x) saveToRepo( y, repoDir = DIRectory )} ) |
Now if you are interested in a specific artifact but the only thing you remember about it is its class
was data.frame
and its name started with letters ir
then it is possible to search for that artifact using the searchInLocalRepo()
function.
hashes1 <- searchInLocalRepo( "class:data.frame", DIRectory) hashes2 <- searchInLocalRepo( "realName: ir", DIRectory, fixed = FALSE) |
As a result we got md5hashes of artifacts which class was data.frame
(hashes1
) and md5hashes
of artifacts which names (stored in tag
named realName
) starts with ir
. Now we can proceed with an intersection on those two vectors of characters.
(hash <- intersect(hashes1, hashes2)) ## [1] "32ac7871f2b4875c9122a7d9f339f78b" |
After we got one md5hash
corresponding to the artfiact we are interested in, we can load it using the loadFromLocalRepo()
function.
myData <- loadFromLocalRepo( hash, DIRectory, value = TRUE ) summary(myData[,-5]) ## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.30 Min. :2.00 Min. :1.00 Min. :0.1 ## 1st Qu.:5.10 1st Qu.:2.80 1st Qu.:1.60 1st Qu.:0.3 ## Median :5.80 Median :3.00 Median :4.35 Median :1.3 ## Mean :5.84 Mean :3.06 Mean :3.76 Mean :1.2 ## 3rd Qu.:6.40 3rd Qu.:3.30 3rd Qu.:5.10 3rd Qu.:1.8 ## Max. :7.90 Max. :4.40 Max. :6.90 Max. :2.5 |
One can always check the realName
tag of that artifact with the getTagsLocal()
function.
getTagsLocal(hash, DIRectory, "realName") ##[1] "realName: iris" |
If only specific artifacts from previously created Repository in DIRectory
directory are needed in future, they may be copied to a new Repository in new directory. New, smaller Repository will use less memory and may be easier to send to contributors when working in groups. Here is an example of copying artifacts only from selected classes. Because DIRectory2
directory did not exist, the parameter force=TRUE
was needed to force creating empty Repository. Vector hashesR
contains md5hashes
or artifacts that are related to other artifacts, which mean they are datasets used to compute other artifacts. The special parameter fixed = TRUE
specifies to search in tags
that start with letters relation
.
hashes <- unlist(sapply(c("class:coxph", "class:function", "class:data.frame"), searchInLocalRepo, repoDir = DIRectory)) hashesR <- searchInLocalRepo( "relation", repoDir = DIRectory, fixed = FALSE) DIRectory2 <- (paste0( DIRectory, "/ex")) createEmptyRepo( DIRectory2, force = TRUE ) copyLocalRepo( DIRectory, DIRectory2, c( hashes, hashesR ) ) |
You can even tar
your Repository with tarLocalRepo()
function and send it to anybody you want.
tarLocalRepo( DIRectory2 ) |
You can check the summary of Repository using the summaryLocalRepo()
function. As you can see, some of the coxph
artifacts have an addtional class. There are also 8 datasets. Those are artifacts needed to compute other artifacts and archived additionaly in the saveToRepo()
call with default parameter archiveData = TRUE
.
summaryLocalRepo( DIRectory2 ) ## Number of archived artifacts in the Repository: 81 ## Number of archived datasets in the Repository: 11 ## Number of various classes archived in the Repository: ## Number ## data.frame 51 ## function 19 ## coxph 11 ## coxph.null 1 ## coxph.penal 4 ## Saves per day in the Repository: ## Saves ## 2014-09-22 92 |
When Repository is no longer necessary we may simply delete it with deleteRepo()
function.
deleteRepo( DIRectory ) deleteRepo( DIRectory2 ) |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.