Read Random Rows from A Huge CSV File
[This article was first published on S+/R – Yet Another Blog in Statistical Computing, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Given R data frames stored in the memory, sometimes it is beneficial to sample and examine the data in a large-size csv file before importing into the data frame. To the best of my knowledge, there is no off-shelf R function performing such data sampling with a relatively low computing cost. Therefore, I drafted two utility functions serving this particular purpose, one with the LaF library and the other with the reticulate library by leveraging the power of Python. While the first function is more efficient and samples 3 records out of 336,776 in about 100 milliseconds, the second one is more for fun and a showcase of the reticulate package.
library(LaF) sample1 <- function(file, n) { lf <- laf_open(detect_dm_csv(file, sep = ",", header = TRUE, factor_fraction = -1)) return(read_lines(lf, sample(1:nrow(lf), n))) } sample1("Downloads/nycflights.csv", 3) # year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight # 1 2013 9 15 1323 -6 1506 -23 MQ N857MQ 3340 # 2 2013 3 18 1657 -4 2019 9 UA N35271 80 # 3 2013 6 7 1325 -4 1515 -11 9E N8477R 3867 # origin dest air_time distance hour minute # 1 LGA DTW 82 502 13 23 # 2 EWR MIA 157 1085 16 57 # 3 EWR CVG 91 569 13 25 library(reticulate) sample2 <- function(file, n) { rows <- py_eval(paste("sum(1 for line in open('", file, "'))", sep = '')) - 1 return(import("pandas")$read_csv(file, skiprows = setdiff(1:rows, sample(1:rows, n)))) } sample2("Downloads/nycflights.csv", 3) # year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight # 1 2013 10 9 812 12 1010 -16 9E N902XJ 3507 # 2 2013 4 30 1218 -10 1407 -30 EV N18557 4091 # 3 2013 8 25 1111 -4 1238 -27 MQ N721MQ 3281 # origin dest air_time distance hour minute # 1 JFK MSY 156 1182 8 12 # 2 EWR IND 92 645 12 18 # 3 LGA CMH 66 479 11 11
To leave a comment for the author, please follow the link and comment on their blog: S+/R – Yet Another Blog in Statistical Computing.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.