[This article was first published on #untitled, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
If you are mining with a text source, performing some language modelling, you need to sample from a text corpus. I was wondering how to have a function to randomly selects chunks of a text file efficiently and is doing the job fast. I wanted to keep the function as simple as possible and using basic S3 methods in R.
Solution
I heavily encourage you to use scan
function that is a S3 basic R function:
sampleText <- function (filename, total, sampleSize) { lineNumber = sample(total, sampleSize) sample <- list() for (line in lineNumber) { result <- scan(filename, what="character", skip= line, nlines=1, sep="\n", strip.white = TRUE, fileEncoding = 'UTF-8') sample <- list(sample, result) } return(unlist(sample)) }
And here’s how to use it:
sample <- createPartition('path_to_file.txt', 2000, 10)
Inspecting result:
> class(sample) [1] "character" > sample [1] "It's a cloudy day" [2] "But I did want her to have at least PART of my imaginary Paris experience, so I used her pretty Paris stamp to make her birthday card." ... [10] "Our smartest friend (Zachary)- was nice enough to study the injection and give us some information to share with everyone..."
NOTE: The reason that this function is fast is that you send in a file connection. If you are working with different encoding, to change values accordingly. Also, make sure that in case you want sentences back, set sep='.'
.
Hope you find this function useful! ?
To leave a comment for the author, please follow the link and comment on their blog: #untitled.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.