Site icon R-bloggers

Corpus Linguistics with R, Day 2

[This article was first published on Cornelius Puschmann's Blog » R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R Lesson 2


text< -c("This is a first example sentence.", "And this is a second example sentence.")

# gsub replaces stuff in strings

> gsub (“second”, “third”, text)
SEARCH-REPLACE-SUBJECT
[1] “This is a first example sentence.”
[2] “And this is a third example sentence.”
> gsub (“n”, “X”, text)
[1] “This is a first example seXteXce.”
[2] “AXd this is a secoXd example seXteXce.”
> gsub (“is”, “was”, text)
[1] “Thwas was a first example sentence.”
[2] “And thwas was a second example sentence.”

Perl-style regex

^ beginning of str, e.g. “^x”, ***OR*** NOT inside of []
$ end of str, e.g. “x$”
. any other char
\ escape char – TWO (“\\”) needed
[] character classes, e.g. [aeiou] vowels, [a-h] is same as [abcdefgh]
{MIN,MAX} number of immediately preceding unit (chacter)

examples
lo+l

> grep(“analy[sz]e”, c(“analyze”, “analyse”, “moo”), perl=T, value=T)
[1] “analyze” “analyse”

> grep(“(first|second)”, text, perl=T, value=T)
[1] “This is a first example sentence.”
[2] “And this is a second example sentence.”
> grep(“(first|lalala)”, text, perl=T, value=T)
[1] “This is a first example sentence.”
>

> grep(“ab{2}”, z, perl=T, value=T)
[1] “aabbccdd”
> grep(“(ab){2}”, z, perl=T, value=T)
[1] “ababcdcd”
>
>
> gsub(“a (first|second)”, “another”, text, perl=T)
[1] “This is another example sentence.”
[2] “And this is another example sentence.”
>
>
>
>
> gsub(“[abcdefgh]”, “X”, text, perl=T)
[1] “TXis is X Xirst XxXmplX sXntXnXX.”
[2] “AnX tXis is X sXXonX XxXmplX sXntXnXX.”

> grep(“forg[eo]t(s|ting|ten)?_v”, a.corpus.file, perl=T, value=T)
all forms of forget

*? lazy matching e.g.
gregexpr(“s.*?s”, text[1], perl=T)

> gregexpr(“s.*?s”, text[1], perl=T)
[[1]]
[1] 4 14
attr(,”match.length”)
[1] 4 12

# note: things that are matched are consumed and can then not be found again in the same passtext

> gsub(“(19|20)[0-9]{2}”, “YEAR”, text)
[1] “They killed 250 people in YEAR.” “No, it was in YEAR.”
> #replaces only 19xx and 20xx

> textfile< -scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile< -tolower(textfile)
> textfile
[1] “the licenses for most software are designed to take away your”
[2] “freedom to share and change it. by contrast, the gnu general public”
[3] “license is intended to guarantee your freedom to share and change free”
[4] “software–to make sure the software is free for all its users. this”
[5] “general public license applies to most of the free software”
[6] “foundation’s software and to any other program whose authors commit to”
[7] “using it. (some other free software foundation software is covered by”
[8] “the gnu library general public license instead.) you can apply it to”
[9] “your programs, too.”
> unlist(strsplit(textfile, “//W”))
[1] “the licenses for most software are designed to take away your”
[2] “freedom to share and change it. by contrast, the gnu general public”
[3] “license is intended to guarantee your freedom to share and change free”
[4] “software–to make sure the software is free for all its users. this”
[5] “general public license applies to most of the free software”
[6] “foundation’s software and to any other program whose authors commit to”
[7] “using it. (some other free software foundation software is covered by”
[8] “the gnu library general public license instead.) you can apply it to”
[9] “your programs, too.”
> text_split< -unlist(strsplit(textfile, "//W"))
> text_split
[1] “the licenses for most software are designed to take away your”
[2] “freedom to share and change it. by contrast, the gnu general public”
[3] “license is intended to guarantee your freedom to share and change free”
[4] “software–to make sure the software is free for all its users. this”
[5] “general public license applies to most of the free software”
[6] “foundation’s software and to any other program whose authors commit to”
[7] “using it. (some other free software foundation software is covered by”
[8] “the gnu library general public license instead.) you can apply it to”
[9] “your programs, too.”
>
> text_split< -unlist(strsplit(textfile, "//W"))
> text_split
[1] “the licenses for most software are designed to take away your”
[2] “freedom to share and change it. by contrast, the gnu general public”
[3] “license is intended to guarantee your freedom to share and change free”
[4] “software–to make sure the software is free for all its users. this”
[5] “general public license applies to most of the free software”
[6] “foundation’s software and to any other program whose authors commit to”
[7] “using it. (some other free software foundation software is covered by”
[8] “the gnu library general public license instead.) you can apply it to”
[9] “your programs, too.”
> text_split< -unlist(strsplit(textfile, "\\W"))

> textfile< -scan(file.choose(), what="char", sep="\n")
Enter file name: corp_gpl_short.txt
Read 9 items
> textfile< -tolower(textfile)
> textfile
[1] “the licenses for most software are designed to take away your”
[2] “freedom to share and change it. by contrast, the gnu general public”
[3] “license is intended to guarantee your freedom to share and change free”
[4] “software–to make sure the software is free for all its users. this”
[5] “general public license applies to most of the free software”
[6] “foundation’s software and to any other program whose authors commit to”
[7] “using it. (some other free software foundation software is covered by”
[8] “the gnu library general public license instead.) you can apply it to”
[9] “your programs, too.”
> unlist(strsplit(textfile, “//W”))
[1] “the licenses for most software are designed to take away your”
[2] “freedom to share and change it. by contrast, the gnu general public”
[3] “license is intended to guarantee your freedom to share and change free”
[4] “software–to make sure the software is free for all its users. this”
[5] “general public license applies to most of the free software”
[6] “foundation’s software and to any other program whose authors commit to”
[7] “using it. (some other free software foundation software is covered by”
[8] “the gnu library general public license instead.) you can apply it to”
[9] “your programs, too.”

> text_split< -unlist(strsplit(textfile, "//W+"))
> text_split
[1] “the licenses for most software are designed to take away your”
[2] “freedom to share and change it. by contrast, the gnu general public”
[3] “license is intended to guarantee your freedom to share and change free”
[4] “software–to make sure the software is free for all its users. this”
[5] “general public license applies to most of the free software”
[6] “foundation’s software and to any other program whose authors commit to”
[7] “using it. (some other free software foundation software is covered by”
[8] “the gnu library general public license instead.) you can apply it to”
[9] “your programs, too.”
> sort(table(text_split), decreasing=T)
text_split
to software the free and general
9 9 7 5 4 3 3
is it license public your by change
3 3 3 3 3 2 2
for foundation freedom gnu most other share
2 2 2 2 2 2 2
all any applies apply are authors away
1 1 1 1 1 1 1
can commit contrast covered designed guarantee instead
1 1 1 1 1 1 1
intended its library licenses make of program
1 1 1 1 1 1 1
programs s some sure take this too
1 1 1 1 1 1 1
users using whose you
1 1 1 1
>

> text_freqs
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share all
2 2 2 2 2 2 1
any applies apply are authors away can
1 1 1 1 1 1 1
commit contrast covered designed guarantee instead intended
1 1 1 1 1 1 1
its library licenses make of program programs
1 1 1 1 1 1 1
s some sure take this too users
1 1 1 1 1 1 1
using whose you
1 1 1
> text_freqs[text_freqs>1]
text_split
to software the free and general is
9 7 5 4 3 3 3
it license public your by change for
3 3 3 3 2 2 2
foundation freedom gnu most other share
2 2 2 2 2 2
>

> !(text_split %in% stop_list)
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
[25] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
[37] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[49] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE
[61] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[73] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
[85] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> text_stopremoved< -text_split[!(text_split %in% stop_list)]
> text_stopremoved
[1] “licenses” “for” “most” “software” “are”
[6] “designed” “to” “take” “away” “your”
[11] “freedom” “to” “share” “change” “it”
[16] “by” “contrast” “gnu” “general” “public”
[21] “license” “is” “intended” “to” “guarantee”
[26] “your” “freedom” “to” “share” “change”
[31] “free” “software” “to” “make” “sure”
[36] “software” “is” “free” “for” “all”
[41] “its” “users” “this” “general” “public”
[46] “license” “applies” “to” “most” “free”
[51] “software” “foundation” “s” “software” “to”
[56] “any” “other” “program” “whose” “authors”
[61] “commit” “to” “using” “it” “some”
[66] “other” “free” “software” “foundation” “software”
[71] “is” “covered” “by” “gnu” “library”
[76] “general” “public” “license” “instead” “you”
[81] “can” “apply” “it” “to” “your”
[86] “programs” “too”
>

# LOAD an R file
source(“something.r”)

To leave a comment for the author, please follow the link and comment on their blog: Cornelius Puschmann's Blog » R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.