Don Quijote — Word Statistics

Mathematical Poetics

11 years ago

[This article was first published on Mathematical Poetics, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Using the Gutenberg Project’s free text of Don Quijote + Unix for Poets, here are the most used (non-short) words in Miguel de Cervantes’ famous work:

2167 Quijote
2145 Sancho
1331 porque
1053 respondió
1027 había
900 merced
813 vuestra
796 todos
711 cuando
625 donde
614 quien
577 decir
573 caballero
535 hacer
525 aunque
482 aquel
464 aquí
462 estaba
450 sobre
431 está
416 tanto
414 verdad
409 allí
398 tengo
393 mundo
385 tiene
383 alguna
377 hasta
371 dicho
363 parte
361 entre
359 todas
358 buena
353 luego
346 cosas
344 menos
344 lugar
342 tenía
328 manera
328 aquella
327 tiempo
325 Panza
310 ahora
304 puesto
292 caballeros
289 ellos
287 mucho
285 fuera
283 puede
282 antes
281 mejor
281 algún
280 visto
279 Dulcinea
272 tierra
269 otras
258 padre
258 otros
258 hombre
257 hecho
254 haber
253 quiero
252 cielo
250 habían
248 amigo
247 saber
246 historia
245 camino
242 tener
240 escudero
239 parece
239 manos
238 días
234 muchas
231 estas
222 mujer
222 desta
221 será
219 mesmo
219 cuanto
219 cómo
215 quién
214 cabeza
211 punto
211 noche
207 veces
207 replicó
205 cuenta
203 Rocinante
202 parecer
200 razones
199 también
198 fuese
198 duque
198 diciendo
197 andante
196 muchos
196 estos
196 caballo
195 vuesa
195 nuestro
193 podía

CODE: tr -sc '[A-Z][a-z][áéíóú]' '[\012*]' < quijote.textfile | perl -e 'while () { print if length($_)>5; }' | sort | uniq -c | sort -rn > quijote.hist

Here’s the power law distribution of non-short words in Don Quijote:

CODE: tr -sc '[A-Z][a-z][áéíóú]' '[\012*]' < quijote | perl -e 'while () { print if length($_)>5; }' | sort | uniq -c | sort -rn | perl -e 'while () { print $1 if $_ =~ /(\d+)/; print "\n"; } ' | uniq -c > quijote.countofcounts.powerlaw.hist

> par(bg="#fafaff", col="#111177")
> plot(quijote.countofcounts.powerlaw, log="y", type="s", lwd=4, xlab="Number of times a word appears in the text", ylab="Number of words with this frequency", main="Word Frequency in Don Quijote de la Mancha", col="#111177")

And including short words retains the power law distribution.

CODE: tr -sc '[A-Z][a-z][áéíóú]' '[\012*]' < quijote | uniq -c | sort -rn | perl -e ‘while (< >) { print $1 if $_ =~ /(\d+)/; print “\n”; } ’ | uniq -c > quijote.countofcounts.powerlaw.hist.shortwordstambien

To leave a comment for the author, please follow the link and comment on their blog: Mathematical Poetics.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.