Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc. I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible).
A very cursory reading of the Documentary Hypothesis indicates that the only books of the Torah supposed to be authored by one person each were Vayikra (Leviticus) and Devarim (Deuteronomy). The remaining three appear to be a confusing hodgepodge compilation from multiple authors. I figured that if I submitted the books of the Torah to a similar analysis, and if the Documentary Hypothesis is spot-on, then the analysis should be able to accurately classify only Vayikra and Devarim.
The theory with the following analysis (taken from the English textual world, of course) seems to be this: When someone writes a book, they write with a very particular style. If you are going to be able to detect that style, statistically, it is convenient to detect it using function words. Function words (“and”, “also”, “if”, “with”, etc) need to be used regardless of content, and therefore should show up throughout the text being analyzed. Each author uses a distinct number/proportion of each of these function words, and therefore are distinguishable based on their profile of usage.
With that in mind, I started my journey. The first steps were to find an online source of Torah text that I could easily scrape for word counts, and then to figure out which hebrew function words to look for. For the Torah text, I relied on the inimitable Chabad.org. They hired good rational web developer(s) to make their website, and so looping through each perek (chapter) of the Torah was a matter of copying and pasting html page numbers from their source code.
Several people told me that I’d be wanting for function words in Hebrew, as there are not as many as in English. However, I found a good 32 of them, as listed below:
Transliteration | Hebrew Function Word | Rough Translation | Word Count |
al | עַל | Upon | 1262 |
el | אֶל | To | 1380 |
asher | אֲשֶׁר | That | 1908 |
ca_asher | כַּאֲשֶׁר | As | 202 |
et | אֶת | (Direct object marker) | 3214 |
ki | כִּי | For/Because | 1030 |
col | וְכָל + כָּל + לְכָל + בְּכָל + כֹּל | All | 1344 |
ken | כֵּן | Yes/So | 124 |
lachen | לָכֵן | Therefore | 6 |
hayah_and_variants | תִּהְיֶינָה + תִּהְיֶה + וְהָיוּ + הָיוּ + יִהְיֶה + וַתְּהִי + יִּהְיוּ + וַיְהִי + הָיָה | Be | 819 |
ach | אַךְ | But | 64 |
byad | בְּיַד | By | 32 |
gam | גַם | Also/Too | 72 |
mehmah | מֶה + מָה | What | 458 |
haloh | הֲלֹא | Was not? | 17 |
rak | רַק | Only | 59 |
b_ad | בְּעַד | For the sake of | 5 |
loh | לֹא | No/Not | 1338 |
im | אִם | If | 332 |
al2 | אַל | Do not | 217 |
ele | אֵלֶּה | These | 264 |
haheehoo | הַהִוא + הַהוּא | That | 121 |
ad | עַד | Until | 324 |
hazehzot | הַזֶּה + הַזֹּאת + זֶה + זֹאת | This | 474 |
min | מִן | From | 274 |
eem | עִם | With | 80 |
mi | מִי | Who | 703 |
oh | אוֹ | Or | 231 |
maduah | מַדּוּעַ | Why | 10 |
etzel | אֵצֶל | Beside | 6 |
heehoo | הִוא + הוּא + הִיא | Him/Her/It | 653 |
az | אָז | Thus | 49 |
This list is not exhaustive, but definitely not small! My one hesitation when coming up with this list surrounds the Hebrew word for “and”. ”And” takes the form of a single letter that attaches to the beginning of a word (a “vav” marked with a different vowel sound depending on its context), which I was afraid to try to extract because I worried that if I tried to count it, I would mistakenly count other vav’s that were a valid part of a word with a different meaning. It’s a very frequent word, as you can imagine, and its absence might very well affect my subsequent analyses.
Anyhow, following is the structure of Torah:
‘Chumash’ / Book | Number of Chapters |
‘Bereishit’ / Genesis | 50 |
‘Shemot’ / Exodus | 40 |
‘Vayikra’ / Leviticus | 27 |
‘Bamidbar’ / Numbers | 36 |
‘Devarim’ / Deuteronomy | 34 |
Additionally, I’ve included a faceted histogram below showing the distribution of word-counts per chapter by chumash/book of the Torah:
m = ggplot(torah, aes(x=wordcount)) > m + geom_histogram() + facet_grid(chumash ~ .)
You can see that the books are not entirely different in terms of word counts of the component chapters, except for the possibility of Vayikra, which seems to tend towards the shorter chapters.
After making a Python script to count the above words within each chapter of each book, I loaded it up into R and split it into a training and testing sample:
torah$randu = runif(187, 0,1) torah.train = torah[torah$randu <= .4,] torah.test = torah[torah$randu > .4,]
For this analysis, it seemed that using Random Forests made the most sense. However, I wasn’t quite sure if I should use the raw counts, or proportions, so I tried both. After whittling down the variables in both models, here are the final training model definitions:
torah.rf = randomForest(chumash ~ al + el + asher + caasher + et + ki + hayah + gam + mah + loh + haheehoo + oh + heehoo, data=torah.train, ntree=5000, importance=TRUE, mtry=8) torah.rf.props = randomForest(chumash ~ al_1 + el_1 + asher_1 + caasher_1 + col_1 + hayah_1 + gam_1 + mah_1 + loh_1 + im_1 + ele_1 + mi_1 + oh_1 + heehoo_1, data=torah.train, ntree=5000, importance=TRUE, mtry=8)
As you can see, the final models were mostly the same, but with a few differences. Following are the variable importances from each Random Forests model:
> importance(torah.rf)
Word | MeanDecreaseAccuracy | MeanDecreaseGini |
hayah | 31.05139 | 5.979567 |
heehoo | 20.041149 | 4.805793 |
loh | 18.861843 | 6.244709 |
mah | 18.798385 | 4.316487 |
al | 16.85064 | 5.038302 |
caasher | 15.101464 | 3.256955 |
et | 14.708421 | 6.30228 |
asher | 14.554665 | 5.866929 |
oh | 13.585169 | 2.38928 |
el | 13.010169 | 5.605561 |
gam | 5.770484 | 1.652031 |
ki | 5.489 | 4.005724 |
haheehoo | 2.330776 | 1.375457 |
> importance(torah.rf.props)
Word | MeanDecreaseAccuracy | MeanDecreaseGini |
asher_1 | 37.074235 | 6.791851 |
heehoo_1 | 29.87541 | 5.544782 |
al_1 | 26.18609 | 5.365927 |
el_1 | 17.498034 | 5.003144 |
col_1 | 17.051121 | 4.530049 |
hayah_1 | 16.512206 | 5.220164 |
loh_1 | 15.761723 | 5.157562 |
ele_1 | 14.795885 | 3.492814 |
mi_1 | 12.391427 | 4.380047 |
gam_1 | 12.209273 | 1.671199 |
im_1 | 11.386682 | 2.651689 |
oh_1 | 11.336546 | 1.370932 |
mah_1 | 9.133418 | 3.58483 |
caasher_1 | 5.135583 | 2.059358 |
It’s funny that the results, from a raw numbers perspective, show that hayah, the hebrew verb for “to be”, shows at the top of the list. That’s the same result as in the Shakespeare et al. analysis! Having established that all variables in each model had some kind of an effect on the classification, the next task was to test each model on the testing sample, and see how well each chumash/book of the torah could be classified by that model:
> torah.test$pred.chumash = predict(torah.rf, torah.test, type="response") > torah.test$pred.chumash.props = predict(torah.rf.props, torah.test, type="response") > xtabs(~torah.test$chumash + torah.test$pred.chumash) torah.test$pred.chumash torah.test$chumash 'Bamidbar' 'Bereishit' 'Devarim' 'Shemot' 'Vayikra' 'Bamidbar' 4 5 2 8 7 'Bereishit' 1 14 1 14 2 'Devarim' 1 2 17 0 1 'Shemot' 2 4 4 9 2 'Vayikra' 5 0 4 0 5 > prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash),1) torah.test$pred.chumash torah.test$chumash 'Bamidbar' 'Bereishit' 'Devarim' 'Shemot' 'Vayikra' 'Bamidbar' 0.15384615 0.19230769 0.07692308 0.30769231 0.26923077 'Bereishit' 0.03125000 0.43750000 0.03125000 0.43750000 0.06250000 'Devarim' 0.04761905 0.09523810 0.80952381 0.00000000 0.04761905 'Shemot' 0.09523810 0.19047619 0.19047619 0.42857143 0.09523810 'Vayikra' 0.35714286 0.00000000 0.28571429 0.00000000 0.35714286 > xtabs(~torah.test$chumash + torah.test$pred.chumash.props) torah.test$pred.chumash.props torah.test$chumash 'Bamidbar' 'Bereishit' 'Devarim' 'Shemot' 'Vayikra' 'Bamidbar' 0 5 0 13 8 'Bereishit' 1 16 0 13 2 'Devarim' 0 2 11 4 4 'Shemot' 1 4 2 13 1 'Vayikra' 3 3 0 0 8 > prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash.props),1) torah.test$pred.chumash.props torah.test$chumash 'Bamidbar' 'Bereishit' 'Devarim' 'Shemot' 'Vayikra' 'Bamidbar' 0.00000000 0.19230769 0.00000000 0.50000000 0.30769231 'Bereishit' 0.03125000 0.50000000 0.00000000 0.40625000 0.06250000 'Devarim' 0.00000000 0.09523810 0.52380952 0.19047619 0.19047619 'Shemot' 0.04761905 0.19047619 0.09523810 0.61904762 0.04761905 'Vayikra' 0.21428571 0.21428571 0.00000000 0.00000000 0.57142857
So, from the perspective of the raw number of times each function word was used, Devarim, or Deuteronomy, seemed to be the most internally consistent, with 81% of the chapters in the testing sample correctly classified. Interestingly, from the perspective of the proportion of times each function word was used, we see that Devarim, Shemot, and Vayikra (Deuteronomy, Exodus, and Leviticus) had over 50% of their chapters correctly classified in the training sample.
I’m tempted to say here, at the least, that there is evidence that at least Devarim was written with one stylistic framework in mind, and potentially one distinct author. From a proportion point of view, it appears that Shemot and Vayikra also show an internal consistency suggestive of close to one stylistic framework, or possibly a distinct author for each book. I’m definitely skeptical of my own analysis, but what do you think?
The last part of this analysis comes from a suggestion given to me by a friend, which was that once I modelled the unique profiles of function words within each book of the Torah, I should use that model on some post-Biblical texts. Apparently one idea is that the “Deuteronomist Source” was also behind the writing of Joshua, Judges, and Kings. If the same author was behind all four books, then when I train my model on these books, they should tend to be classified by my model as Devarim/Deuteronomy, moreso than other books.
As above, below I show the distribution of word count by book, for comparison’s sake:
> m = ggplot(neviim, aes(wordcount)) > m + geom_histogram() + facet_grid(chumash ~ .)
Interestingly, it seems as though chapters in these particular post-Biblical texts seem to be a bit longer, on average, than those in the Torah.
Next, I gathered counts of the same function words in Joshua, Judges, and Kings as I had for the 5 books of the Torah, and tested my random forests Torah model on them. As you can see below, the result was anything but clear on that matter:
> xtabs(~neviim$chumash + neviim$pred.chumash) neviim$pred.chumash neviim$chumash 'Bamidbar' 'Bereishit' 'Devarim' 'Shemot' 'Vayikra' 'Joshua' 3 7 7 6 1 'Judges' 2 11 2 6 0 'Kings' 0 8 3 10 1 > xtabs(~neviim$chumash + neviim$pred.chumash.props) neviim$pred.chumash.props neviim$chumash 'Bamidbar' 'Bereishit' 'Devarim' 'Shemot' 'Vayikra' 'Joshua' 2 8 6 7 1 'Judges' 0 9 2 9 1 'Kings' 0 7 6 7 2
I didn’t even bother to re-express this table into fractions, because it’s quite clear that each book of the prophets that I analyzed didn’t seem to be clearly classified in any one category. Looking at these tables, there doesn’t yet seem to me to be any evidence, from this analysis, that whoever authored Devarim/Deuteronomy also authored these post-biblical texts.
Conclusion
I don’t think that this has been a full enough analysis. There are a few things in it that bother me, or make me wonder, that I’d love input on. Let me list those things:
- As mentioned above, I’m missing the inclusion of the Hebrew “and” in this analysis. I’d like to know how to extract counts of the Hebrew “and” without extracting counts of the Hebrew letter “vav” where it doesn’t signify “and”.
- Similar to my exclusion of “and”, there are a few one letter prepositions that I have not included as individual predictor variables. Namely ל, ב, כ, מ, signifying “to”, “in”/”with”, “like”, and “from”. How do I count these without counting the same letters that begin a different word and don’t mean the same thing?
- Is it more valid to consider the results of my analyses that were done on the word frequencies as proportions (individual word count divided by total number of words in the chapter), or are both valid?
- Does a list exist somewhere that details, chapter-by-chapter, which source is believed to have written the Torah text, according to the Documentary Hypothesis, or some more modern incarnation of the Hypothesis? I feel that if I were able to categorize the chapters specifically, rather than just attributing them to the whole book (as a proxy of authorship), then the analysis might be a lot more interesting.
All that being said, I’m intrigued that when you look at the raw number of how often the function words were used, Devarim/Deuteronomy seems to be the most internally consistent. If you’d like, you can look at the python code that I used to scrape the chabad.org website here: python code for scraping, although please forgive the rudimentary coding! You can get the dataset that I collected for the Torah word counts here: Torah Data Set, and the data set that I collected for the Prophetic text word counts here: Neviim data set. By all means, do the analysis yourself and tell me how to do it better
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.