Multiple Classification and Authorship of the Hebrew Bible

Posted on January 1, 2013 by inkhorn82 in R bloggers | 0 Comments

[This article was first published on Data and Analysis with R, at Work, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sitting in my synagogue this past Saturday, I started thinking about the authorship analysis that I did using function word counts from texts authored by Shakespeare, Austen, etc. I started to wonder if I could do something similar with the component books of the Torah (Hebrew bible).

A very cursory reading of the Documentary Hypothesis indicates that the only books of the Torah supposed to be authored by one person each were Vayikra (Leviticus) and Devarim (Deuteronomy). The remaining three appear to be a confusing hodgepodge compilation from multiple authors. I figured that if I submitted the books of the Torah to a similar analysis, and if the Documentary Hypothesis is spot-on, then the analysis should be able to accurately classify only Vayikra and Devarim.

The theory with the following analysis (taken from the English textual world, of course) seems to be this: When someone writes a book, they write with a very particular style. If you are going to be able to detect that style, statistically, it is convenient to detect it using function words. Function words (“and”, “also”, “if”, “with”, etc) need to be used regardless of content, and therefore should show up throughout the text being analyzed. Each author uses a distinct number/proportion of each of these function words, and therefore are distinguishable based on their profile of usage.

With that in mind, I started my journey. The first steps were to find an online source of Torah text that I could easily scrape for word counts, and then to figure out which hebrew function words to look for. For the Torah text, I relied on the inimitable Chabad.org. They hired good rational web developer(s) to make their website, and so looping through each perek (chapter) of the Torah was a matter of copying and pasting html page numbers from their source code.

Several people told me that I’d be wanting for function words in Hebrew, as there are not as many as in English. However, I found a good 32 of them, as listed below:

Transliteration	Hebrew Function Word	Rough Translation	Word Count
al	עַל	Upon	1262
el	אֶל	To	1380
asher	אֲשֶׁר	That	1908
ca_asher	כַּאֲשֶׁר	As	202
et	אֶת	(Direct object marker)	3214
ki	כִּי	For/Because	1030
col	וְכָל + כָּל + לְכָל + בְּכָל + כֹּל	All	1344
ken	כֵּן	Yes/So	124
lachen	לָכֵן	Therefore	6
hayah_and_variants	תִּהְיֶינָה + תִּהְיֶה + וְהָיוּ + הָיוּ + יִהְיֶה + וַתְּהִי + יִּהְיוּ + וַיְהִי + הָיָה	Be	819
ach	אַךְ	But	64
byad	בְּיַד	By	32
gam	גַם	Also/Too	72
mehmah	מֶה + מָה	What	458
haloh	הֲלֹא	Was not?	17
rak	רַק	Only	59
b_ad	בְּעַד	For the sake of	5
loh	לֹא	No/Not	1338
im	אִם	If	332
al2	אַל	Do not	217
ele	אֵלֶּה	These	264
haheehoo	הַהִוא + הַהוּא	That	121
ad	עַד	Until	324
hazehzot	הַזֶּה + הַזֹּאת + זֶה + זֹאת	This	474
min	מִן	From	274
eem	עִם	With	80
mi	מִי	Who	703
oh	אוֹ	Or	231
maduah	מַדּוּעַ	Why	10
etzel	אֵצֶל	Beside	6
heehoo	הִוא + הוּא + הִיא	Him/Her/It	653
az	אָז	Thus	49

This list is not exhaustive, but definitely not small! My one hesitation when coming up with this list surrounds the Hebrew word for “and”. ”And” takes the form of a single letter that attaches to the beginning of a word (a “vav” marked with a different vowel sound depending on its context), which I was afraid to try to extract because I worried that if I tried to count it, I would mistakenly count other vav’s that were a valid part of a word with a different meaning. It’s a very frequent word, as you can imagine, and its absence might very well affect my subsequent analyses.

Anyhow, following is the structure of Torah:

‘Chumash’ / Book	Number of Chapters
‘Bereishit’ / Genesis	50
‘Shemot’ / Exodus	40
‘Vayikra’ / Leviticus	27
‘Bamidbar’ / Numbers	36
‘Devarim’ / Deuteronomy	34

Additionally, I’ve included a faceted histogram below showing the distribution of word-counts per chapter by chumash/book of the Torah:

m = ggplot(torah, aes(x=wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

You can see that the books are not entirely different in terms of word counts of the component chapters, except for the possibility of Vayikra, which seems to tend towards the shorter chapters.

After making a Python script to count the above words within each chapter of each book, I loaded it up into R and split it into a training and testing sample:

torah$randu = runif(187, 0,1)
torah.train = torah[torah$randu <= .4,] torah.test = torah[torah$randu > .4,]

For this analysis, it seemed that using Random Forests made the most sense. However, I wasn’t quite sure if I should use the raw counts, or proportions, so I tried both. After whittling down the variables in both models, here are the final training model definitions:

torah.rf = randomForest(chumash ~ al + el + asher + caasher + et + ki + hayah + gam + mah + loh + haheehoo + oh + heehoo, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

torah.rf.props = randomForest(chumash ~ al_1 + el_1 + asher_1 + caasher_1 + col_1 + hayah_1 + gam_1 + mah_1 + loh_1 + im_1 + ele_1 + mi_1 + oh_1 + heehoo_1, data=torah.train, ntree=5000, importance=TRUE, mtry=8)

As you can see, the final models were mostly the same, but with a few differences. Following are the variable importances from each Random Forests model:

> importance(torah.rf)

Word	MeanDecreaseAccuracy	MeanDecreaseGini
hayah	31.05139	5.979567
heehoo	20.041149	4.805793
loh	18.861843	6.244709
mah	18.798385	4.316487
al	16.85064	5.038302
caasher	15.101464	3.256955
et	14.708421	6.30228
asher	14.554665	5.866929
oh	13.585169	2.38928
el	13.010169	5.605561
gam	5.770484	1.652031
ki	5.489	4.005724
haheehoo	2.330776	1.375457

> importance(torah.rf.props)

Word	MeanDecreaseAccuracy	MeanDecreaseGini
asher_1	37.074235	6.791851
heehoo_1	29.87541	5.544782
al_1	26.18609	5.365927
el_1	17.498034	5.003144
col_1	17.051121	4.530049
hayah_1	16.512206	5.220164
loh_1	15.761723	5.157562
ele_1	14.795885	3.492814
mi_1	12.391427	4.380047
gam_1	12.209273	1.671199
im_1	11.386682	2.651689
oh_1	11.336546	1.370932
mah_1	9.133418	3.58483
caasher_1	5.135583	2.059358

It’s funny that the results, from a raw numbers perspective, show that hayah, the hebrew verb for “to be”, shows at the top of the list. That’s the same result as in the Shakespeare et al. analysis! Having established that all variables in each model had some kind of an effect on the classification, the next task was to test each model on the testing sample, and see how well each chumash/book of the torah could be classified by that model:

> torah.test$pred.chumash = predict(torah.rf, torah.test, type="response")
> torah.test$pred.chumash.props = predict(torah.rf.props, torah.test, type="response")

> xtabs(~torah.test$chumash + torah.test$pred.chumash)
                  torah.test$pred.chumash
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            4            5          2         8          7
       'Bereishit'           1           14          1        14          2
       'Devarim'             1            2         17         0          1
       'Shemot'              2            4          4         9          2
       'Vayikra'             5            0          4         0          5

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash),1)
                  torah.test$pred.chumash
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.15384615   0.19230769 0.07692308 0.30769231 0.26923077
       'Bereishit'  0.03125000   0.43750000 0.03125000 0.43750000 0.06250000
       'Devarim'    0.04761905   0.09523810 0.80952381 0.00000000 0.04761905
       'Shemot'     0.09523810   0.19047619 0.19047619 0.42857143 0.09523810
       'Vayikra'    0.35714286   0.00000000 0.28571429 0.00000000 0.35714286

> xtabs(~torah.test$chumash + torah.test$pred.chumash.props)
                  torah.test$pred.chumash.props
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
       'Bamidbar'            0            5          0        13          8
       'Bereishit'           1           16          0        13          2
       'Devarim'             0            2         11         4          4
       'Shemot'              1            4          2        13          1
       'Vayikra'             3            3          0         0          8

> prop.table(xtabs(~torah.test$chumash + torah.test$pred.chumash.props),1)
                  torah.test$pred.chumash.props
torah.test$chumash  'Bamidbar'  'Bereishit'  'Devarim'   'Shemot'  'Vayikra'
       'Bamidbar'   0.00000000   0.19230769 0.00000000 0.50000000 0.30769231
       'Bereishit'  0.03125000   0.50000000 0.00000000 0.40625000 0.06250000
       'Devarim'    0.00000000   0.09523810 0.52380952 0.19047619 0.19047619
       'Shemot'     0.04761905   0.19047619 0.09523810 0.61904762 0.04761905
       'Vayikra'    0.21428571   0.21428571 0.00000000 0.00000000 0.57142857

So, from the perspective of the raw number of times each function word was used, Devarim, or Deuteronomy, seemed to be the most internally consistent, with 81% of the chapters in the testing sample correctly classified. Interestingly, from the perspective of the proportion of times each function word was used, we see that Devarim, Shemot, and Vayikra (Deuteronomy, Exodus, and Leviticus) had over 50% of their chapters correctly classified in the training sample.

I’m tempted to say here, at the least, that there is evidence that at least Devarim was written with one stylistic framework in mind, and potentially one distinct author. From a proportion point of view, it appears that Shemot and Vayikra also show an internal consistency suggestive of close to one stylistic framework, or possibly a distinct author for each book. I’m definitely skeptical of my own analysis, but what do you think?

The last part of this analysis comes from a suggestion given to me by a friend, which was that once I modelled the unique profiles of function words within each book of the Torah, I should use that model on some post-Biblical texts. Apparently one idea is that the “Deuteronomist Source” was also behind the writing of Joshua, Judges, and Kings. If the same author was behind all four books, then when I train my model on these books, they should tend to be classified by my model as Devarim/Deuteronomy, moreso than other books.

As above, below I show the distribution of word count by book, for comparison’s sake:

> m = ggplot(neviim, aes(wordcount))
> m + geom_histogram() + facet_grid(chumash ~ .)

Interestingly, it seems as though chapters in these particular post-Biblical texts seem to be a bit longer, on average, than those in the Torah.

Next, I gathered counts of the same function words in Joshua, Judges, and Kings as I had for the 5 books of the Torah, and tested my random forests Torah model on them. As you can see below, the result was anything but clear on that matter:

> xtabs(~neviim$chumash + neviim$pred.chumash)
              neviim$pred.chumash
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           3            7          7         6          1
      'Judges'           2           11          2         6          0
      'Kings'            0            8          3        10          1

> xtabs(~neviim$chumash + neviim$pred.chumash.props)
              neviim$pred.chumash.props
neviim$chumash  'Bamidbar'  'Bereishit'  'Devarim'  'Shemot'  'Vayikra'
      'Joshua'           2            8          6         7          1
      'Judges'           0            9          2         9          1
      'Kings'            0            7          6         7          2

I didn’t even bother to re-express this table into fractions, because it’s quite clear that each book of the prophets that I analyzed didn’t seem to be clearly classified in any one category. Looking at these tables, there doesn’t yet seem to me to be any evidence, from this analysis, that whoever authored Devarim/Deuteronomy also authored these post-biblical texts.

Conclusion

I don’t think that this has been a full enough analysis. There are a few things in it that bother me, or make me wonder, that I’d love input on. Let me list those things:

As mentioned above, I’m missing the inclusion of the Hebrew “and” in this analysis. I’d like to know how to extract counts of the Hebrew “and” without extracting counts of the Hebrew letter “vav” where it doesn’t signify “and”.
Similar to my exclusion of “and”, there are a few one letter prepositions that I have not included as individual predictor variables. Namely ל, ב, כ, מ, signifying “to”, “in”/”with”, “like”, and “from”. How do I count these without counting the same letters that begin a different word and don’t mean the same thing?
Is it more valid to consider the results of my analyses that were done on the word frequencies as proportions (individual word count divided by total number of words in the chapter), or are both valid?
Does a list exist somewhere that details, chapter-by-chapter, which source is believed to have written the Torah text, according to the Documentary Hypothesis, or some more modern incarnation of the Hypothesis? I feel that if I were able to categorize the chapters specifically, rather than just attributing them to the whole book (as a proxy of authorship), then the analysis might be a lot more interesting.

All that being said, I’m intrigued that when you look at the raw number of how often the function words were used, Devarim/Deuteronomy seems to be the most internally consistent. If you’d like, you can look at the python code that I used to scrape the chabad.org website here: python code for scraping, although please forgive the rudimentary coding! You can get the dataset that I collected for the Torah word counts here: Torah Data Set, and the data set that I collected for the Prophetic text word counts here: Neviim data set. By all means, do the analysis yourself and tell me how to do it better

To leave a comment for the author, please follow the link and comment on their blog: Data and Analysis with R, at Work.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Multiple Classification and Authorship of the Hebrew Bible

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)