OneR in Medical Research: Finding Leading Symptoms, Main Predictors and Cut-Off Points

Posted on December 8, 2020 by Learning Machines in R bloggers | 0 Comments

[This article was first published on R-Bloggers – Learning Machines, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

We already had a lot of examples that make use of the OneR package (on CRAN), which can be found in the respective Category: OneR.

Here we will give you some concrete examples in the area of research on Type 2 Diabetes Mellitus (DM) to show that the package is especially well suited in the field of medical research, so read on!

One of the big advantages of the package is that the resulting models are often not only highly accurate but very easy to interpret:

the predictors are ordered from best to worst (based on accuracy), the best one is chosen,
the model is given in the form of simple if-then rules,
the rules contain exact cut-off points.

An additional advantage, compared to other methods, is that with the included optbin function you find as many cut-off points as there are needed to separate all the classes instead of just one (e.g. with decision trees).

For more advantages, a quick introduction, and a real-world example in the area of histology (the study of the microscopic structure of tissues) for breast cancer detection have a look at the official vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.

The first example is based on the early-stage diabetes risk prediction dataset from the Queen Mary University of London which contains the sign and symptom data of newly diabetic or would be diabetic patients (diabetes_data_upload.csv). We use this dataset to find the leading symptoms of diabetes:

library(OneR)

# leading symptoms
data1 <- read.csv("data/diabetes_data_upload.csv") # adjust path accordingly
OneR(data1, verbose = TRUE)
## 
##     Attribute          Accuracy
## 1 * Polyuria           82.31%  
## 2   Polydipsia         80.19%  
## 3   partial.paresis    69.23%  
## 4   sudden.weight.loss 69.04%  
## 5   Gender             68.08%  
## 6   Alopecia           65.96%  
## 7   Polyphagia         65.58%  
## 8   Age                64.42%  
## 9   weakness           63.65%  
## 10  Genital.thrush     61.54%  
## 10  visual.blurring    61.54%  
## 10  Itching            61.54%  
## 10  Irritability       61.54%  
## 10  delayed.healing    61.54%  
## 10  muscle.stiffness   61.54%  
## 10  Obesity            61.54%  
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
## 
## Call:
## OneR.data.frame(x = data1, verbose = TRUE)
## 
## Rules:
## If Polyuria = No  then class = Negative
## If Polyuria = Yes then class = Positive
## 
## Accuracy:
## 428 of 520 instances classified correctly (82.31%)

As we can see in the table the leading symptoms are polyuria (excessive urination volume) and polydipsia (excessive thirst) with an accuracy of over 80 percent each. This result is corroborated by the medical literature.

The next dataset is the quite famous Pima Indians Diabetes Database which is often used as a benchmark for machine learning methods. It can be found in the mlbench package (on CRAN):

# glucose
library(mlbench)

data("PimaIndiansDiabetes")
data2 <- PimaIndiansDiabetes
OneR(optbin(data2))
## 
## Call:
## OneR.data.frame(x = optbin(data2))
## 
## Rules:
## If glucose = (-0.199,141] then diabetes = neg
## If glucose = (141,199]    then diabetes = pos
## 
## Accuracy:
## 573 of 768 instances classified correctly (74.61%)

Glucose (blood sugar) with a cut-off value of 141 is identified as the main predictor of DM, the “official” cut-off point is at 140 mg/dl.

The last dataset is from a National Health and Nutrition Examination Survey (NHANES): nhgh.rda (here you can find more info on the attributes of the dataset).

# HbA1c
load("data/nhgh.rda") # adjust path accordingly
data3 <- nhgh[ , !names(nhgh) %in% c("seqn", "tx")]
OneR(optbin(dx ~., data = data3, method = "infogain"))
## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit):
## target is numeric
## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit): 1452
## instance(s) removed due to missing values
## 
## Call:
## OneR.data.frame(x = optbin(dx ~ ., data = data3, method = "infogain"))
## 
## Rules:
## If gh = (3.99,6.4] then dx = 0
## If gh = (6.4,15.5] then dx = 1
## 
## Accuracy:
## 4955 of 5343 instances classified correctly (92.74%)

Here HbA1c (glycated hemoglobin, measured primarily to determine the three-month average blood sugar level) with a cut-off value of 6.4 is identified as the main predictor for DM with an accuracy of nearly 93%, the “official” cut-off point lies at 6.5%.

In fact, several researchers around the world use the OneR package already. To give you just one publication: Computational prediction of diagnosis and feature selection on mesothelioma patient health records by D. Chicco and C. Rovelli, PLoS One, 2019.

I myself have a paper on COVID-19 under review which was submitted in cooperation with Dr. med. Anna Laura Herzog and Prof. Dr. med. Patrick Meybohm, both from the renowned University Hospital Würzburg, where we used the OneR package among other machine learning methods.

I hope that you can see that the OneR package is well worth a try (not only) in the field of medical research. If you have a project in mind where you are looking for a cooperation partner please leave a note in the comments or contact me directly: About.

To leave a comment for the author, please follow the link and comment on their blog: R-Bloggers – Learning Machines.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

OneR in Medical Research: Finding Leading Symptoms, Main Predictors and Cut-Off Points

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)