OneR in Medical Research: Finding Leading Symptoms, Main Predictors and Cut-Off Points
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
We already had a lot of examples that make use of the OneR
package (on CRAN), which can be found in the respective Category: OneR.
Here we will give you some concrete examples in the area of research on Type 2 Diabetes Mellitus (DM) to show that the package is especially well suited in the field of medical research, so read on!
One of the big advantages of the package is that the resulting models are often not only highly accurate but very easy to interpret:
- the predictors are ordered from best to worst (based on accuracy), the best one is chosen,
- the model is given in the form of simple if-then rules,
- the rules contain exact cut-off points.
An additional advantage, compared to other methods, is that with the included optbin
function you find as many cut-off points as there are needed to separate all the classes instead of just one (e.g. with decision trees).
For more advantages, a quick introduction, and a real-world example in the area of histology (the study of the microscopic structure of tissues) for breast cancer detection have a look at the official vignette: OneR – Establishing a New Baseline for Machine Learning Classification Models.
The first example is based on the early-stage diabetes risk prediction dataset from the Queen Mary University of London which contains the sign and symptom data of newly diabetic or would be diabetic patients (diabetes_data_upload.csv). We use this dataset to find the leading symptoms of diabetes:
library(OneR) # leading symptoms data1 <- read.csv("data/diabetes_data_upload.csv") # adjust path accordingly OneR(data1, verbose = TRUE) ## ## Attribute Accuracy ## 1 * Polyuria 82.31% ## 2 Polydipsia 80.19% ## 3 partial.paresis 69.23% ## 4 sudden.weight.loss 69.04% ## 5 Gender 68.08% ## 6 Alopecia 65.96% ## 7 Polyphagia 65.58% ## 8 Age 64.42% ## 9 weakness 63.65% ## 10 Genital.thrush 61.54% ## 10 visual.blurring 61.54% ## 10 Itching 61.54% ## 10 Irritability 61.54% ## 10 delayed.healing 61.54% ## 10 muscle.stiffness 61.54% ## 10 Obesity 61.54% ## --- ## Chosen attribute due to accuracy ## and ties method (if applicable): '*' ## ## Call: ## OneR.data.frame(x = data1, verbose = TRUE) ## ## Rules: ## If Polyuria = No then class = Negative ## If Polyuria = Yes then class = Positive ## ## Accuracy: ## 428 of 520 instances classified correctly (82.31%)
As we can see in the table the leading symptoms are polyuria (excessive urination volume) and polydipsia (excessive thirst) with an accuracy of over 80 percent each. This result is corroborated by the medical literature.
The next dataset is the quite famous Pima Indians Diabetes Database which is often used as a benchmark for machine learning methods. It can be found in the mlbench
package (on CRAN):
# glucose library(mlbench) data("PimaIndiansDiabetes") data2 <- PimaIndiansDiabetes OneR(optbin(data2)) ## ## Call: ## OneR.data.frame(x = optbin(data2)) ## ## Rules: ## If glucose = (-0.199,141] then diabetes = neg ## If glucose = (141,199] then diabetes = pos ## ## Accuracy: ## 573 of 768 instances classified correctly (74.61%)
Glucose (blood sugar) with a cut-off value of 141 is identified as the main predictor of DM, the “official” cut-off point is at 140 mg/dl.
The last dataset is from a National Health and Nutrition Examination Survey (NHANES): nhgh.rda (here you can find more info on the attributes of the dataset).
# HbA1c load("data/nhgh.rda") # adjust path accordingly data3 <- nhgh[ , !names(nhgh) %in% c("seqn", "tx")] OneR(optbin(dx ~., data = data3, method = "infogain")) ## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit): ## target is numeric ## Warning in optbin.data.frame(x = data, method = method, na.omit = na.omit): 1452 ## instance(s) removed due to missing values ## ## Call: ## OneR.data.frame(x = optbin(dx ~ ., data = data3, method = "infogain")) ## ## Rules: ## If gh = (3.99,6.4] then dx = 0 ## If gh = (6.4,15.5] then dx = 1 ## ## Accuracy: ## 4955 of 5343 instances classified correctly (92.74%)
Here HbA1c (glycated hemoglobin, measured primarily to determine the three-month average blood sugar level) with a cut-off value of 6.4 is identified as the main predictor for DM with an accuracy of nearly 93%, the “official” cut-off point lies at 6.5%.
In fact, several researchers around the world use the OneR package already. To give you just one publication: Computational prediction of diagnosis and feature selection on mesothelioma patient health records by D. Chicco and C. Rovelli, PLoS One, 2019.
I myself have a paper on COVID-19 under review which was submitted in cooperation with Dr. med. Anna Laura Herzog and Prof. Dr. med. Patrick Meybohm, both from the renowned University Hospital Würzburg, where we used the OneR package among other machine learning methods.
I hope that you can see that the OneR package is well worth a try (not only) in the field of medical research. If you have a project in mind where you are looking for a cooperation partner please leave a note in the comments or contact me directly: About.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.