Site icon R-bloggers

learningmachine v1.0.0: prediction intervals around the probability of the event ‘a tumor being malignant’

[This article was first published on T. Moudiki's Webpage - R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Considering the number of people who read this post, a lot of you are probably using learningmachine v0.2.3. Maybe because of the fancy name. Just so you know, learningmachine is only doing batch learning at the moment. Stay tuned.

Well, today, there are good news and bad news. The good news is learningmachine is back with v1.0.0 (Python port coming next week). The “bad” news is: jumping to v1.0.0 this early means there’s a change in the interface (that won’t change drastically anymore); with a lot of good reasons:

v0.2.3 remains available on a branch.

The new features are:

learningmachine is still experimental, probably with some quirks (because achieving this level of abstraction required some effort), with no beautiful documentation, but you can already tinker it and do advanced analysis, as shown below. You may also like this vignette and this vignette.

utils::install.packages("caret")

utils::install.packages("dfoptim")

utils::install.packages("ggplot2")

utils::install.packages("mlbench")

utils::install.packages("ranger")

utils::install.packages("remotes")

remotes::install_github("Techtonique/learningmachine")

library(learningmachine)

library(ggplot2)
library(mlbench)
library(ranger)

data("BreastCancer")
BreastCancer$Id <- NULL
rownames(BreastCancer) <- NULL 
y <- as.factor(BreastCancer$Class)
X <- BreastCancer[,-10]
X$Bare.nuclei[is.na(X$Bare.nuclei)] <- median(as.numeric(BreastCancer$Bare.nuclei[!is.na(BreastCancer$Bare.nuclei)]))
apply(X, 2, function(x) sum(is.na(x)))

   Cl.thickness       Cell.size      Cell.shape   Marg.adhesion    Epith.c.size 
              0               0               0               0               0 
    Bare.nuclei     Bl.cromatin Normal.nucleoli         Mitoses 
              0               0               0               0 

for (i in seq_len(ncol(X)))
{
  X[,i] <- as.numeric(X[,i])
}


index_train <- caret::createDataPartition(y, p = 0.8)$Resample1
X_train <- X[index_train, ]
y_train <- y[index_train]
X_test <- X[-index_train, ]
y_test <- y[-index_train]
dim(X_train)

[1] 560   9

dim(X_test)

[1] 139   9

obj <- learningmachine::Classifier$new(method = "ranger")
obj$get_type()

[1] "classification"

obj$get_name()

[1] "Classifier"

obj$set_B(10)
obj$set_level(95)

t0 <- proc.time()[3]
obj$fit(X_train, y_train, pi_method="kdesplitconformal") # this will be described in a paper
cat("Elapsed: ", proc.time()[3] - t0, "s \n")

Elapsed:  0.123 s 

probs <- obj$predict_proba(X_test)

obj$summary(X_test, y=y_test, 
            class_name = "malignant",
            show_progress=FALSE)

$Coverage_rate
[1] 95.68345

$ttests
                     estimate         lower       upper      p-value signif
Cl.thickness     0.0056807801  0.0024459156 0.008915645 0.0006893052    ***
Cell.size        0.0039919446  0.0011625077 0.006821382 0.0060221736     **
Cell.shape       0.0023459459  0.0005416303 0.004150262 0.0112039276      *
Marg.adhesion    0.0042356479  0.0018622609 0.006609035 0.0005676013    ***
Epith.c.size    -0.0001036245 -0.0013577745 0.001150525 0.8704619531       
Bare.nuclei      0.0104212402  0.0031755384 0.017666942 0.0051349801     **
Bl.cromatin      0.0051171380 -0.0002930096 0.010527286 0.0635723868      .
Normal.nucleoli  0.0067594459  0.0024786650 0.011040227 0.0021872093     **
Mitoses          0.0007052483 -0.0001171510 0.001527648 0.0922097961      .

$effects
── Data Summary ────────────────────────
                           Values 
Name                       effects
Number of rows             139    
Number of columns          9      
_______________________           
Column type frequency:            
  numeric                  9      
________________________          
Group variables            None   

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable        mean      sd       p0 p25 p50 p75   p100 hist 
1 Cl.thickness     0.00568  0.0193  -0.0178    0   0   0 0.158  ▇▁▁▁▁
2 Cell.size        0.00399  0.0169  -0.0136    0   0   0 0.116  ▇▁▁▁▁
3 Cell.shape       0.00235  0.0108  -0.0209    0   0   0 0.0827 ▁▇▁▁▁
4 Marg.adhesion    0.00424  0.0142  -0.00497   0   0   0 0.116  ▇▁▁▁▁
5 Epith.c.size    -0.000104 0.00748 -0.0371    0   0   0 0.0409 ▁▁▇▁▁
6 Bare.nuclei      0.0104   0.0432   0         0   0   0 0.297  ▇▁▁▁▁
7 Bl.cromatin      0.00512  0.0323  -0.0171    0   0   0 0.366  ▇▁▁▁▁
8 Normal.nucleoli  0.00676  0.0255  -0.00125   0   0   0 0.126  ▇▁▁▁▁
9 Mitoses          0.000705 0.00490  0         0   0   0 0.0507 ▇▁▁▁▁

df <- reshape2::melt(probs$sims$malignant[c(1, 5), ])
df$Var2 <- NULL 
colnames(df) <- c("individual", "prob_malignant")
df$individual <- as.factor(df$individual)
ggplot2::ggplot(df, aes(x=prob_malignant, fill=individual)) + geom_histogram(alpha=.3) +
  theme(
    panel.background = element_rect(fill='transparent'),
    plot.background = element_rect(fill='transparent', color=NA),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    legend.background = element_rect(fill='transparent'),
    legend.box.background = element_rect(fill='transparent')
  )

t.test(subset(df, individual == 1)$prob_malignant)

    One Sample t-test

data:  subset(df, individual == 1)$prob_malignant
t = 323.02, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.6990101 0.7076507
sample estimates:
mean of x 
0.7033304 

t.test(subset(df, individual == 2)$prob_malignant)

    One Sample t-test

data:  subset(df, individual == 2)$prob_malignant
t = 222.29, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 0.5023095 0.5113577
sample estimates:
mean of x 
0.5068336 

t.test(prob_malignant ~ individual, data = df)

    Welch Two Sample t-test

data:  prob_malignant by individual
t = 62.327, df = 197.58, p-value < 2.2e-16
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 0.1902796 0.2027140
sample estimates:
mean in group 1 mean in group 2 
      0.7033304       0.5068336 
To leave a comment for the author, please follow the link and comment on their blog: T. Moudiki's Webpage - R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exit mobile version