Site icon R-bloggers

Start with wordcloud

[This article was first published on Joris Muller's blog - Posts about R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • I followed my good resolutions on practising data analysis in my previous post and started to play with the French drug database.

    After importing the data, I started classically with data visualisation. In this database, there is a lot of text data. To visualise this, some wordcloud is always welcome. They are maybe not accurate at all but are from my point of view a very good illustration of a text-based dataset.

    To my knowledge there are two main wordcloud packages in R :

    Let’s play with this.

    Prepare the data

    # Read the previously imported data
    db <- readRDS("raw_rds/bdpm.rds")
    

    For example, there is the pharmaceutical form column.

    head(db$forme)
    
    ## [1] "pommade"                                                 
    ## [2] "capsule molle"                                           
    ## [3] "solution injectable"                                     
    ## [4] "solution injectable"                                     
    ## [5] "suspension à diluer pour perfusion"                      
    ## [6] "poudre et pommade et comprimé et granules et solution(s)"
    

    There is a lot of different forms

    uforme <- unique(db$forme)
    length(uforme)
    
    ## [1] 405
    

    405 various form. But there is multiple form in one line sometimes, separated by “et”. Try to find the real different form.

    forms <- db$forme %>%
      strsplit(split = " et ") %>%
      unlist() 
      
    length(unique(forms))  
    
    ## [1] 393
    
    head(forms)
    
    ## [1] "pommade"                           
    ## [2] "capsule molle"                     
    ## [3] "solution injectable"               
    ## [4] "solution injectable"               
    ## [5] "suspension à diluer pour perfusion"
    ## [6] "poudre"
    

    Select the 100 most frequent

    head(as.data.frame(table(forms)))
    
    ##                                   forms Freq
    ## 1                              comprimé   17
    ## 2                       comprimé enrobé   19
    ## 3                    comprimé pelliculé   36
    ## 4  comprimé pelliculé buvable pelliculé    1
    ## 5          comprimé pelliculé pelliculé    1
    ## 6                                 crème   17
    
    cent <- forms %>%
      table() %>%
      as.data.frame() %>%
      arrange(desc(Freq)) %>%
      head(100)
    
    kable(head(cent))
    

    name Freq ——————– —– comprimé pelliculé 2289 comprimé 1782 gélule 1019 solution injectable 934 poudre 886 comprimé sécable 852

    Wordcloud

    Make a wordcloud with wordcloud

    library(wordcloud)
    wordcloud(cent$., freq = cent$Freq)
    

    < !-- -->

    Not bad. Try something funkier.

    wordcloud(
      words = cent$., 
      freq = cent$Freq, 
      random.color = T, 
      random.order = F, 
      colors = brewer.pal(8,"Dark2")
    )
    

    < !-- -->

    I find this very informative. Intituively it’s possible to see what’s the most frequent forms are. And is far more attractive than a table or an unreadable barplot.

    library(ggplot2)
    
    ggplot(cent) +
      aes(x = ., y = Freq) +
      geom_bar(stat = "identity") +
      coord_flip()
    

    < !-- -->

    OK, I would be possible to make a better plot but I think you see the point.

    Wordcloud 2

    Wordcloud 2 produce html widget

    library(wordcloud2)
    wordcloud2(cent)
    
    < !--html_preserve-->
    < !--/html_preserve-->

    It’s easier and more fun! Try it, it’s interactive.

    Note : if you want to add this widget to a page, you need to link the proper javascript files. In my case I put this in my markdown file:

    <script src="/assets/2016-12-27-Drug_wordcloud_files/htmlwidgets-0.7/htmlwidgets.js"></script>
    <link href="/assets/2016-12-27-Drug_wordcloud_files/wordcloud2-0.0.1/wordcloud.css" rel="stylesheet">
    <script src="/assets/2016-12-27-Drug_wordcloud_files/wordcloud2-0.0.1/wordcloud2-all.js"></script>
    <script src="/assets/2016-12-27-Drug_wordcloud_files/wordcloud2-0.0.1/hover.js"></script>
    <script src="/assets/2016-12-27-Drug_wordcloud_files/wordcloud2-binding-0.2.0/wordcloud2.js"></script>
    

    Enough with wordcloud. We understood that’s the “comprimé” (tablet) pharmaceutical form is the most frequent, followed by the “gélule” (capsule), “poudre” (powder) and “granule” (small pill). We can also see that’s some text cleaning would be necessary to make a proper analysis.

    To leave a comment for the author, please follow the link and comment on their blog: Joris Muller's blog - Posts about R.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.