Site icon R-bloggers

Survey categorical variables with KableExtra

[This article was first published on R on Amit Levinson, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
  • In my in-progress thesis I decided I’ll analyze my survey results in something other than SPSS we learned in undergrad, which eventually led me to begin using R. The time came and I started analyzing my pilot survey data from Qualtrics1. In this post I’ll address how I used {KableExtra} to nicely print a frequency table of the categorical & ordinal questions I had in my survey. You can also do what I describe below in other packages, however I enjoy using {KableExtra} for its rich vignette and clearly defomed functions.

    The Problem

    In my pilot survey I had ~20 questions that were categorical, ordinal and were simple constructs not requiring a thorough analysis but only a quick review at the distribution of responses. I wanted to print all of these variables in one formatted table and address any anomalies if needed.

    The issue was, how can I format printing of all categorical variables in their chronological order, along with the original question and the distribution of responses?

    Basically, how can I achieve the following output:

    Explore our data

    First, let’s loda the packages we’ll need and look at our data:

    library(tidyverse)
    library(here)
    library(readxl)
    library(knitr)
    library(kableExtra)
    library(janitor)
    library(scales)
    
    df_survey <- read_xlsx(here("content", "post", "printing-survey-table", "data","survey.xlsx"))
    head(df_survey)
    ## # A tibble: 6 x 12
    ##   Q2     Q4     Q6     Q7    Q11_1   Q13_1   Q15   Q26   Q27   Q30   Q32   Q37  
    ##   <chr>  <chr>  <chr>  <chr> <chr>   <chr>   <chr> <chr> <chr> <chr> <chr> <chr>
    ## 1 Female Salar~ Simil~ Cent~ Once a~ At lea~ Never No    <NA>  <NA>  No    Yes  
    ## 2 Female Salar~ Below~ South Severa~ Every ~ Never No    <NA>  <NA>  No    Yes  
    ## 3 Female Stude~ Below~ South Betwee~ I've n~ Never No    <NA>  <NA>  No    Yes  
    ## 4 Female Unemp~ Simil~ Cent~ Once a~ At lea~ Last~ No    <NA>  <NA>  No    Yes  
    ## 5 Male   Salar~ Simil~ Cent~ Once a~ At lea~ Last~ Yes   2-3 ~ Yes   No    No   
    ## 6 Female Salar~ Simil~ Cent~ Severa~ Last t~ Never No    <NA>  <NA>  No    Yes

    So we have a lot of information describing our sample data records. While this data is fabricated, it mirrors a common survey dataset: Each row represents a respondent with answers to various questions. With respect to continuous variables I did a different analysis, so for the purpose of the following post we’ll need only character columns. Let’s start by removing anything other than the relevant columns:

    df_char <- df_survey %>% 
      janitor::clean_names () %>% 
      select_if(is.character) 

    We’ll need to change the data to a long form so that we can print it for efficient reading. An easy approach will be to use the pivot_longer argument, rendering all our columns in one long table:

    df_long <- df_char %>% 
      pivot_longer(q2:q37, names_to = "question") %>% 
      count(question, value) %>%
      group_by(question) %>% 
      mutate(pct = percent(n/sum(n)))
    
    # Print table
    kbl(df_long) %>% 
      kable_styling() %>% 
      scroll_box(height = "550px")
    question value n pct
    q11_1 Between once a week to once in a month 5 14.7%
    q11_1 Less than once a month 4 11.8%
    q11_1 Once a day 6 17.6%
    q11_1 Once a week 4 11.8%
    q11_1 Several times a day 6 17.6%
    q11_1 Several times a week 8 23.5%
    q11_1 NA 1 2.9%
    q13_1 About every week 1 2.9%
    q13_1 At least once a month 13 38.2%
    q13_1 Every three months 5 14.7%
    q13_1 Every three to six months 4 11.8%
    q13_1 I’ve never bought 5 14.7%
    q13_1 Last time I bought was over half a year ago 6 17.6%
    q15 Last half a year 3 8.8%
    q15 Last month 2 5.9%
    q15 Last three months 3 8.8%
    q15 Last week 1 2.9%
    q15 Last year 1 2.9%
    q15 Never 18 52.9%
    q15 Over a year ago 6 17.6%
    q2 Female 15 44.1%
    q2 Male 17 50.0%
    q2 Other 2 5.9%
    q26 No 20 59%
    q26 Yes 13 38%
    q26 NA 1 3%
    q27 2-3 times 4 11.8%
    q27 5-7 times 4 11.8%
    q27 Over 7 times 5 14.7%
    q27 NA 21 61.8%
    q30 No 1 3%
    q30 Yes 12 35%
    q30 NA 21 62%
    q32 No 27 79.4%
    q32 Yes 5 14.7%
    q32 NA 2 5.9%
    q37 No 6 17.6%
    q37 Yes 25 73.5%
    q37 NA 3 8.8%
    q4 Salaried employee 18 52.9%
    q4 Self employed 3 8.8%
    q4 Student 8 23.5%
    q4 Student,Salaried employee 3 8.8%
    q4 Student,Unemployed 1 2.9%
    q4 Unemployed 1 2.9%
    q6 Above average 5 14.7%
    q6 Below average 16 47.1%
    q6 Similar to average 13 38.2%
    q7 Center 16 47.1%
    q7 South 18 52.9%


    Great, this prints nicely, but we’re left with several issues to address2:

    1. We’re missing the actual questions – Notice how we only have “q11_1” but not a description of what the actual question or what the variable is. You can add a question label within Qualtrics, but I still wanted to have the question itself presented along with the question number.

    2. Questions were reorganized alphabetically – Once we ran the pivot_longer R sorted our dataframe alphabetically according to the question column, but we might want it ordered according to the survey layout. Of course this is contingent on your data; I wanted to present it aligned to the order of the survey questions.

    3. Some of the responses aren’t ordered – Notice how some of the responses are randomly ordered, when ideally we’d want them to be ordered by hierarchy. For example question q11_1 describes frequency responses that aren’t hierarchically ordered.

    4. Remove duplicated information – Our question column, and if we add another one with the question’s text, will have duplicate information. While the value changes within questions printing the question column for each row is redundant. In addition, once we’ll add the question title it’ll be even more cluttered and any additional irrelevant text should be removed.

    So then, let’s address these issues individually.

    Adding information to our questions

    When you download the survey data from Qualtrics you also receive it with the original questions. When I personally analyzed the data I removed it, but here it’s perfect for our display of additional information. There’s also a great function for doing exactly that from the {qualtRics} package, but I was having trouble connecting to the platform’s API through my Qualtrics user.

    Adding the questions was straightforward: Just combine the current data_long with a dataset containing my questions. We’ll use a copy of the original survey data (of course fabricated for purpose of the survey) that only contains the questions:

    df_questions <- read_xlsx(here("content", "post", "printing-survey-table", "data", "questions.xlsx"))
    df_questions[,1:3]
    ## # A tibble: 1 x 3
    ##   Q2     Q4                   Q6                                                
    ##   <chr>  <chr>                <chr>                                             
    ## 1 Gender What's your occupat~ The average income for an individual is X, you're~

    Great! we see our question’s text as values with the variables being the questions themselves. Now let’s render it in a long format so that each row is a question id with the corresponding text as a value, and then we’ll join it with our current dataset of answers:

    df_q_clean <- df_questions %>% 
      clean_names() %>% 
      pivot_longer(cols = q2:q37, names_to = "question", values_to = "text") 
    
    df_long_joined <-  left_join(df_long, df_q_clean)
    
    head(df_long_joined)
    ## # A tibble: 6 x 5
    ## # Groups:   question [1]
    ##   question value                           n pct   text                         
    ##   <chr>    <chr>                       <int> <chr> <chr>                        
    ## 1 q11_1    Between once a week to onc~     5 14.7% Every how often do you consu~
    ## 2 q11_1    Less than once a month          4 11.8% Every how often do you consu~
    ## 3 q11_1    Once a day                      6 17.6% Every how often do you consu~
    ## 4 q11_1    Once a week                     4 11.8% Every how often do you consu~
    ## 5 q11_1    Several times a day             6 17.6% Every how often do you consu~
    ## 6 q11_1    Several times a week            8 23.5% Every how often do you consu~

    Perfect. However, as you can see, our new text column provides the same information across the same questions, which seems kind of redundant. We’ll keep it for now and address it soon when we turn to print our table.

    Reordering within and across questions

    The next issue on the list is that we want some of our questions to be organized not by the count frequency or some randomness, but by hierarchy of the answer options. For example ‘a few times a day’, ‘Once a day’, ‘several times a week’ and so on as a hierarachal structure in my ordinal variables.

    Alas, I don’t have a magical automated method and would be grateful to hear about other options you encountered or thought of. I thought of using factors to reorder the levels, but once I pivot my data into a long format the answers are again sorted alphabetically. Instead I decided to manually combine my current dataframe with an identical one I saved where I ranked each relevant ordinal question manually. Though a tedious task, this manual workload is more efficient than automating everything.

    # Save the sorted response file and use that to rank
    # write_csv(df_long_q_sorted, here("content", "post", "printing-survey-table", "data", "answers_hir.csv"))
    
    answer_hir <- read_csv(here("content", "post", "printing-survey-table", "data", "answers_hir.csv"))
    head(answer_hir)
    ## # A tibble: 6 x 3
    ##   question value                      rank
    ##   <chr>    <chr>                     <dbl>
    ## 1 q2       Female                       NA
    ## 2 q2       Male                         NA
    ## 3 q4       Salaried employee            NA
    ## 4 q4       Self employed                NA
    ## 5 q4       Student                      NA
    ## 6 q4       Student,Salaried employee    NA

    We now have our new guide in which we ranked our questions. Notice that the first answers are NA, but that’s because the nominal variables have no intrinsic hierarchy. Now let’s use this dataframe to create a value with which to sort our answers:

    df_long_ranked <- left_join(x = df_long_joined, y = answer_hir) 
    head(df_long_ranked)
    ## # A tibble: 6 x 6
    ## # Groups:   question [1]
    ##   question value                         n pct   text                       rank
    ##   <chr>    <chr>                     <int> <chr> <chr>                     <dbl>
    ## 1 q11_1    Between once a week to o~     5 14.7% Every how often do you c~     5
    ## 2 q11_1    Less than once a month        4 11.8% Every how often do you c~     6
    ## 3 q11_1    Once a day                    6 17.6% Every how often do you c~     2
    ## 4 q11_1    Once a week                   4 11.8% Every how often do you c~     4
    ## 5 q11_1    Several times a day           6 17.6% Every how often do you c~     1
    ## 6 q11_1    Several times a week          8 23.5% Every how often do you c~     3

    We’ll reorder by rank along with solving the next issue which is the way our questions are ordered. Basically, we want it to be ordered by the question value and not using an alphabetic sort. For example, we’d like q_5 to appear before q11_1, similar to how it appeared in the survey. I’ll apply some regex (regular expression) manipulation to capture only the numbers and use that to sort by. Displayed as follows:

    df_long_q_sorted <- df_long_ranked %>% 
      mutate(q_num = str_remove_all(question, "[a-z]"),
        q_num = str_replace_all(q_num, "_", "."),
        q_num = str_remove(q_num, "\\.$"),
        q_num = as.numeric(q_num)) %>% 
      group_by(question) %>% 
      arrange(q_num, rank) %>% 
      ungroup() %>% 
      select(-c(q_num, rank)) %>% 
      relocate(text, .after = question)
    
    head(df_long_q_sorted)
    ## # A tibble: 6 x 5
    ##   question text                    value                 n pct  
    ##   <chr>    <chr>                   <chr>             <int> <chr>
    ## 1 q2       Gender                  Female               15 44.1%
    ## 2 q2       Gender                  Male                 17 50.0%
    ## 3 q2       Gender                  Other                 2 5.9% 
    ## 4 q4       What's your occupation? Salaried employee    18 52.9%
    ## 5 q4       What's your occupation? Self employed         3 8.8% 
    ## 6 q4       What's your occupation? Student               8 23.5%

    Fantastic!

    I found that first using the regex and then sorting by rank doesn’t properly work, so instead I implemented it along with sorting the questions. Again, if the order of questions and answers doesn’t matter in your data you can just skip past some of the stages.

    Great, now that we have all our data formatted properly, we can turn to the printing!

    KableExtra

    In my initial round of exploring the pilot survey I used {KableExtra} and its powerful features. You might find other packages better to work with when knitting to Word. With that said, it’s possible (and very effective) to knit to Html and copy that into a Word document. Despite the copy + paste requirement, I found it to be the better approach for keeping all the aesthetics and formatting integrated in the original document. Oh, and I also had my questions originally in Hebrew which was easier to knit to Html altogether.

    removing redundant information

    As we saw earlier, the argument is pretty straight forward. We can address the redundant information we have – question and text column appearing with each answer (our final issue) – within the KableExtra object using collapse_rows:

    df_long_q_sorted %>% 
      kbl(col.names = c("Question", "Text", "Answer", "n", "%")) %>% 
      kable_styling(full_width = F) %>% 
      column_spec(1, bold = T) %>% 
      collapse_rows(columns = c(1,2), valign = "top") %>% 
      scroll_box(height = "750px")
    Question Text Answer n %
    q2 Gender Female 15 44.1%
    Male 17 50.0%
    Other 2 5.9%
    q4 What’s your occupation? Salaried employee 18 52.9%
    Self employed 3 8.8%
    Student 8 23.5%
    Student,Salaried employee 3 8.8%
    Student,Unemployed 1 2.9%
    Unemployed 1 2.9%
    q6 The average income for an individual is X, you’re income is: Below average 16 47.1%
    Similar to average 13 38.2%
    Above average 5 14.7%
    q7 Where do you live in Israel Center 16 47.1%
    South 18 52.9%
    q11_1 Every how often do you consume chocolate? Several times a day 6 17.6%
    Once a day 6 17.6%
    Several times a week 8 23.5%
    Once a week 4 11.8%
    Between once a week to once in a month 5 14.7%
    Less than once a month 4 11.8%
    NA 1 2.9%
    q13_1 Every how often do you buy chocolate? About every week 1 2.9%
    At least once a month 13 38.2%
    Every three months 5 14.7%
    Every three to six months 4 11.8%
    Last time I bought was over half a year ago 6 17.6%
    I’ve never bought 5 14.7%
    q15 When did you last attend a party? Last week 1 2.9%
    Last month 2 5.9%
    Last three months 3 8.8%
    Last half a year 3 8.8%
    Last year 1 2.9%
    Over a year ago 6 17.6%
    Never 18 52.9%
    q26 Do you think 2020 was a good year? No 20 59%
    Yes 13 38%
    NA 1 3%
    q27 How many times did you decide to stay home instead of going out this year? 2-3 times 4 11.8%
    5-7 times 4 11.8%
    Over 7 times 5 14.7%
    NA 21 61.8%
    q30 Should Amit buy a new computer? No 1 3%
    Yes 12 35%
    NA 21 62%
    q32 Do you really mean that? No 27 79.4%
    Yes 5 14.7%
    NA 2 5.9%
    q37 Would you like to participate in a follow up study? No 6 17.6%
    Yes 25 73.5%
    NA 3 8.8%


    That easy? Yes!

    The trick here that I love is collapsing the column, an argument also common in other packages such as {formattable} I look forward to explore. Collapsing a column makes printing in rmarkdown really easy and efficient, something I found lacking in other platforms I learned such as SPSS. I also added a column_spec to bold the first column. Of course you can also remove the scroll box by not using the scroll_box option at the end, which will print your whole table.

    If you’re looking for a word format, you can just copy & paste your html output (Perfect for working with Hebrew text for example, a little more on that below). You can find an additional example by Hao Zhu, the creator of the package, or here’s my attempt below:

    Figure 1: Just select all and copy it into a word document

    If you want the table to be a little more formally formatted I recommend exploring aesthetic arguments such as kable_classic, kable_minimal and others from the {KableExtra} family. Here’s a short example using the kable_minimal with an Html output:

    df_long_q_sorted %>% 
      head(10) %>% 
      kbl() %>% 
      kable_styling(full_width = F) %>% 
      column_spec(1, bold = T) %>% 
      collapse_rows(columns = c(1,2), valign = "top")

    Figure 2: Example using kable_minimal as a table theme


    Wait, but what if I want a simple table in word?

    Let’s say you want a simple kable table when knitting in word, you can just add ‘df_print: kable’ to the YAML of your document or alternatively, you can explore other options in the Rmarkdown book that elegantly print dataframes. How do we remove the redundant information when printing? Just replace the duplicated values with an empty string:

    df_long_q_sorted %>% 
      mutate(across(c(question, text), ~ ifelse(duplicated(.x), " ", .x))) %>% 
      select(`Question` = question, `Text` = text, `Value` = value, n, `%` = pct) %>% 
      head(20)

    Figure 3: Outputting a table using df_print: kable in the YAML section

    Voila!

    I first removed redundant text by using the across along with a conditional argument to remove duplicated text. Basically the formula (~ ifelse) reads as take anyone of the specified columns and pass it to a conditional statement that if true (if the word is duplicated), add a space character instead. Below is a screenshot when rendered to word, and you can continue to format it with or without other packages.

    Right-to-Left languages?

    I found it difficult knitting Hebrew characters to a Word output but easily done when rendering Html documents. Here’s a short example, without the whole pre-processing, using some Hebrew questions:

    hebrew_example <- read_xlsx(here("content", "post", "printing-survey-table", "data", "hebrew_example.xlsx"))
    
    hebrew_example %>% 
      select(pct, n, value, text, question) %>% 
      mutate(pct = percent(pct)) %>% 
      # Reverse the order of questions
      kbl(col.names = c("%", "שכיחות", "תשובה", "שאלה", "פריט"), align = 'r') %>% 
      kable_styling(full_width = F) %>% 
      column_spec(5, bold = T) %>% 
      collapse_rows(columns = c(4,5), valign = "top") %>% 
      scroll_box(height = "500px")
    % שכיחות תשובה שאלה פריט
    47.1% 16 נקבה מגדר q2
    52.9% 18 זכר
    52.9% 18 שכיר סטטוס תעסוקתי q4
    8.8% 3 עצמאי
    23.5% 8 סטודנט
    8.8% 3 סטודנט, שכיר
    2.9% 1 סטודנט, מובטל
    2.9% 1 מובטל
    17.6% 6 לא האם תסכים להשתתף במחקר המשך? q37
    73.5% 25 כן
    8.8% 3 חסר


    Notice how I reversed the columns direction to have it compatible for a right-to-left reading. I also changed the locale setting (not shown) using the Sys.setlocale argument. While this still needs some additional work (for example the ‘?’ isn’t aligned), it’s definitely a good start.

    Wrapping up

    So this is how I approached the issue of providing a simple descriptive table of my categorical variables. While it was somewhat tedious and some parts required manual work, you might not require all the stages. I hope you were able to take something from this post, I certainly enjoy working with the {KableExtra} package and look forward to sharing additional things I’ve learned while using it for reporting findings from my thesis.

    Now then, time to get back to writing!


    1. If you use Qualtrics as your survey platform, check out the corresponding package to work with such data here.↩︎

    2. Of course there are many other issues I won’t address here, such as response options that are missing if no one chose them in my current pilot population.↩︎

    To leave a comment for the author, please follow the link and comment on their blog: R on Amit Levinson.

    R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
    Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.