Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
India is the world’s largest Democracy and as it goes, also a highly diverse place. This is my attempt to see how “Hindi” and other languages are spoken in India.
In this post, we’ll see how to collect data for this relevant puzzle – directly from Wikipedia and How we’re going to visualize it – highlighting the insight.
Data
Wikipedia is a great source for data like this – Languages spoken in India and also because Wikipedia lists these tables as html <table> it becomes quite easier for us to use rvest::html_table() to extract the table as dataframe without much hassle.
options(scipen = 999) library(rvest) # for webscraping library(tidyverse) # for data analysis and visualization # the wikipedia page URL - thanks to DuckDuckGo search lang_url <- "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers_in_India" # extracting the entire content of the page content <- read_html(lang_url) # extracting only tables from the downloaded content tables <- content %>% html_table(fill = TRUE) # from the page we know, it's the first table we want picking up the first # element from the list of tables lang_table <- tables[[1]] ### header cleaning - exclude the first row lang_table <- lang_table[-1, ] lang_table ## First language speakers First language speakers ## 2 Hindi[b] 422,048,642 41.03% ## 3 English 226,449 0.02% ## 4 Bengali 83,369,769 8.10% ## 5 Telugu 74,002,856 7.19% ## 6 Marathi 71,936,894 6.99% ## 7 Tamil 60,793,814 5.91% ## 8 Urdu 51,536,111 5.01% ## 9 Kannada 37,924,011 3.69% ## 10 Gujarati 46,091,617 4.48% ## 11 Odia 33,017,446 3.21% ## 12 Malayalam 33,066,392 3.21% ## 13 Sanskrit 14,135 <0.01% ## Second languagespeakers[11] Third languagespeakers[11] Total speakers ## 2 98,207,180 31,160,696 551,416,518 ## 3 86,125,221 38,993,066 125,344,736 ## 4 6,637,222 1,108,088 91,115,079 ## 5 9,723,626 1,266,019 84,992,501 ## 6 9,546,414 2,701,498 84,184,806 ## 7 4,992,253 956,335 66,742,402 ## 8 6,535,489 1,007,912 59,079,512 ## 9 11,455,287 1,396,428 50,775,726 ## 10 3,476,355 703,989 50,271,961 ## 11 3,272,151 319,525 36,609,122 ## 12 499,188 195,885 33,761,465 ## 13 1,234,931 3,742,223 4,991,289 ## Total speakers ## 2 53.60% ## 3 12.18% ## 4 8.86% ## 5 8.26% ## 6 8.18% ## 7 6.49% ## 8 5.74% ## 9 4.94% ## 10 4.89% ## 11 3.56% ## 12 3.28% ## 13 0.49%
At this point, we’ve got the required table but mind you, The numbers are in characters and for us to plot visualizations – it has to be in Numeric format. We’ll pick only First Language Speakers for further sections so will change those numbers from character into numeric format
# clean-up the messed up column names
lang_table <- lang_table %>% 
  janitor::clean_names()
lang_table[1,"x"] <- "Hindi"
lang_table
##            x first_language_speakers first_language_speakers_2
## 2      Hindi             422,048,642                    41.03%
## 3    English                 226,449                     0.02%
## 4    Bengali              83,369,769                     8.10%
## 5     Telugu              74,002,856                     7.19%
## 6    Marathi              71,936,894                     6.99%
## 7      Tamil              60,793,814                     5.91%
## 8       Urdu              51,536,111                     5.01%
## 9    Kannada              37,924,011                     3.69%
## 10  Gujarati              46,091,617                     4.48%
## 11      Odia              33,017,446                     3.21%
## 12 Malayalam              33,066,392                     3.21%
## 13  Sanskrit                  14,135                    <0.01%
##    second_languagespeakers_11 third_languagespeakers_11 total_speakers
## 2                  98,207,180                31,160,696    551,416,518
## 3                  86,125,221                38,993,066    125,344,736
## 4                   6,637,222                 1,108,088     91,115,079
## 5                   9,723,626                 1,266,019     84,992,501
## 6                   9,546,414                 2,701,498     84,184,806
## 7                   4,992,253                   956,335     66,742,402
## 8                   6,535,489                 1,007,912     59,079,512
## 9                  11,455,287                 1,396,428     50,775,726
## 10                  3,476,355                   703,989     50,271,961
## 11                  3,272,151                   319,525     36,609,122
## 12                    499,188                   195,885     33,761,465
## 13                  1,234,931                 3,742,223      4,991,289
##    total_speakers_2
## 2            53.60%
## 3            12.18%
## 4             8.86%
## 5             8.26%
## 6             8.18%
## 7             6.49%
## 8             5.74%
## 9             4.94%
## 10            4.89%
## 11            3.56%
## 12            3.28%
## 13            0.49%
lang_table %>% 
  select(one_of("x","first_language_speakers")) %>% 
  mutate(first_language_speakers = parse_number(first_language_speakers)) -> lang_table_first
names(lang_table_first) <- c("Language","first_language_speakers")
lang_table_first
##     Language first_language_speakers
## 1      Hindi               422048642
## 2    English                  226449
## 3    Bengali                83369769
## 4     Telugu                74002856
## 5    Marathi                71936894
## 6      Tamil                60793814
## 7       Urdu                51536111
## 8    Kannada                37924011
## 9   Gujarati                46091617
## 10      Odia                33017446
## 11 Malayalam                33066392
## 12  Sanskrit                   14135
Visualization
Now that we got a categorical and a numerical variable. It’s time to play with some visualization – as it’s typical – a bar chart.
All Languages
lang_table_first %>% 
  mutate(Language = fct_reorder(Language,-first_language_speakers)) %>% 
 ggplot() + geom_bar(aes(Language, first_language_speakers),
                     stat = "identity",
                     fill = ifelse(lang_table_first$Language == 'Hindi',
                                   "#ffdd00",
                                   "#ff00ff")) +
  theme_minimal() +
  labs(title = "Most Spoken Languages",
       subtitle = "First Language in India",
       caption = "Data Source: Wikipedia - Census 2001")
That’s a long tail with Hindi leading the way.
Hindi & Everyone else
library(viridis)
lang_table_first %>% 
  mutate(Language = ifelse(Language == "Hindi",
                           "Hindi","non_Hindi")) %>% 
  group_by(Language) %>% 
  summarize(first_language_speakers = sum(first_language_speakers)) %>% 
  mutate(percentage = round((first_language_speakers / sum(first_language_speakers))*100,2)) %>% 
  ggplot() + geom_bar(aes(Language,percentage,fill = Language), stat = "identity"
                      ) +
    scale_fill_viridis_d(option = 'E', direction = -1) +
  scale_y_continuous(limits = c(0,60)) +
  theme_minimal() +
  geom_label(aes(Language,percentage, label= paste0(percentage,"%"))) + 
  labs(title = "Hindi vs Non_Hindi",
       subtitle = "First Spoken Language in India",
       caption = "Data:Wikipedia - Census 2001") 
Living up to the Diversity of India, A mixed (assorted) group of languages other than Hindi forms ~54% while Hindi-only is ~46%
Summary
Not getting into the politics of this context, In this post, we learnt how to get data (that’s requried for us) using rvest and did analysis using tidyverse to generate some valuable insights on India’s most spoken first languages. If you are interested to know more regarding R, You can check out this tutorial.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
