Hindi and Other Languages in India based on 2001 census
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
India is the world’s largest Democracy and as it goes, also a highly diverse place. This is my attempt to see how “Hindi” and other languages are spoken in India.
In this post, we’ll see how to collect data for this relevant puzzle – directly from Wikipedia and How we’re going to visualize it – highlighting the insight.
Data
Wikipedia is a great source for data like this – Languages spoken in India and also because Wikipedia lists these tables as html <table>
it becomes quite easier for us to use rvest::html_table()
to extract the table as dataframe without much hassle.
options(scipen = 999) library(rvest) # for webscraping library(tidyverse) # for data analysis and visualization # the wikipedia page URL - thanks to DuckDuckGo search lang_url <- "https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers_in_India" # extracting the entire content of the page content <- read_html(lang_url) # extracting only tables from the downloaded content tables <- content %>% html_table(fill = TRUE) # from the page we know, it's the first table we want picking up the first # element from the list of tables lang_table <- tables[[1]] ### header cleaning - exclude the first row lang_table <- lang_table[-1, ] lang_table ## First language speakers First language speakers ## 2 Hindi[b] 422,048,642 41.03% ## 3 English 226,449 0.02% ## 4 Bengali 83,369,769 8.10% ## 5 Telugu 74,002,856 7.19% ## 6 Marathi 71,936,894 6.99% ## 7 Tamil 60,793,814 5.91% ## 8 Urdu 51,536,111 5.01% ## 9 Kannada 37,924,011 3.69% ## 10 Gujarati 46,091,617 4.48% ## 11 Odia 33,017,446 3.21% ## 12 Malayalam 33,066,392 3.21% ## 13 Sanskrit 14,135 <0.01% ## Second languagespeakers[11] Third languagespeakers[11] Total speakers ## 2 98,207,180 31,160,696 551,416,518 ## 3 86,125,221 38,993,066 125,344,736 ## 4 6,637,222 1,108,088 91,115,079 ## 5 9,723,626 1,266,019 84,992,501 ## 6 9,546,414 2,701,498 84,184,806 ## 7 4,992,253 956,335 66,742,402 ## 8 6,535,489 1,007,912 59,079,512 ## 9 11,455,287 1,396,428 50,775,726 ## 10 3,476,355 703,989 50,271,961 ## 11 3,272,151 319,525 36,609,122 ## 12 499,188 195,885 33,761,465 ## 13 1,234,931 3,742,223 4,991,289 ## Total speakers ## 2 53.60% ## 3 12.18% ## 4 8.86% ## 5 8.26% ## 6 8.18% ## 7 6.49% ## 8 5.74% ## 9 4.94% ## 10 4.89% ## 11 3.56% ## 12 3.28% ## 13 0.49%
At this point, we’ve got the required table but mind you, The numbers are in characters and for us to plot visualizations - it has to be in Numeric format. We’ll pick only First Language Speakers for further sections so will change those numbers from character into numeric format
# clean-up the messed up column names lang_table <- lang_table %>% janitor::clean_names() lang_table[1,"x"] <- "Hindi" lang_table ## x first_language_speakers first_language_speakers_2 ## 2 Hindi 422,048,642 41.03% ## 3 English 226,449 0.02% ## 4 Bengali 83,369,769 8.10% ## 5 Telugu 74,002,856 7.19% ## 6 Marathi 71,936,894 6.99% ## 7 Tamil 60,793,814 5.91% ## 8 Urdu 51,536,111 5.01% ## 9 Kannada 37,924,011 3.69% ## 10 Gujarati 46,091,617 4.48% ## 11 Odia 33,017,446 3.21% ## 12 Malayalam 33,066,392 3.21% ## 13 Sanskrit 14,135 <0.01% ## second_languagespeakers_11 third_languagespeakers_11 total_speakers ## 2 98,207,180 31,160,696 551,416,518 ## 3 86,125,221 38,993,066 125,344,736 ## 4 6,637,222 1,108,088 91,115,079 ## 5 9,723,626 1,266,019 84,992,501 ## 6 9,546,414 2,701,498 84,184,806 ## 7 4,992,253 956,335 66,742,402 ## 8 6,535,489 1,007,912 59,079,512 ## 9 11,455,287 1,396,428 50,775,726 ## 10 3,476,355 703,989 50,271,961 ## 11 3,272,151 319,525 36,609,122 ## 12 499,188 195,885 33,761,465 ## 13 1,234,931 3,742,223 4,991,289 ## total_speakers_2 ## 2 53.60% ## 3 12.18% ## 4 8.86% ## 5 8.26% ## 6 8.18% ## 7 6.49% ## 8 5.74% ## 9 4.94% ## 10 4.89% ## 11 3.56% ## 12 3.28% ## 13 0.49% lang_table %>% select(one_of("x","first_language_speakers")) %>% mutate(first_language_speakers = parse_number(first_language_speakers)) -> lang_table_first names(lang_table_first) <- c("Language","first_language_speakers") lang_table_first ## Language first_language_speakers ## 1 Hindi 422048642 ## 2 English 226449 ## 3 Bengali 83369769 ## 4 Telugu 74002856 ## 5 Marathi 71936894 ## 6 Tamil 60793814 ## 7 Urdu 51536111 ## 8 Kannada 37924011 ## 9 Gujarati 46091617 ## 10 Odia 33017446 ## 11 Malayalam 33066392 ## 12 Sanskrit 14135
Visualization
Now that we got a categorical and a numerical variable. It’s time to play with some visualization - as it’s typical - a bar chart.
All Languages
lang_table_first %>% mutate(Language = fct_reorder(Language,-first_language_speakers)) %>% ggplot() + geom_bar(aes(Language, first_language_speakers), stat = "identity", fill = ifelse(lang_table_first$Language == 'Hindi', "#ffdd00", "#ff00ff")) + theme_minimal() + labs(title = "Most Spoken Languages", subtitle = "First Language in India", caption = "Data Source: Wikipedia - Census 2001")
That’s a long tail with Hindi leading the way.
Hindi & Everyone else
library(viridis) lang_table_first %>% mutate(Language = ifelse(Language == "Hindi", "Hindi","non_Hindi")) %>% group_by(Language) %>% summarize(first_language_speakers = sum(first_language_speakers)) %>% mutate(percentage = round((first_language_speakers / sum(first_language_speakers))*100,2)) %>% ggplot() + geom_bar(aes(Language,percentage,fill = Language), stat = "identity" ) + scale_fill_viridis_d(option = 'E', direction = -1) + scale_y_continuous(limits = c(0,60)) + theme_minimal() + geom_label(aes(Language,percentage, label= paste0(percentage,"%"))) + labs(title = "Hindi vs Non_Hindi", subtitle = "First Spoken Language in India", caption = "Data:Wikipedia - Census 2001")
Living up to the Diversity of India, A mixed (assorted) group of languages other than Hindi forms ~54%
while Hindi-only is ~46%
Summary
Not getting into the politics of this context, In this post, we learnt how to get data (that’s requried for us) using rvest
and did analysis using tidyverse
to generate some valuable insights on India’s most spoken first languages. If you are interested to know more regarding R, You can check out this tutorial.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.