4 R projects to form a core data analyst portfolio

[This article was first published on Articles - The Analyst Code, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

The job market for data analysts is large and highly competitive. Many companies, including companies not traditionally classified as “tech” or “coding” companies are looking to hire people with analytical coding experience. Yet, the numbers of applicants seems to be rising even faster. It’s a competitive market and you want your application to stand out.

Many jobs descriptions include lines like “Executes and advises on reporting needs and works cross-functionally to analyze data and make actionable recommendations at all levels” or “Utilizes advanced analytical and/or statistical ability to evaluate data and make judgments and recommendations“, “Experience in at least one computer programming language or analytical programming language (R, Python, SAS, etc.)” (emphasis added).

Notice that these job postings include two common themes (1) experience analyzing data (2) and experience providing recommendations. Your goal as an aspiring analyst is to be able to demonstrate experience in both of these domains. But how can you do this when you are applying for your first job in the field? Easy: spend some time building a core portfolio that shows the types of skills that recruiters want.

This article covers four projects that can form the core of your application portfolio. We recommend completing these prior to applying for jobs so that you can have demonstrable experience to include on your resume and discuss in your interviews. Be sure to create a GitHub repo for each project and link prominently to your GitHub profile on your resume.

I expect that building this portfolio will take at least one month of focused work.

  • Week 1: Exploratory data analysis

  • Week 2: Interactive Shiny dashboard

  • Week 3: Natural Language Processing

  • Week 4: Machine Learning

As you work through the projects, keep in mind that your goal is not just to gain experience analyzing data but also providing insightful recommendation. Even though this is practice, structure your output to include conclusions like “If I wanted to improve my exercise habits, the data show that…”. Prove to future recruiters that you have the skills they want to see.

Core portfolio projects

Exploratory data analysis

  • Project goal: Load a messy dataset into R, clean the data, create 4-5 charts or tables that have summary stats about your data, and create 4-5 charts or tables that provide analytic insight. Your output should be an html rmarkdown document.

  • Questions to answer: How big is my dataset? What do the top 10 items of data look like? What is the distribution of my variables of interest (mean/median, skew, outliers)? How does data change over time? What is the relationship between Variable A and Variable B? Why does Variable A behave in such and such a manner?

  • Resume skills practiced: R, data cleaning, data visualization

  • Recommended packages: dplyr, tidyr, ggplot2, kableExtra, others as needed

  • Examples: R for Data Science, Data Science Heroes

  • Data ideas to get you started: spotifyr, NYC Airbnb, college football games, Fitbit data

Interactive Shiny dashboard

  • Project goal: Similar to exploratory data analysis, load, clean, and analyze a dataset. Focus, however, on visualizing data in a Shiny dashboard. Your output should be either a Shiny dashboard with “server.R” and “ui.R” files, a flexdashboard with “runtime: shiny” or an html rmarkdown with “runtime: shiny.” Host your dashboard on shinyapps.io.

  • Questions to answer: Ask the same basic questions as the Exploratory Data Analysis project, but build a dashboard that allows users to generate the answers themselves. For example, instead of showing a histogram of Variable A, create a histogram chart that allows the user to select which variable to plot.

  • Resume skills practiced: R, Shiny, data visualization

  • Recommended packages: shiny, DT, others as needed

  • Examples: Covid-19 in the US (html document, source), hospital info (Shiny app, source)

  • Data ideas to get you started: Stock prices, World Bank data

Natural Language Processing (NLP) with R

  • Project goal: Replicate the analysis completed in this comprehensive example in order to draw analytical insight from a large body of unstructured text.

  • Questions to answer: What are the ten most frequent words in my corpus (excluding “stop words”)? What is the sentiment score of my corpus? How does sentiment change in each section in my corpus (e.g., chapter in my book)? What are the most common positive / negative words? What can tf-idf tell us about words unique to a part of my corpus (e.g., What words are most distinct to books A, B, and C)? Can I implement Latent Dirichlet allocation to separate my corpus into various topics?

  • Resume skills practiced: R, NLP, Text Mining, Machine Learning

  • Recommended packages: tidytext, topicmodels

  • Examples: Usenet text, Twitter data

  • Data ideas to get you started: Project Gutenberg library of books (website, gutenbergr package), Twitter API (rwteet package)

Machine learning with R

  • Project goal: Load a dataset, train a machine learning algorithm on part of the dataset, and use the rest of the dataset to test it. Create summary stats to evaluate the performance of your model.

  • Questions to answer: The types of questions you ask will vary tremendously based on the type of machine learning algorithm you want to implement. Try implementing one unsupervised and a handful of supervised algorithms. If you implement k-means (unsupervised), run your model with different seed values to find the lowest total within cluster sum of squares (SS), and plot total with SS against various numbers of clusters (k) to determine the optimal cluster count. Try to train several different supervised models (random forest, kNN, etc.) on the same dataset, and compare their results to pick the best one. Example here.

  • Resume skills practiced: R, Machine Learning

  • Recommended packages: caret

  • Examples: Machine Learning Mastery, MIT

  • Data ideas to get you started: UCI Machine Learning Repository, Hotel booking demand

To leave a comment for the author, please follow the link and comment on their blog: Articles - The Analyst Code.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)