Site icon R-bloggers

Text Analysis of Job Descriptions for Data Scientists, Data Engineers, Machine Learning Engineers and Data Analysts

[This article was first published on Method Matters Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

In the previous post, the intrepid Jesse Blum and I analyzed metadata from over 6,500 job descriptions for data roles in seven European countries. In this post, we’ll apply text analysis to those job postings to better understand the technologies and skills that employers are looking for in data scientists, data engineers, data analysts, and machine learning engineers.

In this post we present results from text analyses that show that:

The results of this analysis complement and extend the results we presented last time, showing that employers have distinct visions of the (mostly technical & software-related) skillsets that data analysts, data scientists, data engineers, and machine learning engineers should possess.

The Data

The data come from a web scraping program developed by Jesse and myself. Every 2 weeks, we scraped job advertisements from a major job portal website, extracting all jobs posted within the previous 2-week period for the following job titles: Data Engineer, Data Analyst, Data Scientist and Machine Learning Engineer for the following countries: the United Kingdom, Ireland, Germany, France, the Netherlands, Belgium and Luxembourg.

We started data collection mid-August and finished by the end of December, 2021, ending up with 6,590 job descriptions scraped. All the data and code used for this analysis are available on Github. Feedback welcome!

Results

Our dataset includes job descriptions for data roles across four languages (English, French, Dutch and German). We wanted to see if there were any differences in word usage among the different roles (data scientist, data engineer, machine learning engineer and data analyst), and therefore conducted language-specific analyses to contrast and compare the roles according to the words used to describe the job openings.

Word Clouds

Our first set of analyses uses a great R function to create comparison clouds. This type of analysis allows us to compare the frequency of words across groups of documents, and highlight words that appear more in a given group versus the others.

Jesse and I are more comfortable in English, French, and Dutch than German, so we limited our analysis to those three languages. However, there were far fewer Dutch job descriptions than for the other two, so the resulting Dutch comparison cloud was not particularly informative. Below, we focus on the English and French wordclouds and what they reveal about employers’ expectations for the different roles.

The French word cloud looks like this:

The English word cloud looks like this:

Overall, we found that there were clear differences between the roles in the language used in the job advertisements. Furthermore, these differences were largely consistent across the English and French language job ads.

The following table summarizes the comparison:

Role French (N = 1,349) Job Descriptions English (N = 3,869) Job Descriptions
Data analysts
  • Expected to know about data analysis (analyse), reporting (reporting, tableau de bord), and data visualization (visualisation).
  • Likely work more with stakeholders in the business (métier).
  • In contrast to the English job description texts, data analysts are expected to know more about SQL (in English this word appeared more frequently in data engineering job descriptions).
  • Expected to have skills in reporting, dashboarding, data analysis and office suite.
  • More interaction with other stakeholders throughout the larger organization.
  • More emphasis on identifying insights (which need to be communicated to others in order to inform decision making).
Data scientists
  • Relatively few unique skills.
  • Expected to know data science and statistics (statistique), and to build models (modèle) and make predictions (prédiction).
  • Relatively few unique skills.
  • Expected to know about data science, statistics, mathematics and making predictions.
Data engineers
  • Greater expectation to work with cloud platforms (plateforme, cloud, Azure), big data technologies (Scala and Spark), data pipelines, etl and data storage (stockage).
  • Somewhat surprisingly, data engineers, compared to the other roles, are expected to work with agile methodology.
  • Greater expectation to work with cloud and data platforms, etl (data transfer & storage) and data pipelines, databases, data architecture and infrastructure, Spark and SQL.
  • Essentially, the technologies and databases that go along with storing and transferring data from one place to another are under the responsibility of the data engineer.
Machine learning engineers
  • Greater expectation to use machine learning (apprentissage automatique), artificial intelligence (intelligence artificielle), and tools for deep learning algorithms / neural networks (réseau de neurones artificiels) like TensorFlow and Pytorch.
  • Greater expectation to use artificial intelligence and deep learning frameworks such as TensorFlow and Pytorch.
  • Greater expectation to know more about software engineering and computer science.
  • Interestingly, the text of the English job ads reveals that machine learning engineers are being asked to work on computer vision problems.

Some other observations that we found noteworthy:

Using Skills-ML to Extract Skills from Job Ads

The Skills ML library is a great tool for extracting high-level skills from job descriptions. The Skills ML library uses a dictionary-based word search approach to scan through text and identify skills from the ONET skill ontology, allowing for the extraction of important high-level skills mapped by labor market experts. This approach is more comprehensive than simply counting words (as we did with the comparison clouds above), and it takes into account the fact that some words are synonyms or represent the same skill or technology (e.g.”database”, “data warehouse”, “data lake”, etc. can be grouped under a higher-level term such as “data storage”). Because the ONET skills are only available in English, this analysis was conducted only on the English-language job descriptions.

Most Common Skills

As the following figure shows, Python was the most common skill represented in the English-language job descriptions. Other top skills include R, programming, mathematics, Tableau, visualization, writing, Git, and physics. However, this analysis collapses all the skills across the four data roles. We saw in the wordcloud analysis above and in the previous analysis of job keywords that the desired skillsets can look quite different between the different data profiles.

Clustering Skills and Roles

In order to get a sense of how the extracted skills differed across the data roles, we made a cluster map using the Python Seaborn library. Specifically, we calculated the percentage of job ads per role that contained each skill, filtering on skills that appeared in more than 50 job ads. These percentages were converted to z-scores, such that higher numbers indicate that a given skill is mentioned more often for a given role compared to the others. This final matrix was then passed to the cluster map algorithm, which performs a simultaneous clustering of both the job roles and of the extracted skills.

The results of this analysis showed that there are clear clusters of skillsets required for different types of data-related roles.

In the clustering diagram, shades of red indicate a higher prevalence of a given skill for a given role compared to the others, while shades of blue indicate a lower prevalence of a given skill for a given role compared to the others.

Along the horizontal axis, individual skills are clustered together in logical ways. For instance, at the right side of the chart, Microsoft Office is grouped together with Microsoft Excel and Google Analytics.

On the vertical axis, roles cluster into three separate groups according to their required skills:

The Added Value of Analyzing Job Description Texts

Overall, the above analysis serves as a useful extension of the Metadata analysis we described in our previous post. Here, we first presented comparison clouds showing the relative frequency of words that were unique to a given role compared to the others. We made separate word clouds for the texts of the English and French job ads, respectively, and found that the main conclusions from these visualizations were the same. Interesting findings from this analysis included:

We also extracted skills from the English language job descriptions using the ONET skill classification. As in our previous analysis of skill keywords, Python was the most frequently-appearing skill. We then made a clustermap to see how the extracted skills differed across the roles.

In this analysis, the data analysts role had least in common with the others. Data analysts in particular were more likely to use office tools (Excel, Google Analytics), visualization tools (e.g. Tableau) and business software (e.g. Salesforce), and less likely to use programming tools and languages (e.g. Git and Python).

Data Engineers also had their own specialties, being particularly likely to work with a wider variety of data storage, big data, and query technologies (e.g. many flavors of SQL, Apache Spark etc.) This analysis shows that data analysts and data engineers have very different skillsets, with data analysts being more focused on office and business software, and data engineers being more focused on programming and databases. This highlights the importance of having both roles on a team in order to have a well-rounded skillset, and the unlikeliness of having one person being equally good at both skillsets (the long-sought after but rarely-found “unicorn” profile).

The End

This is the final post that we’ll make of the analysis of these job description data. All of the data and code for these analyses are available on Github, and we encourage you to explore them further!

This exercise was very for us, challenging ourselves across data analysis, data science, data engineering. Both the metadata analysis presented previously and the current text analysis helped us clarify our thinking about the market for data profiles in Europe, and we hope to have expanded your understanding of the data professions and the skills that unite and differentiate them. The job market is evolving quickly, as are the technologies and tools that data professionals are being asked to master. Our analysis of European job descriptions offers a snapshot of the current job market, and we are excited to see what the future brings as European companies’ and institutions’ data efforts mature and as the market continues to evolve!

To leave a comment for the author, please follow the link and comment on their blog: Method Matters Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.