Site icon R-bloggers

Team Collaboration in R and Python Made Easy

[This article was first published on RStudio | Open source & professional software for data science teams on RStudio, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Bilingual teams that want to do serious data science require collaboration, transparency, and reproducibility across R and Python workflows while empowering professionals to work in their preferred language(s). Accomplishing this requires tools built for interoperability at scale and a shared standard between data science languages.

Here are a few recommendations to achieve this:

Let’s dive into each of these recommendations!

1. Adopt a Bilingual IDE

Bilingual teams should adopt a standard IDE that enables work to be done in Python and R, both separately and simultaneously. The RStudio IDE is perfect for this. It allows data professionals to work in either or both languages without sacrificing modern IDE features such as syntax highlighting, debugging tools, workspace management, git integration, markdown support, and more. Adopting a single IDE can lead to a more straightforward setup, better documentation, and smoother troubleshooting than if multiple IDEs are used across the team.

The RStudio IDE can:

If a data scientist receives a Python script, they can open and run it in the RStudio IDE:

Or, they could hand off an output from one language to the other:

2. Interoperable Enterprise Ecosystem

Enterprise tools should promote collaboration and interoperability. RStudio Workbench is a secure, scalable, and centralized environment that enables R and Python code collaboration by providing:

RStudio Workbench empowers bilingual data professionals to collaborate across projects regardless of the language or IDE. Combining Workbench with RStudio Package Manager ensures that everyone has the same software and packages across projects. Bilingual data professionals can focus on generating insights instead of managing software or package dependencies and versions, which can be an arduous task for one language, let alone two.

RStudio Connect allows insights of varying formats to be shared with the push of a button. APIs, dashboards, models, and reports can be automatically shared or provided on demand using Connect.

RStudio Workbench, Package Manager, and Connect combine to form RStudio Team. RStudio Team is a robust end-to-end bilingual ecosystem that creates a seamless, interoperable environment where collaboration is easy and preference is respected. Workbench, Package Manager, and Connect all support R and Python and are available separately or collectively.

3. Use Packages that Enable Bilingual Teams

R and Python workflows can be developed using packages that exist in both languages. Bilingual packages make collaboration more accessible and reduce the potential for error by eliminating the differences and incompatibilities that can arise when using a highly varied package stack. The list of bilingual packages is rapidly growing, so be sure to check the blog frequently for more updates!

An example workflow would be built like this:

Workflow Action R Python
Data Manipulation dplyr siuba
Data Visualization ggplot2 plotnine
Data Object Publishing pins pins
Reporting quarto quarto
Model Deployment vetiver vetiver
Dashboarding Shiny for R Shiny for Python (in Alpha)

Note: Some packages have different names but similar functionality

4. Adopt Syntactical Consistency

The final recommendation is to adopt syntactical consistency between languages. Collaboration, training, and process resiliency are more manageable when a well-established standard operating procedure exists. Code can be written consistently between users and languages by embracing well-defined grammar.

A well-defined grammar provides a consistent structure and opinionated syntax to perform actions enabling standardization across code bases. This yields the following benefits:

R and Python have packages that help users adopt a consistent grammar for data manipulation: dplyr and siuba, and visualization: ggplot2 and plotnine, respectively.

Imagine we have a dataset with three columns of values (a,b,c) and a category column. We want to do the following:

Click through the tabs to compare Pandas, siuba and dplyr syntax for these actions.

The pandas code is relatively clean despite some reference duplication and syntax inconsistency between functions. The example code below is one of several ways in pandas to write the ordered series of steps, many of which would look foreign to an R user (and even some Python users). Select the next tab to see a different variation of pandas code executing the same steps.

# pandas method 1
 (
    example_df
      .assign(a_dbl = example_df.a * 2)
      .assign(b_tri = example_df.b * 3)
      .assign(c_half = example_df.c / 2)
      .assign(b_min_c = example_df.b - example_df.c)
      .query('b_min_c > 3')
      .groupby('category', as_index=False)
      .agg(
          avg_a_dbl = pd.NamedAgg(column='a_dbl', aggfunc=np.mean),
          avg_b_tri = pd.NamedAgg(column='b_tri', aggfunc=np.mean),
          avg_c_half = pd.NamedAgg(column='c_half', aggfunc=np.mean),
          avg_b_min_c = pd.NamedAgg(column='b_min_c', aggfunc=np.mean)
     )
  )

Here we execute the same ordered series of steps as in the previous tab with noticeably different code. While this solution gives the same output as before, the code is more difficult to read and invokes concepts the average R user would be unfamiliar with.

  # pandas method 2
  example_df['a_dbl'] = example_df['a'].mul(2)
  example_df['b_tri'] = example_df['b'] * 3
  example_df['c_half'] = example_df['c'].div(2)
  example_df['b_min_c'] = example_df['b'].sub(example_df['c'], axis = 0)
  example_df = example_df[example_df['b_min_c'] > 3]
  example_df['avg_a_dbl'] = example_df['a_dbl'].groupby(example_df['category']).transform('mean')
  example_df['avg_b_tri'] = example_df['b_tri'].groupby(example_df['category']).transform('mean')
  example_df['avg_c_half'] = example_df['c_half'].groupby(example_df['category']).transform('mean')
  example_df['avg_b_min_c'] = example_df['b_min_c'].groupby(example_df['category']).transform('mean')

With siuba, the syntax remains consistent between functions and actions. Adding columns, filtering, grouping, etc., is always done in the same manner, regardless of the user. The syntax is standard across all Python scripts and immediately readable by R users who work with a nearly identical syntax, as shown in the R code snippet on the next tab. Siuba has added benefits over pandas, like integration with plotnine and the ability to generate SQL.

(
example_df
  >> mutate(
    a_dbl = _.a * 2,
    b_tri = _.b * 3,
    c_half = _.c / 2,
    b_min_c = _.b - _.c )
  >> filter(_.b_min_c > 3)
  >> group_by(_.category)
  >> summarize(
    avg_a_dbl = _.a_dbl.mean(),
    avg_b_tri = _.b_tri.mean(),
    avg_c_half = _.c_half.mean(),
    avg_b_min_c = _.b_min_c.mean() )
  )

The R code is nearly identical to the Siuba code on the previous tab. Siuba makes a few modifications to accommodate Python syntax, such as denoting columns with _.

example_df %>%
  mutate(
    a_dbl = a * 2,
    b_tri = b * 3,
    c_half = c / 2,
    b_min_c = b - c) %>%
  filter(b_min_c > 3) %>%
  group_by(category) %>%
  summarize(
    avg_a_dbl = mean(a_dbl),
    avg_b_tri = mean(b_tri),
    avg_c_half = mean(c_half),
    avg_b_min_c = mean(b_min_c)
  )

Let’s look at another example where we create a grouped bar chart in matplotlib and plotnine. Shortened ggplot2 code is provided for comparison.

The matplotlib code requires additional steps over plotnine. It can also be written in multiple ways, making it hard for non-Python users to follow. As a result, matplotlib can often be in-cohesive and cumbersome. Plotnine addresses this by using the grammar of graphics built into ggplot2. The grammar of graphics takes a layered approach, making creating custom and often complex visuals simple in syntax. Take a look at the next tab to see this point demonstrated.

x = np.arange(len(example_df.labels.unique()))
width = 0.35

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, np.array(example_df['means'][example_df.category== 'cat']), width, label='Cat')
rects2 = ax.bar(x + width/2, np.array(example_df['means'][example_df.category== 'dog']), width, label='Dog')

ax.set_ylabel('Scores')
ax.set_title('Scores by Group and Animal')
ax.set_xticks(x, example_df.labels.unique())
ax.set_xlabel('Groups')
ax.legend()

ax.bar_label(rects1, padding=3)
ax.bar_label(rects2, padding=3)

fig.tight_layout()
plt.show()

Creating the grouped bar chart is more straightforward in plotnine compared to matplotlib. More importantly, the syntax remains consistent no matter how complex a plot we generate.

(
ggplot(example_df, aes(x='labels', y='means', fill='category'))
 + geom_col(stat='identity', position='dodge')
 + scale_fill_manual(values=['#0343DF', '#F97306'])
+ labs(title='Scores by Group and Animal',x='Groups',y='Scores')
+ geom_text(aes(label='means'), position=position_dodge(width=0.9),size=8, va='bottom')
+ theme(
  panel_background = element_blank(),
  panel_border = element_rect(color = "black", size=.5),
  axis_text_y=element_text(color='black'),
  axis_text_x=element_text(color='black'),
  legend_title=element_blank(),
  legend_position=(0.82, .8),
  legend_background = element_blank()
  )
)

plotnine and ggplot2 code is nearly identical. There are some minor differences to accommodate for Python syntax.

ggplot(example_df, aes(x=labels, y=means, fill=category)) +
 geom_bar(stat='identity', position='dodge') +
 scale_fill_manual(values=c('#0343DF', '#F97306')) +
 labs(title='Scores by Group and Animal',x='Groups',y='Scores')

As demonstrated, the siuba and plotnine packages create a familiar syntax between R and Python for data manipulation and visualization. Using siuba and plotnine leads to a significant portion of the code base being written consistently and readable by both R and Python users.

Learn More About How RStudio is Enabling Bilingual Teams

Anyone interested in building a collaborative bilingual team is highly encouraged to explore the tools this blog post covers. All of the recommendations in this blog can be taken individually but are ideally applied concurrently. To learn more, check out the links below.

To leave a comment for the author, please follow the link and comment on their blog: RStudio | Open source & professional software for data science teams on RStudio.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.