Git Tricks for Working with Large Repositories

[This article was first published on rOpenSci - open tools for open science, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Recently Yanina Bellini Saibene reminded us to update our Slack profile:

Friendly reminder: Let’s increase the value of our rOpenSci Slack community. Please add details to your profile, e.g., your photo, your favorite social media handle, what you do, your pronouns, and how to pronounce your name.

After doing that I went on to updating my profile photos on the rOpenSci website, which ended up teaching me a few git tricks I would like to share here. Thanks Maëlle Salmon for the encouragement, and Steffi LaZerte for reviewing this post.

Cloning as usual

When I tried to clone the source code of rOpenSci’s website I realized the repo was large and it would take me several minutes.

git clone https://github.com/ropensci/roweb3.git

I decided to stop the process and researched how to just pull the latest version of the specific files I needed.

Pulling the latest version of specific files

First I forked the rOpenSci website repository (roweb3). I used the gh CLI from the terminal, but also I could have forked it manually from Github.

# if not using `gh`, fork ropensci/roweb3 from GitHub
gh repo fork ropensci/roweb3

Then I created a local empty roweb3 directory and linked it to the fork.

git init roweb3
cd roweb3
git remote add origin [email protected]:maurolepore/roweb3.git

Now for the tricks! I avoided having to download the whole repository by first finding the specific files I needed on GitHub’s “Go to file” box, then:

git config core.sparseCheckout true
echo "themes/ropensci/static/img/team/mauro*" >> .git/info/sparse-checkout
  • Trick 2: Pulled with --depth 1 to get only the latest version of those files.
git pull --depth=1 origin main

I explored the result with tree and it was just what I needed to modify:

tree
.
└── themes
 └── ropensci
 └── static
 └── img
 └── team
 ├── mauro-lepore.jpg
 └── mauro-lepore-mentor.jpg

But how large is it?

While those tricks were useful, I was still curious about the size of the repo, so I did clone it all and explored disk usage with du:

du --human-readable --max-depth=1 .
219M ./themes
164K ./.Rproj.user
56K ./archetypes
628K ./resources
168K ./data
376M ./.git
20K ./static
12K ./.github
40K ./scripts
161M ./content
24K ./layouts
475M ./public
1.3G .

Indeed this is much larger than the source code I typically handle. But now I know a few more Git tricks (and even more about blogging on rOpenSci 🙂 ).

Conclusion

If all you have is a hammer, everything looks like a nail. — Abraham Maslow

Sometimes git clone is not the right tool for the job. A sparse checkout and a shallow pull can help you get just what you need.

If you enjoy learning from videos you may search “git” on my YouTube channel or explore the playlists git, git-from-the-terminal, and git-con-la-terminal (in Spanish).

What are your favorite Git tricks? How about blogging about them?

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci - open tools for open science.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)