Git Tricks for Working with Large Repositories
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Recently Yanina Bellini Saibene reminded us to update our Slack profile:
Friendly reminder: Let’s increase the value of our rOpenSci Slack community. Please add details to your profile, e.g., your photo, your favorite social media handle, what you do, your pronouns, and how to pronounce your name.
After doing that I went on to updating my profile photos on the rOpenSci website, which ended up teaching me a few git tricks I would like to share here. Thanks Maëlle Salmon for the encouragement, and Steffi LaZerte for reviewing this post.
Cloning as usual
When I tried to clone the source code of rOpenSci’s website I realized the repo was large and it would take me several minutes.
git clone https://github.com/ropensci/roweb3.git
I decided to stop the process and researched how to just pull the latest version of the specific files I needed.
Pulling the latest version of specific files
First I forked the rOpenSci website repository (roweb3
). I used the
gh
CLI from the terminal, but also I could have
forked it manually from Github.
# if not using `gh`, fork ropensci/roweb3 from GitHub gh repo fork ropensci/roweb3
Then I created a local empty roweb3
directory and linked it to the fork.
git init roweb3 cd roweb3 git remote add origin [email protected]:maurolepore/roweb3.git
Now for the tricks! I avoided having to download the whole repository by first finding the specific files I needed on GitHub’s “Go to file” box, then:
- Trick 1: Configured a sparse checkout matching just those files.
git config core.sparseCheckout true echo "themes/ropensci/static/img/team/mauro*" >> .git/info/sparse-checkout
- Trick 2: Pulled with
--depth 1
to get only the latest version of those files.
git pull --depth=1 origin main
I explored the result with
tree
and it
was just what I needed to modify:
tree . └── themes └── ropensci └── static └── img └── team ├── mauro-lepore.jpg └── mauro-lepore-mentor.jpg
But how large is it?
While those tricks were useful, I was still curious about the size of the repo,
so I did clone it all and explored disk usage with
du
:
du --human-readable --max-depth=1 . 219M ./themes 164K ./.Rproj.user 56K ./archetypes 628K ./resources 168K ./data 376M ./.git 20K ./static 12K ./.github 40K ./scripts 161M ./content 24K ./layouts 475M ./public 1.3G .
Indeed this is much larger than the source code I typically handle. But now I know a few more Git tricks (and even more about blogging on rOpenSci 🙂 ).
Conclusion
If all you have is a hammer, everything looks like a nail. — Abraham Maslow
Sometimes git clone
is not the right tool for the job. A sparse checkout and a
shallow pull can help you get just what you need.
If you enjoy learning from videos you may search “git” on my YouTube channel or explore the playlists git, git-from-the-terminal, and git-con-la-terminal (in Spanish).
What are your favorite Git tricks? How about blogging about them?
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.