Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I get around this conundrum by running much of my analytics on either my work server or on an EC2 machine (I’m going to call these collectively “my servers” for the rest of this post). The nagging problem with this has been keeping files in sync. RStudio Server has been a great help to my workflow because it lets me edit files in my browser and they run on my servers. But when a long running R job blows out files I want those IMMEDIATELY synced with my laptop. That way I know when I undock my laptop to run to the train station that all my files will be there for me to spill Old Style beer on as I ride the Metra North line.
So Spideroak’s the panacea then? Well… um… no. They have two critical flaws: 1) They depend on time stamps on files to determine most recent file. 2) Syncs are slow, sometimes taking more than 5 minutes for very small files. The time stamp issue is an engineering failure, plain and simple. I’ve talked to their tech support and been assured that they are going to change this and index using server time, not system time in the future. But as of April 6, 2011, Spideroak uses local system time. For most users this is no big deal. For my use case this is painful. My server and my laptop were 6 seconds different and that time difference was enough for me to get Spideroak confused about which files were the freshest. This is a big deal when syncing two file systems with fast changing files. The other issue, slow sync, was actually more painful but probably the result of their attempt to be nice with CPU time and also encryption. When jobs on my server finished, I expected those files to start syncing within seconds and the only delay I expected was bandwidth constraints. With Spideroak syncs might take 5 minutes to start and then it would go out for coffee, come back jittery and then finally complete. Even if SPideroak fixed the time sync issue (or I forced my laptop to set its time based on my server), it still would not work for my sync because of the huge lags.
So looking at Dropbox and Spideroak I realized that I liked everything about Spideroak except its sync. It’s a great cloud backup tool that seems to properly do encryption, it’s multiplatform (win, linux, mac), has an iPad/iPhone app for viewing/sending files, it’s smart about backups and won’t upload the same file twice (even if the file is on two different computers). For my business use, I just can’t use Dropbox. The lack of “trust no one” encryption is a deal killer. So what I really need is a sync solution to use along side Spideroak.
There are some neat projects out there for sync. Projects like Sparkleshare look really promising but they are trying to do all sorts of things, not just sync. I’ve already settled on letting Spideroak do backup and version tracking so I don’t really need all those features… OK, OK, I can hear you muttering, “just use rsync and be done with it already.” Yeah, that’s a good idea. But rsync is single directional and does a lot of things well, but can also be a bit of an asshole if you don’t set all the flags right and rub its belly the right way. If you google for “bidirectional sync” you’re going to see this problem has plagued a lot of folks. This blog post has already gone on long enough so I’ll cut to the chase. Here’s the stack of tools I settled on for cobbling together my own secure, real-time, bidirectional sync between two Ubuntu boxes (one of which changes IP address and is often behind a NAT router):
1) Unison – Fast sync using rsync-esque algos and really fast caching/scanning
2) lsyncd – Live (real-time) sync daemon
3) autossh – ssh client with a nifty wrapper that keeps the connection alive and respawns the connection if dropped
I’ll do another post with the nitty-gritty of how I set this up, but the short version is that I installed Unison and lsyncd on both the laptop and the server. Single direction sync from my laptop to the server is pretty straight forward: lsyncd watches files, if one changes it calls unison which syncs the files with the server. The tricky bit was getting my server to be able to sync with my laptop which is often behind a NAT router. The solution was to open an ssh connection from my laptop to my server using autossh and reverse port forward port 5555 from the server back to my laptop’s port 22. That way an lsyncd process on the server can monitor the file system and when it sees a change can kick off a unison job that syncs the server to ssh://localhost:5555//some/path which is forwarded to my laptop! Autossh makes sure that connection does not get dropped and respawns if it does get dropped. So with a little shell scripting to start the lsyncd daemon on both machines, some config of lsyncd, and a local shell script to fire off the autossh connection, I’ve got real-time bidirectional sync!
In a follow up post I’ll put of the details of this configuration. Stay tuned. (EDIT: Update posted!)
If you’ve solved sync a different way and you like your solution, please comment. I’ve not settled that this is my long-term solution. It’s just a solution that works. Which is more than I had yesterday.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.