Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
One of the great conveniences of performing a data science style analysis using Jupyter is that Jupyter notebooks are literate containers that combine code, text, results, and graphs. This is also one of the pain points in working with Jupyter notebooks with partners or with source control. That is: Jupyter notebooks are JSON (which rapidly becomes not human readable, and not easily diff-able) and many notebook viewing tools alter the notebook even on opening.
There are tools for dealing with this, such as git hooks that strip output data- but they have not met my needs in the past.
The above differs from the knitr/rmarkdown strategy pursued by R Studio. In that scheme “.Rmd” files are purely code and text (user produced inputs), and are processed to produce outputs (typically markdown, HTML, pdf, and others).
As I switch back and forth between R and Python projects for various clients and partners, I got to thinking: “is there an easy way to separate code from presentations in Jupyter notebooks?”
The answer turns is yes. Jupyter itself exposes a rich application programming interface in Python. So it is very easy to organize Jupyter’s power into tools that give me a great data science and analysis workflow in Python.
All of the steps I am going to demonstrate can be found here.
What we do is start with an ‘.ipynb’ Jupyter worksheet or notebook: plot.ipynb. I can edit and execute this worksheet using JupyterLab, Visual Studio Code, PyCharm, or many other interactive tools. As usual the sheets input cells are a mixture of text cells and markdown cells.
Obviously Jupyter itself can export the notebook to Python:
jupyter nbconvert --to script --stdout plot.ipynb > plot_nbconvert.py
However this is pretty much one way, there isn’t a quick way to convert plot_nbconvert.py back to an ‘.ipynb’ notebook or to execute the Python in such a way that we also get the implicit printing and plotting that notebooks provide for the last value seen in each cell and the markdown formatting. The converted “.py” file doesn’t preserve enough of our expressed intent.
Suppose, instead, we export our notebook with the following command (supplied by the wvpy package):
python -m wvpy.pysheet plot.ipynb
This creates the file plot.py. This export uses the convention that free text is taken to be Python code, and markdown is in special quote-blocks. When there are neighboring code blocks, there is an annotation to find the boundaries so we don’t lose the block structure.
Once we have this file we can do one of two things:
-
Render it to HTML by executing it, saving the results, and stripping out code inputs.
The command to do this is:
python -m wvpy.render_workbook --strip_input plot.py
And this produces the ‘.html’ file plot.html. The html looks like the following.
Notice the input cells and output numberings are stripped to make this essentially a user controlled report.The above command works the same on a notebook input such as plot.ipynb.
- Convert it back to ‘.ipynb’ for further interactive work.
The command to do this is:cp plot.py plot_copy.py python -m wvpy.pysheet plot_copy.py
And we can see the recovered file plot_copy.ipynb has the structure of the original.
One can share, edit, and diff the ‘.py’ file. All one has to do is mark markdown in line-initial “''' begin text
” and “''' # end text
” blocks. Multiple code blocks are separate by “'''end code'''
” lines.
The design is: work however you want (definitely prototyping in ‘.ipynb’ files, using whatever tools you like), but save only converted ‘.py’ files in source control. Automate re-running sheets (even multiple runs taking external parameters) to reproduce results at will.
This is a workflow we intend to use and teach a lot.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.