Site icon R-bloggers

Distributing data science products

[This article was first published on Category R on Roel's R-tefacts, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
< !-- categories: R and blog. Blog is general, R means rweekly and r-bloggers --> < !-- share img is either a complete url or build on top of the base url (https://blog.rmhogervorst.nl) so do not use the same relative image link. But make it more complete post/slug/image.png --> < !-- content -->

Where or what is production? What does it mean when someone says to bring some data science product ‘in production’ ? What does it mean for data science products to be in production? Is your product already in production? Is it a magical place?

I think two questions are of importance:

If the answer to these questions is yes, than your ‘thing’ is in production. The devil is, as always, in the details. How do you integrate your work into the infrastructure. So how can you integrate your data science product into your company infrastructure?

Distributing data science products

I see 3 or 5 different end-products that I would call ‘production’.

Data scientists produce results with a statistical model.

How we go from there is one of three ways

Options for distributing a trained model

  1. distribute the parameters of the model alone. For instance: If you build a linear model, you can extract the parameters and turn those into an advanced SQL query with f.i.: {tidypredict} or {modeldb}. I don’t know any python packages that can do this, but you could program it. If your model is sufficiently simple you can even print out the decision rules for practitioners, for instance with {FFTrees}.
  2. return the trained model artefact: save your pickled python model or .rds R model in a central location with some metadata and pull it where necessary. Tensorflow models are distributed like this.
  3. wrap your model and environment into a docker container, supply it with an API and distribute that container. The entire model is hidden away behind an interface that everyone in any programming language can work with. The big cloud vendors do it in a similar way (they call it AI for some reason).

So what should I choose?

Talk to your stakeholders from start to finish. Plan for production from the start of your proof-of-concept. You already know which of the options is required, is your end goal an explanation or a prediction? If your end product is a prediction, will you batch predict, or create an image that can be called with a standard API? Figure these things out as early as possible, so that your project has the best changes of becoming successful product and not one of the many failed proofs of concept.

Good luck!

To leave a comment for the author, please follow the link and comment on their blog: Category R on Roel's R-tefacts.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.