Articles by John Mount

Comparative examples using replyr:let

December 22, 2016 | John Mount

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier. Archie’s Mechanics #2 (1954) copyright ...
[Read more...]

help(let, package=’replyr’)

December 17, 2016 | John Mount

A bit more on our replyr R package. library("replyr") help(let, package='replyr') let {replyr} R Documentation Prepare expr for execution with name substitutions specified in alias. Description replyr::let implements a mapping from desired names (names used directly in the expr code) to names used in the data. ... [Read more...]

magrittr’s Doppelgänger

December 13, 2016 | John Mount

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice. If you read my last article on assignment carefully you may have noticed I ...
[Read more...]

The Case For Using -> In R

December 12, 2016 | John Mount

R has a number of assignment operators (at least ““; plus “” which have different semantics). The R-style guides routinely insist on “” when using magrittr pipelines. Don Quijote and … Continue reading The Case For Using -__ In R
[Read more...]

The case for index-free data manipulation

December 10, 2016 | John Mount

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) ...
[Read more...]

Parametric variable names and dplyr

December 3, 2016 | John Mount

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R ...
[Read more...]

Be careful evaluating model predictions

December 2, 2016 | John Mount

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For ...
[Read more...]

MySql in a container

November 19, 2016 | John Mount

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin ... [Read more...]

Teaching Practical Data Science with R

November 16, 2016 | John Mount

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of. I have written before how I think this book stands out and why you should consider studying from it. Please read on for a some additional comments on the intent of ... [Read more...]

You should re-encode high cardinality categorical variables

November 11, 2016 | John Mount

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and ... [Read more...]

Some vtreat design principles

November 1, 2016 | John Mount

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles ... [Read more...]

A quick look at RStudio’s R notebooks

October 22, 2016 | John Mount

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm). (see http://rmarkdown.rstudio.com/r_notebooks.html and https://www.rstudio.com/products/rstudio/download/preview/ ) [Read more...]

Data science for executives and managers

October 21, 2016 | John Mount

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is ... [Read more...]

On calculating AUC

October 7, 2016 | John Mount

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is ... [Read more...]

Proofing statistics in papers

October 2, 2016 | John Mount

Recently saw a really fun article making the rounds: The prevalence of statistical reporting errors in psychology (1985–2013) Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al. Behav Res (2015). doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for ... [Read more...]
1 15 16 17 18 19 24

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)