Articles by John Mount

Comparative examples using replyr:let

December 22, 2016 | John Mount

Consider the problem of “parametric programming” in R. That is: simply writing correct code before knowing some details, such as the names of the columns your procedure will have to be applied to in the future. Our latest version of replyr::let makes such programming easier. Archie’s Mechanics #2 (1954) copyright ...

[Read more...]

help(let, package=’replyr’)

December 17, 2016 | John Mount

A bit more on our replyr R package. library("replyr") help(let, package='replyr') let {replyr} R Documentation Prepare expr for execution with name substitutions specified in alias. Description replyr::let implements a mapping from desired names (names used directly in the expr code) to names used in the data. ... [Read more...]

Organize your data manipulation in terms of “grouped ordered apply”

December 15, 2016 | John Mount

Consider the common following problem: compute for a data set (say the infamous iris example data set) per-group ranks. Suppose we want the rank of iris Sepal.Lengths on a per-Species basis. Frankly this is an “ugh” problem for many analysts: it involves all at the same time grouping, ordering, ...

[Read more...]

magrittr’s Doppelgänger

December 13, 2016 | John Mount

R picked up a nifty way to organize sequential calculations in May of 2014: magrittr by Stefan Milton Bache and Hadley Wickham. magrittr is now quite popular and also has become the backbone of current dplyr practice. If you read my last article on assignment carefully you may have noticed I ...

[Read more...]

The Case For Using -> In R

December 12, 2016 | John Mount

R has a number of assignment operators (at least ““; plus “” which have different semantics). The R-style guides routinely insist on “” when using magrittr pipelines. Don Quijote and … Continue reading The Case For Using -__ In R

[Read more...]

The case for index-free data manipulation

December 10, 2016 | John Mount

Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) ...

[Read more...]

Parametric variable names and dplyr

December 3, 2016 | John Mount

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R ...

[Read more...]

Be careful evaluating model predictions

December 2, 2016 | John Mount

One thing I teach is: when evaluating the performance of regression models you should not use correlation as your score. This is because correlation tells you if a re-scaling of your result is useful, but you want to know if the result in your hand is in fact useful. For ...

[Read more...]

vtreat data cleaning and preparation article now available on arXiv

November 30, 2016 | John Mount

Nina Zumel and I are happy to announce a formal article discussing data preparation and cleaning using the vtreat methodology is now available from arXiv.org as citation arXiv:1611.09477 [stat.AP]. vtreat is an R data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound ... [Read more...]

New R package: replyr (get a grip on remote dplyr data services)

November 22, 2016 | John Mount

It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more ...

[Read more...]

MySql in a container

November 19, 2016 | John Mount

I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R. As a consulting data scientist I often have to debug and rehearse work away from the clients actual infrastructure. Because of this it is useful to be able to spin ... [Read more...]

Teaching Practical Data Science with R

November 16, 2016 | John Mount

Practical Data Science with R, Zumel, Mount; Manning 2014 is a book Nina Zumel and I are very proud of. I have written before how I think this book stands out and why you should consider studying from it. Please read on for a some additional comments on the intent of ... [Read more...]

You should re-encode high cardinality categorical variables

November 11, 2016 | John Mount

Nina Zumel and I have been doing a lot of writing on the (important) details of re-encoding high cardinality categorical variables for predictive modeling. These are variables that essentially take on string-values (also called levels or factors) and vary through many such levels. Typical examples include zip-codes, vendor IDs, and ... [Read more...]

Laplace noising versus simulated out of sample methods (cross frames)

November 9, 2016 | John Mount

Nina Zumel recently mentioned the use of Laplace noise in “count codes” by Misha Bilenko (see here and here) as a known method to break the overfit bias that comes from using the same data to design impact codes and fit a next level model. It is a fascinating method ... [Read more...]

Some vtreat design principles

November 1, 2016 | John Mount

We have already written quite a few times about our vtreat open source variable treatment package for R (which implements effects/impact coding, missing value replacement, and novel value replacement; among other important data preparation steps), but we thought we would take some time to describe some of the principles ... [Read more...]

A quick look at RStudio’s R notebooks

October 22, 2016 | John Mount

A quick demo of RStudio’s R Notebooks shown by John Mount (of Win-Vector LLC, a statistics, data science, and algorithms consulting and training firm). (see http://rmarkdown.rstudio.com/r_notebooks.html and https://www.rstudio.com/products/rstudio/download/preview/ ) [Read more...]

Data science for executives and managers

October 21, 2016 | John Mount

Nina Zumel recently announced upcoming speaking appearances. I want to promote the upcoming sessions at ODSC West 2016 (11:15am-1:00pm on Friday November 4th, or 3:00pm-4:30pm on Saturday November 5th) and invite executives, managers, and other data science consumers to attend. We assume most of the Win-Vector blog audience is ... [Read more...]

On calculating AUC

October 7, 2016 | John Mount

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC). R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is ... [Read more...]

Adding polished significance summaries to papers using R

October 4, 2016 | John Mount

When we teach “R for statistics” to groups of scientists (who tend to be quite well informed in statistics, and just need a bit of help with R) we take the time to re-work some tests of model quality with the appropriate significance tests. We organize the lesson in terms ... [Read more...]

Proofing statistics in papers

October 2, 2016 | John Mount

Recently saw a really fun article making the rounds: The prevalence of statistical reporting errors in psychology (1985–2013) Nuijten, M.B., Hartgerink, C.H.J., van Assen, M.A.L.M. et al. Behav Res (2015). doi:10.3758/s13428-015-0664-2. The authors built an R package to check psychology papers for ... [Read more...]

« 1 … 15 16 17 18 19 … 24 »

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Articles by John Mount

Comparative examples using replyr:let

help(let, package=’replyr’)

Organize your data manipulation in terms of “grouped ordered apply”

magrittr’s Doppelgänger

The Case For Using -> In R

The case for index-free data manipulation

Parametric variable names and dplyr

Be careful evaluating model predictions

vtreat data cleaning and preparation article now available on arXiv

New R package: replyr (get a grip on remote dplyr data services)

MySql in a container

Teaching Practical Data Science with R

You should re-encode high cardinality categorical variables

Laplace noising versus simulated out of sample methods (cross frames)

Some vtreat design principles

A quick look at RStudio’s R notebooks

Data science for executives and managers

On calculating AUC

Adding polished significance summaries to papers using R

Proofing statistics in papers

Articles by John Mount

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)