Articles by John Mount

How to Avoid the dplyr Dependency Driven Result Corruption

December 6, 2017 | John Mount

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases. To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any ... [Read more...]

Please inspect your dplyr+database code

December 2, 2017 | John Mount

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements. If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not ... [Read more...]

Win-Vector LLC announces new “big data in R” tools

November 29, 2017 | John Mount

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN): partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or ...
[Read more...]

Vectorized Block ifelse in R

November 27, 2017 | John Mount

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R. From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure. When porting code from one language ... [Read more...]

Arbitrary Data Transforms Using cdata

November 22, 2017 | John Mount

We have been writing a lot on higher-order data transforms lately: Coordinatized Data: A Fluid Data Specification Data Wrangling at Scale Fluid Data Big Data Transforms. What I want to do now is "write a bit more, so I finally feel I have been concise." The cdata R package supplies ...
[Read more...]

RStudio Keyboard Shortcuts for Pipes

November 18, 2017 | John Mount

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R. You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).
[Read more...]

Update on coordinatized or fluid data

November 12, 2017 | John Mount

We have just released a major update of the cdata R package to CRAN. If you work with R and data, now is the time to check out the cdata package. Among the changes in the 0.5.* version of cdata package: All coordinatized data or fluid data operations are now in ...
[Read more...]

Let X=X in R

November 3, 2017 | John Mount

Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points: Sometimes you have to write parameterized or re-usable code. The methods for doing this should be easy and legible. The first point feels abstract, until you find yourself wanting to re-use code on new projects. ...
[Read more...]

Big Data Transforms

October 29, 2017 | John Mount

As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark). Often we come to a point where we ...
[Read more...]

Some Announcements

October 24, 2017 | John Mount

Some Announcements: Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”, Sunday, October 29, 2017 10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area). ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and … ... [Read more...]

Upcoming data preparation and modeling article series

September 23, 2017 | John Mount

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN. vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects. vtreat handles, ...
[Read more...]

My advice on dplyr::mutate()

September 22, 2017 | John Mount

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; ...
[Read more...]

It is Needlessly Difficult to Count Rows Using dplyr

September 3, 2017 | John Mount

Question: how hard is it to count rows using the R package dplyr? Answer: surprisingly difficult. When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and ...
[Read more...]

Permutation Theory In Action

September 2, 2017 | John Mount

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things ... [Read more...]

Why to use the replyr R package

August 31, 2017 | John Mount

Recently I noticed that the R package sparklyr had the following odd behavior: suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #__ [1] '0.7.2.9000' packageVersion("sparklyr") #__ [1] '0.6.2' packageVersion("dbplyr") #__ [1] '1.1.0.9000' sc * Using Spark: 2.1.0 d [1] NA ncol(d) #__ [1] NA nrow(d) #__ [1] NA … Continue reading Why to use the replyr R package
[Read more...]

Neat New seplyr Feature: String Interpolation

August 28, 2017 | John Mount

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of ...
[Read more...]

wrapr: R Code Sweeteners

August 25, 2017 | John Mount

wrapr is an R package that supplies powerful tools for writing and debugging R code. Primary wrapr services include: let() %.__% (dot arrow pipe) := (named map builder) λ() (anonymous function builder) DebugFnW() let() let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for ...
[Read more...]

Some Neat New R Notations

August 22, 2017 | John Mount

The R package seplyr supplies a few neat new coding notations. An Abacus, which gives us the term “calculus.” The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the ...
[Read more...]

Is dplyr Easily Comprehensible?

August 19, 2017 | John Mount

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible?dplyr makes sense to those of us who use it a lot. And we can teach part time R users a lot of the common good use patterns. But, ...
[Read more...]
1 11 12 13 14 15 24

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)