Articles by John Mount

How to Avoid the dplyr Dependency Driven Result Corruption

December 6, 2017 | John Mount

In our last article we pointed out a dangerous silent result corruption we have seen when using the R dplyr package with databases. To systematically avoid this result corruption we suggest breaking up your dplyr::mutate() statements to be dependency-free (not assigning the same value twice, and not using any ... [Read more...]

Please inspect your dplyr+database code

December 2, 2017 | John Mount

A note to dplyr with database users: you may benefit from inspecting/re-factoring your code to eliminate value re-use inside dplyr::mutate() statements. If you are using the R dplyr package with a database or with Apache Spark: I respectfully advise you inspect your code to ensure you are not ... [Read more...]

Win-Vector LLC announces new “big data in R” tools

November 29, 2017 | John Mount

Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN): partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or ...

Vectorized Block ifelse in R

November 27, 2017 | John Mount

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R. From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure. When porting code from one language ... [Read more...]

Arbitrary Data Transforms Using cdata

November 22, 2017 | John Mount

We have been writing a lot on higher-order data transforms lately: Coordinatized Data: A Fluid Data Specification Data Wrangling at Scale Fluid Data Big Data Transforms. What I want to do now is "write a bit more, so I finally feel I have been concise." The cdata R package supplies ...

RStudio Keyboard Shortcuts for Pipes

November 18, 2017 | John Mount

I have just released some simple RStudio add-ins that are great for creating keyboard shortcuts when working with pipes in R. You can install the add-ins from here (which also includes both installation instructions and use instructions/examples).

Data Wrangling at Scale

November 15, 2017 | John Mount

Just wrote a new R article: “Data Wrangling at Scale” (using Dirk Eddelbuettel’s tint template). Please check it out.

Update on coordinatized or fluid data

November 12, 2017 | John Mount

We have just released a major update of the cdata R package to CRAN. If you work with R and data, now is the time to check out the cdata package. Among the changes in the 0.5.* version of cdata package: All coordinatized data or fluid data operations are now in ...

Let X=X in R

November 3, 2017 | John Mount

Our article "Let’s Have Some Sympathy For The Part-time R User" includes two points: Sometimes you have to write parameterized or re-usable code. The methods for doing this should be easy and legible. The first point feels abstract, until you find yourself wanting to re-use code on new projects. ...

Big Data Transforms

October 29, 2017 | John Mount

As part of our consulting practice Win-Vector LLC has been helping a few clients stand-up advanced analytics and machine learning stacks using R and substantial data stores (such as relational database variants such as PostgreSQL or big data systems such as Spark). Often we come to a point where we ...

Some Announcements

October 24, 2017 | John Mount

Some Announcements: Dr. Nina Zumel will be presenting “Myths of Data Science: Things you Should and Should Not Believe”, Sunday, October 29, 2017 10:00 AM to 12:30 PM at the She Talks Data Meetup (Bay Area). ODSC West 2017 is soon. It is our favorite conference and we will be giving both a workshop and … ... [Read more...]

Upcoming data preparation and modeling article series

September 23, 2017 | John Mount

I am pleased to announce that vtreat version 0.6.0 is now available to R users on CRAN. vtreat is an excellent way to prepare data for machine learning, statistical inference, and predictive analytic projects. If you are an R user we strongly suggest you incorporate vtreat into your projects. vtreat handles, ...

My advice on dplyr::mutate()

September 22, 2017 | John Mount

There are substantial differences between ad-hoc analyses (be they: machine learning research, data science contests, or other demonstrations) and production worthy systems. Roughly: ad-hoc analyses have to be correct only at the moment they are run (and often once they are correct, that is the last time they are run; ...

It is Needlessly Difficult to Count Rows Using dplyr

September 3, 2017 | John Mount

Question: how hard is it to count rows using the R package dplyr? Answer: surprisingly difficult. When trying to count rows using dplyr or dplyr controlled data-structures (remote tbls such as Sparklyr or dbplyr structures) one is sailing between Scylla and Charybdis. The task being to avoid dplyr corner-cases and ...

Permutation Theory In Action

September 2, 2017 | John Mount

While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things ... [Read more...]

Why to use the replyr R package

August 31, 2017 | John Mount

Recently I noticed that the R package sparklyr had the following odd behavior: suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #__ [1] '0.7.2.9000' packageVersion("sparklyr") #__ [1] '0.6.2' packageVersion("dbplyr") #__ [1] '1.1.0.9000' sc * Using Spark: 2.1.0 d [1] NA ncol(d) #__ [1] NA nrow(d) #__ [1] NA … Continue reading Why to use the replyr R package

Neat New seplyr Feature: String Interpolation

August 28, 2017 | John Mount

The R package seplyr has a neat new feature: the function seplyr::expand_expr() which implements what we call “the string algebra” or string expression interpolation. The function takes an expression of mixed terms, including: variables referring to names, quoted strings, and general expression terms. It then “de-quotes” all of ...

wrapr: R Code Sweeteners

August 25, 2017 | John Mount

wrapr is an R package that supplies powerful tools for writing and debugging R code. Primary wrapr services include: let() %.__% (dot arrow pipe) := (named map builder) λ() (anonymous function builder) DebugFnW() let() let() allows execution of arbitrary code with substituted variable names (note this is subtly different than binding values for ...

Some Neat New R Notations

August 22, 2017 | John Mount

The R package seplyr supplies a few neat new coding notations. An Abacus, which gives us the term “calculus.” The first notation is an operator called the “named map builder”. This is a cute notation that essentially does the job of stats::setNames(). It allows for code such as the ...

Is dplyr Easily Comprehensible?

August 19, 2017 | John Mount

dplyr is one of the most popular R packages. It is powerful and important. But is it in fact easily comprehensible?dplyr makes sense to those of us who use it a lot. And we can teach part time R users a lot of the common good use patterns. But, ...

« 1 … 11 12 13 14 15 … 24 »

Copyright © 2025 | MH Corporate basic by MH Themes