All the fake data that’s fit to print

Scott Chamberlain

5 years ago

[This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

charlatan makes fake data.

Excited to annonunce a new package called charlatan. While perusing packages from other programming languages, I saw a neat Python library called faker.

charlatan is inspired from and ports many things from Python's https://github.com/joke2k/faker library. In turn, faker was inspired from PHP's faker, Perl's Faker, and Ruby's faker. It appears that the PHP library was the original – nice work PHP.

Use cases

What could you do with this package? Here's some use cases:

Students in a classroom setting learning any task that needs a dataset.
People doing simulations/modeling that need some fake data
Generate fake dataset of users for a database before actual users exist
Complete missing spots in a dataset
Generate fake data to replace sensitive real data with before public release
Create a random set of colors for visualization
Generate random coordinates for a map
Get a set of randomly generated DOIs (Digial Object Identifiers) to assign to fake scholarly artifacts
Generate fake taxonomic names for a biological dataset
Get a set of fake sequences to use to test code/software that uses sequence data

Features

Language support: A huge selling point of charlatan is language support. Of course for some data types (numbers), languages don't come into play, but for many they do. That means you can create fake datasets specific to a language, or a dataset with a mix of languages, etc. For the variables in this package, we have not yet ported over all languages for those variables that Python's faker has.
Lite weight: We've tried to make this package as lite as possible so that it's just generally easy to install, but also can be used in other packages or workflows while bringing along as little baggage as possible.
Reviewed: it's been reviewed! See reviews by Brooke Anderson and Tristan Mahr, and handling editor Noam Ross
R specific features such as methods to create data.frame's (so the user doesn’t have to do the extra step of putting vectors together)

Status

We have not ported every variable, or every language yet in those variables. We have added some variables to charlatan that are not in faker (e.g., taxonomy, gene sequences). Check out the issues to follow progress.

Package API

ch_generate: generate a data.frame with fake data
fraudster: single interface to all fake data methods
High level interfaces: There are high level functions prefixed with ch_ that wrap low level interfaces, and are meant to be easier to use and provide easy way to make many instances of a thing.
Low level interfaces: All of these are R6 objects that a user can initialize and then call methods on the them.

Other R work in this space:

Vignette

Check out the package vignette to get started.

setup

Install charlatan

install.packages("charlatan")

Or get the development version:

devtools::install_github("ropensci/charlatan")

library(charlatan)

Examples

high level interface

fraudster is an interface for all fake data variables (and locales):

x <- fraudster()
x$job()
#> [1] "Textile designer"
x$name()
#> [1] "Cris Johnston-Tremblay"
x$job()
#> [1] "Database administrator"
x$color_name()
#> [1] "SaddleBrown"

If you want to set locale, do so like fraudster(locale = "{locale}")

locale support

The locales that are supported vary by data variable. We're adding more locales through time, so do check in from time to time – or even better, send a pull request adding support for the locale you want for the variable(s) you want.

As an example, you can set locale for job data to any number of supported locales.

For jobs:

ch_job(locale = "en_US", n = 3)
#> [1] "Charity officer"   "Financial adviser" "Buyer, industrial"
ch_job(locale = "fr_FR", n = 3)
#> [1] "Illustrateur"                 "Guichetier"
#> [3] "Responsable d'ordonnancement"
ch_job(locale = "hr_HR", n = 3)
#> [1] "Pomoćnik strojovođe"
#> [2] "Pećar"
#> [3] "Konzervator – restaurator savjetnik"
ch_job(locale = "uk_UA", n = 3)
#> [1] "Фрілансер"  "Астрофізик" "Доцент"
ch_job(locale = "zh_TW", n = 3)
#> [1] "包裝設計"         "空調冷凍技術人員" "鍋爐操作技術人員"

For colors:

ch_color_name(locale = "en_US", n = 3)
#> [1] "DarkMagenta" "Navy"        "LightGray"
ch_color_name(locale = "uk_UA", n = 3)
#> [1] "Синій ВПС"          "Темно-зелений хакі" "Берлінська лазур"

charlatan will tell you when a locale is not supported

ch_job(locale = "cv_MN")
#> Error: cv_MN not in set of available locales

generate a dataset

ch_generate() helps you create data.frame's with whatever variables you want that charlatan supports. Then you're ready to use the data.frame immediately in whatever your application is.

By default, you get back a certain set of variables. Right now, that is: name, job, and phone_number.

ch_generate()
#> # A tibble: 10 x 3
#>                          name                       job
#>                         <chr>                     <chr>
#>  1                  Coy Davis     Geneticist, molecular
#>  2               Artis Senger                 Press sub
#>  3                 Tal Rogahn              Town planner
#>  4             Nikolas Carter         Barrister's clerk
#>  5            Sharlene Kemmer Insurance account manager
#>  6            Babyboy Volkman           Quality manager
#>  7 Dr. Josephus Marquardt DVM                  Best boy
#>  8                Vernal Dare            Engineer, site
#>  9              Emilia Hessel       Administrator, arts
#> 10              Urijah Beatty     Editor, commissioning
#> # ... with 1 more variables: phone_number <chr>

You can select just the variables you want:

ch_generate('job', 'phone_number', n = 30)
#> # A tibble: 30 x 2
#>                           job         phone_number
#>                         <chr>                <chr>
#>  1        Call centre manager  1-670-715-3079x9104
#>  2 Nurse, learning disability 1-502-781-3386x33524
#>  3           Network engineer       1-692-089-3060
#>  4           Industrial buyer       1-517-855-8517
#>  5     Database administrator  (999)474-9975x89650
#>  6       Operations geologist          06150655769
#>  7             Engineer, land     360-043-3630x592
#>  8     Pension scheme manager        (374)429-6821
#>  9          Personnel officer   1-189-574-3348x338
#> 10         Editor, film/video       1-698-135-1664
#> # ... with 20 more rows

Data types

A sampling of the data types available in charlatan:

person name

ch_name()
#> [1] "Jefferey West-O'Connell"

ch_name(10)
#>  [1] "Dylon Hintz"          "Dr. Billy Willms DDS" "Captain Bednar III"
#>  [4] "Carli Torp"           "Price Strosin III"    "Grady Mayert"
#>  [7] "Nat Herman-Kuvalis"   "Noelle Funk"          "Dr. Jaycie Herzog MD"
#> [10] "Ms. Andrea Zemlak"

phone number

ch_phone_number()
#> [1] "643.993.1958"

ch_phone_number(10)
#>  [1] "+06(6)6080789632"    "05108334280"         "447-126-9775"
#>  [4] "+96(7)2112213020"    "495-425-1506"        "1-210-372-3188x514"
#>  [7] "(300)951-5115"       "680.567.5321"        "1-947-805-4758x8167"
#> [10] "888-998-5511x554"

job

ch_job()
#> [1] "Scientist, water quality"

ch_job(10)
#>  [1] "Engineer, production"
#>  [2] "Architect"
#>  [3] "Exhibitions officer, museum/gallery"
#>  [4] "Patent attorney"
#>  [5] "Surveyor, minerals"
#>  [6] "Electronics engineer"
#>  [7] "Secondary school teacher"
#>  [8] "Intelligence analyst"
#>  [9] "Nutritional therapist"
#> [10] "Information officer"

Messy data

Real data is messy! charlatan makes it easy to create messy data. This is still in the early stages so is not available across most data types and languages, but we're working on it.

For example, create messy names:

ch_name(50, messy = TRUE)
#>  [1] "Mr. Vernell Hoppe Jr."     "Annika Considine d.d.s."
#>  [3] "Dr. Jose Kunde DDS"        "Karol Leuschke-Runte"
#>  [5] "Kayleen Kutch-Hintz"       "Jahir Green"
#>  [7] "Stuart Emmerich"           "Hillard Schaden"
#>  [9] "Mr. Caden Braun"           "Willie Ebert"
#> [11] "Meg Abbott PhD"            "Dr Rahn Huel"
#> [13] "Kristina Crooks d.d.s."    "Lizbeth Hansen"
#> [15] "Mrs. Peyton Kuhn"          "Hayley Bernier"
#> [17] "Dr. Lavon Schimmel d.d.s." "Iridian Murray"
#> [19] "Cary Romaguera"            "Tristan Windler"
#> [21] "Marlana Schroeder md"      "Mr. Treyton Nitzsche"
#> [23] "Hilmer Nitzsche-Glover"    "Marius Dietrich md"
#> [25] "Len Mertz"                 "Mrs Adyson Wunsch DVM"
#> [27] "Dr. Clytie Feest DDS"      "Mr. Wong Lebsack I"
#> [29] "Arland Kessler"            "Mrs Billy O'Connell m.d."
#> [31] "Stephen Gerlach"           "Jolette Lueilwitz"
#> [33] "Mrs Torie Green d.d.s."    "Mona Denesik"
#> [35] "Mitchell Auer"             "Miss. Fae Price m.d."
#> [37] "Todd Lehner"               "Elva Lesch"
#> [39] "Miss. Gustie Rempel DVM"   "Lexie Parisian-Stark"
#> [41] "Beaulah Cremin-Rice"       "Parrish Schinner"
#> [43] "Latrell Beier"             "Garry Wolff Sr"
#> [45] "Bernhard Vandervort"       "Stevie Johnston"
#> [47] "Dawson Gaylord"            "Ivie Labadie"
#> [49] "Ronal Parker"              "Mr Willy O'Conner Sr."

Right now only suffixes and prefixes for names in en_US locale are supported. Notice above some variation in prefixes and suffixes.

TO DO

We have lots ot do still. Some of those things include:

Locales: For existing data variables in the package, we need to fill in locales for which Python's faker has the data, but we need to port it over still.
Data variables: there's more we can port over from Python's faker. In addition, we may find inspiration from faker libraries in other programming languages.
Messy data: we want to make messy data support more available throughout the package. Watch issue #41.
If you have ideas for potential data variables, issue #11 is a good place for those. Or open a new issue, either way.
One reviewer brought up whether data should be within bounds of reality ( see issue #40). The first question for me is should we do this – if the answer is yes or at least sometimes, then we can explore how. It's not yet clear if it's the right thing to do.

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.