Site icon R-bloggers

All the fake data that’s fit to print

[This article was first published on rOpenSci Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

charlatan makes fake data.

Excited to annonunce a new package called charlatan. While perusing packages from other programming languages, I saw a neat Python library called faker.

charlatan is inspired from and ports many things from Python's https://github.com/joke2k/faker library. In turn, faker was inspired from PHP's faker, Perl's Faker, and Ruby's faker. It appears that the PHP library was the original – nice work PHP.

Use cases

What could you do with this package? Here's some use cases:

Features

Status

We have not ported every variable, or every language yet in those variables. We have added some variables to charlatan that are not in faker (e.g., taxonomy, gene sequences). Check out the issues to follow progress.

Package API

Other R work in this space:

Vignette

Check out the package vignette to get started.

setup

Install charlatan

install.packages("charlatan")

Or get the development version:

devtools::install_github("ropensci/charlatan")

library(charlatan)

Examples

high level interface

fraudster is an interface for all fake data variables (and locales):

x <- fraudster()
x$job()
#> [1] "Textile designer"
x$name()
#> [1] "Cris Johnston-Tremblay"
x$job()
#> [1] "Database administrator"
x$color_name()
#> [1] "SaddleBrown"

If you want to set locale, do so like fraudster(locale = "{locale}")

locale support

The locales that are supported vary by data variable. We're adding more locales through time, so do check in from time to time – or even better, send a pull request adding support for the locale you want for the variable(s) you want.

As an example, you can set locale for job data to any number of supported locales.

For jobs:

ch_job(locale = "en_US", n = 3)
#> [1] "Charity officer"   "Financial adviser" "Buyer, industrial"
ch_job(locale = "fr_FR", n = 3)
#> [1] "Illustrateur"                 "Guichetier"
#> [3] "Responsable d'ordonnancement"
ch_job(locale = "hr_HR", n = 3)
#> [1] "Pomoćnik strojovođe"
#> [2] "Pećar"
#> [3] "Konzervator – restaurator savjetnik"
ch_job(locale = "uk_UA", n = 3)
#> [1] "Фрілансер"  "Астрофізик" "Доцент"
ch_job(locale = "zh_TW", n = 3)
#> [1] "包裝設計"         "空調冷凍技術人員" "鍋爐操作技術人員"

For colors:

ch_color_name(locale = "en_US", n = 3)
#> [1] "DarkMagenta" "Navy"        "LightGray"
ch_color_name(locale = "uk_UA", n = 3)
#> [1] "Синій ВПС"          "Темно-зелений хакі" "Берлінська лазур"

charlatan will tell you when a locale is not supported

ch_job(locale = "cv_MN")
#> Error: cv_MN not in set of available locales

generate a dataset

ch_generate() helps you create data.frame's with whatever variables you want that charlatan supports. Then you're ready to use the data.frame immediately in whatever your application is.

By default, you get back a certain set of variables. Right now, that is: name, job, and phone_number.

ch_generate()
#> # A tibble: 10 x 3
#>                          name                       job
#>                         <chr>                     <chr>
#>  1                  Coy Davis     Geneticist, molecular
#>  2               Artis Senger                 Press sub
#>  3                 Tal Rogahn              Town planner
#>  4             Nikolas Carter         Barrister's clerk
#>  5            Sharlene Kemmer Insurance account manager
#>  6            Babyboy Volkman           Quality manager
#>  7 Dr. Josephus Marquardt DVM                  Best boy
#>  8                Vernal Dare            Engineer, site
#>  9              Emilia Hessel       Administrator, arts
#> 10              Urijah Beatty     Editor, commissioning
#> # ... with 1 more variables: phone_number <chr>

You can select just the variables you want:

ch_generate('job', 'phone_number', n = 30)
#> # A tibble: 30 x 2
#>                           job         phone_number
#>                         <chr>                <chr>
#>  1        Call centre manager  1-670-715-3079x9104
#>  2 Nurse, learning disability 1-502-781-3386x33524
#>  3           Network engineer       1-692-089-3060
#>  4           Industrial buyer       1-517-855-8517
#>  5     Database administrator  (999)474-9975x89650
#>  6       Operations geologist          06150655769
#>  7             Engineer, land     360-043-3630x592
#>  8     Pension scheme manager        (374)429-6821
#>  9          Personnel officer   1-189-574-3348x338
#> 10         Editor, film/video       1-698-135-1664
#> # ... with 20 more rows

Data types

A sampling of the data types available in charlatan:

person name

ch_name()
#> [1] "Jefferey West-O'Connell"

ch_name(10)
#>  [1] "Dylon Hintz"          "Dr. Billy Willms DDS" "Captain Bednar III"
#>  [4] "Carli Torp"           "Price Strosin III"    "Grady Mayert"
#>  [7] "Nat Herman-Kuvalis"   "Noelle Funk"          "Dr. Jaycie Herzog MD"
#> [10] "Ms. Andrea Zemlak"

phone number

ch_phone_number()
#> [1] "643.993.1958"

ch_phone_number(10)
#>  [1] "+06(6)6080789632"    "05108334280"         "447-126-9775"
#>  [4] "+96(7)2112213020"    "495-425-1506"        "1-210-372-3188x514"
#>  [7] "(300)951-5115"       "680.567.5321"        "1-947-805-4758x8167"
#> [10] "888-998-5511x554"

job

ch_job()
#> [1] "Scientist, water quality"

ch_job(10)
#>  [1] "Engineer, production"
#>  [2] "Architect"
#>  [3] "Exhibitions officer, museum/gallery"
#>  [4] "Patent attorney"
#>  [5] "Surveyor, minerals"
#>  [6] "Electronics engineer"
#>  [7] "Secondary school teacher"
#>  [8] "Intelligence analyst"
#>  [9] "Nutritional therapist"
#> [10] "Information officer"

Messy data

Real data is messy! charlatan makes it easy to create messy data. This is still in the early stages so is not available across most data types and languages, but we're working on it.

For example, create messy names:

ch_name(50, messy = TRUE)
#>  [1] "Mr. Vernell Hoppe Jr."     "Annika Considine d.d.s."
#>  [3] "Dr. Jose Kunde DDS"        "Karol Leuschke-Runte"
#>  [5] "Kayleen Kutch-Hintz"       "Jahir Green"
#>  [7] "Stuart Emmerich"           "Hillard Schaden"
#>  [9] "Mr. Caden Braun"           "Willie Ebert"
#> [11] "Meg Abbott PhD"            "Dr Rahn Huel"
#> [13] "Kristina Crooks d.d.s."    "Lizbeth Hansen"
#> [15] "Mrs. Peyton Kuhn"          "Hayley Bernier"
#> [17] "Dr. Lavon Schimmel d.d.s." "Iridian Murray"
#> [19] "Cary Romaguera"            "Tristan Windler"
#> [21] "Marlana Schroeder md"      "Mr. Treyton Nitzsche"
#> [23] "Hilmer Nitzsche-Glover"    "Marius Dietrich md"
#> [25] "Len Mertz"                 "Mrs Adyson Wunsch DVM"
#> [27] "Dr. Clytie Feest DDS"      "Mr. Wong Lebsack I"
#> [29] "Arland Kessler"            "Mrs Billy O'Connell m.d."
#> [31] "Stephen Gerlach"           "Jolette Lueilwitz"
#> [33] "Mrs Torie Green d.d.s."    "Mona Denesik"
#> [35] "Mitchell Auer"             "Miss. Fae Price m.d."
#> [37] "Todd Lehner"               "Elva Lesch"
#> [39] "Miss. Gustie Rempel DVM"   "Lexie Parisian-Stark"
#> [41] "Beaulah Cremin-Rice"       "Parrish Schinner"
#> [43] "Latrell Beier"             "Garry Wolff Sr"
#> [45] "Bernhard Vandervort"       "Stevie Johnston"
#> [47] "Dawson Gaylord"            "Ivie Labadie"
#> [49] "Ronal Parker"              "Mr Willy O'Conner Sr."

Right now only suffixes and prefixes for names in en_US locale are supported. Notice above some variation in prefixes and suffixes.

TO DO

We have lots ot do still. Some of those things include:

To leave a comment for the author, please follow the link and comment on their blog: rOpenSci Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.