Confidentialise Your Data with the randomNames Package

Posted on July 23, 2018 by Daniel Oehm in R bloggers | 0 Comments

[This article was first published on R – Daniel Oehm | Gradient Descending, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Sensitive data has it’s restrictions for good reason. Personal data such as names and other identifiable information should be protected. Policies are in place to prevent any accidental data breach by governments and businesses. This can be hurdle for data projects, particularly when socialising your work. A common technique is to strip the individuals name and replace it with a random number. This is fine and does the job but the story is much better told when you can refer to a person. Another method is to randomise the names in the list giving each individual a random name first and last name that is present in your data. This often leaves you with unease because what if, by chance you randomly assign the same name to someone? It could happen to a John Smith.

A better way is to use the randomNames package. It’s simple and easy to use and an important step can be done without too much thought. Simply use the function below.

# Random names
randomNames(10)

##  [1] "Shibles, Suzanna"  "Foehl, Meghan"     "Marino, Jebediah" 
##  [4] "May, Cheyenne"     "Lockhart, Isaiah"  "Vera, Ian"        
##  [7] "al-Othman, Ilhaam" "Sanchez, Garrett"  "Aguilar, Madison" 
## [10] "Johnson, Nico"

Other parameters control the sex and ethnicity of the name.

randomNames(10, gender = 1, ethnicity = 5)

##  [1] "Cramer, Kylie"      "Baldocchi, Melissa" "Ray, Alexis"       
##  [4] "Hoffman, Jennifer"  "Ellerbrock, Nikki"  "Sholdt, Kimberly"  
##  [7] "Lewis, Emily"       "Riddle, Laura"      "Davison, Rachel"   
## [10] "Mounts, Kristin"

Perhaps you only need the first or last name to be confidentialised.

randomNames(10, which.names = "first")

##  [1] "David"     "Jeong Min" "Daniel"    "Jake"      "Nyamekye" 
##  [6] "Lamontee"  "Juana"     "Connor"    "Mariah"    "Deante"

randomNames(10, which.names = "last")

##  [1] "Williams"  "Pham"      "el-Hares"  "Farrell"   "Hayes"    
##  [6] "al-Ullah"  "Cantu"     "Burnett"   "Lightfoot" "Nguyen"

You can also sample the whole feature set if needed randomising gender and ethnicity.

randomNames(10, return.complete.data = TRUE)

##     gender ethnicity first_name last_name
##  1:      1         2  Kathleena   Bellino
##  2:      0         5    Michael Farabaugh
##  3:      1         2       Erin   Sterett
##  4:      1         3        Eva    Waldon
##  5:      1         5     Fatima     Mills
##  6:      0         1     Joseph Cournoyer
##  7:      1         5      Sasha    Pawlak
##  8:      0         2      Mario      Pham
##  9:      0         2      Grant    Nguyen
## 10:      1         1   Danielle     Felix

Addresses are also sensitive since they can be identifiable, whether it’s the property, business or the persons which reside within them. Here’s a freebie, a simple function that uses the randomNames package to generate fake addresses.

# random address for australian household
randomAddress <- function(n, punit = 0.2){

  #Weighting number between 1-30 to be equally as likely and numbers beyond with decreasing probability
  numbers <- 1:999
  p <- c(rep(log(1/30 + 1), 30), log(1/(31:999) + 1))

  # select number of units
  nunits <- rbinom(1, n, punit)

  # unit number and street number
  unit_number <- paste0(sample(numbers, nunits, prob = p, replace = TRUE), "/", sample(numbers, nunits, prob = p, replace = TRUE))
  if(n != nunits){
    street_number <- as.character(sample(numbers,n-nunits, prob = p, replace = TRUE))
    number <- c(unit_number, street_number)[sample(1:n, n)]
  }else{
    number <- street_number
  }

  # select address name
  address <- randomNames(n, which.names = "last", ethnicity = 5)

  # street suffix 
  suffix <- sample(c("Road", "Steet", "Avenue", "Way", "Drive", "Grove", "Place", "Lane", "Cresent", "Close"), n, replace = TRUE)
  address <- paste(address, suffix)

  # select suburb/town name
  city <- toupper(randomNames(n, which.names = "last", ethnicity = 5))

  # state and postcode
  pstate <- c(0.32 + 0.017, 0.258, 0.2, 0.07, 0.104, 0.021, 0.01, 0.017)
  statec <- c("NSW", "Vic", "Qld", "SA", "WA", "Tas", "NT", "ACT")
  first_postcode_number <- c(2, 3, 4, 5, 6, 7, 0, 2)
  id <- sample(1:8, n, replace = TRUE, prob = pstate)
  state <- statec[id]
  postcode <- first_postcode_number[id]*1000 + sample(1:999, n, replace = TRUE)

  # return data frame
  return(data.frame(number, address, city, state, postcode))
}
randomAddress(10)

##    number        address      city state postcode
## 1  545/24     Burt Close  CHAUSSEE    NT      821
## 2     552 Goossen Avenue    BISHOP   Qld     4820
## 3       2  Powell Avenue   GREINER   Qld     4200
## 4      87 Dalley Cresent  PALMBERG    WA     6409
## 5     996    Albin Place   NIEMIEC   ACT     2691
## 6  104/46   Rodgers Lane THIBEDEAU   Qld     4629
## 7     107   Thomas Place   DEGEARE   Qld     4686
## 8     323 Mccleary Steet  MARCHAND   Vic     3026
## 9      12   George Steet     CHASE   NSW     2719
## 10    221    Burgess Way    FOSTER   NSW     2809

Not the most accurate recreation of Australian addresses but it’s not bad. The proportions are consistent with the current Australian population figures. This looks better and is easier to tell a story with these address rather than labelling every address as “1 Aardvark Ave”… which I have regretfully done before. I intend to build a much better function which offers more flexibility when randomising addresses which simulate the Australian populaltion and allows you to subset the addresses to a user definied set of states, postcode, more realistic set of unit numbers, etc. But for now this does the job.

The post Confidentialise Your Data with the randomNames Package appeared first on Daniel Oehm | Gradient Descending.

To leave a comment for the author, please follow the link and comment on their blog: R – Daniel Oehm | Gradient Descending.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Confidentialise Your Data with the randomNames Package

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)