Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Sensitive data has it’s restrictions for good reason. Personal data such as names and other identifiable information should be protected. Policies are in place to prevent any accidental data breach by governments and businesses. This can be hurdle for data projects, particularly when socialising your work. A common technique is to strip the individuals name and replace it with a random number. This is fine and does the job but the story is much better told when you can refer to a person. Another method is to randomise the names in the list giving each individual a random name first and last name that is present in your data. This often leaves you with unease because what if, by chance you randomly assign the same name to someone? It could happen to a John Smith.
A better way is to use the randomNames package. It’s simple and easy to use and an important step can be done without too much thought. Simply use the function below.
# Random names randomNames(10)
## [1] "Shibles, Suzanna" "Foehl, Meghan" "Marino, Jebediah" ## [4] "May, Cheyenne" "Lockhart, Isaiah" "Vera, Ian" ## [7] "al-Othman, Ilhaam" "Sanchez, Garrett" "Aguilar, Madison" ## [10] "Johnson, Nico"
Other parameters control the sex and ethnicity of the name.
randomNames(10, gender = 1, ethnicity = 5)
## [1] "Cramer, Kylie" "Baldocchi, Melissa" "Ray, Alexis" ## [4] "Hoffman, Jennifer" "Ellerbrock, Nikki" "Sholdt, Kimberly" ## [7] "Lewis, Emily" "Riddle, Laura" "Davison, Rachel" ## [10] "Mounts, Kristin"
Perhaps you only need the first or last name to be confidentialised.
randomNames(10, which.names = "first")
## [1] "David" "Jeong Min" "Daniel" "Jake" "Nyamekye" ## [6] "Lamontee" "Juana" "Connor" "Mariah" "Deante"
randomNames(10, which.names = "last")
## [1] "Williams" "Pham" "el-Hares" "Farrell" "Hayes" ## [6] "al-Ullah" "Cantu" "Burnett" "Lightfoot" "Nguyen"
You can also sample the whole feature set if needed randomising gender and ethnicity.
randomNames(10, return.complete.data = TRUE)
## gender ethnicity first_name last_name ## 1: 1 2 Kathleena Bellino ## 2: 0 5 Michael Farabaugh ## 3: 1 2 Erin Sterett ## 4: 1 3 Eva Waldon ## 5: 1 5 Fatima Mills ## 6: 0 1 Joseph Cournoyer ## 7: 1 5 Sasha Pawlak ## 8: 0 2 Mario Pham ## 9: 0 2 Grant Nguyen ## 10: 1 1 Danielle Felix
Addresses are also sensitive since they can be identifiable, whether it’s the property, business or the persons which reside within them. Here’s a freebie, a simple function that uses the randomNames package to generate fake addresses.
# random address for australian household randomAddress <- function(n, punit = 0.2){ #Weighting number between 1-30 to be equally as likely and numbers beyond with decreasing probability numbers <- 1:999 p <- c(rep(log(1/30 + 1), 30), log(1/(31:999) + 1)) # select number of units nunits <- rbinom(1, n, punit) # unit number and street number unit_number <- paste0(sample(numbers, nunits, prob = p, replace = TRUE), "/", sample(numbers, nunits, prob = p, replace = TRUE)) if(n != nunits){ street_number <- as.character(sample(numbers,n-nunits, prob = p, replace = TRUE)) number <- c(unit_number, street_number)[sample(1:n, n)] }else{ number <- street_number } # select address name address <- randomNames(n, which.names = "last", ethnicity = 5) # street suffix suffix <- sample(c("Road", "Steet", "Avenue", "Way", "Drive", "Grove", "Place", "Lane", "Cresent", "Close"), n, replace = TRUE) address <- paste(address, suffix) # select suburb/town name city <- toupper(randomNames(n, which.names = "last", ethnicity = 5)) # state and postcode pstate <- c(0.32 + 0.017, 0.258, 0.2, 0.07, 0.104, 0.021, 0.01, 0.017) statec <- c("NSW", "Vic", "Qld", "SA", "WA", "Tas", "NT", "ACT") first_postcode_number <- c(2, 3, 4, 5, 6, 7, 0, 2) id <- sample(1:8, n, replace = TRUE, prob = pstate) state <- statec[id] postcode <- first_postcode_number[id]*1000 + sample(1:999, n, replace = TRUE) # return data frame return(data.frame(number, address, city, state, postcode)) } randomAddress(10)
## number address city state postcode ## 1 545/24 Burt Close CHAUSSEE NT 821 ## 2 552 Goossen Avenue BISHOP Qld 4820 ## 3 2 Powell Avenue GREINER Qld 4200 ## 4 87 Dalley Cresent PALMBERG WA 6409 ## 5 996 Albin Place NIEMIEC ACT 2691 ## 6 104/46 Rodgers Lane THIBEDEAU Qld 4629 ## 7 107 Thomas Place DEGEARE Qld 4686 ## 8 323 Mccleary Steet MARCHAND Vic 3026 ## 9 12 George Steet CHASE NSW 2719 ## 10 221 Burgess Way FOSTER NSW 2809
Not the most accurate recreation of Australian addresses but it’s not bad. The proportions are consistent with the current Australian population figures. This looks better and is easier to tell a story with these address rather than labelling every address as “1 Aardvark Ave”… which I have regretfully done before. I intend to build a much better function which offers more flexibility when randomising addresses which simulate the Australian populaltion and allows you to subset the addresses to a user definied set of states, postcode, more realistic set of unit numbers, etc. But for now this does the job.
The post Confidentialise Your Data with the randomNames Package appeared first on Daniel Oehm | Gradient Descending.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.