Removing Records by Duplicate Values
[This article was first published on Yet Another Blog in Statistical Computing » S+/R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Removing records from a data table based on duplicate values in one or more columns is a commonly used but important data cleaning technique. Below shows an example about how to accomplish this task by SAS, R, and Python respectively.
SAS Example
data _data_; input label $ value; datalines; A 4 B 3 C 6 B 3 B 1 A 2 A 4 A 4 ; run; proc sort data = _last_; by label value; run; data _data_; set _last_; by label; if first.label then output; run; proc print data = _last_ noobs; run; /* OUTPUT: label value A 2 B 1 C 6 */
R Example
> # INPUT DATA INT THE CONSOLE > df <- read.table(header = T, text = ' + label value + A 4 + B 3 + C 6 + B 3 + B 1 + A 2 + A 4 + A 4 + ') > # SORT DATA FRAME BY COLUMNS > df2 <- df[order(df$label, df$value), ] > print(df2) label value 6 A 2 1 A 4 7 A 4 8 A 4 5 B 1 2 B 3 4 B 3 3 C 6 > # DEDUP RECORDS > df3 <- df2[!duplicated(df2$label), ] > print(df3) label value 6 A 2 5 B 1 3 C 6
Python Example
In [1]: import pandas as pd In [2]: # INPUT DATA INTO DATAFRAME In [3]: df = pd.DataFrame({'label': ['A', 'B', 'C'] + ['B'] * 2 + ['A'] * 3, 'value': [4, 3, 6, 3, 1, 2, 4, 4]}) In [4]: # SORT DATA BY COLUMNS In [5]: df2 = df.sort(['label', 'value']) In [6]: print(df2) label value 5 A 2 0 A 4 6 A 4 7 A 4 4 B 1 1 B 3 3 B 3 2 C 6 In [7]: # DEDUP RECORDS In [8]: df3 = df2.drop_duplicates(['label']) In [9]: print(df3) label value 5 A 2 4 B 1 2 C 6
To leave a comment for the author, please follow the link and comment on their blog: Yet Another Blog in Statistical Computing » S+/R.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.