Delimited file where delimiter clashes with data values
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!).
Currently, my rule of thumb is to use pipe as my default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ’1f’. In R, one could refer to it as ‘\0x1F’. In SAS, it could be specified as ’1F’x. In bash, the character could be specified on the command line (e.g., using the cut
command, csvlook
command, etc) by specifying $’1f’ as the delimiter character.
If the file contains the newline character in a data cell (\n
), then the record separator character (hex: 1E) could be used for determining new lines.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.