Parsing Dates and Times

[This article was first published on Rcpp Gallery, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Motivation

R has excellent for dates and times via the built-in Date and POSIXt classes. Their usage, however, is not always as straightforward as one would want. Certain conversions are more cumbersome than we would like: while as.Date("2015-03-22"), would it not be nice if as.Date("20150322") (a format often used in logfiles) also worked, or for that matter as.Date(20150322L) using an integer variable, or even as.Date("2015-Mar-22") and as.Date("2015Mar22")?

Similarly, many date and time formats suitable for POSIXct (the short form) and POSIXlt (the long form with accessible components) often require rather too much formatting, and/or defaults. Why for example does as.POSIXct(as.numeric(Sys.time()), origin="1970-01-01") require the origin argument on the conversion back (from fractional seconds since the epoch) into datetime—when it is not required when creating the double-precision floating point representation of time since the epoch?

But thanks to Boost and its excellent Boost Date_Time library—which we already mentioned in this post about the BH package— we can address parsing of dates and times. It permitted us to write a new function toPOSIXct() which now part of the RcppBDT package (albeit right now just the GitHub version but we expect this to migrate to CRAN “soon” as well).

Implementation

We will now discuss the outline of this implementation. For full details, see the source file.

Headers and Constants

#include <boost/date_time.hpp>
#include <boost/lexical_cast.hpp>
#include <Rcpp.h>

// [[Rcpp::depends(BH)]]

namespace bt = boost::posix_time;

const std::locale formats[] = {    // this shows a subset only, see the source file for full list
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d %H:%M:%S%f")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%Y/%m/%d %H:%M:%S%f")),

    std::locale(std::locale::classic(), new bt::time_input_facet("%Y-%m-%d")),
    std::locale(std::locale::classic(), new bt::time_input_facet("%b/%d/%Y")),
};
const size_t nformats = sizeof(formats)/sizeof(formats[0]);

Note that we show only two datetime formats along with two date formats. The actual implementation has many more.

Core Converter

The actual conversion from string to a double (the underlying format in POSIXct) is done by the following function. It loops over all given formats, and returns the computed value after the first match. In case of failure, a floating point NA is returned.

double stringToTime(const std::string s) {

    bt::ptime pt, ptbase;

    // loop over formats and try them til one fits
    for (size_t i=0; pt == ptbase && i < nformats; ++i) {
        std::istringstream is(s);
        is.imbue(formats[i]);
        is >> pt;
    }
    
    if (pt == ptbase) {
        return NAN;
    } else { 
        const bt::ptime timet_start(boost::gregorian::date(1970,1,1));
        bt::time_duration diff = pt - timet_start;

        // Define BOOST_DATE_TIME_POSIX_TIME_STD_CONFIG to use nanoseconds
        // (and then use diff.total_nanoseconds()/1.0e9;  instead)
        return diff.total_microseconds()/1.0e6;
    }
}

Convenience Wrappers

We want to be able to convert from numeric as well as string formats. For this, we write a templated (and vectorised) function which invokes the actual conversion function for each argument. It also deals (somewhat heuristically) with two corner cases: we want 20150322 be converted from either integer or numeric, but need in the latter case distinguish this value and its rangue from the (much larger) value for seconds since the epoch. That creates a minir ambiguity: we will not be able to convert back for inputs from seconds since the epoch for the first few years since January 1, 1970. But as these are rare in the timestamp form we can accept the trade-off.

template <int RTYPE>
Rcpp::DatetimeVector toPOSIXct_impl(const Rcpp::Vector<RTYPE>& sv) {

    int n = sv.size();
    Rcpp::DatetimeVector pv(n);
    
    for (int i=0; i<n; i++) {
        std::string s = boost::lexical_cast<std::string>(sv[i]);
        //Rcpp::Rcout << sv[i] << " -- " << s << std::endl;

        // Boost Date_Time gets the 'YYYYMMDD' format wrong, even
        // when given as an explicit argument. So we need to test here.
        // While we are at it, may as well test for obviously wrong data.
        int l = s.size();
        if ((l < 8) ||          // impossibly short
            (l == 9)) {         // 8 or 10 works, 9 cannot
            Rcpp::stop("Inadmissable input: %s", s);
        } else if (l == 8) {    // turn YYYYMMDD into YYYY/MM/DD
            s = s.substr(0, 4) + "/" + s.substr(4, 2) + "/" + s.substr(6,2);
        }
        pv[i] = stringToTime(s);
    }
    return pv;
}

User-facing Function

Finally, we can look at the user-facing function. It accepts input in either integer, numeric or character vector form, and then dispatches accordingly to the templated internal function we just discussed. Other inputs are unsuitable and trigger an error.

// [[Rcpp::export]]
Rcpp::DatetimeVector toPOSIXct(SEXP x) {
    if (Rcpp::is<Rcpp::CharacterVector>(x)) {
        return toPOSIXct_impl<STRSXP>(x);
    } else if (Rcpp::is<Rcpp::IntegerVector>(x)) {
        return toPOSIXct_impl<INTSXP>(x); 
    } else if (Rcpp::is<Rcpp::NumericVector>(x)) {
        // here we have two cases: either we are an int like
        // 200150315 'mistakenly' cast to numeric, or we actually
        // are a proper large numeric (ie as.numeric(Sys.time())
        Rcpp::NumericVector v(x);
        if (v[0] < 21990101) {  // somewhat arbitrary cuttoff
            // actual integer date notation: convert to string and parse
            return toPOSIXct_impl<REALSXP>(x);
        } else {
            // we think it is a numeric time, so treat it as one
            return Rcpp::DatetimeVector(x);
        }
    } else {
        Rcpp::stop("Unsupported Type");
        return R_NilValue;//not reached
    }
}

Illustration

A simply illustration follows. A fuller demonstration is part of the RcppBDT package. This already shows support for subsecond granularity and a variety of date formats.

## parsing character
s <- c("2004-03-21 12:45:33.123456",    # ISO
       "2004/03/21 12:45:33.123456",    # variant
       "20040321",                      # just dates work fine as well
       "Mar/21/2004",                   # US format, also support month abbreviation or full
       "rapunzel")                      # will produce a NA

p <- toPOSIXct(s)

options("digits.secs"=6)                # make sure we see microseconds in output
print(format(p, tz="UTC"))              # format UTC times as UTC (helps for Date types too)

[1] "2004-03-21 12:45:33.123456" "2004-03-21 12:45:33.123456"
[3] "2004-03-21 00:00:00.000000" "2004-03-21 00:00:00.000000"
[5] NA                          

We can also illustrate integer and numeric inputs:

## parsing integer types
s <- c(20150315L, 20010101L, 20141231L)
p <- toPOSIXct(s)
print(format(p, tz="UTC"))

[1] "2015-03-15" "2001-01-01" "2014-12-31"

## parsing numeric types
s <- c(20150315, 20010101, 20141231)
print(format(p, tz="UTC"))

[1] "2015-03-15" "2001-01-01" "2014-12-31"

Note that we always forced display using UTC rather local time, the R default.

To leave a comment for the author, please follow the link and comment on their blog: Rcpp Gallery.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)