Site icon R-bloggers

RcppSimdJson 0.0.4: Even Faster Upstream!

[This article was first published on Thinking inside the box , and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A new (upstream) simdjson release was announced by Daniel Lemire earlier this week, and my Twitter mentions have been running red-hot ever since as he was kind enough to tag me. Do look at that blog post, there is some impressive work in there. We wrapped up the (still very simple) rcppsimdjson around it last night and shipped it this morning.

RcppSimdJson wraps the fantastic and genuinely impressive simdjson library by Daniel Lemire. Via some very clever algorithmic engineering to obtain largely branch-free code, coupled with modern C++ and newer compiler instructions, it results in parsing gigabytes of JSON parsed per second which is quite mindboggling. For illustration, I highly recommend the video of the recent talk by Daniel Lemire at QCon (which was also voted best talk). The best-case performance is ‘faster than CPU speed’ as use of parallel SIMD instructions and careful branch avoidance can lead to less than one cpu cycle use per byte parsed.

This release brings upstream 0.3 (and 0.3.1) plus a minor tweak (also shipped back upstream). Our full NEWS entry follows.

Changes in version 0.0.4 (2020-04-03)

  • Upgraded to new upstream releases 0.3 and 0.3.1 (Dirk in #9 closing #8)

  • Updated example validateJSON to API changes.

But because Daniel is such a fantastic upstream developer to collaborate with, he even filed a full feature-request ‘maybe you can consider upgrading’ as issue #8 at our repo containing the fully detailed list of changes. As it is so impressive I will simple quote the upper half of just the major changes:

Highlights

  • Multi-Document Parsing: Read a bundle of JSON documents (ndjson) 2-4x faster than doing it individually. API docs / Design Details
  • Simplified API: The API has been completely revamped for ease of use, including a new JSON navigation API and fluent support for error code and exception styles of error handling with a single API. Docs
  • Exact Float Parsing: Now simdjson parses floats flawlessly without any performance loss (https://github.com/simdjson/simdjson/pull/558). Blog Post
  • Even Faster: The fastest parser got faster! With a shiny new UTF-8 validator and meticulously refactored SIMD core, simdjson 0.3 is 15% faster than before, running at 2.5 GB/s (where 0.2 ran at 2.2 GB/s).

For questions, suggestions, or issues please use the issue tracker at the GitHub repo.

Courtesy of CRANberries, there is also a diffstat report for this release.

If you like this or other open-source work I do, you can now sponsor me at GitHub. For the first year, GitHub will match your contributions.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

To leave a comment for the author, please follow the link and comment on their blog: Thinking inside the box .

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.