Site icon R-bloggers

The race for speed at the data layer

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The competition amongst database vendors to create the fastest, most powerful "data layer" — the hardware and software to provide storage for Big Data with high-performance data processing — is clearly heating up. The Netezza appliance has been so successful that IBM has been racing to keep up with demand. SAP is also seeing success with its HANA in-memory database. Meanwhile, Oracle has been racing to appease angry customers by releasing new products to compete with SAP.

All of these database vendors are putting the focus on "high-performance analytics": the capability to quickly extract data tables, perform data aggregations (GROUP BY averages and other statistical summaries), and to refine complex data for structured data analysis. There are usually some black-box machine learning algorithms (e.g. clustering tools) available as well. Basically, "analytics" at the data layer means pretty much everything you can do in a SQL query, plus access a few stored procedures.

But when it comes to building powerful predictive models from these massive data sets, enterprises need more than just basic analytics at the data layer. To really unlock the information and insight potential of Big Data, the modern enterprise needs:

  • A scalable data layer that can accommodate broader and heterogeneous datasets and run on commodity servers;
  • Real-time ingestion of multiple data streams combined with access to historical data warehouses;
  • And the expertise of data scientists who have the flexibility and agility to explore this wealth of data, and answer critical questions about what it means — and propose questions that haven't been asked yet.

But to be successful, a data scientist needs more than just SQL queries and some black-box algorithms. Statistical modeling is a process, not an atomic action: it requires data exploration, "data hacking" to transform and combine predictive variables, refinement of multiple statistical models, and often combining multiple models into an even more powerful predictive engine.

That's why in every case, database vendors have turned to the R language to provide advanced analytics capability for their customers. By combining a language designed for the process of statistical modeling, with the high-performance data-processing capabilities of the data layer, data scientists now have ready access to the wealth of Big Data and the tools necessary to build powerful predictive models from it. Plus, by tapping into the expertise of a community of more than two million users and R developers, organizations with a data platform built on SAP, Oracle or IBM Netezza now also have access to cutting-edge data analysis and modeling techniques to apply to novel streams of data in social networking, manufacturing, supply chain, and many other domains.  

SAP, incidentally, hasn't formally announced integration with R yet. They have leaked a slide presentation detailing how the R engine can be integrated into the HANA in-memory database: the architecture looks similar to the way the IBM Netezza appliance integrates with Revolution R Enterprise. SAP is planning a HANA coming out party and press conference this Tuesday, and I wouldn't be surprised to see the R integration formally announced then.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.