Site icon R-bloggers

SQL vs. NoSQL for Data Science

[This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Data come in variety of form, at different pace, and at different volume. And if all three criteria define the difference between SQL and NoSQL and there, all three are still irrelevant for data science.

My theorem is, that no matter what shape, size, frequeny, value and trustworthiness, SQL type of presenting the data is still the number one player.

Before you all jump, and start writing comments, hold on and continue reading.

Technology

SQL Databases as we have known them and as we know them nowadays have improved over the years in terms of velocity, performance, volume, frequency, and up until this day, keep the relational model, data ACID, isolations, and so on. All this – if you would like to address them – fix, rigid, complex, … tendencies have not only kept the databased number one, but provided the business, digitalization wave and cloud computing with structure and data visibility.

NoSQL database technologies are, on the other hand as described by many, flexible, can provide horizontal scaling, give better query performances due to data normalization and can do parallel computations.

I cannot argue with parallel computation and horizontal scaling. But with rest, I have big issues.

Offerings in Microsoft Azure

Let’s take a look into the different offerings in comparison between SQL and NoSQL:

FeatureSQLNoSQL
StoragePredefined tablesKey-value pair storage, Document Storage, Columnstore storage, Graph based storage
Data ModelTables of rows and columns with relations between other tables, related data stored separately, joined to form complex queriesStores data depending on database type: Key-value, documents, columnar type and graph databases
ExampleSQL Server, Azure SQL Server, Azure DatabaseAzure Cosmos DB ( Table API, Cassandra API, Graph API), HBase in HDInsight, Azure Cache for Redis, Azure Table Storage, Azure Blob Store, Azure Data Lake store, Azure File Store, Azure Time Series Insights
Business PurposeGeneral purpose systems, CRM, Accounting, Finance, Human resources, Planning, Inventory management, Transactional management SystemMobile Apps, IoT Apps, Real-time data Stream and Analytics, Content management
ScaleVertically by increasing server loadHorizontally by sharding across multiple
AnalyticsSQL, SQL with R, Python, and Java, Analysis Services with Data Mining, Integration Services with data profiling, Azure Machine LearningSpark, Scala, Python, R, Hive, SQL, .NET Core, Azure Machine Learning for Python, Stream Analytics

Analytical hiccups and trade-offs with NoSQL

Horizontal scaling and parallel computations are great. They both provide elasticity of resources and faster analytical results. When delivering data to analytical department, the ends must meet.

Strong consistency

You have probably heard the phrase “from eventual to strong consistency”. This simply means that the schema is defined on read and when a document, key-value pair binary file, columnar file is copied between different transformation zones it will eventually become consistent with schema, ACID rules, integrity constraints. But with eventual consistent data, you are gaining fast response (analytics) with the cost of potentials errors or stale data. And delivering this type of data to data science department will always result in back and forth communication full of nagging questions and clearing many data inconsistencies.

Data transformation

Keeping data in original format is great; Transforming it to a readable dataset is a big trade-off for data architect, data engineers and a huge must for data science.

Faster query performance can be achieved with in NoSQL without complex query joins. Normalized data would mean, that all transactional data, along with all dimensional data (with names, explanations) are included. Making datasets inadvertently large, but with cheap storage and scalability, this is not a problem. Problem lies with data consistency, updated, well, ACID. And it’s orchestration. Making shards and copies of data every time, something is changed, updated or deleted, can be tedious, invisible and result of many storage issues. But all orchestration issues are also solvable with right software and some coding.

Data usage and delivery

Data scientists love Python Pandas, Numpy, R’s Dplyr, data.table, Spark’s dataframe, datasets, Julia’s dataframe.jl. You get the picture. All are column and row based. In other words, all NoSQL data are delivered as SQL typed data or columnar typed. Even graph data (with edges and vertices) are transformed in this format.

Flexibility

I tend to calculate flexibility of NoSQL data through “time to market” KPI. How fast can add a new type of data (image, alter schema, new KPI) or a new change on the data and deliver it to the data science department is key to flexibility. There are for sure also other key factors regarding flexibility, for example through deployment and operations, replication or even availability.

Final thoughts

I am happy that NoSQL concepts, technologies and statistical approached have penetrated world of data science. Not only it helped developed new ways of calculation, improved and developed new algorithms, but also opened new ways of analysing formats, brought it to community faster that we have envisioned. But there are caveats to these concepts. Data still need to be cleaned, harmonised, consolidated. In order to bring the consistency, accuracy through transformation, wrangling and orchestration. And these processes must (!) not be neglected, overlooked nor underestimated. If they are, the complete NoSQL paradigm will go down the drain. And this will cost company precious resources, energy and unhappiness.

To leave a comment for the author, please follow the link and comment on their blog: R – TomazTsql.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.