Next Station in Data Science path: DB NonSQLPosted: 6 June, 2015
In my Data Scientist roadmap one important issue is where to storage the information. You can storage the logs in Hadoop, but this is a first step that I will explain in other post. Then you can imagine that you can storage the info in a database.
Joining concepts like database and Big Data you will obtain a NonSQL best answer. But What is a NonSQL database?
First take a look in Wikipedia: “A NoSQL (often interpreted as Not only SQL) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used inrelational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability. The data structures used by NoSQL databases (e.g. key-value, graph, or document) differ from those used in relational databases, making some operations faster in NoSQL and others faster in relational databases. The particular suitability of a given NoSQL database depends on the problem it must solve.”
Relational database (RDBMS) like SQL has been the primary model for database management during the past few decades. But today, non-relational, “NoSQL” databases are gaining prominence as an alternative model for database management. But let’s discuss why this evolution in database management is happening.
In present day, as we are handling huge amount of data, data being organized and well-structured actually creates a problem, especially at extremely large volumes. The structured approach of RDBMS database like SQL slows down performance as data volume or size gets bigger and it is also not scalable to meet the needs of Big Data.
So NoSQL was conceived as a completely different framework of databases that allows for high-performance, agile processing of information at a much bigger scale. This is the database well-adapted to the high demands of big data.
Benefits of NonSQL:
- Elastic scaling: NonSQL databases are designed to expand transparently and horizontally to take advantage of new nodes, and they’re usually designed with low-cost commodity hardware in mind.
- Bigger Handling Capacity Data.
- Maintaining NonSQL databases are cheaper.
- Lesser Server Costs.
- No Schema or Fixed Data Model.
- In order to increase data output and performance advance NoSQL techniques cache data in system memory.
SQL and NonSQL have been great inventions over time in the area of data management and have been used to keep data storage and retrieval optimized and smooth. It’s still hard to criticize one and completely go with the other option. Both technologies are best in what they do and it is up to you to put them to better use depending on the business situations and needs.
NonSQL Databases Types:
- Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
- Graph stores are used to store information about networks, such as social connections. Graph stores include Neo4J and HyperGraphDB.
- Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or “key”), together with its value. Examples of key-value stores are Riak and Voldemort. Some key-value stores, such as Redis, allow each value to have a type, such as “integer”, which adds functionality.
- Wide-column stores such as Cassandra and HBase or are optimized for queries over large datasets, and store columns of data together, instead of rows.
As an examples first we take a look of one of the best wide column stores database like is Vertica, you can have this database for non business uses with the limitation of 1 TB. You can download the db and obtain all of info and support in these web sites:
http://www.vertica.com/customer-experience/vertica-101/ (Video Tutorials)
Inside also you’ll find the distributed R application for applying all kind of algorithms with MAP.
There you can download an appliance but I recommend you that you install the database in a CENTOS 6 version for assuring the well installation of distributed R.
Finally we will talk about on of the best document databases type MONGODB it’s an opensource database here you can obtain instructions for downloading and installing the database:
For testing this database here you’ll find nice tutorials:
If you want to have a Virtual Machine I recommend you to install the Ubuntu OS version 14.04.
Finally as an R follower just I point you to one interesting post to test R with mongodb:
Here you’ll find two examples of two packages with mongodb, RMongo and rmongodb, you can investigate deeply this two packages and experiment the power of this database with R. Please let me know your opinion.
This evolution of databases wants to solve problems that you can not solve with the standard databases, as Einstein remember us: