What Is a Big Data?
Big data is making us think of ways to harness the excessive amount of unstructured data that is generated on a daily basis. Moreover, it is no surprise we have seen the introduction of many new Big Data technologies
There are now open source technologies now available to handle the more and more data created every day. Examples of companies who make use of these technologies to process large amounts of data on daily basis: Craigslist, Facebook, Twitter, eBay, wordpress.com, etc.
We learned OO programming, and told relational databases is what you use! Tear apart objects so they fit into relational!
Object databases came along. You didn’t have to tear apart objects so they fit into database. The failure of OO databases related to corporate controlling of data and the challenges evolving the schema
What do you do when data wont fit onto one server?
Distributed databases
Sharding : with a relational db might have to split tables into multiple partitions among servers to handle excessive amount of data .
Approach is to de-normalize since relationships spread across many servers, but what is the point of using relational database then?
Partitioning: Split data based on keys (i.e. use name to split keys among partition. However, distribution unbalanced. Solution is “Consistent Hashing” which equally distributes the data Highly available.
For example, why is it applications like Facebook and Twitter are always available? Handle this by making replicas among data in case a server node goes down. Replicating is sharing . Every time we update, make copy over another node.
Hadoop – a Big Data technology
Back in the day, Google wanted to crawl the entire internet and perform calculations on the data. At the time, they did not have enough money for a machine to handle this – They used cheap machines and linked the computers together.
Google wrote a paper on what they were doing known as Map Reduce. Nutch (a search engine) was doing similar work lead by computer scientist Doug Cutting. He read Google’s paper and created a prototype. Yahoo hired Cutting. He created the open source Hadoop platform and named the technology after his son’s yellow stuffed-elephant toy, which went on to become the platform’s logo.
Yahoo eventully spun off this side of the company into Hortonworks which would put Hadoop on Microsoft OS.
The NoSQL databases provide infinite scalability, fault tolerance, high availibilty, design-friendly lack of schema.
NoSql is the wrong name, but catchy. Really, it means not a rigid schema, more flexible to work with
Polyglot persistence – means can use multiple databases types to build an application
Need consistency with enterprise data. Eventual Consistency important? Bank yes, blog no .
With Enterprise data, NoSQL is NOT an option. NoSQL appropriate for application data.
ACID is out of the equation, replaced by CAP.
CAP Theorem was developed by Professor Eric Brewer, Co-founder and Chief Scientist of Inktomi.
The theorem states, that a distributed system design, can offer at most two out of three desirable properties:
Consistency - Is the data I’m looking at now the same if I look at it somewhere else? if someone writes a value to a database, there after other users will immediately be able to read the same value back,
Availability – What happens if my database goes down? If a number of nodes fail in your cluster the distributed system can remain operational
Partition Tolerance - What if my data is on different networks? means that if the nodes in your cluster are divided into two groups that can no longer communicate by a network failure, again the system remains operational.
Big Table
The kind of processing that Google does required a high performance and reliable, but weak on consistency.
At time no database like this existed. Google created their own calling it Big Table . Products such as web indexing, Orkut, blogger, Google earth , and part of Goggle App Engine use this.
Big table is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterrupted array of bytes.
Several technologies were created based on the Big Table paper concepts. Examples include Dynamo from Amazon, the Google App Engine (which is built on top of the lower level Big Table with extra capabilities), and many others such as Cassandra.
For example, Amazon uses Dynamo for many of their products such as their shopping cart. Their is a paper written by Amazon on this technology that is readily available.
Facebook created Cassandra which they open sourced and had used it in their email search tool. HBase is a near-clone of Google’s BigTable, whereas Cassandra is a “BigTable/Dynamo hybrid”.
Cassandra is a column family database : grouping of columns . It is similar but not the same as a relational. For example, a customer may have data such as name address phone number, some might not have music data. Or some wont supply age. Known as sparse where each row may not have same columns.
The Cassandra data model is a 4 or 5 dimensional hash described as follows:
Column – a name /value / time stamp tuple
{ name: “twitter_handle”
value: “RandysPizzaRtp”
timestamp: “2013-03-20 11:30:00” }
Super Column
http://arin.me/post/40054651676/wtf-is-a-supercolumn-cassandra-data-model http://www.datastax.com/docs/1.0/ddl/column_family#about-super-columns
Super Column – a name /map tuple where value consists of an unlimited number of columns. No timestamp on these.
colFamily1 = {
name: “twitter_account”
value: {
{ name:” twitter_handle”,
value:”DaveSportsfan”, “2013-03-20 11:32:00”
}
{name:” twitter_email”,
value:”dbloom@nc.rr.com”, “2013-03-20 11:32:00”
}
{name:” twitter_language”,
value:”English”,
“2013-03-20 11:32
}
}
}
Column Family– a structure to group both the Columns and Super Columns. In other words, a slice of data corresponding to a particular key. Like a table in relational
allTweets = {
tweet1: {
handle: “davidmbloom", tweet: “Gators won today!"
},
tweet2 : {
handle: “spurrier",
tweet: “@davidmbloom That’s Great!",
replytohandle: “davidmbloom"
}
}
Keyspace - the outer grouping of the data. Like a schema in relational model. All the Column Families go inside the Keyspace.
Node - is a single server instance within a group of Nodes. In most cases, this is a single physical computer or a single virtual machine instance.
Cluster - is a group of Nodes that distributes your work amongst them. Partitioner - responsible for distributing rows (by key) across nodes in the cluster.
Replication Factor - how many Nodes in the cluster you want a copy of the data to be on. Eventual Consistency - weak consistency by default . But, configurable.
The data type for a column name is called a comparator.
Within a row, columns are always stored in sorted order by their column name. The comparator specifies the data type for the column name, as well as the sort order in which columns are stored within a row.
The comparator may not be changed after the column family is defined.
The data type for a column (or row key) value is called a validator.
For static column families, you should define each column and its associated type when you define the column family using the column_metadata property.
For dynamic column families (where column names are not known ahead of time), you should specify a default_validation_class (default validator for columns not in the column_metadata) instead of defining the per-column data types.
The Cassandra framework has evolved since it first was open sourced.
A few good links on the topic:
- http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
- http://pkghosh.wordpress.com/2013/07/14/storing-nested-objects-in-cassandra-composite_columns/
How to Get Cassandra
http://cassandra.apache.org/download/
CASSANDRA_HOME = C:\Java\apache-cassandra-1.2.8
C:\Java\apache-cassandra-1.2.8\bin>cassandra-cli.bat
Third party install of Cassandra: http://planetcassandra.org/Download/DataStaxCommunityEdition
select your version of windows and msi installer
After the install. you should have the services listed on this page: http://www.datastax.com/documentation/gettingstarted/index.html?pagename=docs&version=quick_start&file=quickstart#getting_started/../getting_started/gettingStartedWindowsTrblShooting_c.html
You will find C:\Program Files\DataStax Community\python\python.exe
From the command line: > python cqlsh
git clone https://github.com/datastax/
package com.example.cassandra;
/*
http://www.datastax.com/documentation/developer/java-driver/1.0/webhelp/index.html
*/
import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Host;
import com.datastax.driver.core.Metadata;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
import com.datastax.driver.core.Session;
public class SimpleClientQuery {
private Cluster cluster;
private Session session;
private final String ksName = "cardinal02";
public void connect(String node) {
cluster = Cluster.builder()
.addContactPoint(node).build();
Metadata metadata = cluster.getMetadata();
System.out.printf("Connected to cluster: %s\n",
metadata.getClusterName());
for ( Host host : metadata.getAllHosts() ) {
System.out.printf("Datatacenter: %s; Host: %s; Rack: %s\n",
host.getDatacenter(), host.getAddress(), host.getRack());
}
session = cluster.connect();
}
public void close() {
cluster.shutdown();
}
public static void main(String[] args) {
SimpleClientQuery client = new SimpleClientQuery();
client.connect("127.0.0.1");
client.createSchema();
client.loadData();
client.close();
}
public void createSchema() {
session.execute("CREATE KEYSPACE " + ksName + " WITH replication " +
"= {'class':'SimpleStrategy', 'replication_factor':3};");
session.execute(
"CREATE TABLE " + ksName + ".songs (" +
"id uuid PRIMARY KEY," +
"title text," +
"album text," +
"artist text," +
"tags set<text>," +
"data blob" +
");");
session.execute(
"CREATE TABLE " + ksName + ".playlists (" +
"id uuid," +
"title text," +
"album text, " +
"artist text," +
"song_id uuid," +
"PRIMARY KEY (id, title, album, artist)" +
");");
}
public void loadData() {
session.execute(
"INSERT INTO " + ksName + ".songs (id, title, album, artist, tags) " +
"VALUES (" +
"756716f7-2e54-4715-9f00-91dcbea6cf50," +
"'Way Cool Jr.'," +
"'Reach For Sky'," +
"'Ratt'," +
"{'jazz', '2013'})" +
";");
session.execute(
"INSERT INTO " + ksName + ".playlists (id, song_id, title, album, artist) " +
"VALUES (" +
"2cc9ccb7-6221-4ccb-8387-f22b6a1b354d," +
"756716f7-2e54-4715-9f00-91dcbea6cf50," +
"'Way Cool Jr.'," +
"'Reach For Sky'," +
"'Ratt'" +
");");
ResultSet results = session.execute("SELECT * FROM " + ksName + ".playlists " +
"WHERE id = 2cc9ccb7-6221-4ccb-8387-f22b6a1b354d;");
System.out.println(String.format("%-30s\t%-20s\t%-20s\n%s", "title", "album", "artist",
"-------------------------------+-----------------------+--------------------"));
for (Row row : results) {
System.out.println(String.format("%-30s\t%-20s\t%-20s", row.getString("title"),
row.getString("album"), row.getString("artist")));
}
System.out.println();
}
}
•Cassandra
usage at Twitter - http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king