AnySizeData: September 2013

Friday, September 27, 2013

Walkthrough in learning Cassandra and Big Data

How much data is created in minute?

What Is a Big Data?

Big data is making us think of ways to harness the excessive amount of unstructured data that is generated on a daily basis. Moreover, it is no surprise we have seen the introduction of many new Big Data technologies

There are now open source technologies now available to handle the more and more data created every day. Examples of companies who make use of these technologies to process large amounts of data on daily basis: Craigslist, Facebook, Twitter, eBay, wordpress.com, etc.

We learned OO programming, and told relational databases is what you use! Tear apart objects so they fit into relational!

Object databases came along. You didn’t have to tear apart objects so they fit into database. The failure of OO databases related to corporate controlling of data and the challenges evolving the schema

What do you do when data wont fit onto one server?

Distributed databases

Sharding : with a relational db might have to split tables into multiple partitions among servers to handle excessive amount of data .

Approach is to de-normalize since relationships spread across many servers, but what is the point of using relational database then?

Partitioning: Split data based on keys (i.e. use name to split keys among partition. However, distribution unbalanced. Solution is “Consistent Hashing” which equally distributes the data Highly available.

For example, why is it applications like Facebook and Twitter are always available? Handle this by making replicas among data in case a server node goes down. Replicating is sharing . Every time we update, make copy over another node.

Hadoop – a Big Data technology

Back in the day, Google wanted to crawl the entire internet and perform calculations on the data. At the time, they did not have enough money for a machine to handle this – They used cheap machines and linked the computers together.

Google wrote a paper on what they were doing known as Map Reduce. Nutch (a search engine) was doing similar work lead by computer scientist Doug Cutting. He read Google’s paper and created a prototype. Yahoo hired Cutting. He created the open source Hadoop platform and named the technology after his son’s yellow stuffed-elephant toy, which went on to become the platform’s logo.

Yahoo eventully spun off this side of the company into Hortonworks which would put Hadoop on Microsoft OS.

The NoSQL databases provide infinite scalability, fault tolerance, high availibilty, design-friendly lack of schema.

NoSql is the wrong name, but catchy. Really, it means not a rigid schema, more flexible to work with

Polyglot persistence – means can use multiple databases types to build an application

Need consistency with enterprise data. Eventual Consistency important? Bank yes, blog no .

With Enterprise data, NoSQL is NOT an option. NoSQL appropriate for application data.

ACID is out of the equation, replaced by CAP.

CAP Theorem was developed by Professor Eric Brewer, Co-founder and Chief Scientist of Inktomi.

The theorem states, that a distributed system design, can offer at most two out of three desirable properties:

Consistency - Is the data I’m looking at now the same if I look at it somewhere else? if someone writes a value to a database, there after other users will immediately be able to read the same value back,

Availability – What happens if my database goes down? If a number of nodes fail in your cluster the distributed system can remain operational

Partition Tolerance - What if my data is on different networks? means that if the nodes in your cluster are divided into two groups that can no longer communicate by a network failure, again the system remains operational.

Big Table

The kind of processing that Google does required a high performance and reliable, but weak on consistency.

At time no database like this existed. Google created their own calling it Big Table . Products such as web indexing, Orkut, blogger, Google earth , and part of Goggle App Engine use this.

Big table is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterrupted array of bytes.

Several technologies were created based on the Big Table paper concepts. Examples include Dynamo from Amazon, the Google App Engine (which is built on top of the lower level Big Table with extra capabilities), and many others such as Cassandra.

For example, Amazon uses Dynamo for many of their products such as their shopping cart. Their is a paper written by Amazon on this technology that is readily available.

Facebook created Cassandra which they open sourced and had used it in their email search tool. HBase is a near-clone of Google’s BigTable, whereas Cassandra is a “BigTable/Dynamo hybrid”.

Cassandra is a column family database : grouping of columns . It is similar but not the same as a relational. For example, a customer may have data such as name address phone number, some might not have music data. Or some wont supply age. Known as sparse where each row may not have same columns.

The Cassandra data model is a 4 or 5 dimensional hash described as follows:

Column – a name /value / time stamp tuple

{ name: “twitter_handle”
value: “RandysPizzaRtp”
timestamp: “2013-03-20 11:30:00” }

Super Column
http://arin.me/post/40054651676/wtf-is-a-supercolumn-cassandra-data-model http://www.datastax.com/docs/1.0/ddl/column_family#about-super-columns

Super Column – a name /map tuple where value consists of an unlimited number of columns. No timestamp on these.

colFamily1 = {
name: “twitter_account”
value: {
{ name:” twitter_handle”,
value:”DaveSportsfan”, “2013-03-20 11:32:00”
}
{name:” twitter_email”,
value:”dbloom@nc.rr.com”, “2013-03-20 11:32:00”
}
{name:” twitter_language”,
value:”English”,
“2013-03-20 11:32
}
}
}

Column Family– a structure to group both the Columns and Super Columns. In other words, a slice of data corresponding to a particular key. Like a table in relational

allTweets = {
tweet1: {
handle: “davidmbloom", tweet: “Gators won today!"
},
tweet2 : {
handle: “spurrier",
tweet: “@davidmbloom That’s Great!",
replytohandle: “davidmbloom"
}
}

Keyspace - the outer grouping of the data. Like a schema in relational model. All the Column Families go inside the Keyspace.

Node - is a single server instance within a group of Nodes. In most cases, this is a single physical computer or a single virtual machine instance.

Cluster - is a group of Nodes that distributes your work amongst them. Partitioner - responsible for distributing rows (by key) across nodes in the cluster.

Replication Factor - how many Nodes in the cluster you want a copy of the data to be on. Eventual Consistency - weak consistency by default . But, configurable.

The data type for a column name is called a comparator.

Within a row, columns are always stored in sorted order by their column name. The comparator specifies the data type for the column name, as well as the sort order in which columns are stored within a row.

The comparator may not be changed after the column family is defined.

The data type for a column (or row key) value is called a validator.

For static column families, you should define each column and its associated type when you define the column family using the column_metadata property.

For dynamic column families (where column names are not known ahead of time), you should specify a default_validation_class (default validator for columns not in the column_metadata) instead of defining the per-column data types.

The Cassandra framework has evolved since it first was open sourced.

A few good links on the topic:

How to Get Cassandra
http://cassandra.apache.org/download/

CASSANDRA_HOME = C:\Java\apache-cassandra-1.2.8

C:\Java\apache-cassandra-1.2.8\bin>cassandra-cli.bat

Third party install of Cassandra: http://planetcassandra.org/Download/DataStaxCommunityEdition

select your version of windows and msi installer

After the install. you should have the services listed on this page: http://www.datastax.com/documentation/gettingstarted/index.html?pagename=docs&version=quick_start&file=quickstart#getting_started/../getting_started/gettingStartedWindowsTrblShooting_c.html

You will find C:\Program Files\DataStax Community\python\python.exe

From the command line: > python cqlsh

git clone https://github.com/datastax/java-driver

package com.example.cassandra;

/*
http://www.datastax.com/documentation/developer/java-driver/1.0/webhelp/index.html

*/

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Host;
import com.datastax.driver.core.Metadata;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;
import com.datastax.driver.core.Session;

public class SimpleClientQuery {
private Cluster cluster;
private Session session;

private final String ksName = "cardinal02";

public void connect(String node) {
cluster = Cluster.builder()
.addContactPoint(node).build();
Metadata metadata = cluster.getMetadata();
System.out.printf("Connected to cluster: %s\n",
metadata.getClusterName());
for ( Host host : metadata.getAllHosts() ) {
System.out.printf("Datatacenter: %s; Host: %s; Rack: %s\n",
host.getDatacenter(), host.getAddress(), host.getRack());
}
session = cluster.connect();

}

public void close() {
cluster.shutdown();
}

public static void main(String[] args) {
SimpleClientQuery client = new SimpleClientQuery();
client.connect("127.0.0.1");
client.createSchema();
client.loadData();
client.close();
}

public void createSchema() {
session.execute("CREATE KEYSPACE " + ksName + " WITH replication " +
"= {'class':'SimpleStrategy', 'replication_factor':3};");

session.execute(
"CREATE TABLE " + ksName + ".songs (" +
"id uuid PRIMARY KEY," +
"title text," +
"album text," +
"artist text," +
"tags set<text>," +
"data blob" +
");");
session.execute(
"CREATE TABLE " + ksName + ".playlists (" +
"id uuid," +
"title text," +
"album text, " +
"artist text," +
"song_id uuid," +
"PRIMARY KEY (id, title, album, artist)" +
");");

}

public void loadData() {
session.execute(
"INSERT INTO " + ksName + ".songs (id, title, album, artist, tags) " +
"VALUES (" +
"756716f7-2e54-4715-9f00-91dcbea6cf50," +
"'Way Cool Jr.'," +
"'Reach For Sky'," +
"'Ratt'," +
"{'jazz', '2013'})" +
";");
session.execute(
"INSERT INTO " + ksName + ".playlists (id, song_id, title, album, artist) " +
"VALUES (" +
"2cc9ccb7-6221-4ccb-8387-f22b6a1b354d," +
"756716f7-2e54-4715-9f00-91dcbea6cf50," +
"'Way Cool Jr.'," +
"'Reach For Sky'," +
"'Ratt'" +
");");
ResultSet results = session.execute("SELECT * FROM " + ksName + ".playlists " +
"WHERE id = 2cc9ccb7-6221-4ccb-8387-f22b6a1b354d;");

System.out.println(String.format("%-30s\t%-20s\t%-20s\n%s", "title", "album", "artist",
"-------------------------------+-----------------------+--------------------"));
for (Row row : results) {
System.out.println(String.format("%-30s\t%-20s\t%-20s", row.getString("title"),
row.getString("album"), row.getString("artist")));
}
System.out.println();
}
}

•Cassandra usage at Twitter - http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-interview-with-ryan-king

•Cassandra Paper - http://odbms.org/download/cassandra.pdf

•Using the CLI , - http://www.datastax.com/docs/1.0/dml/using_cli

•CQL reference - http://www.datastax.com/docs/1.0/references/cql/index

•Cassandra Reference Card - http://refcardz.dzone.com/refcardz/apache-cassandra

•Cassandra Cheat Sheet - http://crlog.info/category/distributed-systems/nosql/cassandra/

•Cassandra Mythology - http://www.infoq.com/articles/cassandra-mythology

•WTF is a supercolumn - http://arin.me/post/40054651676/wtf-is-a-supercolumn-cassandra-data-model

•Cassandra Java Driver API - https://datastax-oss.atlassian.net/browse/JAVA

•Playing with Cassandra - http://blog.octo.com/en/nosql-lets-play-with-cassandra-part-13/

•Cassandra API list - http://brianoneill.blogspot.com/2012/08/cassandra-apis-laundry-list.html

Tuesday, September 17, 2013

Tweaking Consistency with Cassandra

In the new age of the Internet, using a relational databases to handle large amounts can lead to trouble :

"Load-balancing and failover are sometimes difficult to achieve even on a smaller scale with relational databases, especially if you don’t have the option to license a commercial database clustering option. And even if you can, there are limits to scalability. People tend to workaround these problems with master-slave database configurations, but they can be difficult to set up and manage"

Cassandra is one of the viable NoSQL options that accounts for scalability. Moreover, for the developers that are used to working with relational databases it offers a familiar query language :

With the introduction of the Cassandra Query Language (CQL) Cassandra now supports declaring schema and typing for your data model. For the application developer, this feature brings the Cassandra data model somewhat closer to the relational (relations and tuples) model.

In relational, scalability is hindered by the amount of effort that its put towards achieving consistency. As a result, many NoSQL databases allows for a lower consistency level (i.e. eventual consistency) , with the benefit of added scalability. Cassandra even allows you to customize the consistency level:

Promoting availability over consistency became a key design tenet for many NoSQL databases. Other common design goals for NoSQL databases include high performance, horizontal scalability, simplicity and schema flexibility. These design goals were also shared by Cassandra founders, but it was also designed to be CAP-aware, meaning the developer is allowed to tune the tradeoff between consistency and latency

As far as an API to use with Cassandra for the developer, there are two good options out there:

Because Cassandra user community is still growing, it’s good to choose a client with an active user community and existing documentation. Astyanax client, developed by Netflix, currently seems to be the most widely used, production-ready and feature complete Java client for Cassandra. This client supports both Thrift and CQL 3 based APIs for accessing the data. DataStax, a company that provides commercial Cassandra offering and support, is also developing their own CQL 3 based Java driver, which recently came out of beta phase.

Sunday, September 15, 2013

Cassandra modeling and objects

The Cassandra framework has evolved since it first was open sourced.

I am reading a few good links on the topic.

http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/

http://pkghosh.wordpress.com/2013/07/14/storing-nested-objects-in-cassandra-composite_columns/

Saturday, September 14, 2013

Cassandra evolution over 5 years

This post goes over the original paper of Cassandra and comparing it to where we are today with 2.0 release.

A few good points from the article:

"Cassandra has also consistently led the NoSQL market in performance."

On Eventual Consistency : " Perhaps the best treatment of this subject is Christos Kalantzis’s from this year’s Cassandra Summit."

2008 Cassandra’s data model is unrecognizable today.Supercolumns are gone. “Column families” have been replaced by tables, which offer a mostly-conventional schema

remember that when it says “columns” it means what today we call “cells” and by “rows” it means “partitions.”

Sunday, September 8, 2013

NoSQL at Research Triangle Software Symposium

Notes from Talk by Venkat Subramanium on NoSQL.

With key value, you put garbage in and you get garbage out. A Document database - XML , given key can store XML or a JSON document or a blob. The benefit is can get specific element using query. With colunar, a key and associated column of data, several keys and columns. Its often clustered. High availability and scalability data access in efficient way. You run nodes in grid, some in each node, request comes in find out which location its in. The benefit is its very scalable and support a lot of users. Something goes down on node, app runs, seems seamless.

Replicate is sharing , every time we update, make copy over another node. Which nodes do you makes updates to? When is it available right away, but if its down than we would check later and update it then. Joe has address, Kim has an address, both update fine, but update both across different nodes.

Relationship are graph databases example marketing, twitter, Facebook , etc do data mining,
Data graph, patterns

Key value store (i.e. what you like) access based on key, Hash code algorithm on machine to get data
Dynamo db is an example. Here you have App data/Session data , shopping data , user profiles, preference.

No schema how access data? Typically in Java with relational you have properties file and you update your hibernate file and also provide a new setter and getter. But, we prefer that we only have to update in one place.

Consistency on server, sharing it won't happen instantly on another node. Consistency important? Bank yes, blog no! Nosql - eventually consistent, some time in future. Amazon claims their eventually consistent takes 1 sec for consistency.

Column family db - Groups of columns similar not same as relational
i.e. Customer data : name address phone number, music data
One or many music interests , age, some don't store age, With relational very inflexible. No schema flexible. Each row (i.e. key) can have a different set off columns. examples are Cassandra & Hbase

Graph data bass relationships - Not clustered, Inserts expensive, Nodes and edges, 2 way relationships, examples are Neo4j flockdb

Nosql wrong name, but catchy. Doesn't mean no SQL but not a rigid schema, more flexible to work with. No schema to evolve

Enterprise data no SQL is not a solution.

Polyglot persistence - multiple databases to build applicactions just like multiple languages

NoSQL works for application data, also saves programming effort with nosql

Multiple classifications of nosql databases

Aggregate oriented, Aggregate ignorant - relationship oriented

Data hierarchy or relationship between, Parent child vs traversing, Key value, Document, Column family

Where we are with databeses ? Use them when we have to store out data.
Only use db when you have a need for it based on requirements
Use it if we have to support concurrency, Multi user concurrency at same time, multiple different applications concurrency at same time

When app turns into enterprise data, people come for your data, Minute becomes enterprise rules change

With relational, Rows and columns relationships tuples, Robust , standard , familiar, popular
Consistency very well , easy access using SQL , adhoc queries

Impedance mismatch . Good 30 years ago with c and fortran

OO is defacto standard now in programming. Objects tear them apart and form them into shape to fit database. OO database don't have to dismantle, no database work at all, Object db solved problem well for problem at hand, but corporate data didn't integrate we'll. Object relational database died

Came along ORM but structure rigid - stability where things do not change fast and takes effort to evolve schema. Centralized data and hard to scale for availability

Thus, what is Different now with these NoSQL databases? As, Relational not affected by oo databases.

What has changed is modem and baud rate gone, now everyone carries devices

Frequency of data is enormous so applications need distributed access to data.

Data we sending to applications more data exchanges moving away from corporate data.

Clustering

Still have Batch processing overnight. Clustering with greater availability . Distribute data into many locations. i.e. Hadoop