Sentimental Analysis using Hadoop Phase 2: Week 2dcm.uhcl.edu/caps15g1/Phase2PPTs/4 Ankur Uprit Week 2.pdf · their employees frequently speak about Hive at Big Data and Hadoop conferences

Sentimental Analysis using HadoopPhase 2: Week 2

MARKET / INDUSTRY, FUTURE SCOPE

BY

ANKUR UPRIT

The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular item of data.

The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.

This key/value type database allow clients to read and write values using a key as follows:

1. Get(key), returns the value associated with the provided key.

2. Put(key, value), associates the value with the key.

3. Multi-get(key1, key2, .., keyN), returns the list of values associated with the list of keys.

4. Delete(key), removes the entry for the key from the data store.

While Key/value type database seems helpful in somecases, but it has some weaknesses as well. One, is that themodel will not provide any kind of traditional databasecapabilities (such as atomicity of transactions, orconsistency when multiple transactions are executedsimultaneously). Such capabilities must be provided bythe application itself.

Secondly, as the volume of data increases, maintainingunique values as keys may become more difficult;addressing this issue requires the introduction of somecomplexity in generating character strings that willremain unique among an extremely large set of keys.

Key Value

“India” {“B-25, Sector-58, Noida, India –201301″

“Romania”

{“IMPS Moara Business Center, Buftea No. 1, Cluj-Napoca, 400606″,City Business Center, Coriolan Brediceanu No. 10, Building B, Timisoara, 300011″}

“US” {“3975 Fair Ridge Drive. Suite 200 South, Fairfax, VA 22033″}

Reference: http://www.3pillarglobal.com/insights/exploring-the-different-types-of-nosql-databases

MongoDBMongoDB is a document-oriented database that natively supports JSON format. It is extremely easyto use and operate so it is very popular with developers and doesn’t require a databaseadministrator (DBA) to bootstrap.

RedisRedis is one of the fastest data stores available today. An open source, in-memory and NoSQLdatabase known for its speed and performance, Redis has become popular with developers and hasa growing and vibrant community. It features several data types that make implementing variousfunctionalities and flows extremely simple.

CassandraCreated at Facebook, Cassandra has emerged as a useful hybrid of a column-orienteddatabase with a key-value store. Grouping families gives the familiar feeling of tablesand provides good replication and consistency for easy linear scaling. Cassandra ismost effective when used for managing really big volumes of data (the kind that don’tfit in a single server), such as Web/click analytics and measurements from the Internetof Things – writing to Cassandara is extremely fast.

CouchDBCouchDB gets accessed in JSON format over HTTP. This makes it simple to use forWeb applications. Perhaps not surprisingly, CouchDB is suited best for the Web withsome interesting applications for offline mobile apps.

HBaseDeep down in Hadoop is the powerful database, HBase, that spreads data out among nodes usingHDFS. It is, perhaps, most appropriate to use for managing huge tables consisting of billions ofrows. Being a part of Hadoop, it allows using map/reduce processing on the data for complicatedcomputational jobs, but also provides real-time data processing capabilities.

Both HBase and Cassandra follow the BigTable model. As such, HBase can be scaled linearlysimply by adding more nodes to the setup. HBase is best suited for real-time querying of BigData.

Reference: http://www.itbusinessedge.com/slideshows/top-five-nosql-databases-and-when-to-use-them.html

Most of the scripting languages like php, python, perl, ruby bash is good. Any language able to read from stdin, write to sdtout and parse tab and new line characters will work: Hadoop Streaming just pipes the string representations of key value pairs as concatenated with a tab to a program that must be executable on each task tracker node.

On most linux distros used to setup hadoop clusters, python, bash, ruby, perl... are already installed but nothing will prevent to roll up your own execution environment for your favorite scripting or compiled programming language.

But, the difference between java and scripting language, it is "Heart Beat of child nodes will not be sent to the parent nodes when we are using scripting languages".

Reference: 1. http://stackoverflow.com/questions/8572339/which-language-to-use-for-hadoop-map-reduce-programs-java-or-php?answertab=active#tab-top2. http://www.slideshare.net/corleycloud/big-data-just-an-introduction-to-hadoop-and-scripting-languages

Clover provides the metrics youneed to better balance the effortbetween writing code that does stuff,and code that tests stuff.

Clover runs in your IDE or yourcontinuous integration system, andincludes test optimization to makeyour tests run faster, and fail morequickly.

Reference: https://www.atlassian.com/software/clover/overview

JUnit is a simple framework to write repeatable tests. It is an instance of the xUnit architecture for unit testing frameworks.

Reference: http://blog.cloudera.com/blog/2008/12/testing-hadoop/

How to create a test:https://github.com/junit-team/junit/wiki/Getting-started

Hadoop admin who can work independently with excellent communicationskill. Good knowledge of Linux (security, configuration, tuning, troubleshootingand monitoring). Able to deploy Hadoop cluster, add and remove nodes,troubleshoot failed jobs, configure and tune the cluster, monitor critical parts ofthe cluster, etc.

Hadoop Developer: A Hadoop Developer has many responsibilities. And the jobresponsibilities are dependent on your domain/sector, where some of them wouldbe applicable and some might not

Hadoop development and implementation.

Loading from disparate data sets.

Pre-processing using Hive and Pig.

Designing, building, installing, configuring and supporting Hadoop.

Translate complex functional and technical requirements into detailed design.

Perform analysis of vast data stores and uncover insights.

Maintain security and data privacy.

http://www.edureka.co/blog/hadoop-developer-job-responsibilities-skills/

https://www.dice.com/jobs/detail/Hadoop-Solution-Architect-%26%2347-Information-and-BI-Architect-%26%2347-Big-Data-

Architect-%26%2347-Hadoop-Admin-IDC-Technologies-Sunnyvale-CA-94085/10114879/727691

Create scalable and high-performance web services for data tracking.

High-speed querying.

Managing and deploying HBase.

Being a part of a POC effort to help build new Hadoop clusters.

Test prototypes and oversee handover to operational teams.

Propose best practices/standards.

Reference: http://hortonworks.com/training/certification/

You would not compare so does Hive vs Hbase - Commonly happend because of SQL-like layer on Hive - Hbase is a Database but Hive is never a Database.

Hive is a MapReduce based Analysis/ Summarisation tool running on Top of Hadoop. Hive depends on Mapreduce( Batch Processing) + HDFS

Hbase is a Database (NoSQL) - which is used to store and retrieve data.To Query(Scans) Hbase - mapreduce is not required - So HBase depends only on HDFS - not on Mapreduce So Hbase is Online Processing System

http://www.quora.com/Hive-vs-HBase-Which-one-wins-the-battle-Which-is-used-in-which-scenario

Depending on where you work, you may need to simply use whatever standards your company has established.

For example, Hive is commonly used at Facebook for analytical purposes. Facebook promotes the Hive language andtheir employees frequently speak about Hive at Big Data and Hadoop conferences.

However, Yahoo! is a big advocate for Pig Latin. Yahoo! has one of the biggest Hadoop clusters in the world. Their dataengineers use Pig for data processing on their Hadoop clusters.

Alternatively, you may have a choice of Pig or Hive at your organization, especially if no standards have yet beenestablished, or perhaps multiple standards have been set up.

However, compared to Hive, Pig needs some mental adjustment for SQL users to learn. Pig Latin has many of the usualdata processing concepts that SQL has, such as filtering, selecting, grouping, and ordering, but the syntax is a littledifferent from SQL (particularly the group by and flatten statements!). Pig requires more verbose coding, although it’sstill a fraction of what straight Java MapReduce programs require. Pig also gives you more control and optimization overthe flow of the data than Hive does.

If you are a data engineer, then you’ll likely feel like you’ll have better control over the dataflow (ETL) processes whenyou use Pig Latin, especially if you come from a procedural language background. If you are a data analyst, however,you will likely find that you can ramp up on Hadoop faster by using Hive, especially if your previous experience wasmore with SQL than with a procedural programming language. If you really want to become a Hadoop expert, then youshould learn both Pig Latin and Hive for the ultimate flexibility.

http://www.hadoopwizard.com/top-10-presentations-for-learning-hadoop-on-slideshare

http://www.hadoopwizard.com/events/categories/conference/

http://www.hadoopwizard.com/which-big-data-company-has-the-worlds-biggest-hadoop-cluster/

Documents

Sentimental Analysis using Hadoop Phase 2: Week 2dcm.uhcl.edu/caps15g1/Phase2PPTs/4 Ankur Uprit Week 2.pdf · their employees frequently speak about Hive at Big Data and Hadoop conferences