Extracting twitter data using apache flume

Extracting Twitter Data using Apache

FlumeBy Bharat Khanna

Talend ETL Developer

What you need ??

• Horton works Hadoop Cluster :- HDP 1.3• Oracle Virtual Box• Putty • Winscp• Maven (for creating flume-snapshot.jar)

What is Flume ?• Flume is a distributed, reliable, and available service for

efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.

Network Settings at Oracle Virtual Box

Network Settings at Oracle Virtual Box Contd..

Getting Started

• Run your Hadoop Cluster in Virtual Box. Once it is started, make sure you are able to connect to HDFS from your host windows machine by giving address as something like http://192.168.56.101:8000.

• This IP address you will get when you run ifconfig command in your Hadoop cluster once it is started.

File Browser using HUE

• Your HDFS interface from host machine may look like below: -

Setting your bash_profile in Putty

• It is important to set environment variables by editing bash_profile that can edited using command “vi .bash_profile”(You need dot before bash_profile as by default it is hidden) at your home directory. Exclude Maven_Home below for now.

Creating Flume Snapshot.jar

• This jar contains necessary libraries for proper functioning of Flume. This can be either downloaded by googling or we can create it ourselves. Best is to create it ourselves.

• You need Maven software for this. If your java version is 1.6, which is in Hortonworks HDP 1.3 , then download archived version of Maven i.e. 3.0.5 from http://archive.apache.org/dist/maven/maven-3/ else use any latest version.

Creating Flume Snapshot.jar Contd..

• Once download, unzip the folder in windows, and transfer it to your Hortonworks cluster using Winscp.

• Create a link to the folder by command “ln -s apache-maven-3.0.5 maven” in your home directory folder.

• Set the path of this link in your bash_profile as shown in slide 8.

• Logoff and login again to Unix session after saving your bash_profile to implement changes. Run command “mvn -version” to check its working.

Creating Flume Snapshot.jar Contd..

• Download Cloudera’s Twitter Code zip file from https://github.com/cloudera/cdh-twitter-example.

• Unzip it and transfer it to your home directory in Hortonworks cluster using Winscp.

• Go to flume-sources folder under folder cdh-twitter-example-master and run command “mvn package” to build the flume snapshot.jar file. This file can be found under target folder in same directory.

Configuring Flume• Transfer the flume-sources-1.0-SNAPSHOT.jar to lib directory of

flume under location /etc/flume/apache-flume-1.6.0-bin/lib for Hortonworks 1.3 VM.

• Flume’s configuration directory can be found at /etc/flume/apache-flume-1.6.0-bin/conf.

• Open flume-env.sh.template file in vi editor , set Java_Home Path as defined in the bash_profile and Flume Classpath as the path of flume-snapshot.jar in double quotes.

• Rename flume-env.sh.template to flume-env.sh using mv command.

Configuring Flume contd..

• You also need to transfer following jar files to flume lib folder. Jar From Directoryhadoop-core.jar HADOOP_HOME i.e. /usr/lib/hadoophadoop-client-1.2.0.1.3.0.0-107.jar

HADOOP_HOME i.e. /usr/lib/hadoop

jets3t-0.6.1.jar /usr/lib/hadoop/libcommons-httpclient-3.0.1.jar

/usr/lib/hadoop/lib

commons-configuration-1.6.jar

/usr/lib/hadoop/lib

commons-codec-1.4.jar /usr/lib/hadoop/lib

Creating Twitter App

• Go to dev.twitter.com and click on create a new app.• Give your name , description and website may be like

http://yourdomain.com.• After creating app, go to Keys and Access tokens and

create your consumer key , consumer secret , access token and access token secret.

• Make a note of it as you need that in subsequent steps.

Creating conf file • Go to folder , /etc/flume/apache-flume-1.6.0-bin/conf and open a new file

named Twitter.conf.• A Sample Image of it is shown in next slide. You need to insert your

consumer key , consumer secret , access token and access token secret that you got in previous step.

• Then you need to enter keywords for which you want to analyze the data.• At last, you need to give your hdfs path that you can get from

fs.default.name in core-site.xml file under Hadoop_Home/conf i.e. /usr/lib/hadoop/conf

Checks before running flume-Setting Timezone

• Make sure that the time being shown in your VM matches with what you can see in your local machine. If they are not, you need to reset the time as shown below. You can time in your VM by “date” command.

• If your Timezone is matching , you can skip next 2 steps.• Time zone is controlled by /etc/localtime file. You can check the list

of timezones available under /usr/share/zoneinfo/ directory.• cd /etc• ln -s /usr/share/zoneinfo/US/Eastern localtime

Checks before running flume-Setting Oracle Virtual Box Properties

• You need to make sure that you can always reset your time in VM as you have done in previous step. For that you need to set following properties at VirtualBox.

• In Windows, start a command line interpreter, go to C:\Program Files\Oracle folder and click VirtualBox to select, then holding left shift key, do a mouse right-button click and select "Open command window here" menu, the interpreter has to be running now.

Checks before running flume-Setting Oracle Virtual Box Properties Contd..• Run following commands in command prompt.VBoxManage setextradata ${VMNAME} "VBoxInternal/Devices/VMMDev/0/Config/GetHostTimeDisabled" 1$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-interval" 10000$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-min-adjust" 100$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-set-on-restore" 1$ VBoxManage guestproperty set ${VMNAME} "/VirtualBox/GuestAdd/VBoxService/--timesync-set-threshold" 1000

Running Flume

• Go to flume bin directory and run the flume agent using following command:-

• flume-ng agent -n TwitterAgent -c conf -f /etc/flume/apache-flume-1.6.0-bin/conf/twitter.conf

• After sometime, you may start getting files like below under directory specified in conf file.

Error Catalog• You may face following frequently occurring errors while running flume.Apache flume Error - java.lang.NoSuchMethodError: twitter4j.FilterQuery.setIncludeEntities(Z)Ltwitter4j FilterQueryFix :- This happens because of FilterQuery.class occurring in two different jars( one of which will be flume-snapshot.jar) .You can search for those clashing jars using command :- “find . -name "*.jar" | xargs grep FilterQuery.class” under lib directory of flume.Rename the other jar by suffixing jar name with .org.

Error Catalog Contd..

• Apache flume Error :- java.io.IOException: Callable timed out after 10000 ms on file:

Fix :- This happens because of too many connections to twitter from your account. Just wait for some time and try again.

Extracting twitter data using apache flume

Data & Analytics

Flume 1.8.0 User Guide — Apache Flume · of event data including but not limited to network trafﬁc data, social-media-generated data, email messages and pretty much any data source

Log -Analytics with Apache-Flume Elasticsearch HDFS Kibana

Apache Flume · PDF fileApache Flume is a distributed, ... Zookeeper based ... management of plugin packaging issues as well as simpler debugging and troubleshooting of several

1/3/13 Flume 1.3.0 User Guide - Welcome to Apache Flumeflume.apache.org/releases/content/1.3.1/FlumeUserGuide.pdf1/3/13 Flume 1.3.0 User Guide — Apache Flume documentation ﬁle:

Solace JMS Integration with Cloudera CDH V5 · Solace JMS Integration with Cloudera CDH 5.4 6 3 Integration with Apache Flume Flume is a very flexible bridge application that runs

Log -Analytics con Apache-Flume Elasticsearch HDFS Kibana

CLOUDERA: UMA ABORDAGEM PARA ANÁLISE DE LOGS DO … · Palavras-chaves: Big Data; Apache Hadoop; Apache Flume; Cloudera. Log. Abstract Our goal is to demonstrate an approach in the

Data Aggregation At Scale Using Apache Flume

Logging with Log4j and log aggregation with Apache flume · •Log4j is a reliable, fast, and flexible logging framework (APIs) written in Java, which is distributed under the Apache

Integrating Event Streams and File Data with Apache Flume ... Event...– Event streams ! Apache Flume – Database tables ! Apache Sqoop . Traditional (Hadoop) approach • Map class

Indexing Data and Faceting Search With Apache SOLR · Indexing Data and Faceting Search With Apache SOLR 1 Indexing Data and Faceting Search With Apache SOLR ... • Extracting text

Apache Flink - Linux Foundation Eventsevents17.linuxfoundation.org/.../slides/flink-apachecon2.pdf · 2015-04-09 · Apache HBase Apache Kafka Apache Flume RabbitMQ Hadoop IO... Data

ビッグデータ関連ソフトウェアの動向調査1167.5 0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 Apache S4 Impala Esper Apache Flume Apache Sqoop Apache ManifoldCF

Apache Flume (NG)

S e t up - Apache Flumeflume.apache.org/releases/content/1.9.0/FlumeUserGuide.pdfFlume 1.9.0 User Guide Int roduc t i on Ove rvi e w Apache Flume is a distributed, reliable, and available

Apache Flume 1.5を活⽤したAmebaにおけるログのシステム連携

Flume 1.6.0 User Guide - Apache Flume · PDF fileZookeeper based Configuration ... easier management of plugin packaging issues as well as simpler debugging and troubleshooting of

Flume 1.8.0 User Guide — Apache Flumeflume.apache.org/releases/content/1.8.0/FlumeUserGuide.pdf · Flume 1.8.0 User Guide Introduction Overview Apache Flume is a distributed, reliable,

Apache Hadoop Ingestion Patterns & Apache Flume

Flume User Guide - Welcome to Apache Flume â€” Apache Flume