Literatur Review

Embed Size (px)

DESCRIPTION

litrev

Citation preview

2.1 Text MiningText mining refers to the process of deriving high-quality information fromtext. It describes a set of linguistic, statistical, and machine learning techniquesthat model and structure the information content of textual sources for businessintelligence, exploratory data analysis, research, or investigation. Text mining is avariation of data mining, that tries to find interesting patterns from databases. Asmost information is currently stored as text, text mining is believed to have a highcommercial potential value. Text analysis processes typically include:a. Information retrieval or identification of a corpus. This step includes collectingor identifying a set textual materials on the web, file system, database, or contentmanagement system;b. Applying natural language processing, such as part of speech tagging, syntaticparsing, and other types of linguistic analysis;c. Named entity recognition to identify named text features: people, organizations,place names, and so on using statistical techniques;d. Recognition of Pattern Identified Entities. Features such as telephone numbers,email addresses, and quantities can be discerned with regular expression;e. Coreference is identification of noun phrases and other terms that refer to thesame object;f. Identification of associations among entities and other information in text;g. Sentiment analysis, involves discerning subjective material and extracting variousforms of attitudinal information, such as opinion, mood, and emotion;h. Quantitive text analysis is a set of techniques stemming from the social sciencesin order to find out the meaning or stylistic patterns of a casual personal text.Text mining is now broadly applied for various fields, including security,biomedical, software applications, sentiment analysis, marketing and academic applications.

Sentiment analysis refers to the use of natural language processing, text analysis,and computational linguistics to identify and extract subjective information insource materials. The basic task in sentiment analysis is classifying the polarity oftext, whether the aspect is positive, negative, or neutral. Sentiment analysis can besplit into two separate categories: manual or human sentiment analysis and automatedsentiment analysis [22]. The differences lie in the efficiency of the system andthe accuracy of the analysis. A human analysis component is required in sentimentanalysis, as automated systems are not able to analyze historical tendencies of theindividual commenter, or the platform and are often classified incorrectly in theexpressed sentiment.There are two main techniques for sentiment analysis: machine learning basedand lexicon based [20]. In machine learning based techniques, two sets of documentsare needed: training set and test set. A training set is used by an automatic classifierto learn the different characteristics of documents, while the test set is usedto check how well the classifier performs. Machine learning starts with collecting training dataset. The next step is training a classifier on the training data. Oncethe technique is selected, an important decision to make is feature selection. It cantell how the documents are represented.In lexicon based technique, the classification is done by comparing the featuresof a given text against sentiment lexicons whose sentiment values are determinedprior to the use. Sentiment lexicon contains lists of words and expressionsused to express peoples subjective feelings and opinions. For example, with positiveand negative lexicons, the document is analyzed for which sentiment need to find.If the document has more positive word lexicon, it is positive, and vice versa. Thelexicon based technique is an unsupervised learning because it does not require prior training in order to classify data.

2.3 Related WorksLexicon-based method has been implemented in sentiment analysis before.Taboada [16] has used lexicon-based method to develop Semantic Orientation Calculator(SO-CAL), which is applied to do sentiment analysis on blog postings andvideo game reviews. The conclusion is that lexicon-based method for sentimentanalysis are robust, result in good cross-domain performance, and can be easily enhancedwith multiple sources of knowledge. Palanisamy, Yadav, and Elchuri [14]were discovering sentiments on Twitter using a lexicon built by Serendio anatomy,which consists of positive, negative, negation, stop words and phrases. The systemyields an F-score of 0.8004 on the test dataset.Implementation of sentiment analysis in various language has been done byCui anqi and Garcia. Cui Anqi [7] were applying lexicon-based method for sentimentanalysis for a Chinese microblog named Weibo. It was meeting some difficulties be cause Weibo messages usually have imbalanced sentiment polarities. Garcia, Gaines, and Linaza [10] were applying lexicon-based method for sentiment analysis of online reviews in Spanish. Preliminary evaluation of the proposed approach has been conducted on the basis of two real datasets of Spanish reviews related to accommodation and food and beverage from TripAdvisor.com. Among the preliminar conclusions, it can be mentioned that it seems to be some type of relation betweenthe length of the review and the subjectivity. A further conclusion is that negativesentiments are harder to detect than positive ones. Usually, negative sentiments areexpressed using an indirect language, irony and also, explaining the whole negativeexperience as a story, which may or may not contain explicit negative words.

2.4 TwitterTwitter (www.twitter.com) is an online social networking and microbloggingservice that enables users to send and read "tweets", which are text messages limitedto 140 characters, via Twitter website, mobile devices, or with instant messaging.Twitter Inc. is based in San Fransisco and has offices in New York City, Boston,San Antonio, and Detroit.Twitter was created in March 2006 by Jack Dorsey, Evan Williams, Biz Stone,and Noah Glass. The site was launched in Juli 2006. The service rapidly gainedworldwide popularity, with 500 million registered users in 2012, with 400 millionposted tweets per day [24]. The service also handled 1.6 billion search queries perday. This high popularity leads Twitter to be used for various purposes, such aspolitical campaigns, learning media, and advertisement, whereas it faces variousissues and controversies regarding security, user privacy, lawsuit, and censorship[17].

2.4.2 TweetsTweets are text messages sent by users which is limited to 140 characters.Users may subscribe to other users tweets, this is known as following, and thesubscribers are known as followers. Users can group posts together by topic or typeby using hashtags, words or phrases prefixed with a "#" sign. Similarly, the "@"sign followed by a username is used for mentioning or replying to other users. Torepost a message from another user and share the message with ones own followers,the retweet function is symbolized by "RT" before the message.A word, phrase, or topic that is tagged at a greater rate than other tagsis said to be a trending topic. Trending topics become popular either through aconcerted effort by users, or because of an event that prompts people to talk aboutone specific topic. These topics help Twitter and the users to understand what ishappening in the world.

2.5 RR is a free software programming language and software environment for statisticalcomputing and graphics, including linear and nonlinear modeling, classicalstatistical tests, time-series analysis, classification, clustering, and others. The Rlanguage is widely used among statisticians and data miners for developing statisticalsoftware and data analysis. Polls and surveys of data miners are showing Rspopularity has increased substantially in recent years. R is also chosen as one ofthe most powerful open-source analysis tools for sentiment analysis [12] with tmpackage, which provides a comprehensive text mining framework for R.R is an implementation of the S programming language combined with lexicalscoping semantics inspired by Scheme. S was created by John Chambers while atBell Labs. R was created by Ross Ihaka and Robert Gentleman [11] at the Universityof Auckland, New Zealand, and is currently developed by the R Development CoreTeam, of which Chambers is a member. R is named partly after the first names of the first two R authors and partly as a play on the name of S [13]. R is a GNUproject. The source code for the R software environment is written primarily in C,Fortran, and R. R is freely available under the GNU General Public License, andpre-compiled binary versions are provided for various operating systems. R uses acommand line interface; however, several graphical user interfaces are available foruse with R.

2.5.1 Versions of RThe versions of R from the oldest to the newest ones are mentioned as below:a. Version 0.16, this is the last alpha version developed primarily by Ihaka andGentleman. The mailing lists commenced on April 1, 1997;b. Version 0.49 (April 23, 1997), this is the oldest available source release, andcompiles on a limited number of Unix-like platforms. CRAN (Comprehensive RArchive Network) is started on this date, with 3 mirrors that initially hosted 12packages. Alpha versions of R for Microsoft Windows and Mac OS are madeavailable shortly after this version;c. Version 0.60 (December 5, 1997), R becomes an official part of the GNU Project.The code is hosted and maintained on CVS;d. Version 1.0.0 (February 29, 2000), considered by its developers stable enoughfor production use;e. Version 1.4.0, S4 methods are introduced and the first version for Mac OS X ismade available soon after;f. Version 2.0.0 (October 4, 2004), introduced lazy loading, which enables fast11loading of data with minimal expense of system memory;g. Version 2.1.0, support for UTF-8 encoding, and the beginnings of internationalizationand localization for different languages;h. Version 2.11.0 (April 22, 2010), support for Windows 64 bit systems;i. Version 2.13.0 (April 14, 2011), adding a new compiler function that allowsspeeding up functions by converting them to byte-code;j. Version 2.14.0 (October 31, 2011), added mandatory namespaces for packagesand a new parallel package;k. Version 2.15.0 (March 30, 2012), new load balancing functions. Improved serializationspeed for long vectors;l. Version 3.0.0 (April 3, 2013), support for numeric index values 231 and largeron 64 bit systems.2.5.2 R Graphical User InterfaceMany statisticians use R with the command line. However, the commandline can be quite daunting to a beginner of R. Fortunately, there are many differentgraphical user interfaces available for R which help to flatten the learning curve:a. RGUI, comes with the pre-compiled version of R for Microsoft Windows;b. Tinn-R, an open source, highly capable integrated development environment featuringsyntax highlighting similar to that of MATLAB. Only available for Windows;12c. Java Gui for R, cross-platform stand-alone R terminal and editor based on Java(also known as JGR);d. Deducer, GUI for menu driven data analysis (similar to SPSS/JMP/Minitab);e. Rattle GUI, cross-platform GUI based on RGtk2 and specifically designed fordata mining;f. R Commander, cross-platform menu-driven GUI based on tcltk (several plug-insto Rcmdr are also available);g. RExcel, using R and Rcmdr from within Microsoft Excel;h. RapidMiner;i. RKWard, extensible GUI and IDE for R;j. RStudio, cross-platform open source IDE (which can also be run on a remotelinux server);k. Weka, allows for the use of the data mining capabilities in Weka and statisticalanalysis in R.

2.5.3 R Add-on PackagesThe capabilities of R are extended through user-created packages, which allowspecialized statistical techniques, graphical devices, import and export capabilities,reporting tools, etc. These packages are developed primarily in R, and sometimesin Java, C and Fortran. A core set of packages is included with the installation ofR, with 5300 additional packages (as of April 2012) available at the ComprehensiveR Archive Network (CRAN), Bioconductor, and other repositories.132.5.3.1 Add-on Packages in RThe R distribution comes with the following packages:a. base, base R functions (and datasets before R 2.0.0);b. compiler, R byte code compiler (added in R 2.13.0);c. datasets, base R datasets (added in R 2.0.0);d. grDevices, graphics devices for base and grid graphics (added in R 2.0.0);e. graphics, R functions for base graphics;f. grid, a rewrite of the graphics layout capabilities, plus some support for interaction;g. methods, formally defined methods and classes for R objects;h. parallel, support for parallel computation, including by forking and by sockets,and random-number generation (added in R 2.14.0). ;i. splines, regression spline functions and classes;j. stats, R statistical functions;k. stats4, statistical functions using S4 classes;l. tcltk, interface and language bindings to Tcl/Tk GUI elements;m. tools, tools for package development and administration;n. utils, R utility functions.14These base packages were substantially reorganized in R 1.9.0. The formerbase was split into the four packages: base, graphics, stats, and utils. Packagesctest, eda, modreg, mva, nls, stepfun and ts were merged into stats, andpackage mle moved to stats4.2.5.3.2 Add-on Packages from CRANThe Comprehensive R Archive Network (CRAN) is a collection of sites whichcarry identical material, consisting of the R distributions, the contributed extensions,documentation for R, and binaries. The CRAN src/contrib area contains a wealthof add-on packages, including the following recommended packages which are to beincluded in all binary distributions of R:a. KernSmooth, functions for kernel smoothing and density estimation correspondingto Wand and Jones [21];b. MASS, functions and datasets from the main package of Venables and Ripley[19], for R versions prior to 2.10.0;c. Matrix, a Matrix package, recommended for R 2.9.0 or later;d. boot, functions and datasets for bootstrapping from Davison and Hinkley [8];e. class, functions for classification (k-nearest neighbor and LVQ), for R versionsprior to 2.10.0;f. cluster, functions for cluster analysis;g. codetools, code analysis tools, recommended for R 2.5.0 or later;h. foreign, functions for reading and writing data stored by statistical software likeMinitab, S, SAS, SPSS, Stata, Systat, etc;15i. lattice, for lattice graphics;j. mgcv, routines for GAMs and other generalized ridge regression problems withmultiple smoothing parameter selection by GCV or UBRE;k. nlme, fit and compare Gaussian linear and nonlinear mixed-effects models;l. nnet, software for single hidden layer perceptrons (feed-forward neural networks),and for multinomial log-linear models, for R versions prior to 2.10.0;m. rpart, recursive partitioning and regression trees;n. spatial, functions for kriging and point pattern analysis from Venables and Ripley[19], for R versions prior to 2.10.0;o. survival, functions for survival analysis, including penalized likelihood.2.5.3.3 Add-on Packages from BioconductorBioconductor is an open source and open development software project forthe analysis and comprehension of genomic data. Most Bioconductor componentsare distributed as R add-on packages. Initially most of the Bioconductor softwarepackages focused primarily on DNA microarray data analysis. As the project has matured,the functional scope of the software packages broadened to include the analysisof all types of genomic data, such as SAGE, sequence, or SNP data. In addition,there are metadata (annotation, CDF and probe) and experiment data packages.The packages from Bioconductor are available at http://www.bioconductor.org/162.5.3.4 Add-on Packages from OmegahatThe Omega Project for Statistical Computing provides a variety of opensourcesoftware for statistical applications, with special emphasis on web-based software,Java, the Java virtual machine, and distributed computing. R packages availablefrom the Omega project is available at http://www.omegahat.org/.2.5.4 R StudioRStudio is a free and open source integrated development environment (IDE)for R. It is available in two editions: RStudio Desktop, where the program is runlocally as a regular desktop application; and RStudio Server, which allows accessingRStudio using a web browser while it is running on a remote Linux server. Prepackageddistributions of RStudio Desktop are available for Microsoft Windows, MacOS X, and Linux. RStudio is written in the C++ programming language and usesthe Qt framework for its graphical user interface. The user interface of RStudio canbe seen in Figure 2.1.The RStudio team contributes code to many R packages and projects. Hereare a few of the prominent ones:a. ggplot2, is a plotting system for R, based on the grammar of graphics, whichtries to take the good parts of base and lattice graphics and none of the badparts;b. knitr, designed to be a transparent engine for dynamic report generation withR, solve some long-standing problems in Sweave, and combine features in otheradd-on packages into one package;c. plyr, is a set of tools for a common set of problems: to split up a big data structure17Gambar 2.1: RStudiointo homogeneous pieces, apply a function to each piece and then combine all theresults back together;d. RPubs, is a free publishing service for R Markdown, which weaves together thewriting and output of code;e. devtools, is a developer tool for building R packages. It removes the pains andbottlenecks of package development;f. packrat, is a dependency management tool to make the projects more isolated,portable, and reproducible.