Paper id 26201475

International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637

166

Big Data: Text Analytics Mrs.Balshetwar S.V.1, Prof (Dr.)Tugnayat R.M.2

1HOD, Department of Information Technology, Satara College of Engineering and Management ,limb,satara Shivaji University, Kolhapur (Maharashtra) Email: [email protected]

2Principal, Shri Shankarprasad Agnihotri College of Engineering,,Wardha Sant Gadge Baba Amravati university, Amravati (Maharashtra)

Abstract-Every part of this technological world is flooded of big data today. Almost 80% of this big data is unstructured, because the data comes from various new sources like device logs, server logs, twitter feeds, chat data, blogs, web pages, emails, social media content. This makes a huge collection of text data which is created by humans to express themselves to others, so it has become an important source of data that may contain valuable information. Through text analytics techniques we can extract information from these collected sources of data and utilize them in customer management, sentiment analysis, and collaborative analysis. This paper discusses some basic techniques that identify useful patterns from the text in big data.

Index Terms - Big data, Text analytics, LDA.

1. INTRODUCTION

A text is piece of information through which human communicate with each other. Broad range of application and devices are available for text communication and sharing intentional data and so it is collected at unexpected scale. Decisions are made today on the basis of this data which were previously made on guess or on the models of reality. Big data analysis now drives every aspect of modern application, gadgets, industry as well as society.

Text within big data, such as data form newspapers, magazines, web pages, emails, blogs, tweets, is particularly important because there are sources of information that has valuable information for humans. To utilize this large amount of text data it requires techniques for processing the data.

Those techniques must have following characteristics: (1) They must be fast & accurate in processing. (2) Must be able to find Relationship with other

information. (3) Must remove 100% ambiguity from data. (4) Must handle heterogeneous data efficiently. Steps in big data analysis

Flow chart in fig.1 shows big data steps for

retrieving or extracting important & valuable information.

The analyzing steps shown in fig.1 for big data querying and mining are very different from traditional analysis methods that are worked out on small amount of data. Big data is often noisy, dynamics, interrelated and untrustworthy but nevertheless even noisy big data could be more valuable than small samples because statistics obtained from frequent pattern and correlated analysis usually disclose more reliable hidden patterns knowledge.

Analyzing step can extract meaningful and related information by processing the text which is in natural language, but it is not so easy to analyze using simple regression models or decision trees. However the group of technique called as text analytics can help to get deep information from these sources by translating this complex textual information into useful signals that can give deeper analysis.

Fig1. Steps in big data analysis

2. LITERATURE REVIEW

Based on a survey of over 4,000 information technology (IT) professionals from 93countries and 25 industries, the IBM Tech Trends Report (2011) identified business analytics as one of the four major technology trends in the 2010s. In a survey of the state of business analytics by Bloomberg

Various sources of data

Recording data

Cleaning data

Integrating

Analyzing

Interpreting

Various sources of data

Recording data

Cleaning data

Integrating

Analyzing

Interpreting


167

Businessweek (2011), 97 percent of companies with revenues exceeding $100 million were found to use some form of business analytics. A report by the McKinsey Global Institute (Manyika et al. 2011) predictedthat by 2018, the United States alone will face a shortage of 140,000 to 190,000 people with deep analytical skills, as well as a shortfall of 1.5 million data-savvy managers with the know-how to analyze big data to make effective decisions[1]

Emerging analytics research opportunities can be classified into five critical technical areas—(big) data analytics, text analytics, web analytics, network analytics, and mobile analytics—all of which can contribute to BI&A[1].

In “hype cycle for big data” report (July 2013), Gartner positions text analytics as delivering great business benefits and project adoption in next two to five years[2].

Text analytics has its academic roots in information retrieval and computational linguistics. In information retrieval, document representation and query processing are the foundations for developing the vector-space model, Boolean retrieval model, and probabilistic retrieval model, which in turn, became the basis for the modern digital libraries, search engines, and enterprise search systems[3]. In computational linguistics, statistical natural language processing (NLP) techniques for lexical acquisition, word sense disambiguation, part-of-speech-tagging (POST), and probabilistic context-free grammars have also become important for representing text [4].

Text analytics Academic roots

Document representation Information retrieval Query processing Application Digital libraries, Search engines NLP Computational linguistic CFG

POST Application for representing text

3. ALGORITHMS FOR TEXT ANLYTICS

Text analytics refers to the process of deriving high quality information from text. Information is derived by finding/learning patterns and trends in it using statistical methods.

To extract specific type of information from text data there are many algorithms but which to apply depends on the type of data analysis project at hand. Some projects are clear in objectives and certain are just trying to deep inspect the data and get some valuable data from mass of information where the

outcome is known, which can then be used for further analysis.

As per the project in hand by making use of statistical methods based on frequency matrix (counts words appearing in various text sources) or term document matrix (lists all the unique terms in the text which is examined) often gives a new useful feature after applying proper statistical technique. However this technique gives intermediate results which can be used as foundation for further analysis. In this type of method after examining documents a scorecard is prepared showing score of every term in respect of number of times it appears in the document then by applying a threshold only those terms are collected that is above threshold which is then used to construct a larger concept

Other text analytic technique may make use of Named Entity Extraction (NEE) , it is a method that identifies every smallest element in text and classifies them into predefined entities like person, place, product, date etc.

Making use of NEE, probabilities can be set that a particular document refers to an named entity.

NEE is based on natural language processing (NLP). After analyzing the structure of text, NEE generates a score foe every entity that is identified from that text. Considering the score for every entity and applying threshold, those entities can be used in creation of structured features and make use of it further in prediction models.

NEE has been successfully developed for news analysis and biomedical application [1].

Another text analytic technique which is widely used in emerging areas like topic model is LDA (Latent Drichelt allocation). LDA is mainly used for finding main topic/themes that are in every part of a large unstructured collection of documents and it is also useful for detecting changes in customer behavior.

LDA is an unsupervised method that is applied on unstructured data. In comparison with NEE it is not NLP based rather it looks for pattern in text.

LDA can be applied to any type of structured, unstructured, semi structured data from any number of sources to identify patterns in text.

Text analytics are widely used in emerging areas like information extraction, topic models and opinion/sentiment analysis.

While working on text in big data, where exactly is text analytics technique applied. Text data are typically held as notes, documents and various forms of electronic correspondence (emails for example). Structured data on the other hand are usually contained in databases with fixed structures. Many data mining techniques have been developed to extract useful patterns from structured data and this process is often enhanced by the addition of variables (called features) which add new ‘dimensions’, providing information that is not implicitly contained


168

in existing features. The appropriate processing of text data can allow such new features to be added, improving the effectiveness of predictive models or providing new insights [5].

Let us consider an example for customer information. Structured data about an customer is stored in database using fields like name, salary and order value and there be an unstructured texts about customer information. Both structured and unstructured text is lacking of customer ID. By applying text analytics technique like NEE or LDA

Fig 2. Where to apply text analytics? One can infer the nature, strength or absence of

relationships among individuals and yield a new

feature like sentiment and then by mining data predictive pattern are created that can be used in application and create valuable information from it.

4. OBSTACLES IN TEXT ANALYTICS

Although the field is new, text analytics is achieving a level of development that makes its widespread use.

Following are the obstacles for adoption of text analytics

1. How to deploy the results? 2. How to handle heterogeneous data? 3. Lack of methods to determine what

exactly is in text. Recent technological development has overcome

these problems.

5. CONCLUSION

The real strength of big data lies in its utilization. Big data has huge amount of text data and it can be used for extracting sensible information. Making use of big data for training a classifier, applying NLP along with text analytics technique can give valuable information from raw text data. This paper discusses how NEE, LDA and term matrix can be used to extract information from large unstructured text.

REFERENCES

[1] Hsinchun Chen.; Roger H. L. Chiang.; Veda C. Storey (2012). Business intelligence and analytics:From big data to big impact MIS Quarterly Vol. 36 No. 4, pp. 1165-1188/December .

[2] https://www.gartner.com/doc/2574616 [3] Salton, G. 1989. Automatic Text Processing,

Reading, MA: Addison Wesley [4] Manning, C. D., and Schütze, H. 1999.

Foundations of Statistical Natural Language Processing, Cambridge, MA: The MIT Press. March, S. T., and Storey, V. C. 2008. “Design Science in the Information Systems Discipline,” MIS Quarterly (32:4), pp. 725-730.

[5] http://butleranalytics.com/unstructured-meets-structured-data/

[6] http://www.informs.org/ [7] http://www.twitter.com. Twitter, Inc. [8] http://www.google.com. Google, Inc. [9] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent

dirichlet allocation. J.Mach. Learn. Res., 3:993–1022, March 2003.

[10] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi. Short and Tweet: Experiments on Recommending Content from Information Streams. CHI 2010.

[11] W. Dou, X. Wang, R. Chang, and W. Ribarsky. ParallelTopics: A Probabilistic Approach to Exploring Document Collections. In Visual Analytics Science and Technology (VAST), 2011 IEEE Conference on, 2011.

Name Salary Order Sentiment Raj 45,000 3400 Negative Jyoti 65,000 4500 Positive

Name Salary Order Raj 45,000 3400 Jyoti 65,000 4500

Structured Data Unstructured Data

New Feature

Text Analytics

Patterns

Data Mining


169

[12] Salton, G & Buckley, C. 1988. Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5):513-523.

[13] X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval,SIGIR ’06, pages 178–185, New York, NY, USA, 2006. ACM