Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
© 2013 IBM Corporation
Big Data, NLP and Industry Applications L Venkata Subramaniam
IBM Research India
Content
§ Introduction: – What is Big Data? – Map Reduce – Noise in Text – ways of handling it
§ Application in Commerce – Processing internal and external data
2
Big Data – Big Analytics
Traditional / Non-traditional data sources
Analytics delivery
Powerful Analytics
Algo Trading
Telco churn predict
Smart Grid
Cyber Security
Government / Law enforcement
ICU Monitoring
Environment Monitoring
Velocity Insights in microseconds
Volume Terabytes per second Petabytes per day
Variety All kinds of data All kinds of analytics
Veracity Data in Doubt Inconsistent, ambiguous
Think Big
What is Hadoop?
An Open-Source Software , batch-offline oriented, data & I/O intensive general purpose framework for creating distributed applications that process huge amounts of data.
HUGE
- Few thousand machines
- Peta-bytes of data
- Processing thousands of job each week
What’s so special about Hadoop?
Storage MapReduce
• Distributed • Reliable • Commodity gear
• Parallel Programming • Fault Tolerant
Scalable Affordable Flexible Fault Tolerant
New nodes can be added on the fly
Massively parallel computing on
commodity servers
Hadoop is schema-less – can
absorb any type of data
Through MapReduce software framework
Map-Reduce
§ Compute Avg of B for each distinct value of A
A B C
R1 1 10 12
R2 2 20 34
R3 1 10 22
R4 1 30 56
R5 3 40 17
R6 2 10 49
R7 1 20 44
MAP 1
MAP 2
(1, 10) (2, 20) (1, 10)
(1, 30) (3, 40) (2, 10) (1, 20)
(1, 17.5)
(2, 15) (3, 40)
(1, [10, 10, 30, 20])
(2, 10) (3, 40)
Reducer 1
Reducer 2
Task Map (break task into small parts)
Reduce (many results to a single result set)
What is Noise
7
Big Data, Fast Data, Noisy Data
30% world population on the internet and increasing fast
There are more social networking accounts than people in the world
Social Networking overtakes Search: Facebook becomes the most visited website ahead of Google
I’ll see ya tomo RIP Jackson
I’m lookie out 4 a car 2 burn rubber on the streets of LA What should I buy?? A mini laptop with Windows
OR a Apple MacBook!??!
Social Media Communication is meant for Friends
Noisy, Informal, Implicit and Contextual Conversations
Big Data: More video content was uploaded onto YouTube in the past two months than all the new content ABC, CBS and NBC have been entering 24/7 since 1948.”
Type of Text WER
SMS (texting) 50%
Tweets 35%
ASR 30%
Web queries 15%
OCR 5%
Newswire Text (WSJ, Reuters, NYT)
0.005%
Upt
o 10
,000
tim
es m
ore
nois
y
55 million Tweets per day
Lead Generation, Disaster Tracking
Large Dimensional, uncertain, unverified
8
SMS
§ 0 there – there § 1 aint – are not § 2 no – no § 3 doubt – doubt § 4 there – there § 5 hon – honey § 6 im – I am § 7 gonna – going § 8 be – be § 9 takin – taking § 10 it – it § 11 4 – for § 12 life – life § 13 u – You § 14 wont – wont § 15 b – be § 16 rida – rid of § 17 me – me § 18 lol – laugh out loud § 19 Ray – (NAME)
Texting Language: Over 50% of the words are written in non standard ways Spontaneous Language: Use of slang, ungrammatical, no punctuations, no case information Mixing of Languages: Many SMS contain text in a mix of two or more languages
Type of Noise %
Deletion of Characters
48%
Phonetic Substitution
33%
Abbreviations 5%
Dialectical Usage
4%
Deletion of Words
1.2%
101 SMSes 52% words were non standard (Contractor et al., 2010)
9
What is Noisy Text?
§ Any kind of difference in the surface form of an electronic text from the intended, correct or original text (Knoblock et al., 2007)
§ Noise can be at the lexical level {b4, before, befour} • Resulting in substitution, insertion, deletion, transposition,
run-on, and split.
§ Noise can be at morphological, syntactic, discourse level {I can hear u, I can hear you, I can here you}
• Resulting in substitution, insertion, deletion, transposition of words and the introduction of out of vocabulary words.
10
Classifying Noise
Lexical Errors (Subramaniam et al., 2009)
§ Missing characters {before > bef}
§ Extra characters {raster > raaster}
§ Phonetic substitution {before > b4, late > l8}
§ Abbreviations {laugh out loud > lol, United Nations > UN}
Syntactical Errors (Kukich, 1992; Foster et al., 2007)
§ Missing Word {What are the subjects? > What the subjects?}
§ Extra word {Was that in the summer? >Was that in the summer it?}
§ Real word spelling errors {She could not comprehend. > She could no comprehend.}
§ Agreement {She steered Melissa round a corner. > She steered Melissa round a corners.}
§ Dialectical usage {I’m going to be there > I’m gonna be there}
11
Techniques for Automatically Detecting Lexical Errors (Kukich 92)
§ Efficient methods to detect strings that do not appear in a given word list, dictionary or lexicon
– Nonword error detection
§ Two approaches – N-gram
• Look up each n-gram in an input string in a precompiled table to ascertain either its existence or its frequency. Nonexistent or infrequent n-grams (shj, iqn) are identified as possible misspellings.
• Good for identifying errors made by OCR devices • But unusual/foreign language valid words will be marked and nice-looking
mistakes will be marked valid
– Dictionary based • Input string appears in a dictionary? If not, the string is flagged as a
misspelled word. • But nearly two-thirds of the words in a dictionary did not appear in an eight
million word corpus of New York Times text, and, conversely two-thirds of the words in the text were not in the dictionary (1986 study)
12
Techniques for automatically Detecting Incorrect (Syntax) Grammar (Foster et al., 2007)
§ Efficient methods to detect word sequences that do not form a grammatical sentence
§ Three Approaches – N-gram
• Classifies a sentence as ungrammatical if it contains an unusual part of speech sequence
– Precision-grammar • Classifies a sentence using a parser and a broad-coverage
hand-written grammar
– Probabilistic-parsing • Finds sentences with parsing error
13
Spelling Correction (Kukich, 1992)
§ Isolated Word Correction – Minimum edit distance techniques – Similarity key techniques – Probabilistic techniques – N-gram-based techniques – Rule-based techniques – Will not catch typos resulting in correctly spelled words {form, from} – Estimates put real word errors at 30% of all word errors
§ Context-Dependent Word Correction – Parsing – Language models – Can errors be ignored and still meaningful interpretation be done? {I
am coming with you, I comes with you}
14
Tomorrow never dies!!!
§ 2moro (9) § tomoz (25) § tomoro (12) § tomrw (5) § tom (2) § tomra (2) § tomorrow (24) § tomora (4)
§ tomm (1) § tomo (3) § tomorow (3) § 2mro (2) § morrow (1) § tomor (2) § tmorro (1) § moro (1)
dis is n eg 4 txtin lang
This is an example for Texting language
§ Extreme corruption of words and sentences
§ Models for SMS language are lacking
SMS Text Normalization
15
Finding Canonical Sets (Acharyya, 2009)
§ Learn mappings
§ How can we do it in an unsupervised way ?
§ Find some invariant, that does not change in spite of corruptions
§ Buckets of context seem invariant! – <..Back Bucket....> sceam <..Front Bucket...> – sceam : sms(2) new(5) recharge(4) tel-provider(2) about(3)
– <..Back Bucket...> scheme <..Front Bucket...> – scheme : sms(4) new(2) activate(3) tel-provider(2) about(1) recharge(1)
costmer, castumar, kustamar,
coustomber
customer
16
Whom does it matter
§ Research Community J
§ Business Community - New tools, new capabilities, new infrastructure, new business models etc.,
Financial Services..
Customer 360
q Understand Customer Needs: Target individual customers better by understanding their spend patterns, spend locations, intent, sentiment, life events and propensity
q Combine internal and external data for a 360 view of the customer q For each customer determine their product propensity q Predict where, when and what spend is likely for a customer q Roll out offers based on customer propensities
• By combining internal and external data from enterprise, social and mobile, determine where, when and what her next spend will be
Food
books
Entertainment
For each customer determine his/her behavior or signature in terms of movement patterns, hangouts, intent and spends
Arti Mehra
18
Entity (people, products, events) Insights The problem Solution
What are the key product interests of person A?
Over time learn about the person’s product interests from her social media postings
What is the location and trajectory of person B?
Gives the current location and locations in the past
What life events happened in person A’s life in the past x months?
List significant events like marriage, birth of a child, relocation, etc.
What are the events of interest happening in a given location?
Lists the top events in a given geography
What is the sentiment on a given product?
Gives the sentiment on a product
Key Sustained Value Factor:
Understand customers wants and needs better
IBM Confidential
Analyze social data in the context of enterprise data to build entity and event profiles and establish linkages between them for online and offline analysis
What 360 comtext means? § Builds an entity’s complete profile by aggregating data about the entity from social and enterprise data
sources. Here an entity refers to people, products, brands and events.
360 Context
intent to purchase for
customers
Application Domains
Social Data
Enterprise Databases
User Domains
propensities/ sentiment/intent • event Detection • entity Linkages • sentiment
core customer view/transactions • event Profiles • entity Profiles
Smarter Commerce
Smarter Cities
real-time public safety events
20
Enterprise and Social Media Integration
Personal Attributes • Identifiers: name, address, age, gender, occupation… • Interests: beaches, cuisines, historical architects etc • Life Cycle Status: marital, parental
Relationships • Personal relationships: family, friends and roommates… • Business relationships: co-workers and work/interest network…
Travel Interests • Personal preferences of class, travel interests, social influence, cuisines, hotels • Travel history
Social Media
Facebook Twitter
Life Events • Life-changing events: relocation, having a baby, getting married, getting divorced, buying a house…
Profile: Name, Address, Designation, Contact Number, Membership Status
Transaction History: Purchase Date, Time, Store, Item, Amount
Timely Insights • Intent to travel various destinations, airline preference, travel class, business • Last travel
CRM: Request Id, Complaint, sentiments
360 View Creation, Querying, Analysis and Reporting
Internal Profile
@Ram Sarin: I need a smart phone that’s smarter than my old phone…suggestions?
Store: Pit Stop, Delhi Airport Amount: Rs 1,650 Date: 17:35, 3rd June 2014
Assemble an entity view of the customer, aggregate data from thousands of different documents spread across internal and external, structured and unstructured, text and non-text sources
Movement Patterns • Hangouts (weekend, weekday, evening, etc.) • Spatio temporal signatures
Name: Ram Sarin Shukla Address: Bhogal, New Delhi Corporate: IBM
21
Social Media Analytics
Spatio-Temporal Hangout Analysis
360 View from Multiple Documents Spend Analytics
Influencing Online Search using Social Analytics
22
Signed Customer Guest
Enter Search Term
Customer-Specific Search Results
Trending Products
Social Influenced Search: v Customers open the commerce site v Customers are shown a list of trending products based on the social feed v Once they are logged in, they see more specific products and brands associated with themselves and their network. Benefits: v Allows customers to see what is hot not just within a merchant’s site, but instead in a more social context
Offer Generation using Social Analytics
23
Customer
Trigger: Life event, Current Spend, Spend Behaviour
Customer-Specific Offers
Social Influenced Offers: v Live offers are ranked using customer propensity based on life events, spend behaviour, hangout analysis v Customers are sent offers with highest likelihood of acceptance v Right Dme offers, like when the customer is in the mall shopping Benefits: v Allows customers to be presented with precisely relevant offers based on specific needs
Demo: http://malhar.irl.in.ibm.com:8080/MDM_MovieAnalysis/
© 2013 IBM Corporation
Data Analytics Pipeline
Data Ingest and Prep
Extract Buzz, Intent , Sentiment
Entity Analytics: Profile Resolution
Real time analytics. Pre-defined views and charts
Dashboard
Stream Computing and Analytics
BigInsights System and Analytics
Online flow: Data-in-motion analysis
Offline flow: Data-at-rest analysis
Pre-defined Workbooks and Dashboards
Social Media Data + Enterprise Data
Extract Buzz, Intent , Sentiment And Consumer Profiles
Entity Analytics and Integration
Comprehensive Customer Profiles
Social Media
Enterprise
CRM
Tx
Spatio-Temporal Analytics and Integration
Extract Spatial Location & Temporal Presence
Advanced Text Analytics and Entity Resolution
§ Input – For each user, his/her social media data collected over time that represents
his/her online content in terms of text message postings, product mentions, check-ins
§ Output – Social Media Profile of the User
• Sentiment on different products, likes, dislikes • Intent to purchase, purchases and product mentions • Location details in terms of where certain activities were carried out • Extracted Life events
– Matching of social media profile to Enterprise profiles
§ Novelty – Works on BigInsights Map-Reduce framework with high Scalability – Identification and categorization of key concepts from noisy social media data – Entity Resolution across Enterprise customer data with Social Media profiles
using sparsely populated attributes. Uses Name, Context clues like customer clues, product purchases, different granularities of location clues gathered from Social Media
25
Example Analysis : Extraction from Twitter messages Remove Spam Extract intent, interests, life events and micro segmentation attributes
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others http://4sq.com/gbsaYR
@silliesraghu good!!! U shouldnt! Think about the important stuff, like ur birthday ;) btw happy birthday Raghu ;)
@rakonturdelhi im moving to delhi in 3 months. i look foward to the new lifestyle
I had an iphone, but it's dead @JoaoVianaa. (I've no idea where it's) !Want a blackberry now !!!
Monetizable Intent
Relocation
Location
Name, Birth Day
Subtle Spam, Advertising
Sarcasm, Wishful Thinking
While accounting for less relevant messages
I think that @justinbieber deserves his 2 AMAZING songs in top ten!!! Buy them on itunes
http://Cell-Pones.com Looking to buy a phone? WiFi Cell Phones, Windows Mobile
@purplepleather Gotta do more research my Versace term paper 2day. Before I die, I want a versace purple diamond tiara. Im just sayin>lol
had so much fun today! I want to buy a million dollar house with a wrap around porch ... ... wading river on the long island sound, ha i wish!
27
Prior Business Transactions
Customer ready to buy a DSLR camera today, possibly at a nearby mall
Buying a DSLR today !
Thrza gr8 deal on ZX-550 @ the mall
Go for the best, DP-2000
Michael’s online friends offer lots of advice
Wifey’s birthday tomorrow, looking for a killer dslr
Sarcasm, Wishful Thinking
Maybe I should buy her that purple roadster, while I’m at it. ;-) lol
Text Analytics used to extract intent from Social Media Married, Male, Spouse Birthdate, Gift Type, Intent to Purchase, Timeframe
Intent to Purchase, Gift Type?
Potential Locations and
Activity
In NYC area this w/e, any good malls nearby?
Region & City Location, Timeframe, Intent to Shop
§ Resultant fact base contains billions of facts, and is incrementally updated § Fact segmentation or clustering is rapid enough to drive a business decision
More data: Customer intent extracted from social media provides context
Entity Extraction, Fact Discovery, Intent & Sentiment
Social Data
450M+ tweets/day Millions of tweets yield one company-specific fact
Influencers
Buying DSLR today!
Intent
27
28 IBM Research 28 28 28
Example Analysis: Linking Social Profiles with Customer Database
§ Identify candidate matches based on name and location § Advanced text analysis provides strong supporting evidence that is leveraged by the matching
algorithm: – Explicit customer interaction through social media customer support – Mention of company product use
§ Customer Identity confirmed through additional disambiguation techniques – E.g., uniqueness of customer information (name, products owned) in specific geography
Name: Tarun Rana Screen name: @tarunrana Extracted Location: Delhi
Messages This stupid <bankname> app never works for me !!!!
@BankSupport well how long is my online deposit gonna take it has been like three days … i might as well have gone to the atm
Cust.# 12345
Name Tarun Rana
Address DX 40, Kendriya Vihar
City Gurgaon
State Haryana
Customer DB
All names and identifiers are fictitious
Semantic Name Variations Tarun Rana vs. Rana, Tarun Kumar K. Singh vs. Karan (Happy) Singh
Geo Proximity Gurgaon, Haryana vs. New Delhi UP vs. Delhi
Job Role Disambiguation
“Software sales manager at IBM…” vs. “Managing SPSS Sales for Canada…”
29 IBM Research 29 29 29
Geo-located documents
Textual clues
Example Analysis: Inferring Location from Multiple Clues Social Media Profile Screen name : @tarunrana
Location: Delhi Name: Tarun Rana
Name: Tarun Rana Screen name: @tarunrana Location: Delhi Description: just a Nor-Cal gal trying to fall in love with Florida
Messages
I'm at Barista, SDA (New Delhi) http://4sq.com/SZ3yjj
I'm at S.o.G (New Delhi) http://4sq.com/UCweM5
Gotta love wc football #wc #brazil http://instagr.am/p/QOHPqabdYt/
I'm at Eats American Grill (New Delhi) http://4sq.com/O2a1Jm
Check out my blog about #food in #ChandniChowk http://www.citybythebay.com
Who's watching the #brazil tonight? (from 27.27989014,-82.34825406)
Fusion libraries: • Confidence: place mentions vs. geo-codes • Analysis of location time-series
Check-ins
Metadata
Fusion libraries: • Confidence:
metadata vs. content
Disambiguation, fusion of partial information
Temporary location
Permanent location
All names and identifiers are fictitious
Advanced spatio-temporal analytics
§ Input – For each user, his/her spatio-temporal data collected over time that represents his/her
behavior or signature in terms of movement, hangouts, spend patterns (from card swipes, mobile location data)
§ Output – Given a specific time in future, the possible location(s) at which the user could be
present along with possible spends – Given specific location(s), the possible time in future at which the user could visit this
location(s) and likely spends at the location(s) – Given a specific location and/or time, the likely spend
§ Novelty – Spend Prediction using Periodicity from User’s Behavior: We use the concept of
periodicity to predict the location and related spend of a user. E.g. – “When you are in the mall what are you likely to purchase?”
– Spatio-Temporal Signature: We use the concepts of time-series mining for predicting the probability of future temporal occurrence of a user at a given set of spatial location/s. E.g., - “When will user U next be in the vicinity of Red Fort?”
– Offer Propensity: We statistically model the customer to predict his spend behaviour based on location, time, past spends, etc.
30
The Concept of a ST- Card Swipe Signature
Loc 2
8:30-8:40
Loc 3
8:40-9:00
Loc 4 9:00-1
800
Loc 5 18:00-18:30
Loc 1 20:30-8:30
Day 1
Loc 2
8:30-8:40
Loc 3
8:40-9:00
Loc 4 9:00-1
800
Loc 3
18:00-18:20
Loc 2
18:20-18:30
Loc 1 18:30-8:30
Day 2
Loc 5 8:30-9:00
Loc 4 9:00-1800
Loc 5 18:00-18:
05
Loc 1 18:05-8:3
0
Day 3
Loc 2
8:30-8:40
Loc 3
8:40-9:00
Loc 4 9:00-1
800
Loc 5 18:00-18:30
Loc 1 18:30-8:30
Day 4
Loc 5 8:30-9:
00
Loc 4 9:00-1
800 Loc 3
18:00-18:20
Loc 2
18:20-18:30
Loc 1 18:30-8:30
Day 5
Loc 6
12:30-12:40
Loc 7 12:40-14:
00
Loc 6
14:00-14:10
Loc 1 18:30-12:
30
Day 6
Loc 6
12:30-12:40
Loc 7 12:40-20:
00
Loc 6
20:00-20:10
Loc 1 14:20-12:
30
Day 7
… Card
owner’s swipes
Loc1
2
3
4
5
7
6
MetaData (location type, merchant type, amount, etc.)
ST-Signature
31
32
Example Analysis: Predicting Next Purchase Location/Time
Signature Generator
Spa=al Predic=on Query: When will user swipe his card at locaDon X again?
Temporal Predic=on Query: Where will user swipe his card next?
Spatio-‐Temporal Prediction
Predicted Temporal Values
Predicted Spa=al Loca=ons
33
Example Analysis: Identifying Similar Users For Recommendations
Signature Generator
Signature Generator
Similarity Framework
Pruning Similarity Computa=on
Similarity Visualiza=on
Example Analysis: Categorization of Spending Patterns from Spatio-temporal Behavior
Signature Generator
Customer DB
Spending Patterns
Permanent Address
Current Address
Vacation Location
Social Media
Furnishings
Clothing & Apparel
Entertainment
Advanced Temporal Analytics
▪ Input – For each user, his/her spending patterns (card swipes, online payments, etc.)
over time
▪ Output – For different times in future, different product recommendations based on
spending patterns
▪ Novelty – Captures contextual parameters for a better prediction – Allows personalized recommendations of temporal nature – Makes use of domain ontologies to generalize the product purchases required
for better regression model – Allows predictions for non-periodically bought products – Identifying interesting package deals (tie-in sales)
10
Harry Potter 1
Harry Potter 2
Reebok Shoes
iPhone 3 iPhone 4
PanteneShampoo
Dove Conditioner
PanteneShampoo
Dove Conditioner
Head & shoulders Shampoo Dove
Conditioner
Books à Fiction, Medium price, Latest release, India
Electronics, High Price, Latest release, US
Books à Fiction, Medium price, Latest release, India
Electronics, High Price, Latest release, US
Hair Care product, low price, available for long, India
?
TODAY 1 MONTH later
?
Candidate: Harry Potter 3
Pepsodent Toothbrush
Pepsodent toothpaste
Colgate toothpaste
Pepsodent toothpaste
Oral B toothbrush
Shoes, iPhone 5
Reebok Shoes
Example Analysis: Predicting Next Purchase, Location, Time
Offline Component
Modeling purchase patterns
Parameter Extractor User
History
For every customer, for every product, extract quantity,
location, high/low end status, release date (latest/old), and
so on
SystemML models can be used here. Regress for each product as well as general
category of each product.
Regression Analysis
Online Component
Current Product Catalog
Recommender
Alerts, Reminders and updates
For predicting the recommendations, use the categorical periodicity and other parameters
for best matches
The temporal product catalog to be used
Prediction Timestamp
Conclusion
§ Aggregate information from social media, enterprise and transaction data – a lot of information in a timely manner from everywhere about customers
– Social Media is used extensively by people to express intent, sentiment, product preferences, life events, etc.
– Spatio-Temporal information can be extracted from multiple sources like social media check-ins, mobility data, etc.
– 360 Customer view enables determining when, where and what the next spend will be
37
Thank You J
38