Upload
xiaoyu-wang
View
197
Download
1
Tags:
Embed Size (px)
DESCRIPTION
My talk on Patent Visualization at The 3rd IEEE Workshop on Interactive Visual Text Analytics. Primary focus is to introduce the Scalable Visual Analytics research that my team is working on. Workshop paper can be found at: http://vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis2013.pdf
Citation preview
Visually Exploring Patent Collections for Events
and PatternsDerek X. Wang
Associate Director of the Charlotte Visualization Center
Together with: Wenwen Dou, Wlodek Zadrozny, Suraj Ankam, Debbie Strumsky, Terry Rabinowitz
Value
BusinessesValue
BusinessesValue
BusinessesValue
• 800 patents:
• $1 billion worth of patents from AOL to Microsoft
BusinessesValue
• 800 patents:
• $1 billion worth of patents from AOL to Microsoft
• 1,100 patents from Kodak
• 525 Million to group license
BusinessesValue
• 800 patents:
• $1 billion worth of patents from AOL to Microsoft
• 1,100 patents from Kodak
• 525 Million to group license
• 17, 000 Patents
• $12.5 billion Motorola Mobility to Google
2006 2007 2008 2009 2010
Dataset: 123 Publications from VAST proceedings from 2006-2010.
Value
2006 2007 2008 2009 2010
Dataset: 123 Publications from VAST proceedings from 2006-2010.
ValueTechnology
2006 2007 2008 2009 2010
Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity
Dataset: 123 Publications from VAST proceedings from 2006-2010.
ValueTechnology
2006 2007 2008 2009 2010
Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity
Blue topic: dimension quality cluster measure lda attribute reduction projection
Dataset: 123 Publications from VAST proceedings from 2006-2010.
ValueTechnology
2006 2007 2008 2009 2010
Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity
Blue topic: dimension quality cluster measure lda attribute reduction projection
Dataset: 123 Publications from VAST proceedings from 2006-2010.
FODAVA
ValueTechnology
2006 2007 2008 2009 2010
Cyan topic: variable uncertainty trend correlation linear multivariate sensitivity
Blue topic: dimension quality cluster measure lda attribute reduction projection
Dataset: 123 Publications from VAST proceedings from 2006-2010.
FODAVA
ValueTechnology
**X. Wang et al., ParallelTopics: A probabilistic approach to exploring document collections, IEEE VAST 2011
Goal
Value
GoalValue
• Can we spot an emerging new technology?
GoalValue
• Can we spot an emerging new technology?
• Text mining and visualization
GoalValue
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
GoalValue
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
• How much do claims differ from class descriptions?
GoalValue
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
• How much do claims differ from class descriptions?
• How much do claims differ from claims in other similar patents
GoalValue
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
• How much do claims differ from class descriptions?
• How much do claims differ from claims in other similar patents
• Can we list “all” patents relevant for some technology? (and what does it mean)
GoalValue
A Robust and Scalable Patent Analysis Infrastructure Is Needed
GoalValue
A Robust and Scalable Patent Analysis Infrastructure Is Needed
Visual Analytics Will Play a Key Role
BalancedAnalytics
Technology
GoalValue
A Robust and Scalable Patent Analysis Infrastructure Is Needed
Visual Analytics Will Play a Key Role
BalancedAnalytics
Technology
Human
Computer+=
Value
Challenge
Goal
Value ChallengeGoal
Value ChallengeGoal
Unstructured or semi-structured
Highly heterogeneous
Leading to highly heterogeneous models
Incomplete or with holes
With intrinsic uncertainty (and in some cases deception)
Inside and outside the enterprise
Containing detailed time and space information:
GoalValue Challenge
Research
GoalValue Challenge ResearchStructuring the Unstructured:
Topic Modeling
GoalValue Challenge ResearchStructuring the Unstructured:
Topic Modeling• Latent Dirichlet Allocation (LDA)
GoalValue Challenge ResearchStructuring the Unstructured:
Topic Modeling• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
GoalValue Challenge ResearchStructuring the Unstructured:
Topic Modeling• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
• Coherent sets of most likely words to describe topics
GoalValue Challenge ResearchStructuring the Unstructured:
Topic Modeling• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
• Coherent sets of most likely words to describe topics
• Topics defined by keyword groups
GoalValue Challenge ResearchStructuring the Unstructured:
Topic Modeling• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
• Coherent sets of most likely words to describe topics
• Topics defined by keyword groups
• Topics in text collections can effectively be inferred
GoalValue Challenge Research
GoalValue Challenge ResearchStructuring the Unstructured:
Investigative Element Extraction
GoalValue Challenge ResearchStructuring the Unstructured:
Investigative Element Extraction
• Recognition of entities including people, locations, buildings, organizations.
GoalValue Challenge ResearchStructuring the Unstructured:
Investigative Element Extraction
• Recognition of entities including people, locations, buildings, organizations.
• Recognition of times and dates.
GoalValue Challenge ResearchStructuring the Unstructured:
Investigative Element Extraction
• Recognition of entities including people, locations, buildings, organizations.
• Recognition of times and dates.
• Construct near-real-time analysis pipeline for entity association
RealityValue Challenge Research
RealityValue Challenge ResearchStructuring the Unstructured:
Event Structuring
RealityValue Challenge ResearchStructuring the Unstructured:
Event Structuring
Events: Meaningful occurrences in space and time
RealityValue Challenge ResearchStructuring the Unstructured:
Event Structuring
Events: Meaningful occurrences in space and time
Motivating Event
Particular Topic Stream
RealityValue Challenge ResearchStructuring the Unstructured:
Event Structuring
Events: Meaningful occurrences in space and time
Motivating Event
Particular Topic Stream
Narrative: a series of clustered (event-based) stories temporally-linked based on content similarity.
RealityValue Challenge Research
Results
RealityValue Challenge Research ResultsCan we spot an emerging new technology?
RealityValue Challenge Research ResultsCan we spot an emerging new technology?
Data: 50,000 telecommunication patents, in past 10 years Abstract text and patent meta-information;
1.5 Gb Raw Patent Documents
RealityValue Challenge Research ResultsCan we spot an emerging new technology?
Data: 50,000 telecommunication patents, in past 10 years Abstract text and patent meta-information;
1.5 Gb Raw Patent Documents
Methods: Topic modeling and visualization
RealityValue Challenge Research ResultsCan we spot an emerging new technology?
Results: We can see a significant change in the topic of “software and storage” in communication around 2007 (corresponding to Apple iPhone?)
Data: 50,000 telecommunication patents, in past 10 years Abstract text and patent meta-information;
1.5 Gb Raw Patent Documents
Methods: Topic modeling and visualization
RealityValue Challenge ResearchCan we spot an emerging new technology?
Results
**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013
RealityValue Challenge ResearchCan we spot an emerging new technology?
Results
Model: § 100 topics § Each topic a distribution on
words § Each abstract a combination
of topics !
Note: Width of the graph proportional to the number of patents and the number of words from a particular topic (topic signal strength). Number of class 455 patents grew from 2234 in 2005 to 7647 in 2012
**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013
RealityValue Challenge ResearchCan we spot an emerging new technology?
Results
Model: § 100 topics § Each topic a distribution on
words § Each abstract a combination
of topics !
Note: Width of the graph proportional to the number of patents and the number of words from a particular topic (topic signal strength). Number of class 455 patents grew from 2234 in 2005 to 7647 in 2012
**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013
RealityValue Challenge Research Results
RealityValue Challenge Research Results
Typical Keyword: “transistor”
RealityValue Challenge Research Results
!
Emergent: “storage, software, …”
Typical Keyword: “transistor”
RealityValue Challenge Research Results
RealityValue Challenge ResearchCan we spot novelty within an existing patent?
Results
RealityValue Challenge ResearchCan we spot novelty within an existing patent?Data$$
$Ini(ally:$A"random"sample"of"40"patents"in"several"classes"with"focus"on"455"(telecom)."""
$Recently:$Confirmed"through"automated"analysis"of"several"subclasses"of"455.""$Method:"Compare"words"in"claims"with"words"in"class"plus"subclass"definiAon""Results:"Large"symmetric"differences
""#$%&(()*+,&)÷"#$%&(./0+1+2+#1)"""#$%&(34&2$*52&)÷"#$%&(./0+1+2+#1)"
"
Results
RealityValue Challenge Research ResultsExample
h)p://pa,t.uspto.gov/netacgi/nph-‐Parser?Sect1=PTO2&p=1&u=%2Fnetahtml%2Fsearch-‐bool.html&r=2&f=G&l=50&d=pall&s1=449%2F8.CCLS.&OS=CCL/449/8&RS=CCL/449/8
Patent Title Process for rearing bumblebee queens and process for
rearing bumblebees
Main ClassificaTon 449/1 ; 449/2; 449/8
Class 449 – Bee Culture / Subclass 1 Class 449 – Bee Culture / Subclass 8
RealityValue Challenge Research ResultsWe claim: 1. A process for rearing bumblebee queens (genus Bombus) comprising generaTng a colony with workers in the presence of ferTlized eggs and/or larvae from at least one colony, in a room with a controlled climate provided with food, and allowing the colony to grow unTl bumblebee queens are produced, wherein subadult and/or adult workers that originate from at least one different colony are brought together with said ferTlized eggs and/or larvae. 2. The process according to claim 1, wherein the workers that originate from said at least one different colony are brought together with a young colony in the eusocial phase, consisTng of a ferTlized queen, brood and the first born workers. 3. The process according to claim 1, wherein more than 100 workers are brought together. 4. The process according to claim 1, wherein rearing is carried out using a workers: ferTlized eggs raTo of 0.5-‐4. 5. The process according to claim 1, wherein the workers originaTng from said at least one different colony are first kept in a room without any queen and without brood for one day. 6. The process according to claim 1, wherein brood and workers from different bumblebee species are brought together. 7. A process for rearing bumblebees (genus Bombus), comprising rearing bumblebee queens by generaTng a colony with workers in the presence of ferTlized eggs and/or larvae from at least one colony, in a room with a controlled climate provided with food, and allowing the colony to grow, wherein subadult and/or adult workers that originate from at least one different colony are brought together with said ferTlized eggs and/or larvae, and using said bumblebee queens for rearing bumblebees.
RealityValue Challenge Research Results
Class 449 1 -> Class Definition
8 -> 7 -> 3 -> Class Definition
Subclass Nesting
RealityValue Challenge Research Results
Class 449 1 -> Class Definition
8 -> 7 -> 3 -> Class Definition
Subclass Nesting
Class Name: Bee Culture Class Defini;on: This class includes the methods of and structures for propagaTng, raising and caring for bees; as well as certain ancillary methods and structures.
RealityValue Challenge Research Results
Class 449 Subclass 1Subclass Name: Method Subclass Defini;on: This subclass is indented under the class definiTon. Process.
RealityValue Challenge Research Results
Class 449 Subclass 8Subclass Name: Queen Raising Subclass Defini;on: This subclass is indented under subclass 7. Structure with provision to encourage and care for the producTon of a bee larvae into a queen bee.
RealityValue Challenge Research ResultsWords in class / subclass defini;ons found in patent claim
method 0 colony 11
process 7 culture 0
queen 6 propagate 0
raise 0
encourage 0
care 0
larvae 4
producTon 1
bee 7
mulT 0
swarm 0
capture 0
house 0
hive 0
structure 0
RealityValue Challenge Research ResultsWords in claim that were not in definiTons
rearing 5
worker 10
egg 5
ferTlize 6
climate 2
food 2
different 5
control 2
RealityValue Challenge Research Results
RealityValue Challenge Research Results
Observations • Novelty is in words/relations that are not part of the definition (but appear in
patent claims or its abstract) • Some things can be left unsaid. Is there a boundary? • Happens in all patents (but degree varies)
Can we spot novelty within an existing patent?
RealityValue Challenge Research Results
Can we spot novelty within an existing patent?
Next • Opportunity to text mine these differences – Are they random on a time scale? – Would descriptions of emerging technologies emerge from these
patterns? – Do combination patents have more of these?
RealityValue Challenge Research Results
RealityValue Challenge Research ResultsCan we list “all” patents relevant for some technology?
RealityValue Challenge Research Results
– Data: Patents, Wikipedia
Can we list “all” patents relevant for some technology?
RealityValue Challenge Research Results
– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions
Can we list “all” patents relevant for some technology?
RealityValue Challenge Research Results
– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions
Can we list “all” patents relevant for some technology?
RealityValue Challenge Research Results
– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions
– Method: Text mining of patents in certain classes, text mining of filing by certain market/technology players, text mining of other patents, using Wikipedia and manuals as a guidance what to look for.
Can we list “all” patents relevant for some technology?
RealityValue Challenge Research Results
– Data: Patents, Wikipedia– Potential Data: Cell phone manuals or other descriptions
– Method: Text mining of patents in certain classes, text mining of filing by certain market/technology players, text mining of other patents, using Wikipedia and manuals as a guidance what to look for.
Can we list “all” patents relevant for some technology?
RealityValue Challenge Research
Scale
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
Distributed Data Storage and Pre-Processing Environment
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
Distributed Data Storage and Pre-Processing Environment
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
Distributed Data Storage and Pre-Processing Environment
MapReduce procedures for data-cleaning and pre-processing Distributed Storage Solution (MongoDB), is used for data storage,
analysis and Retrieval
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
Distributed Data Storage and Pre-Processing Environment
MapReduce-based social media crawlers for Twitter, blogs and news articles: Unstructured Contents: Textual Information, Image, Comments
Structured Contents: User Graph, Geo-tags, HashTag
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
Parallel Data Analytics Cluster
MPI-based Parallel-LDA implementation for Topic modeling with Memory Sharing Optimization
Results
RealityValue Challenge Research ScaleScalable Computing Architecture for Extracting Latent Topics and Events*
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
Parallel Data Analytics Cluster
OpenNLP-based Parallel Implementation for Entity-Extraction Customized PBS to schedule jobs for parallel computing environment
Results
RealityValue Challenge Research Results Scale
News Briefing App
RealityValue Challenge Research Scale
Resources we’d be happy to share
• Complete US patents and applications (until 1q2013) with with a search engine (Lucene) interface • Patent Classes • Other text resources (Wikipedia, Wiktionary etc) !
We’d be happy to prepare specialized extracts or combination for those who need them.
Results
RealityValue Challenge Research Scale
Thank you!
Derek Xiaoyu Wang [email protected]
Results
News Briefing App @News_Briefing
Now FREE at App Store