36
Basis Technology – Human Language Technology Conference 2012 1 Big Data Triage with Text Analytics Steve Kearns Director of Product Management Basis Technology

Big Data Triage with Rosette Human Language Technology Conference

Embed Size (px)

DESCRIPTION

This talk will discuss how Rosette — entity extraction, entity searching, document clustering, near duplicate detection, and fact-relationship-event extraction — can be combined with a powerful search engine to facilitate information discovery and thematic analysis across a variety of sources and languages. The term “Big Data” has many possible meanings — large volume, fast-moving, many sources — but the issues it creates are clear. Analysts have significantly more data available, but the tools to exploit this data haven’t kept pace. Many legacy approaches to analytic systems — databases and custom applications around them — are not flexible enough to pull in data from new sources at a moment’s notice, are not able to import and share the new data quickly enough to provide actionable intelligence, and cannot scale up to hold the massive amounts of data being produced. But even if today’s systems could handle all of the available data — when presented with massive volumes of semi-structured, multilingual data from many sources, how effectively could an analyst discover the relevant data and efficiently move it into the analytical process? View more slides from the Human Language Technology Conference 2012 here: http://info.basistech.com/hlt-2012-slides

Citation preview

Page 1: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 1

Big Data Triage with Text Analytics

Steve Kearns

Director of Product Management

Basis Technology

Page 2: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 2

Agenda

•  What is Big Data? •  Challenges of Big Bata •  Text Analytics Technology •  Text Analytics for Big Data Triage

Page 3: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 3

What is Big Data?

Page 4: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 4

Big Data

•  Volume

•  Velocity

•  Variety

Page 5: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 5

Volume

Page 6: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 6 http://mashable.com/2012/06/22/data-created-every-minute/

Volume

Page 7: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 7

Velocity

•  High-Throughput Sources: –  Digital Forensics

•  Rapid Site Exploitation •  Many Hard Drives

•  Rapidly Changing Sources: –  OSINT

•  News •  Social Media

•  High Throughput Storage, Analysis, Alerting

Page 8: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 8

Variety

•  Data Types –  DOMEX/DOCEX/MEDEX/OSINT –  Finished Intel –  Cables –  Intellipedia –  Harmony –  Biometrics –  Watch Lists –  Hard Drive -> File(s) -> Unstructured and Structured Content –  Sensor Data

•  Structured / Unstructured •  Textual / Visual / Numeric

Page 9: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 9

The Challenge: Finding Value

http://learn-how-to-be-happy.com/wp-content/uploads/2011/08/happy_face.jpg

Page 10: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 10

Big Data Problems - Volume

•  Where/How do you store it? –  Single database -> database cluster -> Hadoop/HDFS?

•  Data quality? –  Manual review or annotation? –  People don’t scale

•  Query –  If you can, how fast, how complex and on what can you query? –  User Interface? SQL? Programming? –  How do you view results? –  Can you filter the results to refine your query? –  Thematic exploration, where the results of one query inform the next –  Security?

Page 11: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 11

Big Data Problems - Velocity

•  Time sensitive –  Value of information decreases over time –  How long from “publish” to “discoverable”?

•  Rapid changes/updates –  Which updates are important? –  Which sources/users are important? Which may become important? –  Individual pieces of data may be meaningless, but what about in

aggregate? –  Quality/Verification? –  Manual Review?

Page 12: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 12

Big Data Problems - Variety

•  Many Sources –  Often stored, formatted, and accessed differently –  Access, security? –  Many languages –  How reliable is each source?

•  Few, if any, links –  Between sources –  Between documents –  Between information within documents

Page 13: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 13

General Problems

2  +  2  

Scale  

Human  Language  

•  Computers are great at some things •  Humans are great at others

Page 14: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 14

Text Analytics

Page 15: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 15

Text Analytics

Automated analytical methods operating on the written word to surface insights about the data.

It's purpose is to assist the human in

finding things of relevance and interest.

Page 16: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 16

Text Analytics techniques

Page 17: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 17

Triage Example

Baghdad military command spokesman Colonel Dhia al-Wakeel said the attacks bore the hallmarks of al-Qaeda. Thursday was the deadliest day in Iraq since March 20, when shootings and bombings claimed by an al-Qaeda affiliated group killed 50 people and wounded 255 nationwide.

Al-­‐Qaeda  has  the  following  direct  franchises:  

§ Al-­‐Qaeda  in  the  Arabian  Peninsula,  which  comprises  

§  Al  Qaeda  in  Saudi  Arabia,  and  

§  Islamic  Jihad  of  Yemen §               Al-­‐Qaeda  in  Iraq §                             Al-­‐Qaeda  OrganizaBon  in  the  Islamic  Maghreb

§  Al-­‐Shabaab  in  Somalia  §  EgypBan  Islamic  Jihad §  Libyan  Islamic  FighBng  Group §  East  Turkestan  Islamic  Movement in  Xinjiang,  China  

Query:  Al  Qaeda  al-­‐Qaeda   0.99  

 (al-­‐Qa'idah)ة 0.99  Al  -­‐Qaeda   0.99  

 (al-­‐Qa'idah) ة  0.99  al-­‐Qada     0.91  al-­‐Qaida   0.91  Al-­‐Qa'ida     0.91  Al-­‐Qaïda     0.91  al-­‐Qaida  Africa   0.78  Al-­‐Qaeda  SancBons  List   0.74  Al-­‐Qaïda  Libyenne     0.74  

وتنظيم القاعدة 0.74  al-­‐Qaeda  in  Islamic  Maghreb   0.7  

Page 18: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 18

Text Analytics : Language ID

La Grande-Bretagne a de son côté jugé que l'accord de Luxembourg constituait un véritable changement dans la stratégie agricole de l'Europe, tandis que l'Irlande y a vu un gage de stabilité et et de sécurité pour les agriculteurs. Le président nigérian

Olusegun Obasanjo a salué cette l'engagement du G8, déclarant que "la condition majeure au développement est l'absence de conflit". La porte-parole de la présidence française, Catherine Colonna, a pour sa part qualifié la réunion d'"exceptionnelle".

Американская софтверная компания становится пользующимся спросом у спецслужб США экспертом в области лингвистики (в частности, изучения и обработки информации на арабском языке) после терактов 11 сентября 2001 г.

В данный момент правительство США, обвиняющее радикальную мусульманскую группировку "Аль Каида" в терактах 2 года назад, активизирует свое внимание к арабскому языку и программам его обработки. Грамматика языков данной группы

「端末側で行単位に(あるいは一画面分)編集しておいて、

送信キーによりまとめて送信する」という方式と、

「端末には知能はなく、一字一

字すべてがその都度送られ処理される」

という方式は、究極的に前者は半二重通信、後者は全二重

通信とフィットします。 後者では、入力のエコーもコン

ピュータ側で制御されます。

つまり、入力した字の表示はキー入力がコンピュータに送られ、

それが送り返されて表示され

ます。

FNPがコンピュータと端末の間にあって、実際の端末とのやり

とりを制御するのです。そして、コンピュータとFNPの間の通信は、

少量の転送には不向きで、大量の一括転送に向いていました。

FNPによるコンピュータへの割り込み要求は高価なものだっ

たからです。Multicsでのプロセスのwake upも高価だということもありました。

私ごとになりますが、ちょうどこのころ大学院生でしたが、

ACOS-6用のある言語処理系の開発を請け負って作っていま

した。ACOS-6はMulticsの概念に非常に近いものを持っていました、あるいは持とうとしていま

した。 また、ハードウェアも大変似て

いました。シールをはがすと、 その下から別のアメリカの会社

の名前が出てくるマシンでテスト

したこともありました。1年間ほとんど休みなしにマシンルーム

にこもっていて、ここでの議論

と疑問を自分のテーマとしても 扱ったことがあるのです。そ

れで、よーくわかるのです。

Après avoir rencontré les présidents de quatre des cinq pays africains (Afrique du Sud, Algérie, Sénégal, Nigeria) membres du comité de pilotage du Nouveau partenariat pour le développement économique de l'Afrique

Программное обеспечение Basis Technology позволяет осуществлять поиск слов с близкими значениями, а также транслитерировать арабские и фарси-буквы в латинские. Продукт был разработан по специальному заказу правительства США с целью оптимизации процесса анализа арабских текстов.

La Grande-Bretagne a de son côté jugé que l'accord de Luxembourg constituait un véritable changement dans la stratégie

Après avoir rencontré les présidents de quatre des cinq pays africains (Afrique du Sud, Algérie, Sénégal, Nigeria) membres du comité de pilotage du

Le président nigérian Olusegun Obasanjo a salué cette l'engagement du G8, déclarant que "la condition majeure au développement est

Программное обеспечение Basis Technology позволяет осуществлять поиск слов с близкими значениями, а также транслитерировать

Американская софтверная компания становится пользующимся спросом у спецслужб США экспертом в области

В данный момент правительство США, обвиняющее радикальную мусульманскую группировку "Аль Каида" в терактах 2

「端末側で行単位に(あるいは一画面分)編集しておいて、

送信キーによりまとめて送信する」という方式と、

「端末には知能はなく、一字一

字すべてがその都度送られ処理される」

FNPがコンピュータと端末の間にあって、実際の端末とのやり

とりを制御するのです。そして、コンピュータとFNPの間の通信は、

少量の転送には不向きで、大量の一括転送に向いていました。

FNPによるコンピュータへの割り

「端末側で行単位に(あるいは一画面分)編集しておいて、

送信キーによりまとめて送信する」という方式と、

「端末には知能はなく、一字一

字すべてがその都度送られ処理される」

French

Russian

Japanese

Page 19: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 19

Text Analytics: Lemmatization

flying Search  

Results

fly   132 hits

flown   61 hits

flew   78 hits

flying   97 hits

Page 20: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 20

Text Analytics: Lemmatization (Arabic)

ففججرر Search  

Results

(Detonated)

ووتتففججييررههاا

132 hits

ممتتففججررااتت

77 hits

تتففججييررااتت 32 hits

ففججررههاا 22 hits

تتففججررتت 2 hits

Page 21: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 21

Text Analytics: Entity Extraction

Page 22: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 22

Text Analytics: Relationship Extraction

Page 23: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 23

Text Analytics: Entity Search

Page 24: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 24

Text Analytics: Document Clustering

Page 25: Big Data Triage with Rosette Human Language Technology Conference

Big  Data  Triage    Text  Analytics    

Page 26: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 26

Big Data Processing

•  IdenBfy  data  sources  • Data  cleansing  • Move  data  into  analysis  repository  

Collect  

•  IdenBfy  EnBBes,  Facts,  RelaBonships  • Link  between  Documents  • Link  fact/enBty  between  documents  

Analyze  

• Keyword  search  +  metadata  filters  • ThemaBc  exploraBon  –  using  metadata  • Cross-­‐document  links  

Index  

Page 27: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 27

Big Data Processing - Technology

• Source:  News,  Twieer,  Database,  file  system,  digital  forensics,  etc.  

• Storage:  HDFS,  MongoDB,  SQL,  etc.  Collect  

• Plahorm:  Hadoop,  UIMA,  Odyssey,  Custom  • Analysis  type:  Language  ID,  EnBty  ExtracBon,  RelaBonship  ExtracBon,  Document  Clustering,  EnBty  Linking  

Analyze  

• Fulltext  Search:  Solr,  Accumulo,  Lucene  • Structured  Data:  RDF,  SQL,  OrientDB,  Neo4j,  Cassandra,  HDFS,  etc.  Index  

Page 28: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 28

Big Data Triage Requirements

•  View results while still processing –  Incremental collection/analysis/indexing

•  User Interface that allows exploration –  Dashboard –  Keyword Search –  Geo Search –  Entity Search

•  Enables thematic exploration –  Metadata produced by Analysis makes this easier

Page 29: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 29

Dashboard

Page 30: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 30

Search and Filter

Page 31: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 31

Foreign Language Search

Page 32: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 32

Detailed Document View  

Page 33: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 33

Entity Search – Cross Language

Page 34: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 34

Search/Filter/Explore

http://www.silobreaker.com/FlashNetwork.aspx?DrillDownItems=11_237360

Page 35: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 35

Summary

Text  Analy9cs  enables  Big  Data  Triage  

Page 36: Big Data Triage with Rosette Human Language Technology Conference

Basis Technology – Human Language Technology Conference 2012 36

Thank You!

For more information: Visit www.basistech.com

Write to [email protected]

Call 617-386-2090 or 800-697-2062