27
SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth Data Intensive Technologies and Applications/IGB Automated Learning Group, and GSLIS University of Illinois, Urbana-Champaign

SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Embed Size (px)

Citation preview

Page 1: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

SEASR – Software Environment for the Advancement of Scholarly Research

OverviewUniversity of Illinois

June 2007

Michael Welge, Loretta Auvil, John UnsworthData Intensive Technologies and Applications/IGB

Automated Learning Group, and GSLISUniversity of Illinois, Urbana-Champaign

Page 2: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Structured Data “Rush”

Page 3: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

• Provides scalable environment from the Desktop to Web Services to Grid Services

• Employs a visual programming system for data/work flow paradigm

• Provides capability to build custom applications

• Provides capability to access data management tools

• Contains data mining algorithms for prediction and discovery

• Provides data transformations for standard operations

• Integrated environment for models and visualization

• Supports an extensible interface for creating one’s own algorithms

• Provides access to distributed computing capabilities

D2K- Framework for Data Analysis

Page 4: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

D2K Components

• D2K Infrastructure• Itinerary Execution engine

• D2K-Driven Applications• Applications that make use of the D2K

Infrastructure• Toolkit is a D2K-Driven app

• D2K Server• Special kind of D2K-Driven app• Wraps the infrastructure to provide remote

itinerary and module execution• Used by the Toolkit to distribute module

execution• D2K Web Service

• Provides a generic programmatic interface for executing itineraries

• Communicates with D2K Servers over socket connections using D2K Specific protocols.

Page 5: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

PredictionIndustrial ManufacturerComputed customer buying propensitiesAchieved 25% conquest customer sales lift by executing directed cross/upsell resulting in $65 million in incremental revenue

DiscoveryAutomotive manufacturerIdentified patterns of inappropriate warranty work in dealer channelTargeted $200M+ of potentially unnecessary annual expense

MonitoringDepartment store retailerWatched POS transaction flow for unusual variationsDeterred inappropriate behavior and fraudulent transactionsResulted in savings of over $125 million

Creating Customer Value

Page 6: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Applications Examples

Harris A. Lewin explains that Evolution Highway allows one to look " . . . at the whole genome at once - multiple chromosomes across multiple species. The insights wouldn't have come so quickly if we couldn't throw the data at this framework from NCSA.”

Nicholas M. Ball, Robert J. Brunner, Adam D. Myers, and David Tcheng, Robust Machine Learning Applied to Astronomical Data Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees, The Astrophysical Journal, Vol. 650, Part 1, Pages 497–509, 2006

Comparative Genomics

Science, Vol. 309, Issue 5734, Pages 613-617, 22 July 2005

Music AnalysisJ. Stephen Downie, The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future, Computer Music Journal, Vol. 28, No. 2, Pages 12-23 Summer 2004

Astronomy

Page 7: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Research, Development, & Technology Transfer Model

Page 8: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

SEASR: The Data ProblemStructured Vs. Unstructured

1999

GIG

AB

YTES

Cave paintings,Bone tools 40,000

BCEWriting 3500 BCE

0 C.E.

Paper 105

Printing 1450

Electricity, Telephone 1870

Transistor 1947

Computing 1950Internet (DARPA) Late 1960s

The Web 1993

20% 20% Structured Structured DataData

80%80% Unstructured Unstructured DataData

Today, 80% of business is conducted Today, 80% of business is conducted on unstructured informationon unstructured information– – Gartner Group

80% of the information 80% of the information needed needed is in the Open Sourceis in the Open Source– – NIA

Workers spend 80% of the Workers spend 80% of the time gathering time gathering informationinformation– – STIC, EMF

Source: www.fastsearch.com

Page 9: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

-15 Years

Database

The Internet

Today

Unstructured Information

email / Word / HTML / PDF / etc

Semi-structured InformationDoc Mgt / XML

• Today, 80% of business is conducted on unstructured information

Gartner Group

• 80% of the information needed is in the Open Source

NIA

• Workers spend 80% of the time gathering information

STIC, EMF

Unstructured Data “Rush”

Page 10: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

The issue is getting worse...

Now

Other forms of Unstructured Information

Affecting every Industry Sector

Video

+

Voice

Multiple Devices

+

Page 11: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Hail SEASR!

Software Environment for the Advancement of Scholarly Research (SEASR)

– addresses the challenges of transforming information into knowledge by constructing

the software bridges that are required to move from the unstructured and semi-

structured data world to the structured data world.

– aims to make collections more useful by integrating two well-known research and

development frameworks NCSA’s Data-To-Knowledge (D2K) and IBM’s Unstructured

Information Management Architecture (UIMA) into an easily usable environment that

researchers in any discipline can easily learn and adapt for their own unstructured

data analysis.

Page 12: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

UIMA Lineage

• Developed over 5 years

– Funded by DARPA (GALE)

– Companies: BBN, MITRE, SAIC

– Universities: Carnegie Mellon, Columbia, UMass/Amherst

– 100 Developers from IBM World Wide

• UIMA Enables ….– Part of Speech Detectors– Document Structure Detectors– Tokenizers, Parsers, Translators– Named-Entity Detectors– Sentiment Detectors– Relationship Detectors

Page 13: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

SEASR: Architecture

SEASR’s advanced informatics tools will expand the technical capabilities of what is now available in the field by:

• connecting data sources that are currently incompatible, whether due to different formats or protocols

• offering all project components as open source, to enable users to modify and add to tools

• allowing users to write analytic engines in their programming language of choice

• installing on all hardware footprints, so that the tools can be brought to data sets where they are housed

• creating a repository for components that will support sharing and publishing among users

• enabling scalability so that components may run on a large variety of hardware footprints, including shared memory processors and clusters

Page 14: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

SEASR: Research, Development, & Technology Transfer Model

Page 15: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Research Areas

• Focused Data Retrieval and Data Integration: Given a target topic (Iran’s nuclear program) or an entity (University of Tehran), how do we locate, retrieve, and integrate all relevant data-both structured (databases) and observational (sensory data, textual data, image data)?

• Semantic Data Enrichment: How to handle the overwhelming array of different data formats, how to understand the layout of data and infer metadata for a variety of text sources and images, and how to infer semantic markup and construct/augment knowledge bases.

• Entity and Relationship Discovery: How can we match ambiguous mentions of entities across both structured data and text? How do we discover relationships among entities? How do we related new collected data to existing knowledge bases?

IACAT, Dan Roth and Jiawei Han, CS, UIUC

Page 16: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Research Areas

• Knowledge Discovery and Hypotheses Generation: How to exploit the rich semantic structure generated by identifying entities and relationships among them to promote knowledge discovery and to generate hypotheses that emerge from “surprising” correlations or structural events?

• Intelligent Human-Computer Interactions for Information Access: How to devise effective interaction models and interfaces for accessing multimodal data, interactive annotation and discovery models, and support hypotheses suggestion and verification?

• Mathematical and Computational Foundations: The research described above builds on our team members’ work on key mathematical and algorithmic questions underlying progress in the Data Sciences, and serves to motivate further theoretical questions.

IACAT, Dan Roth and Jiawei Han, CS, UIUC

Page 17: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Getting the “Band” Together

• June 2007 – Band formation– Project start date– More use ideas and framework discussions

• December – First ‘gig”– Framework and data app demonstration

• Vocals - Research Technology– John Unsworth, Stephen Downie, Tim Wentling– Dan Roth, Jiawei Han, Kevin Chang, Cheng Xiang Zhai

• Percussions & Bass - SEASR Development– Loretta Auvil, Tara Bazler, Duane Searsmith, Andrew Shirk, Students

• Lead – Designers/Developer/Applications Areas– Humanities – M2K, Nora/Monk and Others (we heard about

yesterday/today))• Need Groupies! (Advisors, Researchers, Developers, and Application

Drivers) – Loretta Auvil

Page 18: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

SEASR: How can I participate?

• Collaborate on application development or ontology creation

• Contribute to component development for analytics or data access

• Participate in visualization and UI design

• Serve as an advisor

Contact Loretta Auvil ([email protected])

Page 19: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

SEASREngineering Knowledge for the Humanities

Thank You

Page 20: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Lincoln Papers Project

• A Model for Digital Humanities Scholarship • Collaboration between I-CHASS, and the Lincoln Presidential

Museum and Library in Springfield, Illinois– UIUC permanent home of digital archive of all Lincoln

materials held by Lincoln Library

• Opportunities for Discovery from the Lincoln Papers– Provides ability to explore many technologies of interest

to humanities scholars, including:• Digitization and OCR• Information Extraction and Analysis of text- and

image-based information• Social Networking Tools• Geo-spatial Analysis

– Solutions can be transferred to other digital collections, such as Founding Father’s Papers

Vernon Burton, History Department, UIUC

Page 21: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Research Project and Consequence of Digital Analysis

• Scholar interested in development of Lincoln’s concept of “Liberty”– Data extraction tools identify all instances of “liberty” and related

concepts– Social networking tools trace with whom, when, and how frequently

Lincoln corresponded on the subject– Geo-spatial analysis can reveal regional differences in support for

emancipation• Scholar is able to

– Easily identify and retrieve all key materials from a collection numbering hundreds of thousands--or even millions--of documents

– Gain insight into development and strength of Lincoln’s commitment to emancipation

– Identify key correspondents--some of whom might have previously been overlooked--who helped shape Lincoln’s public policy

Vernon Burton, History Department, UIUC

Page 22: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Other Example Research

• Voice mining (DH 2006 Poster )– Scholar is interested in development of models that can analyze characters’

utterances in plays. – Scholar is able to construct analytical models that can successfully identify the

socio-economic class or status of the character which uttered a given line of play text.• Criticism mining (DH 2006)

– Scholar is interested in development of tools that can automatically analyze critical reviews on humanities objects.

– Scholar is able to easily construct text categorization models predict positive and negative reviews; predict the genre of the work being reviewed; and differentiate fiction and non-fiction book reviews

• Differentiating Editorial and Customer Critiques of Cultural Objects Using Text Mining (DH 2007)

– Scholar is interested in development of tools that can automatically differentiate critiques written by scholars and professional editors versus ordinary readers.

– Scholar is able to use text mining tools to differentiate these two kinds of critiques as well as to see what features makes them different.

J. Stephen Downie, GSLIS, UIUC

Page 23: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Conceptual Analytical Architecture

Page 24: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

SEASR Architecture

Page 25: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Structured Data for Analysis

• Low Volume Data

– Wire services

– Call Detail Records

– Phone directories

– Badge access tracking

– Customer lists

– Account histories

– Supplier network data

– Biometric access data

• High Volume Data– Stock transactions– Web pages– News Wire feeds– Audits– CRM databases– Web access logs– Net logs– Mutual Fund validation– Credit/Debit transactions– RFID tracking logs

Page 26: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Unstructured Data for Analysis

• Low Volume Data– Email, Chat and IM– Internal documents– Call Center data logs– Pager data– External reports and

data– Publicly accessible

records– Calendars– RF monitoring– Print stream monitoring

• High Volume Data– VOIP phone calls– Broadcast media– Web cam data– Deep web crawl data– Surveillance cameras– Videoconferences– Voice mail– Satellite data

Page 27: SEASR – Software Environment for the Advancement of Scholarly Research Overview University of Illinois June 2007 Michael Welge, Loretta Auvil, John Unsworth

Current SEASR Team

• PI: Michael Welge, NCSA• Co-PI: John Unsworth, GSLIS; Loretta Auvil, NCSA• Technical Lead: Duane Searsmith, NCSA• Use Cases and Communities Involvement: Loretta Auvil, NCSA• Usability Evaluator: Tara Bazler, Indiana University• Software and Application Developers

– Bernie Acs, NCSA– Vered Goren, NCSA– Amit Kumar, NCSA– Xavier Llora, NCSA– Mary Pietrowicz, NCSA– Andrew Shirk, NCSA– David Tcheng, NCSA

• Humanities Domain and Communications Consultant– Kelly Searsmith, NCSA

• Community Advisors– Tim Cole, Mathematics Librarian, UIUC