Text-mining and Automation

Embed Size (px)

Citation preview

Automation and Text Mining

Ben O'Steen25 May 2010

Hi

Ben O'Steen

(Software Engineer, Oxford University Libraries)

Ben O'Steen

(Software Engineer, Oxford University Libraries)

Freelance Enthusiast

Text-mining (and related techniques):

Processing natural language to gain direct and contextual information, often with a means to quantify this information's accuracy.

Automation (in this context):

Making decisions, providing additional options and increasing the amount of information understood by a system without the need for human effort.

BI Search and Text Analytics[1] says:

structured data in first place at 47%, trailed by unstructured (31%) and semi-structured data (22%).

53%! And this was amongst data management professionals!

[1] BI Search and Text Analytics, 2006 http://tdwi.org/research/2007/07/tdwi-best-practices-reports.aspx

Natural Language Processing

Natural Language Processing

Taxing; it is a difficult process often requiring multiple analyses and lots of compute power for reasonable response times.

Natural Language Processing

Taxing; it is a difficult process often requiring multiple analyses and lots of compute power for reasonable response times.

Developing. Every year, new and better solutions are found.

Natural Language Processing

Taxing; it is a difficult process often requiring multiple analyses and lots of compute power for reasonable response times.

Developing. Every year, new and better solutions are found.

Multi-disciplinary. The skills required by the team are broad: Machine-learning, Linguistics, Statistics, Logic and so on.

Natural Language Processing

Natural Language Processing982,000 occurrences (Google)

+Multidisciplinary - 25,800

+Multi-disciplinary - 10,400

Natural Language Processing

Natural Language Processing982,000 occurrences (Google)

+Multidisciplinary - 25,800

+Multi-disciplinary - 10,400~27% of the time, 'NLP' occurs in a page with the term 'multidisciplinary'

Crude, but interesting result.

Natural Language Processing

Natural Language Processing982,000 occurrences (Google)

+Multidisciplinary - 25,800

+Multi-disciplinary - 10,400~27% of the time, 'NLP' occurs in a page with the term 'multidisciplinary'

Crude, but interesting result.

Text-mining557,000 occurrences (Google)

(I'll come back to why it is important to note that NLP is heavily multidisciplinary shortly)

'Real-world' NLP

Machine-learning; 'MoreLikeThis'Amazon

Search-engines (from Google's to Apache Solr)

Google's Adwords: Advertising/Marketing

And just about any business that trades solely over the web.

'Real-world' NLP

Machine-learning; 'MoreLikeThis'

Predictive analytics (game theory)Business/stockmarket predictive models

Customer 'churn'

Credit ratings

'Real-world' NLP

Machine-learning; 'MoreLikeThis'

Predictive analytics (game theory)

Search and IndexingBing - Information Overload campaign

TrueKnowledge NLP + Semantic Knowledge base

'Smart' results:Not WYTIWYG (What You Typed is What You Get) but WYMIWYG (What You Meant is What You Get)

Specific information added to results, formatted based on the meaning of your search query.

'Real-world' NLP

Machine-learning; 'MoreLikeThis'

Predictive analytics (game theory)

Search and Indexing

Term-extractionIdentification of non-trivial phrases, which have meaning outside of the context of the text or from a pre-selected ontology.

Characterisation (Yahoo)Yahoo's API for a term extractor service (since 2007) - http://developer.yahoo.com/search/content/V1/termExtraction.html

Thomson-Reuters' OpenCalais service

Academic NLP

Very much a trending topic in Humanities!The 'Digital Humanities' term is becoming popular, and a large part of it is NLP and statistics.A number of specialist conferences, barcamps and workshops have begun to establish themselves, like THATCamp.

Many understand that they could do more with their sources but lack the programmatic skill to do so.

NLP in libraries.

Under-represented and under-used.

Library and repository software are built with the assumption that metadata will be understood and structured before it even enters the system.The side-effect for repositories is that they rapidly hit a dead-end for voluntary uptake.

This assumption needs to be smashed to pieces and never made again.A perfect area to investigate using text-mining on the 'messy metadata' in situ?

Why are people scared of NLP?

NLP uptake issues

Research-heavy backgroundMany of the useful, recent developments are (somewhat ironically) 'hidden' inside PDFs and blocks of text in peer-reviewed articles.

NLP uptake issues

Research-heavy background

No easy entry-pointStraightforward to read high-level overviews

Access to in-depth material is reasonably good if your library has academic subscriptions.

There is little middle ground between the two.

NLP uptake issues

Research-heavy background

No easy entry-point

Cannot be blindly applied for good results (yet)The widely used and established tools (GATE, UIMA workflows) assume you know or can plan the technical steps you need to achieve your aim.It is hard to find information on the opposite, but more common approach: I want to do X, tell me the technical steps I should do

Google Prediction API?

NLP uptake issues

Research-heavy background

No easy entry-point

Cannot be blindly applied for good results (yet)

A long learning curveOften required to know what various algorithms are and do.

Requires understanding language syntax!

Multidisciplinary makes it very easy for documentation to 'Not State the Obvious'

NLP uptake issues

Research-heavy background

No easy entry-point

Cannot be blindly applied for good results (yet)

A long learning curve

Not quite reached the personal network tipping point yet.Applied Computer science is heavy peer-learnt.

Few have someone they know personally who can explain NLP and text-mining analyses.

Multidisciplinary problems

Tough psychological issue (IMO)Very difficult - for those who understand it - to realise how much information is required.

If the barrier for new people to join in is too high, then they give up and go elsewhere....

Multidisciplinary problems

Tough psychological issue (IMO)Very difficult - for those who understand it - to realise how much information is required.

If the barrier for new people to join in is too high, then they give up and go elsewhere....

which means that those writing documentation never get good feedback from those they trust or know 'socially'...

Multidisciplinary problems

Tough psychological issue (IMO)Very difficult - for those who understand it - to realise how much information is required.

If the barrier for new people to join in is too high, then they give up and go elsewhere....

which means that those writing documentation never get good feedback from those they trust or know 'socially'...

which tends to reinforce a RTFM mentalityNegative feedback loop, self-ostracising.

Technical Issues

None of the well-established toolkits are attractive to work with.U-compare is Java-focussed and has a SOAP apiThe NAMES project announced at the 2007 CRIG DRY meeting that their service was going to have a SOAP api. The developer who made that announcement got loudly booed by the majority of attendees.

GATEI showed several colleagues the GATE developer UI: Urgh. Doesn't this just have a simple service I can use? summed up the general sentiment.

SOAP

Web application developers are avoiding SOAP in drovesThey are avoiding implementing it or using it.

Google dropped their exemplary SOAP search interface in 2006

Amazon and eBay have both SOAP and REST apis they routinely see orders of magnitude more calls using the REST api than the SOAPYou can't make a SOAP call easily from .js

Java

People are searching for it less.

From my experience, very few developers are starting with Java web apps anymore.

Simple API?

Can we take inspiration from Google?

Is it possible to offer some purposeful tools or UIMA workflows in a way that can be accessed by any browser*?

(*without a Java applet)

What 'I' want

aka: services that I or people I know have either been asked for, or have seen the desperate need for.

What 'I' want

The ability to select text in any webpage, right-click and overlay analytical results in the same page.Eg domain-specific term-extractionAdventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article* Shotton et al, 2009 - marked up an academic article, by hand.

Enhanced article: http://dx.doi.org/10.1371/journal.pntd.0000228.x001

*Cited article: http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000361

What 'I' want

Citation 'chasing'Perform a check to see if a cited document really has the outcomes suggested by the citing document.Sentiment analysis guided by domain term-identification?

Textual similarity: offer the reader a selection of similar passages from the cited text?

Meta-analysis tools to verify the robustness and acceptability of cited research.

Trace back to find the 'source' of a particular idea.

What 'I' want

But that information is already in the PDF! I don't want to type it out again.Automatic metadata suggestion for repositories of texts.

Could be applied retroactively.

Common types of 'extractable' information:Article/Monograph metadata (Who, when, where, etc)

Subject and cross-domain classification to aid searching/browsing

Semantic expression of textual citations

Summary

Use of NLP and related techniques is needed, now more than ever.

We need more people to understand how to perform certain tasks.

We need to make systems more capable of holding heuristically obtained information.

NLP needs simple entrypoints more 'Hello World' type examples.

NLP web apps that can be accessed from the browser.