The National Centre for Text Mining

21/03/05 National Centre for Text Mining

The National Centre for Text Mining

Anne E Trefethen

Deputy Director,

e-Science Core Programme

…and its ramifications for e-Science and the other way round


A Definition of e-Science

‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’

John Taylor

Director General of Research Councils

Office of Science and Technology


Licklider’s Vision for the Internet

“Lick had this concept – all of the stuff linked together throughout the world, that you can use a remote computer, get data from a remote computer, or use lots of computers in your job.”

Larry Roberts – Principal Architect of the ARPANET


UK e-Science Programme

Collaborative projects

Director’s Management Role

Director’sAwareness and Co-ordination Role

Generic Challenges EPSRC (£15m) £16.2m, DTI (£15m)

Industrial Collaboration

Pilot ApplicationProgramme

PPARC (£26m) £31.6mBBSRC (£8m) £10.0mMRC (£8m) £13.1mNERC (£7m) £8.0mESRC (£3m) £10.6mEPSRC (£17m) £18.0mCLRC (£5m) £5.0m

Research Councils (£74m), £96.3mDTI (£5m)

TechnicalAdvisoryGroup

£250m government investment over 5yrs


Powering the Virtual Universe

http://www.astrogrid.ac.uk(Edinburgh, Belfast, Cambridge, Leicester,

London, Manchester, RAL)

Multi-wavelength showing the jet in M87: from top to bottom – Chandra X-ray, HST optical, Gemini mid-IR, VLA radio. AstroGrid will provide advanced, Grid based, federation and data mining tools to facilitate better and faster scientific output.

Picture credits: “NASA / Chandra X-ray Observatory / Herman Marshall (MIT)”, “NASA/HST/Eric Perlman (UMBC), “Gemini Observatory/OSCIR”, “VLA/NSF/Eric Perlman (UMBC)/Fang Zhou, Biretta (STScI)/F Owen (NRA)”

AstroGrid Slides courtesyof Nick Walton, Cambridge


Image from

E

SO

Image + IRIS data

Gamma Ray BurstsGamma Ray Bursts

D. Ducros, ESAReprocessing of ionospheric STP datachange coords from earth to celestial

Collate data frommultiple telescopesover months - meta data issues

Localise GRB alertin minutes – as faderapidly.

SWIFT satelliteobserves gammaray burst

Compare against SNlight curves – bump shows eveidence for a SN in the GRB(Price et al, 2002)

Interaction with observatory pipe-lines

Cross reference multi-λdata – ID pre-cursorand or environment

Large computationalphotometric redshift calcs on multi-λ > gives distance


Dark Matter + Large Scale StructureDark Matter + Large Scale Structure

X-ray cluster: Chandra X-ray (Mullis) overlaid on a deep BRI image (Clowe & Luppino).

Image from

E

SO

Multiple large image sources:registration &association

Source ID from multiplexed spectral data

Multi-TB λCDMmodels, e.g.Millennium Sim

Generate Shear Mapsc.f. CDM models> DM distributionwith redshift Remove stars

correlate galswith z

Colour-Colourrelationshipsclassification in multi-phase space

Automatic clusterfinding techniques


Some facts on Astronomy data• Virtual observatories

– Many national virtual observatories containing data at different wavelengths. Estimated

• US NVO project alone will store 500 Terabytes/year • Laser Interferometer Gravitational Observatory (LIGO)

generates 250 Terabytes/year • VISTA, Visible and infrared survey telescope estimated to

generate 250 Gigabytes of raw data/night – 10 terabytes of stored data/year.

• Together with data analysis need to combine with previously published knowledge on that astronomical time/space events


The eDiaMoND Project

Hardware, Software and People Skills

People Skills

University RelationsLife SciencesWorldwide Grid

BreastScreeningProgrammes

Engineering and Physical Sciences Research CouncilMedical Research Council

eDiaMoNDeDiaMoND

eDiamond Slides courtesyof David Gavagahn, Oxford


UK Breast Screening – Today

Began in 1988

Women 50-64ScreenedEvery 3 Years1 View/Breast

~100 BreastScreeningProgrammes- Scotland- Wales- Northern Ireland- England

1,300,000 - Screened in 2001-0265,000 - Recalled for Assessment8,545 – Cancers detected300 - Lives per year Saved

230 - Radiologists (Double Reading)

Film

Paper

Statistics from NHS Cancer Screening web site


UK Breast Screening – Challenges

230 - Radiologists (Double Reading)50% - Workload Increase

2,000,000 - Screened every Year120,000 - Recalled for Assessment10,000 - Cancers1,250 - Lives Saved

Began in 1988

Women 50-70ScreenedEvery 3 Years2 Views/Breast+ DemographicIncrease

~100 BreastScreeningProgrammes- Scotland- Wales- Northern Ireland- England

Digital

Digital


UK Breast Screening – Workflow

Call

1000

Missed1

Interval Cancers

ScreeningScreening AssessmentAssessment

TrainingTraining

EpidemiologyEpidemiology

~100 BreastScreeningProgrammes

Recall

40

All Clear960

All Clear34

Cancer

6

Previous

Current


eDiaMoND – Scope


TeachingTeaching

DiagnosisDiagnosis

ScreeningScreening


TeachingTeaching

DiagnosisDiagnosis

ScreeningScreening


TeachingTeaching

DiagnosisDiagnosis

ScreeningScreeningDataData32 MB / Image

256 TB / Year


TrainingTraining

ScreeningScreening

Workstation Grid

Previous

Current

ComputeCompute

StandardMammoFormat

StandardMammoFormat

DataMining

DataMining

CADeCADi

CADeCADi

~4 BreastScreeningProgrammes


eDiaMoND – Compute

Mammograms have different appearances, depending on image settings and acquisition systems

StandardMammoFormat

StandardMammoFormat

Temporal mammography

ComputerAidedDetection

3D View


eDiaMoND – Data

Data Images

Logical View is One Resource

PatientPatient AgeAge …… ImageImage

107258107258 5555 …… 1.dcm1.dcm

236008236008 6262 …… 2.dcm2.dcm

700266700266 5959 …… 3.dcm3.dcm

895301895301 5858 …… 4.dcm4.dcm

……………… …… …… ……..……..

……………… …… …… ……..……..

……………… …… …… ……..……..

……………… …… …… ……..……..

……………… …… …… ……..……..

……………… …… …… ……..……..

……………… …… …… ……..……..

……………… …… …… ……..……..

DataDataDICOMDICOM

DICOMDICOM

DICOMDICOM

DICOMDICOM

Grid

ComputeCompute

StandardMammoFormat

StandardMammoFormat

DataMining

DataMining

CADeCADi

CADeCADi


myGrid: Directly Supporting the e-Scientist

PartnersManchester, EBI, Southampton,Nottingham, Newcastle, SheffieldAstraZenecaGlaxoSmithKlineMerck KGaAEpistemics LtdGeneticXchangeNetwork Inference

http://mygrid.man.ac.uk

IBMSUN Microsystems

myGrid slidescourtesy of Carole Goble


myGrid Project • Imminent

‘deluge’ of genomics data

• Highly heterogeneous

• Highly complex and inter-related

• Convergence of data and literature archives

(courtesy of Carole Goble, Manchester)

http://www.mrc.ac.uk/PDFs/dem_gen.pdf


Information Weaving

• Large amounts of different kinds of data & many applications.

• Highly heterogeneous.– Different types, algorithms,

forms, implementations, communities, service providers

• High autonomy.• Highly complex and inter-

related, & volatile.



An in silico experiment = a web of interconnected information and components

Provenance record of workflow runs

Provenance of the workflow template. Related workflows.

People

Ontologies describing workflows

Services used

Notes

Data in and out

LiteratureLiterature



• Building links between e-research data, from the CombeChem project, with scholarly communication and other on-line sources

• Investigating the role of aggregator services in linking data-sets from Grid enabled projects to open data archives contained in digital repositories through to peer-reviewed articles as resources in portals

• JISC-funded project led by UKOLN in partnership with the Universities of Southampton and Manchester

The eBank Project


Grid

E-Scientists

Entire E-Science CycleEncompassing experimentation, analysis, publication, research, learning

5

Institutional Archive

LocalWebPublisher

Holdings

Digital Library

E-Scientists Graduate Students

Undergraduate Students

Virtual Learning Environment

E-Experimentation

E-Scientists

Technical Reports

Reprints

Peer-Reviewed Journal &

Conference Papers

Preprints & Metadata

Certified Experimental

Results & Analyses

Data, Metadata & Ontologies


Generic Issues• In next 5 years e-Science projects will produce

more scientific data than has been collected in the whole of human history

NSF “Atkins” report on Cyberinfrastructure• the primary access to the latest findings in a growing

number of fields is through the Web, then through classic preprints and conferences, and lastly through refereed archival papers’.

• ‘archives containing hundreds or thousands of terabytes

of data will be affordable and necessary for archiving scientific and engineering information’.


Generic Issues cont

• Data Deluge from e-Science projects requires grid technologies to facilitate discovery, analysis, curation of data

• Sheer volume of text published and new results appearing, is impossible for researchers to read and correlate

• Effective automated processing required research, locate, gather and make use of knowledge encoded electronically in available literature


Bioscience and biomedicine

• Bioscience and biomedicine resulted in huge volume of domain literature

• Open Acess publishers such as BioMed Central have a growing number of full-text articles

• Integration of literature and data analysis of increasing importance - linking factual biodatabases to literature, using publishers to check, complete or complement contents of such databases


• NaCTeM establishing high-quality service provision in text mining for academic community – focus initially on biological and biomedical science

• Enabling e-Science applications!


Grid Technologies enabling Text mining

• Text mining process involves many steps• Potentially many tools• Large amounts of text and data to be analysed • Requring temporary storage of intermediate

results• Access large resources, ontologies, document

collections etc• Compute-intensive algorithms• Portal access to data and compute resources


Conclusions

• The vastness of the amount of electronic literature and digital text demands automatic capabilities for effective analysis

• Combining this capability with data analysis is of growing importance for some research areas

• The future services provided by NaCTeM will form a significant piece of the toolset for e-Science applications


Acknowledgements

Thanks to

• Carole Goble and the myGrid team

• Liz Lyon and the eBank team

• Dave Gavaghan and the eDiamond Project

• Nick Walton and AstroGrid

Documents

The National Centre for Text Mining