30
September 2005 NVO Summer School 1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC THE US NATIONAL VIRTUAL OBSERVATORY

September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

Embed Size (px)

Citation preview

Page 1: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 1

Object Classification in the

Virtual Observatory:A VO Status Report

Tom McGlynnNASA/GSFC

THE US NATIONAL VIRTUAL OBSERVATORY

Page 2: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 2

How do we know what we want in the VO?

Pretend the VO exists.

What is the science we are doing with it?

Now try to do that science and see what gets in the way.

Page 3: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 3

Can we classify ROSAT X-ray sources?

All RASS Sources (124,730)

Classified RASS Sources ~7,000

Total RASS Sources ~130,000

Page 4: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 4

What do we want to do?

• Find counterparts to ROSAT X-ray sources in optical, IR, radio.

• Train a classifier to use multiwavelength information to determine type of objects.

• Classify all of the objects seen by ROSAT.

Page 5: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 5

What is classification?

• Translation from observables to distinct physical processes.

• Each element classified independently of others• Is classification different from measurement?• Classification versus cataloging• Usually classify ‘objects’ but also…

– Events: GRBs, solar flares, …– Simulated data– Pixels/regions in an image: Earth and planetary

studies, shocked regions, …

Page 6: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 6

It’s not just us.

http://aria.arizona.edu/courses/tutorials/class/html/class.html

A typical plot of objects to be classified?

There is lots of information and discussion of classification outside astronomy.

Page 7: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 7

Examples

• Moving versus fixed stars• Classes of stellar spectra (ordered by strength

of Balmer lines).– Substitute for a measurement– Cf. Dwarf versus giant

• Osterbrock diagram: AGN versus star-forming emission line galaxies.

• Bautz-Morgan types of clusters of galaxies– Dominance of cluster by central galaxy.

• Types of x-ray sources: AGN, SNR, pulsars, XRBs, …

Page 8: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 8

Galaxy Classification

Page 9: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 9

Why do we classify?

• Understand a given field.• Generate statistical samples.• Compare different regions/observations.• Find rare objects.• Remove unwanted backgrounds.• Plan subsequent observations.• …

Page 10: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 10

Do we know what we are looking for?

• Yes: We have a good idea of the kinds of objects that are in the field.– Supervised classification– Find out which regions of observable ‘phase space’

belong to which classes and use that knowledge to classify new sources.

• No: We don’t really know what we’re looking at.– Unsupervised classification– Is there any structure in the phase space distribution?

Page 11: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 11

Supervised versus unsupervised classification

Supervised and Unsupervised Land Use Classification, Chris Banman

http://www.emporia.edu/earthsci/student/banman5/perry3.html

Page 12: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 12

Supervised classification

• Often has a ‘training’ phase where a priori knowledge is used to tune the classifier algorithm. Training takes most of the time.– But Osterbrock diagram based on theoretical

modeling.

• We specify a list of output classes.• May give a list of probabilities of membership

in more than one class.• Algorithms: Neural networks, nearest

neighbor, decision trees

Page 13: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 13

Supervised classifier training

Neural Networks Oblique Decision Trees

Page 14: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 14

Unsupervised classification

• Tries to find natural groupings of data.• User often specifies number of classes

to find.• Classes found are anonymous – it is up

to user to define physical meaning.• Self-organizing maps, K-means, C-

means hierarchical clustering, gaussian mixtures

Page 15: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 15

Self-organizing maps

Catalogs in VizieR K-means

Fuzzy C-means

Page 16: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 16

Some key questions.

1. (S) What output classes are we interested in, and what degree of resolution do we want?

• Star versus galaxy or A0V versus SBa

(U) How many classes might we expect?2. What input data sets are we going to use? 3. How are we going to get them?4. How do we combine them?5. What observables are available? Which are useful?6. (S) What training sets are available?

(U) How do we understand the output classes?7. What algorithm are we going to use in classification?8. How can we test the results so that we believe them?

Page 17: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 17

Specification/Count of Output Classes

We weren’t sure how detailed we could do classifications and had to play with the classifiers to see what might be feasible.

Does the VO help?Not directly. This will often be implicit in

the problem. By making other aspects in classification easier, the VO makes playing around with this choice easier.

Page 18: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 18

What input data sets are we going to use?

We knew which datasets we were going to use but we added one along the way.

Does the VO help?Maybe. VO registries can help find

resources but these will often be implicit in the problem.

Page 19: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 19

We used custom interfaces to get data from different resources, but VOTables were developed early enough for us to use. (Perl VOTable parser from ClassX effort) This took a fair bit of work.

Does the VO help?A lot. Just a few standard ways to get the data

and nice standard ways of defining them. Limits on some services are still annoying. New libraries can make this part really easy. Large XML files are cumbersome to process in many tools.

How are we going to get the data?

Page 20: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 20

How do we combine them?

We used custom software. This took a lot of work but we had to deal with the issue of multiple counterparts to each X-ray sources.

Does the VO help?A lot. XMatch does a lot of what we want though

not everything. Note spatial matching capabilities in TOPCAT allow merging of data from ConeSearch too.

Page 21: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 21

What observables are available? Which are useful?

This took a lot of work. Understanding what variables were available and getting full descriptions was difficult.

Does the VO help?A little. Visualization tools like Mirage are nice

for getting a feel for the data, but non-VO tools (e.g., IDL itself) may do this just as well. Documentation in the VO is probably not better than before but a common framework for getting information to users is available if providers ever get around to providing adequate documentation.

Page 22: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 22

Classification needs right information, not all information.

Hughes Effect

Classification of Multi-Spectral Data by Join Supervised-Unsupervised Learning

(Shahshahani & Landgrebe)

Page 23: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 23

Training set/ground truth data

We knew most of the training data in advance.

Does the VO help?VO registry may point out some

possibilities but training or truth data may be implicit in the problem.

Page 24: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 24

What algorithm are we going to use in classification?

We had experience with oblique decision trees.

Does the VO help?A little. VOStat provides a few capabilities for

unsupervised classification, but the Web interface is a little flakey. Web service interfaces to a few standard classifiers might be nice. VO could do a lot more here.

Page 25: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 25

VOStat

• See www.vostat.org• Statistics routines on-line with VO

interface.• Downloadable library• Fairly minimal Web interface• Includes K-means and hierarchical

clustering tools.

Page 26: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 26

How can we test the results so that we believe them?

We found a number of independently classified sets of objects and checked for consistency.

Does the VO help?Yes. This is probably where we can most

effectively use VO resources we discover in the registry. However a couple of the samples we used were not yet published.

Page 27: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 27

Testing the results

Classify independently classified datasets.

Check faint sources?

Page 28: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 28

Overall…

A lot of progress since we started ClassX but plenty of issues still remain.

Page 29: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 29

A ClassX phase space slice

\

Page 30: September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 30

Science

• Probalistic classifications of all ROSAT X-ray sources: McGlynn, et. al 2004ApJ...616.1284M

• New HMXRB’s: Suchkov and Hanisch2004ApJ...612..437S