September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL

September 2005NVO Summer School 1

Object Classification in the

Virtual Observatory:A VO Status Report

Tom McGlynnNASA/GSFC

THE US NATIONAL VIRTUAL OBSERVATORY


How do we know what we want in the VO?

Pretend the VO exists.

What is the science we are doing with it?

Now try to do that science and see what gets in the way.


Can we classify ROSAT X-ray sources?

All RASS Sources (124,730)

Classified RASS Sources ~7,000

Total RASS Sources ~130,000


What do we want to do?

• Find counterparts to ROSAT X-ray sources in optical, IR, radio.

• Train a classifier to use multiwavelength information to determine type of objects.

• Classify all of the objects seen by ROSAT.


What is classification?

• Translation from observables to distinct physical processes.

• Each element classified independently of others• Is classification different from measurement?• Classification versus cataloging• Usually classify ‘objects’ but also…

– Events: GRBs, solar flares, …– Simulated data– Pixels/regions in an image: Earth and planetary

studies, shocked regions, …


It’s not just us.

http://aria.arizona.edu/courses/tutorials/class/html/class.html

A typical plot of objects to be classified?

There is lots of information and discussion of classification outside astronomy.


Examples

• Moving versus fixed stars• Classes of stellar spectra (ordered by strength

of Balmer lines).– Substitute for a measurement– Cf. Dwarf versus giant

• Osterbrock diagram: AGN versus star-forming emission line galaxies.

• Bautz-Morgan types of clusters of galaxies– Dominance of cluster by central galaxy.

• Types of x-ray sources: AGN, SNR, pulsars, XRBs, …


Galaxy Classification


Why do we classify?

• Understand a given field.• Generate statistical samples.• Compare different regions/observations.• Find rare objects.• Remove unwanted backgrounds.• Plan subsequent observations.• …


Do we know what we are looking for?

• Yes: We have a good idea of the kinds of objects that are in the field.– Supervised classification– Find out which regions of observable ‘phase space’

belong to which classes and use that knowledge to classify new sources.

• No: We don’t really know what we’re looking at.– Unsupervised classification– Is there any structure in the phase space distribution?


Supervised versus unsupervised classification

Supervised and Unsupervised Land Use Classification, Chris Banman

http://www.emporia.edu/earthsci/student/banman5/perry3.html


Supervised classification

• Often has a ‘training’ phase where a priori knowledge is used to tune the classifier algorithm. Training takes most of the time.– But Osterbrock diagram based on theoretical

modeling.

• We specify a list of output classes.• May give a list of probabilities of membership

in more than one class.• Algorithms: Neural networks, nearest

neighbor, decision trees


Supervised classifier training

Neural Networks Oblique Decision Trees


Unsupervised classification

• Tries to find natural groupings of data.• User often specifies number of classes

to find.• Classes found are anonymous – it is up

to user to define physical meaning.• Self-organizing maps, K-means, C-

means hierarchical clustering, gaussian mixtures


Self-organizing maps

Catalogs in VizieR K-means

Fuzzy C-means


Some key questions.

1. (S) What output classes are we interested in, and what degree of resolution do we want?

• Star versus galaxy or A0V versus SBa

(U) How many classes might we expect?2. What input data sets are we going to use? 3. How are we going to get them?4. How do we combine them?5. What observables are available? Which are useful?6. (S) What training sets are available?

(U) How do we understand the output classes?7. What algorithm are we going to use in classification?8. How can we test the results so that we believe them?


Specification/Count of Output Classes

We weren’t sure how detailed we could do classifications and had to play with the classifiers to see what might be feasible.

Does the VO help?Not directly. This will often be implicit in

the problem. By making other aspects in classification easier, the VO makes playing around with this choice easier.


What input data sets are we going to use?

We knew which datasets we were going to use but we added one along the way.

Does the VO help?Maybe. VO registries can help find

resources but these will often be implicit in the problem.


We used custom interfaces to get data from different resources, but VOTables were developed early enough for us to use. (Perl VOTable parser from ClassX effort) This took a fair bit of work.

Does the VO help?A lot. Just a few standard ways to get the data

and nice standard ways of defining them. Limits on some services are still annoying. New libraries can make this part really easy. Large XML files are cumbersome to process in many tools.

How are we going to get the data?


How do we combine them?

We used custom software. This took a lot of work but we had to deal with the issue of multiple counterparts to each X-ray sources.

Does the VO help?A lot. XMatch does a lot of what we want though

not everything. Note spatial matching capabilities in TOPCAT allow merging of data from ConeSearch too.


What observables are available? Which are useful?

This took a lot of work. Understanding what variables were available and getting full descriptions was difficult.

Does the VO help?A little. Visualization tools like Mirage are nice

for getting a feel for the data, but non-VO tools (e.g., IDL itself) may do this just as well. Documentation in the VO is probably not better than before but a common framework for getting information to users is available if providers ever get around to providing adequate documentation.


Classification needs right information, not all information.

Hughes Effect

Classification of Multi-Spectral Data by Join Supervised-Unsupervised Learning

(Shahshahani & Landgrebe)


Training set/ground truth data

We knew most of the training data in advance.

Does the VO help?VO registry may point out some

possibilities but training or truth data may be implicit in the problem.


What algorithm are we going to use in classification?

We had experience with oblique decision trees.

Does the VO help?A little. VOStat provides a few capabilities for

unsupervised classification, but the Web interface is a little flakey. Web service interfaces to a few standard classifiers might be nice. VO could do a lot more here.


VOStat

• See www.vostat.org• Statistics routines on-line with VO

interface.• Downloadable library• Fairly minimal Web interface• Includes K-means and hierarchical

clustering tools.


How can we test the results so that we believe them?

We found a number of independently classified sets of objects and checked for consistency.

Does the VO help?Yes. This is probably where we can most

effectively use VO resources we discover in the registry. However a couple of the samples we used were not yet published.


Testing the results

Classify independently classified datasets.

Check faint sources?


Overall…

A lot of progress since we started ClassX but plenty of issues still remain.


A ClassX phase space slice

\


Science

• Probalistic classifications of all ROSAT X-ray sources: McGlynn, et. al 2004ApJ...616.1284M

• New HMXRB’s: Suchkov and Hanisch2004ApJ...612..437S

http://adsabs.harvard.edu/cgi-bin/nph-data_query?bibcode=2004ApJ...616.1284M&db_key=AST&link_type=ABSTRACT&high=4324fe0c1e23605

http://adsabs.harvard.edu/cgi-bin/nph-data_query?bibcode=2004ApJ...612..437S&db_key=AST&link_type=ABSTRACT&high=4324fe0c1e23605

Documents

September 2005NVO Summer School1 Object Classification in the Virtual Observatory: A VO Status Report Tom McGlynn NASA/GSFC T HE US N ATIONAL V IRTUAL