Statistical Association. Winter 2001 Vol.12 No.1 A WORD ...statweb.stanford.edu/~susan/papers/v121.pdf · Data visualization exposition: planned for 2003 We haven’t held our popular

A joint newsletter of theStatistical Computing& Statistical GraphicsSections of the AmericanStatistical Association.

Winter 2001Vol.12 No.1

A WORD FROM OUR CHAIRS

Statistical Graphics

Deborah F. Swayne is the 2001Chair of the Statistical GraphicsSection. Here is the news thatshe has from the section.

State of the sectionWe had a lively discussion on the section officers’ mail-ing list this winter about the health of the section, whichdoesn’t seem to be as active or innovative as it mightbe. We need to continue to think about how to engagethe membership and how to productively focus our ef-forts. Contributions to this discussion from other sec-tion members would be most welcome.

We have some notable successes, such as the continuingpopularity of our invited sessions and short courses, butwe also have some notable weaknesses. I’ll considersome of them here and describe the responses we’reworking on, and then tell you about some new ideasthat have come up this year.

Student paper competition: a mergerWe have not been very successful at attracting enoughpapers for a meaningful competition – we don’t knowhow much this is due to our own inefficiency and howmuch to the fact that the number of papers on graphicsproduced each year is quite small, whether by studentsor not.

CONTINUED ON PAGE 4

Statistical Computing

Mark Hansen is the 2001 Chairof the Statistical Computing Sec-tion. Here are some of his newsitems.

Winter weather has finally settled over the northeast,bringing with it a kind of normalcy that I’ve been long-ing for. I have nothing to do on this chilly afternoonexcept revisit some of the initiatives I’ve been able toaccomplish as Chair of the Section. Perhaps a simplelist is the best format:

Section Web siteThe domainwww.statcomputing.org is now the of-ficial home of the Computing Section. On the siteyou will find a detailed history of the Section, newsitems, awards updates, conference announcements, bul-letin boards and links to the Newsletter. Have a lookand let us know what you think!

Section CD-ROMTony Rossini has recently compiled a collection ofes-sential statistical software. We plan to distribute thiscollection in CD form to all active members of the Sec-tion. We are hoping that this will become an annual af-fair, offering a regular snapshot of the latest Computingtools.

CONTINUED ON PAGE 3

http://www.statcomputing.org

SPECIAL FEATURE ARTICLE

Report from theWorkshop on ModernStatistical Computing andGraphics in AcademiaThe Statistical Computing and Graphics sections jointlysponsored a workshop prior to the Interface 2001 tobrainstorm about how to recognize and promote sta-tistical computing and graphics in academia. Partici-pants discussed some of the problems facing young aca-demics in our field, and sought actions that individualsand organizations might take to respond to these prob-lems.

The workshop was attended by about two dozen peo-ple, most of them young academics and the rest senioracademics and researchers in industry and the military.

The program for the workshop broke the day into fourparts. Each of the first three sessions began with a shortpresentation and continued with discussion involvingthe whole group.

• Rewarding research in statistical computing (withDi Cook and Thomas Lumley as presenters anddiscussion leaders)

• Collaboration with other disciplines (Chris Gen-ovese, Robert Gentleman)

• Curriculum content and development (RossIhaka, David Madigan, Michael Minnotte)

In the final session, workshop participants summarizedthe discussions of the day, and came up with some sug-gestions for action, and these are some of the more con-crete proposals that emerged:

• Problem: Statistics Departments have a hardtime evaluating work in statistical computingwhen it appears in tenure and promotion pack-ages.Response: Strengthen the outlets for publishingwork in Statistics computing, especially software.We agreed that we would begin this process byadding our support to JSS rather than trying tospawn any parallel effort.Actions: Formulate a document providing guide-lines for evaluating software that might be usedby Statistics Departments in promotion andtenure evaluations, and for review of journal ar-ticles on software.

• Problem: Statistical computing faculty have ahard time convincing their departments to adoptinnovative practices.Response: Build a web site which gathers dataon policies, programs, practices that are friendlyto stat computing.Actions: We can begin with a survey of sectionmembers, and supplement what we learn with re-sults of other ongoing investigations of statisticsdepartments.

• Problem: Computational literacy is low in theStatistics profession.Response: Offer introductory classes at Statis-tics meetings. Topics that would be appropri-ate include databases, scripting languages, and anintroduction to R. Appropriate meetings includeENAR, WNAR, and the JSM. We would offer theclasses as cheaply as possible, and offer them freeto section members.Actions: The sections should work with theirContinuing Education representatives and Pro-gram Chairs.

• Problem: Statistics Department curricula offerfew courses in statistical computing.Response: Many departments are now experi-menting, but there’s little sharing of experience.Actions: Collect syllabi, organize and make themavailable on the web.

• Problem: Junior faculty have to learn a lot in ashort time.Response: Offer experience and advice in sometricky areas, such as grant applications, preparingtenure and promotion packagesAction: Approach statisticians who have had suc-cess writing grant proposals, and those who havehad experience in evaluating them, and ask themto write an article sharing their expertise.

If you wish to contribute to any of these activities,please talk to Debby Swayne or Mark Hansen.

The workshop organizers are grateful to Pfizer, Inc andStata, whose generous contributions supplemented –and greatly exceeded – the seed money provided by thetwo sections.

Deborah F. Swayne, Di Cook, Mark Hansen, GuyNason, Duncan Temple Lang and Adalbert Wilhelm

http://www.statcomputing.org/workshops/scga/

©©

2 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1

http://www.statcomputing.org/workshops/scga/

Editorial

We have finally managed to get an issue out in 2001.Maybe we can think of it as the El Nino issue since itscoming out just in time for Christmas and New Year’seve. This is our first all electronic issue, we hope youenjoy it, and especially remember to use it as a naviga-tion guide. In particular, all live links are surroundedby blue frames, so clicking on them will either awakeyour browser and take you to the site or if your browseris already working, it should just open up the relevantpage in it.

In conjunction with Interface ’01 the sections held aworkshop on Modern Statistical Computing and Graph-ics in Academia. There is a report from this workshopin this issue, and the aim is to have several more sub-stantial reports form the workshop in 2002 issues of thenewsletter.

Perhaps as a consequence of the workshop this newslet-ter issue has a healthy selection of articles on soft-ware. Three commercial software production compa-nies show their wares: Spotfire, Visual Insights, and S-Plus. Antony Unwin describes the software research ac-

tivity at the University of Augsburg and Lee Wilkinsondescribes a new concept “streaming graphics” underly-ing his software Dancer. The next issue in 2002 willhave several more articles describing software.

We have dedicated our two statistical computing topicsto the SVD which seems to be more and more popularas complex multivariate data drown our computers!

We have also included an account of an experimentalclass for teaching statistical computing, in the future,we would like to hear about our members’ experiencewith various other software in the classroom.

Finally we wish to encourage submissions to thenewsletter. This year has been asloooowyear for get-ting the newsletter out, but we promise to be more dili-gent in the new year.

Dianne Cook, Susan HolmesEditors, Sta-tistical Graphics apdfnd Computing Sec-tions

©©

FROM OUR CHAIRS (Cont.). . .

Statistical ComputingCONTINUED FROM PAGE 1

Refereed Papers.This year, the Section on Statistical Computing is in-troducing an experimental refereeing process for con-tributed papers submitted to the Joint Statistical Meet-ings. We see this service as helping our members intwo ways. First, by introducing some form of referee-ing we can highlight outstanding contributed papers onstatistical computing. Secondly, we hope to improve theoverall quality of contributed papers and presentationsat the JSM. Details of the process can be found on theSection Web site or a recent Amstat News article.

The Joint Statistical Meetings.Doug Nychka put together an outstanding program forthe JSM in Atlanta this year. The 2002 meetings will bein New York City, and Tim Hesterberg already has animpressive slate of sessions prepared. To welcome youall to New York, the Section will be raffling off a pairof tickets to the Broadway showThe Producers. Theyare excellent seats for the Tuesday night performanceduring the JSM.

Elections.Please join me in congratulating our new Execu-tive Committee Members Leland Wilkinson (ChairElect), Mary Lindstrom (Program Chair-Elect), CharlesKooperberg (Secretary/Treasurer) and Robert McCul-loch (Council of Sections Representative). I alsowant to thank our outgoing Secretary/Treasurer, MerliseClyde. She has done a great job managing our financesfor the last three years, and will be greatly missed.

Section Awards.Lionel Galway has organized another very successfulset of Section-sponsored competitions. Both the Stu-dent Paper Competition and the J. M. Chambers Awardcontinue to draw outstanding entries. The winnersfor 2001 can be found on the Section Web site. The2002 competition for these prizes is just starting, soplease encourage eligible students to enter. (For 2002,the Computing and Graphics Sections have merged re-sources and will be offering a single, joint Student Pa-per Competition.) Lionel also arranges our Best Con-tributed Paper Presentation Award, which this year wentto Jay Servidea, a graduate student at the University ofChicago.

Workshop on Modern Statistical Computing and

Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 3

Graphics in Academia.Just before the Interface meeting this year, our Sectionssponsored a workshop that brought together some of themost influential members of the Computing community.In this edition of the Newsletter you will find a reporton the steps that we will be taking to help encouragethe research and education of Statistical Computing andGraphics.

We’ve accomplished a lot this year, and I expect that ourincoming Chair, Susan Holmes, will see that our Sec-tion continues to be one of the most active in the ASA.Finally, I want to thank our “new” Newsletter editors,Di Cook and Susan Holmes. Producing this volume is a

time-consuming process. Extracting contributions frombusy authors (and Section Chairs!) is a thankless job,but the end result is an important resource for our mem-bers. Please join me in congratulating them on anotheroutstanding issue.

Peace.

Mark Hansen Chair, Statistical Comput-ing Section

©©

FROM OUR CHAIRS (Cont.). . .

Statistical GraphicsCONTINUED FROM PAGE 1

As a result, the two sections will merge their studentpaper competitions into one, to be referred to as the“Statistical Computing and Statistical Graphics studentpaper competition.” The graphics section is grateful tothe computing section for supporting this idea, becausetheir annual competition runs like a well-oiled machineand regularly attracts very good papers. The graph-ics section will help to disseminate publicity, will con-tribute judges who go to work every January, and willof course contribute prize money.

Data visualization exposition: planned for 2003

We haven’t held our popular data visualization exposi-tion or contest in several years, and we were a little bittoo late to sponsor one at the 2002 JSM. That meanswe have plenty of time to plan one for 2003, and DavidJames (chair-elect in 2002) has volunteered to lead theeffort. Our target is to have web sites and flyers pre-pared to announce the contest prominently at the 2002JSM, and there’s a lot of work to be done before then.If you’d like to help plan it, or you have interestingdata to suggest, please talk to David ([email protected]).

New initiatives

The graphics section’s major new activity this year wasits cosponsorship of the workshop on the future of sta-tistical computing and graphics in academia, held inconjunction with Interface 2001. You can read a re-

port on the workshop elsewhere in the newsletter. Thiswas a very interesting experiment for our section, andI hope we continue to participate in the discussion re-flected there, and that we contribute to implementingsome of the suggestions that emerged.

Two other ideas have been put forth, and again, I’d loveto hear from section members who find them interest-ing:

• Form an awards committee to recognize influen-tial and innovative work in statistical graphics.This could involve honoring past work or newwork.• Design a set of posters that could adorn stat de-

partment hallways, to increase awareness of ourfield.

These two ideas might even have some synergy, sincewe could design posters based on the work of honorees.

Section communicationMark Hansen and I began to use email this year to com-municate with section members, and we hope you ap-prove. We think both sections will continue to use email(sparingly!) to tell you what’s going on and to ask foryour reactions and ideas. If you would like to receivesection email in the future, make sure that your emailaddress in the ASA’s records is up to date.

Deborah SwayneChair, Statistical Graphics Section

©©


Figure 1: Spotfire DecisionSite - clustering and profiling genes.

SOFTWARE PACKAGES

The Next Stage ForAnalytics: BringingStatistics to the MassesDr. Bill Ladd, Director Bioinformatics, Spotfire, Inc.

Applying Algorithms Across the EnterpriseAnalytical applications are in high demand these days.For most organizations, that means better managingand analyzing the data generated by processes through-out the enterprise value chain. The volumes of dataproduced by today’s high-throughput processes havemade researchers increasingly are turning to statisti-cians, computer scientists, and applied mathematiciansfor assistance with interpreting and analyzing researchdata. Statisticians need to find ways of making end-users comfortable with the basic methodologies and

tools so they can run their own analysis and not haveto rely overworked statistical specialists.

Often, the development of a ”correct” mathematical ap-proach to a specific problem is the first in a series ofsteps that will make a new algorithm usable to end-userscientists. Once developed, the algorithm must be inte-grated into existing analysis processes. It must be de-livered within an analytics framework that encouragesdata exploration and interpretation. And this frame-work must be deployed across broad, dispersed usercommunities- the globally distributed project teams thatdrive research decision-making.

Once an algorithm has been developed, the issue of de-ploying it to other researchers and other areas of theenterprise is critical. Algorithms deployment can bedeemed inadequate because they fail to mirror the an-alytical process preferred by researchers. A frameworksupported by a robust, adaptable platform and able tobe configured to the established processes is needed toshare such algorithms and ”best practices” throughout


an enterprise.

Analyzing the data and communicating the results tothe appropriate places in the organization to facili-tate decision-making is, ultimately, the reason why re-searchers generate data in the first place. And the anal-ysis tools- be they the latest killer algorithm, a desktopanalysis application, or a slide rule–are all means to thisimportant end. Because the questions researchers askoften refer to the data they have moving through theirresearch process, their needs can seem isolated and spe-cific. Analytical specialists, in turn, deliver tools suitedto these specific needs- solutions that maximize the useof the data at a specific point in the larger data flow thatbegins with laboratory information management sys-tems or basic research experiments and ends with a fin-ished product. But to be useful over time and add valueto the data produced during research projects, analyticsshould focus on providing tools that can be added to orreplaced within an ongoing and ever-changing analysisprocess.

Delivering an Analytic Framework: SpotfireDecisionSiteThe Spotfire DecisionSite framework addresses theseissues and provides a usable, eAnalytical applicationthat has a number of key elements for statisticians todistribute algorithms and analysis tools.

Data AccessCrucial decisions, those that drive research projects, re-sult when data can be compared across experiments,labs, and fields. Yet too often simply acquiring datafrom the various sources where it resides is difficult. Asuccessful analytics framework must support truly inte-grated data access- data from any source, in any format,anywhere, at any time, without any end-user program-ming.

Data VisualizationModern computing tools can do more than simply dis-play results-they can provide interactive, fast, and flex-ible data visualizations that facilitate and even amplifyhuman thought processes. An analytic framework mustinclude data visualization tools that can present com-plex datasets in unique ways that engage these innateperceptive abilities. These visualizations must be madeinteractive, encouraging researchers to explore unex-pected behavior in their data. Interactive visualiza-tions let machines focus on what they do best-dataprocessing-while humans focus on their talent-makingdecisions.

UsabilityAn analytics framework may connect to all the rightdata sources and incorporate impressive visualizationsand analysis options. But if it is not easy to use, it willnot get used. A modern analytics framework must caterfirst to the needs of a wide variety of scientists, offer-ing more sophisticated toolsets as an option for thosescientists who want greater control.

”Guides,” have emerged as a popular way of simplify-ing complex tasks by stepping users through commontasks, rather than searching through menus to find theoption needed to complete a task. Guides are a usefulusability tools because they can serve both novice usersand power users. And when applied to research, guidescan greatly speed many aspects of data analysis, fromtransacting with or entering experimental information tosetting the parameters of a complex clustering method,statistical analysis, or data reduction algorithm.

ExtensibilitySoftware solutions supporting analytics must be adapt-able and readily configurable to support access to new,emerging data sources, new analytical tools, and newanalysis processes. The architecture of the platformshould be able to incorporate methods developed inmany different architectures and runtime environments.

Many analytical applications claim extensibility, butplace such strong restrictions on the approaches usedto add functionality; in the end, they are quite limited inscope. When an analysis framework is well planned,such as Spotfire DecisionSite, the application infras-tructure itself becomes an important part of the appli-cation’s functionality.

DeployabilityGlobal project teams encompassing researchers from avariety of disciplines must be able to share data andcome to consensus on research directions. Web infras-tructures are ideal for these purposes and are the logi-cal choice to underpin an analytics framework. Usersgain access to the most up-to-date statistical and anal-ysis methods that apply to their research. With onechange to one part of the system, IT staff can delivernew features and functionality to researchers anywherein the world.

ConsistencyWell-planned analytics frameworks do more than meetthe access, analysis and usability demands of the in-dividual researcher- they provide consistency acrossthe broader organization. Analytic tools and statisti-cal methods geared towards one application area can


be adapted by researchers in another application areafor their own purposes. As different user communi-ties configure the environment for access to data andanalysis methods relevant to their specific research, newclasses of scientific investigation will be enabled thatbring these previously disparate classes of data together.

Looking forwardIf analytical specialists such as statisticians expect theirmethods to be adopted and used productively through-out the enterprise, they need to embrace the analysisprocesses and frameworks used by their customers- theend-user researchers and decision makers. Research isnot slowing down and the generation of experimentaldata stands only to increase. In evolutionary terms, thefittest research organizations in the future will likely bethose that can efficiently analyze data to come to de-cisions about what to do next. Statistics and analyt-ics must itself evolve. Industrial quality research re-quires industrial quality software that lets analytical re-searchers focus on their specialty with access to the lat-est statistical methods, rather than on data management,reporting or storage.

For more information on the Spotfire De-cisionSite eAnalytic application, please visithttp://www.spotfire.com or send a request [email protected] .

Visualization forInformation ConsumersStephen G. Eick, Visual Insights

http://www.visualinsights.com

[email protected]

AbstractWithin an organization there are different types of in-formation consumers with varying needs and levels ofsophistication. Our focus within the statistical graph-ics community traditionally has been on building so-phisticated tools for highly-skilled analysts. By focus-ing broadly on the information needs and understand-ing different user tasks, we can create visual tools thatappeal to all information consumers within an organi-zation and extend our influence outside of our core areaof statistical graphics.

Information ConsumersWithin the statistical graphics community we delight inproducing sophisticated tools for visual data analysis.Our tools are powerful and incorporate the latest think-

ing on statistical analysis and data mining. The guidingphilosophy, it seems to me, is that we build the tools thatwe need to perform our analyses. The result is wonder-ful demonstrations at the JSM and other leading scien-tific conferences. Sometimes, however, it appears thatsome magical combination keys, selections, and mous-ing was needed to produce the result.

This lack of usability may have to do with our back-ground. In the development of tools we simultaneouslywear multiple hats. Frequently we are the creater, devel-oper, user, and evaluator simultaneously. By combiningthese roles, we can rapidly iterate through various ideasand techniues. There is, however, a negative aspect ofcombined roles. Our perspective generally is that of ahighly trained, sophisticated data analyst and our toolsembody this perspective.

In an organization there are three different types of in-formation consumers. At the highest level, the Execu-tive wants quick summary information so that problemsare apparant at a glance and there is enough detail forintuitive decision-making. The Manager, who reportsto an executive, has time to explore data and look fortrends. He needs the ability to customize the reportswithout being overwhelmed by complexity. The Ana-lyst or Doer wants full access to all available informa-tion at a fine-grain detailed level. He or she has the timeand interest to perform detailed studies, deep analyses,and needs powerful tools.

To meet the varying needs of information consumer Ihave defined four different types of visualizations. Forconcreteness, I will use examples from website analyt-ics. Website analytics is the study of how users are ac-cessing websites. It involves the collection and analy-sis of clickstream data, understanding how website ef-fectiveness, investigating visitor demographics, study-ing of on-line transactions, and influencing behaviorthrough on-line promotions.

Four Types of VisualizationsThis section highlight four types of visualizations orga-nized in increasing of complexity.

Digital Dashboards (Figure 1) provide a “at a glance”presentation for quick intuitive decision-making. Thekey intellectual property is the choice of statistics, ap-propriate time scales, pleasing presentation, and simplegraphics. Users may manipulate the time ranges and se-lect other tabs in the display for more detailed informa-tion. The interesting research issues involve the graphi-cal presentation, workflow, and implementation.

Active Reports (Figure 2) are customizable graphicalinterfaces that are used for flexible reporting, trending


Figure 2: Browser-based Executive Dashboard showing website activity. Green and Red triangles highlight changes(positive and negative) from the previous time period.

and publishing. In contrast to browser-based digitaldashboards, an active report is more complicated andmore flexible. Our implementation of active reportsis a client-side windows application where users maymanipulate data, change thresholds, publish html snap-shots, reaggregate, export results sets. Active reportsare frequently used by business analysts. The key is-sues involved with active reports include the flexibilityof data access, ease of creating reports, and ability tointegrate with other systems in an organization.

Figure 2: Active 3D Report. The transparent cuttingplan helps users to visually compare bar heights.

Visual Discovery and Analysis (VDA) tools, such asXGOBI, are powerful environments that combine state-of-the-art statistical techniques with visual data analy-


sis. They combine both statistical and graphical toolsinto a common framework.

Task-specific visualizations (Figure 3) capture domainknowledge to address domain-specific issues. For web-site analytics, a critical issue involves understandingthat paths that browsers take to navigate the site - seeFigure 3. The intellectual property includes novel visualmetaphors and interaction techniques. Task-specific vi-sualizations will be widely used by all information con-sumers in an organization.

Path visualization forwww.visualinsights.com.

Summary and ConclusionWithin an orgazation there are different types of infor-mation consumers that have different needs. This essayhighlights three different types of users:

• Executive,• Manager, and• Analyst

and four types of visualizations:

• Digital Dashboards,• Active Reports,• Visual Discovery Analysis Tools, and• Task-specific visualizations.

Digital Dashboards, while engineered for an Executive,are useful for all members of an organization. Ac-tive Reports and VDA tools address Manager’s andAnalysts’s needs. Task-specific visualization are role-neutral may be used by anybody with domain knowl-edge.

The important take away message for us in the statisti-cal graphics community is that our background, train-ing, and inclinations encourage us to build VDA toolsfor analysts. These tools, while highly sophisticated, donot necessarily meet the needs of the other information

consumers. To meet these broader needs, we must de-sign tools that target tasks beyond our traditional focuson statistical graphics.

S-PLUS GraphletsCharles Roosen, PhD, Senior Statistician, InsightfulCorporation

S-PLUS 6 for UNIX and S-PLUS 6 for Windows intro-duce an exciting new technology for displaying graph-ics over the web. The Graphlet is a Java applet that sup-ports displaying and interacting with S-PLUS graphicsthat have been saved in a file. The Graphlet can beembedded within a Web page, and viewed by a webbrowser.

Creating an S-PLUS Graphlet file is as easy as creat-ing a graph in S-PLUS. Use the award-winning S pro-gramming language to create your graphic, and thensimply save it as a Graphlet. Because the S languagewas designed explicitly for graphical display and sta-tistical analysis, hundreds of high-level functions are atyour disposal for creating an informative and attractivegraph with just a few lines of code.

Graphlets allow you to add interactivity to your graphs.Graphlets are live objects, and the viewer can interactwith the Graphlet by moving and clicking the mousewithin the Web browser. With a simple click on adata point or label the viewer can be provided withfurther information through on-screen information, an-other Graphlet, or another page anywhere on the WorldWide Web. The applications of this interactivity areendless. Here are just a few examples:

• Click on a financial time series chart to show allthe news stories relevant to the date where themouse was clicked.

• Make the bars in a bar-chart active, so that click-ing on a bar will drill down to display anotherchart containing only the data in that bar.


• Link cities on a map to further detailed chartsabout that city.• Add interactive labels to a scatter plot to provide

more information about your data.• Provide multiple graphs as labeled pages in a

single Graphlet, allowing the viewer to browseamong the pages at will.• Display latitude and longitude coordinates of the

mouse position within an image map, and allowthe viewer to zoom in and display a specific re-gion.

The Graphlet window provides further tools for the userto zoom in on selected regions of your graphic to viewdetails. And because Graphlets–unlike GIF and JPEGgraphics–are rendered on-the-fly using vector graphics,there is no loss of resolution or pixelation when lookingat your graphic ”up close and personal”.

Since Graphlets are implemented using standard JavaApplet technology, they can be embedded into anyWeb page using standard Web authoring tools suchas Microsoft FrontPage. Graphlet files are very small(typically 20-60 Kb) so they download quickly acrossthe Web. S-PLUS Graphlets can be viewed on anyJava-capable browser, including Internet Explorer 5 andNetscape 4. The S-PLUS Graphlet applet is freely redis-tributable, so you can give S-PLUS Graphlets to friendsand colleagues for use on their Web pages, even if theydon’t have S-PLUS.

For more information on Graphlets and a varietyof live examples, see the Graphlet web page athttp://www.insightful.com/graphlets .

COSADA SoftwareProjects in AugsburgAntony R. Unwin, Department of ComputerorientedStatistics and Data Analysis, Augsburg University

[email protected]

IntroductionMANET, TURNER, CASSATT, KLIMT, MARC —there has been a succession of statistical software devel-opments at the department of ComputerOriented Statis-tics and Data Analysis (COSADA) at Augsburg Univer-sity in Germany. (Martin Theus’s MONDRIAN mightalso be counted, but he wrote it after he left Augsburgand was at AT&T.) All the software is for interactiveanalysis. Users should be able to work directly with ob-jects on screen and not have to type in commands. Thesoftware is research software, in that the main aim is to

develop and test new interactive ideas, but great empha-sis is also put on achieving an intuitive, fast and flexibleinterface. Ideas can best be tested if they are imple-mented in an effective way. The programmes have pri-marily been written by students. While all members ofthe department have an input into the design and testingof the software, the programming is always the respon-sibility of just one person.

Each programme is named after an impressionistpainter, because software can only help the user to getimpression of their data. MANET and TURNER weredeveloped for Macs but more recent programmes havebeen written in Java, so that they can potentially run onany computer system.

CASSATT

CASSATTimplements interactive parallel coordinates.Consider, for instance, the 1999 reprise of thefamous California-Bordeaux wine tasting data set(www.nerc.com/ liquida/report20.html). There were 46wines which were all scored by 32 judges. Did thejudges agree on their rankings? Did at least some ofthe judges agree? Were the French and the Califor-nian wines ranked differently? Were some wines alwaysranked higher than others? Which judge(s) were clos-est to the average ranking of all judges? Can patternsof rankings be determined? There are many interest-ing questions that may be pursued and a combinationof analytic and graphical methods would be best. Thefigure shows the rankings by wine in a parallel coor-dinate plot. There is an axis for each wine and eachbroken line represents the views of a single judge. Theaxes have all been scaled equally (a menu command),as the default is to scale each axis individually (a nat-ural default when, unlike here, different variables havenon-comparable scales). It is immediately apparent thatthe judges were not in complete accord! To assist inter-pretation the wines have been sorted by average rank-ing (one of many sorting possibilities and again a menucommand) with the best to the left. You can see thateven the wine with the best average ranking was given apoor rank by one judge and even the wine with the worstaverage ranking was given a good rank by one judge.The former judge has been selected to see whether hisother judgements were also unusual. To improve visi-bility the lines for the judges who are not selected havebeen lightened (this is done using a floating dialog box).

No brief description can do justice to a sophisticatedpiece of software.CASSATTis a fully linked multi-window system providing not just parallel coordinateplots but also standard plots for continuous variables(dotplots, boxplots, scatterplots) as well. It offers the


Figure 3: Rankings by wine in a parallel coordinate plot.

basic interactive tools (including querying, selection,linking, sorting and colouring) but also some speciallyrelevant to parallel coordinates (inverting axes, selec-tion methods for selecting not only points, but also linesand even angles). Selection sequences, which were in-troduced in MANET to provide an editable, graphicalinterface for selections, have been extended inCAS-SATTto include the selection history in a graphic form.The software has been used successfully (i.e. its inter-active features are still fast enough) with datasets withover 300 variables and with a dataset with over 20000cases.

Future directions

CASSATTassumes, like most statistical software, thatthe data set is supplied as a data matrix. It would beuseful to be able to work with more complex structures.For instance, it is currently not possible in the tastingdataset to include which wines are French and whichCalifornian, what vintages they were and what combi-nations of grapes they contained. If we had this infor-mation, we could interactively sort the wines by countryof origin or by vintage.

Parallel coordinates are excellent for continuous vari-ables but not so good for categorical variables. Toomany lines simply overlap and it is hard to see struc-ture. Missing values are another related problem. Newapproaches are needed. We intend to investigate thelinking of mosaic plots and parallel coordinates in thesame package.

And, of course, we would like to interweaveCAS-SATTwith R, the way we are doing with KLIMT.

CASSATTwas written in Java by Sylvia Winkler, whoshared first prize in the inaugural John Chambers soft-ware competition for her work.

CheckCASSATT’s homepage for further information:www1.math.uni-augsburg.de/Cassatt/

Streaming GraphicsAndrew A. Norton, Matthew A. Rubin, LelandWilkinson, SPSS, Inc.

[email protected].

AbstractStreaming graphics, a cross between streaming videoand dynamic graphics, is a new approach to visualizingdata. The goal is to integrate and synchronize multiplestreams of data in real time and display them at ratesof up to 20 frames per second. This article describesthe design of a system, calledDancer, that implementsthese ideas in Java.

IntroductionThe termstreaming graphicsevokes the termsstream-ing media(e.g., Feamster, 2001) , anddynamic graph-ics (e.g., Cleveland and McGill, 1988). While similarto both in its outword appearance, streaming graphicsis basically different from both. Streaming media sys-tems are generally concerned with delivering sound andvideo information in real time. Dynamic graphics areconcerned with using motion to reveal structure in staticdata. Streaming graphics systems are concerned withdisplaying analyses (summaries) of streaming data inreal time. Applications of streaming graphics involve


many different environments, including real-time mon-itoring of manufacturing processes, health indicators, fi-nancial statistics, and Web data.

DataA streaming data source is fundamental to streaminggraphics. In the simplest streaming data model, data ar-rive in a fixed-length buffer (window) at equally spacedtime points. Upon the arrival of a new data packet, welose one of the packets already in the buffer (usually thelast). When a new data packet arrives, we refresh ourdisplay. We can also have a fixed-length buffer that re-ceives data packets at irregular time intervals. In thisenvironment, the arrival of new data can trigger a dis-play refresh. There are many other prevalent streamingdata environments, however, particularly ones involvingmultiple, asynchronous data events occurring simulta-neously within multiple processing threads.

Streaming vs. Static Data

A popular metaphor for the difference between stream-ing data sources and static data warehouses is astreamversus apool. In the massive data-mining environment,we often hear this metaphor upgraded to ariver versusanocean. The significance of this distinction, whateverthe scale, is that our graphical and statistical algorithmsfor mining a stream (to stretch the metaphor) must adaptto its temporal nature. We must pay attention to speed ofcalculation, lest we be swamped with new data packetsbefore we can compute a displayable result. We must al-low calculations to persist wherever possible so that wedo not waste time redoing subsets of previous calcula-tions. We must learn to sample our data stream, or knowhow to bail-out of calculations when the results mightbe out-of-date. We must synchronize our results, lestwe mistakenly display summaries of disparate events inthe same time frame.

Many of these concerns do not apply to the traditional,static data mining environment. The premier consider-ation in the static environment is scalability with regardto thesizeof a dataset. We seek algorithms that scalecomparatively well, in the sense of computing time be-ing a constant, linear, or logarithmic function of the sizeof a dataset. In the streaming data environment, we needto worry about scalability with regard to themomentumof a stream. Themomentumof a stream is itsmass(theaverage size of data packets arriving in a fixed interval)timesvelocity(the average number of packets arrivingin a fixed interval). This Newtonian physics metaphor isan over-simplification, but it helps us to understand thatattending to scalability with regard to the total bulk ofthe data we encounter will not be sufficient for building

a streaming display system. Even a constant cpu-timecalculation may still be too slow to handle the real-timedata stream we must display.

Multiple vs. Single Data SourceIn the static data environment, we merge different datasources into a single table before computing graphi-cal and statistical summaries. We cannot employ thissimplification in the streaming environment. Instead,Danceris designed to handle multiple datasources. Forexample, we might want to attach a single display to alive feed from a stock exchange and another live feedfrom a commodities exchange. There is no time to in-dex and merge these two feeds into a temporal database.

Multiple feeds imply a multi-threaded computing en-vironment with event-notification through broadcastersand listeners. The data feeds run simultaneously, andeach broadcasts the arrival of new data it receives. Thestatistical and graphical listeners attend to those mes-sages and react as they choose. In our financial exam-ple, the result is a display that shows a moving time-series of a particular stock superimposed on a movingtime-series of a related commodity.

Continuous vs. Discrete Time ScalesA consequence of this functional architecture is thatDancer operates on a continuous rather than discretetime scale. In theory,Dancerdisplays data at any in-stant. Its time scale moves (at least perceptually) contin-uously. This is in contrast to some time series displays,such as seen in traditional ARIMA (Box and Jenkins,1976) packages, that contain measurements indexed onequally spaced time points.

Dancerhas two time-scale modes. In the static-framemode, geometric representations of data (points, lines,etc.) move through a motionless frame (the data area).We may clip the elements at the frame boundaries or letthem pass beyond.

In the dynamic-frame mode, geometric elements arecontinuously repositioned within a frame that is con-tinuously scrolling forward or backward in time. Thismode requires careful programming to allow tickmarks, grid lines, and scale labels to scroll smoothlyand continuously, in 2D or 3D as required.

GeometryDancer is based on a geometric model of statisticalgraphics described in Wilkinson (1999). That bookcontrasts the geometric model with the ”chart-centric”model of statistical, business, and scientific graphicspackages. In the geometric model, elements such aspoints, lines, and intervals are embedded in a frame de-


fined by an algebra with three operators. These geomet-ric elements are realized (made viewable, hearable, etc.)through a set of aesthetics such as size, shape, texture,and color. ThenViZn system for rendering graphs onthe Web is based on the same model (Wilkinson, et al.,2000).

Unlike nViZn, Dancercan bind each geometric elementto a different data source. The behavior of every el-ement in the frame is synchronized through the func-tional data model. Special rendering technology is re-quired to enable each element to move up to 20 times asecond based on incoming data.

Scene Tree vs. Geometric Primitives

A scene treeis a data structure that contains the infor-mation in a geometric scene. Scene trees are frequentlyused in 3D modeling to organize the arrangement andrendering of a collection of geometric objects. For ex-ample, if the root of a tree is a body, then the root’s chil-dren could be torso, head, arms, and legs. The childrenof arms could be upper arms, lower arms, and hands.And the children of hands could be fingers and palms.The advantages of this organization are several. Firstof all, children inherit attributes from parents, includ-ing aesthetics (e.g., color, texture) and localized coordi-nates. Second, adding objects to a scene (another body,for example) requires only appending a new subtree toan appropriate node. Third, a tree is a simple objectthat can be explored through depth-first or breadth-firstsearch. This makes rendering rapid and efficient. Andit makes interactions with elements (editing, brushing,linking) a simple matter of walking the tree to locate aselected element.

Wilkinson (1994) employed agraph treemodel for con-structing and rendering statistical graphics in the SYS-TAT package. The hierarchical nodes of the tree wereWindow, Pane, Frame, Graph, and Element. With theFisher-Anderson Iris data, for example, SYSTAT plotsthree scatterplot matrices (three species, four variables)in a window by assigning one Pane node to Windowand three Frame nodes to Pane. Each Frame containssixteen Graph nodes. Each Graph node contains twoElement nodes (one for a cloud of points and one fora smoother). The most important consequence of thisarchitecture is that SYSTAT has no scatterplot matrixroutine. Instead, SYSTAT has a method for assemblinggraphical elements in a structure of one or more graphsand frames. Anything it can do inside a single scat-terplot it can do simultaneously inside multiple scat-terplot matrices. For the same reasons, SYSTAT hasno routine for drawing a Trellis (Becker et al., 1996).Instead, SYSTAT constructs a Trellis structure by as-

sembling graphs in a hierarchy determined by values oncategorizing variables. Details of look-and-feel can behandled in display modules that render this structure indifferent styles.

Scene trees and graph trees organize geometric prim-itives. They are essential for creating efficient real-time streaming graphics systems for a variety of rea-sons, some of which are evident in the SYSTAT archi-tecture. Most important, perhaps, scene tree architec-ture enables rapid rendering.

Supervised vs. Immediate RenderingWhen we render a scene that is rapidly changing, wemay re-render the entire scene every time a change isdetected or render only the parts that change. If ourgoal is to render complex scenes that may change up to20 times a second, we choose the latter strategy.

In a graphics foundation like Java3D, the re-renderinghousekeeping is taken care of by the Java scene tree.When the scene tree is updated, only parts of the screenthat need updated are re-rendered. An added benefit isthat foundations like Java3D are designed to accommo-date 3D hardware accelerators.

StatisticsStatistical graphics depend on the calculation of statis-tics – point estimates, interval estimates, smoothers,summaries. In a static data environment we can pre-compute these statistics before rendering a graph. Themost prevalent example of this strategy is in OLAP(On Line Analytic Processing) systems based on ETL(Extracting, Transforming, Loading) technology. Thesesystems aggregate data from various sources, store themin a warehouse, and display graphics such as pie and barcharts on the aggregates. The Temple MVV visualiza-tion system of Milhalisin et al. (1995) works similarly.

Update/DowndateStreaming graphics require a different type of statisticalalgorithm. In the simplest case, such as accumulation ofsums and sums of squares and cross-products, we use anupdate/downdate strategy. When new data arrive in ourcomputational buffer, we discard the oldest data item,downdate its contribution to the accumulated statistics,and update the statistics from the new contribution. Thisprocess can accumulate rounding error, so it is essentialto recalculate occasionally all values from the data in awindow. This recalculation can be scheduled to occur atconvenient times, similar to the way garbage collectionis handled in an interpretive system.

Iterative calculations are not as simple. We can up-date and downdate Hessians and Jacobians, but han-


dling convergence in a real-time environment is prob-lematic. Other statistical algorithms present differentproblems. Guha et al. (2000), Datar et al. (2002), andothers are concentrating on these issues.

VisualizationThe simplest display for streaming data reflects the cur-rent state of the stream. Geometric elements such aspoints and lines move as the stream progresses. Real-time is only one aspect of what Dancer does withstreaming data, however.

Instant replay vs. animation

Whatever the time interval (seconds, minutes, hours, ...)we cannot expect a viewer to observe a display contin-uously. A real-time system needs to incorporate alertsfor out-of-bounds conditions. Visual and audio alertsare common in control applications in military, healthcare, and manufacturing.

When an alert occurs, it is not always clear what an-tecedent conditions led to its occurrence. Consequently,Dancer provides an instant replay feature to allow aviewer to rewind a scenario and replay it as an anima-tion. This is an aspect of a more general Dancer featureallowing us to animate any data series. Although de-signed for real-time applications, Dancer’s functionaltime model can be applied to any indexed (ordered)variable so as to allow animation of other variables. Inthis respect, Dancer can behave like exploratory sys-tems such as xGobi (Swayne et al., 1998) or DataDesk(Velleman, 1988) that animate over a variable.

This type of archival animation needs to be distin-guished from Dancer’s normal real-time model, how-ever. Archival animation offers the opportunity topre-process data that we do not have in real-time.In the extreme, animation systems such asFlash(www.macromedia.com) can prepare bitmap framesthat are played in a media player. These animations arefundamentally different from the architecture in Dancer.

Transformations of time

When we animate a sequence by buffering and replay-ing captured data, we may transform time in a varietyof ways. Reversing time helps us to untangle sequen-tial dependencies. Differencing time helps us to rec-ognize rates. Double-differencing time reveals accel-eration/deceleration. Polarizing time (a cyclical trans-formation) helps us to compare cycles. Logging timehelps us to view order-of-magnitude effects. Geologistsand cosmologists often log time scales and, as GrahamWills has pointed out (personal communication), ordi-nary people often use a similar transformation when

they look into the future on a day ... week ... month... year time-scape.

ConclusionStreaming data require streaming algorithms. Much ofthe technology for processing these data is relativelynew and much is yet to be done. Because statisticalgraphics is more involved with processing data thanwith drawing pictures, there is considerable technologytransfer than can occur from related fields in computerscience.

An indication of the challenges involved in process-ing real-time data can be seen in the contrasts betweenDancer and its siblingnViZn. Both are based on thegraphics grammar model and both are programmed inJava to take advantage of a Web environment. The twoprograms share not a line of code.

ReferencesBecker, R.A., and Cleveland, W.S., and Shyu, M-J.(1996). The Design and control of Trellis display.Jour-nal of Computational and Statistical Graphics, 5:123-155.

Box, G.E.P., and Jenkins, G.M. (1976).Time SeriesAnalysis: Forecasting and Control. San Francisco:Holden-Day.

Chambers, J.M., Cleveland, W.S., Kleiner, B., andTukey, P.A. (1983).Graphical Methods for Data Anal-ysis. Monterey, CA: Wadsworth.

Cleveland, W.S. and McGill, M.E., Eds. (1988).Dynamic Graphics for Statistics. Belmont, CA:Wadsworth Advanced Books.

Datar, M., Gionis, A., Indyk, P., and Motwani, R.(2002) Maintaining stream statistics over sliding win-dows. Annual ACM-SIAM Symposium on Discrete Al-gorithms, San Francisco.

Feamster, N.G. (2001). Adaptive deliv-ery of real-time streaming video. Mas-ter’s thesis, Department of Electrical Engineer-ing and Computer Science, MIT. Online athttp://nms.lcs.mit.edu/papers/feamster-thesis.pdf

Guha, S., Mishra, N., Motwani, R., and O’Callaghan, L.(2000). Clustering data streams. InProceedings of theAnnual Symposium on Foundations of Computer Sci-ence (FOCS 2000).

Mihalisin, T., Timlin, J., Gawlinski, E., and Mihalisin,J. (1995). Visual analysis of very large multivariatedatabases.ASA Proceedings of the Section on Statis-tical Graphics, 18-27.


Swayne, D.F., Cook, D., and Buja, A. (1998). XGobi:Interactive Dynamic Data Visualization in the X Win-dow System.Journal of Computational and GraphicalStatistics, 7:113-130.

Velleman, P.F. (1988).Data Desk. Ithaca, NY: DataDescription Inc.

Wilkinson, L. (1994). SYSTAT, Version 6. Evanston:

SYSTAT, Inc.

Wilkinson, L. (1999).The Grammar of Graphics. NewYork: Springer Verlag.

Wilkinson, L., Rope, D.J., Carr, D.B., and Rubin, M.A.(2000) The language of graphics.Journal of Computa-tional and Graphical Statistics, 9:530-543.

TOPICS IN STATISTICAL COMPUTING

Rolling Your Own: Linear Model HypothesisTesting and Power Calculations via the Sin-gular Value DecompositionSteve Verrill, Mathematical Statistician,USDA Forest Service Forest, [email protected]

AbstractWe outline the steps that would permit a

statistician to produce special purpose linearmodel routines through the use of high qualitypublic domain numerical analysis software.

IntroductionGood commercial linear model packages are readilyavailable. It sometimes happens, however, that onewould like linear model code that could be incorpo-rated into a simulation. Verrill (1999) discusses sucha simulation in the context of predictor sort sampling.Seehttp://ws13.fpl.fs.fed.us/ttconf.html for a web in-terface to the simulation program.

Further, a sophisticated user can sometimes becomefrustrated with the inflexibility of a commercial pack-age. This can be particularly true if the user is con-fronted with unbalanced data or complex hypotheses. Inaddition, some commercial linear models packages donot include the ability to perform power calculations.

In such cases the user can make use of public domaincomputer routines that yield flexible linear model ca-pabilities. In this note we step potential users throughthe computations needed to perform hypothesis testsand power calculations. We follow the theoretical ap-proach of Scheffe (1959). To do the numerical work wemake use of the singular value decomposition (see, forexample, Thisted (1988)). There are, of course, othernumerical techniques that can be used to perform thenecessary calculations (see, for example, Kennedy andGentle (1980), Gentle (1998)). We focus on the singu-lar value decomposition because it yields an approachthat is numerically stable, reasonably efficient, and sim-ple to explain and implement. We also suggest the use

of the DCDFLIB public domain package of distributionroutines.

The relation between the singular value decomposition,least squares, generalized inverses, and estimability hasbeen discussed in Good (1969) and Eubank and Webster(1985).

Hypothesis TestingThe standard linear model is

y = Xβ + ε,

wherey is then× 1 vector of responses,X is then× pdesign matrix,β is thep × 1 parameter vector, andε isthen×1 vector of random errors. We assume that theεiare independent, identically distributed N(0,σ2) randomvariables.

We want to test a hypothesis of the form

cT1 β = η1

cT2 β = η2 (1)...

cTq β = ηq.

(It is often the case thatcTi 1 = 0, in which casecTi isreferred to as a “contrast.”)

For example, in a one-way ANOVA we are testing

( 1 −1 0 0 . . . 0 )β = 0( 1 0 −1 0 . . . 0 )β = 0

...

( 1 0 0 . . . 0 −1 )β = 0.

To test hypothesis (1) we proceed in a series of steps:

Is the Hypothesis Overspecified?We need to determine whether theci are linearly inde-pendent. (If they are not, the hypothesis is overspec-ified. The user needs to think more clearly about thehypothesis, and arrive at a set ofci’s that are linearlyindependent.)


http://ws13.fpl.fs.fed.us/ttconf.html

LetCp×q = (c1 . . . cq)

whereq ≤ p. The singular value decomposition ofC is

C = Up×q

γ1 0 . . . 0 00 γ2 0 . . . 0

...0 0 . . . 0 γq

VTq×q

= UDγVT ,

where the columns ofU are orthonormal to each other,V is an orthogonal matrix, andDγ is the diagonal ma-trix with γ = γ1 ≥ γ2 ≥ . . . ≥ γq ≥ 0, the singularvalues ofC, as its diagonal. Thus the rank ofC is justthe number of nonzero singular values. Because of thelimitations of computer arithmetic, the nullγi’s will notin general be exactly equal to zero. We need to deter-mine a threshold value. If aγi lies below that threshold,we take it to be equal to zero and conclude that the hy-pothesis is overspecified. A threshold value that is sug-gested in the numerical analysis literature (for example,Golub and Van Loan (1996)) is

‖C‖ × (the machine precision).

Our experience suggests that

threshold= ‖C‖ × (the machine precision)× 10

is a better rule of thumb. Recall that

‖C‖ =

√√√√ p∑i=1

q∑j=1

c2ij =

√trace(CCT )

=√

trace(UDVTVDUT )

=√

trace(UD2UT )

=√

trace(D2UTU) =√γ2

1 + . . .+ γ2q .

For double precision arithmetic on 32 bit computers onewould use √

γ21 + . . .+ γ2

q × 10−15

as the threshold value.

Are the ci’s Estimable?We need to know whether there exists anai that satisfies

E(aTi y) = cTi β

for all β whereE is the expectation operator. This holdstrue if and only if there exists anai that satisfies

aTi X = cTi or ci = XTai. (2)

We determine whether equation (2) holds by first ob-taining a singular value decomposition of the designmatrix,X.

Xn×p = Un×rΛVTr×p (3)

where theλ’s are nonzero (r ≤ p is the rank ofX), or

XT = VΛUT .

Thus the projection operator onto the space spanned bythe columns ofXT , L(XT ), is VVT , andci = XTaifor someai if and only if ci ∈ L(XT ) or VVTci = ci.Using double precision arithmetic on 32 bit machines,one would take them to be equal if

‖VVTci − ci‖ ≤ ‖ci‖ × 10−15.

Find the ai that Satisfies aTi X = cTiGiven thatci is estimable, there is a uniqueai in thelinear span of the columns ofX, L(X), such that

aTi X = cTi .

(This is one of Scheffe’s lemmas.)

Using the singular value decomposition ofX, equation(3), we know that the solution ofci = XTai satisfies

ci = VΛUTai

orUΛ−1VTci = UUTai. (4)

But becauseai ∈ L(X) andUUT is the projection op-erator ontoL(X), from equation (4) we have

UΛ−1VTci = UUTai = ai.

Thus, we can use the left-hand side of the equationabove to calculate theai that satisfiesaTi X = cTi .

Find an Orthonormal Basis for the LinearSpan of a1, . . . ,aq — L(A)

We simply perform a singular value decomposition ofA = (a1 . . .aq) to obtain

An×q = Un×q∆VTq×q

and the columns ofU form an orthonormal basis forŁ(A).

Find the Hypothesis Sum of SquaresCase 1 —η1 = . . . = ηq = 0 Let uA,1, . . . ,uA,q be anorthonormal basis forL(A) found as described in thepreceding subsection. Scheffe’s theory tells us that thehypothesis sum of squares is

HSS ≡q∑i=1

(uTA,iy

)2

with q degrees of freedom.

Case 2 — at least one of theη’s is nonzero

Scheffe’s theory tells us that the hypothesis sum ofsquares is

HSS ≡(ATy − η

)T (ATA

)−1 (ATy − η

).


with q degrees of freedom.

We can make use of the singular value decompositionof A given above to obtain(

ATA)−1

= V∆−2VT . (5)

Then taking

w ≡ VT(ATy − η

),

we have

HSS =q∑i=1

w2i /δ

2i .

Find the Residual Sum of SquaresLet uX,1, . . . , uX,r be the columns of theU matrix ofthe singular value decomposition, equation (3), ofX.Then the projection ofy onto the linear span of thecolumns ofX, PL(X)(y), equals(

(uX,1)Ty)

uX,1 + . . .+((uX,r)Ty

)uX,r

and the residual sum of squares equals

RSS ≡(y − PL(X)(y)

)T (y − PL(X)(y)

)= yTy − ((uX,1)Ty)2 − . . .− ((uX,r)Ty)2

with n− r degrees of freedom.

Form the F Statistic and Compare It to theAppropriate Critical ValuesTheF statistic equals

(HSS/q) / (RSS/(n− r)) .

This should be compared to anFq,n−r(1 − α) criticalvalue whereα is the significance level.

Confidence IntervalsSuppose that we are interested in confidence intervalson l estimable combinations of the parameters,cT1 β,. . . , cTl β, and further suppose that the linear span ofthe c’s has rankq ≤ l. This rank can be determinedas described in section 2.1. Lets2 ≡ RSS/(n − r).Then (see Scheffe (1959)) we know that an individual(1− α)× 100% confidence interval forcTj β is

aTj y ± s× tn−r(α/2)×√

aTj aj

wheretn−r(α/2) is the appropriatet critical value, andaj satisfiesaTj X = cTj (see section 2.3). Also, joint(1 − α) × 100% confidence intervals for thecTj β’s,j ∈ {1, . . . , l}, are given by

aTj y ± s√q × Fq,n−r(1− α)× aTj aj

whereFq,n−r(1−α) is the appropriateF critical value.

Power CalculationsThe noncentrality parameter is given by

NCP ≡(ATXβ − η

)T (ATA

)−1 (ATXβ − η

)/σ2

=(CTβ − η

)T (ATA

)−1 (CTβ − η

)/σ2. (6)

Note that(ATA

)−1can be calculated as in equation

(5).

Scheffe’s noncentrality parameter equals the square rootof our noncentrality parameter. Our version corre-sponds with what the DCDFLIB library (see below) ex-pects.

In a completely general power calculation program,possible values ofβ, η, andσ2 would be specified by auser in the course of power calculations. In the commoncase in whichη = 0 and theβ’s represent treatmentmeans, it might be more reasonable to expect a user tospecify the components of theβ vector as fractions ofan overall mean and to specify a range of coefficientsof variation. This would yieldβ/σ values that wouldenable one to calculateNCP .

Under the null hypothesis,CTβ − η = 0 and the non-centrality parameter is 0, but under the alternative hy-pothesisCTβ 6= η and the noncentrality parameter isinflated above zero. If we operate at anα significancelevel, the power is given by

Power= Prob(Fq,n−r,NCP > Fq,n−r(1− α)) , (7)

the probability that a noncentralF distribution withqnumerator degrees of freedom,n − r denominator de-grees of freedom, and noncentrality parameterNCPlies above the100(1 − α)th percentile of a centralFdistribution with q numerator degrees of freedom andn− r denominator degrees of freedom.

Implementing the theory as codePublic domain FORTRAN or C code to perform thesingular value decomposition is found in the LIN-PACK (Dongarra and others (1979)) or (C)LAPACK(Anderson and others (1995)) linear algebra li-braries. These can be obtained over the inter-net at http://www.netlib.org/liblist.html. Public do-main C++ code to perform the singular value de-composition can be found by searching on svd athttp://www.netlib.org. A Java translation of the LIN-PACK singular value decomposition can be found athttp://www1.fpl.fs.fed.us/linear algebra.html. Thissite also points to Java translations of the LAPACK lin-ear algebra routines.


http://www.netlib.org/liblist.html

http://www.netlib.org

http://www1.fpl.fs.fed.us/linear_algebra.html

Public domain FORTRAN or C code to calculate thet distribution and the central and noncentralF distri-butions and their inverses can be found in the DCD-FLIB library. DCDFLIB is a public domain library of“routines for cumulative distribution functions, their in-verses, and their parameters.” It was produced by BarryBrown, James Lovato, and Kathy Russell of the Depart-ment of Biomathematics, M.D. Anderson Cancer Cen-ter, The University of Texas. DCDFLIB can be found athttp://odin.mdacc.tmc.edu/anonftp/page 2.html .

A relatively raw example of the use of the LIN-PACK and DCDFLIB routines to create a linear mod-els program can be obtained (or run over the Web) athttp://www1.fpl.fs.fed.us/glm.html. This site includessample input and output based on Table 9.1 in Millikenand Johnson (1992). We note that linear model routinesproduced by these methods would need extra work tobecome user friendly. In particular we have finessedthe issue of the generation or input of the design,X,and hypotheses,CT , matrices. We would expect a so-phisticated user interested in simulations or special pur-pose analyses to be able to generate these matrices byhand. A person new to this approach might want to takea look at some of the examples in Milliken and Johnson(1992).

An algorithm that automatically generates design ma-trices for balanced factorial experiments is described inMacKenzie and O’Flaherty (1982).

Kennedy and Gentle (1980) (page 388) note:

[User friendly] computer software must in-clude the ability to accept user specificationof the model. Most programs in use todayallow the user to provide some rather nat-ural algebraic specification. The programthen deciphers the specification and trans-lates it into numeric coding for subsequentuse. There are no established standardsfor doing this, but many techniques usedin compiler construction can be applied tothis problem.

An example of such an algebraic specification is dis-cussed in Wilkinson and Rogers (1973).

ReferencesAnderson, E., Bai, Z., Bischof, C., Demmel, J., Don-

garra, J., Du Croz, J., Greenbaum, A., Ham-marling, S., McKenney, A., Ostrouchov, S., andSorensen, D. (1995).LAPACK Users’ Guide,Second Edition. Society of Industrial and Ap-plied Mathematics, Philadelphia.

Dongarra, J., Moler, C., Bunch, J., and Stewart, G.(1979). LINPACK Users’ Guide.Society of In-dustrial and Applied Mathematics, Philadelphia.

Eubank, R. L. and Webster, J. T. (1985). “TheSingular-Value Decomposition as a Tool forSolving Estimability Problems,”The AmericanStatistician, 39, 64–66.

Golub, Gene H. and Van Loan, Charles F. (1996).Matrix Computations, Third Edition.The JohnsHopkins University Press, Baltimore and Lon-don.

Gentle, J. E. (1998).Numerical Linear Algebra forApplications in Statistics.Springer-Verlag, NewYork.

Good, I. J. (1969). “Some Applications of the SingularDecomposition of a Matrix,”Technometrics, 11,823–831.

Kennedy, W. J. and Gentle, J. E. (1980). Statisti-cal Computing. Marcel Dekker, New York andBasel.

MacKenzie, G. and O’Flaherty, M. (1982). “DirectDesign Matrix Generation for Balanced FactorialExperiments,”Applied Statistics, 31, 74–80.

Milliken, G. A. and Johnson, D. E. (1992). Analysisof Messy Data, Volume I: Designed Experiments.Chapman and Hall, London.

Scheffe, H. (1959). The Analysis of Variance.JohnWiley and Sons, New York.

Thisted, R. A. (1988).Elements of Statistical Comput-ing. Chapman and Hall, London.

Verrill, S. (1999). “When Good Confidence Inter-vals Go Bad: Predictor Sort Experiments andANOVA,” The American Statistician, 53, 38–42.

Wilkinson, G. N. and Rogers, C. E. (1973). “SymbolicDescription of Factorial Models for Analysis ofVariance,”Applied Statistics, 22, 392–399.


http://odin.mdacc.tmc.edu/anonftp/page_2.html

http://www1.fpl.fs.fed.us/glm.html

CorrespondenceAnalysis with RBrigitte Charnomordic, Biom etrie, INRA, FranceandSusan Holmes, Statistics Department, [email protected]

AbstractWe provide here a few didactic examples

for understanding correspondence analysis us-ing R to illustrate the use of matrix decomposi-tions. We show an example of how correspon-dence analysis can help analyze difficult mul-tidimensional data such as DNA micro-arraydata.

IntroductionIn France, students in PhD or Masters programs in ap-plied statistics have to take a course on correspondenceanalysis. In English speaking countries this method isonly taught in some education departments1, and thereare very few books or articles in English on the sub-ject, Michael Greenacre who studied in France underBenzecri[1] has written several very readable texts onthe subject[5].

With today’s explosion of data in the form of contin-gency tables both two-way and three-way, this methodis very useful in applications such as gene expressionmicroarray data, information retrieval, linguistics andsociology.

Correspondence analysis provides students at the mas-ters level in statistics, applied mathematics, biology andengineering with an opportunity for performing multi-variate data analysis without having to use prepackagedprograms. Thus they also learn how to use graphicaland numerical functions in high level languages such asR/Splus or matlab .

Aims and relevant dataWe can make a dichotomy of data-mining and multi-variate statistical methods into two groups, one seriesof methods confers a particular status to one variable orset of variables, these are to be predicted or explained,they are the response. Methods include regression, mul-tiple response regression, discriminant analysis, analy-sis of variance depending on whether the explanatoryvariables are categorical or continuous. These are notthe ones we are going to study here.

Correspondence analysis is useful when all the variableshave the same status. This is sometimes called unsuper-

vised learning, and includes clustering, principal com-ponents as well. They have in common the creation ofa new set of variables that simplify the arrays at hand.In the case of clustering, the new variable is a categor-ical, in correspondence analysis and principal compo-nents the new variables are continuous and enable theconstruction of useful new graphical representations ofthe data.

Correspondence analysis is an exploratory method be-cause it does not presuppose any model for the data, asdo Goodman’s bilinear methods or factor analysis mod-els for instance. Correspondence analysis and principalcomponents can both be extended to three-way arrays,for instance for the analysis of bootstrap permutationtests or time series of matrices. Such data are oftencalled data cubes.

Correspondence AnalysisCorrespondence analysis (CA, also called homogeneityanalysis and reciprocal averaging), can be used to ana-lyze several types of multivariate data. All involve somecategorical variables. Here are some examples of thetype of data that can be decomposed using this method:

• Contingency Tables (cross between two categori-cal variables)• Multiple Contingency Tables (cross between sev-

eral categorical variables).• Binary tables obtained by cutting continuous

variables into classes and then recoding boththese variables and any extra categorical variablesinto 0/1 tables, 1 indicating presence in that class.So for instance a continuous variable cut intothree classes will provide three new binary vari-ables of which only one can take the value 1 forany given observation.

To first approximation, correspondence analysis can beunderstood as an extension of principal componentsanalysis (PCA) where the variance in PCA is replacedby an inertia proportional to theχ2 distance of the ta-ble from independence. CA decomposes this measureof departure from independence along axes that are or-thogonal according to theχ2 inner product. If we arecomparing two categorical variables, the simplest pos-sible model is that of independence in which case thecounts in the table would obey approximately the mar-gin products identity for am×p contingency table witha total sample size ofn =

∑mi=1

∑pj=1 nij = n··. Inde-

pendence means

nij.=ni·n

n·jnn

1 Exception made for a few statistics departments, for instance UCLA where Jan de Leeuw teaches multivariate statistics.


can also be written:N .= cr′n, where

c =1n

N1m and r′ =1n

N′1p

The departure from independence is measured by theχ2

statistic

X 2 =∑i,j

[(nij − ni·

n

n·jn n)2

ni·n·jn2 n

]

Example: Eye color -Hair colorHere is a simple contingency table as our example fromSnee (1974)[8].

eyes Black Brunette Red BlondeBrown 68 20 15 5Blue 119 84 54 29Hazel 26 17 14 14Green 7 94 10 16

> chisq.test(eyes)Pearson’s Chi-squared test

data: eyes, X-squared = 138.2898,df = 9, p-value = < 2.2e-16

This is a very extreme point in theχ2 distribution. Theinertia is theχ-squared statistic divided by the num-ber of observations (sum(eyes)); Here, the inertia is138.3/592 = 0.2336. CA decomposes this inertia intothe sum of eigenvalues of a symmetrized reweightedversion of the original table, as will be explained be-low. TheRcommandresca<-ca.us(eyes)provides first a scree plot of these eigenvalues, in thisexample:

Values Percent1 0.2088 89. ******************2 0.0222 10. **3 0.0026 1. .Total 0.2336 100.

R screenshot on Macosx

Formulation as a generalized singular value decom-positionGiven anm × p contingency table of countsN of mlevels for a row variable andp levels for a column vari-able. (This is equivalent to a binary matrixX withn =

∑ij nij = n·· observations onm + p columns,

a notion that is useful of the generalization later.)

The first transformation makes the contingency matrixN into a frequency matrixF = 1

nN. We will denotethe row sums byr = F1p and the column sums by thevectorc = F′1m. These both sum to one

r′1m = c′1p = 1

In the case of independence

F .= rc′

All the rows would be multiples of each other or as thisis sometimes called,homogeneous. So, if all the rowswere divided by the weight of that row, these so-calledrow profilesFDr

−1 would be equal (FDr−1 = 1mc),

whereDr−1 denotes the diagonal matrix with the vec-

tor r−1 on its diagonal.

The average row in the case of homogeneity and inde-pendence is obtained by averaging the rows with the rel-evant weights for each column. The average of the row-profiles isc. The departure from independence and ho-mogeneity is measured by some norm ofFDr

−1−1mc

(or at the term by term levelfijfi·− f·j). With this nota-

tion we remark that

χ2 = n∑i,j

(fij − ricj)2

ricj

= n∑i,j

ricj(fijricj− 1)2

Verification inR:

> F<-eyes/sum(eyes)> r<-apply(F,1,sum)> c<-apply(F,2,sum)> E<-outer(r,c)> sum((F-E)ˆ2/E)*592[1] 138.2898

To compute the distance between profiles, each columnis reweighted by the inverse of its sum, this gives theχ2

distance between row profiles.

X 2 = n trace((F− rc′)′Dr−1(F− rc′)Dc

−1)= trace(A′A) whereA = D−1√

r(F− rc′)D−1√

c

The latter decomposition shows a justification forchoosing the matrixA as a natural square root.W =A′A is in a sense the characteristic matrix-operator of


the analysis, in the same way the covariance or corre-lation matrices are those of principal components anal-ysis. Generalizing principal component analysis to in-clude metrics on the rows and the columns can lead toother multivariate techniques such as discriminant anal-ysis (See Mardia, Kent and Bibby(1979) [7]).

Correspondence analysis decomposes the matrixW: itseigenvectors give the axes that account for the largestpart of the departure from independence, just as prin-cipal components provides the axes accounting for thelargest variability. Computationally this is achieved bya generalized singular value decomposition

D−1r FD−1

c − 1′m1p = USV′,with V′DcV = Ip,U′DrU = Im

equivalent to the eigendecompostionW = A′A =V′S2V or the singular value decomposition

D−1

2r FD

−12

c −√

r√

c′ = (Dr

12 U)S(Dc

12 V)′,

where(Dc

12 V)′(Dc

12 V) = Ip, and(Dr

12 U)′(Dc

12 U) =

Ip. The CA plots can be used to find out if there isan hidden ordination of the data, as for instance thechronological seriation studied below.

Finding an underlying ranking.Example: Cox and Brandwood [3] tried to seriatePlato’s works using discriminant analysis on the propor-tion of sentence endings in a given book, with a givenstress pattern. Here we show how such an analysis canbe done with correspondence analysis on the table offrequencies of sentence endings2. The first 10 profiles(as percentages) look as follows:

Rep Laws Crit Phil Pol Soph TimUUUUU 1.1 2.4 3.3 2.5 1.7 2.8 2.4-UUUU 1.6 3.8 2.0 2.8 2.5 3.6 3.9U-UUU 1.7 1.9 2.0 2.1 3.1 3.4 6.0UU-UU 1.9 2.6 1.3 2.6 2.6 2.6 1.8UUU-U 2.1 3.0 6.7 4.0 3.3 2.4 3.4UUUU- 2.0 3.8 4.0 4.8 2.9 2.5 3.5--UUU 2.1 2.7 3.3 4.3 3.3 3.3 3.4-U-UU 2.2 1.8 2.0 1.5 2.3 4.0 3.4-UU-U 2.8 0.6 1.3 0.7 0.4 2.1 1.7-UUU- 4.6 8.8 6.0 6.5 4.0 2.3 3.3.......etc (there are 32 rows in all)

The eigenvalue decomposition (called the scree plot)shows that two axes will provide a summary of85%of the departure from independence.

> res.plato_ca.us(platon)Eigenvalue inertia % cumulative %

1 0.09170 68.96 68.962 0.02120 15.94 84.903 0.00911 6.86 91.764 0.00603 4.53 96.295 0.00276 2.07 98.366 0.00217 1.64 100.00Please examine the eigenvalues...How many axes do you wish to be retained?

The functionca.us asks this question interactively,because only the screeplot visualisation can protectagainst the separation of 2 or more very close eigen-values. The function then returns as the resulting datalist the coordinates for plotting that can be visualizedusing ggobi if there are more than 3 relevant axes.Here is the correspondence analysis representation forthe rows and columns taken separately. We have madethe labels’ sizes proportional to the quality of the rep-resentations in this first plane.These are obtained bytypingcaplot.us(res.plato) .

Plato sentence endings (rows)2A computerized analysis of Plato’s work appears in Ledger (1989)[6]


Plato’s works (columns)

The plot shows very clearly a seriation of all the works,in fact, except forCritias , the seriation is determinedby the first axis, the second axis helps to placeCritiasbetweenPoliticus andSophist, a choice that has alsobeen validated in Ledger(1989) and Cox and Brand-wood (1959).

Decomposition of Inertiaχ2 distance between profilesHere is a reason why such a weighted distance could beuseful : Take a very simple contingency table :

[ 6 2 12 ]X= [ 15 5 7 ]

[ 21 7 42 ]

The row profiles are all

[,1] [,2] [,3][1,] 0.3 0.1 0.6[2,] 0.3 0.1 0.6[3,] 0.3 0.1 0.6

We can see that the profiles are identical, the multino-mials generating the rows are said to be homogeneous,and the contingency table is in fact only of rank 1:

[ 2 ]X= [ 5 ] %*% [ 3 1 6 ]

[ 7 ]

This is exactly the problem we encounter in principalcomponents analysis, we need the decomposition in sin-gular values ofX, but instead of centering the datawith regards to the mean, the data will be centered atindependence. Another difference lies with the choiceof the metric for computing distances between rows orcolumns.

Decomposition of the difference from independence

A cloud, or scatter of weighted points:These are points defined in an euclidean space, sayRp

for instance, so that distances between them are easyto compute. However we associate to each multidi-mensional point a weight that changes the inertia ofthe scatter points. For instance if we have two pointsthe one with a higher weight will ‘pull ’ the centererof gravity toward it. The same will happen for the‘minimum inertia ’ line, it will be pulled toward highlyweighted points.

Distributional Equivalence :If we add two rows that have the same profiles, this willnot change the axes chosen to represent the data, (thecolumn profiles’ geometry remains unchanged). Think-ing of the points as weighted points in a cloud, twopoints that would be at the same spot can be mergedbecause we can add their weights.

Barycentric Representation:Take the simplest case: row profiles of a 3-column con-tingency table. The profiles sum to one so are all rep-resentable in a triangle (called the 3-dimensional sim-plex). The vertices are the extreme profiles, say (1,0,0),(0,1,0) and (0,0,1). Although the row profiles are ina three dimensional space as they belong to this trian-gle they can be taken out and just looked at in thesecoordinates, called the barycentric co-ordinate system.Now an extra scale change will bring this representa-tion to the correspondence analysis one : the dimen-sions will be weighted inversely by the relative weightsof the columns , called column mean profiles, (whichalso add to1). The distances we want to represent be-tween points are to be theχ2 distances relevant here, sothe sides of the equilateral triangle are stretched to havesides inversely proportional to the square roots of theirmean values.

The side of the triangle the most stretched correspondsto the least frequent column.

This representation is the one chosen by default by thefunctioncaplot , there is a delicate issue of choosingthe scales in the two dimensions so that simultaneousrepresentations of rows and columns are valid I the pro-gram warns the user when such a scaling has not beenchosen). In the relevant choice of scaling, proximitiesbetween row and column points are hard to interpret,however it is easier to interpret the directions of the dif-ferent rows and columns.


http://encyclopediaindex.com/b/criti10.htm

Reading the OutputAlthough the maps provided by doing both correspon-dence and principal components analysis look quitesimple there are traps that lead tomisinterpretationsthatmust be avoided. Associated to the co-ordinates in thenew spaces are what we callloadingsor contributionsand which are indicators of how true the proximitiesin the image space are. To this end the object that thefunctioncaplot produce a listed output containing theeigenvalues, coordinates for the rows, for the columns,that are used for building the graphical representationsand absolute contributions and squared cosines that areimportant diagnostic tools.

>names(resca)[1]"recap" "valp" "ind.cords" "var.cords"[5]"ind.cta" "var.cta" "ind.ctr" "var.ctr"[9] "ind" "var" "axes"

• When trying to understand the most meaning-ful rows or columns for a given axis we look atthe absolute contributions of rows or columns togiven axis, this gives the amount of an axis’s in-ertia explained by a single row or column.

• The relative contribution of an axis/ of two axesto the inertia of a row. This is the same as the co-sine of the point with the axis that says how wella point is being projected onto the axis.

Contribution3 to the inertia from rowi:

Distance from theith row to the center of the row-points:

d2χ2(profilei, center) =

∑j

1f·j

(fijfi·− f·j)2

=∑

f·j(fijfi·f·j

− 1)2 =∑k

(skuki )2

This row will thus participate to the inertia by thisamount weighted by the row’s massri. This can bedecomposed into each of the axis separately thus givingan idea of the contribution of each row to the inertia ofeach axis, this is called the absolute contribution of rowi to axisk, ri

∑k(sku

ki )

2. The sum of all row’s contri-butions to a given axis add to one. This translates thefact thatU′DrU = Im, here are the absolute contribu-tions for theeyes data:

%Output resca$indctaAxis 1 Axis 2 Axis 3

Brown 0.2225 0.3788 0.2163Blue 0.0509 0.0232 0.4428Hazel 0.0096 0.5513 0.3191Green 0.7170 0.0467 0.0217Total 1. 1. 1.

Thus we can see that the most important row categoryfor explaining the first axis is Green eyes. The contribu-tions to inertia as decomposed according to the columncategories gives:

%Output resca$varctaAxis 1 Axis 2 Axis 3

Black 0.4312 0.1304 0.0668Brunette 0.5213 0.1124 0.0031Red 0.0340 0.1980 0.6109Blonde 0.0135 0.5591 0.3192Total 1. 1. 1.

Squared Cosine

For interpretation of the exactitude of the projection, itis important to consult the cosine of the angle betweenthe point and it’s projection onto the kth axis or theplane or space spanned by the relevant axes.

cos2(row i,axek) =(skuik)2∑k(skuik)2

%Output resca$indctrAxis 1 Axis 2 Axis 3 Total

Brown 0.838 0.152 0.010 1.Blue 0.864 0.042 0.094 1.Hazel 0.133 0.812 0.055 1.Green 0.993 0.007 0.000 1.

In this case of course, taking 3 axes results in a com-plete reconstruction of the table, so thecos2 betweenthe rows and this 3-space is 1.

On the other hand the column’s cosine can also be pro-vided, here the output gives:

%Output resca$varctrAxis 1 Axis 2 Axis 3 Total

Black 0.9670 0.031 0.002 1.Brunette 0.9775 0.022 0.000 1.Red 0.5424 0.336 0.121 1.Blonde 0.1759 0.777 0.052 1.

This information can actually be incorporated into thegraphic by making the size of the points label propor-tional to how close the point is from the plane. This is a‘perspective’ type plot, points which are close, are wellrepresented and have high cosines. Thus the large let-ters label points that are close to the plane, and are thus“well-represented”.

Summing these for the 2 for which the representationwas made, and using these indices as the size of the font,enables one to see at a glance which rows and columnsare important to interpret.

3Sometimes called absolute contribution as different from thecos2 which are sometimes called relative contributions.


Complete Analysis of DNA-microarray data

This data is freely available thanks to Iyer et al.(1999)[9], who analyzed this data in theirSciencemagazine article. One of the basic methods they usedwas hierarchical clustering as very well explained byCarr et al. (1997)[2]. Here we present the results of thecorrespondence analysis alternative. Since this workwas done, other implementations of correspondenceanalysis (and homogeneity analysis) have been pub-lished (see the software by Wong and Cui[10] and Fel-lenberg et al.[4]).

Eigenvalues for the microarray data

Columns as centroids of rows

The 3-dimensional space spanned by the first three axesexplains78% of the departure from independence ofthe table. Here are two 3 dimensional views.

−10

12

−10

12

−2−1

01

2

−1

−0.5

0

0.5

1

−1

348346

85132

344349

315

345142

> 43

144

234

131

86235

79

161129 92

288

286

289 42

109

77

145

66280

202319

293

316

327

205

88

200

350291

80

178236237

287

71

130

302

139

89

166239238

290

87

149

94

504

326

203

206

72

49

325

163

84

210

47

247157173514

343

167195

292

342

212

248

147

270

165170

13

306

143199

336

152

91

233

146

240

217

282

108

162134245168

56

160 19

317

197

74

308

37

192

229 10

138241

328

512

137

329

318

301

169

59

211

246

172

78

41

196

101294

25

181111

268

136156

14

230153148

209219

140135

57

204

1

Axis:1 (41%)

76

176

112

15

198207

516 16

73

17159

498515513154

201

309

98 48

102

155

324

223

267

133

95

179

340

20

295

171

53

82

193

307

341

347

304

305

30150 23

164

36

218 34

503213

50

244

141 24

54

158

232214243216222

39227

511

67

151

175 2

331339

97 93 45

90

187215

103

183

4 32

177

26

68

228 29 46

278

96

113

11

499

231502

182

75

303

338

337

38

186

174225208

44281 83

283

110

22

334

81

517 31 28

264

124505

269

190506

311

114

284

220494

226

18

180189

285

191

501

330

275

320

310

51

321

224

271

70

35

184

257

52 58 61

21 5

266279

9

258

69

333

33

105

332

265194

104

242

60

27 65221

493

508

312

253 99 55

106

188

392

116125

122

6

322

123

12

185

126313

63100

394

251

0HRC

274

510

3

62

121

3521281HRC 119

408

40

107

357

1HR 30MINC30MIN 15MIN 254117

273

2HRC

263

335

255

115

8

259

7

314

64

272

2603912564HRC 2HR 120

252

300UNSYNC127118

323

249

UNSYN

276296

250

406

277

500

395

261364

390

402

351

497

42024HR

Axis:3 (13%)

418422

262

474

496

485

401

355

398

Correspondence Analysis : columns as centroids of the rows

400397

20HR

403

42136629716HR

4HR

298

490

509483

476

454

507365

480

423

393

396

452

371487

399

384356404

488

477

45012HR

382

407

4896HR

443419431417

479478381

475

444

354

8HR

299

432

460

492449451

405445

383

468

459461358

453463

481

455

466

495

462

359

447

385373360458

482

430414

436

491484

456

433448

429

376

410

388

464

372

413415

380

409424374

465

386

457

389

426411

416387

467

367486377

446

361

438

469437435

353

370

412

428

368

434

440

427369

439

470471

Axis:2 (24%)

375

441379

472

362

425

378

442473

Three first axes

−2 −1 0 1 2 −2 −1 0 1 2

−3

−2

−1

−1

−0.8

−0.6

−0.4

−0.2

0

>

378

362

473

442

369

375

368

379

425

353

370

441

361

Axis:2 ( 24%)435

427

367

377

434439

472

389

486372

491

380

388

440

470

482

376

386

428

495

471

387

437

469

416

374

409

509

467

358

438

354

360

424

359

299

412426

373

433410

429

481

507382

484365

478

381

411

371430

480

458477

385

413

479496

415

457

466

383

488

436

298

446

489

Axis:3 ( 13%)

277

414

296

351

448

8HR

355

497

483

465

6HR

459476461464

501

487

462272460

366

475

432

271

490

492

417

431

295

450

12HR 276468

4HR

263

508 7

463

300352

274

393390

455

453 8506

485

384364

500

510279

356

273

Correspondence Analysis : columns as centroids of the rows

451

511

283

278

269456505

517

498

64

291

444

502249

357

100 70 6281

4HRC

Axis:4 (5.5%)

55

284

287 60 67

449 63443

285

350

68

419

12

62250

2HRC 40266275264

289288251

16HR 51

297

347

262

454 75 54 9499 81494

255

254252 99 762802HR 61 69516

292

503

268

59

290

56

1HR 515

423267

52 32821HRC

259

65

11

445

57 2151340720HR 30MIN 73270

405 53404

7429426530MINC474

58

0HRC

90

18 39

260

38391

83

98

66

22

447

452

422

96

7115MIN

504

48

95

72

421

293

514 27

97 36 78

34122 23

46

26253

UNSYNC

117

28 14

33 44242

5

47

228

226

15 25418123

115256

420

11824HR

227 87

392101

16

512

29

128 20221

400

17 93150 35 45 80224120 1

82

399

125151 88159

180

127

31

94

258 84147 30194

4

223

UNSYN 50 13

119 32 91 37

160

261

121

140114

246148

10

143

79146 77174158

191229145

286 19

42116

344225175 41

182141154

137 89

248

24

134136135198197184

406

397 49

230124

349 2

403

188

199

231

107

168

113189162153179

218

222

85314155157130133152156187

169

245190

164

346

185345

142

193348 43

149

112401

176215213167214

86

208

396

181257186173177

238

216

493

105

171241247201138

232

220170

243

200239

131

196

172

207

244165

178166

402

92

126

195

211217

111110

144

192

139129

183

209313

233206

106

395

236

219163

312317

204237161

305316

315

234

132

108

304

104310

408

398235

303205

323

109203

318

212

394

103

240

210

321

319

306

322320202

102307

311

337

309331341

333

335

340

330

301332

308339

324

343

336

334

338329326302

342

325

327

328

Three next axes -rotated

References[1] BENZECRI, J. P. History and prehistory of data

analysis. V: The analysis of correspondence, sys-tematic index (french),Cahiers de l’Analyse desDonnees2: 9–40, (1977)

[2] D. CARR, R. SOMOGYI AND G. MICHAELS, Tem-plates for Looking at Gene Expression Cluster-ing, Statistical Computing and Graphics Newslet-ter, vol. 8, pp. 20-29, (1997).

[3] D. R. COX AND L. BRANDWOOD, On a discrimi-natory problem connected with the works of Plato,J. Roy. Statist. Soc. Ser. B, 21 (1959), pp. 195–200.

[4] K. FELLENBERG , N. C. HAUSER, B. BRORS,A. NEUTZNER, J. D. HOHEISEL, AND M. V IN-GRON, Correspondence analysis applied to mi-croarray data, PNAS, (2001), pp. 10781-10786.

[5] M. GREENACRE, Theory and Applications of Cor-respondence Analysis, Academic Press, (1984).

[6] G.K.LEDGER, Re-counting Plato, Oxford Univer-sity Press, Oxford, (1989).

[7] K. M ARDIA , J. KENT, AND J. BIBBY , Multivari-ate Analysis, Academic Press, (1979).


http://www.sciencemag.org/cgi/content/short/283/5398/83

[8] R. D. SNEE, Graphical display of two-way contin-gency tables, The American Statistician, 28 (1974),pp. 9–12.

[9] V.R. IYER, M. EISEN, T. ROSS, T. DOUGLAS,G. SCHULER, T. MOORE, J. LEE, J. TRENT,L. STAUDT, J. HUDSON JR.,M. BOGUSKI, D.LASHKARI , D. SHALON , D. BOTSTEIN AND P.

BROWN., The Transcriptional Program in the Re-sponse of Human Fibroblasts to Serum,Science,(1999) 283: 83-87.

[10] W. WONG AND Y. CUI, Gifi Array Analyzer: ana-lyzing microarry data with Homogeneity Analaysis,http://biowww.dfci.harvard.edu/˜ycui/Gifi.html

TEACHING STATISTICAL COMPUTING

Scenarios for StatisticsSusan Holmes, Statistics, Stanford

Statistical Computing for ArtistsAt Stanford this year Brad Efron and I decided to teachan introductory level statistics class aimed ataesthetes.The campus daily immediately misnamed the class:“Maths for fuzzies”. We wanted to draw esthetically in-clined students to a class that would enable them to readand understand statistics as it appeared in the newspa-pers, then illustrate what they had learned in class withan artistic media.

Of course there was no calculus requirement, nor an in-clination for maths expected. Everything was done withwords, but we started the course curriculum backwardsfrom the usual ordering. Starting with the computationof permutation and bootstrap distributions rather thanthe Normal distribution.

Familiarity with the computer was acquired in stages:

Using the browser The class notes were made avail-able on a class web page:

Link to class web page

Downloading and printing pdf files Some of our stu-dents did not know how to download or view PDFfiles.

Running applets There are many existing applets onthe web for teaching probability, we used an ar-ray of ones from Buffon’s needle to Monty Hall

UsingFlash to do some statistical analysesThefirst permutation tests were executed us-ing a simple press button interface ondata already entered into the program:

InteractiveFlash Example

Making one simpleFlash animation sceneThis wasexplained in a lab run by Carrie Grimes, avaluable Teaching Assistant by her prior ex-perience with Flash. An animation used toexplain paired comparisons can be accessed at:


http://biowww.dfci.harvard.edu/~ycui/Gifi.html

http://www.stanford.edu/class/stat30/

http://www.stanford.edu/class/stat30/hw3.html

Publishing Flash animations on the web This in-volves publishing the html files that contains apointer to the relevant swf files and uploading tothe web page. This was much more difficult forour students than we had predicted.

Making an interactive Flash animation This waswhat the students wanted to learn most of all,the basis for being to make computed animatedcartoons, this was their favorite lab!

Making a whole Flash movie with several scenesThis involved making complete storyboards con-taining the concepts and how they were illustratedas well as joining the scenes together.

Writing Flash actionscript programs to analyze dataSome more ambitious students actually did quitea bit of programming, it is possible to allow usersto enter their own data and analyze it, as thisbootstrap flash animation, written by Anita Lillieshows:Bootstrap of a Regression(2000 Elections)

Many of the experimental data was collected by the stu-dents themselves, for instance by testing their ESP us-ing Michael Daniels’ ESP tester

This of course led to binomials, but we experimentedwith them using simulations more than formulas.

The first two sample testing examples were paired testsdone with permutations, and the manipulations that in-volved the computer were done using scripts in flashactionscript.

Entering the data

Instead of doing the analytical formula, we generatedmany samples under the null hypotheses. As we didalso for two sample permutation tests, I called the per-mutationsshufflingof the two samples.

There is a green shuffle button that just does a permuta-tion. Then the difference has to be computed by click-ing buttons, or as the students understand the step bystep process, a special automatic button enables them todo blocks of permutations at the time.

As homeworks and final projects, we asked the studentsto write their own animations to illustrate some of theconcepts we did in class:


http://www-stat.stanford.edu/~susan/courses/s30/example1.html

http://www-stat.stanford.edu/~susan/courses/s30/example1.html

http://www.stanford.edu/~jeeves/regression.html

http://www.mdani.demon.co.uk/para/esp2.htm

http://www.mdani.demon.co.uk/para/esp2.htm

Anita Lillie’s Histogram

Sara Flores’ Stem and leaf

Michael Areinoff’s Histogram

Meredith Hoy’s regression

Danny Jacobs’ Shakespeare Bootstrap

Interactive Flash on the webFirst the instructor can write very simple interactiveprograms to do sets of simulations. Our homework 3consisted in the comparison of two samples of differ-ent sizes with a permutation test, the students had tosimply run through the screens pushing a few buttons.This only required knowing about linking to a site onthe web, and updating the browser so that the correctplug-in was available. At the timeFlash 5 had onlyrecently come out, and this was the version that madedoing mathematics possible.

Flash AnimationsThe students spent 3 labs learning the basics ofFlashanimation, I can’t say it was smooth for all of them thereis a lot involved in going from only knowing email tobeing able to publish an animation. This then has to beuploaded to their own website. A lot of small steps thataren’t so easy for an audience of technophobes. How-ever the motivation of actually learning flash and seeingtheir artwork dance across the screen was very strong.

Flash ActionscriptsHere are some of the possibilities that were added withversion 5 of flash and that has made mathematical func-tions possible:Math.random, Math.floor, Math.round, Math.sin, Math.cos,Math.abs etc...

To see an example of what a script actually looks like,see the examples at Sample for building a Pie chart andSample Actionscript for a 3d rotator.

A summary of the experimentNone of these students wanted to use spreadsheets orstatistical software soFlash was an appealing alter-native especially because no other course on campuscatered to this arty crowd. The audience was coaxedinto using the computer to program becauseFlash waswritten originally for designers, with no prior program-ming necessary at all.

To see for yourselves, an overview of the projects and


http://www-stat.stanford.edu/~susan/courses/s30/students/anitahist.html

http://www-stat.stanford.edu/~susan/courses/s30/students/sflores.html

http://www-stat.stanford.edu/~susan/courses/s30/students/mareinoffhw1.html

http://www-stat.stanford.edu/~susan/courses/s30/students/MHoy.html

http://www-stat.stanford.edu/~susan/courses/s30/students/dannyjacobs.html

http://flash5actionscript.com/samplepiechart.php

http://flash5actionscript.com/sample3drotator.php

homeworks, please visit our Student Sampler.

And from here?Changing the students attitude in an elementary statis-tics class from passive consumers to active creators candefinitely change their experience. Having to make ananimation that explains a concept means that not onlyis the concept understood, but also that students acquirethe focus to extract the few important features that needto be shown in an animation. The artistic and creativecomponents kept them interested even when the statis-tics seemed arduous.

I have noted since working in the United States how ed-ucation here is focused around personal work. Studentsthrive on giving their own take on problems and doingpersonal projects. The course was built to capitalize onthis observation.

Maybe others will be interested by usingFlash forteaching mathematics or statistics, I would be happy toprovide information and advice.

References[1] S. BHANGAL ,(2000), Foundation Actionscript.

Friends of ED, Birmingham, UK.[2] M. WOOLDRIDGE, (2001),Teach Yourself Visually

Flash 5, Marangraphics Inc/ Hungry Minds.[3] B. EFRON AND SUSAN HOLMES, (2000), Sce-

narios for Statistics: Stat30, Webpage:http://www-stat.stanford.edu/˜susan/courses/s30/

NEWS CLIPPINGS AND SECTION NOTICES

Pictures from the JointMixer at JSM 2001

Lots of attendees at the favorite JSM event for manyconference goers: the Stat Computing and Graphicsmixer!

Thomas Lumley and Debby Swayne being helped choos-ing prize winners by an unidentified young statistician.

John Chambers awarding the John M. Chambers Sta-tistical Software Award to Dr. Alessandra Brazzale.

©©


http://www-stat.stanford.edu/~susan/courses/s30/students/all.html

SECTION OFFICERS

Statistical Graphics Section - 2001Deborah F. Swayne,Chair

(973) 360-8423AT&T Labs - [email protected]

Stephen G. Eick, Chair–Elect(708) 713 5169Visual Insights,[email protected]

Edward J. Wegman, (703) Past–Chair(703) 993-1680George Mason [email protected]

David A. James, Program Chair(908) 582-3082Lucent Bell [email protected]

Mario Peruggia, Program-Chair Elect(614) 292-0963Ohio State [email protected]

Dianne Cook, Newsletter Editor(515) 294-8865Iowa State [email protected]

Thomas Lumley, Secretary/Treasurer(206) 543-1044University of [email protected]

Graham Wills, Publications/Electronic Commun.(312) [email protected]

Bradley A. Jones, Rep.(99-01)Council of [email protected]

Charles B. Roosen,Rep.(01-03)Council of SectionsMathSoft, [email protected]

Statistical Computing Section - 2001Mark Hansen, Chair

908-582-3869Bell [email protected]

Susan Holmes,Chair–ElectStanford [email protected]

Russell D. Wolfinger, Past–Chair(919) [email protected]

Douglas W. Nychka, Program Chair(919) 737-2534NCAR, Geophysical [email protected]

Tim Hesterberg, Program Chair–Elect(206) [email protected]

Susan Holmes,Newsletter Editor (00-02)650-725-1925Stanford [email protected]

Merlise Clyde, Secretary/Treasurer (00-02)919-681-8440Duke [email protected]

John F. Monahan, Publications Liaison Office919-737-2541North Carolina State [email protected]

Thomas F. Devlin, Electronic Communication Liaison(973) 655-7244Montclair State [email protected]

Lionel Galway, Awards Officer(310) 393-0411, ext. [email protected]

Ranjan Maitra, Education Liaison(410) 445-2436University of [email protected]

John J. Miller, Education Liaison(703) 993-1690George Mason [email protected]

Leland Wilkinson, Rep.(99-01)Council of Sections(312) [email protected]

David M. Allen, Rep.(00-02)Council of Sections(606) 257-6901University of [email protected]

Elizabeth Slate, Rep.(00-02)Council of Sections607-255-9148Cornell [email protected]


INSIDE

A WORD FROM OUR CHAIRS . . . . . . . . . . . 1Statistical Computing . . . . . . . . . . . . . . . . 1

SPECIAL FEATURE ARTICLE . . . . . . . . . . . . 1Editorial . . . . . . . . . . . . . . . . . . . . . . . . 2FROM OUR CHAIRS (Cont.). . . . . . . . . . . . . . . 3

Statistical Computing . . . . . . . . . . . . . . . . 3FROM OUR CHAIRS (Cont.). . . . . . . . . . . . . . . 4

Statistical Graphics . . . . . . . . . . . . . . . . . 4SOFTWARE PACKAGES . . . . . . . . . . . . . . . 5TOPICS IN STATISTICAL COMPUTING . . . . . . 15TEACHING STATISTICAL COMPUTING . . . . . . 25NEWS CLIPPINGS AND SECTION NOTICES . . . 28SECTION OFFICERS . . . . . . . . . . . . . . . . . 29INSIDE . . . . . . . . . . . . . . . . . . . . . . . . . 30

The Statistical Computing & Statistical Graphics

Newsletteris a publication of the Statistical Comput-ing and Statistical Graphics Sections of the ASA. Allcommunications regarding this publication should beaddressed to:

Susan HolmesEditor, Statistical Computing SectionStanford UniversityStanford, CA 94305(650) 725-1925• FAX:(650) [email protected]://www-stat.stanford.edu/∼susan

Dianne CookEditor, Statistical Graphics SectionDepartment of StatisticsIowa State UniversityAmes, IA 50011-1210(515) 294 8865• FAX: (515) 294 [email protected]/∼dicook

All communications regarding ASA membership andthe Statistical Computing or Statistical Graphics Sec-tions, including change of address, should be sent to:

American Statistical Association1429 Duke StreetAlexandria, VA 22314-3402 USA(703) 684-1221• FAX (703) [email protected]

Nonprofit OrganizationU. S. POSTAGE

P A I DPermit No. 50

Summit, NJ 07901

American Statistical Association1429 Duke StreetAlexandria, VA 22314-3402USA

This publication is available in alternative media on request.

http://www-stat.stanford.edu/~susan

http://www.public.iastate.edu/~dicook

Documents

Statistical Association. Winter 2001 Vol.12 No.1 A WORD ...statweb.stanford.edu/~susan/papers/v121.pdf · Data visualization exposition: planned for 2003 We haven’t held our popular