Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
A joint newsletter of theStatistical Computing& Statistical GraphicsSections of the AmericanStatistical Association.
Winter 2001Vol.12 No.1
A WORD FROM OUR CHAIRS
Statistical Graphics
Deborah F. Swayne is the 2001Chair of the Statistical GraphicsSection. Here is the news thatshe has from the section.
State of the sectionWe had a lively discussion on the section officers’ mail-ing list this winter about the health of the section, whichdoesn’t seem to be as active or innovative as it mightbe. We need to continue to think about how to engagethe membership and how to productively focus our ef-forts. Contributions to this discussion from other sec-tion members would be most welcome.
We have some notable successes, such as the continuingpopularity of our invited sessions and short courses, butwe also have some notable weaknesses. I’ll considersome of them here and describe the responses we’reworking on, and then tell you about some new ideasthat have come up this year.
Student paper competition: a mergerWe have not been very successful at attracting enoughpapers for a meaningful competition – we don’t knowhow much this is due to our own inefficiency and howmuch to the fact that the number of papers on graphicsproduced each year is quite small, whether by studentsor not.
CONTINUED ON PAGE 4
Statistical Computing
Mark Hansen is the 2001 Chairof the Statistical Computing Sec-tion. Here are some of his newsitems.
Winter weather has finally settled over the northeast,bringing with it a kind of normalcy that I’ve been long-ing for. I have nothing to do on this chilly afternoonexcept revisit some of the initiatives I’ve been able toaccomplish as Chair of the Section. Perhaps a simplelist is the best format:
Section Web siteThe domainwww.statcomputing.org is now the of-ficial home of the Computing Section. On the siteyou will find a detailed history of the Section, newsitems, awards updates, conference announcements, bul-letin boards and links to the Newsletter. Have a lookand let us know what you think!
Section CD-ROMTony Rossini has recently compiled a collection ofes-sential statistical software. We plan to distribute thiscollection in CD form to all active members of the Sec-tion. We are hoping that this will become an annual af-fair, offering a regular snapshot of the latest Computingtools.
CONTINUED ON PAGE 3
SPECIAL FEATURE ARTICLE
Report from theWorkshop on ModernStatistical Computing andGraphics in AcademiaThe Statistical Computing and Graphics sections jointlysponsored a workshop prior to the Interface 2001 tobrainstorm about how to recognize and promote sta-tistical computing and graphics in academia. Partici-pants discussed some of the problems facing young aca-demics in our field, and sought actions that individualsand organizations might take to respond to these prob-lems.
The workshop was attended by about two dozen peo-ple, most of them young academics and the rest senioracademics and researchers in industry and the military.
The program for the workshop broke the day into fourparts. Each of the first three sessions began with a shortpresentation and continued with discussion involvingthe whole group.
• Rewarding research in statistical computing (withDi Cook and Thomas Lumley as presenters anddiscussion leaders)
• Collaboration with other disciplines (Chris Gen-ovese, Robert Gentleman)
• Curriculum content and development (RossIhaka, David Madigan, Michael Minnotte)
In the final session, workshop participants summarizedthe discussions of the day, and came up with some sug-gestions for action, and these are some of the more con-crete proposals that emerged:
• Problem: Statistics Departments have a hardtime evaluating work in statistical computingwhen it appears in tenure and promotion pack-ages.Response: Strengthen the outlets for publishingwork in Statistics computing, especially software.We agreed that we would begin this process byadding our support to JSS rather than trying tospawn any parallel effort.Actions: Formulate a document providing guide-lines for evaluating software that might be usedby Statistics Departments in promotion andtenure evaluations, and for review of journal ar-ticles on software.
• Problem: Statistical computing faculty have ahard time convincing their departments to adoptinnovative practices.Response: Build a web site which gathers dataon policies, programs, practices that are friendlyto stat computing.Actions: We can begin with a survey of sectionmembers, and supplement what we learn with re-sults of other ongoing investigations of statisticsdepartments.
• Problem: Computational literacy is low in theStatistics profession.Response: Offer introductory classes at Statis-tics meetings. Topics that would be appropri-ate include databases, scripting languages, and anintroduction to R. Appropriate meetings includeENAR, WNAR, and the JSM. We would offer theclasses as cheaply as possible, and offer them freeto section members.Actions: The sections should work with theirContinuing Education representatives and Pro-gram Chairs.
• Problem: Statistics Department curricula offerfew courses in statistical computing.Response: Many departments are now experi-menting, but there’s little sharing of experience.Actions: Collect syllabi, organize and make themavailable on the web.
• Problem: Junior faculty have to learn a lot in ashort time.Response: Offer experience and advice in sometricky areas, such as grant applications, preparingtenure and promotion packagesAction: Approach statisticians who have had suc-cess writing grant proposals, and those who havehad experience in evaluating them, and ask themto write an article sharing their expertise.
If you wish to contribute to any of these activities,please talk to Debby Swayne or Mark Hansen.
The workshop organizers are grateful to Pfizer, Inc andStata, whose generous contributions supplemented –and greatly exceeded – the seed money provided by thetwo sections.
Deborah F. Swayne, Di Cook, Mark Hansen, GuyNason, Duncan Temple Lang and Adalbert Wilhelm
http://www.statcomputing.org/workshops/scga/
©©
2 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
Editorial
We have finally managed to get an issue out in 2001.Maybe we can think of it as the El Nino issue since itscoming out just in time for Christmas and New Year’seve. This is our first all electronic issue, we hope youenjoy it, and especially remember to use it as a naviga-tion guide. In particular, all live links are surroundedby blue frames, so clicking on them will either awakeyour browser and take you to the site or if your browseris already working, it should just open up the relevantpage in it.
In conjunction with Interface ’01 the sections held aworkshop on Modern Statistical Computing and Graph-ics in Academia. There is a report from this workshopin this issue, and the aim is to have several more sub-stantial reports form the workshop in 2002 issues of thenewsletter.
Perhaps as a consequence of the workshop this newslet-ter issue has a healthy selection of articles on soft-ware. Three commercial software production compa-nies show their wares: Spotfire, Visual Insights, and S-Plus. Antony Unwin describes the software research ac-
tivity at the University of Augsburg and Lee Wilkinsondescribes a new concept “streaming graphics” underly-ing his software Dancer. The next issue in 2002 willhave several more articles describing software.
We have dedicated our two statistical computing topicsto the SVD which seems to be more and more popularas complex multivariate data drown our computers!
We have also included an account of an experimentalclass for teaching statistical computing, in the future,we would like to hear about our members’ experiencewith various other software in the classroom.
Finally we wish to encourage submissions to thenewsletter. This year has been asloooowyear for get-ting the newsletter out, but we promise to be more dili-gent in the new year.
Dianne Cook, Susan HolmesEditors, Sta-tistical Graphics apdfnd Computing Sec-tions
©©
FROM OUR CHAIRS (Cont.). . .
Statistical ComputingCONTINUED FROM PAGE 1
Refereed Papers.This year, the Section on Statistical Computing is in-troducing an experimental refereeing process for con-tributed papers submitted to the Joint Statistical Meet-ings. We see this service as helping our members intwo ways. First, by introducing some form of referee-ing we can highlight outstanding contributed papers onstatistical computing. Secondly, we hope to improve theoverall quality of contributed papers and presentationsat the JSM. Details of the process can be found on theSection Web site or a recent Amstat News article.
The Joint Statistical Meetings.Doug Nychka put together an outstanding program forthe JSM in Atlanta this year. The 2002 meetings will bein New York City, and Tim Hesterberg already has animpressive slate of sessions prepared. To welcome youall to New York, the Section will be raffling off a pairof tickets to the Broadway showThe Producers. Theyare excellent seats for the Tuesday night performanceduring the JSM.
Elections.Please join me in congratulating our new Execu-tive Committee Members Leland Wilkinson (ChairElect), Mary Lindstrom (Program Chair-Elect), CharlesKooperberg (Secretary/Treasurer) and Robert McCul-loch (Council of Sections Representative). I alsowant to thank our outgoing Secretary/Treasurer, MerliseClyde. She has done a great job managing our financesfor the last three years, and will be greatly missed.
Section Awards.Lionel Galway has organized another very successfulset of Section-sponsored competitions. Both the Stu-dent Paper Competition and the J. M. Chambers Awardcontinue to draw outstanding entries. The winnersfor 2001 can be found on the Section Web site. The2002 competition for these prizes is just starting, soplease encourage eligible students to enter. (For 2002,the Computing and Graphics Sections have merged re-sources and will be offering a single, joint Student Pa-per Competition.) Lionel also arranges our Best Con-tributed Paper Presentation Award, which this year wentto Jay Servidea, a graduate student at the University ofChicago.
Workshop on Modern Statistical Computing and
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 3
Graphics in Academia.Just before the Interface meeting this year, our Sectionssponsored a workshop that brought together some of themost influential members of the Computing community.In this edition of the Newsletter you will find a reporton the steps that we will be taking to help encouragethe research and education of Statistical Computing andGraphics.
We’ve accomplished a lot this year, and I expect that ourincoming Chair, Susan Holmes, will see that our Sec-tion continues to be one of the most active in the ASA.Finally, I want to thank our “new” Newsletter editors,Di Cook and Susan Holmes. Producing this volume is a
time-consuming process. Extracting contributions frombusy authors (and Section Chairs!) is a thankless job,but the end result is an important resource for our mem-bers. Please join me in congratulating them on anotheroutstanding issue.
Peace.
Mark Hansen Chair, Statistical Comput-ing Section
©©
FROM OUR CHAIRS (Cont.). . .
Statistical GraphicsCONTINUED FROM PAGE 1
As a result, the two sections will merge their studentpaper competitions into one, to be referred to as the“Statistical Computing and Statistical Graphics studentpaper competition.” The graphics section is grateful tothe computing section for supporting this idea, becausetheir annual competition runs like a well-oiled machineand regularly attracts very good papers. The graph-ics section will help to disseminate publicity, will con-tribute judges who go to work every January, and willof course contribute prize money.
Data visualization exposition: planned for 2003
We haven’t held our popular data visualization exposi-tion or contest in several years, and we were a little bittoo late to sponsor one at the 2002 JSM. That meanswe have plenty of time to plan one for 2003, and DavidJames (chair-elect in 2002) has volunteered to lead theeffort. Our target is to have web sites and flyers pre-pared to announce the contest prominently at the 2002JSM, and there’s a lot of work to be done before then.If you’d like to help plan it, or you have interestingdata to suggest, please talk to David ([email protected]).
New initiatives
The graphics section’s major new activity this year wasits cosponsorship of the workshop on the future of sta-tistical computing and graphics in academia, held inconjunction with Interface 2001. You can read a re-
port on the workshop elsewhere in the newsletter. Thiswas a very interesting experiment for our section, andI hope we continue to participate in the discussion re-flected there, and that we contribute to implementingsome of the suggestions that emerged.
Two other ideas have been put forth, and again, I’d loveto hear from section members who find them interest-ing:
• Form an awards committee to recognize influen-tial and innovative work in statistical graphics.This could involve honoring past work or newwork.• Design a set of posters that could adorn stat de-
partment hallways, to increase awareness of ourfield.
These two ideas might even have some synergy, sincewe could design posters based on the work of honorees.
Section communicationMark Hansen and I began to use email this year to com-municate with section members, and we hope you ap-prove. We think both sections will continue to use email(sparingly!) to tell you what’s going on and to ask foryour reactions and ideas. If you would like to receivesection email in the future, make sure that your emailaddress in the ASA’s records is up to date.
Deborah SwayneChair, Statistical Graphics Section
©©
4 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
Figure 1: Spotfire DecisionSite - clustering and profiling genes.
SOFTWARE PACKAGES
The Next Stage ForAnalytics: BringingStatistics to the MassesDr. Bill Ladd, Director Bioinformatics, Spotfire, Inc.
Applying Algorithms Across the EnterpriseAnalytical applications are in high demand these days.For most organizations, that means better managingand analyzing the data generated by processes through-out the enterprise value chain. The volumes of dataproduced by today’s high-throughput processes havemade researchers increasingly are turning to statisti-cians, computer scientists, and applied mathematiciansfor assistance with interpreting and analyzing researchdata. Statisticians need to find ways of making end-users comfortable with the basic methodologies and
tools so they can run their own analysis and not haveto rely overworked statistical specialists.
Often, the development of a ”correct” mathematical ap-proach to a specific problem is the first in a series ofsteps that will make a new algorithm usable to end-userscientists. Once developed, the algorithm must be inte-grated into existing analysis processes. It must be de-livered within an analytics framework that encouragesdata exploration and interpretation. And this frame-work must be deployed across broad, dispersed usercommunities- the globally distributed project teams thatdrive research decision-making.
Once an algorithm has been developed, the issue of de-ploying it to other researchers and other areas of theenterprise is critical. Algorithms deployment can bedeemed inadequate because they fail to mirror the an-alytical process preferred by researchers. A frameworksupported by a robust, adaptable platform and able tobe configured to the established processes is needed toshare such algorithms and ”best practices” throughout
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 5
an enterprise.
Analyzing the data and communicating the results tothe appropriate places in the organization to facili-tate decision-making is, ultimately, the reason why re-searchers generate data in the first place. And the anal-ysis tools- be they the latest killer algorithm, a desktopanalysis application, or a slide rule–are all means to thisimportant end. Because the questions researchers askoften refer to the data they have moving through theirresearch process, their needs can seem isolated and spe-cific. Analytical specialists, in turn, deliver tools suitedto these specific needs- solutions that maximize the useof the data at a specific point in the larger data flow thatbegins with laboratory information management sys-tems or basic research experiments and ends with a fin-ished product. But to be useful over time and add valueto the data produced during research projects, analyticsshould focus on providing tools that can be added to orreplaced within an ongoing and ever-changing analysisprocess.
Delivering an Analytic Framework: SpotfireDecisionSiteThe Spotfire DecisionSite framework addresses theseissues and provides a usable, eAnalytical applicationthat has a number of key elements for statisticians todistribute algorithms and analysis tools.
Data AccessCrucial decisions, those that drive research projects, re-sult when data can be compared across experiments,labs, and fields. Yet too often simply acquiring datafrom the various sources where it resides is difficult. Asuccessful analytics framework must support truly inte-grated data access- data from any source, in any format,anywhere, at any time, without any end-user program-ming.
Data VisualizationModern computing tools can do more than simply dis-play results-they can provide interactive, fast, and flex-ible data visualizations that facilitate and even amplifyhuman thought processes. An analytic framework mustinclude data visualization tools that can present com-plex datasets in unique ways that engage these innateperceptive abilities. These visualizations must be madeinteractive, encouraging researchers to explore unex-pected behavior in their data. Interactive visualiza-tions let machines focus on what they do best-dataprocessing-while humans focus on their talent-makingdecisions.
UsabilityAn analytics framework may connect to all the rightdata sources and incorporate impressive visualizationsand analysis options. But if it is not easy to use, it willnot get used. A modern analytics framework must caterfirst to the needs of a wide variety of scientists, offer-ing more sophisticated toolsets as an option for thosescientists who want greater control.
”Guides,” have emerged as a popular way of simplify-ing complex tasks by stepping users through commontasks, rather than searching through menus to find theoption needed to complete a task. Guides are a usefulusability tools because they can serve both novice usersand power users. And when applied to research, guidescan greatly speed many aspects of data analysis, fromtransacting with or entering experimental information tosetting the parameters of a complex clustering method,statistical analysis, or data reduction algorithm.
ExtensibilitySoftware solutions supporting analytics must be adapt-able and readily configurable to support access to new,emerging data sources, new analytical tools, and newanalysis processes. The architecture of the platformshould be able to incorporate methods developed inmany different architectures and runtime environments.
Many analytical applications claim extensibility, butplace such strong restrictions on the approaches usedto add functionality; in the end, they are quite limited inscope. When an analysis framework is well planned,such as Spotfire DecisionSite, the application infras-tructure itself becomes an important part of the appli-cation’s functionality.
DeployabilityGlobal project teams encompassing researchers from avariety of disciplines must be able to share data andcome to consensus on research directions. Web infras-tructures are ideal for these purposes and are the logi-cal choice to underpin an analytics framework. Usersgain access to the most up-to-date statistical and anal-ysis methods that apply to their research. With onechange to one part of the system, IT staff can delivernew features and functionality to researchers anywherein the world.
ConsistencyWell-planned analytics frameworks do more than meetthe access, analysis and usability demands of the in-dividual researcher- they provide consistency acrossthe broader organization. Analytic tools and statisti-cal methods geared towards one application area can
6 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
be adapted by researchers in another application areafor their own purposes. As different user communi-ties configure the environment for access to data andanalysis methods relevant to their specific research, newclasses of scientific investigation will be enabled thatbring these previously disparate classes of data together.
Looking forwardIf analytical specialists such as statisticians expect theirmethods to be adopted and used productively through-out the enterprise, they need to embrace the analysisprocesses and frameworks used by their customers- theend-user researchers and decision makers. Research isnot slowing down and the generation of experimentaldata stands only to increase. In evolutionary terms, thefittest research organizations in the future will likely bethose that can efficiently analyze data to come to de-cisions about what to do next. Statistics and analyt-ics must itself evolve. Industrial quality research re-quires industrial quality software that lets analytical re-searchers focus on their specialty with access to the lat-est statistical methods, rather than on data management,reporting or storage.
For more information on the Spotfire De-cisionSite eAnalytic application, please visithttp://www.spotfire.com or send a request [email protected] .
Visualization forInformation ConsumersStephen G. Eick, Visual Insights
http://www.visualinsights.com
AbstractWithin an organization there are different types of in-formation consumers with varying needs and levels ofsophistication. Our focus within the statistical graph-ics community traditionally has been on building so-phisticated tools for highly-skilled analysts. By focus-ing broadly on the information needs and understand-ing different user tasks, we can create visual tools thatappeal to all information consumers within an organi-zation and extend our influence outside of our core areaof statistical graphics.
Information ConsumersWithin the statistical graphics community we delight inproducing sophisticated tools for visual data analysis.Our tools are powerful and incorporate the latest think-
ing on statistical analysis and data mining. The guidingphilosophy, it seems to me, is that we build the tools thatwe need to perform our analyses. The result is wonder-ful demonstrations at the JSM and other leading scien-tific conferences. Sometimes, however, it appears thatsome magical combination keys, selections, and mous-ing was needed to produce the result.
This lack of usability may have to do with our back-ground. In the development of tools we simultaneouslywear multiple hats. Frequently we are the creater, devel-oper, user, and evaluator simultaneously. By combiningthese roles, we can rapidly iterate through various ideasand techniues. There is, however, a negative aspect ofcombined roles. Our perspective generally is that of ahighly trained, sophisticated data analyst and our toolsembody this perspective.
In an organization there are three different types of in-formation consumers. At the highest level, the Execu-tive wants quick summary information so that problemsare apparant at a glance and there is enough detail forintuitive decision-making. The Manager, who reportsto an executive, has time to explore data and look fortrends. He needs the ability to customize the reportswithout being overwhelmed by complexity. The Ana-lyst or Doer wants full access to all available informa-tion at a fine-grain detailed level. He or she has the timeand interest to perform detailed studies, deep analyses,and needs powerful tools.
To meet the varying needs of information consumer Ihave defined four different types of visualizations. Forconcreteness, I will use examples from website analyt-ics. Website analytics is the study of how users are ac-cessing websites. It involves the collection and analy-sis of clickstream data, understanding how website ef-fectiveness, investigating visitor demographics, study-ing of on-line transactions, and influencing behaviorthrough on-line promotions.
Four Types of VisualizationsThis section highlight four types of visualizations orga-nized in increasing of complexity.
Digital Dashboards (Figure 1) provide a “at a glance”presentation for quick intuitive decision-making. Thekey intellectual property is the choice of statistics, ap-propriate time scales, pleasing presentation, and simplegraphics. Users may manipulate the time ranges and se-lect other tabs in the display for more detailed informa-tion. The interesting research issues involve the graphi-cal presentation, workflow, and implementation.
Active Reports (Figure 2) are customizable graphicalinterfaces that are used for flexible reporting, trending
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 7
Figure 2: Browser-based Executive Dashboard showing website activity. Green and Red triangles highlight changes(positive and negative) from the previous time period.
and publishing. In contrast to browser-based digitaldashboards, an active report is more complicated andmore flexible. Our implementation of active reportsis a client-side windows application where users maymanipulate data, change thresholds, publish html snap-shots, reaggregate, export results sets. Active reportsare frequently used by business analysts. The key is-sues involved with active reports include the flexibilityof data access, ease of creating reports, and ability tointegrate with other systems in an organization.
Figure 2: Active 3D Report. The transparent cuttingplan helps users to visually compare bar heights.
Visual Discovery and Analysis (VDA) tools, such asXGOBI, are powerful environments that combine state-of-the-art statistical techniques with visual data analy-
8 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
sis. They combine both statistical and graphical toolsinto a common framework.
Task-specific visualizations (Figure 3) capture domainknowledge to address domain-specific issues. For web-site analytics, a critical issue involves understandingthat paths that browsers take to navigate the site - seeFigure 3. The intellectual property includes novel visualmetaphors and interaction techniques. Task-specific vi-sualizations will be widely used by all information con-sumers in an organization.
Path visualization forwww.visualinsights.com.
Summary and ConclusionWithin an orgazation there are different types of infor-mation consumers that have different needs. This essayhighlights three different types of users:
• Executive,• Manager, and• Analyst
and four types of visualizations:
• Digital Dashboards,• Active Reports,• Visual Discovery Analysis Tools, and• Task-specific visualizations.
Digital Dashboards, while engineered for an Executive,are useful for all members of an organization. Ac-tive Reports and VDA tools address Manager’s andAnalysts’s needs. Task-specific visualization are role-neutral may be used by anybody with domain knowl-edge.
The important take away message for us in the statisti-cal graphics community is that our background, train-ing, and inclinations encourage us to build VDA toolsfor analysts. These tools, while highly sophisticated, donot necessarily meet the needs of the other information
consumers. To meet these broader needs, we must de-sign tools that target tasks beyond our traditional focuson statistical graphics.
S-PLUS GraphletsCharles Roosen, PhD, Senior Statistician, InsightfulCorporation
S-PLUS 6 for UNIX and S-PLUS 6 for Windows intro-duce an exciting new technology for displaying graph-ics over the web. The Graphlet is a Java applet that sup-ports displaying and interacting with S-PLUS graphicsthat have been saved in a file. The Graphlet can beembedded within a Web page, and viewed by a webbrowser.
Creating an S-PLUS Graphlet file is as easy as creat-ing a graph in S-PLUS. Use the award-winning S pro-gramming language to create your graphic, and thensimply save it as a Graphlet. Because the S languagewas designed explicitly for graphical display and sta-tistical analysis, hundreds of high-level functions are atyour disposal for creating an informative and attractivegraph with just a few lines of code.
Graphlets allow you to add interactivity to your graphs.Graphlets are live objects, and the viewer can interactwith the Graphlet by moving and clicking the mousewithin the Web browser. With a simple click on adata point or label the viewer can be provided withfurther information through on-screen information, an-other Graphlet, or another page anywhere on the WorldWide Web. The applications of this interactivity areendless. Here are just a few examples:
• Click on a financial time series chart to show allthe news stories relevant to the date where themouse was clicked.
• Make the bars in a bar-chart active, so that click-ing on a bar will drill down to display anotherchart containing only the data in that bar.
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 9
• Link cities on a map to further detailed chartsabout that city.• Add interactive labels to a scatter plot to provide
more information about your data.• Provide multiple graphs as labeled pages in a
single Graphlet, allowing the viewer to browseamong the pages at will.• Display latitude and longitude coordinates of the
mouse position within an image map, and allowthe viewer to zoom in and display a specific re-gion.
The Graphlet window provides further tools for the userto zoom in on selected regions of your graphic to viewdetails. And because Graphlets–unlike GIF and JPEGgraphics–are rendered on-the-fly using vector graphics,there is no loss of resolution or pixelation when lookingat your graphic ”up close and personal”.
Since Graphlets are implemented using standard JavaApplet technology, they can be embedded into anyWeb page using standard Web authoring tools suchas Microsoft FrontPage. Graphlet files are very small(typically 20-60 Kb) so they download quickly acrossthe Web. S-PLUS Graphlets can be viewed on anyJava-capable browser, including Internet Explorer 5 andNetscape 4. The S-PLUS Graphlet applet is freely redis-tributable, so you can give S-PLUS Graphlets to friendsand colleagues for use on their Web pages, even if theydon’t have S-PLUS.
For more information on Graphlets and a varietyof live examples, see the Graphlet web page athttp://www.insightful.com/graphlets .
COSADA SoftwareProjects in AugsburgAntony R. Unwin, Department of ComputerorientedStatistics and Data Analysis, Augsburg University
IntroductionMANET, TURNER, CASSATT, KLIMT, MARC —there has been a succession of statistical software devel-opments at the department of ComputerOriented Statis-tics and Data Analysis (COSADA) at Augsburg Univer-sity in Germany. (Martin Theus’s MONDRIAN mightalso be counted, but he wrote it after he left Augsburgand was at AT&T.) All the software is for interactiveanalysis. Users should be able to work directly with ob-jects on screen and not have to type in commands. Thesoftware is research software, in that the main aim is to
develop and test new interactive ideas, but great empha-sis is also put on achieving an intuitive, fast and flexibleinterface. Ideas can best be tested if they are imple-mented in an effective way. The programmes have pri-marily been written by students. While all members ofthe department have an input into the design and testingof the software, the programming is always the respon-sibility of just one person.
Each programme is named after an impressionistpainter, because software can only help the user to getimpression of their data. MANET and TURNER weredeveloped for Macs but more recent programmes havebeen written in Java, so that they can potentially run onany computer system.
CASSATT
CASSATTimplements interactive parallel coordinates.Consider, for instance, the 1999 reprise of thefamous California-Bordeaux wine tasting data set(www.nerc.com/ liquida/report20.html). There were 46wines which were all scored by 32 judges. Did thejudges agree on their rankings? Did at least some ofthe judges agree? Were the French and the Califor-nian wines ranked differently? Were some wines alwaysranked higher than others? Which judge(s) were clos-est to the average ranking of all judges? Can patternsof rankings be determined? There are many interest-ing questions that may be pursued and a combinationof analytic and graphical methods would be best. Thefigure shows the rankings by wine in a parallel coor-dinate plot. There is an axis for each wine and eachbroken line represents the views of a single judge. Theaxes have all been scaled equally (a menu command),as the default is to scale each axis individually (a nat-ural default when, unlike here, different variables havenon-comparable scales). It is immediately apparent thatthe judges were not in complete accord! To assist inter-pretation the wines have been sorted by average rank-ing (one of many sorting possibilities and again a menucommand) with the best to the left. You can see thateven the wine with the best average ranking was given apoor rank by one judge and even the wine with the worstaverage ranking was given a good rank by one judge.The former judge has been selected to see whether hisother judgements were also unusual. To improve visi-bility the lines for the judges who are not selected havebeen lightened (this is done using a floating dialog box).
No brief description can do justice to a sophisticatedpiece of software.CASSATTis a fully linked multi-window system providing not just parallel coordinateplots but also standard plots for continuous variables(dotplots, boxplots, scatterplots) as well. It offers the
10 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
Figure 3: Rankings by wine in a parallel coordinate plot.
basic interactive tools (including querying, selection,linking, sorting and colouring) but also some speciallyrelevant to parallel coordinates (inverting axes, selec-tion methods for selecting not only points, but also linesand even angles). Selection sequences, which were in-troduced in MANET to provide an editable, graphicalinterface for selections, have been extended inCAS-SATTto include the selection history in a graphic form.The software has been used successfully (i.e. its inter-active features are still fast enough) with datasets withover 300 variables and with a dataset with over 20000cases.
Future directions
CASSATTassumes, like most statistical software, thatthe data set is supplied as a data matrix. It would beuseful to be able to work with more complex structures.For instance, it is currently not possible in the tastingdataset to include which wines are French and whichCalifornian, what vintages they were and what combi-nations of grapes they contained. If we had this infor-mation, we could interactively sort the wines by countryof origin or by vintage.
Parallel coordinates are excellent for continuous vari-ables but not so good for categorical variables. Toomany lines simply overlap and it is hard to see struc-ture. Missing values are another related problem. Newapproaches are needed. We intend to investigate thelinking of mosaic plots and parallel coordinates in thesame package.
And, of course, we would like to interweaveCAS-SATTwith R, the way we are doing with KLIMT.
CASSATTwas written in Java by Sylvia Winkler, whoshared first prize in the inaugural John Chambers soft-ware competition for her work.
CheckCASSATT’s homepage for further information:www1.math.uni-augsburg.de/Cassatt/
Streaming GraphicsAndrew A. Norton, Matthew A. Rubin, LelandWilkinson, SPSS, Inc.
AbstractStreaming graphics, a cross between streaming videoand dynamic graphics, is a new approach to visualizingdata. The goal is to integrate and synchronize multiplestreams of data in real time and display them at ratesof up to 20 frames per second. This article describesthe design of a system, calledDancer, that implementsthese ideas in Java.
IntroductionThe termstreaming graphicsevokes the termsstream-ing media(e.g., Feamster, 2001) , anddynamic graph-ics (e.g., Cleveland and McGill, 1988). While similarto both in its outword appearance, streaming graphicsis basically different from both. Streaming media sys-tems are generally concerned with delivering sound andvideo information in real time. Dynamic graphics areconcerned with using motion to reveal structure in staticdata. Streaming graphics systems are concerned withdisplaying analyses (summaries) of streaming data inreal time. Applications of streaming graphics involve
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 11
many different environments, including real-time mon-itoring of manufacturing processes, health indicators, fi-nancial statistics, and Web data.
DataA streaming data source is fundamental to streaminggraphics. In the simplest streaming data model, data ar-rive in a fixed-length buffer (window) at equally spacedtime points. Upon the arrival of a new data packet, welose one of the packets already in the buffer (usually thelast). When a new data packet arrives, we refresh ourdisplay. We can also have a fixed-length buffer that re-ceives data packets at irregular time intervals. In thisenvironment, the arrival of new data can trigger a dis-play refresh. There are many other prevalent streamingdata environments, however, particularly ones involvingmultiple, asynchronous data events occurring simulta-neously within multiple processing threads.
Streaming vs. Static Data
A popular metaphor for the difference between stream-ing data sources and static data warehouses is astreamversus apool. In the massive data-mining environment,we often hear this metaphor upgraded to ariver versusanocean. The significance of this distinction, whateverthe scale, is that our graphical and statistical algorithmsfor mining a stream (to stretch the metaphor) must adaptto its temporal nature. We must pay attention to speed ofcalculation, lest we be swamped with new data packetsbefore we can compute a displayable result. We must al-low calculations to persist wherever possible so that wedo not waste time redoing subsets of previous calcula-tions. We must learn to sample our data stream, or knowhow to bail-out of calculations when the results mightbe out-of-date. We must synchronize our results, lestwe mistakenly display summaries of disparate events inthe same time frame.
Many of these concerns do not apply to the traditional,static data mining environment. The premier consider-ation in the static environment is scalability with regardto thesizeof a dataset. We seek algorithms that scalecomparatively well, in the sense of computing time be-ing a constant, linear, or logarithmic function of the sizeof a dataset. In the streaming data environment, we needto worry about scalability with regard to themomentumof a stream. Themomentumof a stream is itsmass(theaverage size of data packets arriving in a fixed interval)timesvelocity(the average number of packets arrivingin a fixed interval). This Newtonian physics metaphor isan over-simplification, but it helps us to understand thatattending to scalability with regard to the total bulk ofthe data we encounter will not be sufficient for building
a streaming display system. Even a constant cpu-timecalculation may still be too slow to handle the real-timedata stream we must display.
Multiple vs. Single Data SourceIn the static data environment, we merge different datasources into a single table before computing graphi-cal and statistical summaries. We cannot employ thissimplification in the streaming environment. Instead,Danceris designed to handle multiple datasources. Forexample, we might want to attach a single display to alive feed from a stock exchange and another live feedfrom a commodities exchange. There is no time to in-dex and merge these two feeds into a temporal database.
Multiple feeds imply a multi-threaded computing en-vironment with event-notification through broadcastersand listeners. The data feeds run simultaneously, andeach broadcasts the arrival of new data it receives. Thestatistical and graphical listeners attend to those mes-sages and react as they choose. In our financial exam-ple, the result is a display that shows a moving time-series of a particular stock superimposed on a movingtime-series of a related commodity.
Continuous vs. Discrete Time ScalesA consequence of this functional architecture is thatDancer operates on a continuous rather than discretetime scale. In theory,Dancerdisplays data at any in-stant. Its time scale moves (at least perceptually) contin-uously. This is in contrast to some time series displays,such as seen in traditional ARIMA (Box and Jenkins,1976) packages, that contain measurements indexed onequally spaced time points.
Dancerhas two time-scale modes. In the static-framemode, geometric representations of data (points, lines,etc.) move through a motionless frame (the data area).We may clip the elements at the frame boundaries or letthem pass beyond.
In the dynamic-frame mode, geometric elements arecontinuously repositioned within a frame that is con-tinuously scrolling forward or backward in time. Thismode requires careful programming to allow tickmarks, grid lines, and scale labels to scroll smoothlyand continuously, in 2D or 3D as required.
GeometryDancer is based on a geometric model of statisticalgraphics described in Wilkinson (1999). That bookcontrasts the geometric model with the ”chart-centric”model of statistical, business, and scientific graphicspackages. In the geometric model, elements such aspoints, lines, and intervals are embedded in a frame de-
12 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
fined by an algebra with three operators. These geomet-ric elements are realized (made viewable, hearable, etc.)through a set of aesthetics such as size, shape, texture,and color. ThenViZn system for rendering graphs onthe Web is based on the same model (Wilkinson, et al.,2000).
Unlike nViZn, Dancercan bind each geometric elementto a different data source. The behavior of every el-ement in the frame is synchronized through the func-tional data model. Special rendering technology is re-quired to enable each element to move up to 20 times asecond based on incoming data.
Scene Tree vs. Geometric Primitives
A scene treeis a data structure that contains the infor-mation in a geometric scene. Scene trees are frequentlyused in 3D modeling to organize the arrangement andrendering of a collection of geometric objects. For ex-ample, if the root of a tree is a body, then the root’s chil-dren could be torso, head, arms, and legs. The childrenof arms could be upper arms, lower arms, and hands.And the children of hands could be fingers and palms.The advantages of this organization are several. Firstof all, children inherit attributes from parents, includ-ing aesthetics (e.g., color, texture) and localized coordi-nates. Second, adding objects to a scene (another body,for example) requires only appending a new subtree toan appropriate node. Third, a tree is a simple objectthat can be explored through depth-first or breadth-firstsearch. This makes rendering rapid and efficient. Andit makes interactions with elements (editing, brushing,linking) a simple matter of walking the tree to locate aselected element.
Wilkinson (1994) employed agraph treemodel for con-structing and rendering statistical graphics in the SYS-TAT package. The hierarchical nodes of the tree wereWindow, Pane, Frame, Graph, and Element. With theFisher-Anderson Iris data, for example, SYSTAT plotsthree scatterplot matrices (three species, four variables)in a window by assigning one Pane node to Windowand three Frame nodes to Pane. Each Frame containssixteen Graph nodes. Each Graph node contains twoElement nodes (one for a cloud of points and one fora smoother). The most important consequence of thisarchitecture is that SYSTAT has no scatterplot matrixroutine. Instead, SYSTAT has a method for assemblinggraphical elements in a structure of one or more graphsand frames. Anything it can do inside a single scat-terplot it can do simultaneously inside multiple scat-terplot matrices. For the same reasons, SYSTAT hasno routine for drawing a Trellis (Becker et al., 1996).Instead, SYSTAT constructs a Trellis structure by as-
sembling graphs in a hierarchy determined by values oncategorizing variables. Details of look-and-feel can behandled in display modules that render this structure indifferent styles.
Scene trees and graph trees organize geometric prim-itives. They are essential for creating efficient real-time streaming graphics systems for a variety of rea-sons, some of which are evident in the SYSTAT archi-tecture. Most important, perhaps, scene tree architec-ture enables rapid rendering.
Supervised vs. Immediate RenderingWhen we render a scene that is rapidly changing, wemay re-render the entire scene every time a change isdetected or render only the parts that change. If ourgoal is to render complex scenes that may change up to20 times a second, we choose the latter strategy.
In a graphics foundation like Java3D, the re-renderinghousekeeping is taken care of by the Java scene tree.When the scene tree is updated, only parts of the screenthat need updated are re-rendered. An added benefit isthat foundations like Java3D are designed to accommo-date 3D hardware accelerators.
StatisticsStatistical graphics depend on the calculation of statis-tics – point estimates, interval estimates, smoothers,summaries. In a static data environment we can pre-compute these statistics before rendering a graph. Themost prevalent example of this strategy is in OLAP(On Line Analytic Processing) systems based on ETL(Extracting, Transforming, Loading) technology. Thesesystems aggregate data from various sources, store themin a warehouse, and display graphics such as pie and barcharts on the aggregates. The Temple MVV visualiza-tion system of Milhalisin et al. (1995) works similarly.
Update/DowndateStreaming graphics require a different type of statisticalalgorithm. In the simplest case, such as accumulation ofsums and sums of squares and cross-products, we use anupdate/downdate strategy. When new data arrive in ourcomputational buffer, we discard the oldest data item,downdate its contribution to the accumulated statistics,and update the statistics from the new contribution. Thisprocess can accumulate rounding error, so it is essentialto recalculate occasionally all values from the data in awindow. This recalculation can be scheduled to occur atconvenient times, similar to the way garbage collectionis handled in an interpretive system.
Iterative calculations are not as simple. We can up-date and downdate Hessians and Jacobians, but han-
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 13
dling convergence in a real-time environment is prob-lematic. Other statistical algorithms present differentproblems. Guha et al. (2000), Datar et al. (2002), andothers are concentrating on these issues.
VisualizationThe simplest display for streaming data reflects the cur-rent state of the stream. Geometric elements such aspoints and lines move as the stream progresses. Real-time is only one aspect of what Dancer does withstreaming data, however.
Instant replay vs. animation
Whatever the time interval (seconds, minutes, hours, ...)we cannot expect a viewer to observe a display contin-uously. A real-time system needs to incorporate alertsfor out-of-bounds conditions. Visual and audio alertsare common in control applications in military, healthcare, and manufacturing.
When an alert occurs, it is not always clear what an-tecedent conditions led to its occurrence. Consequently,Dancer provides an instant replay feature to allow aviewer to rewind a scenario and replay it as an anima-tion. This is an aspect of a more general Dancer featureallowing us to animate any data series. Although de-signed for real-time applications, Dancer’s functionaltime model can be applied to any indexed (ordered)variable so as to allow animation of other variables. Inthis respect, Dancer can behave like exploratory sys-tems such as xGobi (Swayne et al., 1998) or DataDesk(Velleman, 1988) that animate over a variable.
This type of archival animation needs to be distin-guished from Dancer’s normal real-time model, how-ever. Archival animation offers the opportunity topre-process data that we do not have in real-time.In the extreme, animation systems such asFlash(www.macromedia.com) can prepare bitmap framesthat are played in a media player. These animations arefundamentally different from the architecture in Dancer.
Transformations of time
When we animate a sequence by buffering and replay-ing captured data, we may transform time in a varietyof ways. Reversing time helps us to untangle sequen-tial dependencies. Differencing time helps us to rec-ognize rates. Double-differencing time reveals accel-eration/deceleration. Polarizing time (a cyclical trans-formation) helps us to compare cycles. Logging timehelps us to view order-of-magnitude effects. Geologistsand cosmologists often log time scales and, as GrahamWills has pointed out (personal communication), ordi-nary people often use a similar transformation when
they look into the future on a day ... week ... month... year time-scape.
ConclusionStreaming data require streaming algorithms. Much ofthe technology for processing these data is relativelynew and much is yet to be done. Because statisticalgraphics is more involved with processing data thanwith drawing pictures, there is considerable technologytransfer than can occur from related fields in computerscience.
An indication of the challenges involved in process-ing real-time data can be seen in the contrasts betweenDancer and its siblingnViZn. Both are based on thegraphics grammar model and both are programmed inJava to take advantage of a Web environment. The twoprograms share not a line of code.
ReferencesBecker, R.A., and Cleveland, W.S., and Shyu, M-J.(1996). The Design and control of Trellis display.Jour-nal of Computational and Statistical Graphics, 5:123-155.
Box, G.E.P., and Jenkins, G.M. (1976).Time SeriesAnalysis: Forecasting and Control. San Francisco:Holden-Day.
Chambers, J.M., Cleveland, W.S., Kleiner, B., andTukey, P.A. (1983).Graphical Methods for Data Anal-ysis. Monterey, CA: Wadsworth.
Cleveland, W.S. and McGill, M.E., Eds. (1988).Dynamic Graphics for Statistics. Belmont, CA:Wadsworth Advanced Books.
Datar, M., Gionis, A., Indyk, P., and Motwani, R.(2002) Maintaining stream statistics over sliding win-dows. Annual ACM-SIAM Symposium on Discrete Al-gorithms, San Francisco.
Feamster, N.G. (2001). Adaptive deliv-ery of real-time streaming video. Mas-ter’s thesis, Department of Electrical Engineer-ing and Computer Science, MIT. Online athttp://nms.lcs.mit.edu/papers/feamster-thesis.pdf
Guha, S., Mishra, N., Motwani, R., and O’Callaghan, L.(2000). Clustering data streams. InProceedings of theAnnual Symposium on Foundations of Computer Sci-ence (FOCS 2000).
Mihalisin, T., Timlin, J., Gawlinski, E., and Mihalisin,J. (1995). Visual analysis of very large multivariatedatabases.ASA Proceedings of the Section on Statis-tical Graphics, 18-27.
14 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
Swayne, D.F., Cook, D., and Buja, A. (1998). XGobi:Interactive Dynamic Data Visualization in the X Win-dow System.Journal of Computational and GraphicalStatistics, 7:113-130.
Velleman, P.F. (1988).Data Desk. Ithaca, NY: DataDescription Inc.
Wilkinson, L. (1994). SYSTAT, Version 6. Evanston:
SYSTAT, Inc.
Wilkinson, L. (1999).The Grammar of Graphics. NewYork: Springer Verlag.
Wilkinson, L., Rope, D.J., Carr, D.B., and Rubin, M.A.(2000) The language of graphics.Journal of Computa-tional and Graphical Statistics, 9:530-543.
TOPICS IN STATISTICAL COMPUTING
Rolling Your Own: Linear Model HypothesisTesting and Power Calculations via the Sin-gular Value DecompositionSteve Verrill, Mathematical Statistician,USDA Forest Service Forest, [email protected]
AbstractWe outline the steps that would permit a
statistician to produce special purpose linearmodel routines through the use of high qualitypublic domain numerical analysis software.
IntroductionGood commercial linear model packages are readilyavailable. It sometimes happens, however, that onewould like linear model code that could be incorpo-rated into a simulation. Verrill (1999) discusses sucha simulation in the context of predictor sort sampling.Seehttp://ws13.fpl.fs.fed.us/ttconf.html for a web in-terface to the simulation program.
Further, a sophisticated user can sometimes becomefrustrated with the inflexibility of a commercial pack-age. This can be particularly true if the user is con-fronted with unbalanced data or complex hypotheses. Inaddition, some commercial linear models packages donot include the ability to perform power calculations.
In such cases the user can make use of public domaincomputer routines that yield flexible linear model ca-pabilities. In this note we step potential users throughthe computations needed to perform hypothesis testsand power calculations. We follow the theoretical ap-proach of Scheffe (1959). To do the numerical work wemake use of the singular value decomposition (see, forexample, Thisted (1988)). There are, of course, othernumerical techniques that can be used to perform thenecessary calculations (see, for example, Kennedy andGentle (1980), Gentle (1998)). We focus on the singu-lar value decomposition because it yields an approachthat is numerically stable, reasonably efficient, and sim-ple to explain and implement. We also suggest the use
of the DCDFLIB public domain package of distributionroutines.
The relation between the singular value decomposition,least squares, generalized inverses, and estimability hasbeen discussed in Good (1969) and Eubank and Webster(1985).
Hypothesis TestingThe standard linear model is
y = Xβ + ε,
wherey is then× 1 vector of responses,X is then× pdesign matrix,β is thep × 1 parameter vector, andε isthen×1 vector of random errors. We assume that theεiare independent, identically distributed N(0,σ2) randomvariables.
We want to test a hypothesis of the form
cT1 β = η1
cT2 β = η2 (1)...
cTq β = ηq.
(It is often the case thatcTi 1 = 0, in which casecTi isreferred to as a “contrast.”)
For example, in a one-way ANOVA we are testing
( 1 −1 0 0 . . . 0 )β = 0( 1 0 −1 0 . . . 0 )β = 0
...
( 1 0 0 . . . 0 −1 )β = 0.
To test hypothesis (1) we proceed in a series of steps:
Is the Hypothesis Overspecified?We need to determine whether theci are linearly inde-pendent. (If they are not, the hypothesis is overspec-ified. The user needs to think more clearly about thehypothesis, and arrive at a set ofci’s that are linearlyindependent.)
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 15
LetCp×q = (c1 . . . cq)
whereq ≤ p. The singular value decomposition ofC is
C = Up×q
γ1 0 . . . 0 00 γ2 0 . . . 0
...0 0 . . . 0 γq
VTq×q
= UDγVT ,
where the columns ofU are orthonormal to each other,V is an orthogonal matrix, andDγ is the diagonal ma-trix with γ = γ1 ≥ γ2 ≥ . . . ≥ γq ≥ 0, the singularvalues ofC, as its diagonal. Thus the rank ofC is justthe number of nonzero singular values. Because of thelimitations of computer arithmetic, the nullγi’s will notin general be exactly equal to zero. We need to deter-mine a threshold value. If aγi lies below that threshold,we take it to be equal to zero and conclude that the hy-pothesis is overspecified. A threshold value that is sug-gested in the numerical analysis literature (for example,Golub and Van Loan (1996)) is
‖C‖ × (the machine precision).
Our experience suggests that
threshold= ‖C‖ × (the machine precision)× 10
is a better rule of thumb. Recall that
‖C‖ =
√√√√ p∑i=1
q∑j=1
c2ij =
√trace(CCT )
=√
trace(UDVTVDUT )
=√
trace(UD2UT )
=√
trace(D2UTU) =√γ2
1 + . . .+ γ2q .
For double precision arithmetic on 32 bit computers onewould use √
γ21 + . . .+ γ2
q × 10−15
as the threshold value.
Are the ci’s Estimable?We need to know whether there exists anai that satisfies
E(aTi y) = cTi β
for all β whereE is the expectation operator. This holdstrue if and only if there exists anai that satisfies
aTi X = cTi or ci = XTai. (2)
We determine whether equation (2) holds by first ob-taining a singular value decomposition of the designmatrix,X.
Xn×p = Un×rΛVTr×p (3)
where theλ’s are nonzero (r ≤ p is the rank ofX), or
XT = VΛUT .
Thus the projection operator onto the space spanned bythe columns ofXT , L(XT ), is VVT , andci = XTaifor someai if and only if ci ∈ L(XT ) or VVTci = ci.Using double precision arithmetic on 32 bit machines,one would take them to be equal if
‖VVTci − ci‖ ≤ ‖ci‖ × 10−15.
Find the ai that Satisfies aTi X = cTiGiven thatci is estimable, there is a uniqueai in thelinear span of the columns ofX, L(X), such that
aTi X = cTi .
(This is one of Scheffe’s lemmas.)
Using the singular value decomposition ofX, equation(3), we know that the solution ofci = XTai satisfies
ci = VΛUTai
orUΛ−1VTci = UUTai. (4)
But becauseai ∈ L(X) andUUT is the projection op-erator ontoL(X), from equation (4) we have
UΛ−1VTci = UUTai = ai.
Thus, we can use the left-hand side of the equationabove to calculate theai that satisfiesaTi X = cTi .
Find an Orthonormal Basis for the LinearSpan of a1, . . . ,aq — L(A)
We simply perform a singular value decomposition ofA = (a1 . . .aq) to obtain
An×q = Un×q∆VTq×q
and the columns ofU form an orthonormal basis forŁ(A).
Find the Hypothesis Sum of SquaresCase 1 —η1 = . . . = ηq = 0 Let uA,1, . . . ,uA,q be anorthonormal basis forL(A) found as described in thepreceding subsection. Scheffe’s theory tells us that thehypothesis sum of squares is
HSS ≡q∑i=1
(uTA,iy
)2
with q degrees of freedom.
Case 2 — at least one of theη’s is nonzero
Scheffe’s theory tells us that the hypothesis sum ofsquares is
HSS ≡(ATy − η
)T (ATA
)−1 (ATy − η
).
16 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
with q degrees of freedom.
We can make use of the singular value decompositionof A given above to obtain(
ATA)−1
= V∆−2VT . (5)
Then taking
w ≡ VT(ATy − η
),
we have
HSS =q∑i=1
w2i /δ
2i .
Find the Residual Sum of SquaresLet uX,1, . . . , uX,r be the columns of theU matrix ofthe singular value decomposition, equation (3), ofX.Then the projection ofy onto the linear span of thecolumns ofX, PL(X)(y), equals(
(uX,1)Ty)
uX,1 + . . .+((uX,r)Ty
)uX,r
and the residual sum of squares equals
RSS ≡(y − PL(X)(y)
)T (y − PL(X)(y)
)= yTy − ((uX,1)Ty)2 − . . .− ((uX,r)Ty)2
with n− r degrees of freedom.
Form the F Statistic and Compare It to theAppropriate Critical ValuesTheF statistic equals
(HSS/q) / (RSS/(n− r)) .
This should be compared to anFq,n−r(1 − α) criticalvalue whereα is the significance level.
Confidence IntervalsSuppose that we are interested in confidence intervalson l estimable combinations of the parameters,cT1 β,. . . , cTl β, and further suppose that the linear span ofthe c’s has rankq ≤ l. This rank can be determinedas described in section 2.1. Lets2 ≡ RSS/(n − r).Then (see Scheffe (1959)) we know that an individual(1− α)× 100% confidence interval forcTj β is
aTj y ± s× tn−r(α/2)×√
aTj aj
wheretn−r(α/2) is the appropriatet critical value, andaj satisfiesaTj X = cTj (see section 2.3). Also, joint(1 − α) × 100% confidence intervals for thecTj β’s,j ∈ {1, . . . , l}, are given by
aTj y ± s√q × Fq,n−r(1− α)× aTj aj
whereFq,n−r(1−α) is the appropriateF critical value.
Power CalculationsThe noncentrality parameter is given by
NCP ≡(ATXβ − η
)T (ATA
)−1 (ATXβ − η
)/σ2
=(CTβ − η
)T (ATA
)−1 (CTβ − η
)/σ2. (6)
Note that(ATA
)−1can be calculated as in equation
(5).
Scheffe’s noncentrality parameter equals the square rootof our noncentrality parameter. Our version corre-sponds with what the DCDFLIB library (see below) ex-pects.
In a completely general power calculation program,possible values ofβ, η, andσ2 would be specified by auser in the course of power calculations. In the commoncase in whichη = 0 and theβ’s represent treatmentmeans, it might be more reasonable to expect a user tospecify the components of theβ vector as fractions ofan overall mean and to specify a range of coefficientsof variation. This would yieldβ/σ values that wouldenable one to calculateNCP .
Under the null hypothesis,CTβ − η = 0 and the non-centrality parameter is 0, but under the alternative hy-pothesisCTβ 6= η and the noncentrality parameter isinflated above zero. If we operate at anα significancelevel, the power is given by
Power= Prob(Fq,n−r,NCP > Fq,n−r(1− α)) , (7)
the probability that a noncentralF distribution withqnumerator degrees of freedom,n − r denominator de-grees of freedom, and noncentrality parameterNCPlies above the100(1 − α)th percentile of a centralFdistribution with q numerator degrees of freedom andn− r denominator degrees of freedom.
Implementing the theory as codePublic domain FORTRAN or C code to perform thesingular value decomposition is found in the LIN-PACK (Dongarra and others (1979)) or (C)LAPACK(Anderson and others (1995)) linear algebra li-braries. These can be obtained over the inter-net at http://www.netlib.org/liblist.html. Public do-main C++ code to perform the singular value de-composition can be found by searching on svd athttp://www.netlib.org. A Java translation of the LIN-PACK singular value decomposition can be found athttp://www1.fpl.fs.fed.us/linear algebra.html. Thissite also points to Java translations of the LAPACK lin-ear algebra routines.
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 17
Public domain FORTRAN or C code to calculate thet distribution and the central and noncentralF distri-butions and their inverses can be found in the DCD-FLIB library. DCDFLIB is a public domain library of“routines for cumulative distribution functions, their in-verses, and their parameters.” It was produced by BarryBrown, James Lovato, and Kathy Russell of the Depart-ment of Biomathematics, M.D. Anderson Cancer Cen-ter, The University of Texas. DCDFLIB can be found athttp://odin.mdacc.tmc.edu/anonftp/page 2.html .
A relatively raw example of the use of the LIN-PACK and DCDFLIB routines to create a linear mod-els program can be obtained (or run over the Web) athttp://www1.fpl.fs.fed.us/glm.html. This site includessample input and output based on Table 9.1 in Millikenand Johnson (1992). We note that linear model routinesproduced by these methods would need extra work tobecome user friendly. In particular we have finessedthe issue of the generation or input of the design,X,and hypotheses,CT , matrices. We would expect a so-phisticated user interested in simulations or special pur-pose analyses to be able to generate these matrices byhand. A person new to this approach might want to takea look at some of the examples in Milliken and Johnson(1992).
An algorithm that automatically generates design ma-trices for balanced factorial experiments is described inMacKenzie and O’Flaherty (1982).
Kennedy and Gentle (1980) (page 388) note:
[User friendly] computer software must in-clude the ability to accept user specificationof the model. Most programs in use todayallow the user to provide some rather nat-ural algebraic specification. The programthen deciphers the specification and trans-lates it into numeric coding for subsequentuse. There are no established standardsfor doing this, but many techniques usedin compiler construction can be applied tothis problem.
An example of such an algebraic specification is dis-cussed in Wilkinson and Rogers (1973).
ReferencesAnderson, E., Bai, Z., Bischof, C., Demmel, J., Don-
garra, J., Du Croz, J., Greenbaum, A., Ham-marling, S., McKenney, A., Ostrouchov, S., andSorensen, D. (1995).LAPACK Users’ Guide,Second Edition. Society of Industrial and Ap-plied Mathematics, Philadelphia.
Dongarra, J., Moler, C., Bunch, J., and Stewart, G.(1979). LINPACK Users’ Guide.Society of In-dustrial and Applied Mathematics, Philadelphia.
Eubank, R. L. and Webster, J. T. (1985). “TheSingular-Value Decomposition as a Tool forSolving Estimability Problems,”The AmericanStatistician, 39, 64–66.
Golub, Gene H. and Van Loan, Charles F. (1996).Matrix Computations, Third Edition.The JohnsHopkins University Press, Baltimore and Lon-don.
Gentle, J. E. (1998).Numerical Linear Algebra forApplications in Statistics.Springer-Verlag, NewYork.
Good, I. J. (1969). “Some Applications of the SingularDecomposition of a Matrix,”Technometrics, 11,823–831.
Kennedy, W. J. and Gentle, J. E. (1980). Statisti-cal Computing. Marcel Dekker, New York andBasel.
MacKenzie, G. and O’Flaherty, M. (1982). “DirectDesign Matrix Generation for Balanced FactorialExperiments,”Applied Statistics, 31, 74–80.
Milliken, G. A. and Johnson, D. E. (1992). Analysisof Messy Data, Volume I: Designed Experiments.Chapman and Hall, London.
Scheffe, H. (1959). The Analysis of Variance.JohnWiley and Sons, New York.
Thisted, R. A. (1988).Elements of Statistical Comput-ing. Chapman and Hall, London.
Verrill, S. (1999). “When Good Confidence Inter-vals Go Bad: Predictor Sort Experiments andANOVA,” The American Statistician, 53, 38–42.
Wilkinson, G. N. and Rogers, C. E. (1973). “SymbolicDescription of Factorial Models for Analysis ofVariance,”Applied Statistics, 22, 392–399.
18 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
CorrespondenceAnalysis with RBrigitte Charnomordic, Biom etrie, INRA, FranceandSusan Holmes, Statistics Department, [email protected]
AbstractWe provide here a few didactic examples
for understanding correspondence analysis us-ing R to illustrate the use of matrix decomposi-tions. We show an example of how correspon-dence analysis can help analyze difficult mul-tidimensional data such as DNA micro-arraydata.
IntroductionIn France, students in PhD or Masters programs in ap-plied statistics have to take a course on correspondenceanalysis. In English speaking countries this method isonly taught in some education departments1, and thereare very few books or articles in English on the sub-ject, Michael Greenacre who studied in France underBenzecri[1] has written several very readable texts onthe subject[5].
With today’s explosion of data in the form of contin-gency tables both two-way and three-way, this methodis very useful in applications such as gene expressionmicroarray data, information retrieval, linguistics andsociology.
Correspondence analysis provides students at the mas-ters level in statistics, applied mathematics, biology andengineering with an opportunity for performing multi-variate data analysis without having to use prepackagedprograms. Thus they also learn how to use graphicaland numerical functions in high level languages such asR/Splus or matlab .
Aims and relevant dataWe can make a dichotomy of data-mining and multi-variate statistical methods into two groups, one seriesof methods confers a particular status to one variable orset of variables, these are to be predicted or explained,they are the response. Methods include regression, mul-tiple response regression, discriminant analysis, analy-sis of variance depending on whether the explanatoryvariables are categorical or continuous. These are notthe ones we are going to study here.
Correspondence analysis is useful when all the variableshave the same status. This is sometimes called unsuper-
vised learning, and includes clustering, principal com-ponents as well. They have in common the creation ofa new set of variables that simplify the arrays at hand.In the case of clustering, the new variable is a categor-ical, in correspondence analysis and principal compo-nents the new variables are continuous and enable theconstruction of useful new graphical representations ofthe data.
Correspondence analysis is an exploratory method be-cause it does not presuppose any model for the data, asdo Goodman’s bilinear methods or factor analysis mod-els for instance. Correspondence analysis and principalcomponents can both be extended to three-way arrays,for instance for the analysis of bootstrap permutationtests or time series of matrices. Such data are oftencalled data cubes.
Correspondence AnalysisCorrespondence analysis (CA, also called homogeneityanalysis and reciprocal averaging), can be used to ana-lyze several types of multivariate data. All involve somecategorical variables. Here are some examples of thetype of data that can be decomposed using this method:
• Contingency Tables (cross between two categori-cal variables)• Multiple Contingency Tables (cross between sev-
eral categorical variables).• Binary tables obtained by cutting continuous
variables into classes and then recoding boththese variables and any extra categorical variablesinto 0/1 tables, 1 indicating presence in that class.So for instance a continuous variable cut intothree classes will provide three new binary vari-ables of which only one can take the value 1 forany given observation.
To first approximation, correspondence analysis can beunderstood as an extension of principal componentsanalysis (PCA) where the variance in PCA is replacedby an inertia proportional to theχ2 distance of the ta-ble from independence. CA decomposes this measureof departure from independence along axes that are or-thogonal according to theχ2 inner product. If we arecomparing two categorical variables, the simplest pos-sible model is that of independence in which case thecounts in the table would obey approximately the mar-gin products identity for am×p contingency table witha total sample size ofn =
∑mi=1
∑pj=1 nij = n··. Inde-
pendence means
nij.=ni·n
n·jnn
1 Exception made for a few statistics departments, for instance UCLA where Jan de Leeuw teaches multivariate statistics.
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 19
can also be written:N .= cr′n, where
c =1n
N1m and r′ =1n
N′1p
The departure from independence is measured by theχ2
statistic
X 2 =∑i,j
[(nij − ni·
n
n·jn n)2
ni·n·jn2 n
]
Example: Eye color -Hair colorHere is a simple contingency table as our example fromSnee (1974)[8].
eyes Black Brunette Red BlondeBrown 68 20 15 5Blue 119 84 54 29Hazel 26 17 14 14Green 7 94 10 16
> chisq.test(eyes)Pearson’s Chi-squared test
data: eyes, X-squared = 138.2898,df = 9, p-value = < 2.2e-16
This is a very extreme point in theχ2 distribution. Theinertia is theχ-squared statistic divided by the num-ber of observations (sum(eyes)); Here, the inertia is138.3/592 = 0.2336. CA decomposes this inertia intothe sum of eigenvalues of a symmetrized reweightedversion of the original table, as will be explained be-low. TheRcommandresca<-ca.us(eyes)provides first a scree plot of these eigenvalues, in thisexample:
Values Percent1 0.2088 89. ******************2 0.0222 10. **3 0.0026 1. .Total 0.2336 100.
R screenshot on Macosx
Formulation as a generalized singular value decom-positionGiven anm × p contingency table of countsN of mlevels for a row variable andp levels for a column vari-able. (This is equivalent to a binary matrixX withn =
∑ij nij = n·· observations onm + p columns,
a notion that is useful of the generalization later.)
The first transformation makes the contingency matrixN into a frequency matrixF = 1
nN. We will denotethe row sums byr = F1p and the column sums by thevectorc = F′1m. These both sum to one
r′1m = c′1p = 1
In the case of independence
F .= rc′
All the rows would be multiples of each other or as thisis sometimes called,homogeneous. So, if all the rowswere divided by the weight of that row, these so-calledrow profilesFDr
−1 would be equal (FDr−1 = 1mc),
whereDr−1 denotes the diagonal matrix with the vec-
tor r−1 on its diagonal.
The average row in the case of homogeneity and inde-pendence is obtained by averaging the rows with the rel-evant weights for each column. The average of the row-profiles isc. The departure from independence and ho-mogeneity is measured by some norm ofFDr
−1−1mc
(or at the term by term levelfijfi·− f·j). With this nota-
tion we remark that
χ2 = n∑i,j
(fij − ricj)2
ricj
= n∑i,j
ricj(fijricj− 1)2
Verification inR:
> F<-eyes/sum(eyes)> r<-apply(F,1,sum)> c<-apply(F,2,sum)> E<-outer(r,c)> sum((F-E)ˆ2/E)*592[1] 138.2898
To compute the distance between profiles, each columnis reweighted by the inverse of its sum, this gives theχ2
distance between row profiles.
X 2 = n trace((F− rc′)′Dr−1(F− rc′)Dc
−1)= trace(A′A) whereA = D−1√
r(F− rc′)D−1√
c
The latter decomposition shows a justification forchoosing the matrixA as a natural square root.W =A′A is in a sense the characteristic matrix-operator of
20 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
the analysis, in the same way the covariance or corre-lation matrices are those of principal components anal-ysis. Generalizing principal component analysis to in-clude metrics on the rows and the columns can lead toother multivariate techniques such as discriminant anal-ysis (See Mardia, Kent and Bibby(1979) [7]).
Correspondence analysis decomposes the matrixW: itseigenvectors give the axes that account for the largestpart of the departure from independence, just as prin-cipal components provides the axes accounting for thelargest variability. Computationally this is achieved bya generalized singular value decomposition
D−1r FD−1
c − 1′m1p = USV′,with V′DcV = Ip,U′DrU = Im
equivalent to the eigendecompostionW = A′A =V′S2V or the singular value decomposition
D−1
2r FD
−12
c −√
r√
c′ = (Dr
12 U)S(Dc
12 V)′,
where(Dc
12 V)′(Dc
12 V) = Ip, and(Dr
12 U)′(Dc
12 U) =
Ip. The CA plots can be used to find out if there isan hidden ordination of the data, as for instance thechronological seriation studied below.
Finding an underlying ranking.Example: Cox and Brandwood [3] tried to seriatePlato’s works using discriminant analysis on the propor-tion of sentence endings in a given book, with a givenstress pattern. Here we show how such an analysis canbe done with correspondence analysis on the table offrequencies of sentence endings2. The first 10 profiles(as percentages) look as follows:
Rep Laws Crit Phil Pol Soph TimUUUUU 1.1 2.4 3.3 2.5 1.7 2.8 2.4-UUUU 1.6 3.8 2.0 2.8 2.5 3.6 3.9U-UUU 1.7 1.9 2.0 2.1 3.1 3.4 6.0UU-UU 1.9 2.6 1.3 2.6 2.6 2.6 1.8UUU-U 2.1 3.0 6.7 4.0 3.3 2.4 3.4UUUU- 2.0 3.8 4.0 4.8 2.9 2.5 3.5--UUU 2.1 2.7 3.3 4.3 3.3 3.3 3.4-U-UU 2.2 1.8 2.0 1.5 2.3 4.0 3.4-UU-U 2.8 0.6 1.3 0.7 0.4 2.1 1.7-UUU- 4.6 8.8 6.0 6.5 4.0 2.3 3.3.......etc (there are 32 rows in all)
The eigenvalue decomposition (called the scree plot)shows that two axes will provide a summary of85%of the departure from independence.
> res.plato_ca.us(platon)Eigenvalue inertia % cumulative %
1 0.09170 68.96 68.962 0.02120 15.94 84.903 0.00911 6.86 91.764 0.00603 4.53 96.295 0.00276 2.07 98.366 0.00217 1.64 100.00Please examine the eigenvalues...How many axes do you wish to be retained?
The functionca.us asks this question interactively,because only the screeplot visualisation can protectagainst the separation of 2 or more very close eigen-values. The function then returns as the resulting datalist the coordinates for plotting that can be visualizedusing ggobi if there are more than 3 relevant axes.Here is the correspondence analysis representation forthe rows and columns taken separately. We have madethe labels’ sizes proportional to the quality of the rep-resentations in this first plane.These are obtained bytypingcaplot.us(res.plato) .
Plato sentence endings (rows)2A computerized analysis of Plato’s work appears in Ledger (1989)[6]
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 21
Plato’s works (columns)
The plot shows very clearly a seriation of all the works,in fact, except forCritias , the seriation is determinedby the first axis, the second axis helps to placeCritiasbetweenPoliticus andSophist, a choice that has alsobeen validated in Ledger(1989) and Cox and Brand-wood (1959).
Decomposition of Inertiaχ2 distance between profilesHere is a reason why such a weighted distance could beuseful : Take a very simple contingency table :
[ 6 2 12 ]X= [ 15 5 7 ]
[ 21 7 42 ]
The row profiles are all
[,1] [,2] [,3][1,] 0.3 0.1 0.6[2,] 0.3 0.1 0.6[3,] 0.3 0.1 0.6
We can see that the profiles are identical, the multino-mials generating the rows are said to be homogeneous,and the contingency table is in fact only of rank 1:
[ 2 ]X= [ 5 ] %*% [ 3 1 6 ]
[ 7 ]
This is exactly the problem we encounter in principalcomponents analysis, we need the decomposition in sin-gular values ofX, but instead of centering the datawith regards to the mean, the data will be centered atindependence. Another difference lies with the choiceof the metric for computing distances between rows orcolumns.
Decomposition of the difference from independence
A cloud, or scatter of weighted points:These are points defined in an euclidean space, sayRp
for instance, so that distances between them are easyto compute. However we associate to each multidi-mensional point a weight that changes the inertia ofthe scatter points. For instance if we have two pointsthe one with a higher weight will ‘pull ’ the centererof gravity toward it. The same will happen for the‘minimum inertia ’ line, it will be pulled toward highlyweighted points.
Distributional Equivalence :If we add two rows that have the same profiles, this willnot change the axes chosen to represent the data, (thecolumn profiles’ geometry remains unchanged). Think-ing of the points as weighted points in a cloud, twopoints that would be at the same spot can be mergedbecause we can add their weights.
Barycentric Representation:Take the simplest case: row profiles of a 3-column con-tingency table. The profiles sum to one so are all rep-resentable in a triangle (called the 3-dimensional sim-plex). The vertices are the extreme profiles, say (1,0,0),(0,1,0) and (0,0,1). Although the row profiles are ina three dimensional space as they belong to this trian-gle they can be taken out and just looked at in thesecoordinates, called the barycentric co-ordinate system.Now an extra scale change will bring this representa-tion to the correspondence analysis one : the dimen-sions will be weighted inversely by the relative weightsof the columns , called column mean profiles, (whichalso add to1). The distances we want to represent be-tween points are to be theχ2 distances relevant here, sothe sides of the equilateral triangle are stretched to havesides inversely proportional to the square roots of theirmean values.
The side of the triangle the most stretched correspondsto the least frequent column.
This representation is the one chosen by default by thefunctioncaplot , there is a delicate issue of choosingthe scales in the two dimensions so that simultaneousrepresentations of rows and columns are valid I the pro-gram warns the user when such a scaling has not beenchosen). In the relevant choice of scaling, proximitiesbetween row and column points are hard to interpret,however it is easier to interpret the directions of the dif-ferent rows and columns.
22 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
Reading the OutputAlthough the maps provided by doing both correspon-dence and principal components analysis look quitesimple there are traps that lead tomisinterpretationsthatmust be avoided. Associated to the co-ordinates in thenew spaces are what we callloadingsor contributionsand which are indicators of how true the proximitiesin the image space are. To this end the object that thefunctioncaplot produce a listed output containing theeigenvalues, coordinates for the rows, for the columns,that are used for building the graphical representationsand absolute contributions and squared cosines that areimportant diagnostic tools.
>names(resca)[1]"recap" "valp" "ind.cords" "var.cords"[5]"ind.cta" "var.cta" "ind.ctr" "var.ctr"[9] "ind" "var" "axes"
• When trying to understand the most meaning-ful rows or columns for a given axis we look atthe absolute contributions of rows or columns togiven axis, this gives the amount of an axis’s in-ertia explained by a single row or column.
• The relative contribution of an axis/ of two axesto the inertia of a row. This is the same as the co-sine of the point with the axis that says how wella point is being projected onto the axis.
Contribution3 to the inertia from rowi:
Distance from theith row to the center of the row-points:
d2χ2(profilei, center) =
∑j
1f·j
(fijfi·− f·j)2
=∑
f·j(fijfi·f·j
− 1)2 =∑k
(skuki )2
This row will thus participate to the inertia by thisamount weighted by the row’s massri. This can bedecomposed into each of the axis separately thus givingan idea of the contribution of each row to the inertia ofeach axis, this is called the absolute contribution of rowi to axisk, ri
∑k(sku
ki )
2. The sum of all row’s contri-butions to a given axis add to one. This translates thefact thatU′DrU = Im, here are the absolute contribu-tions for theeyes data:
%Output resca$indctaAxis 1 Axis 2 Axis 3
Brown 0.2225 0.3788 0.2163Blue 0.0509 0.0232 0.4428Hazel 0.0096 0.5513 0.3191Green 0.7170 0.0467 0.0217Total 1. 1. 1.
Thus we can see that the most important row categoryfor explaining the first axis is Green eyes. The contribu-tions to inertia as decomposed according to the columncategories gives:
%Output resca$varctaAxis 1 Axis 2 Axis 3
Black 0.4312 0.1304 0.0668Brunette 0.5213 0.1124 0.0031Red 0.0340 0.1980 0.6109Blonde 0.0135 0.5591 0.3192Total 1. 1. 1.
Squared Cosine
For interpretation of the exactitude of the projection, itis important to consult the cosine of the angle betweenthe point and it’s projection onto the kth axis or theplane or space spanned by the relevant axes.
cos2(row i,axek) =(skuik)2∑k(skuik)2
%Output resca$indctrAxis 1 Axis 2 Axis 3 Total
Brown 0.838 0.152 0.010 1.Blue 0.864 0.042 0.094 1.Hazel 0.133 0.812 0.055 1.Green 0.993 0.007 0.000 1.
In this case of course, taking 3 axes results in a com-plete reconstruction of the table, so thecos2 betweenthe rows and this 3-space is 1.
On the other hand the column’s cosine can also be pro-vided, here the output gives:
%Output resca$varctrAxis 1 Axis 2 Axis 3 Total
Black 0.9670 0.031 0.002 1.Brunette 0.9775 0.022 0.000 1.Red 0.5424 0.336 0.121 1.Blonde 0.1759 0.777 0.052 1.
This information can actually be incorporated into thegraphic by making the size of the points label propor-tional to how close the point is from the plane. This is a‘perspective’ type plot, points which are close, are wellrepresented and have high cosines. Thus the large let-ters label points that are close to the plane, and are thus“well-represented”.
Summing these for the 2 for which the representationwas made, and using these indices as the size of the font,enables one to see at a glance which rows and columnsare important to interpret.
3Sometimes called absolute contribution as different from thecos2 which are sometimes called relative contributions.
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 23
Complete Analysis of DNA-microarray data
This data is freely available thanks to Iyer et al.(1999)[9], who analyzed this data in theirSciencemagazine article. One of the basic methods they usedwas hierarchical clustering as very well explained byCarr et al. (1997)[2]. Here we present the results of thecorrespondence analysis alternative. Since this workwas done, other implementations of correspondenceanalysis (and homogeneity analysis) have been pub-lished (see the software by Wong and Cui[10] and Fel-lenberg et al.[4]).
Eigenvalues for the microarray data
Columns as centroids of rows
The 3-dimensional space spanned by the first three axesexplains78% of the departure from independence ofthe table. Here are two 3 dimensional views.
−10
12
−10
12
−2−1
01
2
−1
−0.5
0
0.5
1
−1
348346
85132
344349
315
345142
> 43
144
234
131
86235
79
161129 92
288
286
289 42
109
77
145
66280
202319
293
316
327
205
88
200
350291
80
178236237
287
71
130
302
139
89
166239238
290
87
149
94
504
326
203
206
72
49
325
163
84
210
47
247157173514
343
167195
292
342
212
248
147
270
165170
13
306
143199
336
152
91
233
146
240
217
282
108
162134245168
56
160 19
317
197
74
308
37
192
229 10
138241
328
512
137
329
318
301
169
59
211
246
172
78
41
196
101294
25
181111
268
136156
14
230153148
209219
140135
57
204
1
Axis:1 (41%)
76
176
112
15
198207
516 16
73
17159
498515513154
201
309
98 48
102
155
324
223
267
133
95
179
340
20
295
171
53
82
193
307
341
347
304
305
30150 23
164
36
218 34
503213
50
244
141 24
54
158
232214243216222
39227
511
67
151
175 2
331339
97 93 45
90
187215
103
183
4 32
177
26
68
228 29 46
278
96
113
11
499
231502
182
75
303
338
337
38
186
174225208
44281 83
283
110
22
334
81
517 31 28
264
124505
269
190506
311
114
284
220494
226
18
180189
285
191
501
330
275
320
310
51
321
224
271
70
35
184
257
52 58 61
21 5
266279
9
258
69
333
33
105
332
265194
104
242
60
27 65221
493
508
312
253 99 55
106
188
392
116125
122
6
322
123
12
185
126313
63100
394
251
0HRC
274
510
3
62
121
3521281HRC 119
408
40
107
357
1HR 30MINC30MIN 15MIN 254117
273
2HRC
263
335
255
115
8
259
7
314
64
272
2603912564HRC 2HR 120
252
300UNSYNC127118
323
249
UNSYN
276296
250
406
277
500
395
261364
390
402
351
497
42024HR
Axis:3 (13%)
418422
262
474
496
485
401
355
398
Correspondence Analysis : columns as centroids of the rows
400397
20HR
403
42136629716HR
4HR
298
490
509483
476
454
507365
480
423
393
396
452
371487
399
384356404
488
477
45012HR
382
407
4896HR
443419431417
479478381
475
444
354
8HR
299
432
460
492449451
405445
383
468
459461358
453463
481
455
466
495
462
359
447
385373360458
482
430414
436
491484
456
433448
429
376
410
388
464
372
413415
380
409424374
465
386
457
389
426411
416387
467
367486377
446
361
438
469437435
353
370
412
428
368
434
440
427369
439
470471
Axis:2 (24%)
375
441379
472
362
425
378
442473
Three first axes
−2 −1 0 1 2 −2 −1 0 1 2
−3
−2
−1
−1
−0.8
−0.6
−0.4
−0.2
0
>
378
362
473
442
369
375
368
379
425
353
370
441
361
Axis:2 ( 24%)435
427
367
377
434439
472
389
486372
491
380
388
440
470
482
376
386
428
495
471
387
437
469
416
374
409
509
467
358
438
354
360
424
359
299
412426
373
433410
429
481
507382
484365
478
381
411
371430
480
458477
385
413
479496
415
457
466
383
488
436
298
446
489
Axis:3 ( 13%)
277
414
296
351
448
8HR
355
497
483
465
6HR
459476461464
501
487
462272460
366
475
432
271
490
492
417
431
295
450
12HR 276468
4HR
263
508 7
463
300352
274
393390
455
453 8506
485
384364
500
510279
356
273
Correspondence Analysis : columns as centroids of the rows
451
511
283
278
269456505
517
498
64
291
444
502249
357
100 70 6281
4HRC
Axis:4 (5.5%)
55
284
287 60 67
449 63443
285
350
68
419
12
62250
2HRC 40266275264
289288251
16HR 51
297
347
262
454 75 54 9499 81494
255
254252 99 762802HR 61 69516
292
503
268
59
290
56
1HR 515
423267
52 32821HRC
259
65
11
445
57 2151340720HR 30MIN 73270
405 53404
7429426530MINC474
58
0HRC
90
18 39
260
38391
83
98
66
22
447
452
422
96
7115MIN
504
48
95
72
421
293
514 27
97 36 78
34122 23
46
26253
UNSYNC
117
28 14
33 44242
5
47
228
226
15 25418123
115256
420
11824HR
227 87
392101
16
512
29
128 20221
400
17 93150 35 45 80224120 1
82
399
125151 88159
180
127
31
94
258 84147 30194
4
223
UNSYN 50 13
119 32 91 37
160
261
121
140114
246148
10
143
79146 77174158
191229145
286 19
42116
344225175 41
182141154
137 89
248
24
134136135198197184
406
397 49
230124
349 2
403
188
199
231
107
168
113189162153179
218
222
85314155157130133152156187
169
245190
164
346
185345
142
193348 43
149
112401
176215213167214
86
208
396
181257186173177
238
216
493
105
171241247201138
232
220170
243
200239
131
196
172
207
244165
178166
402
92
126
195
211217
111110
144
192
139129
183
209313
233206
106
395
236
219163
312317
204237161
305316
315
234
132
108
304
104310
408
398235
303205
323
109203
318
212
394
103
240
210
321
319
306
322320202
102307
311
337
309331341
333
335
340
330
301332
308339
324
343
336
334
338329326302
342
325
327
328
Three next axes -rotated
References[1] BENZECRI, J. P. History and prehistory of data
analysis. V: The analysis of correspondence, sys-tematic index (french),Cahiers de l’Analyse desDonnees2: 9–40, (1977)
[2] D. CARR, R. SOMOGYI AND G. MICHAELS, Tem-plates for Looking at Gene Expression Cluster-ing, Statistical Computing and Graphics Newslet-ter, vol. 8, pp. 20-29, (1997).
[3] D. R. COX AND L. BRANDWOOD, On a discrimi-natory problem connected with the works of Plato,J. Roy. Statist. Soc. Ser. B, 21 (1959), pp. 195–200.
[4] K. FELLENBERG , N. C. HAUSER, B. BRORS,A. NEUTZNER, J. D. HOHEISEL, AND M. V IN-GRON, Correspondence analysis applied to mi-croarray data, PNAS, (2001), pp. 10781-10786.
[5] M. GREENACRE, Theory and Applications of Cor-respondence Analysis, Academic Press, (1984).
[6] G.K.LEDGER, Re-counting Plato, Oxford Univer-sity Press, Oxford, (1989).
[7] K. M ARDIA , J. KENT, AND J. BIBBY , Multivari-ate Analysis, Academic Press, (1979).
24 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
[8] R. D. SNEE, Graphical display of two-way contin-gency tables, The American Statistician, 28 (1974),pp. 9–12.
[9] V.R. IYER, M. EISEN, T. ROSS, T. DOUGLAS,G. SCHULER, T. MOORE, J. LEE, J. TRENT,L. STAUDT, J. HUDSON JR.,M. BOGUSKI, D.LASHKARI , D. SHALON , D. BOTSTEIN AND P.
BROWN., The Transcriptional Program in the Re-sponse of Human Fibroblasts to Serum,Science,(1999) 283: 83-87.
[10] W. WONG AND Y. CUI, Gifi Array Analyzer: ana-lyzing microarry data with Homogeneity Analaysis,http://biowww.dfci.harvard.edu/˜ycui/Gifi.html
TEACHING STATISTICAL COMPUTING
Scenarios for StatisticsSusan Holmes, Statistics, Stanford
Statistical Computing for ArtistsAt Stanford this year Brad Efron and I decided to teachan introductory level statistics class aimed ataesthetes.The campus daily immediately misnamed the class:“Maths for fuzzies”. We wanted to draw esthetically in-clined students to a class that would enable them to readand understand statistics as it appeared in the newspa-pers, then illustrate what they had learned in class withan artistic media.
Of course there was no calculus requirement, nor an in-clination for maths expected. Everything was done withwords, but we started the course curriculum backwardsfrom the usual ordering. Starting with the computationof permutation and bootstrap distributions rather thanthe Normal distribution.
Familiarity with the computer was acquired in stages:
Using the browser The class notes were made avail-able on a class web page:
Link to class web page
Downloading and printing pdf files Some of our stu-dents did not know how to download or view PDFfiles.
Running applets There are many existing applets onthe web for teaching probability, we used an ar-ray of ones from Buffon’s needle to Monty Hall
UsingFlash to do some statistical analysesThefirst permutation tests were executed us-ing a simple press button interface ondata already entered into the program:
InteractiveFlash Example
Making one simpleFlash animation sceneThis wasexplained in a lab run by Carrie Grimes, avaluable Teaching Assistant by her prior ex-perience with Flash. An animation used toexplain paired comparisons can be accessed at:
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 25
Publishing Flash animations on the web This in-volves publishing the html files that contains apointer to the relevant swf files and uploading tothe web page. This was much more difficult forour students than we had predicted.
Making an interactive Flash animation This waswhat the students wanted to learn most of all,the basis for being to make computed animatedcartoons, this was their favorite lab!
Making a whole Flash movie with several scenesThis involved making complete storyboards con-taining the concepts and how they were illustratedas well as joining the scenes together.
Writing Flash actionscript programs to analyze dataSome more ambitious students actually did quitea bit of programming, it is possible to allow usersto enter their own data and analyze it, as thisbootstrap flash animation, written by Anita Lillieshows:Bootstrap of a Regression(2000 Elections)
Many of the experimental data was collected by the stu-dents themselves, for instance by testing their ESP us-ing Michael Daniels’ ESP tester
This of course led to binomials, but we experimentedwith them using simulations more than formulas.
The first two sample testing examples were paired testsdone with permutations, and the manipulations that in-volved the computer were done using scripts in flashactionscript.
Entering the data
Instead of doing the analytical formula, we generatedmany samples under the null hypotheses. As we didalso for two sample permutation tests, I called the per-mutationsshufflingof the two samples.
There is a green shuffle button that just does a permuta-tion. Then the difference has to be computed by click-ing buttons, or as the students understand the step bystep process, a special automatic button enables them todo blocks of permutations at the time.
As homeworks and final projects, we asked the studentsto write their own animations to illustrate some of theconcepts we did in class:
26 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
Anita Lillie’s Histogram
Sara Flores’ Stem and leaf
Michael Areinoff’s Histogram
Meredith Hoy’s regression
Danny Jacobs’ Shakespeare Bootstrap
Interactive Flash on the webFirst the instructor can write very simple interactiveprograms to do sets of simulations. Our homework 3consisted in the comparison of two samples of differ-ent sizes with a permutation test, the students had tosimply run through the screens pushing a few buttons.This only required knowing about linking to a site onthe web, and updating the browser so that the correctplug-in was available. At the timeFlash 5 had onlyrecently come out, and this was the version that madedoing mathematics possible.
Flash AnimationsThe students spent 3 labs learning the basics ofFlashanimation, I can’t say it was smooth for all of them thereis a lot involved in going from only knowing email tobeing able to publish an animation. This then has to beuploaded to their own website. A lot of small steps thataren’t so easy for an audience of technophobes. How-ever the motivation of actually learning flash and seeingtheir artwork dance across the screen was very strong.
Flash ActionscriptsHere are some of the possibilities that were added withversion 5 of flash and that has made mathematical func-tions possible:Math.random, Math.floor, Math.round, Math.sin, Math.cos,Math.abs etc...
To see an example of what a script actually looks like,see the examples at Sample for building a Pie chart andSample Actionscript for a 3d rotator.
A summary of the experimentNone of these students wanted to use spreadsheets orstatistical software soFlash was an appealing alter-native especially because no other course on campuscatered to this arty crowd. The audience was coaxedinto using the computer to program becauseFlash waswritten originally for designers, with no prior program-ming necessary at all.
To see for yourselves, an overview of the projects and
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 27
homeworks, please visit our Student Sampler.
And from here?Changing the students attitude in an elementary statis-tics class from passive consumers to active creators candefinitely change their experience. Having to make ananimation that explains a concept means that not onlyis the concept understood, but also that students acquirethe focus to extract the few important features that needto be shown in an animation. The artistic and creativecomponents kept them interested even when the statis-tics seemed arduous.
I have noted since working in the United States how ed-ucation here is focused around personal work. Studentsthrive on giving their own take on problems and doingpersonal projects. The course was built to capitalize onthis observation.
Maybe others will be interested by usingFlash forteaching mathematics or statistics, I would be happy toprovide information and advice.
References[1] S. BHANGAL ,(2000), Foundation Actionscript.
Friends of ED, Birmingham, UK.[2] M. WOOLDRIDGE, (2001),Teach Yourself Visually
Flash 5, Marangraphics Inc/ Hungry Minds.[3] B. EFRON AND SUSAN HOLMES, (2000), Sce-
narios for Statistics: Stat30, Webpage:http://www-stat.stanford.edu/˜susan/courses/s30/
NEWS CLIPPINGS AND SECTION NOTICES
Pictures from the JointMixer at JSM 2001
Lots of attendees at the favorite JSM event for manyconference goers: the Stat Computing and Graphicsmixer!
Thomas Lumley and Debby Swayne being helped choos-ing prize winners by an unidentified young statistician.
John Chambers awarding the John M. Chambers Sta-tistical Software Award to Dr. Alessandra Brazzale.
©©
28 Statistical Computing & Statistical Graphics Newsletter Vol.12 No.1
SECTION OFFICERS
Statistical Graphics Section - 2001Deborah F. Swayne,Chair
(973) 360-8423AT&T Labs - [email protected]
Stephen G. Eick, Chair–Elect(708) 713 5169Visual Insights,[email protected]
Edward J. Wegman, (703) Past–Chair(703) 993-1680George Mason [email protected]
David A. James, Program Chair(908) 582-3082Lucent Bell [email protected]
Mario Peruggia, Program-Chair Elect(614) 292-0963Ohio State [email protected]
Dianne Cook, Newsletter Editor(515) 294-8865Iowa State [email protected]
Thomas Lumley, Secretary/Treasurer(206) 543-1044University of [email protected]
Graham Wills, Publications/Electronic Commun.(312) [email protected]
Bradley A. Jones, Rep.(99-01)Council of [email protected]
Charles B. Roosen,Rep.(01-03)Council of SectionsMathSoft, [email protected]
Statistical Computing Section - 2001Mark Hansen, Chair
908-582-3869Bell [email protected]
Susan Holmes,Chair–ElectStanford [email protected]
Russell D. Wolfinger, Past–Chair(919) [email protected]
Douglas W. Nychka, Program Chair(919) 737-2534NCAR, Geophysical [email protected]
Tim Hesterberg, Program Chair–Elect(206) [email protected]
Susan Holmes,Newsletter Editor (00-02)650-725-1925Stanford [email protected]
Merlise Clyde, Secretary/Treasurer (00-02)919-681-8440Duke [email protected]
John F. Monahan, Publications Liaison Office919-737-2541North Carolina State [email protected]
Thomas F. Devlin, Electronic Communication Liaison(973) 655-7244Montclair State [email protected]
Lionel Galway, Awards Officer(310) 393-0411, ext. [email protected]
Ranjan Maitra, Education Liaison(410) 445-2436University of [email protected]
John J. Miller, Education Liaison(703) 993-1690George Mason [email protected]
Leland Wilkinson, Rep.(99-01)Council of Sections(312) [email protected]
David M. Allen, Rep.(00-02)Council of Sections(606) 257-6901University of [email protected]
Elizabeth Slate, Rep.(00-02)Council of Sections607-255-9148Cornell [email protected]
Vol.12 No.1 Statistical Computing & Statistical Graphics Newsletter 29
INSIDE
A WORD FROM OUR CHAIRS . . . . . . . . . . . 1Statistical Computing . . . . . . . . . . . . . . . . 1
SPECIAL FEATURE ARTICLE . . . . . . . . . . . . 1Editorial . . . . . . . . . . . . . . . . . . . . . . . . 2FROM OUR CHAIRS (Cont.). . . . . . . . . . . . . . . 3
Statistical Computing . . . . . . . . . . . . . . . . 3FROM OUR CHAIRS (Cont.). . . . . . . . . . . . . . . 4
Statistical Graphics . . . . . . . . . . . . . . . . . 4SOFTWARE PACKAGES . . . . . . . . . . . . . . . 5TOPICS IN STATISTICAL COMPUTING . . . . . . 15TEACHING STATISTICAL COMPUTING . . . . . . 25NEWS CLIPPINGS AND SECTION NOTICES . . . 28SECTION OFFICERS . . . . . . . . . . . . . . . . . 29INSIDE . . . . . . . . . . . . . . . . . . . . . . . . . 30
The Statistical Computing & Statistical Graphics
Newsletteris a publication of the Statistical Comput-ing and Statistical Graphics Sections of the ASA. Allcommunications regarding this publication should beaddressed to:
Susan HolmesEditor, Statistical Computing SectionStanford UniversityStanford, CA 94305(650) 725-1925• FAX:(650) [email protected]://www-stat.stanford.edu/∼susan
Dianne CookEditor, Statistical Graphics SectionDepartment of StatisticsIowa State UniversityAmes, IA 50011-1210(515) 294 8865• FAX: (515) 294 [email protected]/∼dicook
All communications regarding ASA membership andthe Statistical Computing or Statistical Graphics Sec-tions, including change of address, should be sent to:
American Statistical Association1429 Duke StreetAlexandria, VA 22314-3402 USA(703) 684-1221• FAX (703) [email protected]
Nonprofit OrganizationU. S. POSTAGE
P A I DPermit No. 50
Summit, NJ 07901
American Statistical Association1429 Duke StreetAlexandria, VA 22314-3402USA
This publication is available in alternative media on request.