Upload
allan-ryan
View
212
Download
0
Embed Size (px)
Citation preview
Spreadsheet structure inspection using low level access and
visualisationDaniel Ballinger, Robert Biddle and James Noble
School of Mathematical and Computing SciencesEmail: {db, robert, kjx}@mcs.vuw.ac.nz
http://www.mcs.vuw.ac.nz/~db/honours.html
Motivation
Spreadsheets are a common form of end-user programming.
Unfamiliar spreadsheets can contain daunting amounts of information in the layout and inter-cell dependencies.
Methods for studying these structures are usually limited to what the application provides.
It is difficult to get a global sense of the structure of an individual formula that may have dependencies spread out all over the spreadsheet table. Users have to track down individual cell dependencies one by one, tacking back and fourth all over the spreadsheet.
- Bonnie Nardi, A Small Matter of Programming (1993)
Our Proposed Solution
Working outside the spreadsheet application allows for greater flexibility in addressing issues.
The flexibility allows for visualisations to aid in end-user understanding of spreadsheets beyond what the application is capable of.
A set of visualisations as interface enhancements that allow the user to progress from an abstract levels towards actual details present.
The greater flexibility is a trade-off with direct interaction.
We focused on Microsoft Excel due to its large market share.
Spreadsheet elements requiring extraction
Artefacts of interest are derived from low-level structures.
The basic unit of interest is any occupied cell. Each occupied cell will have a value, and
optionally a formula. Formula should be in the same format shown to
the user (not RPN). Building our own BIFF reader would be a sizeable
project in itself. Hence we use third party software.
IBM alphaWorks ExcelAccessor
A Java Bean to access and modify to contents of spreadsheets using a Windows DLL.
Requires using Windows native code and Excel to be installed. Limiting portability.
Excellent ability to extract all details correctly.
Prone to irregular crashes on larger corpus tasks.
Andy Khan’s JExcelAPI
Pure Java integrates better with toolkit. Open source allows for easier expansion
and bug fixes. Some element types produce problems with
extraction. E.g. Array functions, intersections, absolute references becoming relative, and earlier BIFF formats.
Spreadsheet Application Toolkit
Find and store spreadsheets from the Internet.Extract low level structures. E.g. Cell values and formulas.Analyse spreadsheet structures. Either individual or corpus.Conveying the findings through visualisation.
Query
URLs
XLS files
Algorithms
Image
Gobbler Google
Fetcher
Analyser
Extractor
Visualisation Tools
Web Servers
Toolkit Files
BIFF Reader
Processed Data
Metrics
Aspects of spreadsheet structure and use
The spreadsheet paradigm has two main characteristics:– The spatial relationships between cells– The logical relationships created by formula
These characteristics are not always disjoint.
User problems are mapped onto a 2D table that shields them from low-level details of programming and allows for more natural expression of many problems.
Spreadsheet layout – Real-estate Utilisation 2D
Understanding layout is an important first step in learning about a new spreadsheet.Actual values and formulas are only shown as occupied cells.The visualisation layout mimics that of Excel, with columns along the top of the x-axis and rows running down the y-axis.Cells with a higher occupancy level are coloured towards the red end of the colour spectrum.
Spreadsheet layout – Real-estate Utilisation 3D
Occupancy data is projected into 3D to create a surface map.Discrete to continuous data transformation helps smooth the effects of spikes.Coloured to give a Topographical terrain effect.Full benefit is seen with user interaction.
Formula Inspection – Data Flow
Visualising formula components that are extracted.Fully trace worksheets in one view.Single Cell, Range, Union, and Intersection.
Basic Referencing Components
Visualising formula in a fairly common summation example
The relative complexity of the bottom-right formula is clear from the larger circle.
Formula in a summation example
Igarashi’s Static Global View
Formula Inspection - Dependency Types
Excel allows for combinations of relative and absolute positioning.Understanding the referencing type is important when replicating formula and identifying regular patterns.
Relative (default)
Row Absolute
Fully Absolute
Column Absolute
Related Work
Takeo Igarashi– Spreadsheets augment “a visible tabular layout with invisible formulas”.– Created visualisations to help reveal the hidden dataflow graphs and
superficial tabular layouts of spreadsheets. Markus Clermont
– Most end-users are not trained programmers.– Many spreadsheets exist beyond being simple scratch pads.
Raymond Panko– Studies of empirical data into spreadsheet errors.– Found error rates can be disturbingly high.– Errors attributed to over confidence and lack of formal checking.
Margaret Burnett– The importance of scalability in visualisations.– Forms/3 and an embedded testing methodology.
Summary and Future Work
We created a Java toolkit to extract artefacts from spreadsheets and then convert the basic information into visualisations.
These visualisations are used to augment the information provided by Excel in helping users understand spreadsheets.
Future work will include detailed user studies and corpus analysis to find larger patterns.
We must also address visualisation scalability for larger, more complex, spreadsheets.
http://www.mcs.vuw.ac.nz/~db/honours.html