1
DiscoverySpace A tool for gene expression analysis and biological discovery Richard Varhol, Neil Robertson, Mehrdad Oveisi-Fordorei, Chris Fjell, Derek Leung, Asim Siddiqui, Marco Marra, Steve Jones DiscoverySpace DiscoveryDB Further analysis Acknowledgments Biological data is complex, dynamic and is generated at a phenomenal rate. To obtain meaningful results from such data sets, familiarity is needed not only with the types of data available, but also with how these data sets are structured and how they are related. Backed by a relational database system, DiscoverySpace is able to query data from over 25 publicly available data sources, as well as internal experimental results. DiscoveryDB currently contains data about genes, proteins, pathways, functional genetic components and biological ontologies, all of which are accessible through DiscoverySpace. The collection of available data sources can easily be expanded. While some resources publish data in relational formats, others use proprietary flat-file formats that require the development of custom parsers. However, once import mechanisms have been developed for a specific data source, the database can be automatically updated based on a schedule specified by the administrator. DiscoverySpace is a graphical software application that eliminates the need for a detailed understanding of the underlying data structures, allowing the researcher to focus better on experimental results and implications. Originally designed to support SAGE gene expression analysis, DiscoverySpace is targeted at the biologist who wishes to utilize high-throughput bioinformatic techniques without the complication of scripting languages and database queries. The image above shows a view of the DiscoverySpace desktop and the various tools available within the application. DiscoverySpace provides an intuitive query builder for defining subsets of the available data, which can then be related to associated datasets in a familiar spreadsheet-like interface (the Explorer). The application provides statistical analysis tools (Scatter Plot, Venn Table, Tag Search), which allow the user to compare large sets of data and elicit similarities and patterns. Depending upon available annotations, datasets can be also interpreted with regards to their associated Gene Ontology Terms using DiscoverySpace’s GO Analysis Tool. The image above represents the publicly available data sources resident within DiscoveryDB and the linkages between them. Most major resources cross-reference one another, facilitating queries across the knowledge space. In this image above one can see that SwissProt is very heavily referenced by other resources. In addition analytical linkages can be generated by performing post-processing of primary data. These large scale computational analyses, for instance BLAST and HMM, are stored in the database to allow for rapid access. DiscoveryDB provides specific support for SAGE (Serial Analysis of Gene Expression) gene mapping operations. As gene sequence data, such as Refseq and MGC, is imported into DiscoveryDB, the sequence data is parsed and “Virtual Tags” are extracted. These Virtual Tags can then quickly be linked to the Tag Sequences from experimental SAGE libraries. DiscoveryDB also houses the CMOST database which compiles the results of extensive data processing to increase the accuracy and reliability of tag to gene mapping. 3. Sub-Section Labels Analysis using DiscoverySpace To begin a SAGE experiment we need to select SAGE Libraries for further analysis. For the purposes of this experiment we will use public data from CGAP (Cancer Genome Anatomy Project), which publishes over 280 SAGE libraries. To narrow the field of available libraries, we search for libraries with the specific qualities we are interested in; Long SAGE libraries from the species Homo Sapiens and from Mammary Gland tissue. This is done by defining a Query (below) to restrict the set of CGAP Libraries to the subset of interest. The figure to the left shows a DiscoverySpace Query. Each green arrow represents the location of a constraint, the details of which can be seen in the lower section of the panel. e.g. The Name property is restricted so that we view only those libraries with names beginning with "LSAGE" (Long SAGE). The figure above shows the results of the Query defined. This query has narrowed the 280+ CGAP libraries to just 14 of interest. Now that we have identified two SAGE libraries of interest, one benign and one cancer, we need to define further queries to capture the tags from those libraries. Once these two sets are defined and loaded they can be compared using DiscoverySpace's inbuilt "Scatter Plot" tool. The Scatter Plot uses the Audic-Claverie algorithm to detect statistical similarities and differences between two sets of Tag Sequences. The Scatter Plot provides an interactive visualization that allows the user to select further subsets. In this example we will isolate only those Tag Sequences which are highly expressed in the cancer library. The Query tool provides an intuitive interface for joining and constraining across multiple data sources. Here we create two Queries, one for each desired tag set. In the Query above, we constrain the CGAP SAGE Library datatype to specify only library "LSAGE_Breast_fibroadenoma_MD". We have moved the endpoint of the Query to capture the Tag Sequences of the Experimental Tags of the specified Library. Now that we have isolated the set of upregulated Tag Sequences we need to map them to counterpart genes. The figure above is a screenshot from the DiscoverySpace Explorer which depicts a SAGE mapping operation. In the main Table View (right) one can see our set of Tag Sequences (light green). The numbers in the left hand gutter indicate the expression value of each of the tags. The Tag Sequences are linked to the Refseq Virtual Tags (darker green), which are themselves linked to Refseq genes (blue). The Tree View (left), which reflects the structure of the Table View, is used to explore the data model and to add and remove columns of interest. The hatched rows in the table mark Tag Sequences for which no mapping can be found. The data from the table can easily be exported to other tools (MS Excel) for additional analysis. The figure above shows an image from the Scatter Plot. Each point represents a particular Tag Sequence and its location indicates the relative expression of the given tag. The points in blue represent those tags that are similarly expressed. The points coloured red are those tags that are upregulated in the cancer library. These Tags have been selected and will now be isolated into a further set for continued analysis. www.bcgsc.ca/discoveryspace The image above is a screenshot from the DiscoverySpace Gene Ontology (GO) Analyzer. In this example we have isolated those Refseq genes which have unambiguously mapped to Tag Sequences from the benign library. It is important to use unambiguous mappings so that we can accurately preserve expression statistics. The Refseq gene records are then viewed in the GO Analyzer. Each gene is mapped to its associated GO Terms and those Terms are then scored with relation to the whole expression set. In this image we are looking at the level 3 GO Terms related to our gene set. One can see, for example, that around 48% of expressed genes are related to the Term "metabolism". In another screenshot from the Explorer (above) our set of Tag Sequences is mapped to only those Refseq genes which are predicted to produce to extracellular proteins. One can see that Refseq NM_004994 is given a 78% chance of producing extracellular proteins. We also view the Gene Ontology (GO) Terms associated with each gene. The one-to-many relationship to GO Terms is represented in the Explorer using "expansion points" which can be collapsed and expanded. Two expansion points in this image are expanded, with the others collapsed. The greyed out sections signify the repeated data that is projected in such an expansion. In this screenshot Refseq NM_004994 is related to seven GO terms including "extracellular space". In the screenshot from the Venn Table (above) our two cancer libraries are compared against other breast cancer libraries from CGAP. The Venn Table allows the user to perform bulk statistical analysis upon multisets of Tag Sequences to detect patterns of expression. Expression can be interpreted using a number of formulae, including percentage, ratio and p-value. In the screenshot we are excluding those Tag Sequences present in the benign library and we are including only those Tag Sequences which are present in all cancer libraries. By applying this constraint and by removing singleton tags, we have reduced our set of interest to only 137 Tag Sequences. funding Genome Canada, Genome BC, BC Cancer Agency thanks Scott Zuyderduyn references • Audic, S. and Claverie, J.-M. (1997) The significance of digital gene expression profiles. Genome Res., 7, 986--995. • Velculescu VE, Zhang L, Vogelstein B, and Kinzler KW (1995). Serial Analysis Of Gene Expression. Science 270, 484-487 The Cancer Genome Anatomy Project: http://cgap.nci.nih.gov/

DiscoverySpace A tool for gene expression analysis and biological discovery Richard Varhol, Neil Robertson, Mehrdad Oveisi-Fordorei, Chris Fjell, Derek

Embed Size (px)

Citation preview

Page 1: DiscoverySpace A tool for gene expression analysis and biological discovery Richard Varhol, Neil Robertson, Mehrdad Oveisi-Fordorei, Chris Fjell, Derek

DiscoverySpaceA tool for gene expression analysis and biological discovery

Richard Varhol, Neil Robertson, Mehrdad Oveisi-Fordorei, Chris Fjell, Derek Leung, Asim Siddiqui, Marco Marra, Steve Jones

DiscoverySpace

DiscoveryDB

Further analysis

Acknowledgments

Biological data is complex, dynamic and is generated at a phenomenal rate. To obtain meaningful results from such data sets, familiarity is needed not only with the types of data available, but also with how these data sets are structured and how they are related.

Backed by a relational database system, DiscoverySpace is able to query data from over 25 publicly available data sources, as well as internal experimental results. DiscoveryDB currently contains data about genes, proteins, pathways, functional genetic components and biological ontologies, all of which are accessible through DiscoverySpace.

The collection of available data sources can easily be expanded. While some resources publish data in relational formats, others use proprietary flat-file formats that require the development of custom parsers. However, once import mechanisms have been developed for a specific data source, the database can be automatically updated based on a schedule specified by the administrator.

DiscoverySpace is a graphical software application that eliminates the need for a detailed understanding of the underlying data structures, allowing the researcher to focus better on experimental results and implications. Originally designed to support SAGE gene expression analysis, DiscoverySpace is targeted at the biologist who wishes to utilize high-throughput bioinformatic techniques without the complication of scripting languages and database queries.

The image above shows a view of the DiscoverySpace desktop and the various tools available within the application. DiscoverySpace provides an intuitive query builder for defining subsets of the available data, which can then be related to associated datasets in a familiar spreadsheet-like interface (the Explorer). The application provides statistical analysis tools (Scatter Plot, Venn Table, Tag Search), which allow the user to compare large sets of data and elicit similarities and patterns. Depending upon available annotations, datasets can be also interpreted with regards to their associated Gene Ontology Terms using DiscoverySpace’s GO Analysis Tool.

The image above represents the publicly available data sources resident within DiscoveryDB and the linkages between them. Most major resources cross-reference one another, facilitating queries across the knowledge space. In this image above one can see that SwissProt is very heavily referenced by other resources. In addition analytical linkages can be generated by performing post-processing of primary data. These large scale computational analyses, for instance BLAST and HMM, are stored in the database to allow for rapid access.

DiscoveryDB provides specific support for SAGE (Serial Analysis of Gene Expression) gene mapping operations. As gene sequence data, such as Refseq and MGC, is imported into DiscoveryDB, the sequence data is parsed and “Virtual Tags” are extracted. These Virtual Tags can then quickly be linked to the Tag Sequences from experimental SAGE libraries. DiscoveryDB also houses the CMOST database which compiles the results of extensive data processing to increase the accuracy and reliability of tag to gene mapping.

3. Sub-Section Labels

Analysis using DiscoverySpace

To begin a SAGE experiment we need to select SAGE Libraries for further analysis. For the purposes of this experiment we will use public data from CGAP (Cancer Genome Anatomy Project), which publishes over 280 SAGE libraries. To narrow the field of available libraries, we search for libraries with the specific qualities we are interested in; Long SAGE libraries from the species Homo Sapiens and from Mammary Gland tissue. This is done by defining a Query (below) to restrict the set of CGAP Libraries to the subset of interest.

The figure to the left shows a DiscoverySpace Query. Each green arrow represents the location of a constraint, the details of which can be seen in the lower section of the panel. e.g. The Name property is restricted so that we view only those libraries with names beginning with "LSAGE" (Long SAGE). The figure above shows the results of the Query defined. This query has narrowed the 280+ CGAP libraries to just 14 of interest.

Now that we have identified two SAGE libraries of interest, one benign and one cancer, we need to define further queries to capture the tags from those libraries. Once these two sets are defined and loaded they can be compared using DiscoverySpace's inbuilt "Scatter Plot" tool. The Scatter Plot uses the Audic-Claverie algorithm to detect statistical similarities and differences between two sets of Tag Sequences. The Scatter Plot provides an interactive visualization that allows the user to select further subsets. In this example we will isolate only those Tag Sequences which are highly expressed in the cancer library.

The Query tool provides an intuitive interface for joining and constraining across multiple data sources. Here we create two Queries, one for each desired tag set. In the Query above, we constrain the CGAP SAGE Library datatype to specify only library "LSAGE_Breast_fibroadenoma_MD". We have moved the endpoint of the Query to capture the Tag Sequences of the Experimental Tags of the specified Library.

Now that we have isolated the set of upregulated Tag Sequences we need to map them to counterpart genes. The figure above is a screenshot from the DiscoverySpace Explorer which depicts a SAGE mapping operation. In the main Table View (right) one can see our set of Tag Sequences (light green). The numbers in the left hand gutter indicate the expression value of each of the tags. The Tag Sequences are linked to the Refseq Virtual Tags (darker green), which are themselves linked to Refseq genes (blue). The Tree View (left), which reflects the structure of the Table View, is used to explore the data model and to add and remove columns of interest. The hatched rows in the table mark Tag Sequences for which no mapping can be found. The data from the table can easily be exported to other tools (MS Excel) for additional analysis.

The figure above shows an image from the Scatter Plot. Each point represents a particular Tag Sequence and its location indicates the relative expression of the given tag. The points in blue represent those tags that are similarly expressed. The points coloured red are those tags that are upregulated in the cancer library. These Tags have been selected and will now be isolated into a further set for continued analysis.

www.bcgsc.ca/discoveryspace

The image above is a screenshot from the DiscoverySpace Gene Ontology (GO) Analyzer. In this example we have isolated those Refseq genes which have unambiguously mapped to Tag Sequences from the benign library. It is important to use unambiguous mappings so that we can accurately preserve expression statistics. The Refseq gene records are then viewed in the GO Analyzer. Each gene is mapped to its associated GO Terms and those Terms are then scored with relation to the whole expression set. In this image we are looking at the level 3 GO Terms related to our gene set. One can see, for example, that around 48% of expressed genes are related to the Term "metabolism".

In another screenshot from the Explorer (above) our set of Tag Sequences is mapped to only those Refseq genes which are predicted to produce to extracellular proteins. One can see that Refseq NM_004994 is given a 78% chance of producing extracellular proteins. We also view the Gene Ontology (GO) Terms associated with each gene. The one-to-many relationship to GO Terms is represented in the Explorer using "expansion points" which can be collapsed and expanded. Two expansion points in this image are expanded, with the others collapsed. The greyed out sections signify the repeated data that is projected in such an expansion. In this screenshot Refseq NM_004994 is related to seven GO terms including "extracellular space".

In the screenshot from the Venn Table (above) our two cancer libraries are compared against other breast cancer libraries from CGAP. The Venn Table allows the user to perform bulk statistical analysis upon multisets of Tag Sequences to detect patterns of expression. Expression can be interpreted using a number of formulae, including percentage, ratio and p-value. In the screenshot we are excluding those Tag Sequences present in the benign library and we are including only those Tag Sequences which are present in all cancer libraries. By applying this constraint and by removing singleton tags, we have reduced our set of interest to only 137 Tag Sequences.

funding Genome Canada, Genome BC, BC Cancer Agency

thanks Scott Zuyderduyn

references • Audic, S. and Claverie, J.-M. (1997) The significance of digital gene expression profiles. Genome Res., 7, 986--995.

• Velculescu VE, Zhang L, Vogelstein B, and Kinzler KW (1995). Serial Analysis Of Gene Expression. Science 270, 484-487

• The Cancer Genome Anatomy Project: http://cgap.nci.nih.gov/