26
Running head Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science, University of Arizona, Tucson, AZ, 85721, USA. Tel: 520-626-4184. E-mail: [email protected] Research category Bioinformatics Keywords Bioinformatics, Comparative Genomics, Java, Brassicaceae, Karyotype Visualization, Synteny Visualization Plant Physiology Preview. Published on July 29, 2013, as DOI:10.1104/pp.113.219444 Copyright 2013 by the American Society of Plant Biologists https://plantphysiol.org Downloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

[email protected] - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

Running head

Visualized Comparative Genomics Analytical Software

Corresponding author

Name: Xiangfeng Wang

Address: School of Plant Science, University of Arizona, Tucson, AZ, 85721, USA.

Tel: 520-626-4184.

E-mail: [email protected]

Research category

Bioinformatics

Keywords

Bioinformatics, Comparative Genomics, Java, Brassicaceae, Karyotype Visualization,

Synteny Visualization

Plant Physiology Preview. Published on July 29, 2013, as DOI:10.1104/pp.113.219444

Copyright 2013 by the American Society of Plant Biologists

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 2: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

CrusView: a Java-based visualization platform for comparative genomics analyses

in Brassicaceae

Hao Chen1 and Xiangfeng Wang1*

1School of Plant Sciences, University of Arizona, Tucson, Arizona, 85721, Untied States

* To whom correspondence should be addressed. Tel: (001) 520-626-4184

Email: [email protected]

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 3: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

FOOTNOTES

*Corresponding author: Xiangfeng Wang, e-mail: [email protected].

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 4: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

ABSTRACT

In plants and animals, chromosomal breakage and fusion events based on conserved

syntenic genomic blocks lead to conserved patterns of karyotype evolution among

species of the same family. However, karyotype information has not been well utilized in

genomic comparison studies. We present CrusView, a Java-based bioinformatic

application utilizing SWT/SWING graphics libraries and a SQLite database for

performing visualized analyses of comparative genomics data in Brassicaceae (Crucifer)

plants. Compared to similar software and databases, one of the unique features of

CrusView is its integration of karyotype information when comparing two genomes. This

feature allows users to perform karyotype-based genome assembly and karyotype-

assisted genome synteny analyses with preset karyotype patterns of the Brassicaceae

genomes. Additionally, CrusView is a local program, which gives its users high flexibility

when analyzing unpublished genomes and allows the users to upload self-defined

genomic information so that they can visually study the associations between genome

structural variations and genetic elements, including chromosomal rearrangements,

genomic macrosynteny, gene families, high-frequency recombination sites, and tandem

and segmental duplications between related species. This tool will greatly facilitate

karyotype, chromosome and genome evolution studies using visualized comparative

genomics approaches in Brassicaceae. The CrusView is freely available at

http://www.cmbb.arizona.edu/CrusView/.

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 5: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

INTRODUCTION

The Brassicaceae (crucifer) plant family contains more than 3,700 species, including the

model plant organism Arabidopsis thaliana; economically important crop species, such

as Brassica rapa and Brassica napus; and close relatives of A. thaliana used in abiotic

stress research, such as Eutrema salsugineum and Schrenkiella parvula. Because

Brassicaceae plants have high scientific and economic importance, several whole-

genome sequencing projects of the species in this family have been recently launched

(http://www.brassica.info). Moreover, Brassicaceae is also a good system for population

genomics. The 1001 Arabidopsis Genomes Project (http://www.1001genomes.org/) plans

to generate complete genome sequences for 1001 A. thaliana strains to study the

associations between genetic variation and phenotypic diversity. The VEGI (Value-

directed Evolutionary Genomics Initiative) project aims to understand the genome

evolution of Brassicaceae by sequencing several close relatives of A. thaliana, such as

Arabidopsis lyrata and Capsella rubella. Recent advances in high-throughput sequencing

(HTS) technology have greatly expedited these whole-genome sequencing projects of

versatile non-model organisms. Although increasingly longer reads can now be produced

from HTS experiments, de novo assembler tools can only generate contig and/or scaffold

sequences from HTS reads. These tools cannot generate complete chromosome sequences

without genetic and/or physical maps that typically require years to create. This limitation

makes chromosome-scale structural variation (i.e., translocation, inversion, deletion and

insertion, and segmental and tandem duplication) and genomic macro-synteny analyses

difficult to perform.

In both plants and animals, genomes of species within the same family have

evolved with conserved karyotype patterns due to the rearrangements of large

chromosomal segments. Chromosomal karyotypes can be obtained from comparative

chromosomal painting (CCP) experiments by performing in situ hybridization

experiments on BAC sequences between related species. The genome of each

Brassicaceae member is composed of 24 conserved genomic blocks that have been

considered as the basic units of chromosomal rearrangement during genome evolution

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 6: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

(Lysak et al., 2006). The sizes of these conserved blocks range from several to dozens of

mega-bases. Currently, karyotypes profiled by CCP experiments in approximately twenty

species in Brassicaceae have been available; such karyotypes include those from

Arabidopsis thaliana (n=5), Homungia alpine (n=6), Eutremeae (n=7), Arabidopsis

lyrata (n=8), Brassica rapa (n=10), and Polyctenium fremontii (n=14), etc. By utilizing

the karyotype information in Brassicaceae, we have developed a tool – KGBassembler

(karyotype-based genome assembler for Brassicaceaes) – to finalize the assembly of

chromosomes from scaffolds/contigs without relying on a genetic/physical map (Ma et al.,

2012).

Over the past 2 years, complete whole-genome sequences of several Brassicaceae

species have been released, including the aforementioned A. lyrata, S. parvula, B. rapa,

and E. salsugineum (Dassanayake et al., 2011; Hu et al., 2011; Wang et al., 2011; Wright and

Agren, 2011; Wu et al., 2012; Yang R, 2013). These genomic resources have opened a new

era of comparative genomics in Brassicaceae to better understand the genomic evolution

(Cheng F, 2012). Numerous tools and databases are available for performing comparative

genomics analysis in plants. CoGe is a comparative genomics analysis platform that is

now a part of the iPlant Collaborative Project (Jorgensen et al., 2008). The CoGe database

currently includes nearly 2,000 genome sequences of approximately 1,500 organisms,

allowing users to perform online visual analyses of genome synteny and duplication

events (Tang and Lyons, 2012). PLAZA and Vista are also web-based databases that

provide comparative analysis services on the genomic data deposited in the databases

(Frazer et al., 2004; Van Bel et al., 2012). Other stand-alone bioinformatic applications for

comparative genomic analysis, such as Easyfig and genoPlotR, are commonly used to

generate synteny plots of given genome segments at a scale ranging from one single gene

to one chromosome (Guy et al., 2010; Sullivan et al., 2011).

In this work, we present a Java-based bioinformatic application – CrusView – for

performing visualized analyses of genome synteny and karyotype evolution in

Brassicaceae. CrusView features a user-friendly graphical user interface (GUI)

implemented with SWT/SWING graphics libraries and a SQLite database used to

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 7: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

manage local genomic data. Compared to the most commonly used tools in comparative

genomics, one of the unique features of CrusView is that available karyotype data of a

Brassicaceae species is incorporated to facilitate karyotype-based chromosome assembly

and analyses of chromosomal structural evolution. Compared to web-based tools, the

stand-alone CrusView tool was also designed to give users higher flexibility in analyzing

currently unpublished genome data and integrating self-defined genomic information

based on the users’ interests, such as gene families, gene duplications, chromosomal

breakpoints, gene ontology (GO) terms, and groups of orthologs/paralogs, with the

genomic synteny maps. In addition, CrusView can generate images representing genomic

synteny between two compared genomes in PNG/SVG/PDF high-resolution formats that

are suitable for publication.

RESULTS

To demonstrate the basic functionality of CrusView, we prepared two example genomes

and related datasets from Arabidopsis thaliana (n=5) and Eutrema salsugineum (n=7) to

perform visualized comparative genomics analyses. E. salsugineum (also known as salt

cress and Thellungiella halophila) is a halophytic relative of A. thaliana; it inhabits the

seashore saline soils of eastern China. Because E. salsugineum and A. thaliana share

similar life cycles, morphological characters and genetic composition, E. salsugineum has

been widely used in plant salt-tolerance studies using the genetic systems and molecular

tools previously established in A. thaliana. The E. salsugineum genome (243 Mb)

contains seven chromosomes and approximately 24,000 protein-coding genes (Yang R,

2013). The karyotype maps derived from comparative chromosomal painting (CCP)

experiments of both E. salsugineum and A. thaliana are currently available (Lysak et al.,

2006). We used these two genomes to demonstrate the karyotype-based genome assembly

of the E. salsugineum chromosomes and the comparative analyses of E. salsugineum and

A. thaliana with integrated karyotype information.

Overview of the functional panels in CrusView

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 8: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

CrusView can be launched via web-start at http://www.cmbb.arizona.edu/crusview. The

navigation panel includes quick buttons that perform basic operations in CrusView. The

published karyotypes of 20 Brassicaceae species have been integrated into CrusView,

and they are shown in the left “karyotype” panel. We will constantly collect the published

karyotypes generated based on CCP experiments. Each time CrusView is launched, the

program will automatically query the CrusView server to update the local karyotype

database. Genomic data files from E. salsugineum and A. thaliana can be imported into

the SQLite database to run a demonstration for users who run CrusView for the first time.

The primary visualization window shows the seven chromosomes of the primary E.

salsugineum genome (Figure 1). The protein-coding genes of E. salsugineum are

designated with the corresponding colors based on the conserved genomic blocks in

which they are located. The upper-right panel shows the color schemes and the letter

labels for the 24 genomic blocks (A to X), while the lower-right panel shows the five

chromosomes of the secondary A. thaliana genome (Figure 1). The information window

displays the genomic annotations of the genes in the primary genome recorded in the

BED file, including the gene IDs, chromosomal locations, genomic block IDs,

orthologous group IDs, sequence similarities with the homologs in the secondary genome,

gene functional descriptions and other user-defined information (Figure 1). User can

switch the primary and secondary genomes, zoom in/out of the chromosome images,

perform a query of interested genes, and invoke a chromosome-level comparison window

using the quick buttons in the navigation panel.

Visualized karyotype comparison between E. salsugineum and A. thaliana

One of the unique functions of CrusView is that it can generate the digital karyotype of a

genome, allowing users to visually compare the chromosomal karyotypes of the primary

and secondary genomes. The Arabidopsis lyrata (n=8) genome represents an ancestral

karyotype in the Brassicaceae family in which each member’s genome is composed of 24

conserved genomic blocks according to the karyotype analyses of several representative

species in the family using CCP experiments (Lysak et al., 2006). Each conserved genomic

block is a large chromosomal segment that can be represented by a group of A. thaliana

genes in synteny with their orthologs in the genomes of other Brassicaceae species. Thus,

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 9: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

the A. thaliana genes can be used as markers to infer the assignment of the 24 conserved

genomic blocks to another species’ genome in Brassicaceae (Lysak et al., 2006; Yang R,

2013). Our previously developed software program KGBassembler includes a pipeline to

assign the genes in a Brassicaceae genome to the 24 conserved genome blocks with a

color scheme and a letter label (A to X) based on the homology with A. thaliana genes

(Ma et al., 2012). Here, we elucidate this procedure using E. salsugineum as a newly

sequenced genome based on three basic steps: first, the A. thaliana amino acid sequences

were mapped to the E. salsugineum scaffold sequences using BLAST, followed by the

selection of the best aligned locations; second, the A. thaliana genes mapped onto the E.

salsugineum scaffolds were used to infer the conserved genomic blocks, followed by the

assignment of the color schemes and letter labels of the 24 blocks to the E. salsugineum

genes; and third, pseudo-chromosome sequences were generated based on the CCP-

derived (n=7) karyotype of E. salsugineum. This pipeline was integrated into CrusView

and can be applied to any newly sequenced Brassicaceae genome to perform karyotype-

based genome assembly and generate digital karyotypes for comparison purposes.

In CrusView, the digital karyotypes of the primary and secondary genomes will

greatly facilitate visualized genomic comparison and the identification of major

chromosomal rearrangement events causing the genomic evolution of the chromosomal

karyotype in the studied Brassicaceae genome. For example, A. thaliana chromosome 2

(AtChr2) resulted from the merging of E. salsugineum chromosome 4 (EsChr4) and the

long arm (14 Mb to 37 Mb) of EsChr3 (Figure 1). Moreover, when compared with the

ancestral karyotype of the eight A. lyrata chromosomes, users may study the different

evolutionary paths of the karyotype in another species. For example, although AtChr1

resulted from the merging of A. lyrata AlChr1 and AlChr2, the structure of EsChr1

remains unchanged compared with AlChr1 (Figure 1). Furthermore, users can search for

interested gene IDs or ortholog group IDs from the navigation panel and map their

positions on the compared primary and secondary genomic karyotypes.

Visualized fine-adjustment of pseudo-chromosome assembly in CrusView

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 10: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

The automatic generation of pseudo-chromosome sequences based on the KGBassembler

algorithm may miss or misplace certain scaffolds that do not contain sufficient gene

synteny information for inferring the assignment of conserved genomic blocks, which are

either relatively short or contain too many repetitive sequences. Additionally, de novo

scaffold assembly is usually interrupted at the edges of highly repeated centromere

sequences. Thus, manual adjustment of the pseudo-chromosomes may be necessary.

Different from KGBassembler in which users need to edit a text file for manual

adjustment, CrusView allows users to perform visualized fine-adjustment of pseudo-

chromosome assembly in GUI and to consider additional genomic information, such as

positions of genetic markers, centromere-specific CentO tandem repeats, and the density

of protein-coding genes during the adjustment. Users can directly load the project result

produced in KGBassembler for visualized fine adjustment or use the “assembling”

function in CrusView to assemble pseudo-chromosomes from the scaffold sequences.

When the assembling function in CrusView is run for the first time, users must indicate

the working folder containing the required input files described in the Methods section

and an output folder to save the generated chromosome sequences. Users may set up

necessary parameters in the “parameter panel” and save the parameters into an INI

configuration file that can be directly loaded to run the assembling function (Figure 2).

The details of the parameters were explained in the KGBassembler manual, and users

may wish to apply different parameter settings to produce the most optimal assembly,

which is largely dependent on the quality of the scaffold sequences themselves as

generated by de novo assembler tools.

To fine-tune the draft pseudo-chromosome sequences, CrusView allows users to

add files containing genetic markers and CentO tandem repeats. In plants, CentO

sequences are ~170 bp motifs that are tandemly arrayed and specifically located in the

core centromeric regions (Benson, 1999). CentO repeats located at one terminal of a long

scaffold are generally indicative of the centromeric end of a scaffold (Figure 2).

Moreover, the density of protein-coding genes is typically higher in the euchromatic

regions of short and long arms than in the pericentromeric heterochromatic regions

(Figure 2). Thus, these types of information are very useful in assisting users to further

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 11: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

inspect and adjust the scaffold layouts and orientations on the chromosomes, as well as

the genomic positions of the genetic markers. Users can simply perform drag-and-drop

actions with a mouse to correct potentially misplaced scaffolds or to adjust the orientation

of scaffolds. When a manual adjustment is performed, users can save the pseudo-

chromosome sequences to a FASTA file and simultaneously generate the gene annotation

file. Finally, users can use the “push to main screen” function to directly add the

assembled pseudo-chromosome and perform further visualized comparative analyses.

Visualization of genomic synteny between two genomes

The “compare two genomes” function in CrusView can provide a visualization of

genomic synteny for each pair of homologous chromosomes for the primary and

secondary genomes. Chromosome-scale genomic synteny can be visualized in two

manners, a chromosomal karyotype with homologous genes linked between the two

chromosomes and a dot-plot indicating chromosomal macrosynteny with duplication

events (Figure 3A). For example, a comparison of the karyotypes of EsChr4 and AtChr2

indicated that A. thaliana chromosome 2 resulted from an event in which the entire

chromosome 4 (genomic blocks I and J) merged with the long arm of chromosome 3

(genomic blocks K, G and H) in E. salsugineum (Figure 3A). In addition, the visualized

chromosomal synteny with karyotype information can also allow users to examine the

differences in the chromosome structures between the two genomes. For instance, the 18

Mb-long region from 27 to 35 Mb of J block on EsChr4 remains highly similar with the

17 Mb-long region from 13 to 20 Mb on AtChr2, whereas the 25 Mb-long I block of

EsChr4 has seemingly dramatically expanded with highly enriched repetitive sequences

and transposable elements compared to the corresponding ~17 Mb I block region on

AtChr2. More interestingly, a small region of EsChr4 between the positions 10 to 11 Mb

was found resulted from the inverted translocation of a region from AtChr2. The selection

of a genomic region with the mouse can invoke the information window, which contains

the genes located in the regions of interest. By clicking on a gene homologous to the

corresponding A. thaliana gene, users will be redirected to the TAIR database, which

contains detailed gene function information.

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 12: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

Chromosome-scale genomic synteny can also be visualized as a dot-plot in

CrusView to facilitate the identification of segmental duplication and tandem duplication

events between the two compared species. From the dot-plot screen, users can select the

regions containing duplication events of interest with the mouse to obtain information

regarding the genes located in the selected regions (Figure 3A). Right-clicking the mouse

will invoke a pull-down list of advanced actions, such as querying selected genes in the

external TAIR database to view detailed functional descriptions, retrieving gene

sequences to a FASTA file, performing exon-level sequence alignment for a single gene,

and aligning multiple genes in a user-defined synteny region using AJaligner. Figure 3B

demonstrates a genomic region between 23.8 and 24.1 Mb on AtChr4 encompassing two

tandem duplication events of the gene members in the calcium-dependent protein kinase

(CDPK) family that may be involved in stress responsive pathways in A. thaliana. While

AtCDPK27 and AtCDPK31 represent a pair of tandemly duplicated genes that

correspond to the single-copy E. salsugineum gene Thhalv10028618m.g, AtCDPK21 and

AtCDPK23 correspond to the single-copy gene Thhalv10028567m.g (Figure 3B). An

exon-level sequence alignment of a pair of interesting orthologous genes will reveal

exon-level structural variations, amino acid variations, insertions and deletions (INDELs),

and single nucleotide polymorphisms (SNPs), which is illustrated by the comparison of

SALT OVERLY SENSITIVE 1 (AtSOS1) in A. thaliana and its E. salsugineum ortholog

(Figure 3C).

Visualization of a user-defined list of genes, duplication events and copy number

variations (CNV) in a genomic synteny plot

Using CrusView, users may visualize a group of genes of interest in the two compared

genomes to determine their associations with genomic synteny and possible duplication

events. We demonstrate this utility by analyzing the tandemly duplicated F-box

superfamily that has been found to display great copy number variations between A.

thaliana (505 genes) and E. salsugineum (613 genes). First, the genes in E. salsugineum

were assigned to the orthologous groups annotated in the OrthoMCL database (Li et al.,

2003). Each ortholog group indicated by a unique ID contains the putative orthologous

genes in A. thaliana and E. salsugineum. We found that one of the ortholog groups

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 13: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

(OG5_127192) that showed high variation in copy number contained 148 and 130 F-box

genes in A. thaliana and E. salsugineum, respectively. In plants, F-box genes consist of a

large superfamily encoding an E3 ubiquitin ligase that is involved in substrate-specific

protein degradation. First, using the “predict tandem duplication” function in CrusView,

highly homologous genes defined with a cutoff of 40% protein-identity and located

adjacent to each other within a 5 Kb window were highlighted in green in the dot-plot of

EsChr3 and AtChr3 (Figure 4). The protein-identity cutoff and window size can both be

adjusted by the user when predicting tandem duplications. Then, using the “keyword

search” function, a group of genes of interest is displayed in the current dot-plot. For

instance, when searching ID “OG5_127192”, F-box genes classified in this ortholog

group by OrthoMCL were highlighted in red in the same dot-plot image (Figure 4). From

the overlapping green dots (tandemly duplicated genes) and red dots (F-box genes in

group OG5_127192), we observed a macro-syntenic block covering a ~5 Mb region on

AtChr3 and a ~15 Mb region on EsChr3 encompassing 59 and 78 tandemly arrayed F-

box genes in A. thaliana and in E. salsugineum, respectively (Figure 4).

Similarly, users can also add additional genomic information to the BED file to

allow searching for self-defined keywords, such as gene ontology (GO) terms, gene

functional descriptions or gene families. CrusView also allows users to filter a list of

genes or genomic positions of interest from the user-defined genomic information file,

which can be displayed on the dot-plot synteny map. Users can define the color schemes

for different gene groups on the plots using the setting function of CrusView. Finally, the

digital karyotype maps, macro-synteny plots based on the 24 color-coded genomic blocks,

and dot-plot synteny map showing duplication events and mapped genes of interest can

be saved as high-quality PNG/SVG/PDF publication-quality images.

CONCLUSION

In this work, we developed a Java-based bioinformatic application – CrusView – using

the powerful SWI/SWING graphics libraries in the Java and SQLite databases; this

application was designed to facilitate research in comparative genomics. We

demonstrated the basic functionality of CrusView by performing a visual comparison of

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 14: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

the A. thaliana and E. salsugineum genomes in the plant Brassicaceae (Crucifer) family.

Compared to other bioinformatic tools that have been developed for similar purposes, one

of CrusView’s unique features is its incorporation of genomic karyotype information

derived from comparative genomics painting (CCP) experiments. The karyotype of a

species associated with the genome structure visualized in CrusView can greatly assist

users in identifying chromosomal rearrangements, genomic synteny and major

duplication events among the related species. Thus, this unique CrusView feature may

facilitate the understanding of karyotype, chromosome and genome evolution based on a

comparative genomics approach. Furthermore, by considering the advantage of a species’

karyotype, CrusView provides a unique function to infer pseudo-chromosome sequences

from scaffold sequences generated by de novo assemblers based on conserved genomic

blocks. This feature is especially convenient for non-model species that lack a genetic

and/or physical map. However, users should be aware that CrusView does not replace de

novo assembler tools, and its performance in finalizing the assembly of a pseudo-

chromosome sequence depends largely on the quality of the scaffolds and contigs

produced from whole-genome shot-gun sequencing projects.

CrusView also includes an array of utilities that can be used to visualize genome

synteny and duplication events and to map a list of genes of interest associated with

syntenic regions between the two analyzed genomes. Compared to database-based

comparative genomics tools, CrusView is much more flexible in the ability to analyze

unpublished genomes; it allows users to integrate self-defined genomic information, such

as gene ontology (GO) classifications, gene families of interest, hot-spots of

chromosomal breakage/fusion points, high-frequency recombination sites, and tandem

duplication to study their correlations with genomic variations and duplication events.

User-defined information and genome synteny plots can be exported as high-resolution,

publication-quality PNG/SVG/PDF images.

Karyotype mapping based on in situ hybridization experiments is a common

genomic technique that is widely used in animals and plants. Conserved patterns of

chromosomal rearrangements based on syntenic genomic blocks as basic units of

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 15: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

chromosomal breakage and fusion events are commonly observed in the animal and plant

kingdoms (Lysak et al., 2006; Ferguson-Smith and Trifonov, 2007). Therefore, although

CrusView was primarily developed and preset based on the karyotype evolution patterns

in Brassicaceae family (primarily for the convenience of the Brassicaceae community),

this software program may also be used to perform karyotype-based genome assembly or

karyotype-assisted genome synteny analysis in other plant families or in other organisms

for which karyotype data exist. If users wish to use the current version of CrusView for

non-Brassicaceae species, they can access the “setting” function to define the color

schemes and letter labels of the conserved genomic blocks based on the karyotype

evolution patterns of the species of interest. Additionally, to promote the broad use of

CrusView in other organisms, the source code of CrusView has been released through

Sourceforge.net to allow academic users to freely download and modify the programs.

MATERIALS AND METHODS

Basic input files for CrusView

CrusView utilizes the Java web-start function so that it can be launched through the

CrusView homepage. When it is run for the first time, CrusView creates a “CrusView”

folder on the user’s local computer and automatically installs the programs and basic

dataset in the folder. CrusView simultaneously creates a local Java SQLite database to

manage the genomic data that the user wishes to analyze. The data files include a FASTA

file containing chromosome or scaffold/contig sequences and a GFF file containing gene

model annotation that will be imported into the SQLite database. The user must also

prepare a BED file in the “bed” folder to provide additional information, such as ortholog

group IDs, genome block IDs, and protein sequence identities between the primary and

secondary genomes. To enable the advanced search function, the BED file may also

include the user’s self-defined genomic information and functional descriptions added in

the last column, such as gene ontology (GO) terms, gene families, recombination

hotspots, and so on. To analyze a specific group of genes of interests, the user can load a

TXT file containing the gene IDs or genomic positions and their further descriptions into

CrusView through provided functions.

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 16: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

Input files for karyotype-based genome assembly

For the species only containing scaffold sequences but with an available CCP-derived

karyotype map, a karyotype-based genome assembly of pseudo-chromosomes from

scaffold sequences is recommended. The KGBassembler will be invoked by the

“assembling” function in CrusView. The assembly function requires the following input

files: a KARYOTYPE file containing CCP-based karyotype information obtained from

the CrusView website or prepared by the user based on instruction, a PSL file containing

A. thaliana genes aligned on the scaffolds, and a FASTA file containing scaffold

sequences. The user can either provide a configuration file in INI format or edit the

“Parameter” tab in the CrusView interface to set up necessary parameters for assembly. If

a genetic map with gene marker information is prepared by the user as a GMM file with

designated format described in the CrusView manual, CrusView may also incorporate

this information during the manual adjustment of the pseudo-chromosomes. To facilitate

the prediction of scaffold orientations on the pseudo-chromosomes, the user may run the

tandem repeat finder (TRF) software program (Benson, 1999) to identify the scaffolds

containing centromere-specific tandem repeat (CentO) sequences. CentO repeat locations

formatted as a BED file can be loaded into CrusView as additional track.

After the KGBassembler has generated the pseudo-chromosome sequences, the

user may use CrusView to perform fine adjustments to the orientations and orders of the

scaffolds on the pseudo-chromosomes based on the additional information provided by

the user, such as the density of protein-coding genes, user-customized genetic marker,

and the locations of CentO centromeric tandem repeats on the scaffolds. CrusView has

been implemented with an enhanced GUI that can be used to further adjust the pseudo-

chromosome assembly using dragging-and-placing mouse actions. By clicking the “save

assembly” button, the pseudo-chromosome sequences and gene annotation information

will be saved in a FASTA file and a GFF file, respectively.

Conversion of user’s yet-to-publish genome sequence and self-defined gene

annotation to input files compatible with CrusView

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 17: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

To facilitate the user to analyze yet-to-published genome sequence, CrusView include a

function to help the user prepare the input files necessary to be used in CrusView. The

user must provide the genome/scaffold sequences in FASTA format, the gene annotation

file in GFF or GTF format, and one additional karyotype file if the user wants to

karyotype-based assembly of pseudo-chromosome sequences. The user is also prompted

to submit their protein sequences to the OrthoMCL online database (17) to assign the

genes to the corresponding ortholog groups to facilitate genome comparison, gene

duplication analyses and copy number variation analyses. To assign the 24 conserved

genome block IDs to the genes, the user must provide a BLAST result of the protein

sequences of the analyzed genome against A. thaliana proteins. Additional genomic

information that the user wishes to include will be integrated into the last column of the

BED file to enable the keyword search function in CrusView.

Inference of genomic macro-synteny based on conserved genomic blocks

The genomes of the Brassicaceae species share 24 conserved genomic blocks (large

chromosomal segments) designated A to X. An additional ID “0” is used by CrusView to

label undetermined regions that are not assigned to any genomic blocks. The

chromosomal locations of the 24 genomic blocks can be inferred from the CCP-derived

karyotype. Each gene located within the same conserved genomic block is assigned a

designated color code to illustrate the digital karyotype of the studied species. Genes

shared within the same genomic block IDs are considered to be in the same genomic

macro-syntenic regions. To analyze a genome lacking a CCP-derived karyotype or a

genome in other families of plant or animal organisms that have different conserved

genomic blocks, the user can self-define the block IDs with HEX color codes in the BED

file.

Visualization of chromosomal karyotype, genomic synteny and gene alignment

CrusView was implemented with the Java SWT/SWING libraries to develop the GUI

interface and visualization functions. Visualization of the genomic data of an analyzed

species can be performed at three levels – the genome level, the chromosome level and

the gene level. If the karyotype information has been associated with the studied genome,

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 18: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

all of the chromosomes will be visualized with the 24 genomic block IDs with

corresponding colors. The user can select any two chromosomes of interest in the two

compared species to visualize chromosomal synteny. When comparing the karyotypes of

two chromosomes, the pairs of orthologous genes between the two species are linked to

indicate major chromosomal rearrangement events. CrusView also generates a dot-plot

for each pair of selected chromosomes to visualize tandem and segmental duplication

events. The user may select a group of genes from the dot-plot using a mouse framing

action to trigger gene-level visualization. A multi-gene alignment within a designated

genomic region (less than 1 Mb) between the two genomes and an exon-to-exon

alignment of one pair of orthologous genes with single nucleotide polymorphism (SNP)

information can be visualized.

Output image files generated from CrusView

One of the useful utilities of CrusView is to generate high-resolution images and save in

PNG/SVG/PDF formats for publication use. Such images include digital karyotypes,

genome synteny plots, dot-plots of two chromosomes, multi-gene alignment within a

genomic region, exon-to-exon alignment plots, plots of genomic duplication events, and

mapping of a list of interested genes in the genomic synteny plots.

Software availability

CrusView is publically available online (http://www.cmbb.arizona.edu/crusview) and has

been implemented as a Java web-start application under Windows and Linux 32/64 bit

systems with options for different memory sizes. Sample datasets from Arabidopsis

thaliana and Eutrema salsugineum are provided to demonstrate the basic functions of

CrusView. The software manual and a series of video tutorials of CrusView are also

provided online (http://www.cmbb.arizona.edu/crusview/video_tutorial).

COMPETING INTERESTS

The author(s) declare that they have no competing interests.

AUTHORS’ CONTRIBUTIONS

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 19: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

W.X. and C.H. conceived the project. W.X. and C.H. developed the software. W.X. and

C.H. prepared the manuscript.

FIGURE LEGENDS

Figure 1. Functional panels in the CrusView main screen. a. Navigation panel; b. List of

available karyotypes in Brassicaceae; c. Main window showing the primary genome (E.

salsugineum); d. Color scheme and letter labels of the 24 conserved genomic blocks; e.

Window showing the secondary genome (A. thaliana); f. Gene annotation panel; g.

Digital ancestral karyotype of A. lyrata; h. Digital karyotype of A. thaliana; and i. Digital

karyotype of E. salsugineum.

Figure 2. Genome assembling function. a. Digital karyotype of E. salsugineum; b.

unplaced short-scaffold sequences; c. Parameter panel; d. Menu bar; e. Main working

panel for the manual curation of the genome assembly of E. salsugineum; f. Density of

protein-coding genes on scaffolds; g. Centromere-specific tandem repeat; and h. Genetic

marker track.

Figure 3. Visualization of genome synteny and gene alignment. A. Panels for genome

synteny visualization: a. Navigation bar; b. Primary genome; c. Secondary genome; d.

Chromosome synteny; e. Dot-plot; f. Genes in the selected area; g. Action list; h.

Selection of segmental duplication; and i. Genes in the ortholog groups. B. Alignment of

multiple gene members in the CDPK family showing tandem duplication events. C.

Exon-level alignment of the SOS1 genes between A. thaliana and E. salsugineum.

Figure 4. Mapping duplication events and genes of interest onto the dot-plot synteny map.

A dot-plot synteny map of EsChr3 and AtChr3. The blue dots represent homologous gene

pairs in the A. thaliana and E. salsugineum genomes. The blue dots arranged along the

diagonal line indicate a macro-synteny region. The aligned blue dots deviating from the

diagonal line indicate segmental duplications. The green dots represent potential

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 20: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

tandemly duplicated genes selected using a cutoff of protein-identity of 40% and 5 Kb

window size. The red dots represent F-box genes selected by a keyword search. The

overlapping red dots and green dots indicate the tandemly duplicated F-box genes on

EsChr3 and AtChr3.

REFERENCES

Benson, G. (1999). Tandem repeats finder: a program to analyze DNA sequences.

Nucleic Acids Res 27, 573-580.

Cheng F, W.J., Fang L and Wang X (2012). Syntenic gene analysis between Brassica

rapa and other Brassicaceae species. Front. Plant Sci 3, 198.

Dassanayake, M., Oh, D.H., Haas, J.S., Hernandez, A., Hong, H., Ali, S., Yun, D.J.,

Bressan, R.A., Zhu, J.K., Bohnert, H.J., and Cheeseman, J.M. (2011). The

genome of the extremophile crucifer Thellungiella parvula. Nat Genet 43, 913-

U137.

Ferguson-Smith, M.A., and Trifonov, V. (2007). Mammalian karyotype evolution.

Nature reviews. Genetics 8, 950-962.

Frazer, K.A., Pachter, L., Poliakov, A., Rubin, E.M., and Dubchak, I. (2004). VISTA:

computational tools for comparative genomics. Nucleic Acids Res 32, W273-279.

Guy, L., Kultima, J.R., and Andersson, S.G. (2010). genoPlotR: comparative gene and

genome visualization in R. Bioinformatics 26, 2334-2335.

Hu, T.T., Pattyn, P., Bakker, E.G., Cao, J., Cheng, J.F., Clark, R.M., Fahlgren, N.,

Fawcett, J.A., Grimwood, J., Gundlach, H., Haberer, G., Hollister, J.D.,

Ossowski, S., Ottilar, R.P., Salamov, A.A., Schneeberger, K., Spannagl, M.,

Wang, X., Yang, L., Nasrallah, M.E., Bergelson, J., Carrington, J.C., Gaut,

B.S., Schmutz, J., Mayer, K.F.X., de Peer, Y.V., Grigoriev, I.V., Nordborg, M.,

Weigel, D., and Guo, Y.L. (2011). The Arabidopsis lyrata genome sequence and

the basis of rapid genome size change. Nat Genet 43, 476-+.

Jorgensen, R.A., Stein, L., Rain, S., Andrews, G., and Chandler, V. (2008). The iPlant

collaborative: A cyberinfrastructure-centered community for a new plant biology.

In Vitro Cell Dev-An 44, S26-S26.

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 21: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

Li, L., Stoeckert, C.J., and Roos, D.S. (2003). OrthoMCL: Identification of ortholog

groups for eukaryotic genomes. Genome Research 13, 2178-2189.

Lysak, M.A., Berr, A., Pecinka, A., Schmidt, R., McBreen, K., and Schubert, I.

(2006). Mechanisms of chromosome number reduction in Arabidopsis thaliana

and related Brassicaceae species. Proc Natl Acad Sci U S A 103, 5224-5229.

Ma, C., Chen, H., Xin, M., Yang, R., and Wang, X. (2012). KGBassembler: a

karyotype-based genome assembler for Brassicaceae species. Bioinformatics 28,

3141-3143.

Sullivan, M.J., Petty, N.K., and Beatson, S.A. (2011). Easyfig: a genome comparison

visualizer. Bioinformatics 27, 1009-1010.

Tang, H., and Lyons, E. (2012). Unleashing the genome of brassica rapa. Front Plant Sci

3, 172.

Van Bel, M., Proost, S., Wischnitzki, E., Movahedi, S., Scheerlinck, C., Van de Peer,

Y., and Vandepoele, K. (2012). Dissecting plant genomes with the PLAZA

comparative genomics platform. Plant Physiol 158, 590-600.

Wang, X., Wang, H., Wang, J., Sun, R., Wu, J., Liu, S., Bai, Y., Mun, J.H., Bancroft,

I., Cheng, F., Huang, S., Li, X., Hua, W., Freeling, M., Pires, J.C., Paterson,

A.H., Chalhoub, B., Wang, B., Hayward, A., Sharpe, A.G., Park, B.S.,

Weisshaar, B., Liu, B., Li, B., Tong, C., Song, C., Duran, C., Peng, C., Geng,

C., Koh, C., Lin, C., Edwards, D., Mu, D., Shen, D., Soumpourou, E., Li, F.,

Fraser, F., Conant, G., Lassalle, G., King, G.J., Bonnema, G., Tang, H.,

Belcram, H., Zhou, H., Hirakawa, H., Abe, H., Guo, H., Jin, H., Parkin, I.A.,

Batley, J., Kim, J.S., Just, J., Li, J., Xu, J., Deng, J., Kim, J.A., Yu, J., Meng,

J., Min, J., Poulain, J., Hatakeyama, K., Wu, K., Wang, L., Fang, L., Trick,

M., Links, M.G., Zhao, M., Jin, M., Ramchiary, N., Drou, N., Berkman, P.J.,

Cai, Q., Huang, Q., Li, R., Tabata, S., Cheng, S., Zhang, S., Sato, S., Sun, S.,

Kwon, S.J., Choi, S.R., Lee, T.H., Fan, W., Zhao, X., Tan, X., Xu, X., Wang,

Y., Qiu, Y., Yin, Y., Li, Y., Du, Y., Liao, Y., Lim, Y., Narusaka, Y., Wang, Z., Li,

Z., Xiong, Z., and Zhang, Z. (2011). The genome of the mesopolyploid crop

species Brassica rapa. Nat Genet 43, 1035-1039.

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 22: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

Wright, S.I., and Agren, J.A. (2011). The Arabidopsis lyrata genome sequence Sizing

up Arabidopsis genome evolution. Heredity 107, 509-510.

Wu, H.J., Zhang, Z.H., Wang, J.Y., Oh, D.H., Dassanayake, M., Liu, B.H., Huang,

Q.F., Sun, H.X., Xia, R., Wu, Y.R., Wang, Y.N., Yang, Z., Liu, Y., Zhang, W.K.,

Zhang, H.W., Chu, J.F., Yan, C.Y., Fang, S., Zhang, J.S., Wang, Y.Q., Zhang,

F.X., Wang, G.D., Lee, S.Y., Cheeseman, J.M., Yang, B.C., Li, B., Min, J.M.,

Yang, L.F., Wang, J., Chu, C.C., Chen, S.Y., Bohnert, H.J., Zhu, J.K., Wang,

X.J., and Xie, Q. (2012). Insights into salt tolerance from the genome of

Thellungiella salsuginea. Proc Natl Acad Sci U S A 109, 12219-12224.

Yang R, J.D., Chen H, Beilstein M, Grimwood J, Jenkins J, Shu S, Prochnik S, Xin

M, Ma C, Schmutz J, Wing RA, Mitchell-Olds T, Schumaker K and Wang X.

(2013). The reference genome of the halophytic plant Eutrema salsugineum. Front

Plant Sci. 4.

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 23: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 24: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 25: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.

Page 26: xwang1@cals.arizona - Plant Physiology...2013/07/29  · Visualized Comparative Genomics Analytical Software Corresponding author Name: Xiangfeng Wang Address: School of Plant Science,

https://plantphysiol.orgDownloaded on December 17, 2020. - Published by Copyright (c) 2020 American Society of Plant Biologists. All rights reserved.