HIVE Sequence Profiling TutorialINTRODUCTION . The HIVE Sequence Profiling tool calculates the frequency of individual bases plotted against either the number of bases in the reference

HIVE Sequence Profiling Tutorial

The purpose of this tutorial is to guide the user through the process of a single alignment using the HIVE Sequence Profiling tool. Variations on profiling follow the same basic process and differ only through modification of inputs and parameters.

TABLE OF CONTENTS

Introduction

1. Selecting Inputs 1.1 Accessing SNP Profiling from Alignment Results Page 1.2 Accessing SNP Profiling from Home Directory

2. Input Parameters 2.1 Most frequently modified parameters 2.2 Hidden Algorithmic Parameters

3. Job Processing

4. Profiling Results 4.1 Summary results tabs 4.2 Visualization results tabs 5. What next?

INTRODUCTION

The HIVE Sequence Profiling tool calculates the frequency of individual bases plotted against either the number of bases in the reference genome or the number of bases in the consensus sequence resulting from a prior alignment. (This tutorial assumes alignment has been pre-computed using HIVE’s dna-hexagon aligner. For more information, please see the document titled “HIVE dna-hexagon Tutorial”. Profiling can be used for SNP discovery, to study the variable SNP patterns between closely related but divergent samples, or for a number of other applications. Profiling outputs can be exported or can be used as the inputs for other HIVE tools. For example, SNP profiles can be entered into the Clusterization Comparative Analysis tool which creates a tree with distances represented by differences between SNP profiles.

For easier understanding, all text for HIVE system options, parameters or buttons is displayed in bolded blue. Any text the user should input is displayed in the Courier font. Equations are offset and italicized.

1. SELECTING INPUTS

In HIVE, computational software is developed in a modular way to facilitate stacking of different analytic tools end-to-end in a diverse array of possible configurations. To use the HIVE SNP Profiling tool, the system must have a prior alignment completed. However, the user may access the tool from the dna-hexagon (alignment) results page or from the user’s Home directory.

1.1 Accessing SNP Profiling from Alignment Results Page On the top right of the Results box you will should see text that says what can you do next (See Figure 1.) To proceed onto further analyses following alignment, you must hover over Profiling Tools and select the SNP Profiling option. This should redirect you to the inputs page for the SNP Profiling tool.

Figure 1. Access to SNP Profiling Tool

Please note that this page is organized much like the alignment input page (See Figure 2). The top Reference Gene Selection box is where you specify the input data and the lower Algorithmic Parameters box contains a number of parameters which can be customized to produce a variety of profiles based on the user’s needs. This basic layout is used throughout the HIVE algorithmic tools to facilitate a consistent and intuitive user understanding. Furthermore, the Reference Gene Selection box at the top is identical to the Results box from the prior page, with the only exception being that there are now checkboxes next to each reference segment of the reference genome used in the related alignment. To begin you must check the box next to gene or genes in the table you wish to profile. Alternatively, you may check the Analyze All References button below the parameters box to profile all references after parameters are set.

Clicking the help tab will display general help text about the SNP Profiling utility in addition to links to information summarizing the alignment results display.

Figure 2. SNP Profiling Inputs Page Layout

After reference genes are selected, you are now ready to specify parameters.

1.2 Accessing SNP Profiling from Home Directory Logged into HIVE in the user home page, the Cloud Processes box (second from the top) lifts all processes initiated by or shared with the logged-in user. To find alignment results, you can either search in the all processes tab or the algorithmic processes tab, manually or by using the search tool. For tutorial purposes, the alignment results in which we are interested in has process ID 9366 and is named Demo_Align_1 (See Figure 3). Selecting this file (highlighting by clicking) will display process statistics in the progress tab to the left and allow you to view and/or change some metadata in the details tab. Upon clicking done under the Status column of the appropriate process, you will be redirected to the corresponding dna-hexagon results page.

Figure 3. Accessing Alignment Results in Home Cloud Processes Directory

From here, the process continues exactly as in Section 1.1 above. After reference genes have been selected, the user may now consider parameters.

2. INPUT PARAMETERS

Below the Reference Gene Selection box is the expandable Algorithmic Parameters box which allows highly customizable profiling. All parameters are populated with default values such that a user is not required to specify any values to merely run the profile, they need only click the Analyze or Analyze All References button below after selecting inputs. To expand the section, click the icon to the left of the section header. To hide the section when open, click the icon to the left of the section header.

The parameters shown by default on the portal page are those which are most often user-modified. However, there are many additional parameters that may be accessed by clicking the icon in the top left corner of the Algorithmic Parameters box.

2.1 Most frequently modified parameters

Name: Specify a name for the process. If the user does not enter a name, HIVE will supply one following the convention of concatenated reference names. For example, one selected reference will be given the exact name of that reference, i.e. NS.fa. If you select two or more references, the default name will be a list of the names separated by a space, i.e. NS.fa PB1.fa. If you choose to Analyze All References the default process name will be all references.

Profile type: The user may specify to profile relative to the reference genome or relative to the consensus sequence.

Entropic Cutoff: Entropy calculations are computationally expensive and therefore by default set to disabled. However, this parameter gives the user the option to request entropy calculations to be performed and ignored or treated at various levels of stringency. The entropy value associated with a certain position is indicative of randomness such that higher entropy values imply greater confidence in the base-calling procedure at that position. The Review Only option tells the tool to calculate and report entropy values but not to discard any base-calls due to low values. The other four numerical options 0.95, 0.90, 0.75 and 0.67 One Sigma correspond to a scale such that a value of 0.0 can be considered 100% biased and nonrandom and a value of 1.0 can be considered completely random. Selection of one of these options will filter SNP-calls at positions for entropies lower than the specified value.

Treat filtered positions: A number of filters are built into the SNP Profiling utility such as coverage, frequency and entropy filters. This parameter specifies the way the filtered positions are treated in outputs. Default set to Ignore means to leave the corresponding fields empty whereas Fill with zeros will populate the corresponding fields with zeros. This will affect the rendering of the associated visualizations.

2.2 Hidden Algorithmic Parameters More customizable parameters can be viewed by clicking the expansion button found on the top left corner of the Algorithmic Parameters box.

Base Quality Filter: This parameter tells the tool to ignore SNP calls at positions with Phred scores below the selected threshold. Default set to 20.

Count of Inserts Invalidating Repeats: Single base repeats are known to cause erroneous insertions of the repeated base due to an inability to appropriately quantify the signal at a position. If an insertion has a repeated region, this parameter allows the user to specify the maximum length of the single base-pair repeat after which the insertion is ignored. Default set to 3 meaning to ignore and not report potentially called insertions resulting after the 3rd base repeat.

Forward /Reverse Disbalance: The user must provide a number that corresponds to the minimum percentage of difference in forward and reverse base-calls such that positions with disbalance above that percentage will be count the coverage of the position in the direction with larger coverage equal to that of the smaller. The idea here is that we know why coverage in one direction may be over-represented but do not have good explanations for low coverage. Default set to 0 implies to count coverage as reported.

Maximum Percentage of Low-Quality Regions: This is an overall quality filter of aligned sequences as a percentage of original query length. Thus if the average quality over a query read corresponds to a percentage of low-quality regions above the defined cutoff, the entire read will be excluded for counts related to SNP-calling. Default is set to no cutoff implying that all reads will be considered.

Minimal Alignment Length: This allows the user to specify the minimal alignment length (as a percentage of the original query length) below which the alignment will be excluded. Default is set to no cutoff meaning all alignments are considered in profiling computation.

Minimal Coverage Allowed: The user may provide a number to specify the minimum coverage of alignment such that any region with coverage less than the specified threshold will be excluded from the base-calling procedure. Default set to 10 means a region must have a coverage of at least 10 to participate in profiling computations.

Noise: Several factors like partially failed reactions, inappropriate DNA concentrations, decay in fluorescence intensity in later cycles and improper removal of fluorophores can lead to noise in sequencing data. The profiling algorithm contains some parameters to filter noise from data to facilitate greater confidence in base calling.

Noise Filtering Parameters: This parameter allows the user to decide how to treat noise with respect to SNP calling. The default is set to Do not filter noise which will report all variant positions and their respective frequencies. Automatically filter 95% noise will

calculate the threshold under which 95% of variants lie for the associated read-reference comparison. All variants below this threshold will be discarded. Use the noise level of plasmid profiling experiment allows the user to specify a profile run for a clean plasmid of the same reference under the same experimental conditions and use the noise levels from that profile (indicative of experimental error) to define significance for the experimental profiles.

Noise Profile Maximum Cutoff: Define the percentage of variants which are considered significant. Default set to 0.01 indicating the top 1% of variants with respect to their frequency distribution is considered and the bottom 99% discarded.

Noise Profile Resolution: Noise profile resolution defines the size of the bins for distribution graphing. Default set to 0.0001.

Noise Profile Process ID: If using the Noise Filtering Parameter option Use the noise level of plasmid profiling experiment, you must specify the HIVE process ID in this field. The dropdown menu opens your file directory to facilitate easy selection of the appropriate process.

Number of Computational Subjects per Single Thread: This parameter allows the user to specify the maximum number of sequences per thread to be profiled by a single compute node. Default is set to 500,000.

Profile only repeated regions: Checking this box will only profile the regions with repetitions and therefore more than one alignment site. Unchecked by default, the algorithm will profile the entire length of the reference genome unless otherwise specified.

Reference genome serial number: Automatically populated by selected reference genome, if applicable.

SNP Threshold:

Action on SNP Threshold: If the user defines a threshold in the field for the parameter Minimal Frequency of SNP Threshold, this tells the algorithm how to treat the filtered position. Dropdown default set to Ignore Position means all information related to the position will be ignored for all possible calculations whereas Ignore SNP means positional information (counts and coverage) will still be included in calculations but potential SNPs will not be called at this position.

Minimal Frequency of SNP Threshold: Allows user to define a frequency threshold below which potential SNPs will not be called. Default set to 0 means all potential SNPs will be called regardless of frequency.

Safe Entropy Zone: This parameter is basically a visualization aid that allows the graphical engine to display positions in the safe zone using fewer points (computationally less taxing)

while showing positions under the safe threshold at a higher resolution. Default set to 0.0 means to render all positions equally.

Truncate Terminals: This option allows the user to cut the ends of the reference genome by the specified length. This may be useful when dealing with primers of a known length. Default set to do not cut means to profile over the entire length of the specified reference genome.

Once values are set as preferred and inputs have been designated, click the Analyze or Analyze All References button to compute the SNP profile. For tutorial purposes, leave the parameters at default presets but rename the process “Demo_Profile_1”.

3. JOB PROCESSING

Click the Analyze All References button to start the job. The page will be refreshed and two new boxes will be displayed.

The HIVE-hexagon Profiler box tracks the progress of all related processes and nodes. By clicking on the expand node icon found in the top left side of this box, you can view the progress of every subcomponent of this task. Once 100% complete, the process status will change from Waiting to Running to Done. The entire profile is complete when all statuses read Done and the progress bar reads 100% completion.

Preliminary and final results will populate the bottom Next-Gen Profile section as the profiler progresses. The time elapsed clock will stop, but the run-time will continue to be displayed. Click the refresh button on the left side of this box to assure you have the complete results.

4. PROFILING RESULTS

The Next-Gen Profile results box, like the alignment results box, has two components: the left summary viewer and the right detailed visualization viewer (Figure 4). The results section shows details for the corresponding selected reference gene. If more than one gene was included in the profile, HIVE will automatically select the reference gene with the greatest number of hits. To specify an alternate gene, you must select (highlight) this gene in the top Reference Gene Selection box.

The default view of the SNP profiler is the summary tab, a table of profile information with respect to the selected reference, on the left and the profile tab with graphical representations of coverage and SNP-calling on the right. A detailed description of the Next-Gen Profile results tabs follows.

4.1 Summary results tabs

summary: This table displays quantitative information about the computed SNP profile belonging to three categories.

Figure 4. Layout of SNP-Profiling Results

1) General Information: gives the length of the selected reference genome and the number of reference genes/segments/sequences involved

2) Mapped Regions: contains information on reads mapped to the specified reference

Total Contig Length is the length when all contigs are placed end to end. When there is only one contig (in the case of the reference genome) the contig length will be the length of the entire genome.

Mapped Coverage (% Reference): tells the percent of the reference genome that is hit by the selected sequences

Coverage on Contigs: average number of reads per nucleotide in the contigs

Total Number of Contigs: the number of contigs that make up the selected reference sequence

3) Unmapped Regions: displays information on reads that do not map to the specified reference

Total Length of the Unmapped Regions: gives the number of positions in the reference that do not map to the sequence

Unmapped Regions (% Reference): converts the number of unmapped positions into a percentage of the reference Coverage on Gaps: is the average number of reads per position that map to gaps Total Number of Gaps Found: gives the number of positions for which a gap is present in the alignment

Figure 5 shows an up close view of the summary table for the demo SNP-Profiling output. The selected reference is the segment NS.fa, one of 8 influenza reference segments having a length of 934. The entire NS.fa sequence is covered by hits from the query reads, with an average coverage of 379,013 per position. There are no unmapped regions and no gaps.

Figure 5. Next-Gen Profile summary table

downloads: This tab allows the user to download portions or all of the profile outputs in a number of formats.

gaps and contigs: The table shows count, position and coverage information for all gaps and contigs. Figure 6 shows the corresponding table for our tutorial data. There is only 1 contig (the entire reference segment) that starts at position 1 and ends at position 934 with an average coverage 379013.

Figure 6. Gaps and Contigs Table

annot-files: If annotations have been supplied by the user, this tab will list the relevant annotation files.

4.2 Vizualization results tabs

profile: The default view shows two graphs stacked on top of each other. The top coverage graph shows forward and reverse coverage as a function of position on the reference genome (or

consensus, depending on the user’s input selection), where coverage is the number of reads aligned to a specific position. The bottom SNP graph shows variations from the reference (or consensus) sequence at a given position. Figure 7 shows the default profile representation for the NS.fa reference segment of the demo profile results.

Figure 7. Default profile tab view

Holding the mouse over any graph will display details about that point on the graph. For example, hovering over a peak in the SNP graph will display an information box containing the position in the sequence and the percentage of variation. If you then click on that peak, a zoomed-in view centered about that location (Figure 8) will pop-up, providing greater information on the nucleotide variation and the percentage of that variation with respect to all reads in the vicinity of that position. The tabs on this zoomed-in viewer also provide alignment and stack views of the associated alignment over local region of the view.

Figure 8. Zoomed in graphical view

Four additional graphs, disbalance, entropy, quality and inDel% may be expanded and viewed by clicking on the corresponding icons found below the SNP% graph, or collapsed by clicking the related icons. Figure 9 shows all six graphs for the tutorial profile.

Figure 9. Expanded profile view

The disbalance graph displays any disbalance between forward and reverse sequence alignments for a given position. It is normal to see lower values at the start and end of the sequence in this graph.

The entropy graph displays entropy associated with each position, scaled to a value between 0.00 and 1.00. Entropy values close to 1.00 imply the occurrence of the variable base at that position is completely random, whereas entropy values closer to 0.00 imply non-randomness of a given SNP. If the user disabled the entropy analysis in the commands section before computing the profile, the entropy graph will display a straight continuous line with slope of 0.

The quality graph displays the quality associated with each position as a Phred score. Phred, a program developed for the Human Genome Project, calculates a number of parameters related peak values and resolutions for each base in a sequence. The parameter values are then used to look up a score in a corresponding table generated from experiments on known sequences. The Phred quality score, Q, is defined with respect to the base-calling error probabilities, P, such that:

Q = -10log10P

Thus, any Phred score over 20 implies a less than 1% chance that the base is called incorrectly. The higher the Phred score, the lower the associated probability of incorrect base-calling.

The InDel% graph displays the percentage of reads showing insertions and/or deletions at a given position with respect to the specified reference.

sequencing_noise: The graph in this tab shows the distribution of specific base variations (A-C, A-T, etc.) with respect to their frequencies. There is significantly higher abundance of variants with lower frequency. The corresponding chart below reports the threshold frequency associated with various percentage intervals above which variation is deemed significant for each of the twelve specific base variations.

Figure 10. Noise

consensus: This tab shows the consensus sequence of reads covering the region of the selected reference segment. The consensus sequence calls the base with highest frequency over all reads. Depending on the length of the sequence, it will not appear entirely in the window, you will have to scroll right to view the whole sequence (Figure 11).

Figure 11. Consensus sequence

SNP calls: The SNP graph in the profile tab shows all variations at all positions, but this table (Figure 12) summarizes that information and only displays position, quality and count details for those positions with a frequency of variation higher than the specified threshold. If annotation files have been supplied, you can filter results by selecting an option under the Annotations menu in the toolbar. The default threshold is 0.05.

Figure 12. SNP calls view

ORF Visualization: If annotations have been supplied, this tab will show two graphs. One will show the position of significant SNPs according to the threshold set in the SNP calls tab plotted against their frequencies, while the other will show the position of all open reading frames annotated to occur along the length of the specified reference. This stacking of visualization facilitates simple discovery of SNPs in biologically relevant positions and quickly links genomic and proteomic disciplines. Graphs are downloadable by clicking on the Download graph as SVG file links.

5. WHAT NEXT?

HIVE is constructed modularly to facility end-to-end flows of different analytic tools. After computing an SNP profile, you have the option to go Back to Alignment results or Modify and Resubmit your profile process. Both these options can be found on the top right side of the Next-Gen Profile box where the text reads what can you do next (See Figure 12).

You will soon be able to continue forward to the Clusterization tool which uses the output from the SNP profiler to create a tree displaying the various reads in terms of distance as a function of similarity.

This concludes the HIVE SNP-Profiler tutorial. Please see the other tutorials or the tutorial videos available on HIVE main pages for further information.

Documents

HIVE Sequence Profiling TutorialINTRODUCTION . The HIVE Sequence Profiling tool calculates the frequency of individual bases plotted against either the number of bases in the reference