Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Nag
araj
an K
athi
resa
n, P
h.D.
, Co
mpu
tatio
nal S
cien
tist,
KAU
ST S
uper
com
putin
g La
b,
naga
raja
n.ka
thire
san@
kaus
t.edu
.sa
Agendaü UNIX tools for Bioinformatics ü A simple job script in a command line!ü What is workflow? How can I build it? ü How to address the job dependencies??ü When to use Job arrays? ü Optimization in workflow design.
Note:
Some of the Bioinformatics tools like bwa - Burrows-Wheeler Alignment, Samtools, and Picard/GATK are used for explinations.)
UNIX tools for BioinformaticsData transfer and Search for pattern
Move data between two systemsThe rsync utility is a very useful utility for synchronizing files and directories between two different servers.
q Copying from the local machine to a remote machine:rsync local_directory
remote_server_name:remote_directory
q Copying from a remote machine to the local machine:rsync remote_server_name:remote_directory
local_directory
-a archive mode
-r recursive over subdirectories
-v verbose
-x don't cross filesystem boundaries
-H preserve hard links
-P show progress
-n no-op, or dry-run
$ rsync -arvxHPmy_data
[email protected]:/ibex/scratch/kathirn/work/my_data/
Search for pattern • grep, egrep, fgrep• wc• | (Pipe character) • cut• awk• sort• uniq…….
Working with genome filesFasta
Indexed Fasta
Compressed Fastq
Compressed VCF
BAM
SAM
Sorted BAM
GTF
Working with fasta file$ more Aegilops_tauschii.Aet_v4.0.ncrna.fa
Extract the headers from the FASTA file grep, egrep, fgrep à print lines matching a pattern-i, --ignore-case à ignore case-v, --invert-match à “invert”, get the lines not matching the patent -w, --word-regexp à Get the lines when matches whole patent -o, --only-matching à Get only the matching part
egrep = grep –E (--extended-regexp)
fgrep = grep –F (--fixed-strings)
Word countwcà Count the number of lines, words and characters in a given file
$ wc Aegilops_tauschii.Aet_v4.0.ncrna.fa13525 48871 1247270
Aegilops_tauschii.Aet_v4.0.ncrna.fa
$ wc -l Aegilops_tauschii.Aet_v4.0.ncrna.fa13525 Aegilops_tauschii.Aet_v4.0.ncrna.fa
$ wc -c Aegilops_tauschii.Aet_v4.0.ncrna.fa1247270 Aegilops_tauschii.Aet_v4.0.ncrna.fa
$ wc -w Aegilops_tauschii.Aet_v4.0.ncrna.fa48871 Aegilops_tauschii.Aet_v4.0.ncrna.fa
Question: How do I count the number of sequences in the above fasta file?
Answer:$ grep -c ">" Aegilops_tauschii.Aet_v4.0.ncrna.fa3732
Why?• Counting the header (“>”) is an appropriate way!• Many sequence lines is possible within a single sequence identification.
>ENSRNA050031380-T1 ncrnachromosome:Aet_v4.0:2D:126982204:126982306:-1 gene:ENSRNA050031380 gene_biotype:snRNA transcript_biotype:snRNAgene_symbol:U6 description:U6 spliceosomal RNAACTATATAAAAAACTTCCAATTTTAGTGGAACTATACAGAGAAGATTAGCATGGCCCCGACGCAAGGATGACACACACGAATTGAGAAATGATCCAAATTTTT
Sequence identification
Sequence
Combining the commands| à Pipe character
Example
Grep option:-io: ignore-case and only-matching
Useful data processing tools!cut àThis command allows extracting the column from the fileUse: cut –f file name
Useful tableview!https://github.com/informationsea/tableview/releases/download/v0.4.6/tableview_linux_amd64
$ cat sample.vcf | grep -v "#" | tableview_linux_amd64
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
$ cat sample.vcf | grep -v "#" Grep Option: -v: invert-match
https://github.com/informationsea/tableview/releases/download/v0.4.6/tableview_linux_amd64
Cont. … cut command!
Uniq and sort
AWK à scans each line and performance some actions. awk ‘ {action1} …’
AWK
Combine commands: awk + pipe + uniq + sort …
A simple job scriptMinimum 3 parameters: 1. sbatch: Submit a batch script to Slurm2. time: Set a limit on the total run time of the job
allocation--time=days-hours:minutes:seconds
-t days-hours:minutes:seconds
3. wrap: specified command string or simple "sh" shell script & submit to the slurm controller
Example: $ sbatch --time=00:10 --wrap="hostname”
Output:
slurm-9024853.out$ cat slurm-9024853.out
cn603-28-l
Caution note (by default SLURM allocation)
• memory = 2GB• CPU = 1 core• Node = 1 node
$ cat my_job.sh#!/bin/bash#SBATCH --time=00:10hostname
$ sbatch ./my_job.shSubmitted batch job 7438
$ cat slurm-7438.outcn512-05-r
Job script (batch jobs)
Workflow - exampleGenome mapping/alignment
Compress the Sequence Alignment Map file (SAM to BAM)
Sorting the BAM file
Index for BAM files
Chromosome interval for research interest
Mark or remove the duplicate
BWA
Samtools
Samtools
Samtools
Samtools
GATK/Picard
Step #1
Step #2
Step #3
Step #4
Step #5
Step #6
Step #1: Genome alignmentBurrows-Wheeler Aligner: • bwa index ref.fa• bwa mem ref.fa reads.fq > aln-se.sam• bwa mem ref.fa read1.fq read2.fq > aln-pe.sam• bwa aln ref.fa short_read.fq > aln_sa.sai• bwa samse ref.fa aln_sa.sai short_read.fq > aln-se.sam• bwa sampe ref.fa aln_sa1.sai aln_sa2.sai read1.fq read2.fq > aln-pe.sam• bwa bwasw ref.fa long_read.fq > aln.sam
http://bio-bwa.sourceforge.net/bwa.shtml
Pre-request 1. Genome Reference file (GRCh37, HG19 …)2. Genome Reference - Index files3. Sample data (Single or Pair-end)
Caution note: • By default, the BWA tool will run as a sequential (1 core)• It’s supported with multi-threads for parallelization using the option -t• The option -T (alignment score) is different.
Reference available:/ibex/reference/KSL/
http://bio-bwa.sourceforge.net/bwa.shtml
Example: BWA MEMbwa mem ref.fa read1.fq read2.fq > aln-pe.samStep #1: Check the availability on the software $ module av bwa------------- /sw/csi/modulefiles/applications -----------------------
bwa/0.7.17/gnu-6.4.0 bwakit/0.7.15/binary-0.7.15
Step #2: Use the module software $ module load bwa/0.7.17/gnu-6.4.0
Step #3: Prepare a job submission script Command: $ bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam
SLURM Script: $ sbatch --time=00:10 --wrap="bwa mem /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam”
Caution note in resource allocation: • 2 GB memory • 1 Core
Cont. . . (Optimized script)SLURM Script: $ sbatch \
--time=2:00:00 \
--mem=100GB \
--cpus-per-task=16 \
--wrap="bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam”
Submitted batch job 9028055
$ cat slurm-9028055.out
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 1600000 sequences (160000000 bp)...
[M::process] read 1600000 sequences (160000000 bp)...
Batch job script$ cat BWA_MEM_batch.sh
#!/bin/bash
#SBATCH --time=2:00:00
#SBATCH --mem=100GB
#SBATCH --cpus-per-task=16
## Software
module load bwa/0.7.17/gnu-6.4.0
## Command
bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SRR396636.sra_1.fastq SRR396636.sra_2.fastq > SRR396636.sam
Job submitted using sbatch$ sbatch ./BWA_MEM_batch.shSubmitted batch job 7439
Standard output/error will be in the name of slurm-.out $ cat slurm-7439.outLoading module for BWABWA 0.7.17 is now loaded[M::bwa_idx_load_from_disk] read 0 ALT contigs[M::process] read 1600000 sequences (160000000 bp)...[M::process] read 1600000 sequences (160000000 bp)...
How can I run 100+ genome samples? $ ls -lrta *_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 2125471805 Dec 17 2013 NIST7086_CGTACTAG_L002_R2_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 2083510543 Dec 17 2013 NIST7086_CGTACTAG_L002_R1_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 2081364133 Dec 17 2013 NIST7086_CGTACTAG_L001_R2_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 2037956271 Dec 17 2013 NIST7086_CGTACTAG_L001_R1_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 2001172486 Dec 17 2013 NIST7035_TAAGGCGA_L002_R2_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 1962477139 Dec 17 2013 NIST7035_TAAGGCGA_L002_R1_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 1954935121 Dec 17 2013 NIST7035_TAAGGCGA_L001_R2_001.fastq.gz
-rw-r--r-- 1 kathirn g-kathirn 1914722761 Dec 17 2013 NIST7035_TAAGGCGA_L001_R1_001.fastq.gz
------ DATA PROCESSING ------
Steps for processing more samplesS1 S2 S3 Sn
$ sbatch \--time=2:00:00 \--mem=100GB \--cpus-per-task=16 \--wrap=" bwa mem -t 16
/ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta SAMPLE_NAME_1.fastq SAMPLE_NAME_2.fastq > SAMPLE_NAME.sam”
One by one SAMPLE_NAME
Until all SAMPLES
YesJob
done
In UNIX script1. Get the unique list of samples $ ls *_R1_001.fastq.gz
NIST7035_TAAGGCGA_L001_R1_001.fastq.gz
NIST7086_CGTACTAG_L001_R1_001.fastq.gz
NIST7035_TAAGGCGA_L002_R1_001.fastq.gz
NIST7086_CGTACTAG_L002_R1_001.fastq.gz
2. Parse sample by sample$ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;
do
echo $SAMPLE_NAME;
done
Cont.3. Get the UNIQUE sample name $ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;
do
echo `basename $SAMPLE_NAME _R1_001.fastq.gz`;
done
Output:
NIST7035_TAAGGCGA_L001
NIST7035_TAAGGCGA_L002
NIST7086_CGTACTAG_L001
NIST7086_CGTACTAG_L002
Cont.4. Multiple Job submission using FOR LOOP$ module load bwa/0.7.17/gnu-6.4.0
$ for SAMPLE_NAME in `ls *_R1_001.fastq.gz`;
do
PREFIX=`basename $SAMPLE_NAME _R1_001.fastq.gz`;
sbatch \
--time=2:00:00 \
--mem=100GB \
--cpus-per-task=16 \
--wrap=" bwa mem -t 16
/ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta
${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam”
done
In a batch script (as a Job arrays)#!/bin/bash#SBATCH --job-name=BWA_MEM#SBATCH --output=BWA_MEM.%A_%a.out#SBATCH --error=BWA_MEM.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-4
## Software module load bwa/0.7.17/gnu-6.4.0
## My variablesSAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;
## Job commandbwa mem -t 16
/ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz > ${PREFIX}.sam
Pre-request:Array size = number of samples
$ sbatch ./bwa_mem_array.shSubmitted batch job 7440
$ squeue -u $USERJOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r
Cont. …$ sbatch ./bwa_mem_array.shSubmitted batch job 7440
$ squeue -u $USERJOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)7440_1 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_2 batch BWA_MEM kathirn R 0:53 1 cn509-23-l7440_3 batch BWA_MEM kathirn R 0:53 1 cn512-05-r7440_4 batch BWA_MEM kathirn R 0:53 1 cn512-05-r
$ ls -lrta *.sam-rw-r--r-- 1 kathirn g-kathirn 2790260736 Feb 9 16:41 NIST7086_CGTACTAG_L002.sam-rw-r--r-- 1 kathirn g-kathirn 3783000064 Feb 9 16:41 NIST7086_CGTACTAG_L001.sam-rw-r--r-- 1 kathirn g-kathirn 3024093184 Feb 9 16:41 NIST7035_TAAGGCGA_L002.sam-rw-r--r-- 1 kathirn g-kathirn 3978428416 Feb 9 16:41 NIST7035_TAAGGCGA_L001.sam
Step 2: SAM to BAM filesSamtools to convert SAM files to BAM#!/bin/bash
module load samtools/1.8
for SAMPLE_NAME in `ls *.sam`;
do
PREFIX=`basename $SAMPLE_NAME .sam`;
sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtoolsview --threads 16 -b -S -h -q 30 ${SAMPLE_NAME} > ${PREFIX}.bam"
done
• Sam files are very large• BAM file is compressed version of SAM• Good to use BAM files and safe to delete
SAM once the BAM files are available.
• 1.8G NIST7035_TAAGGCGA_L001_R1_001.fastq.gz• 1.9G NIST7035_TAAGGCGA_L001_R2_001.fastq.gz• 13G NIST7035_TAAGGCGA_L001.sam• 3.4G NIST7035_TAAGGCGA_L001.bam
Step 3: convert bam to sorted bamSort the BAM files using samtools
#!/bin/bash
module load samtools/1.8
for SAMPLE_NAME in `ls *.bam`;
do
PREFIX=`basename $SAMPLE_NAME .bam`;
sbatch --time=2:00:00 --mem=100GB --cpus-per-task=16 --wrap="samtoolssort --threads 16 -T ${PREFIX} ${SAMPLE_NAME} -o ${PREFIX}.sorted.bam"
done
End-of-Step 3!What are the files generated?
ü sam files (Generated from Genome alignment)ü unsorted bam files (Generated from the samtools, part
of data compression)
ü sorted bam files (Generated from samtools)
Do we need all these intermediate files generated? IF NOT ?!
*.Fastq.gz
*.sam
*.bam
*.sorted. bam
$ bwa mem -t 16 $REF $PREFIX_R1_001.fastq.gz $PREFIX_R1_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 -T $PREFIX -> $PREFIX.sorted.bam
3-in-1 !?*.Fastq.gz
*.sam
*.bam
*.sorted. bam
#!/bin/bash#SBATCH --job-name=BWA_MEM#SBATCH --output=BWA_MEM.%A_%a.out#SBATCH --error=BWA_MEM.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-4
# Software module load bwa/0.7.17/gnu-6.4.0module load samtools/1.8
# My variablesSAMPLE=`ls *_R1_001.fastq.gz | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE _R1_001.fastq.gz` ;
# Job commandbwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta ${PREFIX}_R1_001.fastq.gz ${PREFIX}_R2_001.fastq.gz | samtools view --threads 16 -b -S -h -q 30 - | samtools sort --threads 16 - > $PREFIX.sorted.bam
Step 4: index the bam filesIndex the BAM files using samtools
#!/bin/bash
module load samtools/1.8
for SAMPLE_NAME in `ls *.sorted.bam`;
do
PREFIX=`basename $SAMPLE_NAME .sorted.bam`;
sbatch --time=30:00 --mem=100GB --cpus-per-task=1 --wrap="samtoolsindex ${SAMPLE_NAME}"
done
Summary: list of files generated Sorted BAM files: • -rw-r--r-- 1 kathirn g-kathirn 2.6G Feb 4 12:14 NIST7035_TAAGGCGA_L002.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.5G Feb 4 12:14 NIST7035_TAAGGCGA_L001.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L001.sorted.bam• -rw-r--r-- 1 kathirn g-kathirn 2.7G Feb 4 12:14 NIST7086_CGTACTAG_L002.sorted.bam
Index of Sorted BAM files:
• -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:25 NIST7086_CGTACTAG_L001.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L001.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.5M Feb 4 12:26 NIST7035_TAAGGCGA_L002.sorted.bam.bai• -rw-r--r-- 1 kathirn g-kathirn 3.4M Feb 4 12:26 NIST7086_CGTACTAG_L002.sorted.bam.bai
List of chr. In each BAM files@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:249250621
@SQ SN:2 LN:243199373
@SQ SN:3 LN:198022430
@SQ SN:4 LN:191154276
@SQ SN:5 LN:180915260
@SQ SN:6 LN:171115067
@SQ SN:7 LN:159138663
@SQ SN:8 LN:146364022
@SQ SN:9 LN:141213431
@SQ SN:10 LN:135534747
@SQ SN:11 LN:135006516@SQ SN:12 LN:133851895@SQ SN:13 LN:115169878@SQ SN:14 LN:107349540@SQ SN:15 LN:102531392@SQ SN:16 LN:90354753@SQ SN:17 LN:81195210@SQ SN:18 LN:78077248@SQ SN:19 LN:59128983@SQ SN:20 LN:63025520@SQ SN:21 LN:48129895@SQ SN:22 LN:51304566@SQ SN:X LN:155270560@SQ SN:Y LN:59373566@SQ SN:MT LN:16569
@SQ SN:GL000207.1 LN:4262@SQ SN:GL000226.1 LN:15008@SQ SN:GL000229.1 LN:19913@SQ SN:GL000231.1 LN:27386@SQ SN:GL000210.1 LN:27682@SQ SN:GL000239.1 LN:33824….….
@SQ SN:NC_007605 LN:171823@SQ SN:hs37d5 LN:35477943
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 16 /ibex/reference/KSL/human_g1k_v37_decoy/human_g1k_v37_decoy.fasta NIST7035_TAAGGCGA_L002_R1_001.fastq.gz NIST7035_TAAGGCGA_L002_R2_001.fastq.gz
Step 5: Chromosome interval for research interestObjective:
Generate a chunk of BAM file that has the interval between10,000-15,000 from Chromosome-1 and Chromosome-2, etc. Solution:
$ samtools view NIST7035_TAAGGCGA_L002.sorted.bam 1:10000-15000 | more
HWI-D00119:50:H7AP8ADXX:2:1214:6356:27283 163 1 10354 60 89M12S = 10354 96 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTA
ACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAACCCTAAGCCCCGGCA 8??DBDBAFF>?FGAFFIIFF9ED8;CCDFDED3?9?@?0?B@?DFF(DHECCC@@HGHHGIIIEECC==BCDFFECECECCCCCCDCDCECC N
M:i:0 MD:Z:101 MC:Z:101M AS:i:101 XS:i:71
….
…..
Chr1
Chr2
Cont. Any Optimal or better Solution!? #!/bin/bash#SBATCH --job-name=Region_of_Interest#SBATCH --output=Region_of_Interest.%A.out#SBATCH --error=Region_of_Interest.%A.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --cpus-per-task=16#SBATCH --array=1-2
## My variablesSAMPLE=NIST7035_TAAGGCGA_L002.sorted.bamPREFIX=NIST7035_TAAGGCGA_L002REGION="10000-15000"
## Software module load samtools/1.8
## Job command to get Region of Interest from Chromosome 1 & 2samtools view ${SAMPLE} ${SLURM_ARRAY_TASK_ID}:$REGION --threads 16 -b -o ${PREFIX}.${SLURM_ARRAY_TASK_ID}.$REGION.sorted.bam
Caution note!• Job array will be numeric letters (no
fractions, no characters, no special symbols, no alpha-numeric …. etc. )
• When the Chromosome is “Chr1”, data distribution is required as follows:
if [${SLURM_ARRAY_TASK_ID} -eq 1 ]// ….do something …. //
else// ….do something …. //
fi
Batch processing is required to get the value of ${SLURM_ARRAY_TASK_ID}
To view the chromosome …. (e.G. IGV can be used)
Step 6: mark duplicate(s)Any Optimal or better Solution!? #!/bin/bash#SBATCH --job-name=MarkDupe#SBATCH --output=MarkDupe.%A_%a.out#SBATCH --error=MarkDupe.%A_%a.err#SBATCH --time=2:00:00#SBATCH --nodes=1#SBATCH --mem=100GB#SBATCH --array=1-4
## My variablesSAMPLE=`ls *.sorted.bam | head -n $SLURM_ARRAY_TASK_ID | tail -n 1` ;PREFIX=`basename $SAMPLE .sorted.bam` ;
## Software module load gatk/4.0.1.1
## Job commandgatk MarkDuplicates --INPUT $SAMPLE --OUTPUT $PREFIX.duped.sorted.bam --METRICS_FILE $PREFIX.txt --REMOVE_DUPLICATES true
Pre-request:Array size = number of samples
End-of-Step 6
- Many job script - Multiple files- Manual steps- etc.
Single job script
Job dependency
Job dependencysbatch --dependency= ...
after:jobid[:jobid...] job can begin after the specified jobs have started
afterany:jobid[:jobid...] job can begin after the specified jobs have terminated
afternotok:jobid[:jobid...] job can begin after the specified jobs have failed
afterok:jobid[:jobid...] job can begin after the specified jobs have run to completion with an exit code of zero.
singleton
jobs can begin execution after all previously launched jobs with the same name and user have ended. This is useful to collate results of a swarm or to send a notification at the end of a swarm.
Source: https://hpc.nih.gov/docs/job_dependencies.html
https://hpc.nih.gov/docs/job_dependencies.html
Job dependency - Example$ cat dependent.sh#!/bin/bash## Any bugs/issues, please e-mail: [email protected] "Submitting 5 jobs with 4 job dependency condition";
## Submit First jobFirst_CMD="sleep 40";First_Job="sbatch --partition=batch --job-name=First_Step --time=30:00 --output=First-%J.out --error=First-%J.err--nodes=1";First_ID=$(${First_Job} --parsable --wrap="${First_CMD}");echo "First Job submitted (\" ${First_CMD} is executing \") and this job id is " ${First_ID};
## Execute the Second job only when First job is successfulSecond_CMD="hostname";Second_Job="sbatch --partition=batch --job-name=Second_Step --time=30:00 --output=Second-%J.out --error=Second-%J.err --nodes=1";Second_ID=$(${Second_Job} --parsable --dependency=afterok:${First_ID} --wrap="${Second_CMD}");echo " Second Job (\" ${Second_CMD} \") was submitted (Job_ID=${Second_ID}) and it will execute when the First Job_ID=${First_ID} is successful"
echo " The status of running jobs are ..."echo "-----------------------------------------------------------------------------------------------------------"squeue -u $USER -lecho "-----------------------------------------------------------------------------------------------------------"
Workflow
Source: Computational and Bioinformatics Frameworks for Next-Generation Whole Exome and Genome Sequencing
Source: Best Practices for Variant Discovery in DNAseq
Is this simple and/or Optimized?
workflow optimization
TRIMMOMATIC_JAR
bwa mem
GATK 4.x MarkDuplicates
gatkAddOrReplaceReadGroups
samtools index
GATK 3.x HaplotypeCaller
bgzip
tabix
Workflow for multiple samples
Different software/tools for every job step
Cores = 4
Cores = 16
Cores = 1
Cores = 1
Cores = 1
Cores = 16
Cores = 1
Cores = 1
Optimal Heterogenous resource allocation
• Automated the workflow 8 scripts = single script
• Heterogenous resource allocation used
64 cores = optimal # cores• Turnaround time minimized
Unpredicted = Predicted• Optimized resource allocations
Max. cores = Optimal• Job monitoring and job control
Complex = Simplifiedü Job restart ü Job statistics/report
Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Outcome
sample 1
Read trimming sample 1sample 2……..sample N
Read Mapping sample 1sample 2……..sample N
Mark Duplicate sample 1sample 2……..sample N
Add/Replace Read Groups
sample 1sample 2……..sample N
Indexing sample 1sample 2……..sample N
Haplotype callersample 1sample 2……..sample N
Compress the gVCFsample 1sample 2……..sample N
xgVCF Index
sample 2……..sample N
48
Acknowledgements: Elodie Rey (Prof. Mark Tester ) Michael D. Abrouk (Prof. Simon Krattinger)
THANKS!
Time for Questions and your feedback!