Recent concepts on root canal chelation

Unix: Beyond the Basics

1

Login

• Requesting a tak account

http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php

• Windows

PuTTY

Need to setup X-windows for graphical display

• Macs

Access the Terminal: Go Utilities Terminal

2

Login using Secure Shell

ssh –Y user@tak

3

Command Prompt user@tak ~$

PuTTY on Windows

Terminal on Macs

Hot Topics website: http://jura.wi.mit.edu/bio/education/hot_topics/

• Please login to our tak server, and create a directory for the

exercises in your lab share, and use it as your working directory $ mkdir unix2

$ cd unix2

• Download all files into your working directory

• You should have the files below in your working directory: – foo.txt, sample1.txt, exercise.txt , join2filesByFirstColumn.pl , datasets folder

– you can check with ls command

4

Unix Commands Review (Unix Essentials)

$ sort –k2,3 foo.tab

$ cut –f1,5 foo.tab

$ cut –f1-5 foo.tab

$ wc –l foo.txt

How many lines are in this file?

-f: select only these fields -f1,5: select 1st and 5th fields -f1-5: select 1st, 2nd, 3rd, 4th, and 5th fields

5

start end

-n or -g: sorting numbers -n is recommended than –g, except for scientific notation or a leading '+' -r: reverse order

http://jura.wi.mit.edu/bio/education/hot_topics/

What you will learn…

• sed • awk • groupBy (bedtools) • loops • join files together • scripting

6

sed: stream editor for filtering and transforming text

• Print lines 10 - 15:

• Delete 5 header lines at the beginning of a file:

• Remove all version numbers (ex: '.1') from the end of a list of sequence accessions:

$ sed '1,5d' file > fileNoHeader

$ sed -n '10,15p' bigFile > selectedLines

$ sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly

7

s: substitute g: global modifier d: delete line p: print the current pattern space -n: only print those lines matching our pattern

awk • Name comes from the original authors:

Alfred V. Aho, Peter J. Weinberger, Brian W. Kernighan

• A simple programing language

• Good for short programs, including parsing files

8

awk

9

• Print the 2nd and 1st fields of the file: $ awk ' { print $2"\t"$1 } ' foo.tab • Convert sequences from tab delimited format to fasta format: $ awk ' { print ">"$1"\n"$2 } ' foo.tab > foo.fa $ head -1 foo.tab Seq1 ACTGCATCAC $ head -2 foo.fa >Seq1 ACGCATCAC

#: comment, ignored by awk By default, awk splits each line by spaces

Character Description

\n newline

\r carriage return

\t horizontal tab

awk: field separator

• Issues with default separator: white space

– one field is gene description with multiple words

– consecutive empty cells

• To use tab as the separator:

10

$ awk –F "\t" '{ print NF }' foo.txt or $ awk 'BEGIN {FS="\t"} { print NF }' foo.txt BEGIN: action before read input NF: number of fields in the current record FS: input field separator OFS: output field separator END: action after read input

awk: arithmetic operations

11

Add average values of 4th and 5th fields to the file: $ awk '{ print $0"\t"($4+$5)/2 }' foo.tab $0: all fields

Operator Description

+ Addition

- Subtraction

* Multiplication

/ Division

% Modulo

^ Exponentiation

** Exponentiation

awk: making comparisons

12

Print out records if values in 4th or 5th field are above 4: $ awk '{ if( $4>4 || $5>4 ) print $0 } ' foo.tab

Sequence Description

> Greater than

< Less than

<= Less than or equal to

>= Greater than or equal to

== Equal to

!= Not equal to

~ Matches

!~ Does not match

|| Logical OR

&& Logical AND

awk • Conditional statements:

• Looping:

if (condition1) action1 else if (condition2) action2 else action3

for ( i = 1; i <= NF; i++ ) print $i

13

Display expression levels for the gene NANOG: $ awk '{ if(/NANOG/) print $0 }' foo.txt or $ awk '/NANOG/ { print $0 } ' foo.txt or $ awk '/NANOG/' foo.txt Add line number to the above output: $ awk '/NANOG/ { print NR”\t”$0 }' foo.txt NR: number of the current record

Calculate the average expression (4th, 5th and 6th fields in this case) for each transcript $ awk '{ total= $4 + $5 + $6; avg=total/3; print $0"\t"avg}' foo.txt or $ awk '{ total=0; for (i=4; i<=6; i++) total=total+$i; avg=total/3; print $0"\t”avg }' foo.txt

sub function for substitution: sub( regexp, replacement, target)

Summarize by Columns: groupBy (from bedtools)

-g grpCols column(s) for grouping

-c -opCols column(s) to be summarized

-o Operation(s) applied to opCol: sum, count, min, max, mean, median, stdev, collapse distinct freqdesc (i.e., print desc. list of values:freq) freqasc (i.e., print asc. list of values:freq)

Print the maximum value (3rd column) for each gene (2nd field) in the file $ groupBy -g 2 -c 3 -o max foo.tab

14

Join files together • Unix join

• BaRC scripts: /nfs/BaRC_Public/BaRC_code/Perl/

sorting not required $ join2filesByFirstColumn.pl file1 file2 $ submatrix_selector.pl matrixFile rowIDfile or 0 [columnIDfile]

$ join -1 2 -2 3 FILE1 FILE2 Join files on the 2nd field of FILE1 with the 3rd field of FILE2, only showing the common lines. FILE1 and FILE2 must be sorted on the join fields before running join

15

Shell Flavors

• Syntax (for scripting) depends on which shell echo $SHELL

• Several shells available, bash is the default on tak and commonly used.

• Some of the shells (incomplete listing):

16

Shell Name

sh Bourne

bash Bourne-Again

ksh Korn shell

csh C shell

Shell Script

• Automation: avoid having to retype the same commands many times

• Ease of use and more efficient

• Outline of a script:

#!/bin/bash shebang: interprets how to run the script commands… set of commands used in the script

#comments write comments using “#”

• Commonly used extension for script is .sh (eg. foo.sh), file must have executable permission

17

Bash Shell: for loop

• Process multiple files with one command

• Reduce computational time with many cluster nodes

for myfile in `ls dataDir` do bsub wc –l $myfile done When referring to a variable, $ is needed before the variable name ($myfile), but $ is not needed when defining it (myfile). Note: this syntax is different from awk

18

Shell Script Example #!/bin/sh

# 1. Take two arguments: the first one is a directory with all the datasets, the second one is an output directory

# 2. For each data, calculate average gene expression, and save the results in a file in the output directory

inDir=$1 # 1st argument

outDir=$2 # 2nd argument; outDir must already exist

# Define variables: no spaces on either side of the equal sign

for i in `/bin/ls $inDir ` # refer to variable with $

do

# output file name

name="${i}_gene.txt" # {}: $i_gene not valid; prevent misinterpretation of variable as #characters

# calculate average gene expression

# NM_001039201 Hdhd2 5.0306 5.3309 5.4998

sort -k2,2 ${inDir}/${i} | groupBy -g 2 -c 3,4,5 -o mean,mean,mean >| ${outDir}/${name}

done

19

# You can use graphical editors such as nedit, gedit, xemacs to create shell scripts

Further Reading

• BaRC one liners: – http://iona.wi.mit.edu/bio/bioinfo/scripts/#unix

• Unix Info for Bioinfo: – http://iona.wi.mit.edu/bio/barc_site_map.php

• Online books via MIT: – http://proquest.safaribooksonline.com/

• MIT: Turning Biologists into Bioinformaticists – http://rous.mit.edu/index.php/Teaching

• Bash Guide for Beginners: – http://tldp.org/LDP/Bash-Beginners-Guide/html/

20

http://iona.wi.mit.edu/bio/bioinfo/scripts/#unix



http://iona.wi.mit.edu/bio/barc_site_map.php

http://iona.wi.mit.edu/bio/barc_site_map.php

http://proquest.safaribooksonline.com/



http://rous.mit.edu/index.php/Teaching

http://rous.mit.edu/index.php/Teaching

http://tldp.org/LDP/Bash-Beginners-Guide/html/







Documents

Recent concepts on root canal chelation