29
Bioinformatics master course, ‘11/’12 Paolo Marcatili Parsing a File with Perl Regexp, substr and oneliners

Regexp master 2011

Embed Size (px)

Citation preview

Page 1: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Parsing a File with Perl

Regexp, substr and oneliners

Page 2: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 2

Agenda

Today we will see how to• Extract information from a file• Substr and regexp

We already know how to use:• Scalar variables $ and arrays @• If, for, while, open, print, close…

Page 3: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Task Today

Page 4: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 4

Protein Structures

1st task: • Open a PDB file• Operate a symmetry transformation • Extract data from file header

Page 5: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 5

Zinc Finger

2nd task: • Open a fasta file• Find all occurencies of Zinc Fingers

(homework?)

Page 6: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Parsing

Page 7: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 7

Rationale

Biological data -> human readable files

If you can read it, Perl can read it as well

*BUT*It can be tricky

Page 8: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 8

Parsing flow-chart

Open the fileFor each line{

look for “grammar”and store data

}Close fileUse data

Page 9: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Substr

Page 10: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 10

Substr

substr($data, start, length)returns a substring from the expression supplied as

first argument.

Page 11: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 11

Substr

substr($data, start, length)

^ ^ ^

your string | | start from 0 |

you can omit this(you will extract up to the end of

string)

Page 12: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 12

Substr

substr($data, start, length)Examples:

my $data=“il mattino ha l’oro in bocca”;print substr($data,0) . “\n”; #prints all stringprint substr($data,3,5) . “\n”; #prints mattiprint substr($data,25) . “\n”; #prints boccaprint substr($data,-5) . “\n”; #prints bocca

Page 13: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Pdb rotation

Page 14: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 14

PDB

ATOM 4 O ASP L 1 43.716 -12.235 68.502 1.00 70.05 O ATOM 5 N ILE L 2 44.679 -10.569 69.673 1.00 48.19 N …

COLUMNS DATA TYPE FIELD DEFINITION------------------------------------------------------------------------------------- 1 - 6 Record name "ATOM " 7 - 11 Integer serial Atom serial number.13 - 16 Atom name Atom name.17 Character altLoc Alternate location indicator.18 - 20 Residue name resName Residue name.22 Character chainID Chain identifier.23 - 26 Integer resSeq Residue sequence number.27 AChar iCode Code for insertion of residues.31 - 38 Real(8.3) x Orthogonal coordinates for X in Angstroms39 - 46 Real(8.3) y Orthogonal coordinates for Y in Angstroms47 - 54 Real(8.3) z Orthogonal coordinates for Z in Angstroms55 - 80 Bla Bla Bla (not useful for our purposes)

Page 15: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 15

simmetryX->ZY->XZ->Y

X

Y

Page 16: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 16

Rotation#! /usr/bin/perl -w

use strict;open(IG, "<IG.pdb") || die "cannot open IG.pdb:$!"; open(IGR, ">IG_rotated.pdb") || die "cannot open IG_rotated.pdb:$!"; while (my $line=<IG>){ if (substr($line,0,4) eq "ATOM"){ my $X= substr($line,30,8); my $Y= substr($line,38,8); my $Z= substr($line,46,8); print IGR substr($line,0,30).$Z.$X.$Y.substr($line,54); } else{ print IGR $line; }}close IG;close IGR;

Page 17: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

RegExp

Page 18: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 18

Regular Expressions

PDB have a “fixed” structures.

What if we want to do something like“check for a valid email address”…

Page 19: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 19

Regular Expressions

PDB have a “fixed” structures.

What if we want to do something like“check for a valid email address”…1. There must be some letters or numbers2. There must be a @3. Other letters4. [email protected] is good

[email protected] is not good

Page 20: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 20

Regular Expressions$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/

WHAAAT???

This means:Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters

….Ok, let’s start from something simpler :)

Page 21: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 21

Regular Expressions$line =~ m/^[a-z |1-9| \.| _]+@[^\.]+\.[a-z]{2,}$/

WHAAAT???

This means:Check if $line has some chars at the beginning, then @, then some non-points, then a point, then at least two letters

….Ok, let’s start from something simpler :)

Page 22: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 22

Regular Expressions$line =~ m/^ATOM/Line starts with ATOM

$line =~ m/^ATOM\s+/Line starts with ATOM, then there are some spaces

$line =~ m/^ATOM\s+[\-|0-9]+/Line starts with ATOM, then there are some spaces, then there are

some digits or -$line =~ m/^ATOM\s+\-?[0-9]+/Line starts with ATOM, then there are some spaces, then there can be

a minus, then some digits

Page 23: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 23

Regular Expressions

Page 24: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 24

PDB Header

We want to find %id for L and H chain

Page 25: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 25

PDB Header

We want to find %id for L and H chain

$pidL= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_L:([\.|0-9])/);$pidH= $1 if ($line=~m/REMARK SUMMARY-ID_GLOB_H:([\.|0-9])/);

ONELINER!!

cat IG.pdb | perl -ne ‘print “$1\n” if ($_=~m/^REMARK SUMMARY-ID_GLOB_([LH]:[\.|0-9]+)/);’

Page 26: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili

Zinc Finger

Page 27: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 27

Zinc Finger

A zinc finger is a large superfamily of protein domains that can bind to DNA.

A zinc finger consists of two antiparallel β strands, and an α helix.

The zinc ion is crucial for the stability of this domain type - in the absence of the metal ion the domain unfolds as it is too small to have a hydrophobic core.

The consensus sequence of a single finger is:

C-X{2-4}-C-X{3}-[LIVMFYWC]-X{8}-H-X{3}-H

Page 28: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 28

Homework

Find all occurencies of ZF motif in zincfinger.fasta

Put them in file ZF_motif.fasta

e.g.weofjpihouwefghoicalcvgnfglapglifhtylhyuiui

Page 29: Regexp master 2011

Bioinformatics master course, ‘11/’12

Paolo Marcatili 29

Homework

Find all occurencies of ZF motif in zincfinger.fasta

Put them in file ZF_motif.fasta

e.g.weofjpihouwefghoicalcvgnfglapglifhtylhyuiui

calcvgnfglapglifhtylh