74
Crash course on Python programming Presented by: Shailender Nagpal, Al Ritacco Research Computing UMASS Medical School 09/17/20 12 Information Services,

Crash course on Python programming Presented by: Shailender Nagpal, Al Ritacco Research Computing UMASS Medical School 09/17/2012Information Services,

Embed Size (px)

Citation preview

Crash course on Python

programming

Presented by:Shailender Nagpal, Al Ritacco

Research ComputingUMASS Medical School

09/17/2012Information Services,

AGENDAPython Basics: Lists, Tuples, Expressions, PrintingBuilt-in functions, Blocks, Branching, LoopsHash arrays, String and Array operationsFile reading and writingWriting custom functionsRegular expressions: Find/replace/countProviding input to programsUsing Python scripts with the LSF cluster

00/00/2010Information Services,2

What is Python?

• Python is a high-level, general-purpose, interpreted, interactive programming language

• Provides a simple iterative, top-down, left to right programming environment for users to create small, and large programs

00/00/2010Information Services,3

Features of Python

• Python code is portable between Linux, Mac, Windows• Easy to use and lots of resources are available• Procedural and object-oriented programming, not

strongly "typed"• Similar programming syntax as other languages

– if, if-then-else, while, for, functions, classes, etc• Provides several methods to manipulate data

– Lists, Hashes, Objects

• Statements are not terminated by semi-colon

00/00/2010Information Services,4

Advantages of Python

• Python is a general-purpose programming language like C, Java, etc. But it is "higher" level, means is advantageous to use it in certain applications like Bioinformatics– Fewer lines of code than C, Java– No compilation necessary. Prototype and run!– Run every line of code interactively– Vast function library geared towards scientific computing– Save coding time and automate computing tasks– Intuitive. Code is concise, but human readable

00/00/2010Information Services,5

First Python program• The obligatory "Hello World" program

#!/usr/bin/python

# Comment: 1st program: variable, printname = "World"print "Hello %s" % name

• Save these lines of text as a text file with the ".py" extension, then at the command prompt (linux): python hello.py

Hello World

00/00/2010Information Services,6

Understanding the code

• The first line of a Python script requires an interpreter location, which is the path to the Python executable

#!/path/to/python

• 2nd line: A comment, beginning with "#"• 3rd line: Declaration of a string variable• 4th line: Printing some text to the shell with a variable,

whose value is interpolated by %s• The quotes are not printed, and "name" is replaced by

"World" in the output.

00/00/2010Information Services,7

Second program

• Report summary statistics of DNA sequence#!/usr/bin/pythondna= "ATAGCAGATAGCAGACGACGAGA"print "Length of DNA is “, len(dna)print "Number of A bases are “, dna.count('A')print "Number of C bases are “, dna.count('C')print "Number of G bases are “, dna.count('G')print "Number of T bases are “, dna.count('T')print "Number of G+C bases are “, (dna.count('G') + dna.count('C'))print "Number of GC dinucleotides are “, dna.count('GC')print "G+C percent content is “, (dna.count('G') + dna.count('C')) /len(dna)*100

• In 10 lines of code, we can summarize our data! • Can re-use this code to find motifs, RE sites, etc

00/00/2010Information Services,8

Python Comments

• Use "#" character at beginning of line for adding comments into your code

• Helps you and others to understand your thought process

• Lets say you intend to sum up a list of numbers

# (sum from 1 to 100 of X)

•The code would look like this:sum = 0 # Initialize variable called "sum" to 0for x in range(1,100): # Use "for" loop to iterate over 1 to 100

sum=sum+x # Add the previous sum to xprint "The sum of 1..x is %s\n" % sum # Report the result

00/00/2010Information Services,9

Python Variables

• Variables – Provide a location to "store" data we are interested in

• Strings, decimals, integers, characters, lists, …– What is a character – a single letter or number– What is a string – a list of characters– What is an integer – a number 4.7 (sometimes referred to

as a real if there is a decimal point)

• Variables can be assigned or changed easily within a Python script

00/00/2010Information Services,10

Variables and built-in keywords

• Variable names should represent or describe the data they contain– Do not use meta-characters, stick to alphabets, digits and

underscores. Begin variable with alphabet

• Python as a language has keywords that should not be used as variable names. They are reserved for writing syntax and logical flow of the program– Examples include: if, then, else, for, foreach, while, do,

unless, until, break, continue, switch, def, class

00/00/2010Information Services,11

Variables, Lists• Variables that hold single strings are string variables• Variables that hold single numbers are numeric

variablesscore = 5.3dna = "ATAGGATAGCGA"name = "Shailender"

• Collection of variables are called lists. They hold an array of values – could be a list of students in a class, scores from a test, etcstudents = ["Alan", "Shailender", "Chris"]scores = [89.1, 65.9, 92.4]binding_pos = [9439984, 114028942]

00/00/2010Information Services,12

Printing text• Using triple, double and single quotes • Using single or double quotes process all items

within them. Ex: – print "This \t is a test\nwith text"

– Output: This is a testWith text

• Tripple quotes are useful for multi-line prints or prints with quotes included. Ex: – print """'This \t is a test\nwith text'"""

– Output: 'This \t is a test\nwith text'

00/00/2010Information Services,13

Printing variables• Scalar variables can be printed easily within double-

quotes following a print statement. Variables names are "interpolated", printing the values they containx = 5name = "John"print "%s has %d dollars\n" % (name,x)

• If you run this as a program, you get this outputJohn has 5 dollars

00/00/2010Information Services,14

Printing List variables

• Array variables can also be printed as a list with a default delimiter, but another way to print arrays is put them in a loop and print them as scalarsstudents = ["Alan", "Shailender", "Chris"]print "students\n" # Does not work!print "%s\n" % students # Method 1print "%s %s %s" % (students[0],students[1],students[2]) # Method 2

• If you run this as a program, you get this output:["Alan", "Shailender", "Chris"] # Method 1Alan Shailender Chris # Method 2

00/00/2010Information Services,15

Math Operators and Expressions• Math operators

– Eg: 3 + 2 – + is the operator – We read this left to right– Basic operators such as + - / * ** ( ^ )– Variables can be usedprint "Sum of 2 and 3 is ".(2+3)x = 3print "Sum of 2 and x is ".(2+x)

• PEMDAS rules are followed to build mathematical expressions. Built-in math functions can be used

00/00/2010Information Services,16

Mathematical operations• x=3• y=5• z=y+x

– Is this the same: z=x+y ?– Yes, but not in-terms of computing it (LR grammar)

• x=x*z• y=y+1• y++

00/00/2010Information Services,17

Python built-in functions

• The Python language comes with many built-in functions that can be used with variables to produce summary output, eg,

• Many mathematical functions are also available

00/00/2010Information Services,18

Built-in mathematical functions

• Some functions are:min(x) max(x) len(x)sum(x) abs(x) float(x)pow(x,y) range(x) round(x,n)ceil(x) cos(x) degrees(x)radians(x) exp(x) floor(x)hypot(x,y) log(x,base) sin(x)sqrt(x) tan(x) int(x)

Lists

• Lists can contain an array of numerical or string values, eg:genes = ['ATM1', 'BRCA', 'TNLP2']gene_scores = [48.1, 129.7, 73.2]

• List elements can be accessed by their index, beginning with 0, eg:genes[0]gene_scores[0:2]

List Indexing• Lists can be indexed by number to retrieve individual

elements• Indexes have range 0 to (n-1), where 0 is the index of the

first element and n-1 is the last item's indexnucleotides=["adenine", "cytosine", "guanine", "thymine", "uracil"]nucleotides[3] is equal to thyminenucleotides[4] is equal to what?

• Any element of an array can be re-assigned (Lists are mutable)

00/00/2010Information Services,21

List Operations

• append(x) Add an item to the end of the list• extend(L) Extend the list by appending all the items in the

given list• insert(i, x) Insert an item at a given position. The first

argument is the index of the element before which to insert• remove(x) Remove the first item from the list whose value is

x. It is an error if there is no such itemdata = [10, 20]data.append(30) # [10, 20, 30]data.extend([40,50,60])# [10, 20, 30, 40, 50, 60]data.insert(0,5) #[5, 10, 20, 30, 40, 50, 60]data.remove(20) #[5, 10, 30, 40, 50, 60]

List Operations (…contd)

• pop([i]) Remove the item at the given position in the list, and return it. If no index is specified, pop removes and returns the last item in the list

• index(x) Return the index in the list of the first item whose value is x. It is an error if there is no such item

• count(x) Return the number of times x appears in the list• sort() Sort the items of the list, in place• reverse() Reverse the elements of the list, in place

List operationsfruits = [] # Undefinedfruits = ["apples", "bananas", "cherries"] # Assignedfruits.append("dates") # Lengthenfruits.append("acorn") # Add an item to the frontnut = fruits.pop() # Remove from the frontprint "Well, a squirrel would think a %s was a fruit!\n" % nutfruits.append("mango") # Add an item to the endfood = fruits.pop() # Remove from the endprint "My, that was a yummy %s\n" % foodOutput: Well, a squirrel would think a acorn was a fruit!My, that was a yummy mango!

00/00/2010Information Services,24

Tuples

• Tuples, which is a list of fixed numbers or strings, eg:calendar = 'Jan', 'Feb', 'Mar'

• For single-item tuple, to distinguish it from stringcalendar = 'Jan',

• Tuple with 0 elementscalendar = ()

• Tuples may be nested u = calendar, 'Apr', 'May'u = u, ('Jun', 'Jul')

Tuples (…contd)

• Tuples have many uses. For example: (x, y) coordinate pairs, employee records from a database

• Tuples, like strings, are immutable: it is not possible to assign to the individual items of a tuple

• It is also possible to create tuples which contain mutable objects, such as listst = 2,54,'hello!' # Tuple "packing"x, y, z = t # "unpacking"

Tuples (…contd)

• Sequence unpacking requires the list of variables on the left to have the same number of elements as the length of the sequence

• Note that multiple assignment is really just a combination of tuple packing and sequence unpacking!

• There is a small bit of asymmetry here: packing multiple values always creates a tuple, and unpacking works for any sequence

Sets

• A set is an unordered collection with no duplicate elements, they support mathematical operations like union, intersection, difference, and symmetric difference

• Create a listbasket = ['apple', 'orange', 'apple', 'pear', 'banana']

• Create a set without duplicatesfruit = set(basket)

• Tests if present in set'orange' in fruit

Sets (…contd)

• Two sets can be created, and various operations can be performed between them– the result of a set operation is also a "set", not a list– Union (&)– Intersection (|)– And (^)– Difference (-)

Assignment: Working with "Sets"

• Example with two sets of charactersa = set('abracadabra')b = set('alacazam')a # unique letters ['a', 'r', 'b', 'c',

'd']a - b # letters in a but not in b ['r',

'd','b']a | b # letters in either a or b ['a', 'c',

'r', 'd', 'b', 'm', 'z', 'l']a & b # letters in both a and b ['a', 'c']a ^ b # letters in a or b but not both ['r', 'd', 'b', 'm', 'z', 'l']

Dictionary

• A dictionary is an unordered set of key: value pairs, with the requirement that the keys are unique

• Example data pairs that are suited to be stored in a hash array (as opposed to storing them in 2 separate arrays)– Words (key) and their meanings (value)– Gene symbols (key) and their full names (value)– Country names (key) and their capitals/ currencies (value)

• A pair of braces creates an empty dictionary: {}

Dictionary (…contd)

• Placing a comma-separated list of key:value pairs within the braces adds initial key:value pairs to the dictionarydict = {"Gene1":"Breast Cancer Gene"}

• The main operations on a dictionary are storing a value with some key and extracting the value given the key

• It is also possible to delete a key:value pair with del• If you store using a key that is already in use, the old value

associated with that key is forgotten • The keys() method of a dictionary object returns a list of all

the keys used in the dictionary, in arbitrary order (if you want it sorted, just apply the sort() method to the list of keys)

Dictionary (…contd)

• To check whether a single key is in the dictionary, either use the dictionary's has_key() method or the "in" keyword tel = {'jack': 4098, 'sape': 4139}tel['guido'] = 4127tel['irv'] = 4129tel['jack']

• Deleting a key-value pairdel tel['sape']

• Get the keys of the dictionarytel.keys()

Dictionary (…contd)

• Check if the dictionary has a keytel.has_key('guido')

• Check if the dictionary has a key (another method) 'guido' in tel

Dictionary, cont.• Two ways to create a dictionary:dessert = {pie: "apple", cake: "carrot", sorbet:

"orange"} # Method 1Dessert["pie"] = "apple"Dessert["cake"] = "carrot"Dessert["sorbet"] = "orange"print "I would like %s pie" % dessert["pie"]

Output:I would like apple pie.

00/00/2010Information Services,35

Hash Array iteration (...contd)sounds = {"cow": "moooo", "duck": "quack", "horse":

"whinny", "sheep": "baa", "hen": "cluck", "pig": "oink"}

for animal in sounds.keys(): print "Old MacDonald had a %s" % animal print " With a %s! %s! here..." % sounds[animal]Output:Old MacDonald had a hen. With a cluck! cluck!

here...Old MacDonald had a cow. With a moooo! moooo!

here...Old MacDonald had a sheep. With a baa! baa! here...

00/00/2010Information Services,36

• Python provides excellent features for handling strings contained in variables

• The "split" command allows users to search for patterns in a string and use them as a delimiter to break the string apart

• For example, to extract the words in a sentence, we use the space delimiter to capture the wordsx = "This is a sentence"words = x.split(" ")

• Now words contains the wordswords[0] = "This"words[1] = "is"words[2] = "a"words[3] = "sentence"

String Operations: Split

String Operations

• Two literal strings placed next to one another with a space will concatenate automatically

• To manually concatenate two strings once, the "+" operator is used

• To paste multiple copies of a string, multiply (*) it by a numberword = 'Help' + 'A'word*5'<' + word*5 + '>''Help' 'A''Help'.strip() 'A' # This first is not a literal string'Help'.strip() + 'A' # Manual concatenation

String Sub-scripting

• Once a string is created, it can be subscripted using its indices that begin with 0word = "Programming"word[0] # "P"word[0:3] # "Prog"word[:3] # "Prog"word[3:] # "ramming"

• Slices of python strings cannot be assigned, egword[:3] = "C" # This won't work

String subscripting (…contd)

• Subscripted strings can however be used in string expressionsword[2:3] + 're' # "ogre"

• If the string subscript is out of range, it uses the starting or end character of the stringword[:200] # "Programming"

• If –ve subscript is used, the index begins from the last character to firstword[-2:] # "ng"

String Functions

• Some functionsx = 'python'x.capitalize()x.upper()x.lower()x.trim()x.find("pattern")x.center(20)x = " "x.join("python")

String Functions (…contd)

• Some functionsx.strip()x.lstrip()x.rstrip()x.split("\t")x.count("p")if x.startswith('#'):if x.endswith('a'):if x.isalpha():if x.islower():if x.isupper():

Iterating over Lists with "in"

• Ok, so we have these lists, but how do we work with each element automatically?– How can we iterate over them and perform the same

operation to each element?

• We use looping logic to work with the arrays• We use Python's "for", more specifically foreach

for named_item in array:named_item = <some expression>

00/00/2010Information Services,43

Python Lists (...contd)• Example:nucleotides=["adenine", "cytosine", "guanine",

"thymine", "uracil"]for nt in nucleotides:

print "Nucleotide is: %s\n" % nt

Output:Nucleotide is: adenineNucleotide is: cytosineNucleotide is: guanineNucleotide is: thymineNucleotide is: uracil

00/00/2010Information Services,44

Flow Control: "For" loop

• "For" loops allow users to repeat a set of statements a set number of time.x = ['a', 'b', 'c']for strvar in x:

print strvar, len(strvar)for i in range(0,2)

print x[i], len(x[i])

Break and Continue Statements

• The "break" statement in a "for" or "while" loop transfers control execution outside the loop

• The "continue" statement in a "for" or "while" loop transfers control to the next iteration of the loop, skipping remainder statements after it

Boolean Operations

• Boolean operators provide Boolean context• Many types of operators are provided

– Relational (<, >)– Equality (==, !=)– Logical (and, or, not)– Conditional (?:)

00/00/2010Information Services,47

Commands blocks in Python• A group of statements surrounded by braces {}

– No! There are no curly braces in Python!– Python blocks are all about “indenting”

• Creates a new context for statements and commands• Ex:

if(x>1):

print "Test\n"print "x is greater than 1\n"

00/00/2010Information Services,48

Iterating over Lists with "while"• Example:nucleotides=["adenine", "cytosine", "guanine",

"thymine", "uracil"]i = 0while(i<length(nucleotides)): print "Nucleotide is: %s\n" % nucleotides[i]

i++

Output:Nucleotide is: adenineNucleotide is: cytosineNucleotide is: guanineNucleotide is: thymineNucleotide is: uracil

00/00/2010Information Services,49

Conditional operations with "if-then-else"

• If-then-else syntax allows programmers to introduce logic in their programs

• Blocks of code can be branched to execute only when certain conditions are metif(condition1 is true):

<statements if condition1 is true> else: <statements if condition1 is false>

00/00/2010Information Services,50

Python nested blocks• Blocks within blocks

if (x>1):if (y>2):

print "y>2\n"print "x>1\n"

00/00/2010Information Services,51

Python File access

• What is file access?– set of Python commands/syntax to work with data files

• Why do we need it?– Makes reading data from files easy, we can also create new

data files

• What different types are there?– Read, write, append

00/00/2010Information Services,52

File I/O

• Python easily reads and writes ASCII/text files. f = open(filename, mode)f.read() # Reads entire filef.read(10) # Reads 10 bytesf.readline() # Reads one line at a timef.readlines() # Reads all lines into a listf.close() # Closes the file

• To read lines automatically one by onefor line in f:

print line

File I/O (…contd)

• Other funtionsf.write('abcd') # writes to filef.tell( ) # current position in filef.seek( ) # go to position in filef.close( ) # close the file

• To write files, use:output.write(f, list)

Python File access Example• Examplef = open("mailing_list", "r")for line in f: fields = line.split(":") print "%s %s" % (fields[1], fields[0]) print "%s %s" % (fields[3], fields[4]) print "%s %s %s" % (fields[5],fields[6],fields[7])f.close()

• Output:Al Smith 123 Apple St., Apt. #1 Cambridge, MA 02139

00/00/2010Information Services,55

Input file:Last name:First name:Age:Address:Apartment:City:State:ZIP Smith:Al:18:123 Apple St.:Apt. #1:Cambridge:MA:02139

Python File access writing

• Writing to files– print writes to a file– print writes to a STDOUT by default– Be sure that the file is open for writing first

• Check for errors along the way

00/00/2010Information Services,56

Python File access writing• Example writing to a fileimport osreadfile = "mailing_list"if(os.path.isfile(readfile)): fr = open(readfile, "r") fw = open("labels", "w") for line in fr: line.rstrip() fields = line.split(":") fw.write("%s %s\n" % (fields[1], fields[0])) fw.write("%s %s\n” % (fields[3], fields[4])) fw.write(“%s %s %s” % (fields[5],fields[6],fields[7])) fr.close() fw.close()

00/00/2010Information Services,57

Functions

• What is a function?– group related statements into a single task– segment code into logical blocks– avoid code and variable based collision– can be "called" by segments of other code

• Subroutines return values– Explicitly with the return command– Implicitly as the value of the last executed statement

• Return values can be a scalar or a flat list

00/00/2010Information Services,58

Functions

• A function can be written in any Python program, it is identified by the "def" keyword

• Writing a functiondef printstars():

print "***********************"printstars()

• Notice the indenting after declaring the function, and the use of empty parenthesis

Functions with Inputs and Outputs

• The "return" statement can be used to return some output from the functiondef fib2(n):"""Generate the Fibonacci series """ result = [] a, b = 0, 1 while b < n: result.append(b)a, b = b, a+breturn result

• The function can then be calledfib100 = fib2(100)

Functions with Argument lists

def isEnzyme(enzymePrompt, tries = 4): while True: enzyme = raw_input(enzymePrompt) if enzyme in ["ATGG", "GTAC", "CCAA"]: return True tries = tries - 1 if tries<0: print "Enough tries already!" break

• Can be executed asisEnzyme("Enter an enzyme: ")isEnzyme("Enter enzyme: ", 3)

Functions with Argument lists (…contd)

• In general, an argument list must have any positional arguments followed by any keyword arguments

• The keywords must be chosen from the formal parameter names

• It's not important whether a formal parameter has a default value or not

• No argument may receive a value more than once -- formal parameter names corresponding to positional arguments cannot be used as keywords in the same calls

Python Subroutinesdef fibonacci(n): return 1 if n <= 2 return (fibonacci(n-1) + fibonacci(n-2))

for i in range(1,5): fib = fibonacci(i) print "fibonacci(%d) is %s\n" % (i,fib)

• Example Output:fibonacci(1) is 1fibonacci(2) is 1fibonacci(3) is 2fibonacci(4) is 3fibonacci(5) is 5

00/00/2010Information Services,63

Providing input to programs

• It is sometimes convenient not to have to edit a program to change certain data variables

• Python allows you to read data from shell directly into program variables with the "raw_input" command

• Examples:x = raw_input("Enter your name: ")

• If "x" is a numberx = int(x)

00/00/2010Information Services,64

• Command line arguments are optional data values that can be passed as input to the Python program as the program is run– After the name of the program, place string or numeric

values with spaces separating them– Accessed them by the sys.argv variable inside the program– Avoid entering or replacing data by editing the program

• Examples:python arguments.py arg1 arg2 10 20python arguments2.py 10 20 30 40 50

Command Line Arguments

Using Python programs on the cluster

• Python scripts can easily be submitted as jobs to be run on the MGHPCC infrastructure

• Basic understanding of Linux commands is required, and an account on the cluster

• Lots of useful and account registration information atwww.umassrc.org

• Feel free to reach out to Research Computing for [email protected]

00/00/2010Information Services,66

What is a computing "Job"?

• A computing "job" is an instruction to the HPC system to execute a command or script– Simple linux commands or Python/Python/R scripts that

can be executed within miliseconds would probably not qualify to be submitted as a "job"

– Any command that is expected to take up a big portion of CPU or memory for more than a few seconds on a node would qualify to be submitted as a "job". Why? (Hint: multi-user environment)

67

How to submit a "job"

• The basic syntax is:bsub <valid linux command>

• bsub: LSF command for submitting a job• Lets say user wants to execute a Python script.

On a linux PC, the command isPython countDNA.py

• To submit a job to do the work, dobsub Python countDNA.py

68

Specifying more "job" options

• Jobs can be marked with options for better job tracking and resource management– Job should be submitted with parameters such as queue

name, estimated runtime, job name, memory required, output and error files, etc.

• These can be passed on in the bsub commandbsub –q short –W 1:00 –R rusage[mem=2048] –J "Myjob" –o hpc.out –e hpc.err Python countDNA.py

69

Job submission "options"

70

Option flag or name

Description

-q Name of queue to use. On our systems, possible values are "short" (<=4 hrs execution time), "long" and "interactive"

-W Allocation of node time. Specify hours and minutes as HH:MM

-J Job name. Eg "Myjob"

-o Output file. Eg. "hpc.out"

-e Error file. Eg. "hpc.err"

-R Resources requested from assigned node. Eg: "-R rusage[mem=1024]", "-R hosts[span=1]"

-n Number of cores to use on assigned node. Eg. "-n 8"

Why use the correct queue?

• Match requirements to resources• Jobs dispatch quicker• Better for entire cluster• Help GHPCC staff determine when new resources are

needed

71

Questions?

• How can we help further?• Please check out books we recommend as

well as web references (next 2 slides)

00/00/2010Information Services,72

Python Books

• Python books which may be helpful– http://shop.oreilly.com/product/9780596154516.do

• Bioinformatics Programming Using Python

– http://shop.oreilly.com/product/0636920028154.do • Learning Python

– http://shop.oreilly.com/product/0636920027072.do • Python Cookbook

– http://shop.oreilly.com/product/9780596158118.do • Programming Python – 4th Edition

00/00/2010Information Services,73