Bioinformatics Programming

1

Bioinformatics ProgrammingEE, NCKU

Tien-Hao Chang (Darby Chang)

2

In the last slide More Unix features worthy to

mention– job control

– I/O redirection and piping

– text processing (vi, grep, sed, awk, …)

Programming vs. language

3

Programming

4

BeforeLearning advanced data structures

and the associated algorithms

5

structA brick to construct advanced data structure in C

6

struct struct is similar to array from the view that

both of them can aggregate a set of objects into a single object (here is not that one in object-oriented)– array: aggregate objects with the same type

– struct: aggregate objects with different types

struct is the condensation of ‘structure’ Each entry is a struct declaration is usually

called a ‘field’ or ‘member’

7

struct

Declaration A struct declaration consists of a list of fields, each

of which can have any type– struct mydata { //declare the structure of mydata

char name[8];char id[10];int math;int eng;

};

– defines a type, referred to as struct mydata

To create a new variable of this type– // define a variable ‘student’ of the type ‘mydata’

struct mydata student;

8

struct

The Memory Space

Memory

Student

name

id

math

eng

9

struct

Test Memory Space #include<stdio.h>

#include<stdlib.h>int main(void) {

struct data {char name[10];char sex[2];int math;};struct data student;printf("sizeof(student)=%d\n", sizeof(student));return 0;

} Result 16

10

struct

Access Fields The dot (.) operator

– struct_variable.field_name

For example– student.math = 90;

– student.eng = 20;

– printf("%s’s Math score is %d\n", student.name, student.math);

A convenient shortcut to initializing members of struct is shown below– struct data student={"Mary Wang",74};

11

struct

Array of Structures You may define an array of structures

– struct student { //declare the structure of studentchar name[8];char id[10];int math;int eng;

};// define an array of 3 variable of the type ‘student’struct student stu[3];

[0] [1] … [7]

[0] [1] … [9]

name

id

math

eng

stu[0]

stu[1]

stu[2]...

12

struct

Pointer to Structure Pointers can be used to refer to a struct by its address

– struct mydata { // declare the structure of mydata

char name[8];char id[10];int math;int eng;

} student; // define a mydata variable, student

struct mydata * ptr; // define a pointer of mydata

ptr = &student; // point ptr to the variable, student

Access files from struct pointers– the dereference (->) operator

– struct_pointer_variable->field_name

– student->math = 90

13

struct

Nested Structures Since struct declaration constructs new types, it is trivial to use struct fields

just like normal types such as int, double, …– #include<stdio.h>

#include<stdlib.h>int main(void) {

struct date { // declare dateint month;int day;

};struct student { // declare nested structure, student

char name[10];int math;struct date birthday;

} s1={"David Li", 80, {2,10}}; // define a student variable, s1printf("student name:%s\n",s1.name);printf("birthday:%d month, %d day\n", s1.birthday.month, s1.birthday.day);printf("math grade:%d\n",s1.math);return 0;

}

14

struct

Self-referential Structure Fields are not allowed to be defined as the same

type as the declaration they belong But fields can be defined as pointers to the same

type as the declaration they belong Such a struct with pointer fields referencing to the

same strcut type, is called self-referential structure– struct PERSON {

char name[8];int age;struct PERSON * son; // self-referential pointer

};

name age son

15

Any Questions?

16

WhyFields are not allowed to be defined as the same type as the declaration they belong?

But fields can be defined as pointers to the same type as the declaration they belong?

Hint: think from the perspective of memory

17

The ClosenessBetween C and the realistic representation is the reason of both a) why C-based program is so fast and b) why C is suitable for teaching

18

Languages Comparison Since the 1950s, computer scientists have devised thousands of

programming languages. Many are obscure, perhaps created for a Ph.D. thesis and never heard of since.

Compiling to machine code– some languages transform programs directly into Machine Code—the

instructions that a CPU understands directly

– this transformation process is called compilation

– assembly, C, and C++

Interpreted languages– other languages are either interpreted such as Basic, Perl, and

Javascript

– or a mixture of both being compiled to an intermediate language, including Java and C#

19

Languages Comparison

Compile vs. Interpret An interpreted language is processed at runtime. Every line is read,

analyzed, and executed. Having to reprocess a line every time in a loop is what makes interpreted languages so slow.– this overhead results in that interpreted code runs between 5–10 times

slower than compiled code

– their advantage is not needing to be recompiled after changes and that is handy when you're learning to program.

Because compiled programs almost always run faster than interpreted, languages such as C and C++ tend to be the most popular for writing games.

Java and C# both compile to an interpreted language which is very efficient. Because the Virtual Machine that interprets Java and the .NET framework that runs C# are heavily optimized, it's claimed that applications in those languages are as fast if not faster as compiled C++.

20


Level of Abstraction How close a particular language is to the hardware?

Machine Code is the lowest level followed by assembly. C++ is higher than C because C++ offers greater abstraction. Java and C# are higher than C++ because they compile to an

intermediate language called bytecode.

When computers first became popular in the 1950s, programs were written in machine code. Programmers had to physically flip switches to enter values. This is such a tedious and slow way of creating an application that higher level computer languages had to be created.

21

Super coder!

http://www.evula.org/dragoon/pics/supercoder.jpg

22

Assembler: Fast to run, slow to write– The readable version of Machine Code

• Mov A,$45

– Because it is tied to a particular CPU, assembly is not very portable.

– Languages like C have reduced the need for assembly except where memory is limited or time critical code is needed. This is typically in the kernel code or in a driver.

Basic: For beginners– Basic is an acronym for Beginners All purpose Symbolic Instruction Code and

was created to teach programming in the 1960s.

– Microsoft have made the language their own with many different versions including VBScript for websites and the very successful Visual Basic.

– It is an interpreted language with the only advantage of easy-to-learn. But now it is more like a syntax alternative to C because most programmers are lazy.

Pascal: Conscientious programming– Pascal was devised as a teaching language a few years before C but had limited

usage.

– Until Borland's Turbo Pascal (for Dos) and Delphi (for Windows) appeared, it is suitable for commercial development.

– However Borland was up against Microsoft and lost the battle.

23

C: System programming– C was devised in the early 1970s by Dennis Ritchie. It can be thought of as a general

purpose tool—very useful and powerful but very easy to let bugs through that can make systems insecure.

– C has been described as portable assembly.

– The syntax of many scripting languages is based on C.

C++: A classy language– C++ (or C plus classes as it was originally known) came about ten years after C and

successfully introduced Object Oriented Programming to C, as well as features like exceptions and templates.

– Learning all of C++ is a big task—it is by far the most complicated of the programming languages here but once you have mastered it, you'll have no difficulty with any other language.

C#: Microsoft's big bet– C# was created by Delphi's architect Anders Hejlsberg after he moved to Microsoft

and Delphi developers will feel at home with features such as Windows forms.

– C# syntax is very similar to Java, which is not surprising as Hejlsberg also worked on J++ after he moved to Microsoft.

– Learn C# and you are well on the way to knowing Java. Both languages are semi-compiled, so that instead of compiling to machine code, they compile to bytecode and are then interpreted.

24

Perl: Websites and utilities– Very popular in the Linux world, Perl was one of the first web languages and

remains very popular today.

– For doing ‘quick and dirty’ programming on the web it remains unrivalled and drives many websites.

– It has though been somewhat eclipsed by PHP as a web scripting language.

PHP: Websites coding– PHP was designed as a language for Web Servers and is very popular in

conjunction with Linux, Apache, MySql and PHP or LAMP for short.

– It is interpreted, but pre-compiled so code executes reasonably quickly.

– It can be run on desktop computers but is not as widely used for developing desktop applications.

– Based on C syntax, it also includes Objects and Classes.

JavaScript : Programs in your browser– Javascript is nothing like Java, instead its a scripting language based on C syntax

but with the addition of Objects and is used mainly in browsers.

– JavaScript is interpreted and a lot slower than compiled code but works well within a browser.

– Invented by Netscape and in doldrums for years. Popular again because of AJAX; Asynchronous Javascript and XML. This allows parts of web pages to update from the server without redrawing the entire page.

25

Position 2010 Position 2009 Delta in Position Language Ratings 2010 Delta 2009 Status

1 1 = Java 17.509% -2.29% A2 2 = C 17.279% +1.42% A3 4 ↑ PHP 9.908% +0.42% A4 3 ↓ C++ 9.610% -0.75% A5 5 = (Visual) Basic 6.574% -1.71% A6 7 ↑ C# 4.264% -0.06% A7 6 ↓ Python 4.230% -0.95% A8 9 ↑ Perl 3.821% +0.40% A9 10 ↑ Delphi 2.684% -0.03% A10 8 ↓↓ JavaScript 2.651% -0.96% A11 11 = Ruby 2.327% -0.27% A

12 32 ↑↑↑↑↑↑↑↑↑↑ Objective-C 1.970% +1.79% A

13 - ↑↑↑↑↑↑↑↑↑↑ Go 0.921% +0.92% A

14 15 ↑ SAS 0.769% -0.03% A15 13 ↓↓ PL/SQL 0.737% -0.31% A16 22 ↑↑↑↑↑↑ MATLAB 0.661% +0.20% B17 17 = ABAP 0.639% +0.00% B18 16 ↓↓ Pascal 0.603% -0.13% B19 19 = ActionScript 0.594% +0.11% B

20 27 ↑↑↑↑↑↑↑ Fortran 0.563% +0.24% B

26

http://www.simplyhired.com/a/jobtrends/graph/q-Perl%2C+Ruby%2C+Python%2C+Php%2C+Javascript%2C+Flex%2C+Groovy/t-line

27


Summary

Other noteworthy programming languages– Java, Python, Ruby, Go, …

The popularity forms for many reasons– history (programmers are lazy), business, and functionality

Lasting wars– Java vs. .NET (C will, in some form, live forever)

– Perl vs. PHP vs. Ruby (web programming)

– Perl vs. Python (scripting)

– There might be a dominant system language and a scripting language in the future, but it probably converges to a coexistence world.

Lower Level

Higher Level

» more readable» faster to develop» more coding sugar» avoid careless mistakes

» easy to debug» faster program» general purpose» powerful to do evil

28

Any Questions?

29

Algorithm

30

Algorithm Specification

– a finite set of instructions that accomplishes a particular task

– criteria• input: zero or more quantities that are externally supplied

• output: at least one quantity is produced

• definiteness: clear and unambiguous

• finiteness: terminate after a finite number of steps

• effectiveness: instruction is basic enough to be carried out

Representation– a natural language, like English or Chinese

– a graphic, like flowcharts

– a computer language, like C

31

Algorithm

Selection Sort From those integers that are currently unsorted, find the smallest

and place it next in the sorted list

i [0] [1] [2] [3] [4]- 30 10 50 40 20

0 10 30 50 40 20

1 10 20 50 40 30

2 10 20 30 40 50

3 10 20 30 40 50

32

33

Algorithm

Binary Search [0] [1] [2] [3] [4] [5] [6]

8 14 26 30 43 50 52

left right middle [middle] : target0 6 3 30 < 434 6 5 50 > 434 4 4 43 == 43 (found)

0 6 3 30 > 180 2 1 14 < 182 2 2 26 > 182 1 - (not found)

Searching a sorted listwhile (there are more integers to check) {

middle = (left + right) / 2;if (target < list[middle])

right = middle - 1;else if (targeeet == list[middle])

return middle;else left = middle + 1;

}

34

int binsearch(

int list[], int target,

int left, int right)

{

int middle;

while (left <= right) {

middle = (left + right) / 2;

switch (COMPARE(list[middle], target)) {

case -1: left = middle + 1;

break;

case 0: return middle;

case 1: right = middle – 1;

}

}

return -1;

}

» Program 1.6: Searching an ordered list

35

Algorithm

Recursive Algorithms Beginning programmers view a function as something that is

invoked (called) by another function– it executes its code and then returns control to the calling function

This perspective ignores the fact that functions can call themselves (direct recursion)

They may call other functions that invoke the calling function again (indirect recursion)– extremely powerful

– frequently allow us to express an otherwise complex process in very clear term

We should express a recursive algorithm when the problem itself is defined recursively

36

int binsearch(

int list[], int target,

int left, int right)

{

int middle;

while (left <= right) {

middle = (left + right) / 2;

switch (COMPARE(list[middle], target)) {

case -1: return

binsearch(list,target,middle+1,right);

case 0: return middle;

case 1 : return

binsearch(list,target,left,middle-1);

}

}

return -1;

}

» Program 1.7: Recursive implementation of binary search

37

Any Questions?

38

Data Abstraction

39

Data Abstraction Data type

– A data type is a collection of objects and a set of operations that act on those objects

– For example, the data type int consists of the objects {0, +1, -1, +2, -2, …, INT_MAX, INT_MIN} and the operations +, -, *, /, and %

The data types of C– basic data types: char, int, float, and double

– group data types: array and struct

– pointer data type

– user-defined types

Abstract data type– An abstract data type (ADT) is a data type that is organized in such a

way that the specification of the objects and the operations on the objects is separated from the representation of the objects and the implementation of the operations.

– We know what is does, but not necessarily how it will do it.

40

41

The array as an ADT

42

ToEvaluate which algorithm is better

43

Algorithm

Performance Analysis Criteria

– Is it correct?

– Is it readable?

– …

Performance analysis (machine independent)– space complexity: storage requirement

– time complexity: computing time

Performance measurement (machine dependent)

44

Performance Analysis

Space Complexity S(P)=C+SP(I) Fixed space requirements (C)

– independent of the inputs and outputs

– instruction, constants, simple variables

Variable space requirements (SP(I))– depend on the instance characteristic I

– number, size, values of inputs and outputs associated with I

– recursive stack space, including formal parameters, local variables, and return address

45

Any Questions?

46

AnalyzeSomeone’s exercise

47

The recursion stack space needed is 6(n+1),

since the depth of recursion is n+1.

48

Performance Analysis

Time Complexity T(P)=C+TP(I)

The time, T(P), taken by a program, P, is the sum of its compile time C and its run (or

execution) time, TP(I)

TP(I)=caADD(I)+csSUB(I)+…– Program step: A syntactically or semantically meaningful

program segment whose execution time is independent of the instance characteristics.

– Introduce a new variable, count, into the program

– Tabular method

49

Time Complexity

Iterative Summation float sum(float list[], int n) {

float tmp = 0; ++count; // for assignment

int I;

for (i = 0; i < n; ++i) {

++count; // for the for loop

tmp += list[i];

++count; // for assignment

}

++count; // last execution of for

++count; // for return

return tempsum;

} 2n+3 steps

50

Time Complexity

Tabular MethodStatement s/e Frequency Total Steps

float sum(float list[], int n) 0 0 0

{ 0 0 0

float tmp=0; 1 1 1

int i; 0 0 0

for (i=0; i<n; ++i) 1 n+1 n+1

tmp+=list[i]; 1 n n

return tmp; 1 1 1

} 0 0 0

Total 2n+3

51

Any Questions?

52

Asymptotic notation

53

Asymptotic Notation

Basic Concepts There are two programs, one with

complexity c1n2+c2n and the other with

complexity c3n

– for sufficiently large of value of n, c3n will

be faster than c1n2+c2n

– for small values of n, either could be faster• c1=1, c2=2, c3=100 c1n2+c2n c3n for n 98

• c1=1, c2=2, c3=1000 c1n2+c2n c3n for n 998

54

Asymptotic Notation

O, , O [big “oh’’]

– f(n)=O(g(n)) iff there exist positive constants c and n0 such that f(n) cg(n) for all n,

n n0

– upper bound, worst case

[big omega]– f(n) = (g(n)) (read as “f of n is big omega of g of n”) iff there exist positive

constants c and n0 such that f(n) cg(n) for all n, n n0

– lower bound, best case

[big theta]– f(n) = (g(n)) iff there exist positive constants c1, c2, and n0 such that c1g(n) f(n)

c2g(n) for all n, n n0

– upper and lower bound

Notice that relationship between analyses and notations. For example, sometimes we would analyze the big theta of the worst case of an algorithm.

55

Asymptotic Notation

Theorems If f(n) = amnm+…+a1n+a0, then f(n) = O(nm)

If f(n) = amnm+…+a1n+a0 and am > 0, then f(n) = Ω(nm)

If f(n) = amnm+…+a1n+a0 and am > 0, then f(n) = Θ(nm)

Examples– f(n) = 3n+2

3n+2 4n, for all n 2, 3∴ n+2 = O(n)

3n+2 3n, for all n 1, 3∴ n+2 = Ω(n)

3n 3n+2 4n, for all n 2, 3∴ n+2 = Θ (n)

– f(n) = 10n2+4n+2

10n2+4n+2 11n2, for all n 5, 10∴ n2+4n+2 = O(n2)

10n2+4n+2 n2, for all n 1, 10∴ n2+4n+2 = Ω(n2)

n2 10n2+4n+2 11n2, for all n 5, 10∴ n2+4n+2 = Θ(n2)

– 10n2+4n+2 = O(n2)// 10n2+4n+2 11n2 for n 5– 6*2n+n2 = O(2n) // 6*2n+n2 7*2n for n 4

56

Practical ComplexityTo get a feel for how the various functions grow with n, you are advised to study the following three figures

57

58

59

60

Performance Measurement Although performance analysis gives us a

powerful tool for assessing an algorithm’s space and time complexity, at some point we also must consider how the algorithm executes on our machine

61

Any Questions?

62

FibonacciIn nOut the n-th Fibonacci number

Requirement- a recursive version and an iterative version- report - time/space complexity - practical time - code size (less meaningful in C)- using C would be the best

Bonus- an algorithm of O(n) time and O(1) space complexity- the best time complexity is O(1)- use Makefile to automate the report

63

Fibonacci

A Reference Kenji Mikawa and Ichiro Semba (2005). "An O

(1) time algorithm for generating Fibonacci strings." Electronics and Communications in Japan (Part II: Electronics) 88(9): 67-72.

Provided by 陳偉銘– “However, the majority in this course is male,

so…”

64

Deadline2010/3/23 23:59

Zip your code, a step-by-step README of how to execute the code and anything worthy extra credit. Email to [email protected].

mailto:[email protected]

65

Recall that

http://www.dianadepasquale.com/ThinkingMonkey.jpg

66

gcc

Multiple Source Files If there are multiple source file

– $ gcc file1.c file2.c -o myprog

Or– $ gcc -c file1.c

$ gcc -c file2.c$ gcc file1.o file2.o -o myprog

The second one compiles source files separately. If only file1.c was modified– $ gcc -c file1.c

$ gcc file1.o file2.o -o myprog

Notice that file2.c does not need to be recompiled.– significant time savings when there are numerous source files

This process, though somewhat complicated, is generally handled automatically by a makefile.

67

But how do you knowwhich files should be re-compiled?

http://faculty.northseattle.edu/tfurutani/che140/labbook_files/image005.jpg

68

Don’t invent the wheel

http://www.morphcoaching.com/mypics/Wheel_invention.jpg

69

Makefile

70

Makefile A Makefile is the configuration file used by a standard

program called “make” make is like a project manager in a graphical

development environment, but includes many extra features

Allows an entire project to be intelligently built with one command on the command line– make avoids re-building targets which are up-to-date, thus,

saving typing and compiling time a lot

– Makefiles largely similar to the Project and Workspace files you might be used to from Visual C++, JBuilder, Eclipse, etc

71

Makefile

Filenames When you key in make, the make looks for the

default filenames in the current directory. For GNU make these are– GNUMakefile

– makefile

– Makefile

If there more than one of the above in the current directory, the first one according to the above chosen

It is possible to name the Makefile anyway you want, then for make to interpret it– $ make -f <your-filename>

72

Makefile

Dependencies Sometimes one file depends on another file

– e.g. a C file depends on its header files

If a header file changes, the C files that #include that header file should be recompiled to take into account the changes to the header

interface.h interface.cmain.c

main.o

final executable file(my_project)

interface.o

73

Makefile

A Simple Makefile “Rule” hello: hello.c

gcc hello.c -o hello Save this text as name “Makefile” in the

same directory as the source code To build the project, type “make” Result is an executable named hello If hello file exists, and the file creation time is

newer than hello.c, what should “make” do?– nothing

74

Makefile

Generic Form of a Rule target1 target2 ..: prerequisite1 prerequisite2 ...

<tab>command1

<tab>command2

Target is the output file Prerequisites are the files that are needed by target (and that can

cause target to be recompiled if they change). Command (or action) is the actual command to turn the

prerequisites into the target. Characters after “#” are regarded as comments Line oriented

– If the dependencies or commands are too long and you would like to span them across several lines for clarity and convenience, escape the end of line by “\” at the end.

– Make sure NOT to use tabs for such lines.

75

Makefile

Target make performs corresponding actions of specific targets Target could be a filename that you want to generate or

a phony target, where the later is specially useful for many action automation

Suggested phony targets from GNU– all Default action (build/compile the executable)

– install install previously built executable

– clean clean temporary files generated during the build process, usually the .o or .obj files

The first target listed in the file will be used if no target is formally specified

76

Makefile

Multiple Targets MyProject: main.o interface.o

gcc main.o interface.o -o MyProjectmain.o: main.c interface.h

gcc -c main.c -o main.ointerface.o: interface.c interface.h

gcc -c interface.c -o interface.o

Build MyProject– $ make

– $ make MyProject

– make will figure out the appropriate order from the prerequisites

Compile a non-master targets– $ make main.o

interface.h interface.cmain.c

main.o

final executable file(my_project)

interface.o

77

Makefile

Command A list of actions needed to generate the rule’s target May be empty (just indicate dependencies) Every action is usually a typical shell command you would

normally type to do the same thing You can hide commands with a preceding ‘@’ symbol Every command MUST be preceded with a tab!

– This is how make identifies actions as opposed to variable assignments and targets. Do not indent actions with spaces!

Each action line invoke a sub shell to execute the commands– The sub shell ends after that line

– Some changes (such as cd to another directory or set shell variables) won’t pass to the next line

– Use ‘;’ symbol to execute multiple commands in one line

78

Makefile

Variables In a large Makefile, a good idea is to use variables to

make later changes easy For example, rather than typing ‘gcc’ in the

command part of every rule, create a variable at the top of the Makefile– CC = gcc

Commands can then be– ${CC} source_file.c -o executable_file

Case sensitive Use only alphabets, numbers, and ‘_’ Both $(VAR) or ${VAR} are okay

79

Makefile

Other Features Implicit rules

– GNU make thus provides some implicit rules for common practices such as the object file of foo.c would be foo.o. For example, the following rules are unnecessary

• foo.o: foo.cgcc -c -o foo.o foo.c

Phony target– The target is always out-of-date and thus the actions are always performed

– e.g. ‘.PHONY: clean’

Automatic variables (internal macros)– $@ the filename of the target of the rule

– $< the name of the first prerequisite

– $? the names of all the prerequisites that are newer than the target

– $^ the names of all the prerequisites

– $* the main filename of the target of the rule

Flow control– ifeq, ifneq, ifdef, ifndef, for, if-then-else, …

Documents

Bioinformatics Programming