Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington...

Preview:

Citation preview

Converting Large NCBI Databases into SAS

Rosa SJ Lin

Division of Statistical Genomics Washington University in Saint Louis

June 30, 2008

NCBI(http://www.ncbi.nlm.nih.gov)

Contains a large number of databases Most important are: - GenBank - PubMed - RefSeq - Online Mendelian Inheritance in Man

(OMIM) - dbSNP

dbSNP Database

NCBI dbSNP

Contains information about SNPs

Submitted data is given an ss number

(e.g. ss52079780)

If data meets criteria a reference SNP is

created which had an rs number (e.g.

rs530)

dbSNP Data (1)- Each record with various lines and each line with various lengths

dbSNP Data (2)

dbSNP Data (3)

Various uses of the SCAN, INDEX functions to assist in reading data (1)

data ncbisnp ; length rs $12 ; infile din firstobs=1 missover pad;

input snpline $132. ; if index(snpline,"updated")>0 then do; rs=compress(scan(snpline,1,"|")); output; end;run;

Various uses of the SCAN, INDEX functions to assist in reading data (2)

if index(snpline,"alleles=")>0 then do; alleles=substr(compress(scan(snpline,2,"|")),9); output; end;

if index(snpline,"assembly=reference")>0 then do chrom=input(substr(compress(scan(snpline,3,"|")),5),8.); posc=compress(scan(snpline,4,"|")); output; end;

Use RETAIN statement - cause a variable to keep its value from one iteration of the DATA step to the next.

retain markname rs alleles;

dbSNP Data (4)

Output SAS Dataset

Readings:

Kim L Kolbe etc., SUGI 22: “Advanced Techniques for Reading Difficult and Unusual Flat Files”.

Clinton S Rickards, SUGI 24: “Reading External Files Using SAS® Software”.