13
Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Embed Size (px)

Citation preview

Page 1: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Converting Large NCBI Databases into SAS

Rosa SJ Lin

Division of Statistical Genomics Washington University in Saint Louis

June 30, 2008

Page 2: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

NCBI(http://www.ncbi.nlm.nih.gov)

Contains a large number of databases Most important are: - GenBank - PubMed - RefSeq - Online Mendelian Inheritance in Man

(OMIM) - dbSNP

Page 3: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

dbSNP Database

Page 4: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

NCBI dbSNP

Contains information about SNPs

Submitted data is given an ss number

(e.g. ss52079780)

If data meets criteria a reference SNP is

created which had an rs number (e.g.

rs530)

Page 5: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

dbSNP Data (1)- Each record with various lines and each line with various lengths

Page 6: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

dbSNP Data (2)

Page 7: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

dbSNP Data (3)

Page 8: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Various uses of the SCAN, INDEX functions to assist in reading data (1)

data ncbisnp ; length rs $12 ; infile din firstobs=1 missover pad;

input snpline $132. ; if index(snpline,"updated")>0 then do; rs=compress(scan(snpline,1,"|")); output; end;run;

Page 9: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Various uses of the SCAN, INDEX functions to assist in reading data (2)

if index(snpline,"alleles=")>0 then do; alleles=substr(compress(scan(snpline,2,"|")),9); output; end;

if index(snpline,"assembly=reference")>0 then do chrom=input(substr(compress(scan(snpline,3,"|")),5),8.); posc=compress(scan(snpline,4,"|")); output; end;

Page 10: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Use RETAIN statement - cause a variable to keep its value from one iteration of the DATA step to the next.

retain markname rs alleles;

Page 11: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

dbSNP Data (4)

Page 12: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Output SAS Dataset

Page 13: Converting Large NCBI Databases into SAS Rosa SJ Lin Division of Statistical Genomics Washington University in Saint Louis June 30, 2008

Readings:

Kim L Kolbe etc., SUGI 22: “Advanced Techniques for Reading Difficult and Unusual Flat Files”.

Clinton S Rickards, SUGI 24: “Reading External Files Using SAS® Software”.