63
Enabling Discoveries at High Throughput Small molecule and RNAi HTS at the NCTT Rajarshi Guha NIH Center for Transla6on Therapeu6cs May 3, 2011

Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

  • Upload
    rguha

  • View
    1.187

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Enabling  Discoveries  at  High  Throughput    Small  molecule  and  RNAi  HTS  at  the  NCTT  

Rajarshi  Guha  NIH  Center  for  Transla6on  Therapeu6cs  

May  3,  2011  

Page 2: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Outline  

•  Informa6cs  for  small  molecule  &  RNAi  screening  •  HCA  &  automated  decision  making  

– Pre7y  pictures  can  lead  to  more  efficient  screens  

•  Large  scale  cheminforma6cs      – We  can  do  it,  but  do  we  need  to?  

Page 3: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

•  Founded  2004  as  part  of  NIH  Roadmap  Molecular  Libraries  Ini6a6ve  –  NCGC  staffed  with  90+  scien6sts  –  biologists,  chemists,  informa6cians,  engineers  

–  Post-­‐doc  program  

•  Mission  –  MLPCN  (screening  &  chemical  synthesis;  compound  repository;  PubChem  database;  

funding  for  assay,  library  and  technology  development  )  –  Develop  new  chemical  probes  for  basic  research  and  leads  for  therapeu6c  development,  

par6cularly  for  rare/neglected  diseases  –  New  paradigms  &  applica6ons  of  HTS  for  chemical  biology  /  chemical  genomics  

•  All  NCGC  projects  are  collabora6ons  with  a  target  or  disease  expert;    currently  >200  collabora6ons  with  inves6gators  worldwide    

NIH Chemical Genomics Center

Page 4: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

(C) Detection methods

(B) Target types (A) Disease areas

Project Diversity Project  Diversity  

Page 5: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Assay  formats  &  detec?on  methods  in  HTS  

•  ligand  binding  –  compe66on  binding    

•  enzyma6c  ac6vity  –  biochemical  –  cellular  

•  ion  or  ligand  transport  –  Ion-­‐sensi6ve  dyes  –  membrane  poten6al  dyes  

•  protein-­‐protein  interac6ons    –  biochemical  –  cellular  

•  luminescence  –  chemiluminescence  –  bioluminescence  –  BRET  –  ALPHA  

•  fluorescence  –  FI    –  FRET    –  TRF  –  TR-­‐FRET  –  FP    –  FCS  –  FLT  

•  cellular signal transduction –  reporter gene –  second messenger

•  phenotypic –  protein redistribution –  cell viability –  etc.

•  absorbance •  radioactivity

–  SPA

Assay formats

Detection modes

Page 6: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Detector  Systems:  “Reading  the  assay”  

•  ViewLux  –  Mul6modal  CCD-­‐based  imager  

•  Abs.,  Luminescence,  Fluorescence  

•  Envision  –  PMT-­‐based  reader    

•  ALPHA  

•  Acumen  Explorer  –  Laser  Scanning  Imager  

•  “sta6c”  cell  cytometry  

•  Hamamatsu  FDS  7000  Series    –  rapid  kine6cs  

•  INCell1000  –  Subcellular  imaging  

Page 7: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

1536-well plates, inter-plate dilution series Assay volumes 2 – 5 μL

Assay concentration ranges over 4 logs (high:~ 100 μM)

Informatics pipeline. Automated curve fitting and classification. 300K samples

Automated concentration-response data collection ~1 CRC/sec

A  

B  

C  

qHTS:  High  Throughput  Dose  Response  

Page 8: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Informa?cs  Ac?vi?es  

•  High  throughput  curve  fieng  •  Data  integra6on,  automated  cherry  picking  •  SAR  algorithms  

– QSAR  modeling  – Fragment  based  analysis  – Ac6vity  cliffs  

•  Tools  –  standardizer,  tautomers,  fragment  acDvity  browser,  kinome  browser  and  more  

•  RNAi  hit  selec6on,  OTE  analysis  •  High  content  analysis  

Page 9: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Kinome  Navigator  

•  Browse  kinase  panel  data  

•  Currently  focused  on  the  Abbot  dataset  

•  View    •  Fragments  

•  Target  pairs  •  Kinome  overlay  

hip://tripod.nih.gov  

Page 10: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Fragment  Browser  

•  View  ac6vi6es  on  a  fragment  wise  basis  •  Compare  ac6vity  distribu6ons  by  fragment  •  Currently  based  around  ChEMBL  assays  but  users  can  browse  their  own  compounds  &  ac6vi6es  

hip://tripod.nih.gov  

Page 11: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Structure  Ac?vity  Landscapes  

•  Rugged  gorges  or  rolling  hills?  – Small  structural  changes  associated  with  large  ac6vity  changes  represent  steep  slopes  in  the  landscape  

– But  tradi6onally,  QSAR  assumes  gentle  slopes    – We  can  characterize  the  landscape  using  SALI  

Maggiora,  G.M.,  J.  Chem.  Inf.  Model.,  2006,  46,  1535–1535  

Page 12: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

What  Can  We  Do  With  SALI’s?  

•  SALI  characterizes  cliffs  &  non-­‐cliffs  •  For  a    given  molecular  representa6on,  SALI’s  gives  us  an  idea  of    the  smoothness  of  the    SAR  landscape  

•  Models  try  and  encode  this  landscape  

•  Use  the  landscape  to  guide  descriptor  or  model    selec6on  

Guha,  R.;  Van  Drie,  J.H.,  J.  Chem.  Inf.  Model.,  2008,  48,  646–658  

Page 13: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Predic?ng  the  Landscape  

•  Rather  than  predic6ng  ac6vity  directly,  we  can  try  to  predict  the  SAR  landscape  

•  Implies  that  we  aiempt  to  directly  predict  cliffs  – Observa6ons  are  now  pairs  of  molecules  

Scheiber  et  al,  StaDsDcal  Analysis  and  Data  Mining,  2009,  2,  115-­‐122  

Original  pIC50  RMSE  =  0.97  

SALI,  AbsDiff  RMSE  =  1.10  

SALI,  GeoMean  RMSE  =  1.04  

Page 14: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Data  Integra?on  

•  It’s  nice  to  simplify  data,  but  we  can  s6ll  be  faced  with  a  mul6tude  of  data  types  

•  We  want  to  explore  these  data  in  a  linked  fashion  

•  How  we  explore  and  what  we  explore  is  generally  influenced  by  the  task  at  hand  

•  At  one  point,  make  inferences  over  all  the  data  

Page 15: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Data  Integra?on  

User’s  Network  

Network  of  Public  Data  

Content:  -­‐ Drugs  -­‐ Compounds  -­‐ Scaffolds  -­‐ Assays  -­‐ Genes  -­‐ Targets  -­‐ Pathways  -­‐ Diseases  -­‐ Clinical  Trials  -­‐ Documents  

Links:  -­‐Manually  curated  -­‐Derived  from  algorithms  

Page 16: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Record  View  of  an  Assay  

Page 17: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Access  Disease  Hierarchy  &  Network  

Page 18: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Ar?cles,  Patents,  Drug  Labels,  …  

Page 19: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

NPC  Browser  

hip://tripod.nih.gov/npc/  

Page 20: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Going  Beyond  Explora?on?  

•  Simply  being  able  to  explore  data  in  an  integrated  manner  is  useful  as  an  idea  generator  

•  Can  we  integrate  heterogenous  data  types  &  sources  to  get  a  systems  level  view?  – Current  research  problem  in  genomics  and  systems  biology  

– Some  aiempts  have  been  made  to  merge  chemical  data  with  other  data  types  

Young,  D.W.  et  al,  Nat.  Chem.  Biol.,  2008,  4,  59-­‐68  

Page 21: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

•  Perform  collabora6ve  genome-­‐wide  RNAi  screening-­‐based  projects  with  intramural  inves6gators  

•  Advance  the  science  of  RNAi  and  miRNA  screening  and  informa6cs  via  technology  development  to  improve  efficiency,  reliability,  and  costs.  

RNAi  Facility  Mission  

Range of Assays!

Pathway (Reporter assays, e.g. luciferase,

β-lactamase)!

Complex Phenotypes (High-content imaging, cell

cycle, translocation, etc)!

Simple Phenotypes (Viability, cytotoxicity, oxidative stress, etc)!

Page 22: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

RNAi  Effectors  

RNAi effectors provide an excellent way to conduct gene-specific loss of function studies."

Page 23: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

•  RNAi effectors give a knockdown not a knockout (70% - 80% is considered good). Therefore, they may not silence enough to give a phenotype even if the target is involved in what you are assaying for."

•  RNAi effectors induce off-target effects!!!!! "

Issues  Using  RNAi  Effectors  

Page 24: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

•   Protein  Quality  Control  

•   DNA  Re-­‐replica6on  

•   Base  Excision  Repair  

•   DNA  Damage  –  ELG1  stabiliza6on  

•   An6oxidant  Response  

•   Hypoxia  

•   TNFa  Response  

•   Interferon  Response  

•   iPS  to  RPE  

•   Poxvirus  

•   Respiratory  Viruses  

•   Lysosomal  Storage  Disorders  

•   Parkinsons  –  Mitochondrial  Quality    Control  

•   Ewings  Sarcoma  

•   Drug  Modifiers,  Pancrea6c  Cancer  

•   Drug  Modifiers,  TOP1  Clinical    Agents  

•   Immunotoxin-­‐Mediated  Cell  Death  

Examples of Current Projects Examples  of  Current  Projects  

Page 25: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

User  Accessible  Tools  

Page 26: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

RNAi  Libraries  

Qiagen Human Druggable Genome Library, > 7,000

genes, 4 unique siRNAs per gene."

Kinome Libraries"Purchased from a number of

vendors."

• Smaller libraries (e.g. kinome and miRNA mimics) will enable high-impact screens in systems less amenable to high throughput applications."

• Considerations are being made for additional species and shRNA resources."

Human and Mouse miRNA Mimic Libraries &

Human miRNA Inhibitor Library"

Ambion Human Genome-Wide Library, 21,585 genes, 3

unique siRNAs per gene. "

Dharmacon Human Duet Genome-Wide siRNA

Libraries, 18,236 genes, siRNA pools."

Ambion Mouse Genome-Wide Library, 17,582 genes, 3 unique siRNAs per gene."

Page 27: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Druggable  Genome  Screening  Campaign  

•  Over 7,000 genes, 4 unique siRNAs per gene (≈36,000 wells).

Pseudo-colored Blue/Green Ratio (Normalized to plate Median)

•  85 genes were selected for follow-up through a variety of threshold-based selection schemes.

•  27 genes were validated as confident hits using siRNAs from multiple vendors.

0

20

40

60

80

100

TNFα Receptor IKKα RELA NEMO

Percent Reduction in NF-kB Signal Av

erag

e In

hibi

tion

(%)

Qiagen siRNAs Ambion siRNAs

Significant enrichment for core NF-kB components

Page 28: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Qiagen Ambion

Murata et al Nature Reviews Mol. Cell Biol.

ß1-7 α1-7

α1-7 20S Proteasome

RPT

RPT

RPN

RPN 19S Regulator particle

19S Regulator particle 0

20

40

60

80

100

A1

A2

A3 A4

A5

A6

A7 B2

B3

B4

C4

C5

D2

D7

D14

Percent Reduction in NF-kB Signal

Aver

age

Inhi

bitio

n (%

)

α core 20S β core 20S RPT 19S RPN 19S

PSM Gene

PSM Protein

Significant enrichment for proteins that form the 28S proteasome

An additional 34 genes remain inconclusive, but noteworthy hits that require further study. Some of these tie into the core NF-kB pathway

Druggable  Genome  Screening  Campaign  

Page 29: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Other instances of the seeds incorporated within siRNAs targeting PSMA3 do not exhibit significant activity, adding to the likelihood of this being an on-target effect."

Seed  Sequence  Analysis  

Page 30: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Other instances of the seeds within the active siRNAs targeting SLC24A1 tend to downregulate NF-kB reporter, adding to the likelihood of this being an off-target effect."

Seed  Sequence  Analysis  

Page 31: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

RNAi  &  Small  Molecule  Screens  

Goal:  Develop  systems  level  view  of  small  molecule  acUvity  

•   Reuse  pre-­‐exis6ng  MLI  data  •   Develop  new  annotated  libraries  

TACGGGAACTACCATAATTTA  

CAGCATGAGTACTACAGGCCA  

•   Run  parallel  RNAi  screen  

What  targets  mediate  ac6vity  of  siRNA    and  compound  

Pathway  elucida6on,  iden6fica6on  of  interac6ons  

Target  ID  and  valida6on  

Link  RNAi  generated  pathway  peturba6ons  to  small  molecule  ac6vi6es.  Could  provide  insight  into  polypharmacology  

Page 32: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Matching  Phenotypes  RNAI  

Small  Molecule  

Page 33: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Merging  Screening  Technologies  

•  Lead  iden6fica6on  •  Single  (few)  read  outs  •  High-­‐throughput  •  Moderate  data  volumes  

•  Phenotypic  profiling  •  Mul6ple  parameters  •  Moderate  throughput  •  Very  large  data  volumes  

High  throughput  screening   High  content  screening  

•  We’d  like  to  combine  the  technologies,  to  obtain  rich  high-­‐resolu6on  data  at  high  speed  

•  Is  this  feasible?  What  are  the  trade-­‐offs?  

Page 34: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Merging  Screening  Technologies  

•  A  simple  solu6on  is  to  run  a  HTS  &  HCS  as  separate,  primary  &  secondary  screens  

•  Alterna6vely  –  Wells  to  Cells  –  Integrate  HTS  &  HCS  in  a  single  screen  using  a  combined  plavorm  for  robo6cs  &  real  6me  automated  HTS  analy6cs  

– Selec6ve  imaging  of  interes6ng  wells  

Page 35: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Wells  to  Cells  Workflow  

•  Sequen6al  qHTS  using  laser  scanning  cytometry  followed  by  high-­‐res  microscopy  

•  Unit  of  work  is  a  plate  series    •  The  same  aliquot  is  analyzed  by  both  techniques  

•  A  message  based  system  

•  The  key  is  deciding  which  wells  go  through  the  workflow  

Page 36: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Well  to  Cells  Assays    

•  Cell  cycle,  cell  transloca6on,  DNA  repreplica6on  •  All  assays  run  against  LOPAC1280    •  Consistency  between  cytometry  &  microscopy  is  measured  by  the  R2  between  log  AC50’s  – Cell  cycle,  0.94  –  0.96  – Cell  transloca6on,  0.66  –  0.94  – DNA  rereplica6on,  s6ll  in  progress    

Page 37: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Cell  Transloca?on  Example  Hits  

Page 38: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Informa?cs  Pla[orm  

•  Advanced  correc6on  and  normaliza6on  methods  

•  Sophis6cated  curve  fieng  algorithm  

•  Good  performance,  allows  paralleliza6on  of  the  en6re  workflow  

InCell  Layout    File  

Page 39: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Why  Messaging?  

•  A  messaging  architecture  allows  for  significant  flexibility  – Persistent,  can  be  kept  for  process  tracking,  repor6ng  

– Asynchronous,  allows  individual  components  of  the  workflow  to  proceed  at  their  own  pace  

– Modular,  new  components  can  be  introduced  at  any  6me  without  redesigning  the  whole  workflow  

•  We  employ  Oracle  AQ,  but  any  message  queue  can  be  employed  

Page 40: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Handling  Mul?ple  Pla[orms  

•  Current  examples  employ  InCell  hardware  •  We  also  use  Molecular  Devices  hardware  

•  As  a  result  we  have  two  orthogonal  image  stores  /  databases  

•  Need  to  integrate  them  – Support  seamless  data  browsing    across  mul6ple  screens  irrespec6ve  of  imaging  plavorm  used  

– Support  analy6cs  external  to  vendor  code  

Page 41: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

A  Unified  Interface  

•  A  client  sees  a  single,  simple  interface  to  screening  image  data  

•  Transparently  extract    image  data  via  the    MetaXpress  database    or  via  custom  code  

•  Currently  the  interface  address  image  serving  

•  Unified  metadata  interface  in  the  works  

hXp://host/rest/protocol/plate/well/image  

Page 42: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Trade-­‐offs  &  Opportuni?es  

•  Automa6on  reduces  the  ability  to  handle  unforeseen  errors  – Dispense  errors  and  other  plate  problems  – Well  selec6on  based  on  curve  classes  may  need  to  be  modified  on  the  fly  

•  Well  selec6on  does  not  consider  SAR  – Wells  are  selected  independently  of  each  other  –  If  we  could  model  SAR  on  the  fly  (or  from  valida6on  screens),  we’d  select  mul6ple  wells,  to  obtain  posi6ve  and  nega?ve  results  

Page 43: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Cloud  Compu?ng  &  Cheminforma?cs  

•  Cloud  compu6ng  is  a  hot  topic  •  A  number  of  examples  of  computa6onal  chemistry  /  cheminforma6cs  on  the  cloud  – MolPlex,  hBar,  Numerate,  Wingu,  Sciligence,  Pfizer  

•  Many  examples  use  the  cloud  for  remote  storage  remote  (hosted)  computa6ons  

•  But  providers  such  as  Amazon  allow  us  to  run  distributed  compuDng  applica6ons  on  the  cloud  

Page 44: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Map/Reduce  

• Map/Reduce  is  a  programming  model  for  efficient  distributed  compu6ng  

• M/R  made  “famous”  by  Google,  but  the  idea  has  been  around  for  a  long  6me  

•  It  works  like  a  Unix  pipeline:  –  cat input | grep | sort | uniq -c | cat > output –       Input              |  Map      |  Shuffle  &  Sort    |      Reduce            |  Output  

•  Efficiency  from    

–  Streaming  through  data,  reducing  seeks  

–  Pipelining  Owen  O’Malley,  hip://bit.ly/ecHPvB  

Page 45: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Map/Reduce  

Owen  O’Malley,  hip://bit.ly/ecHPvB  

Page 46: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Hadoop  &  Cheminforma?cs  

•  Hadoop  is  an  Open  Source  implementa6on  of  the  map/reduce  paradigm  

•  Hadoop  is  a  framework  for  scalable,    distributed  compu6ng  – Hadoop,  HDFS,  Hive,  PIG  

•  Importantly,  you  can  play  with  all  this  on  your  laptop  and  just  copy  files  to  the  big  cluster  when  you’re  ready  for  produc6on  

Page 47: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Why  Hadoop?  

•  Simple  way  to  make  use  of  large  clusters  without  MPI  etc  

•  AWS  supports  Hadoop,  so  easy  to  scale  up  to  100’s  or  1000’s  of  cores  

•  Great  for  Java  code,  but  non-­‐Java  code  can  also  make  use  of  Hadoop  

•  M/R  can  be  applied  to  a  lot  of  problems,  but  one  of  the  simplest  is  to  use  it  as  a  “chunker”  

Page 48: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Cheminforma?cs  in  Parallel  

•  Many  cheminforma6cs  problems  are  data  parallel  – Chunk  the  data  and  apply  the  same  technique  over  each  chunk  

•  This  makes  many  problems  amenable  for  M/R  – Substructure  /  pharmacophore  search  

– Descriptor  calcula6ons,  virtual  screening  – Model  development  (?)  

•  In  general,  each  chunk  is  processed  on  a  dis6nct  node  –  so  code  itself  can  be  non-­‐parallel  

Page 49: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Cheminforma?cs  in  Parallel  

See  h_p://blog.rguha.net/?tag=hadoop  for  examples  &  code  

Page 50: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Substructure  Searching  

•  Substructure  searching  is  a  trivial  extension  of  atom  coun6ng  

•  If  a  structure  matches,  emit  (name,1)!

•  Otherwise    (name,0)  

•  Reducer  simply  outputs  tuples  of  the  form  (name,1)  

public class SubSearch {!

…! public static class MoleculeMapper extends ! Mapper<Object, Text, Text, IntWritable> {!

private Text matches = new Text();! private String pattern;!

public void setup(Context context) {! pattern = context.getConfiguration().get("net.rguha.dc.data.pattern");! }!

public void map(Object key, Text value, Context context) throws! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString()); !

sqt.setSmarts(pattern);! boolean matched = sqt.matches(molecule);! matches.set((String) molecule.getProperty(CDKConstants.TITLE));! if (matched) context.write(matches, one);! else context.write(matches, zero);! } catch (CDKException e) {! e.printStackTrace();! }! }! }!

public static class SMARTSMatchReducer extends ! Reducer<Text, IntWritable, Text, IntWritable> {! private IntWritable result = new IntWritable();!

public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! for (IntWritable val : values) {! if (val.compareTo(one) == 0) {! result.set(1);! context.write(key, result);! }! }! }!

Page 51: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Running  on  AWS  

•  All  the  code  was  debugged  on  my  laptop  with  rela6vely  small  files  

•  To  test  the  scalability,  I  shi{ed  everything  to  AWS  – Pharmacophore  search  – 136K  structures,  single    conformer,  560MB  

– Created  a  single  JAR  file  with  CDK  &  applica6on  code  

– Uploaded  data  files  to  S3  •  Total  cost  of  experiments  was  ~  $10  

Page 52: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

But  I  Don’t  Want  to  Write  Programs  

•  All  these  examples  require  us  to  write  full  fledged  Java  classes  

•  An  easier  way  to  use  Pig  &  Pig  La6n  –  a  plavorm  and  query  language  built  on  top  of  Hadoop  

•  Lets  us  write  SQL-­‐like  queries  that  make  use  of  Hadoop  underneath  

•  Flexible  due  to  user  defined  func6ons  (UDF’s)  – UDF’s  encapsulate  the  cheminforma6cs  

Page 53: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Cheminforma?cs  &  Pig  

•  Iden6fy  molecules  in  medium.smi  that  match  the  SMARTS  paiern  and  dump  to  output.txt  

•  The  complexity  is  now  hidden  in  the  UDF  

•  Many  toolkit  func6ons  could  be  wrapped  as  UDF’s,  allowing  flexible  queries  with  much  simpler  code  

•  See  hip://blog.rguha.net/?p=748  for  the  code  

A = load 'medium.smi' as (smiles:chararray);!B = filter A by net.rguha.dc.pig.SMATCH(smiles, 'NC(=O)C(=O)N');!store B into 'output.txt';!

Page 54: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Latency  

•  Hadoop  is  suited  for  batch  processing  •  Significant  network  I/O  involved  in  distribu6ng  data  to  compute  nodes  

•  Not  good  for    – Random  ad  hoc  processing  of  small  subsets  – Small  volume  data  

– Real  6me  (low  latency)  work  

•  But  latency  issues  can  be  addressed  somewhat    by  Hbase,  Hive  and  other  technologies  

Page 55: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

More  than  Chunking?  

•  But  all  the  examples  so  far  could  have  been  done  via  PBS/Condor  or  any  other  job  scheduler  –  (With  Hadoop  we  don’t  have  to  worry  about  explicit  chunking  of  the  input  data)  

•  But  are  there  cheminforma6cs  algorithms  that  can  be  reworked  in  to  the  M/R  paradigm?  – Predic6ve  modeling?  

– Graph  algorithms?  

Page 56: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

More  than  Chunking?  

•  Both  predic6ve  &  graph  algorithms  are  increasingly  supported  in  Hadoop  – Mahout  for  M/L  algorithms  on  massive  datasets  – Cloud9  for  graph  algorithms  

•  A  number  of  bioinforma6cs  applica6ons  make  use  of  M/R  at  the  algorithmic  level  

•  They  are  all  big  applica6ons  – Crossbow  aligns  3  billion  paired/unpaired  reads  

•  Cheminforma?cs  datasets  are  not  very  big  

Page 57: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Summary  

•  HTS  data  is  an  ample  playground  for  interes6ng  analy6cs,  mul6ple  data  types  makes  it  more  fun  

•  A  major  challenge  in  our  informa6cs  infrastructure  is  dealing  with  proprietary  vendor  interfaces  

•  Hadoop  and  M/R  provide  great  opportuni6es  for  handling  large  data  in  a  flexible  manner  

•  But  can  cheminforma6cs  really  make  use  of  it?  

Page 58: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

InformaUcs  

•  Ajit  Jadhav  •  Trung  Nguyen  •  Noel  Southall  •  Ruili  Huang  •  Min  Shen  

•  Hongmao  Sun  

•  Xin  Hu  •  Tongan  Zhao  

RNAi  &  Small  Molecule  

•  Scoi  Mar6n  

•  Pinar  Tuzmen  •  Yu-­‐Chi  Chen  •  Carleen  Klump  •  Craig  Thomas  

•  Jim  Inglese  

•  Ron  Johnson  •  Sam  Michael  

•  Jennifer  Wichterman  

Acknowledgments

Page 59: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT
Page 60: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Coun?ng  Atoms  

•  The  canonical  Hadoop  program  is  to  count  the  frequency  of  words  in  a  text  file  – Mapper  reads  a  line,  outputs  a  tuple  –  (word,  1)  – Reducer  will  receive  tuples,  keyed  on  word!

•  Summing  up  the  1’s  gives  us  the  frequency  of  word    

•  By  default,  Hadoop  works  on  a  line-­‐by-­‐line  basis  •  For  cheminforma6cs  problems,  SMILES  files  sa6sfy  this  requirement  –  one  line,  one  molecule  

Page 61: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Coun?ng  Atoms  public class HeavyAtomCount {! static SmilesParser sp = new SmilesParser(DefaultChemObjectBuilder.getInstance());!

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> ! {!

private final static IntWritable one = new IntWritable(1);! private Text word = new Text();!

public void map(Object key, Text value, Context context) throws ! IOException, InterruptedException {! try {! IAtomContainer molecule = sp.parseSmiles(value.toString());! for (IAtom atom : molecule.atoms()) {! word.set(atom.getSymbol());! context.write(word, one);! }! } catch (InvalidSmilesException e) {! // do nothing for now! }! }! }!

public static class IntSumReducer extends Reducer<Text, IntWritable, ! Text, IntWritable> {! private IntWritable result = new IntWritable();!

public void reduce(Text key, Iterable<IntWritable> values,! Context context) throws IOException, InterruptedException {! int sum = 0;! for (IntWritable val : values) {! sum += val.get();! }! result.set(sum);! context.write(key, result);! }! }!….!}!

•  Uses  the  CDK  to  parse  SMILES  

•  For  each  molecule  loop  over  atoms  – Emit    (symbol,1)!

•  Reducer  simply  sums  the  1’s  for  each  symbol  

Page 62: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Mul?line  Records  

•  Lots  of  cheminforma6cs  applica6ons  require  3D  –  SMILES  won’t  do.  Need  to  support  SDF  

•  We  implement  a  custom  RecordReader to  process  SD  files!

•  We’re  now  ready  to    tackle  preiy  much    most    cheminforma6cs  tasks  

Page 63: Enabling Discoveries at High Throughput - Small molecule and RNAi HTS at the NCTT

Why  Hadoop?  

•  Java  and  C++  APIs  –  In  Java  use  Objects,  while  in  C++  bytes  

•  Each  task  can  process  data  sets  larger    than  RAM  

•  Automa6c  re-­‐execu6on  on  failure  –  In  a  large  cluster,  some  nodes  are  always  slow  or  flaky  – Framework  re-­‐executes  failed  tasks    

•  Locality  op6miza6ons  – M/R  queries  HDFS  for  loca6ons  of  input  data  – Map  tasks  are  scheduled  close  to  the  inputs  when  possible  

Owen  O’Malley,  hip://bit.ly/ecHPvB