32
Winter 2011/2012 INSIDE: And more Analytics in Big Data Biomedical Research PLUS: Privacy and Biomedical Research: Building a Trust Infrastructure

Research Biomedical

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Biomedical

Winter 2011/2012

INSIDE:

And more

AnalyticsinBig Data

BiomedicalResearchPLUS: Privacy and Biomedical Research: Building a Trust Infrastructure

Page 2: Research Biomedical

ii BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

c o n t e n t sWinter 2011/2012

Volume 8, Issue 1ISSN 1557-3192

Executive Editor Russ Altman, MD, PhD

Advisory Editor David Paik, PhD

Associate Editor Joy Ku, PhD

Managing Editor Katharine Miller

Science WritersAlexander Gelfand, Louisa Dalton Hudock,

Katharine Miller, Kristin Sainani, PhD

Community ContributorsJoy Ku, PhD, Oren Freifeld,

Justin Foster, Paul Nuyujukian,Ton van den Bogert, PhD

Layout and DesignWink Design Studio

PrintingAdvanced Printing

Editorial Advisory BoardRuss Altman, MD, PhD, Brian Athey, PhD,Dr. Andrea Califano, Valerie Daggett, PhD,

Scott Delp, PhD, Eric Jakobsson, PhD,Ron Kikinis, MD, Isaac Kohane, MD, PhD,

Mark Musen, MD, PhD, Tamar Schlick, PhD, Jeanette Schmidt, PhD, Michael ShermanArthur Toga, PhD, Shoshana Wodak, PhD,

John C. Wooley, PhD

For general inquiries, subscriptions, or letters to the editor,

visit our website at www.biomedicalcomputationreview.org

OfficeBiomedical Computation Review

Stanford University318 Campus Drive

Clark Center Room S231Stanford, CA 94305-5444

Biomedical Computation Review is published by:

Publication is made possible through the NIHRoadmap for Medical Research Grant U54GM072970. Information on the National Centersfor Biomedical Computing can be obtained fromhttp://nihroadmap.nih.gov/bioinformatics. The NIHprogram and science officers for Simbios are:

Peter Lyster, PhD (NIGMS)Grace Peng, PhD (NIBIB)Jim Gnadt, PhD (NINDS)

Peter Highnam, PhD (NCRR)Jennie Larkin, PhD (NHLBI) Jerry Li, MD, PhD (NIGMS)

Nancy Shinowara , PhD (NICHD)David Thomassen, PhD (DOE)Janna Wehrle, PhD (NIGMS)

Jane Ye, PhD (NLM)

ContentsWinter 2011/2012

The NIH National Center for Physics-Based Simulation of Biological Structures

14 Big Data Analytics In Biomedical ResearchBY KATHARINE MILLER

22 Privacy and Biomedical Research:

Building a Trust InfrastructureBY ALEXANDER GELFAND

DEPARTMENTS

1 GUEST EDITORIAL | SLAYING VILLAINS OUTSIDETHE IVORY TOWER BY TON VAN DEN BOGERT, PhD

2 SIMTK HIGHLIGHTS | GRAND CHALLENGECOMPETITION PROVIDES RICH DATA SET TOIMPROVE JOINT CONTACT FORCE PREDICTIONSBY JOY P. KU, PhD

3 TAPPING THE BRAIN: DECODING FMRIBY LOUISA DALTON HUDOCK

5 PERSONALIZED CANCER TREATMENT: SEEKING CURES THROUGH COMPUTATIONBY KRISTIN SAINANI, PhD

9 FOLLOW THE MONEY: BIG GRANTS IN BIOMEDICAL COMPUTINGBY KRISTIN SAINANI, PhD

11 LEVERAGING SOCIAL MEDIA FOR BIOMEDICAL RESEARCHBY KATHARINE MILLER

27 UNDER THE HOOD | CALCULATING STATISTICS OF ARM MOVEMENTSBY OREN FREIFELD, JUSTIN FOSTER AND PAUL NUYUJUKIAN

30 SEEING SCIENCE | BUSTING ASSUMPTIONS ABOUTRAINBOWS AND 3-D IMAGES BY KATHARINE MILLER

Cover Art: Digital profile image © Vectomart | Dreamstime.com. Page 14: Created by Rachel Jones of Wink Design Studio using: digital profile image © Vectomart | Dreamstime.com, and word clouds created on http://www.tagxedo.com. Page 17, 19, 20: word clouds created on http://www.tagxedo.com. Page 22: Created by Rachel Jones of Wink Design Studio using: in-house images and patient image © Wavebreakmedia Ltd | Dreamstime.com.

Page 3: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 1

g u e s t e d i t o r i a l

Just over a year ago, I left academia. I had been inthat realm for 25 years, working in musculoskeletalbiomechanics and human movement analysis. It was

a move that might have surprised anyone who knew meas a child, as well as many who watched me through mycareer. Yet the transition has been smooth and rewarding,and harbors hints of a lesson for others.

The story begins at age eight with my stated intentto become a professor. What I really meant was thestereotypical absent-minded professor of Europeancomic books: I dreamed of inventing cool gadgets tohelp defeat the villains.

Eventually I became a real professor. I enjoyed the mixof research and teaching, and didn’t mind the additionalreality of “publish or perish.” But as this evolved into “getmore grants or perish,” I began to wonder whether inputor output was more important to our academic institu-tions. And when would I get to defeat the villains anyway?

I became more interested in an alternative lifestyleafter doing some private consulting. A rather typical sce-nario would proceed as follows: The client explains aneed; I propose a solution and write a two-page proposalwith a budget; a contract is signed; work begins; and inthe next year, my software is used in a product that dis-rupts the industry.* This was tremendously refreshing,compared to waiting nine months for a (possibly nega-tive) decision on a grant proposal. And it felt much morelike my dream of inventing cool gadgets. But small dis-

GuestEditorialBY TON VAN DEN BOGERT, PhD

ruptive projects donot fit well withinthe current businessmodel of academic research, where weare encouraged to have large teams, large grants with highindirect costs, and graduate students who do the workover five years.

So when I left academia to start a small business incomputational biomechanics, the decision was easierthan you might expect. Many things wouldn’t change: Icould still get NIH funding (as a small business); teachgraduate students (as an adjunct); and publish (if I

avoided contracts with unreasonable restrictions and setmy fees high enough to have time to write). And I fig-ured I could out-compete others in this niche (mostlygraduate students and postdocs such as those I used totrain) while remaining competitive through continuous

improvement of my capabilities. In biomedical computing, it’s actually possi-

ble for an individual or small business to be atthe cutting edge of research. To some extent,that’s because they can collaborate with aca-demic investigators, as I’ve continued to do(mostly through the Simtk.org collaborativeplatform). Sometimes academics have expertiseor laboratory facilities that I do not have, or theyneed my expertise. Funding for such collabora-tions is available, and I believe they provide anincreasingly viable model for biomedical re-search, resulting in high-quality work and a win-win situation for all parties. Most importantly,the independent “comic book” scientist canenjoy the freedom to quickly and efficiently in-vent cool gadgets and slay a few villains. !!

DETAILS

Antonie (Ton) J. van den Bogert is President of OrchardKinetics, a startup company doing computationalmusculoskeletal biomechanics R&D, based in Cleveland, Ohio.Previous positions were at the University of Calgary and theCleveland Clinic Lerner Research Institute. Ton is well-known asthe moderator of Biomch-L, a mailing list and social networkfor biomechanists, which has existed since 1988 and now hasmore than 10,000 members. He co-founded the TechnicalGroup on Computer Simulation of the International Society ofBiomechanics (ISB) in 1989, and is currently the president of theISB. As a pioneer in musculoskeletal modeling and simulation,he has held a collaborating R01 grant with Simbios, andcurrently serves on the Scientific Advisory Board of Simbios.

* In 2005, I received a Technical Achievement Award from the Academyof Motion Picture Arts and Sciences for work with Motion Analysis Corp.(Santa Rosa, CA). This was a consulting project that resulted in softwareto generate realistic 3-D human animations from motion capture data.

Slaying Villains Outside the Ivory Tower

Page 4: Research Biomedical

2 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

There are numerous modeling methods available tomake predictions of muscle and joint contact forces.While such predictions can help improve treatments

for movement-related disorders such as stroke or osteoarthritis,there’s a problem: choosing which one to use. The researchcommunity hasn’t had a way to easily evaluate different ap-proaches. The Grand Challenge Competition to Predict InVivo Knee Loads—now in its third year—changes that by mo-tivating researchers to critically evaluate their simulations ofcontact forces in the knee, and by providing awealth of experimental data to do so.

“Through the competition, I hope that we asa research community will be a little more criti-cal of ourselves—in a good way—so that we canreally gain confidence in what we’re predictingand our models can become much more usefulclinically,” says B.J. Fregly, PhD, professor ofmechanical and aerospace engineering at theUniversity of Florida. He and Darryl D’Lima,MD, PhD, at Scripps Clinic are principal inves-tigators of the grand challenge grant.

Each fall, the competition organizers makeavailable a comprehensive data set for one subjectwho has received an instrumented, force-measur-ing knee implant. The data includes pre-surgeryMRI and CT data, surface marker trajectory data,electromyography signals, and dynamic x-ray im-ages of knee motion, just to list a few. But the com-petition’s participants are not given the in vivo contact forcemeasurements. These they must predict and submit to theAmerican Society of Mechanical Engineers (ASME) and thecompetition organizers. Presentations and discussions of thecompetition submissions and the announcement of the winneroccur at ASME’s Summer Bioengineering Meeting.

While other research areas have held similar competitions,“this grand challenge competition is the only one I’m awareof in biomechanics,” says Grace Peng, PhD, a program direc-tor at the NIH’s National Institute of Biomedical Imaging andBioengineering, which funds the knee grand challenge grant.“In that way, it’s really unique.”

The government has recently been pushing the use of prizesand challenges, especially for science and technology. “Chal-lenges like this help benchmark some of the many techniquesavailable and determine best practices,” says Peng. “It is help-

s i m T K h i g h l i g h t s

DETAILSThe publication “Grand Challenge Competition to Predict In Vivo Knee Loads” in the Journal ofOrthopaedic Research provides a detailed descriptionof the project and data. To access the data, visithttp://simtk.org/home/kneeloads.

BY JOY P. KU, PhD, DIRECTOR OF SIMBIOS

ful to discuss the methods openly and publicly and have thetechnology converge.”

That’s what Stephen Piazza, PhD, associate professor ofkinesiology at Pennsylvania State University, and his then-graduate student Michael Hast, PhD, discovered when theyparticipated in (and won) last year’s competition. “Whenmultiple groups are working on the same problem, the samedata, there’s a potential for sharing and learning that can’toccur otherwise,” says Piazza.

Hast agrees: “Sometimes it feels like you’re in a cave work-ing on these problems. Being able to see how others ap-proached the same problem was a good experience.”

Piazza and Hast actually waited one year after learningabout the competition before entering. “We couldn’t do it thefirst year. Our simulation wasn’t where we wanted it to be,”says Piazza. The competition motivated them to extend theirsimulations, which had previously been used to predict forcesfor a highly constrained mechanical knee simulator, to workwith the more variable human data.

For Hast, the competition was the capstone to his PhD. Thecompetition’s data set, which is synchronized and captures sub-jects performing different activities and using different walkingmotions, was much more extensive than he had the time or op-portunity to collect for his thesis.

“It is really an embarrass-ment of riches as far as thedata is concerned,” says Hast,who plans to use the data inhis future research. As for thecompetition, he says, “It’s cost-effective and helps the greaterorthopedic community. It’s areally great cause.” !!

A Simbios website providing open access to high quality biocomputational tools, accuratemodels and the people behind them

SimTKHighlights

Grand Challenge Competition Provides Rich Data Set to Improve Joint Contact Force Predictions

Page 5: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 3

Revealing the brain’s hidden stash of pic-tures, thoughts, and plans has, until re-

cently, been the work of parlor magicians.Yet within the last decade, neuroscientists

have gained powerful methods for delvinginto the contents of brain activity, allowingthem to predict specific thoughts—includ-ing images, memories, and intentions—from brain activity.

“When these pattern recognitiontechniques came out, it gave thefield a big boost. People realized thatnow we can really get at content,”says John-Dylan Haynes, PhD, atthe Bernstein Center for Computa-tional Neuroscience in Berlin.

Since the 1990s, functionalmagnetic resonance imaging(fMRI) has been used to track theflow of blood and oxygen in thebrain, thus showing which spots inthe brain are busy. Around 2005,neuroscientists discovered thatcomputational multi-voxel patternanalysis (MVPA)—a techniqueused in other fields such as finger-print identification—could helpthem do more than just pinpointwhat part of the brain is active. Itcould help them read meaning inthe patterns of activity. If fMRItakes pictures of the brain’s hiddenbar codes, MVPA decodes them.Two studies below show the powerof MVPA applied to intentions andmemories. The final study belowshows how modeling can recreatethe images in the mind’s eye.

Decoding IntentionsSome of the early studies using

MVPA showed how scientists coulduse a volunteer’s brain patterns totrain a linear classifier to predict

whether the volunteer was looking at, say, aface or a chair. Then researchers expanded topredict other mental content: emotions,sounds, and memories, for example.

Intentions, Haynes found, are also pre-dictable. In a study published first in 2008 inNature Neuroscience and repeated usingultra-high field fMRI in June 2011 in Public

Library of Science One, Haynes and hiscoworkers asked volunteers to enter an MRImachine with one button near their left fin-ger, and one button near their right finger.

Relax, they told the volunteers. Watch thisstream of letters in front of your eyes, choosewhich button you will push, remember theletter in front of you at the moment you

TAPPING THE BRAIN:Decoding fMRI

A Brain’s Choice to Make. Before a volunteer clicked on either a left or right button, Haynes and his group col-lected the volunteer’s brain activity using fMRI. The scientists found regions (in green) that, analyzed using pat-tern recognition techniques, could predict the button the volunteer would push. Those regions revealed thepredicted decision up to seven seconds before the volunteer felt he or she had made a decision. Credit: John-Dylan Haynes and Chun Siong Soon.

Page 6: Research Biomedical

4 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

choose, and then press the button. The vol-unteers told the researchers which letter wasin front of their eyes when they chose whichbutton to press. Haynes and his coworkersfound that not only could the classifier pre-dict which button a volunteer was about topress, it could predict it far earlier than vol-unteer was able to—up to seven seconds be-fore the volunteer reported consciouslychoosing which button to push.

With this study, Haynes established thatour brains know some things before we do.“Not all of our decisions are made con-sciously,” he says.

Haynes was also one of the first to useMVPA to explore the brain for patterns. Be-cause Haynes didn’t know exactly where hemight find patterns of decision-making in thebrain, or how many seconds before the deci-sion he might be able to see those patterns,he used MVPA as an initial tool for explo-ration. He used a procedure called searchlightdecoding to search the whole brain withoutmaking any assumptions ahead of time, andpinpoint the areas and times in the brain thatforecast decision-making. More recently,Haynes and his group have found they canpredict not just button pushing, but abstractthought as well, such as deciding whether toadd or subtract two numbers.

Decoding MemoriesOthers, including Frank Tong, PhD, at

Vanderbilt University, have grappled withdecoding memories. Tong and his group de-signed an experiment using MRI to test forworking memory (published in Nature inApril 2009) in the early visual areas of thebrain—areas that perceive basic visual fea-tures. They showed a volunteer first one setof parallel lines, then another set in a differ-

ent orientation. They told the volunteer toremember one of the sets. Then after an 11-second pause, they asked the volunteer torecall the orientation they kept in mind.

During that pause, the overall brain ac-tivity in the early visual areas of the brainoften returned to normal. However, even atthat low level of activity (no greater thanwould be expected when viewing a blankscreen), Tong found that pattern classifierscould pick out subtle shifts in brain patternsassociated with each orientation, and couldstill be trained to predict which orientationthe volunteer held in memory. “We can seethese faint echoes of what they saw before,what they are actively holding in mind,”Tong says. “That would be invisible if wedidn’t do multivoxel pattern analysis.”

Tong’s study shows that working mem-ory resides in the early visual areas of thebrain, a zone where few expected to findsuch a higher thought process. And it helpsestablish that images held only in the mindare robust enough that pattern algorithmscan decode and predict them.

Encoding ModelsResearchers have also started decoding

novel brain patterns—something a linearclassifier cannot do because it can only predictpatterns it has already seen. In a study pub-lished in Current Biology in October, compu-tational neuroscientist Jack Gallant, PhD,

and post-doctoral researchers Shinji Nishi-moto, PhD, and Thomas Naselaris, PhD,and their group at the University of Califor-nia, Berkeley used a combination of brainmodeling and decoding to reconstruct a prim-itive version of a movie someone is watching.

After showing a series of movie clips to avolunteer lying inside an MRI machine, thescientists ran the data through a two-stepprocess, Naselaris says. They first created anencoding model: in essence, a virtual brainthat generates signals when images are pre-sented to it. In the early visual region thatGallant studies, the brain processes images aspieces of moving edges. Each two-millimetercube of brain space (also called a voxel)processes the moving edges in a unique way,and Gallant’s group created a model for eachcube. Each model maps various features ofthe movie input—motion, color, spatial fre-quency, and so on—into an output signalthat matches as closely as possible the brainactivity seen in each cube of the volunteer’sfMRI brain data. “It is the signature of thatone individual voxel,” Naselaris says.

Then, using a brand-new set of movieclips, Gallant’s group collected new brainactivity data, and set the model going theother direction. They input the brain activ-ity to see if the model could now predictmovie scenes from brain data. A novel formof Bayesian decoding helped them matchimages, taken from a massive database of 1-

Reconstructing Movies via the Brain. Using brain activity of volunteers watching a movie (framesin top row), Gallant and his coworkers created a model that predicts brain patterns from movieclips. They then used that model and a database of 18 million seconds of random movie clips fromthe Internet to predict the brain activity that would be expected from each clip. They then averagedtogether the 100 clips in the random library that were most likely to have produced brain activitysimilar to what was actually observed (bottom row). See Nishimoto, S et al., Reconstructing VisualExperiences from Brain Activity Evoked by Natural Movies, Current Biology 21:1641-1646 (2011).

TAPP

ING

THE

BRA

IN:

DECO

DING

FM

RI

Page 7: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 5

second movies, to brain activity patterns.Their dark, blurry movie reconstructionsare an average of the top 100 short movieclips with a predicted brain pattern that fitsbest to the actual brain activity patterns.

Naselaris cautions that they can’t recre-ate dreams or other mind’s-eye images withthis model. Because they focused on anearly visual brain region, their model “haseverything to do with the light that is hit-ting your eye,” Naselaris says.

Yet Gallant and his group’s strategy ispowerful for a number of reasons. With it,they can make surprisingly accurate predic-tions about images that the model has notseen before. In addition, their strategy offirst creating an encoding model before de-coding gives researchers a method for test-ing theories about how the brain processesinformation. For other areas of the brain,the strategy will likely have to be modified.“The brain is a pretty complicated place,”Gallant says, “so there is no one grand ap-proach that will solve everything. Instead,there are thousands of neuroscientists usingthousands of different approaches to try tomove ahead.” !!

Personalized cancer therapy is now a re-ality. A handful of tumor-classifying

tests and targeted drugs are in widespreadclinical use; and early attempts are under-way to match high-risk cancer patients toexperimental drugs based on genetic testingof their tumors.

But progress has been incremental andsuccesses have been measured. Cancer iscomplex and insidious; knock out one badplayer with a drug and the system evolvesresistance. Patients may live longer, but stilldie of their disease. To take personalizedcancer medicine to the next level—toachieve cures—computational approachesare needed. “We are at a crossroads whereit’s becoming increasingly difficult to doanything of value without a heavy elementof computation,” says Andrea Califano,PhD, professor of biomedical informatics at

PERSONALIZED CANCER TREATMENT: Seeking Cures Through Computation

Columbia University and director of theColumbia Initiative in Systems Biology.

Bioinformatics and computing are help-ing to make advances on several fronts, in-cluding: cataloging the full spectrum ofgenomic defects in cancer; identifying thedefects that drive malignancy; efficientlytranslating these discoveries to patient care;and improving the tools that are already inclinical use.

Mapping the Landscape Several cancer genome initiatives are

cataloging the array of molecular defectsthat define different cancers and cancersubtypes. The NIH’s The Cancer GenomeAtlas (TCGA) and the InternationalCancer Genome Consortium (ICGC)have already collected multiple layers ofdata—including sequencing, mutational,

Page 8: Research Biomedical

6 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

copy number alteration, DNA methylation,microRNA, and gene expression data—onan unprecedented number of tumors.

“These different layers will be mappedout, overlaid, and integrated, so we can seethe complete genomic picture of thetumor,” says Douglas A. Levine, MD, headof the Gynecology Research Laboratory atthe Memorial Sloan-Kettering CancerCenter and a TCGA investigator.

This is the first time we’ve had all thesedifferent data types on exactly the samesamples, Califano says. “And simply amaz-ing findings are coming out.”

The TCGA reported results for its sec-ond complete genome—for high-gradeserous ovarian cancer (a common and ag-gressive form of this cancer)—in Nature inJune 2011. The project analyzed data from489 tumors, including the complete exomesequences of 316, the most ever reported todate for any solid tumor, Levine says.

Among the most exciting findings,about half the patients had defects—inher-ited or acquired mutations or epigenetic si-lencing—in the tumor suppressor genesBRCA1 and BRCA2 or in related DNA re-pair genes. These tumors might respond toPARP inhibitors, which improve survivalin women with inherited BRCA1 andBRCA2 mutations, Levine says.

Also, seven percent of tumors had ho-mozygous deletions in the tumor suppressorgene PTEN, a defect not previously reportedin this subtype of ovarian cancer. Levineand others have already begun clinical trialsto treat this subset of patients with a newclass of drugs called PI3 kinase inhibitors,which target the PI3K/AKT/mTOR path-way that PTEN regulates.

“All these things need to be tested. Butwe now have the landscape and theroadmap laid out,” Levine says. “Molecularmedicine is not really being used at alltoday in ovarian cancer. I hope it can beused in the near future to make better treat-ment decisions.”

Identifying the Driving Defects

Sequencing cancer genomes is only thefirst step in understanding this disease; thenext step is to sort out which geneticchanges are driving the cancer—and thuswill make robust biomarkers and drug tar-gets—and which are merely incidental.This is where systems biology comes inhandy, Califano says. “The idea is to com-putationally interrogate regulatory net-works of the cancer cell to find out what arethe genes that are actually causally relatedto the presentation of a specific tumor phe-

notype,” Califano says. For example, in a 2010 paper in Na-

ture, Califano’s team used network analy-sis to identify two master regulators of aparticularly aggressive subtype of gliomabrain cancer. These two transcription fac-tors (C/EBP! and Stat-3) don’t appear inthe gene expression signature for this sub-type, as they are about the 500th and1300th most differentially expressed, Cal-ifano says. “But, if you look at them withthese network analyses, they stand out asbeing the most significant genes in termsof their activity in regulating the signa-ture.” Inactivating these genes in mousexenografts blocked tumor development orreduced malignancy.

In an October 2011 paper in Cell, Cal-ifano’s team used a novel algorithm toidentify a “hidden” network of mRNAsand microRNAs that together controlPTEN expression. They showed that 13

genes from this network are commonlymutated in glioma. Deleting any of these13 genes—which had never previouslybeen linked to glioma—suppresses PTENexpression even if the PTEN gene is in-tact. “So this gives you a very strong hintthat mTOR or AKT inhibitors, in combi-nation with other drugs, may actuallywork in patients that have absolutely no

detectable genetic alteration of PTEN,”Califano says.

Bringing the Data to Biologists and Physicians

Experimental biologists and clinical re-searchers are in the best position to translatecancer genome findings into meaningfuladvances in patient care. But they oftenlack the expertise needed to access andmake sense of the data. A team of scien-tists at Georgetown University is tryingto bridge this gap by creating a user-friendly integrated database called G-DOC (Georgetown Database of Cancer),which they describe in a September 2011paper in Neoplasia.

“We are unique in the way that we pro-vide the data mining as well as the analyticsin one environment,” says Subha Madha-van, MD, director of clinical research in-formatics at the Lombardi Comprehensive

Cancer Center at Georgetown. G-DOC in-tegrates patient data, genomic data, andsmall molecule data (“for matchmakingmolecular targets with drugs,” Madhavansays) with popular tools for analyzing andvisualizing these data, including GenePat-tern, Pathway Studio, and Cytoscape.

PERS

ONA

LIZE

D CA

NCER

TRE

ATM

ENT:

SEE

KING

CUR

ES T

HRO

UGH

COM

PUTA

TIO

N

Master Regulators. Six transcription factors, including C/EBP! and Stat-3, control most of thegenes in the gene signature of an aggressive subtype of high-grade glioma. Reprinted by per-mission from Macmillan Publishers Ltd: Carro MS, et al., The transcriptional network for mes-enchymal transformation of brain tumours. Nature. 2010;463:318-25.

Continued on page 8

Page 9: Research Biomedical

7

Hidden Network. Left: Graphic visualization ofa complex network of RNA-RNA interactions inglioma. The interactions regulate the expressionof oncogenes and tumor suppressor genes, in-cluding PTEN. Nodes represent individual RNAsand edges represent RNA-RNA interactions.Nodes near the center are more tightly regu-lated. Below: Close-up of the densest 564-nodesubgraph shown in red at the center of the net-work. Reprinted Sumazin, et al, An ExtensiveMicroRNA-Mediated Network of RNA-RNA In-teractions Regulates Established OncogenicPathways in Glioblastoma, Cell 147:2:370-381(2011) with permission from Elsevier.

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures

Page 10: Research Biomedical

8 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

Users (both internal and external toGeorgetown) have contributed clinicaldata on more than 3000 patients withbreast or gastrointestinal cancers. “We werejust lucky to work with investigators whohad de-identified clinical data that wecould leverage,” Madhavan says. Madha-van’s team has also imported a wealth ofdata from public databases and publishedarticles. “We bring in the raw data and stan-dardize them, so there’s a lot of value addedto that data,” Madhavan says. They will add

TCGA data for breast and colon cancerwhen they become available, she says.

Researchers can use G-DOC to gener-ate or test hypotheses; run in silico experi-ments; learn about the newest types ofdata—including next generation sequenc-ing, metabolomics, DNA copy numberabnormalities, and microRNA expres-sion—as well as about systems biology;and speed up the pace of their research.For example, it took one person using G-DOC one month to complete a coloncancer analysis that would otherwise havetaken a team of people at least a year tocomplete, Madhavan says.

Overcoming Computing BarriersTo fully realize the vision of personal-

ized cancer therapy, more labs will need tobecome computationally savvy, Califanoconcludes. “Right now there are a fewcomputationally empowered labs, but thevast majority of other labs can only accesscomputational analyses through collabo-rations,” Califano says. “There has to besome kind of connective tissue.” Hepoints to models such as the NationalCenters for Biomedical Computing,which have fostered collaborations be-tween computer specialists and biologistsand physicians. “I think this is an exampleof the way things can go forward.” !!

Gene expression signatures that stratify patients into likely and

unlikely treatment responders are already in clinical use for certain

cancers. But these “first generation” tests have severe limitations,

says W. Fraser Symmans, MD, professor of pathology at MD An-

derson Cancer Center. He and his colleagues are using state-of-

the-art bioinformatics and biostatistics techniques to develop the

next generation of gene expression tests for breast cancer.

Symmans and his colleagues discovered a paradox with some

first generation tests for breast cancer. The tests accurately sepa-

rate patients into “good” and “poor” responders to chemother-

apy; but the “good responders” have worse survival. (The tests

misclassify certain aggressive tumors that initially respond vigor-

ously to chemotherapy but tend to relapse.) His team developed a

second-generation test, described in the May 2011 issue of JAMA,

that overcomes this issue and accurately predicts survival.

The test comprises a series of gene signatures (from the tumor)

that sequentially predict: (1) response to hormonal treatment; (2)

resistance to chemotherapy; and (3) sensitivity to chemotherapy.

“We realized that one predictor was not going to be enough to

capture the complexity,” says Christos Hatzis, PhD, who led the

computational aspects of the project; Hatzis is founder and vice

president of technology at Nuvera Biosciences Inc., which has com-

mercial rights to the technology. The team used a multivariate ap-

proach to identify the key genes that define the signature;

univariate approaches yield too many redundancies because genes

work in pathways, Hatzis says.

The test accurately identifies patients who will respond to ther-

apy about twice as often as standard methods. “It doesn’t com-

pletely solve the problem but it’s a big step forward,” Hatzis says.

PERS

ONA

LIZE

D CA

NCER

TRE

ATM

ENT:

SEE

KING

CUR

ES T

HRO

UGH

COM

PUTA

TIO

N

ADVANCING Gene Expression Signatures

Page 11: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 9

Several biomedical computing projectsreceived multi-million dollar funding

in the fall of 2011, including efforts to:simulate the cardiac physiology of the rat;build a state-of-the-art DNA simulationtoolkit; build an artificial pancreas; andmine data for clues about psychiatric dis-ease. The initiatives will bring togetherdiverse experts, datasets, or models to ac-complish ambitious goals.

Modernizing the Lab Rat A new type of lab rat—one simulated on

a computer—will help scientists tease outthe multifactorial causes of cardiovasculardisease, thanks to a $13 million grant fromthe National Institutes of Health. Thegrant establishes a new National Center forSystems Biology.

“The goal of the Virtual Physiological RatProject is to understand how high leveltraits, such as hypertension, arise from mul-tiple inputs at multiple levels, including ge-

netic variation and environmental perturba-tions,” says principal investigator Daniel A.Beard, PhD, professor of physiology at theMedical College of Wisconsin.

Beard’s team will build detailed compu-

tational models of the rat’s heart, vascula-ture, and kidneys. “In some cases, we’redrilling down to the individual cells and in-dividual transporters and pumps that are in-volved in the operation of the organ andintegrating all the way up to the whole

organ,” Beard says. Though sophisticatedmodels already exist for some of thepieces—for example, the heart is the bestmodeled of all organs—they will have toadapt these models for rats in general and

for particular genetic strains of rats. Once the models are refined, Beard’s

team will breed new strains of rats in silicoand subject them to different environmentalstressors to predict which combinations ofgenes and environment lead to hyperten-

sion, heart failure, kidney failure,and other cardiovascular diseases.Then his team will test the mostinteresting predictions with exper-iments in real rats. “The biggestnovelty and the biggest challengeis this cycle of making a predictionand then making a rat,” Beardsays. “We’re doing this on an un-precedented scale.”

Simulating DNA at All Levels

A new DNA simulation toolkitwill be the first to span all levels ofresolution. Funded by a $3 milliongrant from the European ResearchCouncil, the toolkit will help sci-entists gain new insights into howDNA functions, including howgenes are regulated and how theyinteract with the environment(epigenetics).

“The overall goal is to developa complete theoretical and com-putational framework to be ableto simulate DNA in a multiscalemanner, from atomistic to chro-matin scale,” says principal inves-

FOLLOW THE MONEY: Big Grants in Biomedical Computing

Virtual Rat Heart: A computational model of the rat heart that will be incorporated into the Virtual Physio-logical Rat Project. Courtesy of: Daniel Beard, Medical College of Wisconsin.

Page 12: Research Biomedical

10 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

tigator Modesto Orozco, PhD, senior pro-fessor and head of IRB Barcelona’s Molec-ular Modeling and Bioinformatics groupand director of life sciences at theBarcelona Supercomputing Center. “If weare successful, we will define a unique seriesof tools covering the entire time and sizescale of DNA.”

The toolkit will help researchers answerquestions about how nucleic acids work“from the point of view of the first principles

DNA Unveiled: This computer simulation (which precedes from left to right and top to bottom) givesinsights into the mechanism by which DNA starts to unfold. Courtesy of: A. Pérez, IRB Barcelona.

Page 13: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 11

of physics”— for example, how the physicalproperties of chromatin impact DNA regu-latory mechanisms. “We will have to over-come major challenges at the frontierbetween the different simulation levels,”Orozco says. “It is a very ambitious project.”

Advancing the Artificial Pancreas A device that mimics the pancreas’ job

may soon be a reality, thanks in part to a$4.5 million grant from the National Insti-tutes of Diabetes, Digestive, and KidneyDiseases. The device—which features astate-of-the-art control algorithm—wouldfree patients with type I diabetes from theircurrent regimen of manual glucose monitor-ing and insulin injections.

“We’ve assembled an internationaldream team,” says principal investigatorFrank Doyle, PhD, professor of chemicalengineering and of electrical engineering atthe University of California, Santa Barbara.His team includes scientists from Italy, theMayo Clinic, the University of Virginia,and the Sansum Diabetes Institute. Severalgroups have prototypic artificial pancreasdevices in testing. But Doyle’s team aims tobring the first sophisticated device intoreal-world use. “The exciting part of thisgrant is the possibility of getting beyond in-clinic prototyping,” Doyle says.

The devices consist of an ipod-sized com-puter, glucose sensor, and insulin pump(which can be attached to the arms, legs, orstomach). A simple version (made byMedtronic) is already on the market in Eng-land: the system monitors glucose and shutsoff the pump automatically when bloodsugar drops too low. But Doyle’s team isgoing beyond such a simple feedback loop,using an advanced algorithm called “modelpredictive control” (which is also used inaerospace controls). “We forecast and antic-ipate insulin needs,” rather than simply re-sponding to current glucose levels, Doylesays. The algorithm will even adapt to a pa-tients’ individual patterns, such as the tim-ing of exercise and meals, as well as toindividual variation in insulin metabolism,Doyle says. “It won’t be a one-size fits all al-gorithm; it will be tailored and customizedto the individual patient.”

Synthesizing Data on Mental Illness

A new center at the University ofChicago will explore the origins of psychi-atric disease by integrating existing data fromdiverse disciplines and across multiple sites.The center is the newest Silvio O. ConteCenter for Neurosciences Research and thefirst with a computational focus. It received

$14 million in grants from the National In-stitutes of Mental Health and the ChicagoBiomedical Consortium.

“We have a lot of datasets from differentcommunities that have never been analyzedwithin the same model before. It’s an excit-ing research opportunity,” says principal in-vestigator Andrey Rzhetsky, PhD, professorof medicine and human genetics at the Uni-versity of Chicago Medical Center.

Rzhetsky will head a consortium of 15lead investigators from seven schools thatwill bring together clinical data, geneticlinkage data, gene pathway data, functionaldata on genes and proteins, drug data, anddrug-gene interaction data. “The mainpremise of the center is to get together won-derful specialists in different disciplines;make them talk to each other; design mod-els that span all datasets; and make predic-tions that can be tested experimentally.”

Rzhetsky’s team will attempt to unearthnovel connections between genes, environ-ment, and disease phenotypes, as well as be-tween the disorders themselves. Forexample, Rzhetsky and colleagues have pre-viously shown that autism, schizophreniaand bipolar disorder have considerable ge-netic overlap. “You can get a lot more fromjoint analysis of several phenotypes thanfrom a single phenotype,” Rzhetsky says. !!

It has become commonplace for people touse social media to share their healthcare

stories, seek a community of individuals withthe same diseases, and learn about treatmentoptions. All this Internet activity also pro-duces data that can be used for research.

“In the networked world, who cures can-cer? We all do,” says Paul Wicks, PhD, di-rector of research and development atPatientsLikeMe, a site where people diag-nosed with serious life-changing illnesses

can record and share information.For PatientsLikeMe and a number of

other sites, doing biomedical research usingdata gathered online is part of the businessplan. With names such as 23andMe, Med-Help, TUDiabetes, myMicrobes.eu, CureTo-

gether, these sites blend community buildingwith information gathering. They then turnto computational approaches, such as datamining and natural language processing(NLP), to analyze the information gathered.

This crowd-sourced research oftenreaches into realms that otherwise wouldn’tor couldn’t be studied, due to a lack of eitherappropriate information or financial sup-port. Moreover, with their access to largepopulations of both cases and controls, thesesites are rapidly producing clinical researchresults. That they function in a landscape ofever-changing and growing data just makesthe process that much more interesting.

Doing Research That Others Can’t or Won’t

On social media healthcare sites such asPatientsLikeMe, people record and share in-formation about their diseases. This self-re-

LEVERAGING SOCIAL MEDIA: For Biomedical Research

Page 14: Research Biomedical

12 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

ported data may have some inherent biases,says Wicks, who hopes that those issues willdisappear as they get to a large enough scale.But it also has some inherent strengths: On-line, people talk about issues they might notraise with a physician, and they can report onand track their conditions more frequently.

To take advantage of this, PatientsLikeMeset out to “do research that’s new andnovel… and not just cheaper than a survey

by mail,” Wicks says. Moreover, he says, “Ourinclination is to do work that reflects theneeds of patients.” So, for example, Patients-LikeMe studied the incidence of compulsivegambling among people with Parkinson’s dis-ease (PD) because people on the site wereconcerned about the phenomenon. In theirsample—assembled in the course of just aweek—they found that gambling was twiceas common among PD patients as would beexpected from physician notes—suggestingthat patients don’t necessarily share certainembarrassing information with their doctors(although bias in the sample could also be anissue)—and that compulsive gambling wasnot associated with being on a dopamine-ag-onist drug (as previous studies had suggested).

PatientsLikeMe has also looked at off-label drug use. “People don’t want to fundresearch of off-label drugs, especially gener-ics,” Wicks says. “Our platform provides a

way to capture data that no one else has thebandwidth to look at.”

Access to Large Populations for Clinical Trials

At PatientsLikeMe, 23andMe and Med-Help, researchers are finding that onlinecommunities offer a huge benefit to clinicalresearch: A vast treasure trove of cases andan even vaster population of controls.

Launched in 2005, PatientsLikeMe has115,000 users and covers about 1300 con-ditions. For about twenty of those condi-tions, PatientsLikeMe collects patient datain a structured way, requesting informationon specific outcomes—the kinds of thingstypically used for clinical trials. “We buildit so we can prepare for future research stud-ies,” Wicks says.

For example, early on, the site created acommunity and several surveys for peoplewith ALS (amyotrophic lateral sclerosis).This meant they already had lots of valuablebackground data when, in 2008, the commu-nity clamored for treatment with lithium. Asmall (16-person) study in Italy had shownthat lithium could slow the progress of thedisease. But PatientsLikeMe researchers werewary. “Many studies of ALS treatments killpatients faster than placebo,” Wicks says.“You want to be sure it’s not harmful.” So Pa-

tientsLikeMe immediately spent a year gath-ering data on off-label lithium use by 150eager ALS patients in their community. Andthey matched cases to controls in a rigorousway—using an algorithm that considereddata on both ALS onset and the shape of thedisease progression curve, key traits that varyin significant ways among ALS patients. Thiswas possible, Wicks says, because they hadlead-in data describing the patients’ status be-

fore taking the drug. Preliminary results an-nounced in December 2008 (just ninemonths after the Italian research was pub-lished) showed that lithium was not effectivein slowing disease progress. Since then, thisresult was confirmed in randomized clinicaltrials. The PatientsLikeMe research was pub-lished in Nature Biotechnology in April 2011.

The genotyping service 23andMe doesresearch using data they gather from peoplewho provide not only saliva samples but alsophenotype information gathered throughonline surveys. And the company leveragessocial media such as Twitter and Facebookto recruit communities of individuals with aparticular disease. “Recruiting is not donethrough a clinical center,” says Chuong(Tom) Do, PhD, a research scientist at thecompany. “It’s done entirely online.”

For many communities, Do says, “we ac-tually have the genotyping process com-pletely sponsored, making the financialbarriers to participation in the research aslow as possible.” For example, using a pri-vate donation from Google founder SergeyBrin, 23andMe was able to sponsor most ofthe genotyping costs for PD cases in a re-

LEVE

RAG

ING

SO

CIAL

MED

IA:

FOR

BIO

MED

ICAL

RES

EARC

H

This screenshot of the ALS tracking tool for an individual patient in the PatientsLikeMe lithium studyshows how the patients entered their disease characteristics, demographics, blood levels, dosage,ALSFRS-R score (a measure of disease progression), forced vital capacity, and side effects. Reprintedfrom supplemental figure 1, Wicks, P, et al., Accelerated clinical discovery using self-reported patientdata collected online and a patient-matching algorithm, Nature Biotechnology 29, 411–414 (2011).

Page 15: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 13

cent study. But for controls, Do says,23andMe has the advantage of being ableto use data from the population of peoplewho pay for the service. For a less commondisease like PD, Do says, the small propor-tion of misclassified cases mixed in withthat population would have a negligible ef-fect on the results of the study. “We justneed to be sure to get enough cases,” hesays. “Controls come for free. It’s actually ahuge help for us.” Indeed in a recent studyof PD (published in PLoS Genetics in Juneof 2011) involving roughly 3400 cases and29,000 controls, they were able to identifytwo novel genes contributing to the risk ofdeveloping PD. Because of the controlgroup’s size, Do says, “We could wring a lotof statistical power from our dataset.”

23andMe has also launched initiativesto study several rare disorders, namely sar-coma and myeloproliferative neoplasms.While recruitment for these conditions canbe difficult and expensive in the setting ofa traditional research center, 23andMe’ssystem allows for aggregation of individualsat low overhead to the company and with-out regard for geographic barriers, Do says.

With over 12 million users, MedHelp isthe largest online health community. Thebusiness focuses on helping people tracktheir diseases as well as connecting themwith appropriate communities and physi-cians. In addition, though, they work inpartnership with academics, physicians andothers to extract useful knowledge fromMedHelp’s accumulating data. For exam-

ple, several physicians examined data onlens implant failures pulled from the eye-care forums on MedHelp (forums that weresponsored by the American Academy ofOphthalmology). The researchers foundthat multi-focal implants had a muchhigher failure rate than other types—infor-mation that was very valuable to the oph-thalmology community.

Rapid Turnaround TimeCompared to clinical research centers,

those who leverage social media web sitescan conduct clinical research very quickly.The PatientsLikeMe study of lithium use inALS was completed in just twelvemonths—before a randomized clinical trial

even began recruitment. In another exam-ple, when members of the site’s ALS com-munity raised a question about excessive

yawning, PatientsLikeMe published re-search on the problem in just three months.

“The ability to accelerate the pace ofresearch through social media is excitingto me,” Do says. Part of the accelerationcomes from the immediate ability to amassand access large cohorts, he says. But itgoes beyond that. For example, when23andMe set out to determine whether itsdata was reliable enough to replicate pub-lished genome-wide association studies(GWAS), they completed the task atlightning speed compared to a typicalGWAS. Indeed, it took 23andMe less thanone year to replicate and present resultsfrom a PD GWAS that had taken the pre-vious researchers almost six years from hy-pothesis to publication.

FlexibilityIf data initially collected online is in-

complete or even wrong, it is easilyamended by going back to the users withrevised surveys. For example, when23andMe first attempted to replicate aGWAS for celiac disease, they did notfind the expected associations. Becausetheir survey had asked “Have you everbeen diagnosed with celiac disease,” theybelieved their study might include somefalse positives. So they re-worded thequestion to ask: “Have you ever been di-agnosed with celiac disease, as confirmedby a biopsy of the small intestine.” Andwith the newly (and rapidly) acquired an-swers, they were able to replicate four of

the six expected associations.Dealing with changes of this

kind also means re-running theGWAS. “Many times based on re-search results, we’ll ask new ques-tions,” Do says. “So we end upwith a very fluid dataset and theneed for tools that allow us towork with the data as it constantlychanges.” They often run thesame GWAS studies repeatedly.“We have over 1000 that we runon a regular basis, culled from the50 plus surveys,” Do says. That isa unique computational aspect ofthe work: custom-built software toconduct parallelized GWAS onthe same dataset.

Today, 23andMe has morethan 120,000 peoples’ genotypesin its database. “We look forwardto the day when we have one mil-lion plus,” Do says. “We can onlyimagine the types of discoveriesthat will be possible with a data-base that size.” !!

23andMe successfully replicated previous GWAS for a number of diseases as shown here in a chart of suc-cess rate (versus total power) by disease class. Expected = number of associations they expected to repli-cate. Attempts = number of associations they attempted to replicate. The blue dot represents the successratio (number of successful replications divided by number of expected replications). The black line repre-sents the 95 percent prediction interval for the success ratio. Reprinted from Tung, J, et al., Efficient Repli-cation of Over 180 Genetic Associations with Self-Reported Medical Data, PLoS ONE 6(8) (2011).

Page 16: Research Biomedical

14 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

AnalyticsAnalyticsBig Data Big Data

Page 17: Research Biomedical

BiomedicalResearchBy Katharine Millerin

“We have recommendations for you,” announcesthe website Amazon.com each time a cus-tomer signs in.

This mega-retailer analyzes billions of customers’ pur-chases—nearly $40 billion worth in 2011 alone—to predictindividuals’ future buying habits. And Amazon’s system isconstantly learning: With each click of the “Place your

order” button, the company’s databankgrows, allowing it to both refine its predic-tions and conduct research to better un-derstand its market.

These days, this sort of “Big Data Ana-lytics” permeates the worlds of commerce,finance, and government. Credit card com-panies monitor millions of transactions todistinguish fraudulent activity from legiti-mate purchases; financial analysts crunchmarket data to identify good investmentopportunities; and the Department ofHomeland Security tracks Internet andphone traffic to forecast terrorist activity.

Where is Amazon’s equivalent in health-care and biomedical research? Do we havea “learning healthcare system” that, likeAmazon.com, can glean insights from vastquantities of data and push it into the handsof users, including both patients and health-care providers? Not even close.

It’s a situation that frustrates and inspiresColin Hill, CEO, president, chairman andcofounder of GNS Healthcare, a healthcareanalytics company. “When I go to my doctorfor some treatment, he’s kind of guessing asto what drug works,” he says. With the datacurrently being captured and stored, he says,there’s now an opportunity to take a broader

15Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures

Page 18: Research Biomedical

16 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

view of the problem. “We need to make this system smarter anduse data to better determine what interventions work,” he says.

And there is hope, says Jeff Hammerbacher, who formerlyled the data team at Facebook and is now chief scientist atCloudera, a company that provides businesses with a platformfor managing and analyzing big data. “I believe that the meth-ods used by Facebook and others—commodity hardware, opensource software, ubiquitous instrumentation—will prove justas revolutionary for healthcare as they have for communica-tions and retail,” he says.

Others agree: “We have to create an infrastructure that al-lows us to harvest big data in an efficient way,” says Felix

Frueh, PhD, president of the Medco Research Institute.Right now, biomedical infrastructure lags well behind the

curve. Our healthcare system is dispersed and disjointed; med-ical records are a bit of a mess; and we don’t yet have the ca-pacity to store and process the crazy amounts of data comingour way from widespread whole-genome sequencing. Andthen there are privacy issues (see “Privacy in the Era of Elec-tronic Health Information,” a story also in this issue). More-over, while Amazon can instantly provide up-to-daterecommendations at your fingertips, deploying biomedical ad-vances to the clinic can take years.

Despite these infrastructure challenges, some researchersare plunging into biomedical Big Data now, in hopes of ex-tracting new and actionable knowledge. They are doing clin-ical trials using vast troves of observational health care data;analyzing pharmacy and insurance claims data together toidentify adverse drug events; delving into molecular-level datato discover biomarkers that help classify patients based ontheir response to existing treatments; and pushing their resultsout to physicians in novel and creative ways.

Perhaps it’s asking too much to expect that the complexi-ties of biology can be boiled down to Amazon.com-style rec-ommendations. Yet the examples described here suggestpossible pathways to the dream of an intelligent healthcaresystem with big data at its core.

DEFINING BIG DATA IN BIOMEDICINE

Big data in biomedicine is coming from two ends, says Hill:the genomics-driven end (genotyping, gene expression, andnow next-generation sequencing data); and the payer-providerend (electronic medical records, pharmacy prescription infor-mation, insurance records).

On the genomics end, the data deluge is imminent. Withnext-generation sequencing—a process that greatly simplifiesthe sequencing of DNA—it is now possible to generate wholegenome sequences for large numbers of people at low cost. It’sa bit of a game-changer.

“Raw data-wise, it’s 4 terabytes of data from one person,”says Eric Schadt, chair of genetics at Mt. Sinai Medical Schoolin New York City. “But now imagine doing this for thousandsof people in the course of a month. You’re into petabyte scalesof raw data. So how do you manage and organize that scale ofinformation in ways that facilitate downstream analyses?”

For now, as we wait for next-gen sequencing to work itsmagic, genomics data matrices remain long andthin, with typically tens to hundreds of patientsbut millions or at least tens of thousands of vari-ables, Hill notes.

“But on the payer-provider data side,” Hill says,“we’re dealing now with large longitudinal claimsdata sets that are both wide and deep.” A data matrixmight have hundreds of thousands of patients withmany characteristics for each—demographics, treat-ment histories, outcomes and interventions acrosstime—but typically not yet thousands or millions ofmolecular characteristics.

To a great degree, the two sides of biomedical bigdata have yet to converge. Some researchers workwith the clinical and pharmaceutical data; otherswork with the biomolecular and genomics data.“The bottom line is,” says Eric Perakslis, PhD,

chief information officer at the U.S. Food and Drug Admin-istration, “the large body of healthcare data out there has yetto be truly enhanced with molecular pathology. And withoutthat you’re really not getting at mechanisms of action or pre-dictive biology.” Where there is data, he says, “It’s almostthis random thing: Molecular data is collected at a few timepoints but that’s it.”

Nevertheless, Schadt believes that a world where these bio-molecular and clinical datasets come together may arrive soon.

“In maybe ten years time,” he says, “all newborns and everyonewalking through the door will have his or her genome se-quenced and other traits collected and that information willall be crunched in the context of their medical history to assessthe state of the individual.”

“In maybe ten years time,” says EricSchadt, “all newborns and

everyone walking through the doorwill have his or her genome sequencedand other traits collected and thatinformation will all be crunched in thecontext of their medical history toassess the state of the individual.”

“I believe that the methods used by Facebookand others (commodity hardware, open

source software, ubiquitous instrumentation)will prove just as revolutionary for healthcare asthey have for communications and retail,” saysJeff Hammerbacher.

Page 19: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 17

THE TOOLS OF BIG DATA ANALYTICSIf you have data sets with millions or tens of millions of pa-

tients followed as a function of time, standard statistics aren’tsufficient, especially if you are looking for associations amongmore than two variables, or data layers. “This is not aboutgenome-wide association studies (GWAS),” Hill says. Suchstudies typically seek to connect genomic signatures with dis-ease conditions—essentially looking at only two layers of data.“When people start doing this from multiple layers of data,

that’s where it becomes non-trivial,” Hill says. “That’swhere in my mind it gets to big data analyticsrather than biostatistics or bioinformatics.”

Many of the tools of big data analytics arealready being used in other fields, saysSchadt. “We’re almost latecomers to thisgame but the same sorts of principles ap-plied by Homeland Security or a creditcard fraud division are the kinds of ap-proaches we want to apply in theclinical arena.”

The U.S. Department of Home-land Security, for example, examinessuch things as cell phone and emailtraffic and credit card purchase his-tory in an attempt to predict the nextbig national security threat. Theywant to consider everything together,letting the data speak for itself butlooking for patterns in the data thatmay signify a threat, Schadt says.They achieve this using machinelearning in which computers ex-tract patterns and classifiers froma body of data and use them to in-terpret and predict new data:They know when a prior threat oc-curred, so they look for features thatwould have helped them predict it and

apply that looking forward. In a clinical setting, that couldmean looking at not only which molecular or sequencing datapredicts a drug response but also what nurse was on duty in a

particular wing during specific hours when an event occurred.“You just want all this information and then crunch it to figureout what features turn out to be important,” Schadt says.

In addition to machine-learning, Hill says, there is a needfor approaches that scale up to the interpretation of big data. Inhis opinion, this means using hypothesis-free probabilistic causalapproaches, such as Bayesian network analysis, to get at not onlycorrelations, but cause and effect.

He points to strategies developed by Daphne Koller, PhD,

professor of computer science at Stan-ford University, as an example of what

can be done. Much of her work involvesthe use of Bayesian networks—graphical

representations of probability distributions—for machine learning. These methods scale well

to large, multi-layered data sets, he says. Hill’scompany, GNS Healthcare, has developed its own

variation, which they call “reverse engineering and for-ward simulation” (REFS). “We break the dataset into tril-lions of little pieces, evaluating little relationships,” he says.Each fragment then has a Bayesian probabilistic score sig-naling how likely the candidate relationship is as well asthe probability of a particular directionality (an indicationof possible cause and effect). After scoring all of the pos-sible pair-wise and three-way relationships, REFS grabsthe most likely network fragments and assembles theminto an ensemble of possible networks that are robust and

consistent with the data. That’s the reverse engineered part.Next comes forward simulation to predict outcomes when

parts of each networkare altered. This proce-dure allows researchersto score the probabilitythat players in the en-semble of networks areimportant and to do soin an unbiased wayacross a large dataset.

Schadt agrees thatsuch data-driven ap-

proaches are essential, and he uses them in his own work. Buthe says big data analytics covers a vast computational spaceranging from bottom-up dynamical systems modeling to top-

“When people start doing this from multiple layers of data, that’s where itbecomes non-trivial,” Colin Hill says. “That’s where in my mind it gets to

big data analytics rather than biostatistics or bioinformatics.”

“We’re almost latecomers to this game,” says Schadt,“but the same sorts of principles applied by

Homeland Security or a credit card fraud division are thekinds of approaches we want to apply in the clinical arena.”

Page 20: Research Biomedical

18 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

down probabilistic causal approaches—whatever approach(including hypothesis-driven), he says, “can derive meaningfulinformation to aid us in understanding a disease condition ordrug response or whatever the end goal is.” Essentially, he says,it’s not the approach that defines big data analytics, but thegoal of extracting knowledge and ultimately understandingfrom big data.

Perakslis views the problem somewhat differently. “In orderto get translational breakthroughs, you have to start out with

an intentional design, which starts with intentional sampling,”he says. “And to be honest, I don’t think it works well yet hy-pothesis free.” In a GWAS, he says, of course you look ateverything because you don’t know where to look. But the an-swers you seek can be lost in the noise.

Perakslis is more interested in broadening the types of datathat are brought to bear in focused clinical trials. “Too often,when people make decisions, they are only looking at part ofthe story,” he says. So, for example, if at the end of a Phase IIIclinical trial a drug doesn’t produce the degree of successneeded for approval, the database should be rich with infor-mation to help figure out why and where to go from there.TranSMART, a clinical informatics database that Perakslishelped assemble when he worked at J&J, does just that: It in-tegrates different types of data into one location.

CLINICAL & PHARMACEUTICAL BIG DATA: ALREADY ABUNDANT

These days, for certain large healthcare organizations, largequantities of data simply accrue as an inevitable part of doingbusiness. This is true of most hospitals, health maintenanceorganizations (HMOs), and pharmacy benefits managers (alsoknown as PBMs). In these settings, “You really are getting atbig data on the scales of LinkedIn, Amazon, Google, eBay andNetflix,” says Yael Garten, PhD, a senior data scientist atLinkedIn who received her doctorate in biomedical informat-ics from Stanford University.

For example, Kaiser Permanente (KP), an HMO, has a 7terabyte research database culled from electronic medicalrecords, says Joe Terdiman, MD, PhD, director of informationtechnology at Kaiser Permanente Northern California’s Divi-sion of Research. That doesn’t include any imaging data or ge-nomics data. This special research database has beenpre-cleaned and standardized using SNOWMED CT, an on-tology of medical terms useful for research. “By cleaning andstandardizing the data and making it easily accessible, we hopeto do our research faster and more accurately,” Terdiman says.

Medco, a PBM, accumulates longitudinal pharmacy data“because we are who we are and do what we do,” Frueh says.As a large PBM that covers about 65 million lives in theUnited States, Medco manages the pharmaceutical side ofthe healthcare industry on behalf of payers. Their clients are

health plans and large self-insured employers, state and gov-ernmental agencies, as well as Medicare. The company hasagreements with some of these clients who provide large setsof medical claims data for research purposes. From the claimsdata, Medco can extract patient indications, treatments,dates of treatment, and outcomes (for example, whether thepatient was hospitalized or not). Putting this multi-layereddata together, Medco can search for associations betweendrug use, patient characteristics, and clinical impact (good,bad or indifferent) in order to determine whether a drugworks the way it should.

And at Medco, big data analytics has already reaped divi-dends by uncovering drug-drug interactions. For example,clopidogrel (Plavix™) is a widely used drug that preventsharmful blood clots that may cause heart attacks or strokes.However, researchers were concerned that certain otherdrugs—proton-pump inhibitors used to reduce gastric acid pro-duction—might interfere with its activation by the body.Using their database, Medco looked for differences in two co-horts: those on one drug and those on the two drugs that po-tentially interact. The study revealed that patients taking bothPlavix and a proton-pump inhibitor had a 50 percent higherchance of cardiovascular events (stroke or heart attack).

A similar study showed that antidepressants block the ef-fectiveness of tamoxifen taken to prevent breast cancer recur-rence. Patients taking both drugs were twice as likely toexperience a recurrence.

“Both of these studies are prototypical of the kinds of ques-tions we can ask in our database where we can correlate phar-macy data with clinical outcome data,” Frueh says.

GOING HYPOTHESIS-FREEThough Medco’s outcomes are impressive, they have thus

far relied on fairly straightforward statistical and epidemiolog-ical methods that were nevertheless quite labor intensive.“The hands-on analytics time to write the SAS code and spec-ify clearly what you need for each hypothesis is very time-con-suming,” Frueh says. In addition, the work depends on havinga hypothesis to begin with—potentially missing other signalsthat might exist in the data.

To address this limitation, Medco is currently working withHill’s GNS Healthcare to determine whether a hypothesis-free approach could yield new insights. So in the Plavix ex-ample, rather than starting with the hypothesis thatproton-pump inhibitors might interact with drug activation,Frueh says, “We’re letting the technology run wild and seeingwhat it comes up with.”

Because GNS Healthcare’s REFS platform automates theprocess, he says, Medco can take the strongest signals from thedata and avoid wasting time on hypotheses that don’t lead toanything. Right now they are confirming whether the

In a GWAS, Perakslis says, of course youlook at everything because you don’t

know where to look. But the answers youseek can be lost in the noise.

Using hypothesis-freeapproaches, Frueh says, “We’re

letting the technology run wild andseeing what it comes up with.”

Page 21: Research Biomedical

19

strongest findings identified by applying the REFS platform tothe Plavix database actually hold up to more in-depth analysis.

ADDING GENOMICS TO THE MIX

The REFS platform developed by GNS Healthcare alsofunctions in contexts that include genomic data. For exam-ple, in work published in PLoS Computational Biology inMarch 2011, GNS Healthcare and Biogen identified noveltherapeutic intervention points among the one-third ofarthritis patients who don’t respond to a commonly used anti-inflammatory treatment regimen (TNF- ! blockade). Theclinical study sampled blood drawn before and after treatmentof 77 patients. The multi-layered data included genomic se-quence variations; gene expression data; and 28 standardarthritis scoring measures of drug effectiveness (tender orswollen joints, c-reactive protein, pain, etc.). Despite beingentirely data driven, the second-highest rated interventionpoint they discovered was the actual known target of thedrug. The first-highest rated intervention point—a new tar-get—is now being studied by Biogen.

“To my knowledge,” Hill says, “this is the first time that adata-driven computational approach (rather than a single bio-marker approach) has been applied to do this in a comprehen-sive way.” And although the number of patients was relativelysmall, Hill says, the study suggests that researchers can nowinterrogate computer models of drug and disease biology tobetter understand cause and effect relationships from the dataitself, without reliance on prior biological knowledge.

“If you ask me why we’re doing this,” Hill says, “it’s be-cause it’s going to cure cancer and other diseases and there’sno other way to do it than by using big data analytics…. If you do discovery the way it’s been done untilnow, it just doesn’t cut it.”

Today, rather than deal with the vastness ofgenomics data, Schadt says, many researchersdistill it down to look only at the hundred or sogene variants they think they know somethingabout. But this will be a mistake in the longrun, Schadt says. “We need to derivehigher level information fromall of that data without re-ducing dimensionality tothe most naïve level. Andthen we need the ability to con-nect that information to other largedata sources such as all the types of datagathered by a large medical center.”

The eMERGE Network, an NIH-fundedcollaboration across seven sites, is taking a run-ning start at doing this. They are linking electronicmedical records data with genomics data across seven dif-ferent sites. Researchers will be able to study cohorts ex-tracted from this “big data” without having to actively recruitand gather samples from a study population.

To a great extent, though, the eMERGE Network is stillbuilding its repository and confirming that it can repeatknown results. The analytics are only now getting underway.

Kaiser Permanente, like the eMERGE network, is currentlybuilding what will be one of the largest biorepositories any-where, with genotype data from 100,000 patients. “We hopeto reach 500,000,” says Terdiman of Kaiser Permanente.

But Kaiser is still sorting through what sort of platform touse for the data. They are looking at Hadoop—an up-and-coming open-source distributed-computing framework forstoring and managing big data—as well as other possibilities.“With 100,000 patients genotyped, and each one has700,000 SNPs, that’s a pretty big matrix,” Terdiman says.And then when you associate that with phenotypic datafrom the electronic medical record, he points out, “there’s acombinatorial effect of all these variables such that simpleor even relatively fast processors might take weeks to do asingle analysis.” GWAS programs usually run on smallsamples, and Terdiman doesn’t yet know how well theywill scale to the full genotyped database. “No one, liter-ally, has had the amount of data to do GWAS studiesthat we have,” he says.

Really, says Frueh, the data delugefrom whole genome sequencing isjust beginning. Frueh would love totie Medco’s data to genomicsbiorepositories but there justisn’t enough data yet. Fruehnotes that he could possiblypartner with labs or or-

ganizations that have done large GWAS but, he says, unlessyou’re asking the same questions as the GWAS, you won’tget a lot of depth in those studies, especially after matchingpeople to the pharmacy database. “You go from large to smallnumbers very quickly,” he says.

Stephen McHale, CEO of Explorys, a big data bioinformat-ics company based in Cleveland Ohio, says that traditional re-lational data-warehousing technology can’t efficiently handlethe 30 billion clinical elements in their dataset. So Explorys

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures

“If you ask me why we’redoing this,” Hill says,

“it’s because it’s going to curecancer and other diseases and

there’s no other way to do it thanby using big data analytics…. If you do

discovery the way it’s been done untilnow, it just doesn’t cut it.”

Page 22: Research Biomedical

20 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

implemented the type of data architecture that supportsGoogle, Yahoo and Facebook. “Our technology is all columnstore and using MapReduce and those kinds of architectures,”he says, referring to approaches that use large numbers of com-puters to process highly distributable problems across hugedatasets. He says that, to his knowledge, it’s a first in the med-ical space. “We needed that sort of architecture to support thismuch data.” And with genomics coming their way, it seemseven more essential to use these types of architecture, McHalesays. Explorys is now working on some pilot initiatives to in-tegrate genomic data with observational data.

MAKING BIG DATA ACTIONABLEExtracting knowledge from big data is a huge challenge,

but perhaps a greater one is ensuring that big data infrastruc-ture will form the backbone of an effort to push Amazon.com-style recommendations to practitioners and patients.

Garten notes that implementing an Amazon or LinkedInstyle recommendation system in biomedicine will be tough.Such systems use machine learning and natural language pro-cessing to, in a sense, bucket customers into groups. “But theability to bucket people together is harder in biomedicine,”Garten says. “The slightest variations can matter a lot in termsof how we metabolize drugs or respond to the environment,so the signal is harder to find.” The stakes are also higher forgetting a false result.

But Medco’s experience suggests such bucketing is alreadypossible, at least to some extent. For example, in the Plavixexample described above, Medco was in a position to imme-diately effect a change: “We can pull a switch and say thateach and every pharmacist on our list needs to be told aboutthis,” Frueh says. After implementing the rule, Medco saw adrop of about one third in co-use of the interacting drugs.“This is one example where the use of big data in this stepwiseprocess has cut down on the time it takes to get changes intoclinical practice,” Frueh says.

In another example, Medco was able to use its infrastruc-ture to increase uptake of a genotyping test for warfarin dos-ing. First, however, they had to show payers that the test wascost-effective. In a clinical trial conducted in collaborationwith Mayo Clinic, Medco showed that genotyping reducedthe rate of hospitalizations among warfarin-dosed patients by30 percent. Armed with that information, payers becamesupportive of Medco reaching out to physicians to suggestthey use the genotyping test before prescribing warfarin. Be-cause of Medco’s big data infrastructure, this outreach couldbe easily accomplished: Each time a physician prescribedwarfarin, a message was routed back through the pharmacyto the physician, suggesting use of the test. The result: an in-crease in uptake of the test from a rate of 0.5 percent or so

“This is one example where the use ofbig data in this stepwise process

has cut down on the time it takes to getchanges into clinical practice,” Frueh says.

Page 23: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 21

in the general physician population up to approximately 20to 30 percent by physicians in the network.

“This has to do with creating an environment and the oper-ational infrastructure to be proactive,” Frueh says. And Frueh

suspects that uptake of the test will continue to grow. “We’reprobably at the beginning of what we hope will be a hockey-stickshaped uptake of this test.” The lesson: Big data, and the con-nectedness of big data to the real world, provides the opportunityto take advantage of teachable moments at the point of care.

As we go from data generation to knowledge about what itmeans, to making that knowledge actionable, Schadt says, “Itwill impact clinical decisions on every level.”

PLAYING CATCH UP AND THEN SOMETo some extent, big data analytics in biomedicine lags fi-

nance and commerce because it hasn’t taken advantage ofcommercial methods of handling large datasets—like Hadoopand parallelized computing. “These allow data analytics in anindustry-level manner,” Garten says. “That’s something thatLinkedIn, Amazon and Facebook have already nailed, andbioinformatics is lagging behind those industries.”

Bioinformatics researchers still spend a lot of time struc-turing and organizing their data, preparing to harvest the in-sights that are the end goal, says Garten. By contrast, theprivate sector has completed the phase of structuring and col-lecting data in an organized fashion and is now investingmore and more effort toward producing interesting results andinsights. Eventually, Garten says, “the practices and experi-ence from the corporations with large amounts of data (i.e.,LinkedIn, Amazon, Google, Yahoo) will propagate back tothe academic and research setting, and help accelerate theprocess of organizing the data.”

At the same time, bioinformatics actually has somethingto offer the broader world, Garten says. She and others with abioinformatics background who have moved into other arenasbring to the table an ability to handle messy data that is oftenincomplete. The expertise in integrating various datasets increative ways to infer insights from this data, as is done intranslational bioinformatics, is useful for extracting businessinsights in other industries.

Hill also sees biomedical approaches filtering outward.“REFS is data-agnostic,” he says. It can work on genomic dataas easily as clinical data—or, for that matter, financial data.Hill’s company recently created a financial spinoff calledFINA Technologies. He also spun off Dataspora, which is fo-cused on consumer ecommerce. “We’ve created a technologythat goes all the way from unraveling how cancer drugs workto predicting financial markets,” Hill says. “This technologyis applicable to how complex systems work in different indus-tries, and there’s something profound about that.” !!

“The practices and experience from the corporations with largeamounts of data (i.e., LinkedIn, Amazon, Google, Yahoo) will

propagate back to the academic and research setting, and helpaccelerate the process of organizing the data,” Yael Garten says.

Page 24: Research Biomedical

22 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

Page 25: Research Biomedical

23

Building a Trust

Infrastructure

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures

It’s the basis of every patient/physician in-teraction: Shared personal health infor-mation is kept confidential and used only

for the patient’s benefit. It’s a tradition that started before the time of Hip-pocrates, endured through the era of records stored in filing cabinets, and per-sists today as we move to electronic patient records. And it’s codified in theform of HIPAA, the federal Health Insurance Portability and AccountabilityAct, which ensures the privacy of health records.

But as sensitive personal health data accrue in ever larger databases, concernsover privacy breaches are on the rise. And as researchers perceive the potentialusefulness of this vast data trove, they seek strategies to access it without violatingHIPAA. In response, data privacy experts are developing ever more sophisticatedmethods to protect electronic health data from unwanted exposure. And whilemany of these experts have raised alarms about the vulnerabilities of the privacy-protection schemes currently in place, they have also begun talking about thepossibility of implementing far more powerful technologies in the near future.

“This is the start of the golden age of privacy research,” says Dan Kifer,PhD, a computer scientist at Pennsylvania State University who has investi-gated privacy-preserving techniques for applications ranging from biomedicalresearch to the U.S. Census.

Privacy Fears Drive InnovationThe rapid progress taking place in privacy research in the biomedical arena

is driven in large part by fear—namely, fear that the vast warehouses of biomed-ical data now being assembled could be vulnerable to the same kinds of privacy

By Alexander Gelfand

Privacy&Biomedical

Research:

Trust.

Page 26: Research Biomedical

24 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

breaches that have in recent years plagued such infor-mation aggregators as Google and Facebook.

Granted, the evidence of such breaches in therealm of medical records, clinical data, and genomicinformation remains slim. The most alarming inci-dents to date have involved simple failures of security,or of access control, such as the theft or loss of unse-cured computers containing electronic medicalrecords, rather than the unintentional leakage of sen-sitive information from large biomedical databases;and as yet, no one has reportedly been harmed by theunauthorized release of their biomedical information.Instead, the most impressive privacy breaches to datehave been perpetrated byacademic researcherswho were trying to findweaknesses in the systemsthey were attacking:identifying the medicalrecords of a particular in-dividual in a hospital sys-tem, for example, oridentifying participantsin a genome-wide associ-ation study (GWAS) designed to link particular dis-eases to specific genetic variations.

Yet anxiety over the possibility of more public, andmore harmful, privacy breaches continues to build, theprincipal concern being that insurers and employers

might use biomedical data to dis-criminate against policyholders andemployees. Such fears are not un-founded. “Insurers have historicallyused data to make coverage determi-nations,” says Deven McGraw, direc-tor of the health privacy project at theCenter for Democracy and Technol-ogy, a nonprofit public interest groupin Washington, DC. Carl Gunter,PhD, a computer scientist at the Uni-versity of Illinois who studies the healthinformation exchanges that enable hos-pitals to share electronic medicalrecords, emphasizes the “dreadful risks”posed by medical identity theft, inwhich one person assumes the identityof another when seeking medical care,and the medical histories of both victimand thief become dangerously entangled.And there is rising concern over the pri-vacy risks associated with genomic datain particular. As Brad Malin, PhD, direc-tor of the Health Information Privacy Labat Vanderbilt University, points out, ge-nomic data is highly distinguishable, ex-tremely stable, and can in certainsituations be used to predict the likelihoodthat an individual might fall prey to thisdisease or that one—information thatcould be used to deny coverage or a job.

Some of these scenarios might seemunlikely at the present time. It’s doubtful,

for example, that many people outside of a universitycomputer science department would have the techni-cal wherewithal to pick a single individual out of themass of statistics associated with a GWAS. But tech-nological progress has a way of closing the gap betweenthe improbable and the probable. “There’s nothing tosay that what’s unreasonable now won’t be unreason-able in the near future,” Malin says. And even thesmallest risk of a privacy violation can be enough toscare a patient away from participating in a clinicaltrial, or persuade an institution to withhold data fromresearchers due to ethical or legal concerns. Accordingto Gunter, some health information exchanges have

already prohibited the sharing of electronic medicalrecords for research purposes. “There is such fear thatwe need to address it before we can make full use ofthis data for research purposes,” says Lucila Ohno-Machado, PhD, founding chief of the division of bio-

De-identification protocols suppress or modify bits of data thatmight allow an attacker to determine precisely to whom a partic-ular record belongs.

Page 27: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 25

medical informatics and associate dean for informaticsat the University of California at San Diego (UCSD)and principal investigator for iDash (integrating Datafor Analysis, anonymization, and SHaring), a NationalCenter for Biomedical Computing.

All of this unease—over the privacy rights of in-dividuals, over potential discrimination, and over thechilling effect that privacy concerns can have on re-search—has prompted a great deal of innovationamongst the mathematicians, cryptographers, andcomputer scientists who are working to developmechanisms that will allow researchers to analyzebiomedical data without compromising privacy.

Data-Driven Privacy MeasuresStaal Vinterbo, PhD, a computer scientist in the

division of biomedical informatics at UCSD, distin-

guishes between two broad classes of privacy-protect-ing mechanisms currently under development:“data-driven” mechanisms that define privacy in termsof the data itself, and “process-driven” mechanismsthat define privacy in terms of how they access thedata. Both are intended to let researchers performmeaningful analyses without disclosing sensitive per-sonal information, but they stem from very differentconcepts of data privacy, and they often work via verydifferent means. The most common data-driven ap-proaches, for example, modify raw data so that it canbe released without revealing sensitive information,while most process-driven approaches leave the un-derlying data alone and instead build privacy protec-tions into the algorithms they use to extract it.Data-driven mechanisms came first, but process-dri-ven ones may offer better protection—albeit at a price.

De-identification, or anonymization, is the mostcommonly employed privacy measure, and one thatlies very much on the data-driven side of the divide.Rather than freely sharing all of the data in a groupof records, de-identification protocols suppress ormodify the bits that might allow an attacker to de-termine precisely to whom a particular record be-longs. Some of these protocols, especially the moreexperimental ones, can be quite sophisticated. Spec-tral anonymization, which Vinterbo has investigatedwith Thomas Lasko, MD, PhD, a researcher at Van-derbilt, manipulates data in mathematically complexways so that useful correlations can be maintainedfor research purposes even after the information has

been randomized to prevent specific individuals frombeing identified. Synthetic data generation, whichKifer has explored, creates new data that statisticallymimics the real stuff, but shields the actual partici-pants in the original data set. But the anonymizationschemes that are most often used to de-identify elec-tronic health records in the real world simply deleteor truncate specific data fields containing identifiableinformation like proper names and ZIP codes.

The advantage of de-identification is that it al-lows analysts to examine the raw data itself, albeit inaltered form, rather than running queries against itfrom behind a privacy-preserving interface. The dis-advantage is that it does not always work.

The first and most widely publicized demonstra-tion of the weaknesses of de-identification occurredin 1997, when Latanya Sweeney, PhD, linked the

supposedly de-identified health records released bythe Massachusetts state insurance commission tothe state’s voter-registration rolls and re-identifiedthe personal medical records of then-GovernorWilliam Weld. (Sweeney, who was a graduate stu-dent at MIT at the time, is nowdirector of the Data Privacy Labat Harvard University.)

Sweeney’s successful re-identi-fication attack helped prompt theadoption of the HIPAA PrivacyRule in 2000. The Privacy Ruleimposes restrictions on the releaseof “individually identifiablehealth information.” These feder-ally legislated constraints on dis-closure are waived, however, if thedata has been de-identified by ap-plying the so-called “safe harbor”method, which involves removing18 identifiers, including names,dates, and Social Security num-bers. Since data that has been de-identified under the safe harbormethod is no longer considered tobe individually identifiable, it isno longer covered by the PrivacyRule, and can be freely shared.

Yet there is a growing sense among data privacy ex-perts that no form of de-identification will ever be goodenough to meet the highest standards of privacy pro-

Page 28: Research Biomedical

26 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

tection, and that the entire approach will only growless reliable over time. The reason for this is simple: asmore and more data is collected and stored about usall—in online databases and on social networking sites,in publicly available government repositories and else-where—it becomes easier and easier to launch the kindof linkage attack that allowed Sweeney to re-identifyWilliam Weld’s medical records. As the computer sci-entists Arvind Narayanan, PhD, and VitalyShmatikov, PhD, noted in a 2010 article in Commu-nications of the ACM (Association for Computing Ma-chinery), “any attribute can be identifying incombination with others.” In other words, no matterhow many fields aredeleted from an individ-ual’s record, as long asthere is something left forresearchers to work with,there will also be enoughleft for re-identification.

Moreover, because de-identification techniqueshave for the most partbeen designed to protectspecific kinds of data fromspecific kinds of attack, they lack the flexibility neededto deal with a rapidly changing data landscape. Thetask of anonymization will only become harder, for ex-ample, as more and more categories of information,

from handwritten clinical notes togenetic sequences, are gathered andstored in digital repositories. The adhoc nature of such data-driven meth-ods also means that they tend to lacka rigorous mathematical basis; and bytheir very nature they can only accessa very limited amount of data. As a re-sult, it can be difficult to quantify justhow much privacy protection theytruly offer. Scientists can only estimatetheir efficacy by trying to breakthem—thereby proving their limita-tions, but not their strengths.

This is not to say that de-identifica-tion and other data-driven approachesdo not have their uses. Malin and hiscolleagues at Vanderbilt, for example,have for several years used de-identifi-cation strategies such as k-anonymiza-tion to help protect the privacy ofelectronic medical records used for re-search purposes. “We have de-identifiedmore than 1.5 million medical recordsfrom the Vanderbilt University MedicalCenter,” he says. K-anonymization worksby suppressing or modifying enough datato make a certain number of records (k)in a database appear to be identical. Ifdone appropriately, the data can still beused for research, but the individuals whoprovided it become lost in a crowd oflook-alikes. Several data privacy experts

have pointed to flaws in k-anonymization—for exam-ple, an attacker who possesses sufficient backgroundknowledge about someone can break k-anonymity andre-identify that individual– but the risks of re-identifi-cation remain low. And like any de-identificationscheme, k-anonymization allows researchers to exam-ine the raw data.

Process-Driven Privacy Measures

Still, many scientists have begun moving awayfrom data-driven approaches and toward more math-ematically rigorous process-driven ones. “Eventually,

all of us realized that this is just a never-ending cycle:find a way of perturbing the data, find weaknesses,try to fix them, find more weaknesses, try to fixthem,” says Kifer. While a graduate student at Cor-

Differential privacy is achieved by introducing some random noise intothe query responses. An analyst can only see the blurry answers pro-vided by the algorithms, never the raw data itself.

Page 29: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 27

nell University, Kifer helped develop a more robustrefinement to k-anonymity known as l-diversity—arefinement that was soon shown to have flaws of itsown. Recognizing the apparent limitations of de-identification in general, he and his fellow re-searchers began seeking a different path: one thatinvolves devising ways of querying statistical data-bases while providing privacy guarantees that can beexpressed as mathematical and statistical statements.

The first and still the most promising of these ap-proaches, differential privacy, was proposed in 2006by Cynthia Dwork, PhD, and colleagues at MicrosoftResearch, Ben-Gurion University, and the WeizmannInstitute of Science. Dwork recognized the cycle de-scribed by Kifer from the history of cryptography. Shealso knew that modern cryptographers only liberatedthemselves from that cycle when they developed for-mal, provable definitions of information security thatcould be quantified. So Dwork and her collaboratorsdid the same in the realm of privacy, formulating amathematical definition of the concept that amountsto a promise to the data subject that his life will not,in Dwork’s words, “change substantially for the betteror the worse as a result of a computation on the data.”Precisely how that is achieved is more or less up forgrabs; any solution that satisfies the basic definition,which in its true form looks more like a mathematicalproof than a verbal guarantee, will necessarily be dif-ferentially private.

Scientists like Dwork and Kifer are still working

out how to implement differential privacy in the realworld, and few applications have moved beyond thelab. In general, however, differential privacy isachieved by writing special algorithms that sit be-tween a statistical database and an analyst whowishes to run queries against it. If an algorithm is dif-ferentially private, then the results it produces shouldbe essentially the same independent of whether anysingle person is included in the database or not. Oneimportant consequence of this is that no matter whatan analyst knows—no matter what backgroundknowledge they might possess—they still cannotlearn anything more about a specific individual justbecause they happen to be in the database. Con-versely, even if an analyst were to know everythingabout each individual represented in the data exceptfor one, they still should not be able to learn muchabout that one remaining person. No de-identifica-tion scheme can make those kinds of guarantees.

In practice, differential privacy is achieved by in-troducing some random noise into the query re-sponses. For example, if an analyst were to ask howmany people in a database were over 5 feet tall, andthe true answer was 56, then a differentially privatealgorithm might grab a random variable from a prob-ability distribution and add it to the true answer, spit-ting out 57 instead. The noise is the differencebetween the response (57) and the true answer (56).“Our choice of randomness,” Dwork writes in an e-mail, “makes responses close to the truth much morelikely than answers that are far from the truth (whichis what we want for accuracy).”

Nonetheless, attentive readers will have noticedthat differential privacy does in fact work by provid-ing slightly inaccurate results; as Dwork says, it usesprobability to introduce “a little bit of uncertainty.”This has two significant consequences.

First, in a differentially private setting, an analystcan only see the blurry answers provided by the al-gorithms; he can never examine the raw data itself.Dwork and several colleagues are currently investi-gating the possibility of allowing trusted individualsto view the underlying data—a situation Dwork de-scribes as “differential privacy with a human in theloop”—but that is still very much under develop-ment. At least for now, researchers who need to seethe innards of the data sets they are working withmust look elsewhere for privacy protection.

Second, the question of how much noise is enoughnoise, and how much noise is too muchnoise, is a rather thorny one. The trick is toadd just enough randomness to the queryanswers to protect theprivacy of the individualswhose records lie in thedatabase, but not somuch that an analyst canno longer learn accurateor meaningful thingsfrom a statistical perspec-tive about the samplepopulation they com-

prise. This is the price of privacy,or the trade-off between privacyand utility; and it may be the mostserious challenge facing those whoare trying to bring differential pri-vacy out of the lab. “We can al-ways find wildly inaccurate ways ofcomputing something that ensuresa given level of privacy,” saysDwork. The science lies in findingways of ensuring privacy that donot destroy utility.

As it turns out, some queries,and some databases, are more“sensitive” than others, meaningthat they are more prone to leakinformation. As a result, they re-quire more noise. According toVinterbo, more fine-grained infor-

Page 30: Research Biomedical

28 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012 www.biomedicalcomputationreview.org

mation also requires more noise—a situation that mayprove challenging in the case of genomic databases,which contain enormous amounts of incredibly de-tailed data. Since you wouldn’t want to add morenoise than is absolutely necessary, matching the ap-propriate amount of noise to the sensitivity of thequery and of the database—in effect, figuring out howto balance privacy against utility—is crucial. And asKamalika Chaudhuri, PhD, an expert on machinelearning at UCSD, says, it also turns out to be “fairlytechnical and complicated.”

Which is not to say that it can’t be done. A num-ber of researchers are investigating ways of relaxingdifferential privacy so that it still offers strong privacyprotection without requiring excessive amounts ofnoise, while others are trying to find novel methodsof adding noise that won’t degrade accuracy.

Chaudhuri, for example, is interested in using al-gorithms called “classifiers” that can be used to trawlthrough large collections of medical records in orderto predict things like whether a particular individualmight require hospitalization. Classifiers must betrained on standard data sets, however, and thetraining process can leak sensitive information aboutthe training samples. A differentially private ap-proach would typically involve adding a bit of noiseto the results coming out of the classifier—a tech-nique known as “output perturbation.” This protectsprivacy, but also makes the classifier more error-prone. Chaudhuri has figured out a way to insert thenoise earlier in the process, injecting it into the clas-sifier itself—a technique she calls “objective pertur-bation.” The latter still ensures differential privacy,but the results are more accurate.

Efforts like these bode well for the adoption ofdifferential privacy in the coming years. But even itssupporters agree that differential privacy alone can-not be counted upon to solve the privacy problem

once and for all. “Thereis no single solution thatwill suit every possiblescenario,” says Vinterbo.In a recent paper, Kiferpointed toward a fewspecific weaknesses ofdifferential privacy, mostnotably some limitationson its ability to protectprivacy in social net-works and in circum-

stances where some statistics have already beenreleased into the wild. “Differential privacy works,”Kifer says. “But nothing works all the time.”

Finding Integrated SolutionsAs a result, many experts are beginning to envi-

sion a more integrative and contextual approach tobiomedical data privacy—one that would offer amenu of technical solutions backed up by policymeasures, the precise mixture of which would dependon the nature of the data, the needs of the researchers,and the concerns of the data subjects themselves.

Haixu Tang, PhD, and XiaoFeng Wang, PhD, atthe Indiana University Bloomington School of In-formatics and Computing, advocate for what theycall a “hierarchical method of data release” thatwould consider the kind of analysis researchers wishto perform on a particular data set, the level of pri-vacy risk involved, and the degree of utility requiredbefore deciding on a particular privacy mechanism.(The two recently won the 2011 Award for Out-standing Research in Privacy Enhancing Technolo-gies for their work demonstrating that individualscould be identified in a GWAS even when the pre-cision of the published statistics was low and someof the data were missing. They are currently inves-tigating ways of introducing miniscule amounts ofnoise in order to guard against such attacks withoutsacrificing utility.)

Vinterbo, for his part, thinks that a comprehen-sive solution to the privacy problem will require a“trust infrastructure” that includes not only techni-cal solutions, but also “legal frameworks that effi-ciently combine technology and law.” “The needs

that will be met by technical measures alone are aminority,” he says.

Similarly, Malin would like to see a holistic, risk-based approach that draws on the pooled expertiseof technologists, legal experts, and ethics reviewboards, all of whom would have a say in determininghow best to safeguard privacy in a particular con-text—whether that meant implementing the mostrigorous technical scheme possible, or applyingsomething less formal and backing it up with care-fully crafted use agreements and legal sanctions.Only then, he believes, will the biomedical commu-nity have the kind of flexible, nuanced tools neededto address the challenges of protecting its data.

“We have developed great technical solutions,and more are coming down the pipeline,” says Malin,echoing Kifer’s prediction of a golden age. “But wehave to keep the bigger picture in mind.” !!

Page 31: Research Biomedical

Published by Simbios, the NIH National Center for Physics-Based Simulation of Biological Structures 29

Suppose 20 friends live in the same city and want tomeet for dinner. They should be able to identify aunique spot that minimizes the squared distance

everyone needs to travel by taking the arithmetic mean ofeach starting location. However, if these 20 friends werespread across the world rather than in one city, the mean ofall the starting locations would be in the interior of theEarth! In the latter example, where we cannot approximatethe friends’ locations in a 2-D plane, it is necessary to im-pose a geometry constraint: the meeting spot must be onthe 2-D surface of Earth, a subset of the 3-D world. Sincethe arithmetic mean does not incorporate geometric con-straints into its calculation, it yields a nonsensical answer.

Geometric constraints influence even very simple calcu-lations and arise in many contexts. In a biomechanics lab,researchers take measurements of limb movements during

reaching and walking. In the same way that individuals can-not have arbitrary locations in the 3-D world (lest they befound inside Earth), limbs cannot have arbitrary positions.Limbs have bones that have fixed sizes and joints that canonly rotate in certain directions, creating the “space of rota-tions.” Think of this space as a low-dimensional surface em-bedded in a higher-dimension Euclidean space, similar tohow the 2-D surface of Earth is embedded in a 3-D world.Classical statistical computations don’t make sense in thissituation. However, having a mathematical framework thatcan calculate quantities similar to the arithmetic mean underthese inherent geometric constraints would be very useful.

One solution is to choose a valid meeting spot or armpose that also minimizes the distance to the calculated arith-metic mean. This point, called the exterior mean, is notnecessarily unique, but is within the set of points that satisfythe geometric constraints and are closest to the arithmeticmean. For instance, if the average meeting spot were the

exact center of the earth, then Rome,San Francisco, and Tokyo would allbe exterior means. In order to calcu-late the exterior mean, we first canfind the mean of all locations andthen project it onto the set of allpoints that meet the geometric con-straints. Depending on the details ofthe geometric constraints, the exte-rior mean might be found analyticallyor through iterative methods.

One issue with the exterior meanis that it ignores the route taken fromthe starting points, so it could producea non-optimal solution. In our earlierexample, the distance a friend travels

to the exterior mean location mayturn out to be shortest only if he walksthrough the center of Earth, since thecalculation doesn’t account for thepath the friend takes to get there.

Measurements of distances alongpaths as dictated by the constrained geometry of a surfaceare called geodesic distances, and a better analog of thearithmetic mean should minimize the sum of the squaredgeodesic distances. This solution is referred to as the inte-rior mean. For our restaurant example, the interior meanwould be the location such that the total squared distanceeach friend travels along the Earth’s surface is minimized.

For the arm movement problem, the interior meanwould identify an arm orientation that minimizes the av-erage squared distance of the path (measured in the spaceof rotations—recall our surface analogue) that all otherarm poses must go through to get to that pose.

Finding the geodesic distance be-tween two points can be difficult inmany geometries. Fortunately for thestudy of limb movements, which are geo-metrically constrained to rotations, thereexists a simple and elegant iterative al-gorithm with rapid convergence towardsthe interior mean. More complicated sta-tistics can be computed similarly basedon geodesic distances. !!

BY OREN FREIFELD, JUSTIN FOSTER AND PAUL NUYUJUKIAN

u n d e r t h e h o o dUnder TheHood

DETAILS

Oren Freifeld, a graduate student in Michael Black's Vision lab atBrown University, is an exchange scholar working in Krishna Shenoy’sNeural Prosthetics Systems Laboratory at Stanford University. PaulNuyujukian and Justin Foster are graduate students in Shenoy’s lab.For additional information see Karcher, H, Riemannian center of mass and mollifier smoothing, Communications on Pure and Applied Mathematics, 30:509–541 (1977).

Calculating Statistics of Arm Movements

Page 32: Research Biomedical

www.biomedicalcomputationreview.org

Nonprofit Org.U.S. Postage PaidPermit No. 28Palo Alto, CA

Stanford University318 Campus DriveClark Center Room S231Stanford, CA 94305-5444

BY KATHARINE MILLER

T o diagnose heart disease noninvasively, scientists combine3-D visualizations of the heart and blood vessels (recon-structed from CT scans) with computer simulations of

blood flow. Typically, a palette of rainbow colors is used to helpidentify areas of low shear stress—trouble spots of low friction orstagnant blood flow that weaken vessel walls—a good indicatorof disease progression. But 3-D rainbows aren’t as useful as our in-stincts suggest, says Michelle Borkin, a PhD candidate in appliedphysics at Harvard University.

After observing and interviewing cardiologists, Borkin real-ized that interacting with and rotating 3-D images took time andsometimes meant interrupting a procedure. And research intothe psychology of visualization suggests that humans do not read

rainbow colors in an intuitive way. Inspired by tools she had used to un-

derstand the structures of nebulaein outer space, Borkin devel-

oped software called HemoVis that visualizes simulated blood flowin two dimensions, splays or “butterflies” vessels open in a tree di-agram, and colors the areas of low shear stress in gradations fromred to gray. In a test of the software, the 2-D visualizations (com-pared with 3-D) led to much more accurate and efficient diagnosesby 21 medical students; the same was true for the red/gray palettewhen compared to rainbow. “At a single glance,” Borkin says,“they get a quick and accurate diagnosis.” The work was publishedin IEEE Transactions on Visualization and Computer Graphics.

“This paper shows that making smart choices about how youdisplay your data in dimensionality and color not only can helpdoctors see the data better and help them make discoveries,”Borkin says, “but might also save lives.” !!

s e e i n g s c i e n c eSeeingScience

Busting Assumptions about Rainbows and 3-D Images

30 BIOMEDICAL COMPUTATION REVIEW Winter 2011/2012

An arterial system that would previously have been reconstructed in 3-D (left) is insteaddeconstructed into 2-D and shown at right with each branch separated from the main ves-sel. Branching points and relationships between branches are also displayed. Areas shadedred represent diseased areas as indicated by low shear stress measured in computer simu-lations of blood flow. Courtesy of Michelle Borkin.