28
Health Sciences Informatics Research-in- Progress Seminar. The Johns Hopkins Medical School Meyer Building B-105 11:00 A.M. March 7, 2003 Jules J. Berman, Ph.D., M.D. Program Director, Pathology Informatics Cancer Diagnosis Program, DCTD, NCI, NIH EPN - Room 6028 6130 Executive Blvd. Rockville, MD 20892 email: [email protected] voice: 301-496-7147

Health Sciences Informatics Research-in-Progress Seminar. The Johns Hopkins Medical School

  • Upload
    niran

  • View
    21

  • Download
    1

Embed Size (px)

DESCRIPTION

Health Sciences Informatics Research-in-Progress Seminar. The Johns Hopkins Medical School Meyer Building B-105 11:00 A.M. March 7, 2003 Jules J. Berman, Ph.D., M.D. Program Director, Pathology Informatics Cancer Diagnosis Program, DCTD, NCI, NIH EPN - Room 6028 6130 Executive Blvd. - PowerPoint PPT Presentation

Citation preview

Page 1: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Health Sciences Informatics Research-in-Progress Seminar.

The Johns Hopkins Medical SchoolMeyer Building B-105

11:00 A.M. March 7, 2003

Jules J. Berman, Ph.D., M.D.Program Director, Pathology InformaticsCancer Diagnosis Program, DCTD, NCI, NIHEPN - Room 60286130 Executive Blvd.Rockville, MD 20892email: [email protected]: 301-496-7147

Page 2: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

My Background:

15 years as Chief of Surgical Pathology and Cytology at the Baltimore VA Hospital (occasional adjunct appointments at U of MD and Hopkins (for Johns Hopkins Autopsy Resource project)

Last 4 years as Program Director for Pathology Informatics in the Cancer Diagnosis Program at NCI

NCI Coordinator for the Shared Pathology Informatics Network

Somewhere in there, learned Perl

Page 3: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

NIH role:

Help move the field of Pathology Informatics forward by developing NCI [funded] initiatives and by identifying and nurturing vital activities in the area.

Page 4: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

1. Acquisition of Data - 49% of my time

2. Organization of Data - 49% of my time

3. Analysis of Data - 2% of time

Page 5: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

1. Acquisition of Data - Getting people to share, and working within HIPAA and Common Rule Guidelines, meetings

2. Organization of Data - Standards, XML, meta-data, self-describing architectures, more meetings, technical standards committee of API, Tissue Microarray Data Exchange Standard

3. Analysis of Data - ??? - almost irrelevant at this time. People think it’s ok to publish without supporting data.

Page 6: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Data Sharing:

NIH Statement on Data Sharinghttp://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html

National Research Council UPSIDE Universal Principle of Sharing Integral Data Expeditioushttp://books.nap.edu/books/0309088593/html/R1.html

Comment Letter on NIH Data Sharing Proposalhttp://www.aamc.org/advocacy/library/research/corres/2002/051102.htm

Page 7: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

UFO Abductees

Lots of them

They often say about the same thing (independent confirmations)

All walks of life

Generally honest

Minority are a little crazy

One problem: no evidence

Page 8: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Researchers who don’t publish their primary data

Lots of them

They often say about the same thing (independent confirmations)

All walks of life

Generally honest

Minority are a little crazy

One problem: no evidence

Page 9: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

After your research data reaches a certain size, the data becomes the publication, and the journal articles become tiny editorials that describe or interpret the data

Think of the relationship between the earth and the sun.

Terra-centrics did not want to think that their planet was not the center of the universe.

But actually, earth is a tiny fraction of the size of the sun, and people eventually switched to a heliocentric vision of reality.

Research papers are mere editorials that revolve around a central large BLOB of data.

The database is the publication. Everything else is peripheral.

Page 10: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

After your research data reaches a certain size, the data becomes the publication, and the journal articles become little editorials interpreting the data

Examples:

Human Genome Project (3 billion bases)

Gene Expression Arrays

Tissue Microarrays (a thousand cores of tissue)

Proteomics

Page 11: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

All those numbers get multiplied when you start thinking about:

Accruing data (annotating databases with the results of experiments)

Merging and linking data

Executing distributed queries over multiple databases

Page 12: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

So what’s stopping us from making incredibly large medical databases?

Human Subject Protection issues

Usually means confidentiality/privacy issues under context of research using medical records

Human nature

Researcher insecurities

Non-existence of organized data

Page 13: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Two Regulations that tell us how we can use medical records in research:

Common Rule

HIPAA Privacy

Both work on the principle that medical research is good, and it can be conducted without getting patient consent if you can come up with a way to avoid harming patients (no harm, no consent for harm).

Typically, this is done by de-identifying records

Page 14: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

How do you de-identify records?

1. Remove identifiers

2. Ensure non-uniqueness of records

3. Scrub text

Page 15: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Legal Importance of de-identification research

1. Scientific field created in HIPAA

HIPAA asks the community to come up with de-identification standards

2. Civil Rights Office will not be looking for misinterpretation. Will probably only respond to complaints. No pre-screening of methodology by Civil Rights Office.

2. Published Research Methodology sure to weigh-in if lawsuit every occur

To a certain extent, what’s de-identified is what scientists promote and accept in published articles (Daubert v Dow (1993) interpretation of admissible expert opinion)

Page 16: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

One-way hash method described (currently deprecated under HIPAA)

Techniques I’ve been publishing:

Concept-Match Medical Data Scrubbing (In press, Archives of Pathology)

Threshold Method (published, BMC Methods)

Zero-Check, A Zero-Knowledge HIPAA-compliant Protocol for Reconciling Patient Identities Across Institutions (answer to HIPAA attack on one-way hash methodology)

Page 17: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

One-Way Hash Method for de-identifying

Allows you to get follow-up data on de-identified patients

A one-way hash algorithm computes a fixed length string from a character string. It is impossible to determine the original character string by looking at the hash value. The algorithm always gives the same hash value for any given string. Therefore it is typically use as an authenticator [for secret messages].

Joe Smith replaced by one-way hash “ekso583a2ldg”

Page 18: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

One-Way Hash Method for de-identifying

Allows you to get follow-up data on de-identified patients

Joe Smith replaced by one-way hash “ekso583a2ldg”

Joe Smith comes back a year later and his new record is de-identified with one-way has string “ekso583a2ldg”

The two de-identified records are merged under the common one-way hash string, “ekso583a2ldg”

Page 19: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Concept-Match algorithm for scrubbing text:

1. Parse all input into sentences.2. Parse each sentence, into words. 3. Each "stop word" (high frequency word) is preserved. 4. Intervening words and phrases are mapped to a standard nomenclature.

5. Each coded term is replaced by an alternate term that maps to the same code.6. All other words are replaced by blocking symbol (consisting of three asterisks).

Page 20: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Examples from Hopkins Pathology Phrase list:

Diagnosis of severe dysplasia => (Diagnosi=C0348026) of (severe dysplasia=C0334048)

Diagnosis of sickle => (Diagnosi=C0348026) of ***

Diagnosis of sickle cell anemia => (Diagnosi=C0348026) of (herrick anemia=C0002895)

Diagnosis of simple hyperplasia => (Diagnosi=C0348026) of (simple=C0205352) (hypercellularity=C0020507)

Diagnosis of sjogren => (Diagnosi=C0348026) of (sjogren disease=C0037230)

Page 21: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

1. Dr. Atkinson killed his patient today. => *** *** *** *** (patient=C0030705) (today=C0750526)

2. Is this malpractice? => Is this ***

3. Senator garfield was admitted today into the psychiatric unit. => *** *** was *** (today=C0750526) into the (psychiatric behavioral=C0205487) (unit=C0439148).

4. Snetor garfield was admitted today into the psyciatric unit. => *** *** was *** (today=C0750526) into the *** (unit=C0439148)

5. Dr. truelove's diagnosis is both incorrect and incompetent. => *** *** (diagnosi=C0348026) is both *** and ***

6. The patient's social security number is 523845 => The *** *** *** *** is ***

Page 22: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Threshold algorithm

A familiar plot device.

Page 23: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

“they suggested that the manifestations were as severe in the mother as in the sons and that this suggested autosomal dominant inheritance.”

Bob’s Piece 1.

684327ec3b2f020aa3099edb177d3794 => suggested autosomal dominant inheritance3c188dace2e7977fd6333e4d8010e181 => mother8c81b4aaf9c2009666d532da3b19d5f8 => manifestationsdb277da2e82a4cb7e9b37c8b0c7f66f0 => suggestede183376eb9cc9a301952c05b5e4e84e3 => sons22cf107be97ab08b33a62db68b4a390d => severe

Bob’s Piece 2.

they db277da2e82a4cb7e9b37c8b0c7f66f0 that the8c81b4aaf9c2009666d532da3b19d5f8 were as22cf107be97ab08b33a62db68b4a390d in the3c188dace2e7977fd6333e4d8010e181 as in thee183376eb9cc9a301952c05b5e4e84e3 and that this684327ec3b2f020aa3099edb177d3794.

Page 24: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Piece 1 (the listing of phrases and their one-hashes)

1..Contains no information on the frequency of occurrence of the phrases found in the original text (because recurring phrases map to the same hash code and appear as a single entry in Piece 1).

2..Contains no information that Alice can use to connect any patient to any particular patient record. Records do not exist as entities in Piece 1.

3..Contains no information on the order or locations of the phrases found in the original text.

4..Contains all the concepts found in the original text. Stop words are a popular method of parsing text into concepts [4,5].

5..Bob can destroy Piece 1 and re-create it later from the original file, using the same threshold algorithm.

6..Alice can use the phrases in Piece 1 to transform, annotate or search the concepts found in the original file.

7..Alice can transfer Piece 1 to a third party without violating HIPAA privacy rules or Common Rule human subject regulations (in the U.S.). For that matter, Alice can keep Piece 1 and add it to her database of Piece 1 files collected from all of her clients.

8..Piece 1 is not necessarily unique. Different original files may yield the same Piece 1 (if they’re composed of the same phrases). Therefore Piece 1 cannot be used to authenticate the original file used to produce Piece 1.

Page 25: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Properties of Piece 2

1..Contains no information that can be used to connect any patient to any particular patient record.

2..Contains nothing but hash values of phrases and stop words, in their correct order of occurrence in the original text.

3..Anyone obtaining Piece 1 and Piece 2 can reconstruct the original text.4.The original text can be reconstructed from Piece 2, and any file into which Piece 1 has been merged. There is no necessity to preserve Piece 1 in its original form.

5..Bob can lose or destroy Piece 2, and re-create it later from the original file, using the same threshold algorithm.

Page 26: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

Bob prepares threshold Pieces 1 and 2 and sends Piece 1 to Alice. Alice may require Bob to prove the authenticity of Piece 1, but Bob has no reason to care if Piece 1 is intercepted by an unauthorized party. Alice uses her software (which may be secret, or it may require computational facilities that Bob doesn't have, or it may require large databases that Bob doesn't have), to transform or annotate each phrase from Piece 1. The transformation product for each phrase can be almost anything that Bob considers valuable (e.g., a UMLS code, a genome database link, an image file URL, or a tissue sample location). Alice substitutes the transformed text (or simply appends the transformed text) for each phrase back into Piece 1, co-locating it with the original one-way hash number associated with the phrase.

Page 27: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

The original text has been converted into two pieces, neither of which contain any identifying information. There is sufficient information in Piece 1 for Alice to annotate the text and return it to Bob (annotated Piece 1). Bob can reconstruct his original text, including Alice’s annotations, thus adding value to his original data, without breaching patient confidentiality. Bob can pay Alice for her services. Alice can keep Piece 1 and use it for her own purposes. Alice can make a large database consisting of all the Piece 1 files she receives from all of her customers. Alice’s aggregated Piece 1 database can be used by owners of Piece 2 files to reconstruct their original files (along with Alice’s value-added annotations). Alice can sell Piece 1 to a third party, if she wishes. Alice can continually update or otherwise enhance her annotations on Piece 1 and sell the updated versions to Bob and others.

Page 28: Health Sciences Informatics Research-in-Progress Seminar.  The Johns Hopkins Medical School

end