Health Sciences Informatics Research-in-Progress Seminar. The Johns Hopkins Medical School

Health Sciences Informatics Research-in-Progress Seminar.

The Johns Hopkins Medical SchoolMeyer Building B-105

11:00 A.M. March 7, 2003

Jules J. Berman, Ph.D., M.D.Program Director, Pathology InformaticsCancer Diagnosis Program, DCTD, NCI, NIHEPN - Room 60286130 Executive Blvd.Rockville, MD 20892email: [email protected]: 301-496-7147

My Background:

15 years as Chief of Surgical Pathology and Cytology at the Baltimore VA Hospital (occasional adjunct appointments at U of MD and Hopkins (for Johns Hopkins Autopsy Resource project)

Last 4 years as Program Director for Pathology Informatics in the Cancer Diagnosis Program at NCI

NCI Coordinator for the Shared Pathology Informatics Network

Somewhere in there, learned Perl

NIH role:

Help move the field of Pathology Informatics forward by developing NCI [funded] initiatives and by identifying and nurturing vital activities in the area.

1. Acquisition of Data - 49% of my time

2. Organization of Data - 49% of my time

3. Analysis of Data - 2% of time

1. Acquisition of Data - Getting people to share, and working within HIPAA and Common Rule Guidelines, meetings

2. Organization of Data - Standards, XML, meta-data, self-describing architectures, more meetings, technical standards committee of API, Tissue Microarray Data Exchange Standard

3. Analysis of Data - ??? - almost irrelevant at this time. People think it’s ok to publish without supporting data.

Data Sharing:

NIH Statement on Data Sharinghttp://grants.nih.gov/grants/guide/notice-files/NOT-OD-03-032.html

National Research Council UPSIDE Universal Principle of Sharing Integral Data Expeditioushttp://books.nap.edu/books/0309088593/html/R1.html

Comment Letter on NIH Data Sharing Proposalhttp://www.aamc.org/advocacy/library/research/corres/2002/051102.htm

UFO Abductees

Lots of them

They often say about the same thing (independent confirmations)

All walks of life

Generally honest

Minority are a little crazy

One problem: no evidence

Researchers who don’t publish their primary data

Lots of them

They often say about the same thing (independent confirmations)

All walks of life

Generally honest

Minority are a little crazy

One problem: no evidence

After your research data reaches a certain size, the data becomes the publication, and the journal articles become tiny editorials that describe or interpret the data

Think of the relationship between the earth and the sun.

Terra-centrics did not want to think that their planet was not the center of the universe.

But actually, earth is a tiny fraction of the size of the sun, and people eventually switched to a heliocentric vision of reality.

Research papers are mere editorials that revolve around a central large BLOB of data.

The database is the publication. Everything else is peripheral.

After your research data reaches a certain size, the data becomes the publication, and the journal articles become little editorials interpreting the data

Examples:

Human Genome Project (3 billion bases)

Gene Expression Arrays

Tissue Microarrays (a thousand cores of tissue)

Proteomics

All those numbers get multiplied when you start thinking about:

Accruing data (annotating databases with the results of experiments)

Merging and linking data

Executing distributed queries over multiple databases

So what’s stopping us from making incredibly large medical databases?

Human Subject Protection issues

Usually means confidentiality/privacy issues under context of research using medical records

Human nature

Researcher insecurities

Non-existence of organized data

Two Regulations that tell us how we can use medical records in research:

Common Rule

HIPAA Privacy

Both work on the principle that medical research is good, and it can be conducted without getting patient consent if you can come up with a way to avoid harming patients (no harm, no consent for harm).

Typically, this is done by de-identifying records

How do you de-identify records?

1. Remove identifiers

2. Ensure non-uniqueness of records

3. Scrub text

Legal Importance of de-identification research

1. Scientific field created in HIPAA

HIPAA asks the community to come up with de-identification standards

2. Civil Rights Office will not be looking for misinterpretation. Will probably only respond to complaints. No pre-screening of methodology by Civil Rights Office.

2. Published Research Methodology sure to weigh-in if lawsuit every occur

To a certain extent, what’s de-identified is what scientists promote and accept in published articles (Daubert v Dow (1993) interpretation of admissible expert opinion)

One-way hash method described (currently deprecated under HIPAA)

Techniques I’ve been publishing:

Concept-Match Medical Data Scrubbing (In press, Archives of Pathology)

Threshold Method (published, BMC Methods)

Zero-Check, A Zero-Knowledge HIPAA-compliant Protocol for Reconciling Patient Identities Across Institutions (answer to HIPAA attack on one-way hash methodology)

One-Way Hash Method for de-identifying

Allows you to get follow-up data on de-identified patients

A one-way hash algorithm computes a fixed length string from a character string. It is impossible to determine the original character string by looking at the hash value. The algorithm always gives the same hash value for any given string. Therefore it is typically use as an authenticator [for secret messages].

Joe Smith replaced by one-way hash “ekso583a2ldg”

One-Way Hash Method for de-identifying

Allows you to get follow-up data on de-identified patients

Joe Smith replaced by one-way hash “ekso583a2ldg”

Joe Smith comes back a year later and his new record is de-identified with one-way has string “ekso583a2ldg”

The two de-identified records are merged under the common one-way hash string, “ekso583a2ldg”

Concept-Match algorithm for scrubbing text:

1. Parse all input into sentences.2. Parse each sentence, into words. 3. Each "stop word" (high frequency word) is preserved. 4. Intervening words and phrases are mapped to a standard nomenclature.

5. Each coded term is replaced by an alternate term that maps to the same code.6. All other words are replaced by blocking symbol (consisting of three asterisks).

Examples from Hopkins Pathology Phrase list:

Diagnosis of severe dysplasia => (Diagnosi=C0348026) of (severe dysplasia=C0334048)

Diagnosis of sickle => (Diagnosi=C0348026) of ***

Diagnosis of sickle cell anemia => (Diagnosi=C0348026) of (herrick anemia=C0002895)

Diagnosis of simple hyperplasia => (Diagnosi=C0348026) of (simple=C0205352) (hypercellularity=C0020507)

Diagnosis of sjogren => (Diagnosi=C0348026) of (sjogren disease=C0037230)

1. Dr. Atkinson killed his patient today. => *** *** *** *** (patient=C0030705) (today=C0750526)

2. Is this malpractice? => Is this ***

3. Senator garfield was admitted today into the psychiatric unit. => *** *** was *** (today=C0750526) into the (psychiatric behavioral=C0205487) (unit=C0439148).

4. Snetor garfield was admitted today into the psyciatric unit. => *** *** was *** (today=C0750526) into the *** (unit=C0439148)

5. Dr. truelove's diagnosis is both incorrect and incompetent. => *** *** (diagnosi=C0348026) is both *** and ***

6. The patient's social security number is 523845 => The *** *** *** *** is ***

Threshold algorithm

A familiar plot device.

“they suggested that the manifestations were as severe in the mother as in the sons and that this suggested autosomal dominant inheritance.”

Bob’s Piece 1.

684327ec3b2f020aa3099edb177d3794 => suggested autosomal dominant inheritance3c188dace2e7977fd6333e4d8010e181 => mother8c81b4aaf9c2009666d532da3b19d5f8 => manifestationsdb277da2e82a4cb7e9b37c8b0c7f66f0 => suggestede183376eb9cc9a301952c05b5e4e84e3 => sons22cf107be97ab08b33a62db68b4a390d => severe

Bob’s Piece 2.

they db277da2e82a4cb7e9b37c8b0c7f66f0 that the8c81b4aaf9c2009666d532da3b19d5f8 were as22cf107be97ab08b33a62db68b4a390d in the3c188dace2e7977fd6333e4d8010e181 as in thee183376eb9cc9a301952c05b5e4e84e3 and that this684327ec3b2f020aa3099edb177d3794.

Piece 1 (the listing of phrases and their one-hashes)

1..Contains no information on the frequency of occurrence of the phrases found in the original text (because recurring phrases map to the same hash code and appear as a single entry in Piece 1).

2..Contains no information that Alice can use to connect any patient to any particular patient record. Records do not exist as entities in Piece 1.

3..Contains no information on the order or locations of the phrases found in the original text.

4..Contains all the concepts found in the original text. Stop words are a popular method of parsing text into concepts [4,5].

5..Bob can destroy Piece 1 and re-create it later from the original file, using the same threshold algorithm.

6..Alice can use the phrases in Piece 1 to transform, annotate or search the concepts found in the original file.

7..Alice can transfer Piece 1 to a third party without violating HIPAA privacy rules or Common Rule human subject regulations (in the U.S.). For that matter, Alice can keep Piece 1 and add it to her database of Piece 1 files collected from all of her clients.

8..Piece 1 is not necessarily unique. Different original files may yield the same Piece 1 (if they’re composed of the same phrases). Therefore Piece 1 cannot be used to authenticate the original file used to produce Piece 1.

Properties of Piece 2

1..Contains no information that can be used to connect any patient to any particular patient record.

2..Contains nothing but hash values of phrases and stop words, in their correct order of occurrence in the original text.

3..Anyone obtaining Piece 1 and Piece 2 can reconstruct the original text.4.The original text can be reconstructed from Piece 2, and any file into which Piece 1 has been merged. There is no necessity to preserve Piece 1 in its original form.

5..Bob can lose or destroy Piece 2, and re-create it later from the original file, using the same threshold algorithm.

Bob prepares threshold Pieces 1 and 2 and sends Piece 1 to Alice. Alice may require Bob to prove the authenticity of Piece 1, but Bob has no reason to care if Piece 1 is intercepted by an unauthorized party. Alice uses her software (which may be secret, or it may require computational facilities that Bob doesn't have, or it may require large databases that Bob doesn't have), to transform or annotate each phrase from Piece 1. The transformation product for each phrase can be almost anything that Bob considers valuable (e.g., a UMLS code, a genome database link, an image file URL, or a tissue sample location). Alice substitutes the transformed text (or simply appends the transformed text) for each phrase back into Piece 1, co-locating it with the original one-way hash number associated with the phrase.

The original text has been converted into two pieces, neither of which contain any identifying information. There is sufficient information in Piece 1 for Alice to annotate the text and return it to Bob (annotated Piece 1). Bob can reconstruct his original text, including Alice’s annotations, thus adding value to his original data, without breaching patient confidentiality. Bob can pay Alice for her services. Alice can keep Piece 1 and use it for her own purposes. Alice can make a large database consisting of all the Piece 1 files she receives from all of her customers. Alice’s aggregated Piece 1 database can be used by owners of Piece 2 files to reconstruct their original files (along with Alice’s value-added annotations). Alice can sell Piece 1 to a third party, if she wishes. Alice can continually update or otherwise enhance her annotations on Piece 1 and sell the updated versions to Bob and others.

end

Documents

Health Sciences Informatics Research-in-Progress Seminar. The Johns Hopkins Medical School