32
Anatomic Pathology Data Mining Jules J. Berman, Ph.D., M.D. Program Director, Pathology Informatics Cancer Diagnosis Program National Cancer Institute *All opinions herein are Dr. Berman’s and do not represent those of any federal agency.

Anatomic Pathology Data Mining

  • Upload
    meg

  • View
    85

  • Download
    1

Embed Size (px)

DESCRIPTION

Anatomic Pathology Data Mining. Jules J. Berman, Ph.D., M.D. Program Director, Pathology Informatics Cancer Diagnosis Program National Cancer Institute *All opinions herein are Dr. Berman’s and do not represent those of any federal agency. Expertise Domain of the Anatomic Pathology Data Miner. - PowerPoint PPT Presentation

Citation preview

Page 1: Anatomic Pathology  Data Mining

Anatomic Pathology Data Mining

• Jules J. Berman, Ph.D., M.D.Program Director, Pathology InformaticsCancer Diagnosis ProgramNational Cancer Institute

*All opinions herein are Dr. Berman’s and do not represent those of any federal agency.

Page 2: Anatomic Pathology  Data Mining

Expertise Domain of the Anatomic Pathology Data Miner

• Confidentiality/Privacy Issues

• Data Sharing issues, which includes data standardization

• Data Analysis

Page 3: Anatomic Pathology  Data Mining

Data Domain of Pathology Data Miner

• Pathology Data linked to tissue samples

• Any medical record data that can be linked to pathology data (including cancer registry data)

• Any other relevant data in existence that can be sensibly linked to pathology records (this usually means the internet)

Page 4: Anatomic Pathology  Data Mining

Confidentiality/privacy• Anyone interested in using confidential information

(essentially any data generated in a hospital that is attached to a patient) needs to understand confidentiality and privacy issues.

• The fact that you might be using only your department’s data and that you treat the data confidentially will almost never exempt you from existing regulations.

• The consequences to you and your institution of ignoring regulations can be profound.

Page 5: Anatomic Pathology  Data Mining

UNCONSENTED RECORDS VERSUS CONSENTED

RECORDS• How can a researcher get a waiver from

patient consent requirements?

• By minimizing the risk of the study.

• In most studies, this means reducing confidentiality and privacy risks to near-zero

Page 6: Anatomic Pathology  Data Mining

Standards issues related to data sharing

• Nomenclatures and free-text mapping

• Common Data Elements

• Standard Report Formats

• Internet Protocols

Page 7: Anatomic Pathology  Data Mining

CDE for Date of Birth

• |birthdate| September 15, 1970

• |birthday| September 15, 1970

• |D.O.B.| September 15, 1970

• |d.o.b.| September 15, 1970

• |date of birth| September 15, 1970

• |date of birth| September 15, 1970

• |date-of-birth| September 15, 1970

• |date_of_birth| September 15, 1970

• |dob| September 15, 1970

• |DOB| September 15, 1970

Page 8: Anatomic Pathology  Data Mining

Representation of CDE

• |date_of_birth| September 15, 1970 • |date_of_birth| 15, September, 1970 • |date_of_birth| 9/15/70 • |date_of_birth| 15/9/70 • |date_of_birth| 15/09/70 • |date_of_birth| 9/15/1970 • |date_of_birth| 9.15.70 • |date_of_birth| 9,15,70

• |date_of_birth| some delta time

Page 9: Anatomic Pathology  Data Mining

Annotation/Curation of the CDE

• Unique identifier

• Creator name

• Date of creation

• Date of modifications

• Exact definition

• Hierarchy (if applicable)

• List of users or CDE-specific browsers

Page 10: Anatomic Pathology  Data Mining

BEST EXAMPLE CDE SITE

• United States Health InformatioN Knowledgebase (USHIK)

• http://hmrha.hirs.osd.mil/registry/index1.html

Page 11: Anatomic Pathology  Data Mining

CDEs become XML tags

• <date_of_birth>10/17/00</date_of_birth>

Page 12: Anatomic Pathology  Data Mining

CDEs become self-attributing XML tags

• <date_of_birth defn=“http://www.cde.org”>10/17/00 </date_of_birth>

Page 13: Anatomic Pathology  Data Mining

Shared Pathology Informatics Network

• 5-year project beginning April 2001

• Will develop the tools that will allow about 6 large laboratories to share their data with researchers, using the internet

• Basically, it will allow a researcher to interrogate the pathology records at multiple institutions simultaneously and receive a summary report almost instantaneously.

Page 14: Anatomic Pathology  Data Mining

Shared Pathology Informatics Network

institution

2

institution

3

institution

9

Internet request server

Test data requests and responses

institution

1

fire

wall

fire

wall

fire

wall

fire

wall

Page 15: Anatomic Pathology  Data Mining

What is so special about anatomic pathology data?

• Every anatomic pathology record is linked to the patient identifier and to the tissue blocks for that record

• One of the important rate-limiting factors in cancer research today is access to tissues

• Access to even a small fraction of the tissues routinely collected by pathology departments (about 40 million each year) would be of enormous research benefit.

Page 16: Anatomic Pathology  Data Mining

Increasing frequency of precancer terms, 1984-2000

0

2000

4000

6000

8000

1980 1985 1990 1995 2000 2005

Series1

Page 17: Anatomic Pathology  Data Mining

Example project: Virtual Precancer Archive

• Johns Hopkins Surgical Pathology has cases accrued in electronic form since 1984

• 372, 536 is the current (circa Sept., 2000) number of accrued cases

• Wouldn’t it be nice to be able to survey the archived precancer cases in a large archive such as the Hopkins Archive?

Page 18: Anatomic Pathology  Data Mining

Step 1. (Drs Bill Moore and Robert Miller)Build a phrase from all cases

• The text of the reports can be represented as a collection of phrases that contain all of the concepts included in the reports.

• The 372,536 records were parsed to find the diagnostic field free-text.

• Diagnostic field free-text was parsed into sentences.

• Diagnostic field sentences were parsed into phrases and words.

Page 19: Anatomic Pathology  Data Mining

418,159 phrases represent all the textual concepts in the JHH surg path records - lie

outside the realm of Common Rule• minimal mononuclear cell infiltrate• minimal mononuclear cell infiltration• minimal mononuclear cell interstitial• minimal mononuclear infiltrate• minimal mononuclear inflammation• minimal mononuclear interstitial infitrates• minimal mononuclear meningeal• minimal morphologic abnormalities

Page 20: Anatomic Pathology  Data Mining

Step 2. Create a precancer terminology

• Started with the National Library of Medicine’s UMLS (Unified Medical Language System)

• We use the concept list file, which is 113,699,627 bytes and contains 1,598,176 terms.

• As example, rcc has about 80 synonymous terms in UMLS

Page 21: Anatomic Pathology  Data Mining

UMLS CUI C0007134: Renal cell carcinoma

• carcinoma, renal cell• carcinomas, renal cell• renal cell carcinoma• hypernephroid carcinoma• grawitz tumor• hypernephroma• renal cell adenocarcinoma• rcc

Page 22: Anatomic Pathology  Data Mining

Why do we need to disambiguate common terms?

• Google search engine query 09/19/00

• "rcc" => 132,000 hits

• "renal cell carcinoma" => 11,600 hits

• "grawitz tumor" => 79 hits

Page 23: Anatomic Pathology  Data Mining

The UMLS precancer terms

• 2,984 terms

• Contains 221 terms added by myself and given private J-codes

Page 24: Anatomic Pathology  Data Mining

Step 3. Map the Hopkins phrases to the precancer terms

• Start with 418,159 phrases

• One-by-one try to find a matching phrase from the list of 2,984 precancer terms list

• Prepare a file of all the matching terms

• This step takes 33 second to complete with a PERL script running on a 450 MHz desktop computer - i.e., it’s scalable

Page 25: Anatomic Pathology  Data Mining

The result: 10,310 term matches,from 418,159 phrases: a scalable

work in progress• early actinic keratosis|actinic keratosis|0022602

• early adenomatous polyp|adenomatous polyp|0206677

• early borderline rejection|borderline|0205189

• early dysplasia|dysplasia|0334044

• early dysplastic change|dysplastic|0334045

• early dysplastic process|dysplastic|0334045

• early gastric mucin cell metaplasia|metaplasia|0025568

• early gastric mucous cell metaplasia|metaplasia|0025568

Page 26: Anatomic Pathology  Data Mining

Step 4. Give precancer match list to Drs. Bill Moore and Robert Miller to

create a concordance

• 10,310 precancer terms occurred in 54,909 accessioned surgical pathology cases between 1984 and 2000. That is, each of the precancer terms were found in a little more than 5 cases.

• 54,909 cases containing a precancer term represents 54,909/ 372,536 =~ 15%

Page 27: Anatomic Pathology  Data Mining

The concordance looks like this:

• C0001815^367220497667008419098^^

• C0002893^394120765570701149177^^

• C0002893^435120960421908784068^^

• C0002893^436410698795906686356^^

• C0002893^445510623875200588234^^

Page 28: Anatomic Pathology  Data Mining

Precancer-related cases by year

• 1984 1175 7%

• 1985 1573 8%

• 1986 2024 10%

• 1987 2195 11%

• 1988 2239 11%

• 1989 2328 11%

• 1990 2721 12%

• 1991 3077 14%

• 1992 3185 14%

• 1993 2878 13%

• 1994 3060 14%

• 1995 2968 13%

• 1996 3475 14%

• 1997 4726 17%

• 1998 4989 18%

• 1999 5996 20%

• 2000 6298 25%

Page 29: Anatomic Pathology  Data Mining

Precancer-related cases by year

0

1000

2000

3000

4000

5000

6000

7000

Page 30: Anatomic Pathology  Data Mining

Cases per year of Barrett’s esophagus

• C0004763 1984 30• C0004763 1985 35• C0004763 1986 82• C0004763 1987 97• C0004763 1988 106• C0004763 1989 84• C0004763 1990 97• C0004763 1991 100

• C0004763 1992 132• C0004763 1993 126 • C0004763 1994 144• C0004763 1995 162• C0004763 1996 221• C0004763 1997 307• C0004763 1998 341• C0004763 1999 401

Page 31: Anatomic Pathology  Data Mining

Conclusion:

• With these techniques, laboratories with good informatics infrastructure can create a virtual omni-archive (at very low cost) that operates within current human subject protection guidelines for minimal-risk de-identified retrospective studies.

Page 32: Anatomic Pathology  Data Mining

Epilog

• There are other protocols for conducting confidential anatomic pathology research

• These include anonymization, deidentification, brokered double encryption, sanitization through nomenclature mapping

• Example of the latter two methods is: www.netautopsy.org