DATA MINING THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY
By
Johnathan P. DurbinB.S., University of Louisville, 1995
A ThesisSubmitted to the Faculty of the
Graduate School of the University of Louisvillein Partial Fulfillment of the Requirements
for the Degree of
Master of Arts
Department of MathematicsUniversity of Louisville
Louisville, Kentucky
August 2001
A PRACTICE IN DATA MINING USINGTHE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY
By
Johnathan P. DurbinB.S., University of Louisville, 1995
A thesis Approved on
_________July 12, 2001________
by the following Reading Committee:
__________________________________Thesis Director
__________________________________
__________________________________
ii
ABSTRACT
Data mining is a technique with a number of methods used to explore large
datasets from a variety of angles with a wide spectrum of analytical tools. There are
techniques for finding data, cleaning data, and validating results. For years new data have
been collected by educational, research, commercial, and governmental entities for future
analysis. The 1997 National Ambulatory Medical Care Survey dataset (NAMCS) is such
a dataset available in the public domain at the C.D.C. [Centers for Disease Control and
Prevention] for public consumption. Once this dataset was found, imported, and cleaned,
it was analyzed. Although statistical packages have become extremely sophisticated,
commercial statistical packages do not do everything needed for data mining. For this
reason, a program was written (DFEPP) to analyze the data to display the results in a
different manner using visualization techniques to present the significant results in an
easily digested yet informative manner.
iii
TABLE OF CONTENTS
Page
ABSTRACT iii
CHAPTER
I. Introduction 1
II. Acquiring and Importing Data 3
2.1 Acquisition of Data 32.2 Importing Data 6
III. Data Visualization using the Difference From Expected Percentage Plot (DFEPP) Program Design and Use 10
3.1 The Graph Design 113.2 The Design of the DFEPP Program 143.3 The Use of the DFEPP Program 17
IV. Data Mining of the 1997 National Ambulatory Medical Care Survey (NAMCS) Dataset 21
4.1 Analysis of Payee Type by Practice Type 224.1.1 Workers Compensation 234.1.2 Medicare 324.1.3 Medicaid 364.1.4 Self-Pay 414.1.5 Privately Insured 454.1.6 All Other 46
4.2 HMOs 524.3 Modeling 58
4.3.1 Age Group Models 594.3.2 Modeling Classification of Pregnant 62
iv
V. Conclusions 67
REFERENCES 70
APPENDIX – A (Variable list) 72
VITA 98
v
LIST OF IMAGES
Page
IMAGE 1. A sample run of a web search tool (Copernic2000) 6
IMAGE 2. A working example of the output for data visualization: 14
IMAGE 3. An output example 17
IMAGE 4. A working example of the input for data visualization 19
IMAGE 5. A working example of the output for data visualization 20
IMAGE 6. Clementine Code 59
IMAGE 7. Neural Network for Age Group Model Output 60
IMAGE 8. Refined Neural Network for Age Group Model Output 61
IMAGE 9. Neural Network for Age Group Model 61
IMAGE 10. C5 Model for Pregnant Output 62
IMAGE 11. Refined C5 Model for Pregnant Output 64
IMAGE 12. Refined Rule Set for Pregnant Model Rule Set 64
IMAGE 13. Refined C5 Model (2) for Pregnant Output 65
IMAGE 14. Refined Rule Set (2) for Pregnant Model Rule Set 65
vi
LIST OF PLOTS
PLOT 1. Workers compensation by Physician Specialty 23
PLOT 2. Adjusted plot after removal 28
PLOT 3. Medicare by Physician Specialty 32
PLOT 4. Medicaid by Physician Specialty 36
PLOT 5. Distribution of Medicaid population 37
PLOT 6. Age of Pediatric Patients 38
PLOT 7. Modified Medicaid by Physician Specialty 39
PLOT 8. Age of Dermatology Patients 40
PLOT 9. Self Pay by Physician Specialty 41
PLOT 10. Privately Insured by Physician Specialty 45
PLOT 11. All Other Payees by Physician Specialty 46
PLOT 12. HMO Membership Percent by Age 53
PLOT 13. HMO Membership by Age Group 53
PLOT 14. HMO Membership by Payee Type 54
PLOT 15. HMO Membership by Physician Specialty 55
PLOT 16. HMO Membership by Race 56
PLOT 17. Distribution of Asian/Pacific Islander Age 57
vii
LIST OF TABLES
TABLE 1. Payee Types 22
TABLE 2. Workers Compensation ICD-9 Grouped Codes for Orthopedic Visits 24
TABLE 3. Workers Compensation ICD-9 Codes for Orthopedic Visits 25
TABLE 4. Workers Compensation ICD-9 Grouped Codes for Neurology Visits 26
TABLE 5. Workers Compensation ICD-9 Codes for Neurology Visits 27
TABLE 6. New Proportions after WC Orthopedic Surgeon Visits are removed 28
TABLE 7. Workers’ Comp “Other” Physician Visits 30
TABLE 8. Age Statistics by Physician Type 33
TABLE 9. ICD-9 Codes tabled by Medicare Use 34
TABLE 10. Age Statistics by Payee Type 37
TABLE 11. HMO Membership by Physician Specialty 43
TABLE 12. Has Insurance by Physician Specialty 44
TABLE 13. Has Insurance by Physician Specialty 47
TABLE 14. Has Insurance by ICD-9 Codes/Pediatric 48
TABLE 15. All Pay Methods by Insurance/Pediatrics 49
TABLE 16. Has Insurance by ICD-9 Codes/Neurology 50
viii
TABLE 17. All Pay Methods by Insurance/Neurology 51
TABLE 18. HMO Membership 52
TABLE 19. Adjusted HMO Membership 52
TABLE 20. Age Statistics by Race 56
ix
CHAPTER I
INTRODUCTION
The purpose of this paper is to describe the process of data mining through an
example. The primary purpose of data mining is to generate hypotheses to be examined
for validity either with fresh data or by withholding a portion of the initial dataset for
investigation. Data mining is a technique with a number of methods used to explore large
datasets from a variety of angles with a wide spectrum of analytical tools. There are
techniques for finding data, cleaning data, and validating results. For many years new
data have been collected by educational, research, commercial, and governmental entities
so that data mining can be used to find trends and patterns. Much of this available data
have been stored in data warehouses (collections of datasets) or put away by an
organization possibly to be examined in the future. The 1997 National Ambulatory
Medical Care Survey dataset (NAMCS) analyzed in this paper is available in the public
domain at the C.D.C. [Centers for Disease Control and Prevention] for public
consumption along with several other medical datasets. Chapter II covers how to find
medical datasets and import them into various statistical packages. Although statistical
x
packages have become extremely sophisticated, commercial statistical packages do not
do everything needed for data mining. For this reason, a program was written to analyze
the data (Chapter III) to display the results in a different manner using visualization
techniques to present the significant results in an easily digested yet informative manner.
The NAMCS dataset is analyzed in Chapter IV using various statistical packages and the
program developed in Chapter III. The NAMCS dataset consists of 24,615 patient visit
records each containing 224 variables. The data were about personal physical attributes,
physician’s practice and location, reasons and diagnoses for visits, medication given,
insurance types, tests given, types of medical personnel seen, and other visit data (see
Appendix – A for a full variable list). These data can be analyzed a variety of ways:
differences between patient types in common practices or pay methods; examining
whether certain practice types favor using staff over physicians; what practices or pay
methods favor using screenings or tests; or simple analyses of various physical attributes
of the different patient types. In this thesis, the ways in which different payee types
visited the different practices are analyzed. Different payee types disproportionately
visited certain practice types. Some of these disproportions are expected and others are
less explainable. HMO membership and its distribution through age groups, practice
types, pay methods, and races are also analyzed. Older patients and the practices that
serve them had a lower rate of HMO membership but privately insured, “All Other”
payees, and Asian/Pacific Islanders all had higher rates of membership. Another analysis
was done using modeling techniques to determine patient AGE GROUP and another to
determine if the patient was pregnant. A model was found with ~90% accuracy in
determining whether someone was pregnant using age, reason for visit (Non-Illness care),
xi
and sex of patient. A model to determine which age group the patient was in was much
less accurate (~50%).
In this thesis techniques to find, import, clean, and analyze data are discussed.
Some of the techniques are used with the NAMCS dataset while other techniques are
only discussed. A program is also written, by the author of this thesis, with visualization
guidelines, discussed in chapter 3, to analyze the NAMCS dataset.
xii
CHAPTER II
ACQUIRING AND IMPORTING DATA
The first step in a data mining process is to collect the data. A collection
mechanism can be set up to obtain data or the data may already exist in a dataset from an
outside source. After the necessary data have been acquired, they must be put into a
format that can be imported into any statistical packages that will be used to analyze
them. Once the data have been imported, they need to be cleaned for analysis.
2.1 Acquisition of Data
The first step in the data mining process is to acquire the data. Depending on
what is studied, a collection mechanism for data may have to be set up or the data can
come from an outside source. Collecting data can be very expensive and time consuming,
but necessary. When collecting the data during the study, the validity of the collection
mechanism and the data are known. The necessary data may already exist. Studies on
many topics have been done over time and the data for these studies may still be
available. With the invention, and now wide spread use of the computer, much of the data
xiii
for these studies are on magnetic media, easily copied, and transferable for fellow
researchers to use. Governmental agencies, such as the Census Bureau (www.census.gov)
and C.D.C. (www.cdc.gov), have collected data for years and have large datasets in the
public domain online for downloading. The Freedom of Information Act (FOIA
http://www.usdoj.gov/foia/) gives access to governmental data with some restrictions.
These data may or may not be in an easily usable format and the restrictions may not
allow all of the desired data to be made available due to privacy or security issues. Data
from other countries are less restrictive and are available in a variety of formats. Data can
be bought from outside sources. Some companies can be contracted to collect data or the
data may have already been collected and are available for sale to researchers. When the
data come from an outside source, the validity of the data should be considered. There are
pros and cons to both ways of acquiring data but it is up to the researcher to find the data
and to discuss its validity. For the purpose of this data mining project, a public domain
database was used; one that was closely related to an aspect of health care.
Much of the public domain data are already available on the Internet and the
various search engines make it easy to find relevant datasets. Many of the search engines
will point to Internet sites that give or sell data. The Lycos search engine
(www.lycos.com) was developed by the Carnegie Mellon Institute and tends to point to
more research oriented web sites than other search engines. Other search engines
providing pointers to data include Excite (www.excite.com), Alta-Vista
(www.altavista.com), and MSN (www.msn.com).
xiv
A new generation of web tools have been developed to make searching easier and
more thorough. Copernic2000 is one of these web tools; it searches many different search
engine databases for whatever topic is being queried. These web search tools are highly
configurable and can be modified to the individual preferences of users. Web users tend
to prefer certain search engines and web search tools allow the user to focus on the search
engines of their choice. The level of search in the databases can also be defined by
choosing how many hits from each search engine database are allowed. These web search
tools can also search other types of Internet sites such as news groups, email databases,
online businesses, news, and many other focused sites.
A sample run of a web search tool (Copernic2000) (Image 1)
Whether a web search engine or web search tool is used, there are certain
guidelines that should be followed. First, use a keyword such as “dataset” and avoid
words such as “data” or “database”. Keywords “data” and “database” will point to
xv
results, database programs, or databases of articles but the keyword “dataset” will focus
on collections of data. Use the option of searching for all words in a query and if that
does not work, use a search on any words in a query. When a URL is found, consider the
source of the site and its possible biases. There is no optimal way to find data on the
Internet but with the development and refinement of web search tools, locating data is
becoming an easier task.
2.2 Importing Data.
Once a dataset has been found, the dataset needs to be imported into statistical
programs for analysis. The data mining process used to investigate the data relies on
standard statistical packages such as SAS 8® (SAS Institute Inc.), SPSS 10®, and SPSS
Clementine 5.2® (SPSS Inc.). In order to make the investigations, the statistical packages
must be able to read the data. Data are not always in a format that the different statistical
packages can automatically import. Many sites, such as the C.D.C., put their public data
in an ASCII (text) format with rules of how to import the file correctly. Otherwise, the
data are released in a database format or another standard type file. The dataset analyzed
in this paper was in a self-extracting ZIP file that contained 12 ASCII files, one being a
file that explained how the data file was arranged.
There are many different file formats used to save data and to import data. There
are pros and cons to each type of file format. ASCII files are generally either character
delimited files or columnar fixed width files. Character delimited files use a special
xvi
character such as a comma or tab to separate variable columns. When importing these
type files, errors can occur when a special character is included in a text field, or the
spacing may be shifted enough to confuse tab-delimited imports. Fixed width columnar
ASCII files are not as easy to import, but the import allows the user to work with each
variable and to define variable names, labels, and text related to each variable. The user
can format and label the data to individual preference. The user should become very
familiar with the data variables in the dataset.
There are many standard file types that can be imported, including spreadsheet,
database, and portable files. Spreadsheets are the easiest to import but they sometimes
have record number limitations. The variable names can be included in the first row for
ease of importation. In this study, the dataset used was imported into SPSS 10 from a
columnar ASCII file. An attempt to write the 24,610 records to an Excel spreadsheet file
failed and only wrote 16,383 of the records. This may be an issue with SPSS 10 and older
restrictions on spreadsheet files. Database files are another type of file that can be
imported. Flat file databases (all data contained in one table) and well designed relational
databases (multiple tables related by keys) are not a problem to import but some
relational databases are not always structured well and create importing problems.
Different relational database tables within the same database may contain the identical
table variable names that are not meant to be linked but the import features in some
statistical programs try to link them anyway. Other table links may need to be defined in
a certain way such as one to one, one to many, or many to many and these links do not
always import the data correctly. Outside of having the data in the statistical packages’
xvii
file format, portable files are the best choice for importing data. The data with their
variable names are stored in this portable file type for ease of import but the only failing
of this portable file type is that it does not include variable labels or text related to
nominal data. Usually researchers do not have much say in what format the data will be
found, but if possible, they should request data in a portable format or the native format
of their statistical package.
Once the data are imported, they may need to be cleaned. Unless the data were
formatted during the import, the variable labels and text related to nominal data have not
been defined. It is not necessary to define them but the labels and nominal data text make
the analysis easier to comprehend. Some data records may contain missing or invalid
information and the records need to be either corrected or removed. Some variables may
not be necessary and can also be removed. The dataset used in this paper initially
contained 224 variables that were reduced to 33 variables as the analysis was refined. For
a full list of variables, see Appendix A. Many variables contained information about the
“marked” status of another variable and could be removed. Some removed variables were
lengthy text entries that were rarely used. Other removed variables contained medicine
codes. Many of the variables were removed after initial analyses showed little promise
for them. Some categorical data can also be refined to be a more manageable size. One of
the variables in the dataset contained more than 300 different categories that could have
easily been refined to a more manageable 9 categories. Some data may also need some
editing to fix errors such as missed decimal placements, text in numeric fields forcing
numeric variables to import as text, and converting variable types to correct types of data.
xviii
Data mining tools examine data from a variety of angles with a number of
different statistical methods. Not all of these statistical tools or programs can read or
write to common file types without loss of some formatting. Therefore trading data
between programs can sometimes become a problem. SAS programs cannot read native
SPSS 10 SAV files and SPSS programs can not read native SAS files. Both programs can
read and write to common file types but the difficulties described previously can still
occur. Saving data in an ASCII file from one program, then importing the data into
another program can give delimiting problems, or if the columnar format is used, the
variables have to be redefined. Transferring data from one statistical program to another
using spreadsheet format will work better but the constraint on sheet size may limit the
number of records transferred. Portable and database files are the best options currently
available but these formats do not save the variable labels or the text related to nominal
data. An ideal situation would be a format that all statistical packages could export to and
import from without the loss of variable labels and text related to nominal data.
Unfortunately, this ideal currently does not exist.
xix
CHAPTER III
DATA VISUALIZATION USING THE DIFFERENCE FROM EXPECTED PERCENTAGE PLOT (DFEPP):
PROGRAM DESIGN AND USE
One very important aspect of data mining is visualization, usually in graphical
form. There are many different statistical programs that analyze data and have a number
of graphical formats but these programs may not analyze the data in the desired way or
present results in the best manner. Presenting information in a useful and digestible form
is very important in the data mining process. Most papers are written for audiences with
varying degrees of statistical knowledge and should be written to accommodate most, if
not all, of the audience. Visual representation of information is the simplest way to digest
results for the general population and technical detail can be added to validate
information for those with greater statistical knowledge. The statistical packages used
give effective analyses and reporting but they do not always present significant results in
a manner desired by the investigator. For this reason, a program was written and
designed, by the author of this thesis, in Visual Basic 6 using some guidelines in
visualization. In this chapter the design and use of the Difference From Expected
Percentage Plot (DFEPP) program will be covered.
xx
3.1 The Graph Design
Presenting results from a data analysis in a format that is easily read is a necessity
when analyzing and reporting on data. Analysis results should be presented in layers of
detail from the most general to the most in-depth. Graphs and plots are easily understood
and are used for a quick, less detailed, analysis of data. Tables and associated numeric
information can also be used in the presentation of data for greater detail but are
generally less easy to understand. A mix of the two types of presentations is an ideal way
to present data analysis results to a general audience with varying degrees of statistical
knowledge.
There have been very few publications on data presentation and graphic design but the
few publications written provide some basic guidelines. (Tufte, 1997 and White, 1984)
xxi
According to Tufte’s “The Visual Display of Quantitative Information” (Tufte, 1997):
Excellence in statistical graphics consist of complex ideas communicated with
clarity, precision, and efficiency. Graphical displays should:
• Show the data.
• Induce the viewer to think about the substance rather than about
methodology, graphic design, the technology of graphic production, or
something else.
• Avoid distorting what the data have to say.
• Present many numbers in a small space.
• Make large data sets coherent.
• Encourage the eye to compare different pieces of data.
• Reveal the data at several levels of detail, from a broad overview to the fine
structure.
• Serve a reasonably clear purpose: description, exploration, tabulation, or
decoration.
• Be closely integrated with the statistical and verbal descriptions of a data set.
xxii
Jan V. White’s “Using Charts and Graphs” (White, 1984) suggested some other concepts
to include:
• Sort from most to least significant.
• Make sure plot segments are connected to associated text.
• Make significances stand out.
This graph (a plot from a program discussed later in this chapter) uses many of the
concepts include in the books by White and Tufte.
A working example of the output for data visualization: (Image 2)
The above graph shows the data and is simple enough that the design of the graph is not a
distraction from the data presentation. It reduces a large dataset to a simple plot, stratum
information, count, chi square, and associated p-value to provide much information in a
xxiii
small space and makes a large data set coherent. It encourages the eye to compare
different pieces of data through the use of color and by listing the categories by
significance. It serves a reasonably clear purpose: description, exploration, and tabulation
and it is closely integrated with the statistical and verbal descriptions of a data set. For
ease of readability, the text for each category (actual %, category name, count, chi square
value, and associated p-value) are connected by a line to the associated bar plot.
3.2 The Design of the DFEPP Program
There are a variety of factors to consider when writing any program: who will use
the program, what operating systems will be used, what type of data will be used, the
intended purpose, and the intended output. Some specialized programs can be written
cryptically but they are usually for a very limited audience that is generally familiar with
its use. Graphical User Interfaced (GUI) programs are much less cryptic and the easiest
type of program to use for a novice. Older programs were developed where the user
interacted via a command line interface that would intimidate some users, but most GUI
based programs use standardized graphic and menu controls familiar to most computer
users. GUI makes the programs extremely easy to use. Any program that might be used
by the general public should be GUI based.
Many different programming languages were considered in the development of
this DFEPP program. ANSI (American National Standards Institute) C and C++ are very
xxiv
powerful programming languages and can be compiled to run on many different
operating systems, but they lack some of the features needed to copy a generated graph
into a clipboard for pasting into other applications. The Visual C and C++ packages have
a better user interface with the ability to copy graphs onto a clipboard but these languages
are not ANSI compliant and will only run on a few types of operating systems. Java,
developed by Sun Microsystems (www.sun.com), was another language considered for
its portability but it is fairly limited with respect to pasting results into a windows
clipboard. Microsoft Visual Basic 6 ® (VB6) was used to write the DFEPP program.
Programs written in VB6 are extremely easy to program and use with Windows based
controls and interface. Anyone who is somewhat familiar with Windows can use a VB6
coded program. In this program, the interface is familiar and the cutting and pasting of
the generated graph into another program is a simple matter due to the tools included in
VB6. The only downside of VB6 is that it only works on a limited number of operating
systems (MS Windows based), but those few operating systems are on 90%+ of all PCs.
Creating visualization programs require consideration of how the data are entered,
processed, and used. A majority of programming languages can read and write to a
variety of file types and structured files. Input from sources such as keyboards and
scanners, and output to devices such as monitors and printers can be easily accomplished
by most programming languages. The DFEPP program merely required some simple
input into text boxes and a mouse click to plot the graph. VB6 gives an easy input method
for the user as well as easy access to clipboard controls. The graphical output of the
DFEPP program needed to be pasted into other Windows based program (such as Word
xxv
and Excel) and the tools in VB6 programming environment allowed for easy copying and
pasting of a graph. Other languages would also do all of the necessary processing but the
input and output needed would not be as user-friendly.
3.3 The Use of the DFEPP Program
The DFEPP program was written to show significant differences between
expected and actual values of one stratum of a categorical variable across all strata of
another categorical variable. The dataset to be analyzed in Chapter IV was reduced to 33
categorical variables containing data on patient demographics, types of physician
practices, payment for services, and other information on ambulatory visits. An example
of the use of this program is to look at how the different payee types disproportionately
go to different practices. For instance, assuming that payee types visits practice types at
the same rate as their overall percent of population, the privately insured should be 51%
of each type of physician practice.
An output example (Image 3)
xxvi
The program sorts the categorical data (practice type (J)) from greatest
percentage difference between actual (percent of actual privately insured in a practice
type) and expected percentage (percent of privately insured 51% (H)) from greatest to
least and generates a difference from expected percentage plot using the expected percent
value (51%) as a baseline and the actual percent values to plot a bar graph. The user
defines the major and minor percentage differences to their preference (L). The program
highlights the major percentage differences in red and minor percentage differences in
blue within the bar plot (I). Chi Square values are also derived using the number of
elements in each stratum (practice types), and the actual percentages and expected
percentages of the isolated strata (payee type (K)). For example, the privately insured
were 51% of all patients but only 32% of 1418 cardiology patients, giving a Chi Square
value of 204.84.
84.2041418*%49
)1418%)68%49((
1418*%51
)1418%)32%51(( 22
=−+−
The Chi Square values are also highlighted by color for significance. In this paper an
alpha of 0.01 is considered the cut-off point for major significance and the Chi Square
values greater than or equal to 6.635 are highlighted red. Chi Square values between
3.841 and 6.635 are associated with an alpha of 0.05 and have lesser significance but are
highlighted blue in case the user chooses to point out those significances with the lower
alpha. The p-values that are associated with the Chi Square values with one degree of
freedom are also given and highlighted to associated significance. If there is no
significance then “No Sig” is displayed in the p-value column.
xxvii
A working example of the input for data visualization: (Image 4)
Box A is to input the title, B gives the baseline percentage, and E is to input the
categories. The major and minor percentage differences are inputted to C. Column D
contains the actual percentage of A in each of the associate categories in column E.
Column F is the actual count of each of the associate categories in column E. Column G
contains a series of check boxes that select the categories in the associated column E to
be analyzed. Once the user has provided all of the necessary information, the plot option
is chosen in the menu bar to give the following hanging plot:
xxviii
A working example of the output for data visualization: (Image 5)
H gives the strata analyzed with their expected percentage values. I is the difference
from expected percentage plot using the expected percentage value as a baseline and the
actual percentage values contained in J. J contains each category, the percentage of
strata H in each category, and the number of total members per category. K contains the
Chi Square values and associated p-values for the corresponding categories in J,
highlighting the values with some significance by color. L is a legend for graph I
explaining the major and minor significance lines. If the user is satisfied with the graph
then the copy option may be chosen in the menu bar to copy the graph into the clipboard
to paste in to another program. Otherwise the user closes the graph window to modify the
initial data entry window, adjusts the graph options, and then plots the updated graph.
xxix
CHAPTER IV
DATA MINING OF THE 1997 NATIONAL AMBULATORY MEDICAL CARE SURVEY (NAMCS) DATASET
The 1997 National Ambulatory Medical Care Survey (NAMCS) is a national
probability sample survey conducted by the Division of Health Care Statistics, National
Center for Health Statistics (NCHS), and Centers for Disease Control and Prevention
(CDC). The survey consists of 24,715 patient records from visits to 1,247 physicians in
the year 1997. Initially, each patient visit record consisted of 224 variables, including
demographic information, diagnoses, drugs prescribed, types of visits, types of medical
professionals seen, medical tests and screenings done, and location of physician office.
During the data cleanup phase of the project, many of these variables were removed as
the focus of the analysis narrowed, leaving 33 variables with information about pay
method, race, age group, practice types, and other categorical information. Other
variables were reduced from 500+ different categories, using the SPSS
Transform/Compute feature, to far fewer categories. Once the data were cleaned, they
were analyzed using a variety of statistics packages and methods. In section 4.1, the
xxx
relationship of patient payee types to practice types was analyzed to look for
disproportionate relationships. SAS 8® (SAS Institute Inc.), SPSS 10®, and SPSS
Clementine 5.2® (SPSS Inc.) were all used to analyze the dataset but the DFEPP program
was used to do much of the visualization.
4.1 Analysis of Payee Type by Practice Type
In this section, the ways the different payee types visited the various practice
types were analyzed. Initially there were fourteen practice types (Cardiologists,
Dermatologists, General/Family Practice, General Surgery, Internal Medicine,
Neurology, OB/GYN, Ophthalmology, Orthopedic Surgery, Otolaryngology, Pediatrics,
Psychiatry, Urology, and Other) and nine types of payees (privately insured, Medicare,
Medicaid, workers compensation, self-pay, no charge, other, unknown, and blank). Since
payee types identified as no charge, other, unknown, and blank all had a limited number
of records, they were collected into one “all other” payee type yielding the following
distribution of payee types:
Payee Types (Table 1)
12562 51.0%
5395 21.9%
1945 7.9%
503 2.0%
2176 8.8%
2029 8.2%
Private Insurance
Medicare
Medicaid
Worker's Comp.
Self-Pay
All Other
PayeeType
Count Col %
Each payee type is given as a percentage of the overall study population and should be
near the same percentage of each practice type’s patient load but this is not always the
case. Many of the payee types correlated with different practice types but some of these
xxxi
preferences are expected and some are not. Although statistical methods can be used to
investigate specific hypotheses, the primary purpose of data mining is to generate
hypotheses to be examined for validity either with fresh data or by withholding a portion
of the initial dataset for investigation. The first investigation examined all of the cases
and the relationships between payee type and the number of visits to a particular practice
type.
The DFEPP plot below gives an indication of this relationship:
Workers Compensation by Physician Specialty (Plot 1)
4.1.1 Workers Compensation
Workers compensation payees were 2% of all payee types and if there were
relationships, would be expected to be near 2% of patient visits to each type of practice.
The workers compensation payees go to orthopedic surgeons at a much higher rate than
the expected 2% of visits to orthopedic surgeons. They were 18.1% of the 1616
orthopedic surgeons’ patients and the probability that the null hypothesis is true (actual
xxxii
number of visits was the expected 2% of visits to orthopedic surgeons) is less than 0.0005
(χ2=1616.1, p<0.0005) showing that these types of payees go to orthopedic surgeons at a
significantly higher rate. This is not completely unexpected since people go to orthopedic
surgeons for breaks and bruises and these are the main types of injuries that occur at
work. By using the filter and general tables/frequency features in SPSS 10, the reasons
that workers compensation payees visited orthopedic surgeons can be determined.
Workers Compensation ICD-9 Grouped Codes for Orthopedic Visits (Table 2)
1 .5%
1 .5%
18 8.1%
3 1.4%
70 31.7%
1 .5%
109 49.3%
18 8.1%
140-239 Neoplasms
240-279 Endocrine, nutritional and metabolicdiseases, and immunity disorders
320-389 Diseases of the nervous systemand sense organs
680-709 Diseases of the skin andsubcutaneous tissue
710-739 Diseases of the musculoskeletalsystem and connective tissue
780-799 Symptoms, signs, and ill-definedconditions
800-999 Injury and poisoning
V - Supplementary classification of factorsinfluencing health status and contact withhealth services
Count %
ICD-9 Code CategoryWorkers Comp. to
Orthopedic Surgeons
The preceding table is somewhat vague and by using a less broad categorical variable for
the ICD-9-CM (International Classification of Diseases, 9th Revision, Clinical Modification) codes, a
better understanding of these visits can be determined. The following table gives a better
understanding of why the workers compensation payees went to orthopedic surgeons.
xxxiii
Workers Compensation ICD-9 Codes for Orthopedic visits (Table 3)
1 .5%
1 .5%
1 .5%
1 .5%
17 7.7%
2 .9%
1 .5%
26 11.8%
43 19.5%
1 .5%
1 .5%
1 .5%
13 5.9%
16 7.2%
12 5.4%
50 22.6%
1 .5%
5 2.3%
1 .5%
5 2.3%
4 1.8%
1 .5%
5 2.3%
6 2.7%
1 .5%
4 1.8%
1 .5%
00 intestinal infectious diseases
215.3 benign neoplasms Lower limb, including hip
278.0 Obesity
337.21 Reflex sympathetic dystrophy of the upper limb
35x.xx Carpal tunnel syndrome(13), Lesion of ulnarnerve(2), Lesion of ulnar nerve(1), & Mononeuritis(1)
68x.xx Diseases of the skin and subcutaneous tissue
70x.xx
71x.xx Diseases of the musculoskeletal system andconnective tissue
72x.xx
73x.xx
79x.xx ill-defined and unknown causes of morbidity andmortality
80x.xx fractures
81x.xx
82x.xx
83x.xx dislocations
84x.xx sprains and strains of joints and adjacent muscles
87x.xx open wound
88x.xx
905.9 Late effect of traumatic amputation
92x.xx contusion with intact skin surface or crushing injury
95x.xx injury to nerves and spinal cord
996.6 Infection and inflammatory reaction due to internalprosthetic device, implant, and graft
V1 persons with potential health hazards related topersonal and family history
V4 persons with a condition influencing their health status
V5 persons encountering health services for specificprocedures and aftercare
V6 persons encountering health services in othercircumstances
V9 missing
Count %
ICD-9-CM Codes forWorkers Comp. to
Orthopedic Surgeons
xxxiv
Initially this table was created using the first 2 characters in the ICD-9-CM codes but
categories that had just a few visits could be better described by extracting the full
ICD-9-CM code from a complete non-abbreviated table of workers compensation visits
to orthopedic surgeons. The table shows that 73.3% of the visits were for sprains, strains,
breaks, and bruises while 9.5% of the visits were for nerve damage (7.7% carpal tunnel,
1.8% nerve/spinal cord damage), 4.1% were for cuts, and 13.1% for all other.
The workers compensation payees also go to neurologists at a higher rate than the
expected 2% of visits. They comprised 4.1% of the 703 neurology patients and the
probability that the null hypothesis (actual % = 2% expected) is true is less than 0.0005
(χ2=15.82, p<0.0005). Therefore the alternative hypothesis is valid (workers
compensation patients go at a significantly higher rate than expected to neurologists). By
filtering the data and then tabling it, the reason the workers compensation payees visited
this practice type can be determined. The following frequency table of workers
compensation payees going to neurology visits shows why they went:
Workers Compensation ICD-9 Grouped Codes for Neurology Visits (Table 4)
2 6.9%
3 10.3%
9 31.0%
5 17.2%
10 34.5%
290-319 Mental disorders
320-389 Diseases of the nervous systemand sense organs
710-739 Diseases of the musculoskeletalsystem and connective tissue
780-799 Symptoms, signs, and ill-definedconditions
800-999 Injury and poisoning
Count %
ICD-9 Code CategoryWorkers Comp. to
Neurologists
xxxv
Again, for such a small number of cases, a more in-depth analysis can be done by
comparing the complete ICD-9-CM code of each patient’s visit.
Workers Compensation ICD-9 Codes for Neurology Visits (Table 5)
2 6.9%
1 3.4%
2 6.9%
1 3.4%
1 3.4%
1 3.4%
2 6.9%
1 3.4%
1 3.4%
1 3.4%
1 3.4%
1 3.4%
4 13.8%
4 13.8%
3 10.3%
1 3.4%
1 3.4%
1 3.4%
3102- Concussion
3530-Nerve Dmg
3540-Carpal Tunnel
72210 Back Injuries
72280
7231-
7242-
7244-
7245-
7292-Soft Tissue Dmg
7299-
7803-Convulsions
7820-Disturbance of skin sensation
8471-SPRAINS AND STRAINS OFJOINTS AND ADJACENT MUSCLES
8472-
8479-
8489-
8840-Upper Limb Wound
Count %
Physician's diagnosesfor Workers' Comp.
Neurology visits
The table shows that 55.1% of the visits were for sprains, strains, breaks, and bruises.
24.1% of the visits were for nerve damage (6.9% carpal tunnel, 17.2% nerve/spinal cord
damage), 10.3% were for cuts, and 10.3% for all other. The workers compensation
payees go to orthopedic surgeons for many of the same reasons but at somewhat different
proportions.
There were many other practice types that had significantly lower percentages of
workers compensation visits but this is not unexpected. Children are not generally
involved with work and therefore would not use workers compensation to pay for
xxxvi
pediatric visits. OB/GYN and urology visits would also rarely be paid for by workers
compensation. Excluding the visits to these practices and to the practices with a
disproportionately higher percentage of visits will give a better representation of how the
other types of practices are visited by workers compensation payees. This subset of
workers compensation payees visits by the remaining practice types are distributed as
follows:
New Proportions after WC Orthopedic Surgeon Visits are removed (Table 6)
7962 47.0%
4483 26.5%
1098 6.5%
251 1.5%
1753 10.3%
1393 8.2%
Private Insurance
Medicare
Medicaid
Worker's Comp.
Self-Pay
All Other
Count %
Payee Type
The workers compensation payees are reduced to 1.5% of the patient population. When
the 1.5% value is used as the expected value to analyze the data with the DFEPP
program, the following plot is generated:
Adjusted plot after removal (Plot 2)
xxxvii
The plot shows that there is not much deviation from the expected percentage but there
are significant differences when the chi square values are considered. Visits to ‘Other’
physicians show the greatest deviation from the expected value with a significantly
higher number than expected visits. ‘Other’ physicians treated workers compensation
payees for a variety of reasons but mainly for the same reasons as the orthopedic surgeon
visits: sprains, strains, breaks, cuts, and bruises (see table 7).
xxxviii
Workers’ Comp “Other” Physician Visits (Table 7)ICD-9-CM Code
Count Col % ICD-9-CM Code
Count Col %
1119- 1 1.2%25000 1 1.2%33720 1 1.2%33722 1 1.2%3540- 2 2.5%37205 1 1.2%49390 1 1.2%
Disease
Related
81600 2 2.5%8360- 1 1.2%8404- 1 1.2%8409- 2 2.5%8449- 1 1.2%8460- 2 2.5%8469- 1 1.2%
515-- 1 1.2% Bre 8470- 3 3.7%55092 1 1.2% 8471- 2 2.5%71885 1 1.2% 8472- 3 3.7%71943 1 1.2% 8489- 3 3.7%7210- 1 1.2% 8793- 1 1.2%7217- 1 1.2% 8820- 1 1.2%72210 4 4.9% 8830- 2 2.5%72252 1 1.2% 8860- 1 1.2%72280 1 1.2% 9064- 1 1.2%7234- 1 1.2% 9069- 1 1.2%72400 1 1.2% 9248- 3 3.7%7242- 2 2.5% 9300- 1 1.2%7244- 1 1.2% 9404- 1 1.2%7245- 4 4.9% 94420 1 1.2%7246- 1 1.2% 9556- 1 1.2%7248- 1 1.2% 9594- 1 1.2%72632 2 2.5% 9595- 1 1.2%
Break
s, Bru
ises, Strains, S
prain
s, Cu
ts
7294- 1 1.2% V135- 1 1.2%75612 1 1.2% V155- 1 1.2%7804- 1 1.2% V583- 1 1.2%
Personal History
7809- 1 1.2% V6759 1 1.2%7820- 1 1.2% V703- 1 1.2%
Follow-up
V990- 1 1.2% Blank
Psychiatry and general surgery practices were also visited at a significantly higher
rate than the expected 1.5% visit rate for the workers compensation payees. The main
reason for psychiatric visits was depression (74%). General surgery visits tended to be for
cuts, burns, and other wounds (33.3%) and 43% tended to be for breaks, bruises, strains
and sprains. Notably there are significantly lower numbers of visits to ophthalmology
(0.3% actual vs. 1.5% expected) and otolaryngology (0.1% actual vs. 1.5% expected)
practices. This may show that the OSHA (Occupational Safety & Health Administration
xxxix
http://www.osha.gov/) rules guarding vision and hearing loss work effectively to reduce
such injuries. The dermatology visits were also significantly lower (0.1% actual vs. 1.5%
expected) but many burns and other skin problems were treated by general surgery
practices. This would explain, in part, the significantly higher number of general surgery
visits and the correspondingly lower number of dermatology visits.
With the workers compensation payees being 18.1% of the orthopedic surgery
visits and only 2.0% of the total population, the other payee types visits to orthopedic
surgeons will tend to show fewer visits. Therefore a lower number of visits will be
correspondingly less significant than shown in the DFEPP plots. Although there were
other significant disproportions, the workers compensation payee visits were too few to
create any major disproportion in other payee types’ visits to the various practices.
xl
4.1.2 Medicare
Medicare payees were 21.9% of all payee types and if they showed no preference,
would be expected to be near 21.9% of patient visits to each type of practice. This is not
the case. Medicare patients went to cardiologists, urologists, ophthalmologists, and to
Medicare by Physician Specialty (Plot 3)
internal medicine visits at significantly higher rates. The probability that they were the
expected 21.9% of each of the visit loads for each practice is less than 0.0005. They were
53.4% of 1418 visits to cardiologists (χ2=822.63, p<0.0005), 38.9% of 1072 to urologists
(χ2=181.13, p<0.0005), 38.6% of 1437 to ophthalmologists (χ2=234.31, p<0.0005), and
33.1% of the 2358 visits for internal medicine (χ2=172.94, p<0.0005). All but the internal
medicine visits are expected. The Medicare population consists of retired or disabled
individuals. The average age of the Medicare population is 71.6 years with a standard
xli
deviation of 13.01 years. These types of practices treat heart problems, eyesight, and
urinary problems and these are the problems occurring in an older population.
Age statistics by Physician Type (Table 8)
AGE
42.73 3834 23.33
55.09 2358 20.06
5.34 2651 7.33
49.67 1270 20.47
35.82 2022 13.91
45.55 1222 21.91
65.41 1418 15.26
46.38 1409 22.49
57.25 1072 20.16
43.29 1461 16.96
46.23 703 21.83
58.65 1437 22.51
39.84 1175 24.81
52.26 2578 19.62
43.89 24610 24.81
Physician SpecalityGeneral and familypractice
Internal medicine
Pediatrics
General surgery
Obstetrics andgynecology
Orthopedic surgery
Cardiovascular disease
Dermatology
Urology
Psychiatry
Neurology
Ophthalmology
Otolaryngology
All other
Total
Mean N Std. Deviation
The average age of the entire population is 43.89 years with a standard deviation of
24.81 years. The Medicare payees are significantly older; therefore they will
disproportionately visit those practices. The significantly higher number of internal
medicine visits by this population is harder to explain. Table 9 shows why Medicare and
non-Medicare payees visited internal medicine practices:
xlii
-9 ( 9)ICD Codes tabled by Medicare Use Table
47 3.0% 8 1.0%
17 1.1% 10 1.3%
159 10.1% 86 11.0%
12 .8% 5 .6%
47 3.0% 13 1.7%
68 4.3% 26 3.3%
208 13.2% 247 31.7%
246 15.6% 76 9.7%
55 3.5% 22 2.8%
62 3.9% 27 3.5%
1 .1% 1 .1%
54 3.4% 12 1.5%
142 9.0% 63 8.1%
1 .1%
158 10.0% 85 10.9%
85 5.4% 21 2.7%
216 13.7% 78 10.0%
Infectious and parasiticdiseases
Neoplasms
Endocrine, nutritional andmetabolic diseases, andimmunity
Diseases of the bloodand blood-forming organs
Mental disorders
Diseases of the nervoussystem and senseorgans
Diseases of thecirculatory system
Diseases of therespiratory system
Diseases of the digestivesystem
Diseases of thegenitourinary system
Complications ofpregnancy, childbirth, andthe puerperium
Diseases of the skin andsubcutaneous tissue
Diseases of themusculoskeletal systemand connective tissue
Congenital anomalies
Symptoms, signs, andill-defined conditions
Injury and poisoning
Supplementaryclassification of factorsinfluencing health s
Count %
ICD-9 Code Category
False
Count %
ICD-9 Code Category
True
Uses Medicare
Medicare payees went to internal medicine practices for diseases of the circulatory
system at a very disproportionate rate. A total of 13.2% of the population of non-
Medicare payees visited this practice type for diseases of the circulatory system but
31.7% of the population of Medicare payees visited this practice type for the same
xliii
diseases. The other categorical reasons for the visits to this practice by Medicare and non-
Medicare payees were not that different. By reducing the number of visits for diseases of
the circulatory system of the Medicare population to the non-Medicare percentage rate,
the rate of Medicare payees going to internal medicine visits becomes less significant at
28.7%. By reducing the visits, a new χ2 value of 59.8 (p<0.0005) was computed showing
that there was still a significantly higher number of visits to this practice type by
Medicare payees. Other practices were visited at significantly lower rates. 0.8% of 2651
to pediatricians (χ2=690.05, p<0.0005) and 4.7% of 2022 OB/GYN (χ2=349.74,
p<0.0005). The Medicare visits to pediatricians are probably due to recording errors. The
low rate of visits to OB/GYNs for Medicare payees is not an unexpected result. The
average age of OB/GYN patients is 35.82 years with a standard deviation of 13.91 years.
The average age for Medicare patients is 71.9 years and this is over two standard
deviations from the average OB/GYN patients’ age.
With the Medicare payees responsible for 53.4% of the visits to cardiologists and
only 21.9% of the total population, the other payee types’ visits to cardiologists will tend
to show fewer visits and a correspondingly lower number of visits will be less significant
than shown in the DFEPP plots. Urology and Ophthalmology visits were also at
significantly higher rates, but lesser, and will also skew downward the rates of other
payee types visits to these practices.
xliv
4.1.3 Medicaid
Medicaid payees accounted for 7.9% of all payee types and if they showed no
preference, would be expected to be near 7.9% of patient visits to each type of practice.
The Medicaid payees go to pediatricians at a much higher rate than the expected 7.9% of
visits to pediatricians. They were responsible for 20.0% of the 2651 pediatric patients and
the probability that the null hypothesis is true (actual number of visits was the expected
7.9% of visits to pediatricians) is less than 0.0005 (χ2=533.45, p<0.0005) showing that
these types of payees go to pediatricians at a significantly higher rate.
Medicaid by Physician Specialty (Plot 4)
The higher rate of Medicaid payees to pediatricians is not unexpected. The average age
for Medicaid payees is 27.47 years with a standard deviation of 24.53 years. The
distribution for this population is not normal and plot 5 shows this.
xlv
Distribution of Medicaid population (Plot 5)
AGE
100.090.0
80.070.0
60.050.0
40.030.0
20.010.0
0.0
400
300
200
100
0
Std. Dev = 24.43
Mean = 27.5
N = 1945.00
The distribution of Medicaid payees is skewed towards the younger ages and it is the
youngest of all payee types.
Age Statistics by Payee Type (Table 10)
AGE
36.22 12562 21.24
71.61 5395 13.01
27.47 1945 24.43
41.23 503 12.91
36.88 2176 19.34
41.59 2029 22.02
43.89 24610 24.81
Payee TypePrivate Insurance
Medicare
Medicaid
Worker's Comp.
Self-Pay
All Other
Total
Mean N Std. Deviation
95% of the pediatric visits were by patients 20 years or younger (plot 6) and since the
Medicaid population is the youngest, it would carry a disproportionately higher rate of
visits.
xlvi
( 6)Age of Pediatric Patients Plot
AGE of Pediatric patients
85.0 80.0
75.0 70.0
65.0 60.0
55.0 50.0
45.0 40.0
35.0 30.0
25.0 20.0
15.0 10.0
5.0 0.0
1400
1200
1000
800
600
400
200
0
Std. Dev = 7.33 Mean = 5.3 N = 2651.00
There were other practices that the Medicaid payees visited at lower than expected rates.
By removing the pediatric visits, it can be determined how the Medicaid population
visited the other practices. A new expected percentage of 6.4% of Medicaid payees is
used to re-evaluate the data with the DFEPP plot.
xlvii
Modified Medicaid by Physician Specialty (Plot 7)
Urology and orthopedic surgery practices were visited at lower than expected rates but
this is merely a reflection of the disproportionately higher visits by the Medicare and
workers compensation payees to these practice types respectively. The OB/GYN visits
are significantly higher than expected but this population contains a greater percentage of
women in child bearing age and with the significantly lower number of Medicare patients
attending this practice, a higher than expected result should visit OB/GYNs. Surprisingly
the visits to dermatologists by the Medicaid payees are significantly lower than expected.
Many people believe that dermatology patients are mainly children with acne problems.
Plot 8 shows how the dermatology visits are distributed by age:
xlviii
Age of Dermatology Patients (Plot 8)
AGE of all Dermatologists' Patients
100.090.0
80.070.0
60.050.0
40.030.0
20.010.0
0.0
140
120
100
80
60
40
20
0
Std. Dev = 22.49
Mean = 46.4
N = 1409.00
The average age of dermatology patients is 46.4 years with a standard deviation of 22.49
years. The population of dermatology patients is far older than Medicaid payees and
would therefore have fewer Medicaid payees.
Although there was a disproportionately higher number of pediatric visits in the
Medicaid population, the lack of visits in the Medicare population will offset the higher
rate in this population giving the remaining payee types the potential to have near their
expected distribution for pediatric visits. The other practices of the Medicaid population
showed preferences that are merely a reflection of other payee types disproportionately
visiting those practices.
xlix
4.1.4 Self-Pay
Self-Pay payees were 8.8% of all payee types and if they showed no preference,
would be expected to be near 8.8% of patient visits to each type of practice. The Self-Pay
payees go to psychiatrists at a much higher rate than expected.
Self Pay by Physician Specialty (Plot 9)
They represented 26.0% of the 1461 psychiatric patients and the probability that the null
hypothesis is true (actual number of visits was the expected 8.8% of visits to
psychiatrists) is less than 0.0005 (χ2=538.55, p<0.0005) showing that these types of
payees go to psychiatrists at a significantly higher rate. This presents some possibilities:
that the uninsured have more problems that require psychiatric visits or that insurance
will not pay for psychiatric visits. The first possibility is hard to explore but the second
can be explored indirectly. There was no variable to determine if someone was insured
l
but there was a variable to determine if the patient was a member of an HMO. Overall
25.1% of patients were HMO members but only 4.4% of Self-Pay payees were members
of an HMO. When isolating the self-pay psychiatric visits, 8.4% of the patients were
members of an HMO. This shows that the self-pay patients going to psychiatrists were
more apt to be HMO members when compared to the entire self-pay population and
therefore were more apt to have insurance. Another consideration is that people did not
pay for these visits with insurance and therefore would not mark down whether or not
they belonged to an HMO. Regardless, self-pay payees do visit psychiatrists at a
significantly higher rate and at least 8.4% of the visits were by insured patients. Self-pay
payees also visited dermatologists at a significantly higher rate. They represented 16.9%
of the 1408 dermatology patients and the probability that the null hypothesis is true
(actual number of visits was the expected 8.8% of visits to dermatologists) is less than
0.0005 (χ2=115.19, p<0.0005) showing that self-pay payees go to dermatologists at a
significantly higher rate. Only 5.0% of the dermatology self-pay payees were members of
an HMO and this is not significantly higher than the 4.4% however there is another way
to evaluate the HMO data. There were four different responses to HMO insured: Yes, No,
Unknown, and Blank.
li
HMO Membership by Physician Specialty (Table 11)
2.2% 77.1% 19.6% 1.1%
7.0% 84.2% 8.8%
3.0% 89.1% 6.7% 1.2%
1.6% 90.5% 7.9%
1.5% 92.6% 5.9%
5.0% 82.5% 2.5% 10.0%
95.7% 4.3%
5.0% 72.3% 22.3% .4%
96.4% 3.6%
8.4% 54.7% 35.5% 1.3%
89.1% 10.9%
3.9% 76.6% 18.0% 1.6%
6.1% 89.0% 3.7% 1.2%
5.4% 59.5% 35.1%
General and familypractice
Internal medicine
Pediatrics
General surgery
Obstetrics andgynecology
Orthopedic surgery
Cardiovascular disease
Dermatology
Urology
Psychiatry
Neurology
Ophthalmology
Otolaryngology
All other
PhysicianSpecality
Row %
yes
Row %
no
Row %
unknown
Row %
blank
Does the patient belong to an HMO?
The “Unknown” responses are more than likely insured patients that do not know if they
have an HMO plan, not uninsured that are unsure if they have an HMO (insurance) plan.
By collecting the “Yes” and “Unknown” responses into an “Insured” response and the
“No” responses into a “Possible but No HMO” response, who are likely and possibly
insured may be determined yielding the following distribution (Table 12):
lii
Has Insurance by Physician Specialty (Table 12)
21.8% 77.1% 1.1%
15.8% 84.2%
9.7% 89.1% 1.2%
9.5% 90.5%
7.4% 92.6%
7.5% 82.5% 10.0%
4.3% 95.7%
27.3% 72.3% .4%
3.6% 96.4%
43.9% 54.7% 1.3%
10.9% 89.1%
21.9% 76.6% 1.6%
9.8% 89.0% 1.2%
40.5% 59.5%
24.3% 74.8% .9%
General and familypractice
Internal medicine
Pediatrics
General surgery
Obstetrics andgynecology
Orthopedic surgery
Cardiovascular disease
Dermatology
Urology
Psychiatry
Neurology
Ophthalmology
Otolaryngology
All other
PhysicianSpecality
Total
Row %
Insured
Row %
Possiblebut NoHMO
Row %
Blank
Has Insurance
The modified distribution gives a better idea of which patients went to the various
practices with insurance. On average, at least 24.3% of self-pay payees had insurance but
visits to psychiatrists had a much higher rate of insured self-pay patients (43.9%)
showing possibly that insurance companies tend not to cover psychiatric services and the
patients have to pick up the cost. The dermatology patients also show a
disproportionately higher number of self-pay payees but their insured percentage is not
that different from the rest of the self-payees.
liii
4.1.5 Privately Insured
The privately insured went to many practice types at highly disproportionate rates.
They were the second youngest population within this study, and would be expected to
Privately Insured by Physician Specialty (Plot 10)
favor certain practices. They were not older and would not tend to see cardiologists for
heart disease or ophthalmologists for failing eyesight but they were young enough to be
of family bearing age and would tend to see OB/GYN and pediatricians. The
disproportionate visits to cardiologists, OB/GYN, ophthalmologists, and pediatricians
within the privately insured population are nearly in inverse proportion to the Medicare
population’s visits for these four practices. The otolaryngology visits are the only
unexplainable disproportionately higher visited practice in the privately insured
population. The privately insured go to otolaryngologists at a much higher rate than the
expected 51.0% of otolaryngology visits. They were responsible for 63.3% of the 1175
otolaryngology patients and the probability that the null hypothesis is true (actual number
liv
of visits was the expected 51.0% of visits to otolaryngologists) is less than 0.0005
(χ2=71.13, p<0.0005) showing that these types of payees go to otolaryngologists at a
significantly higher rate. There were other significances outside of the five mentioned
practices but none of the other practice types showed a significant deviation from the
expected percentage (greater than 10%) and are not analyzed.
4.1.6 All Other
The last payee type analyzed is the “All Other” payee. The all other payee type
consists of the no charge, other, unknown, and blank payee types. There were too few
visits in each of the subcategories (no charge, other, unknown, and blank payee types) to
effectively analyze but the combined “All Other” payee type had a sufficient number of
visits to analyze. The all other payees were 8.2% of the population and would be
expected to be near 8.2% of patient visits to each practice type.
All Other Payees by Physician Specialty (Plot 11)
lv
The visits by the all other payee types to each of the practice types were all within 10% of
their expected percentage of 8.2%. Only the neurology visits approached the 10%
difference threshold used as a cutoff point. By using the HMO variable, the patients with
insurance can be extracted.
Has Insurance by Physician Specialty (Table 13)
59.2% 35.0% 5.8%
61.9% 31.9% 6.2%
78.0% 18.1% 4.0%
25.3% 50.0% 24.7%
60.4% 37.9% 1.6%
52.0% 40.0% 8.0%
73.3% 19.8% 7.0%
44.7% 48.5% 6.8%
51.2% 41.5% 7.3%
49.1% 47.3% 3.6%
73.3% 25.0% 1.7%
63.8% 28.9% 7.2%
28.6% 63.3% 8.2%
58.0% 40.6% 1.3%
57.8% 35.8% 6.4%
General and familypractice
Internal medicine
Pediatrics
General surgery
Obstetrics andgynecology
Orthopedic surgery
Cardiovascular disease
Dermatology
Urology
Psychiatry
Neurology
Ophthalmology
Otolaryngology
All other
PhysicianSpecality
Total
Row %
Insured
Row %
Possiblebut NoHMO
Row %
Blank
Has Insurance
Note that 57.8% of the all other payee type may have had some insurance. The variable
“Has Insurance” was previously defined in section 4.1.4. Similarly, 73.3% of the
neurology patients may have had insurance. This is significantly higher than the overall
57.8% average for this payee type showing that insurance tends to not cover neurology
visits as well as the other practice types. Pediatric visits by this payee type also had a
lvi
significantly higher number of insured visitors. By reviewing the ICD-9 codes, the
reasons patients went to the different practices can be determined.
lvii
Has Insurance by ICD-9 Codes/Pediatric (Table 14)
Physician Specality Pediatrics
4.5% 1.7%
.6% .6%
.6%
.6% .6%
10.7% 2.8%
16.9% 5.1% .6%
2.8% .6%
.6%
2.8% .6% .6%
.6%
.6%
2.8%
2.8% .6%
32.2% 5.1% 2.3%
Infectious and parasiticdiseases
Neoplasms
Endocrine, nutritional andmetabolic diseases, andimmunity
Diseases of the bloodand blood-forming organs
Mental disorders
Diseases of the nervoussystem and senseorgans
Diseases of thecirculatory system
Diseases of therespiratory system
Diseases of the digestivesystem
Diseases of thegenitourinary system
Complications ofpregnancy, childbirth, andthe puerperium
Diseases of the skin andsubcutaneous tissue
Diseases of themusculoskeletal systemand connective tissue
Congenital anomalies
Symptoms, signs, andill-defined conditions
Injury and poisoning
Supplementaryclassification of factorsinfluencing health s
ICD-9CodeCategory
Layer %
Insured
Layer %
Possiblebut NoHMO
Layer %
Blank
Has Insurance
lviii
As shown, 26.6% of the pediatric visits in the insured all other payee type went for
diagnoses V20.2 (Routine infant or child health check (a subset of “Supplementary
classification of factors influencing health” 32.2%)). This group also went for diseases of
the nervous system/sense organs (10.7%) (hearing loss/ear infections) and of the
respiratory system (16.9%) (soar throats/ tonsillitis/ colds).
All Pay Methods by Insurance/Pediatrics (Table 15)
Physician Specality Pediatrics
924 52.6% 822 46.8% 11 .6%
6 28.6% 15 71.4%
137 25.8% 393 74.0% 1 .2%
16 9.7% 147 89.1% 2 1.2%
1 25.0% 3 75.0%
122 81.9% 25 16.8% 2 1.3%
8 72.7% 3 27.3%
7 53.8% 1 7.7% 5 38.5%
Private Insurance
Medicare
Medicaid
Worker's Compensation
Self-pay
No charge
Other
Unknown
Blank
Primaryexpectedsource ofpayment forthe visit
Count Row %
Insured
Count Row %
Possible but No HMO
Count Row %
Blank
Has Insurance
When looking at the expanded list of pay methods, pediatric visits were paid for by other
means 81.9% of the time. This could merely be families using local government funded
health clinics for pediatric visits.
lix
Has Insurance by ICD-9 Codes/Neurology (Table 16)
Physician Specality Neurology
.8% .8%
1.7%
2.5% .8%
30.8% 5.0% 1.7%
.8% 1.7%
5.8% 5.0%
.8% .8%
17.5% 3.3%
.8% 3.3%
11.7% 4.2%
Infectious and parasiticdiseases
Neoplasms
Endocrine, nutritional andmetabolic diseases, andimmunity
Diseases of the bloodand blood-forming organs
Mental disorders
Diseases of the nervoussystem and senseorgans
Diseases of thecirculatory system
Diseases of therespiratory system
Diseases of the digestivesystem
Diseases of thegenitourinary system
Complications ofpregnancy, childbirth, andthe puerperium
Diseases of the skin andsubcutaneous tissue
Diseases of themusculoskeletal systemand connective tissue
Congenital anomalies
Symptoms, signs, andill-defined conditions
Injury and poisoning
Supplementaryclassification of factorsinfluencing health s
ICD-9CodeCategory
Layer %
Insured
Layer %
Possiblebut NoHMO
Layer %
Blank
Has Insurance
The 30.8% of the all other visits to neurologists for diseases of the nervous system and
sense organs were not covered by insurance even though the patient probably had
insurance; 17.5% went for Symptoms, signs, and ill-defined conditions
lx
(apnea/convulsions/nervous system injury) and 11.7% of the visits were for follow ups
and paper work.
All Pay Methods by Insurance/Neurology (Table 17)
Physician Specality Neurology
134 42.7% 180 57.3%
16 12.7% 109 86.5% 1 .8%
3 5.1% 56 94.9%
11 37.9% 18 62.1%
6 10.9% 49 89.1%
2 100.0%
16 39.0% 25 61.0%
71 97.3% 2 2.7%
1 25.0% 1 25.0% 2 50.0%
Private Insurance
Medicare
Medicaid
Worker's Compensation
Self-pay
No charge
Other
Unknown
Blank
Primaryexpectedsource ofpayment forthe visit
Count Row %
Insured
Count Row %
Possible but No HMO
Count Row %
Blank
Has Insurance
A majority of the visits for this payee type were by unknown ways of pay for neurology
visits. This may show a tendency for insurance not to cover neurological disorders,
leaving patients to pay for these problems themselves.
Although visits to neurologists by the all other payee type do show significance,
this payee type has the least significant difference of all types.
Each different payee type had significant disproportions in the way patients visit the
different practices. Many were expected but a few were not easily explained. The
workers compensation payees went for bumps and bruises; the Medicare population went
to practices that serve ailments in older patients. The Medicaid population is very young
and sees practices that serve children and adults of child bearing age. The privately
insured visited many practices disproportionately but most of the differences could be
lxi
attributed to the other payee types’ disproportions. Self-Pay tended to pay for psychiatry
and dermatology visits at a significantly higher rate showing that these practice visits are
not covered as well as the other practices by insurance. The “All Other” payees went to
neurologists at a significantly higher rate leaving a majority of them with an unknown
way of paying for these services.
4.2 HMOs
What do HMOs pay for? Who are members of HMOs? Are the practices visited
significantly different when compared to the non-HMO population? These are all
questions that can be answered by an analysis of this dataset.
Initially there were four different types of responses to the question of whether or
not the patient was a member of an HMO (yes, no, unknown, and left blank).
HMO Membership (Table 18)
6187 25.1%
15853 64.4%
2242 9.1%
328 1.3%
yes
no
unknown
blank
Does thepatient belongto an HMO?
Count Col %
By making an assumption that the unknown and blank responses are proportionately
distributed through the yes and no responses and removing them, the real proportion of
HMO membership may be determined.
Adjusted HMO Membership (Table 19)
lxii
6187 28.1%
15853 71.9%
yes
no
Does the patient belongto an HMO?
Count Col %
By using the adjusted figures, 28.1% of this population is aware that they are members
and 71.9% is aware that they are not.
Who are members of HMOs? The younger patients (under 65 years) were more apt to
be members of HMOs than patients in the oldest two age groups (65-74 years and 75+)
HMO Membership Percent by Age (Plot 12)
% of HMO Membership
0.005.00
10.0015.0020.0025.0030.0035.0040.0045.00
Under15
years
15-24years
25-44years
45-64years
65-74years
75yearsandover
% of HMO Membership
In all, 28.1% of patients were members of an HMO but three of the age groups
significantly deviated from the expected proportion.
HMO Membership by Age Group (Plot 13)
lxiii
The membership in the two older age groups is significantly lower than for other groups
(patients 75 years and over: 13.7% actual vs. 28.1% expected, χ2=281.11, p<0.0005)
(patients with 65-74 years: 17.5% actual vs. 28.1% expected, χ2=167.34, p<0.0005). The
youngest age group had a significantly higher rate of membership in HMOs (38.9%
actual vs. 28.1% expected, χ2=214.07, p<0.0005). A majority of the older two age groups
are eligible for Medicare.
HMO Membership by Payee Type (Plot 14)
Surprisingly, the Medicare population does not have the lowest rate of HMO membership
(6.7% actual vs. 28.1% expected, χ2=1133.35, p<0.0005). People who had to self pay had
the lowest membership rate (5.5% actual vs. 28.1% expected, χ2=435.58, p<0.0005). This
could be for a variety of reasons: uninsured self pay patients do not have insurance and
would not have an HMO membership, or if the patient was a member and the visit was
not covered, they may not have marked being an HMO member. The significantly lower
rate of workers compensation (11.7% actual vs. 28.1% expected, χ2=44.46, p<0.0005)
visits may be due to workers compensation paying for the visit and not the patients’
private insurance. Therefore the patients may have not marked HMO coverage even if
they were members. The Medicaid population is the youngest population and should
lxiv
follow the younger age groups higher level of HMO membership but their membership
rate is significantly lower than expected (14.6% actual vs. 28.1% expected, χ2=164.62,
p<0.0005). A reason for the low rate could be that Medicaid programs are state run and
only some of the states have HMO options. Also, these data were collected in 1997 when
the concept of Medicaid HMOs was not widely implemented. The significantly0 higher
rate of privately insured (39.9% actual vs. 28.1% expected, χ2=7692.22, p<0.0005) is
partially explained by the lack of HMO coverage in the state run programs, reducing
expected average. The higher rate does show that the privately insured are much more
likely to have been members of an HMO than any other insured type. The all other
payees show the greatest deviation from the expected member rate. They have a
significantly higher rate of HMO membership (52.9% actual vs. 28.1% expected,
χ2=469.71, p<0.0005). Many of the reasons why people were in this group were explored
in the previous section (public funded family clinics, neurology visits uncovered).
What do HMOs pay for?
HMO Membership by Physician Specialty (Plot 15)
lxv
Plot 15 shows the rates of HMO membership for each physician type. 43.6% of visits to
pediatricians are by HMO members. This is significantly higher than the expected rate for
pediatric visits (43.6% actual vs. 28.1% expected, χ2=297.16, p<0.0005). The pediatric
patients are young and would be expected to follow the higher rate of membership of the
younger age groups but the higher rate cannot be completely explained by this. Many of
the lower than expected rates can be attributed to age group preferences such as the older
age groups’ preferred practice types (urology, cardiologists, and ophthalmologists) with
their lower rate of membership. OB/GYN visits are mainly for a younger population and
that rate would be expected to be higher. There are other differences but most of them
can be correlated to age group preferences.
Does any race favor HMOs? When each race is compared to the 28.1% baseline, the
Asian/Pacific Islander population has a significantly higher rate of HMO membership.
HMO Membership by Race (Plot 16)
The Asian/Pacific Islander age is not significantly different from the other races so age
cannot explain the higher HMO rate. Another factor such as location or culture may play
a role.
Age Statistics by Race (Table 20)
lxvi
AGE
44.42 19186 25.08
39.58 2154 24.42
41.62 700 23.91
43.86 22040 25.03
RACEWhite
Black
Asian/Pacific Islander
Total
Mean N Std. Deviation
lxvii
/ ( 17)Distribution of Asian Pacific Islander Age Plot
AGE
90
.0
85
.0
80
.0
75
.0
70
.0
65
.0
60
.0
55
.0
50
.0
45
.0
40
.0
35
.0
30
.0
25
.0
20
.0
15
.0
10
.0
5.0
0.0
60
50
40
30
20
100
The Asian/Pacific Islander population is not distributed skewed to the younger ages.
Membership in an HMO differed greatly when looking at the six age groups and
three races. The older the patient, the less likely they were to be a member of an HMO.
Most of this is due to the Medicare population’s lack of HMO membership. The youngest
age group was most likely to be a member of an HMO. If the patient is an Asian/Pacific
Islander, they are more likely to be an HMO member than a member of another race.
Privately Insured and “All Other” payees had the greatest membership and Medicare,
Medicaid, self-pay, and workers compensation had significantly lower than average
membership rates. Different practices were disproportionately visited by HMO members
at significant rates. Much of this is due to the type of practice and patients ages. Practices
that see predominately older patients will have a lower rate of HMO members.
Conversely, practices that see predominately younger patients will have a higher rate of
HMO members.
lxviii
4.3 Modeling
Data Mining techniques can be used to model the data using a variety of
techniques: Neural Networks, Genetic Algorithms, Decision Trees, Regression Analysis,
Factor Analysis. Each of these models works best with particular data types for input and
output. Regression analysis and genetic algorithms use numeric data input variables to
create a function that can optimize model generation of a numeric data output variable.
Supervised neural networks create a model from a variety of data types to predict an
output variable. Unsupervised neural networks (Kohonen) have no output variable but
generate dynamic data clusters through means of node competition. Factor analyses
generate either variable or data clusters. Decision trees and rules sets are used to predict
categorical data from a variety of input types. When any of these models are created, they
should be created with a subset of the data and validated with a second subset of the data.
Models created with an entire dataset tend to be more complicated and without external
data, never proven to be valid.
Each of these models can be used independently or in conjunction with one
another. For instance, supervised neural networks generate a black box that accepts inputs
and generates output but does not give any function or rules to understand how the
processing occurred. However, neural networks do give a sensitivity analysis of the
variables in the network. Variables that had little effect can be removed leaving variables
with greater effect on the outcome. The variables with greater effect can be used as a
refined starting point in other models. Factor analysis yields clusters of associated
variables. Variables that are clustered together are interrelated and would tend to be the
lxix
best inputs to predict variables within the same group. In this section a few different
techniques will be used to create models for AGE GROUP (categories of age) and
PREGNANCY (yes or no).
4.3.1 Age Group Models
SPSS Clementine 5.2 is the primary package used in this analysis. It provides all
of the previously mentioned modeling techniques. The dataset initially had 224 variables
but was reduced to 33 variables (31 categorical and 2 numeric). Genetic algorithms and
regression analysis will not work well with categorical data and will not be used.
Supervised neural networks can help with a sensitivity analysis of the variables used in
the network to find which variables have the greatest impact on AGE GROUP. The
process using neural networks first requires that the data are filtered. Next is to define the
input/output variables, sampling from the data, and generating a neural network with a
sensitivity analysis. Then variables with the higher influence on AGE GROUP can be
determined.
Clementine Code (Image 6)
lxx
Neural Network for Age Group Model Output (Image 7)
Image 7 is a screen dump of a neural network sensitivity analysis. This shows the
predicted accuracy, some of the structure of the network, and a score of the relative
importance of a variable. The relative importance is a score from 0.0 (low importance) to
1.0 (high importance) of a variable’s importance in the network. AGE, PHYSICIAN
SPECALITY, PRIMARY PAYMENT, MAJOR REASON, PAYEE TYPE, and DAY all
had a relative importance score of 0.10 or more. Refining the network to use only these
variables should improve the model.
lxxi
Why don’t you use what you did earlier, ie A,B,C to define the different parts of the screen dump.
Refined Neural Network for Age Group Model Output (Image 8)
The model’s accuracy did improve but this is not unexpected. The AGE GROUP variable
is based on age and AGE is the variable used to determine which age group they are in. If
all variables except age are removed, the network is 100% accurate. It is not possible to
find an accurate rule using the remaining variables excluding AGE. As shown in image
10, the model was very inaccurate.
Neural Network for Age Group Model Output (Image 9)
When using only the C5 modeling (a rule set model) and only the higher level variables
in the structure, no model over 52% correct was found. Even when seemingly age-related
variables were included (such as pregnant, payee type, and practice type) no accurate
models were found.
lxxii
4.3.2 Modeling Classification of Pregnant
A more accurate model to determine if someone is pregnant can be generated by
using a C5 model with SEX, PAY METHOD, REASON FOR VISIT, TIME SPENT
WITH PHYSICIAN, and AGE GROUP.
C5 Model for Pregnant Output (Image 10)
This model was only 89.87% accurate. Other variables could be added to make the model
more accurate but this adds complexity to the model. A balance between complexity and
accuracy is determined by the researcher but adding complexity for limited results may
not be justifiable. The following is a rule set generated by a C5 model to predict
pregnancy.
lxxiii
Rule set for Pregnant:
Default : -> Noif Male :-> Blankif Female:Rules for Unknown: if major reason for visit == blank/unknown if major reason for visit == Non-illness care and Age Group == Under 15 years and time spent with physician =< 2 Rules for Yes:
major reason for visit == Non-illness care and:
If Age Group == 15-24 years and time spent with physician =< 14
or (time spent with physician > 14 and Primary expected source of payment for the visit == Self-pay)
if Age Group == 25-44 years and Primary expected source of payment for the visit == Blank
or Primary expected source of payment for the visit == Medicaid or Primary expected source of payment for the
visit == [Medicare Worker's Compensation] or ( Primary expected source of payment for the visit == Other and time spent with physician =< 25) or (Primary expected source of payment for the visit == Private Insurance and time spent with physician > 2 and time spent with physician =< 12 ) or (Primary expected source of payment for the visit == Private Insurance and time spent with physician > 50 )
If the patient was female, of child bearing age and visiting for Non-Illness care then she
was probably pregnant. The different payee types show up deep in the structure and
when removed, only slightly lessen the accuracy of the model (Image 11,12).
lxxiv
Refined C5 Model for Pregnant Output (Image 11)
Refined Rule Set for Pregnant Model Rule Set (Image 12)
This model defined above also shows if the patient was female, of child bearing age and
visiting for Non-Illness care then she was probably pregnant. Time spent with a
physician also had some influence on the model. Removing the TIME SPENT WITH
PHYSICIAN also slightly degrades the model’s accuracy but simplifies the model
(Image 13,14).
lxxv
Refined C5 Model (2) for Pregnant Rule Set (Image 13)
Refined Rule Set (2) for Pregnant Model Rule Set (Image 14)
Variables that add little to the outcomes should be discarded. Additional variables will
always add marginally to the results.
There are a variety of types of modeling techniques. Some techniques require
certain types of data input and others are less constraining. Some techniques will generate
functions or rule sets as output where others create unreadable neural networks. Other
techniques generate variable or data clusters. These techniques can be used in
conjunction with each other to refine models. Only a portion of data should be used to
create the model leaving the entire dataset to validate the model. Using these techniques
lxxvi
to predict whether someone was pregnant or what age group they were in yielded mixed
results. The model to determine if a patient was pregnant had a ~90% accuracy but the
model to determine which age group the patient was in, once the AGE variable was
removed, never had an accuracy over 52%. The AGE GROUP variable had six strata that
spread into two or more strata (per age group) in other variables so no other categorical
variable could substantively improve the model. The PREGNANCY variable only had
four strata. The male population always left the answer blank so three strata remained.
Assigning BLANK to male responses automatically gave the model ~50% accuracy.
Pregnancy is considered a non-illness and usually is within a certain patient age range.
By using variables associated with sex, age, and non-illness, the model generated became
highly accurate. If such associated variables are not apparent, neural networks can
generate a sensitivity analysis to find “Relative Importance Scored” variables or variable
clustering techniques to find associated variables. This may be the best course of action
to find an initial set of inputs for a model.
lxxvii
CHAPTER V
CONCLUSIONS
Data mining is a process that, until recent times, was not feasible. Analyzing large
datasets was too time consuming and too apt to have some human computational error in
the analysis. With the creation of the modern computer and mass storage, the analysis of
large datasets has become a less tedious task with less chance for computational error.
What may have taken months to compute before now only takes a few minutes. Once the
data are imported into a statistical package, a variety of analyses can be done in a limited
amount of time. Researchers can test and refine the focus of their analyses in minutes.
The modern computer also allowed for much data to be stored in digital formats which
are easily distributed. Although there are a lot of data on paper, most modern data are
stored in some digital format. Government institutions have stored data on a variety of
topics for years. Corporations have stored internal and customer data. Educational and
research facilities also have stored data. Most of this data is available to the public but
corporations and other entities tend to keep their data to themselves. Much publicly
available data can be found on the Internet at numerous data warehouses.
lxxviii
Data mining is not easily defined. It is a process of acquiring, importing, cleaning,
and analyzing large datasets. Acquiring data can be from a researcher’s own collection
mechanism or data from an outside source. Importing data is a process of transforming
the data into a format accepted by a statistical package. Cleaning data is a process of
removing bad data or correcting existing data. The analyses of datasets depend upon the
types of data. Certain techniques require nominal data, some require numeric data, and
other techniques can use a mixture of data types. These techniques can be used alone or
in conjunction with one another. In chapter 4, a sensitivity analysis from a neural network
was used to refine the variables for a C5 model. A sensitivity analysis from a neural
network could also be used to refine a variable list for a regression analysis. Variable
clusters can be used in a similar manner. SPSS and SAS both have statistical packages
that have many techniques to analyze and present results in an informative manner. Some
packages are more industry specific. There are specialized packages designed for
analyzing web practices and designs, customer patterns, patterns in the stock market, and
other industry specific data. These packages may not always present data in a manner the
researcher desires and a program may need to be written to analyze the data in a different
manner. White and Tufte’s books give guidelines on effective presentation of information
(data visualization) that any statistical program should adhere to. Data visualization gives
statistical information on large datasets with a mixture of graphic and text elements in a
coherent manner. No matter what statistical package or program is used, presenting
results to in a manner that is easily digestible is a must.
In this thesis data mining was used in many ways. Various techniques were used
to acquire the dataset. SPSS 10 was used in importing and cleaning the data. SPSS 10,
lxxix
SPSS Clementine, and SAS 8 were all used in exploring the data. A program was
developed using visualization techniques to further analyze and present results in a
different manner. The developed program used an expected percentage of a stratum
across actual values of strata of another variable to show significant deviations from the
expected values. This program was used to show how different payee types
disproportionately went to different practices and also examined HMO membership.
Other analyses were done using modeling techniques. A model was developed to predict
which age group a patient belonged to with little success. Initially a neural net was used
to find the inputting variables for a C5 model. The initial variables selected yielded very
good results but when the AGE variable was removed from the equation, the model
degraded. A C5 model was also created to determine pregnancy with better results. The
model for pregnancy was refined by using the results from previous C5 models to
generate an accurate and simple rule set (~90% accurate).
Data Mining is a recent concept is data analysis. As more and different types of
data are collected, newer forms of data analysis techniques and associated programs will
be created or refined. As computers on the Internet are used in parallel processing,
statistical programs can do even more complicated analysis (such as SETI@HOME,
FOLDING@HOME). As computers and parallel processing develop, data mining will
also develop in parallel, becoming a more effective way of analysis in the future.
lxxx
References:
Gehan, Edmund A., Ph.D.Lemak, Noreen A., M.D.Statistics in Medical Research, Developments in Clinical TrialsPlenum Publishing Corporation, c1994
Knowledge Discovery Nuggetshttp://www.kdnuggets.com/
Microsoft Corp.One Microsoft WayRedmond, WA 98052-6399http://www.microsoft.com/
National Ambulatory Medical Care SurveyU.S. DEPARTMENT OF HEALTH AND HUMAN SERVICESCenters for Disease Control and Prevention
National Center for Health StatisticsDivision of Data Services Hyattsville, MD 20782-2003(301) 458-4636http://www.cdc.gov/nchs/about/major/ahcd/ahcd1.htm
SAS Institute Inc.SAS Campus DriveCary, NC 27513-2414 http://www.sas.com/
SPSS Inc. 233 S. Wacker Drive,11th floorChicago, Illinois 60606http://www.spss.com/
Sun Microsystems, Inc.901 San Antonio RoadPalo Alto, CA 94303 USAhttp://www.sun.com/
Tufte, Edward R., 1942-Visual explanations : images and quantities, evidence and narrative / Edward R. Tufte.Cheshire, Conn. : Graphics Press, c1997.
lxxxi
Westphal, ChristopherBlaxton, TheresaData Mining Solutions: Methods and Tools for Solving Real-World ProblemsJohn Wiley & Sons, Inc, c1998
White, Jan V., 1928-Using charts and graphsR. R. Bowker Company c1984
lxxxii
Appendix – A (Variable List from NAMCS Files) This section consists of a detailed breakdown of each data record. For each item on the record, the user is provided with a sequential item number, field length, file location, and brief description of the item, along with valid codes. Unless otherwise stated in the "item description" column, the data are derived from the Patient Record form. The American Medical Association (AMA), the American Osteopathic Association (AOA) and the induction interview (reference 3) are alternate sources of data, while the computer generates other items by recoding selected data items.
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------
1 DATE OF VISIT
1.1 2 1-2 MONTH OF VISIT 01-12: January-December
1.2 4 3-6 YEAR OF VISIT 1996 or 1997*
1.3 1 7 DAY OF WEEK OF VISIT 1=Sunday 2=Monday 3=Tuesday 4=Wednesday 5=Thursday 6=Friday 7=Saturday 2 3 8-10 PATIENT AGE (IN YEARS; DERIVED FROM DATE OF BIRTH) 000-999 100 = 100 years and over
3 1 11 SEX 1 = Female 2 = Male
4 1 12 IS PATIENT PREGNANT? 1 = Yes 2 = No 3 = Unknown 4 = Blank/Not applicable
* Survey dates for the 1997 NAMCS were Dec. 30, 1996 through Dec. 28, 1997.
lxxxiii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------- 5 1 13 RACE 1 = White 2 = Black 3 = Asian/Pacific Islander 4 = American Indian/Eskimo/Aleut
6 1 14 ETHINICITY 1 = Hispanic orgin 2 = Not Hispanic 3 = Blank
7 1 15 WAS PATIENT REFERRED BY ANOTHER PHYSICIAN? 1 = Yes 2 = No 3 = Unknown 4 = Blank
8 1 16 WAS AUTHORIZATION REQUIRED FOR CARE? 1 = Yes 2 = No 3 = Unknown 4 = Blank
9 1 17 ARE YOU THE PATIENT'S PRIMARY CARE PHYSICIAN? 1 = Yes 2 = No 3 = Unknown 4 = Blank
10 1 18 PRIMARY EXPECTED SOURCE OF PAYMENT FOR THIS VISIT 1 = Private Insurance 2 = Medicare 3 = Medicaid 4 = Worker's Compensation 5 = Self-pay 6 = No charge 7 = Other 8 = Unknown 9 = Blank
lxxxiv
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------
11 1 19 DOES THIS PATIENT BELONG TO AN HMO? (Health Maintenance Organization) 1 = Yes 2 = No 3 = Unknown 4 = Blank
12 1 20 IS THIS A CAPITATED VISIT? 1 = Yes 2 = No 3 = Unknown 4 = Blank
13 1 21 HAVE YOU OR ANYONE IN YOUR PRACTICE/ DEPARTMENT SEEN PATIENT BEFORE? 1 = Yes, established patient 2 = No, new patient 3 = Blank
14 PATIENT'S REASON(S) FOR VISIT (See page 9 in "Description of the NAMCS" and Reason for Visit Classification)
14.1 5 22-26 REASON #1 10050-89990 = 1005.0-8999.0 90000 = Blank
14.2 5 27-31 REASON #2 10050-89990 = 1005.0-8999.0 90000 = Blank
14.3 5 32-36 REASON #3 10050-8990 = 1005.0-8999.0 90000 = Blank 15 1 37 MAJOR REASON FOR THIS VISIT 1 = Acute problem 2 = Chronic problem, routine 3 = Chronic problem, flareup 4 = Pre- or post surgery/injury follow up 5 = Non-illness care (e.g. routine prenatal, general exam., well baby) 6 = Blank or unknown
lxxxv
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------- 16 1 38 IS THIS VISIT RELATED TO INJURY OR POISONING? 0 = No 1 = Yes
17 1 39 PLACE OF OCCURENCE OF INJURY 1 = Residence 2 = Recreation/Sports Area 3 = Street/Highway 4 = School 5 = Other public building 6 = Industrial places 7 = Other * 8 = Unknown 9 = Not applicable (not an injury visit)
18 1 40 IS THIS INJURY INTENTIONAL? 1 = Yes (self-inflicted) 2 = Yes (assault) 3 = No, unintentional 4 = Unknown 5 = Not applicable (not an injury visit) 19 1 41 IS THIS INJURY WORK RELATED? 1 = Yes 2 = No 3 = Unknown 4 = Not applicable (not an injury visit)
20 CAUSE OF INJURY (See p. 9 in "Descrip- tion of the National Ambulatory Medical Care Survey" for explanation of codes.)
20.1 4 42-45 CAUSE #1 (ICD-9-CM, E-Codes) There is an implied decimal between the third and for inapplicable fourth digits, a dash A prefix 'E' is implied. 8000-999[-] = E800.0-E999 0000 = Not applicable/Blank
20.2 4 46-49 CAUSE #2 (ICD-9-CM, E-Codes) There is an implied decimal between the third and fourth digits; for inapplicable fourth fourth digits, a dash is inserted. A prefix 'E' is implied. 8000-999[-] = E800.0-E999 0000 = Not applicable/Blank
* Due to a data processing problem, responses of "other" place of occurrence of injury were changed to "unknown" for the 1997 NAMCS.
lxxxvi
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------- 20.3 4 50-53 CAUSE #3 (ICD-9-CM, E-Codes) There is an implied decimal between the third and fourth digits; for inapplicable fourth digits, a dash is inserted. A prefix 'E' is implied. 8000-999[-] = E800.0-E999 0000 = Not applicable/Blank
21 100 54-153 CAUSE OF INJURY - VERBATIM TEXT Description of events that preceded the injury. Some entries contain the acronym 'MVA.' MVA=motor vehicle accident.
NOTES ON USING THE CAUSE OF INJURY VERBATIM TEXT DATA
In previous survey years, the cause of injury was converted to an external cause of injury code (E-code) by NCHS medical coders. In 1997, the actual verbatim text has been included on the public use file in addition to the E-code. The inclusion of the verbatim text is meant to assist data users in two major ways. First, the verbatim text can be used by researchers to assign records to injury classification schemes other than the "Supplementary Classification of External Causes of Injury and Poisoning" found in the ICD-9-CM, if so desired. Second, users can search for key text words (for example, swimming pool) to identify diverse causes of injury. It should be noted that, in an effort to preserve confidentiality, all geographic names, personal names, commercial names, and specific dates of injury have been stripped from the verbatim text.
It is important to remember, however, that because of their very specific nature, exact verbatim text strings will not translate into national estimates and should not be used as such. In general, we consider any estimate based on fewer than 30 occurrences in the data to be unreliable. Therefore, a single record showing the specific cause of injury of "tripped over a student's backpack in her classroom and fell on left knee" should not be weighted to produce a national estimate. If, however, a researcher is able to identify 30 or more records where the verbatim text involves a "backpack"-related injury, it might then be possible to sum the patient visit weights for these records to generate a national estimate related to a broader group of visits for backpack-related injuries. The reliability of such an estimate would still depend upon the associated relative standard error.
lxxxvii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------
22 PHYSICIAN'S DIAGNOSES (See page 9 in "Description of the National Ambulatory Medical Care Survey" for explanation of coding.) 22.1 5 154-158 DIAGNOSIS #1 (ICD-9-CM) There is an implied decimal between the third and fourth digits; for inapplicable fourth or fifth digits, a dash is inserted. 0010[-] - V829[-] = 001.0[0] - V82.9[0] V990- = Noncodable, insufficient information for coding, V991- = Left before being seen; patient walked out; not seen by doctor; left against medical advice V992- = Transferred to another facility; sent to see a specialist V997- = Entry of "none," "no diagnosis," "no disease," or "healthy" 00000 = Blank 22.2 5 159-163 DIAGNOSIS #2 (ICD-9-CM) There is an implied decimal between the third and fourth digits; for inapplicable fourth or fifth digits, a dash is inserted. 0010[-] - V829[-] = 001.0[0]- V82.9[0] V990- = Noncodable, insufficient information for coding, illegible V991- = Left before being seen; patient walked out; not seen by doctor; left against medical advice V992- = Transferred to another facility; sent to see specialist V997- = Entry of "none," "no diagnosis," "no disease," or "healthy" 00000 = Blank
lxxxviii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES -----------------------------------------------------------------------------
22.3 5 164-168 DIAGNOSIS #3 (ICD-9-CM) There is an implied decimal between the third and and fourth digits; for inapplicable fourth and fifth digits, a dash is inserted. 0010[-] - V829[-] = 001.0[0] - V82.9[0] V990- = Noncodable, insufficient information for coding, illegible V991- = Left before being seen; patient walked out; not seen by doctor; left against medical advice V992- = Transferred to another facility; sent to see specialist V997- = Entry of "none," "no diagnosis," "no disease," or "healthy" 00000 = Blank 23 PROBABLE, QUESTIONABLE, AND RULEOUT DIAGNOSES 23.1 1 169 IS DIAGNOSIS #1 PROBABLE, QUESTIONABLE, OR RULE OUT? 0 = No 1 = Yes 2 = Not applicable
23.2 1 170 IS DIAGNOSIS #2 PROBABLE, QUESTIONABLE, OR RULE OUT? 0 = No 1 = Yes 2 = Not applicable
23.3 1 171 IS DIAGNOSIS #3 PROBABLE, QUESTIONABLE, OR RULE OUT? 0 = No 1 = Yes 2 = Not applicable
24 DIAGNOSTIC/SCREENING SERVICES
24.1 1 172 Were any diagnostic/screening services ordered or provided at this visit? 0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box)
lxxxix
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
EXAMINATIONS
0 = No, 1 = Yes
24.2 1 173 Breast 24.3 1 174 Pelvic 24.4 1 175 Rectal 24.5 1 176 Skin 24.6 1 177 Visual acuity 24.7 1 178 Glaucoma 24.8 1 179 Hearing
TESTS
0 = No, 1 = Yes
24.9 1 180 Blood pressure 24.10 1 181 Strep test 24.11 1 182 Pap test 24.12 1 183 Urinalysis 24.13 1 184 Pregnancy test 24.14 1 185 PSA 24.15 1 186 Blood lead level 24.16 1 187 Cholesterol measure 24.17 1 188 HIV serology 24.18 1 189 Other STD test 24.19 1 190 Hematocrit/hemoglobin 24.20 1 191 Other blood test 24.21 1 192 EKG
IMAGING 0 = No, 1 = Yes
24.22 1 193 X-Ray 24.23 1 194 CT Scan/MRI 24.24 1 195 Mammography 24.25 1 196 Ultrasound
24.26 1 197 ALL OTHER DIAGNOSTIC/SCREENING SERVICES 0 = No, 1 = Yes
xc
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
24.27 4 198-201 OTHER DIAGNOSTIC/SCREENING SERVICE #1 (ICD-9-CM, Procedures) A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.
0101-9999 = 01.01-99.99 0010 = Item 17, box 26 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank 24.28 4 202-205 OTHER DIAGNOSTIC/SCREENING SERVICE #2 (ICD-9-CM, Procedures) A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.
0101-9999 = 01.01-99.99 0010 = Item 17, box 26 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank
24.29 2 206-207 TOTAL NUMBER OF CHECKBOX AND WRITE-IN DIAGNOSTIC/SCREENING SERVICES ORDERED OR PROVIDED 00-26 99 = All boxes blank, including 'None.'
25 THERAPEUTIC AND PREVENTIVE SERVICES
25.1 1 208 Were therapeutic or preventive services ordered or provided? 0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box)
xci
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
COUNSELING/EDUCATION
0 = No, 1 = Yes
25.2 1 209 Diet/nutrition 25.3 1 210 Exercise 25.4 1 211 HIV/STD transmission 25.5 1 212 Family planning/contraception 25.6 1 213 Prenatal instructions 25.7 1 214 Breast self-exam 25.8 1 215 Tobacco use/exposure 25.9 1 216 Growth/development 25.10 1 217 Mental Health 25.11 1 218 Stress management 25.12 1 219 Skin cancer prevention 25.13 1 220 Injury prevention
OTHER THERAPY
0 = No, 1 = Yes
25.14 1 221 Psychotherapy 25.15 1 222 Psychopharmacotherapy 25.16 1 223 Physiotherapy 25.17 1 224 All other therapeutic and preventive services
25.18 4 225-228 OTHER THERAPEUTIC/PREVENTIVE SERVICE #1 (ICD-9-CM Procedures) A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.
0101-9999 = 01.01-99.99 0010 = Item 18, box 17 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank
xcii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
25.19 4 229-232 OTHER THERAPEUTIC/PREVENTIVE SERVICE #2 (ICD-9-CM Procedures) A left-justified alphanumeric code with an implied decimal after the first two two digits; inapplicable fourth digits have a dash inserted.
0101-9999 = 01.01-99.99 0010 = Item 18, box 17 on Patient Record form was checked, no entry was made in write-in field 0000 = Not applicable/Blank
25.20 2 233-234 Total number of checkbox and write-in therapeutic or preventive services ordered or provided 00-17 99 = All boxes blank, including 'None.' 26 AMBULATORY SURGICAL PROCEDURES
26.1 1 235 Were any ambulatory surgical procedures performed at this visit?
0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box.) NOTE: Because some survey respondents reported ambulatory surgical procedures in the open-ended response categories of the diagnostic and screening services item (item 17) and the therapeutic and preventive services item (item 18) (and vice versa), it is recommended that any analysis of procedures take into account all of the open-ended response categories from all of these items.
26.2 4 236-239 AMBULATORY SURGICAL PROCEDURE #1 (ICD-9-CM, Vol. 3, Procedures, see page 10 in "Description of the National Ambulatory Medical Care Survey" for explanation of codes.)
A left-justified alphanumeric code with an implied decimal after the first two digits; inapplicable fourth digits have a dash inserted.
0101-9999 = 01.01-99.99 0000 = Not applicable/Blank
xciii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
26.3 4 240-243 AMBULATORY SURGICAL PROCEDURE #2 (ICD-9-CM, Vol. 3, Procedures) A left-justified alpahnumeric code with an implied decimalafter the first two digits; inapplicable fourth digits have a dash inserted.
0101-9999 = 01.01-99.99 0000 = Not applicable/Blank
26.4 1 244 TOTAL NUMBER OF AMBULATORY SURGICAL PROCEDURES PERFORMED AT THIS VISIT 0-2 9 = No procedures recorded and 'None' box blank
27 MEDICATIONS (See page 12 in "Description of the National Ambulatory Medical Care Survey" for more information.
27.1 1 245 WERE MEDICATIONS ORDERED OR PROVIDED AT THIS VISIT? 0 = No 1 = Yes 2 = No answer (all checkboxes and write-in fields blank, including 'None' box)
27.2 5 246-250 MEDICATION #1 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown Entry; Other 99999 = Illegible Entry
27.3 5 251-255 MEDICATION #2 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry
27.4 5 256-260 MEDICATION #3 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry
xciv
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
27.5 5 261-265 MEDICATION #4 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry
27.6 5 266-270 MEDICATION #5 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry
27.7 5 271-275 MEDICATION #6 00005-97181 = 00005-97181 90000 = Blank 99980 = Unknown entry 99999 = Illegible entry
27.8 1 276 NUMBER OF MEDICATIONS CODED 0-6
28 FORMULARY LIST 28.1 1 277 WERE ANY DRUGS FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)
28.2 1 278 WAS DRUG #1 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)
28.3 1 279 WAS DRUG #2 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)
28.4 1 280 WAS DRUG #3 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)
xcv
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
28.5 1 281 WAS DRUG #4 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at this visit) 28.6 1 282 WAS DRUG #5 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at this visit)
28.7 1 283 WAS DRUG #6 FROM FORMULARY LIST? 0 = No 1 = Yes 2 = Unknown 3 = Not applicable (no drugs mentioned at visit)
28.8 1 284 NUMBER OF DRUGS FROM FORMULARY LIST 0-6 = 0 - 6 drugs 7 = Not applicable 8 = Unknown
29 PROVIDERS SEEN AT THIS VISIT
0 = No, 1 = Yes
29.1 1 285 No answer (all categories blank) 29.2 1 286 Physician 29.3 1 287 Physician assistant 29.4 1 288 Nurse practitioner 29.5 1 289 Nurse midwife 29.6 1 290 R.N. 29.7 1 291 L.P.N. 29.8 1 292 Medical/nursing assistant 29.9 1 293 Other
30 3 294-296 TIME SPENT WITH PHYSICIAN (in minutes) 000-240
31 6 297-302 PATIENT VISIT WEIGHT A right-justified integer developed by the NAMCS staff for the purpose of producing national estimates from sample data.
xcvi
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
32 1 303 GEOGRAPHIC REGION (Based on actual location of physician's practice.) 1 = Northeast 2 = Midwest 3 = South 4 = West
33 1 304 METROPOLITAN/NON METROPOLITAN (Based on actual location in conjunction with the defintion of the Bureau of the Census and the U.S. Office of Management and Budget.)
1 = MSA (Metropolitan Statistical Area) 2 = Non-MSA
34 3 305-307 PHYSICIAN SPECIALTY COLLECTED FROM INDUCTION INTERVIEW (REFERENCE 3) (See "Physician Specialty List.")
35 1 308 TYPE OF DOCTOR 1 = M.D.- Doctor of Medicine 2 = D.O.- Doctor of Osteopathy
36 4 309-312 PHYSICIAN CODE - A unique code assigned to all records from a particular physician
37 3 313-315 PATIENT CODE- A number assigned to identify each individual record from a particular physician
****THE FOLLOWING FIELDS SHOW WHETHER DATA WERE IMPUTED TO REPLACE BLANKS****
38 IMPUTED FIELDS 0 = Not Imputed 1 = Imputed
38.1 1 316 Visit date 38.2 1 317 Birth year 38.3 1 318 Sex 38.4 1 319 Race 38.5 1 320 Time spent with physician
******************* END OF IMPUTED DATA FIELDS ********************
xcvii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
39 DRUG-RELATED INFO FOR MEDICATION #1
39.1 5 321-325 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined
39.2 1 326 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined
39.3 1 327 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined
39.4 1 328 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined
39.5 4 329-332 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100-2100 = NDC Drug Class
39.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes)
39.6a 5 333-337 INGREDIENT #1 CODE - 50001-92503, or 50000 39.6b 5 338-342 INGREDIENT #2 CODE - 50001-92503, or 50000 39.6c 5 343-347 INGREDIENT #3 CODE - 50001-92503, or 50000 39.6d 5 348-352 INGREDIENT #4 CODE - 50001-92503, or 50000 39.6e 5 353-357 INGREDIENT #5 CODE - 50001-92503, or 50000
xcviii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
40 DRUG-RELATED INFO FOR MEDICATION #2
40.1 5 358-362 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined
40.2 1 363 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined
40.3 1 364 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule I (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined
40.4 1 365 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined
40.5 4 366-369 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 40.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 40.6a 5 370-374 INGREDIENT #1 CODE - 50001-92503, or 50000 40.6b 5 375-379 INGREDIENT #2 CODE - 50001-92503, or 50000 40.6c 5 380-384 INGREDIENT #3 CODE - 50001-92503, or 50000 40.6d 5 385-389 INGREDIENT #4 CODE - 50001-92503, or 50000 40.6e 5 390-394 INGREDIENT #5 CODE - 50001-92503, or 50000
xcix
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ---------------------------------------------------------------------------- 41 DRUG-RELATED INFO FOR MEDICATION #3
41.1 5 395-399 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined
41.2 1 400 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined
41.3 1 401 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined
41.4 1 402 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined
41.5 4 403-406 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 41.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 41.6a 5 407-411 INGREDIENT #1 CODE - 50001-92503, or 50000 41.6b 5 412-416 INGREDIENT #2 CODE - 50001-92503, or 50000 41.6c 5 417-421 INGREDIENT #3 CODE - 50001-92503, or 50000 41.6d 5 422-426 INGREDIENT #4 CODE - 50001-92503, or 50000 41.6e 5 427-431 INGREDIENT #5 CODE - 50001-92503, or 50000
c
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
42 DRUG-RELATED INFO FOR MEDICATION #4
42.1 5 432-436 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined
42.2 1 437 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined
42.3 1 438 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined
42.4 1 439 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined
42.5 4 440-443 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 42.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 42.6a 5 444-448 INGREDIENT #1 CODE - 50001-92503, or 50000 42.6b 5 449-453 INGREDIENT #2 CODE - 50001-92503, or 50000 42.6c 5 454-458 INGREDIENT #3 CODE - 50001-92503, or 50000 42.6d 5 459-463 INGREDIENT #4 CODE - 50001-92503, or 50000 42.6e 5 464-468 INGREDIENT #5 CODE - 50001-92503, or 50000
ci
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
43 DRUG-RELATED INFO FOR MEDICATION #5
43.1 5 469-473 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined
43.2 1 474 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined
43.3 1 475 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined
43.4 1 476 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined
43.5 4 477-480 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class
43.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes)
43.6a 5 481-485 INGREDIENT #1 CODE - 50001-92503, or 50000 43.6b 5 486-490 INGREDIENT #2 CODE - 50001-92503, or 50000 43.6c 5 491-495 INGREDIENT #3 CODE - 50001-92503, or 50000 43.6d 5 496-500 INGREDIENT #4 CODE - 50001-92503, or 50000 43.6e 5 501-505 INGREDIENT #5 CODE - 50001-92503, or 50000
cii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
44 DRUG-RELATED INFO FOR MEDICATION #6
44.1 5 506-510 GENERIC NAME CODE 50001-51379, 51383-92503 = Specific Generic code 51380 = Combination Product 51381 = Fixed Combination 51382 = Multi-vitamin/Multi-mineral 50000 = Generic Name Undetermined
44.2 1 511 PRESCRIPTION STATUS CODE 1 = Prescription Drug 2 = Nonprescription Drug 3 = Undetermined 44.3 1 512 CONTROLLED SUBSTANCE STATUS CODE 1 = Schedule 1 (Research Only) 2 = Schedule II 5 = Schedule V 3 = Schedule III 6 = No Control 4 = Schedule IV 7 = Undetermined 44.4 1 513 COMPOSITION STATUS CODE 1 = Single Entity Drug 2 = Combination Drug 3 = Undetermined 44.5 4 514-517 NAT'L DRUG CODE DIRECTORY DRUG CLASS 0100 - 2100 = NDC Drug Class 44.6 INGREDIENT CODE (Ingredients of Combination Drugs; Maximum of 5 Generic Name Codes) 44.6a 5 518-522 INGREDIENT #1 CODE - 50001-92503, or 50000 44.6b 5 523-527 INGREDIENT #2 CODE - 50001-92503, or 50000 44.6c 5 528-532 INGREDIENT #3 CODE - 50001-92503, or 50000 44.6d 5 533-537 INGREDIENT #4 CODE - 50001-92503, or 50000 44.6e 5 538-542 INGREDIENT #5 CODE - 50001-92503, or 50000
ciii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ---------------------------------------------------------------------------- The items on this page are appearing for the first time on the NAMCS public use file. They were collected using the Physician Induction Interview form at the start of the survey process. All of them pertain to aspects of the physician's practice.
45 1 543 TYPE OF OFFICE SETTING FOR THIS VISIT
1 = Free standing private, solo, or group office 2 = Free standing clinic/urgicenter (not part of hospital emergency department or outpatient department) 3 = Neighborhood health or mental health center 4 = Family planning clinic 5 = Privately operated clinic 6 = Local government clinic (state, county, city) 7 = Health maintenance organization (HMO) or other prepaid practice 8 = Other/unknown
46 1 544 IS THIS A SOLO PRACTICE? 1 = Yes 2 = No
47 1 545 EMPLOYMENT STATUS OF PHYSICIAN 1 = Owner 2 = Employee 3 = Contractor 4 = Blank
48 1 546 WHO OWNS THIS OFFICE? 1 = Hospital 2 = Physician or physician group 3 = Other health care corporation 4 = Health maintenance organization (HMO) 5 = Other 6 = Blank
49 1 547 IS LAB TESTING PERFORMED AT THIS OFFICE? 0 = No 1 = Yes 2 = Blank
civ
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
*** THE FOLLOWING ITEM WAS ADDED TO FACILITATE CALCULATION OF VISIT RATES BY RACE ***
50 1 548 RACE RECODE 1 = White 2 = Black 3 = Other
*** THE FOLLOWING ITEM WAS ADDED TO ENABLE USERS TO CREATE TABLES USING THE PHYSICIAN SPECIALTY GROUPS IN NAMCS SUMMARY REPORTS. ***
51 2 549-550 PHYSICIAN SPECIALTY RECODE
01 = General and family practice 03 = Internal medicine 04 = Pediatrics 05 = General surgery 06 = Obstetrics and gynecology 07 = Orthopedic surgery 08 = Cardiovascular diseases 09 = Dermatology 10 = Urology 11 = Psychiatry 12 = Neurology 13 = Ophthalmology 14 = Otolaryngology 15 = All other
(Note: For this variable, doctors of osteopathy (stratum 02) have been aggregated with doctors of medicine according to their self-designated practice specialty, and therefore are not differentiated in the variable range. To isolate doctors of osteopathy from medical doctors using the Physician Specialty Recode, it is necessary to crosstabulate it with Type of Doctor located in position 308.)
*** THE FOLLOWING ITEM WAS ADDED TO ENABLE USERS TO CREATE SUBSETS OF VISITS BY PATIENTS UNDER ONE YEAR OF AGE ***
52 3 551-553 AGE IN DAYS 001-365 = 001-365 days 999 = More than 365 days
53 1 554 AGE RECODE 1 = Under 15 years 2 = 15-24 years 3 = 25-44 years 4 = 45-64 years 5 = 65-74 years 6 = 75 years and over
cv
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
NUMERIC CODES FOR CAUSE OF INJURY, DIAGNOSIS, AND PROCEDURES
***The following items were included on the public use file to facilitateanalysis of visits using the ICD-9-CM codes. Prior to the 1995 public use file, all ICD-9-CM diagnosis codes on the NAMCS micro-data file were converted from alphanumeric to numeric fields according to the following coding conventions: A prefix of "1" was added to ICD-9-CM codes in the rangeof 001.0[-] through 999.9[-]. A prefix of "20" was substituted for the letter"V" for codes in the range of V01.0[-] through V82.9[-]. Inapplicable fourthand fifth digits were zerofilled. This conversion was done to facilitate analysis of ICD-9-CM data using the Ambulatory Care Statistics software systems. Similar conversions were made for ICD-9-CM procedure codesand external cause of injury codes. Specific coding conventions are discussedin the public use documentation for each data year.
In 1995, however, the decision was made to use the actual ICD-9-CM codes onthe public use data file. Codes were not prefixed, and a dash was inserted for inapplicable fourth and fifth digits. For specific details pertaining to eachtype of code (diagnosis, procedure, cause of injury), refer to the documentation for the survey year of interest. This had the advantage of preserving actual codes and avoiding possible confusion over the creation of some artificial codes due to zerofilling.
It has come to our attention that some users of NAMCS data find it preferableto use the numeric field recodes rather than the alphanumeric fields in certain data applications. Therefore, for 1997, we have included numericrecodes for cause of injury, diagnosis, and procedure (ambulatory surgicalprocedure as well as "other" diagnostic/screening service and "other"therapeutic/preventive service) as listed below. These are in addition to theactual codes for these variables which appear earlier on the public use file.Users can make their own choice about which format best suits their needs.
We are interested in hearing from data users as to which format they prefer so that a decision can be made about whether to include both formats in futureyears. Please contact Susan Schappert, Ambulatory Care Statistics Branch,at 301-436-7132, ext. 172 with any comments or suggestions.******
54 CAUSE OF INJURY RECODE 54.1 4 555-558 CAUSE OF INJURY #1 (Recode to Numeric Field) 8000-9999 =E800.0 - E999.[9] 0000 = Blank
54.2 4 559-562 CAUSE OF INJURY #2 (Recode to numeric Field) 8000-9999 =E800.0 - E999.[9] 0000 = Blank
cvi
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
54.3 4 563-566 CAUSE OF INJURY #3 (Recode to Numeric Field) 8000-9999 =E800.0 - E999.[9] 0000 = Blank
55 DIAGNOSIS RECODE
55.1 6 567-572 DIAGNOSIS #1 (Recode to Numeric Field) 100100-208290 = 001.0[0] - V82.9[0] 209900 = Noncodable, insufficient information for coding, illegible 209970 = Diagnosis of "none" 900000 = Blank
55.2 6 573-578 DIAGNOSIS #2 (Recode to Numeric Field) 100100-208290 = 001.0[0] - V82.9[0] 209900 = Noncodable, insufficient information for coding, illegible 209970 = Diagnosis of "none" 900000 = Blank
55.3 6 579-584 DIAGNOSIS #3 (Recode to Numeric Field) 100100-208290 = 001.0[0] - V82.9[0] 209900 = Noncodable, insufficient information for coding, illegible 209970 = Diagnosis of "none" 900000 = Blank
56 OTHER DIAGNOSTIC/SCREENING SERVICES RECODE
56.1 4 585-588 OTHER DIAGNOSTIC/SCREENING SERVICE #1 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 17, box 25 checked, on Patient Record form was checked, but no entry was made in write-in response field
56.2 4 589-592 OTHER DIAGNOSTIC/SCREENING SERVICE #2 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 17, box 25 checked, on Patient Record form was checked, but no entry was made in write-in response field
cvii
ITEM FIELD FILE NO. LENGTH LOCATION ITEM DESCRIPTION AND CODES ----------------------------------------------------------------------------
57 OTHER THERAPEUTIC/PREVENTIVE SERVICES RECODE
57.1 4 593-596 OTHER THERAPEUTIC/PREVENTIVE SERVICE #1 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 18, box 17 checked, on Patient Record form was checked, but no entry was made in write-in response field
57.2 4 597-600 OTHER THERAPEUTIC/PREVENTIVE SERVICE #2 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank 0010 = item 18, box 17 checked, on Patient Record form was checked, but no entry was made in write-in response field
58 AMBULATORY SURGICAL PROCEDURE RECODE
58.1 4 601-604 AMBUALTORY SURGICAL PROCEDURE #1 (Recode to numeric field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank
58.2 4 605-608 AMBULATORY SURGICAL PROCEDURE #2 (Recode to Numeric Field) 0101-9999 = 01.0[-] - 99.9[-] 0000 = Blank
cviii
VITA
The author, Johnathan Paul Durbin, is the son of Norman Paul Durbin and Carol
(Nachtsheim) Durbin. He was born June 16, 1964, in Santa Cruz, California.
His elementary education was obtained in various public schools in California and
Kentucky. His secondary education was obtained at Western High School, Louisville,
Kentucky, where he graduated in 1982.
In September, 1982, he entered the University of Kentucky and worked on a
Bachelor of Science in Computer Science but became disabled and was unable to finish
it. In 1992 he entered the University of Louisville, and in December, 1995, received a
Bachelor of Science with a major in mathematics with a computer focus.
cix