You are on page 1of 51

Exploring a Breast Cancer Register by Data

Mining Techniques to Support Prediction of


Recurrence

Amir R Razavi
Department of Biomedical Engineering, Division of Medical Informatics
Linköpings universitet, Linköping, Sweden

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden
http://www.imt.liu.se
Outline

• Introduction
• From data to Knowledge
– Data pre-processing - (Paper I)
– Data mining - (Paper II)
– Validating Predictive models - (Paper III)
• Discussion
• Future works

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 2
http://www.imt.liu.se
Introduction

• Data in medicine are stored in many different ways:


– Hospital Information Systems (HIS)
– Electronic Medical Records (EMR)
– Medical registers
– Output from devices for example imaging devices
– …
• And storing patients data continues…
– “Patientjournal 08” project in Östergötland: Computerization
of all patients data completed Dec 2008
– …

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 3
http://www.imt.liu.se
Introduction
• Medical Registers
– Traditionally established by public health
authorities to monitor trends in the incidence of
conditions such as infectious diseases and
cancer.
– Monitoring outcomes after the implementation
of disease-prevention and treatment programs.
– Assessing the safety of new drugs and
procedures, identify best clinical practice and
compare healthcare systems.
– There are many quality registers in Sweden.
Dept of Biomedical Engineering, Medical Informatics
Linköpings universitet, Linköping, Sweden 4
http://www.imt.liu.se
Introduction

• Data in medicine are unique and need


special attention.
– Heterogeneity of medical data
• Complexity of medical data; images, signals, …
• Physician's interpretation
• Poor mathematical characterization
• Degrees of relationships between variables
– Ethical/legal/social issues
• Data ownership
• Confidentiality of human data

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 5
http://www.imt.liu.se
Introduction

• What can be extracted from these large


depositories for medical data?
• How can the hidden knowledge be
extracted from medical registers?

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 6
http://www.imt.liu.se
Introduction

• Knowledge Discovery in Databases (KDD)


– Process of semi-automatically analyzing large
databases to find patterns that are:
• Valid: true for new data with some certainty
• Novel: non-obvious
• Useful: it should be possible to act upon the item
• Understandable: humans should be able to interpret
the pattern

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 7
http://www.imt.liu.se
Introduction

• Data Mining
– “…the process of discovering meaningful new
correlations, patterns, and trends by sifting through
large amounts of data…” (Gartner Group)
– “…the analysis of observational data sets to find
unsuspected relationships and to summarize data in
novel ways…” (Hand et al.)
– “…is an interdisciplinary field bringing together
techniques from machine learning, pattern recognition,
statistics, databases, and visualization…” (Cabana et
al.)
– …

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 8
http://www.imt.liu.se
Introduction

• Supervised vs. unsupervised learning


– Supervised learning (classification) is seen as
learning from examples.
• Supervision: The data (observations, measurements,
etc.) are labeled with pre-defined classes.
– Unsupervised learning (clustering)
• Class labels of the data are unknown.
• Given a set of data, the task is to establish the
existence of classes or clusters in the data.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 9
http://www.imt.liu.se
Introduction

• Supervised learning (classification)


– Decision Tree Induction (DTI)
– Artificial Neural Networks (ANN)
– Support Vector Machines (SVM)
– Multiple Regression Analysis (MRA)
– …

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 10
http://www.imt.liu.se
Introduction

• Decision Tree Induction (DTI)


– Decision tree learning is widely used
• Its classification accuracy is competitive with other
methods,
• Representation as If-then rules is easy to interpret,
• Works well on noisy data.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 11
http://www.imt.liu.se
From Data to Knowledge

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 12
http://www.imt.liu.se
Data Pre-processing
Paper I

Exploring Cancer Register Data to find Risk


Factors for Recurrence of Breast Cancer-
Application of Canonical Correlation
Analysis
Razavi AR, Gill H, Stål O, Sundquist M, Thorstenson S, Åhlfeldt H,
Shahsavar N, the South-East Swedish Breast Cancer Study Group

BMC Medical Informatics and Decision Making. 2005 Aug


22;5:29

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 13
http://www.imt.liu.se
Data Pre-processing

• Data pre-processing: any type of processing


performed on raw data to prepare it for
other procedures in KDD.
– Cleaning
• Handling missing values, identify or remove outliers
– Transformation
• Normalization, discretization or aggregation
– Reduction
• Obtains reduced representation in volume but
produces the same or similar analytical results

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 14
http://www.imt.liu.se
Data Pre-processing

• Why data pre-processing is needed?


– If the data do not have good quality, the
analysis results will not be good.
– Decisions must be based on high-quality data.
– Duplicate or missing data may cause incorrect
or even misleading results.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 15
http://www.imt.liu.se
Data Pre-processing

• Data reduction:
– Obtain a reduced representation of the dataset
that is much smaller in volume but yet produce
the same or almost the same analytical results.
• Why to do it?
– The dataset may be gigantic in volume
– Processing time

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 16
http://www.imt.liu.se
Data Pre-processing

• Dimension reduction
– Removes unimportant attributes: Canonical
Correlation Analysis (CCA)
• Data Compression
• Reducing the number of instances
• Discretization and concept hierarchy
generation

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 17
http://www.imt.liu.se
Data Pre-processing

• Canonical correlation analysis (CCA)


– CCA seeks to identify and quantify the
associations between two sets of variables (i.e.,
predictors and outcomes of a disease).
– It focuses on the correlation between a linear
combination of the variables in one set
(independents/predictors) and a linear
combination of the variables in another set
(dependents/outcomes).

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 18
http://www.imt.liu.se
Data Pre-processing

• It creates a number of canonical solutions each


consisting of a linear combination of one set of
variables:
Ui = a1 X1 + a2 X2 + … + am Xm
and a linear combination of the other set of
variables:
Vi = b1 Y1 + b2 Y2 + … + bn Yn
• The goal is to determine the coefficients (a’s and
b’s) that maximize the correlation between
canonical variates (a linear combination of a set of
original variables) Ui and Vi.
Dept of Biomedical Engineering, Medical Informatics
Linköpings universitet, Linköping, Sweden 19
http://www.imt.liu.se
Data Pre-processing

• Examining canonical solutions to determine


the relative importance of each of the
original variables in the canonical variate
– Canonical Loadings
• Represents the simple linear correlation between an
original observed variable and its canonical variate.
• Shows how each original variable contribute
towards each canonical variate.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 20
http://www.imt.liu.se
Data Pre-processing

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 21
http://www.imt.liu.se
Data Pre-processing
Predictor Set Outcome Set ‡
Age DM, first two years
Tumor location DM, 2-4 years
Side DM, more than 4 years
*
Tumor size LRR, first two years
LN involvement * LRR, 2-4 years
LN involvement † LRR, more than 4 years
Periglandular growth *
NHG
Multiple tumors *
Abbreviations: LN: lymph
Estrogen receptor
node, NHG: Nottingham
Progesterone receptor
Histologic Grade, DM:
S-phase fraction
Distant Metastasis, LRR:
Loco-regional Recurrence DNA index
*
from pathology report, † DNA ploidy
N0: Not palpable LN
metastasis, ‡ all periods are
time after diagnosis.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 22
http://www.imt.liu.se
Data Pre-processing

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 23
http://www.imt.liu.se
Data Pre-processing

• CCA is suggested as an appropriate method


when there are many variables in the input set
and more than one variable in the output set.
• The results successfully detected well known
predictors for breast cancer recurrence.
• This can be assumed as the dimension
reduction step in the process of knowledge
discovery in databases.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 24
http://www.imt.liu.se
From Data to Knowledge

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 25
http://www.imt.liu.se
Data Mining
Paper II

A Data Pre-processing Method to Increase


Efficiency and Accuracy in Data Mining

A. R. Razavi, H. Gill, H. Åhlfeldt, and N. Shahsavar

Lecture Notes in Computer Science, Artificial Intelligence in


Medicine, J. H. S. Miksch, E. Keravnou, Ed.: Springer-
Verlag GmbH, 2005, pp. 434-443

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 26
http://www.imt.liu.se
Data Mining

• DTI is a a greedy divide-and-conquer


algorithm
• Tree is constructed in a top-down recursive manner
• At start, all the training examples are at the root
• Examples are partitioned recursively based on
selected attributes
• Attributes are selected on the basis of an impurity
function (e.g., information gain)

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 27
http://www.imt.liu.se
Data Mining

• DTI
– Pros
• Reasonable training time
• Fast application
• Easy to interpret
• Easy to implement
• Can handle large number of features
– Cons
• Cannot handle complicated relationship between
features

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 28
http://www.imt.liu.se
Data Mining

• Building Predictive Models


– An important function of data mining is the production
of a model. A model can be descriptive or predictive.
– A descriptive model helps in understanding underlying
processes or behavior.
– A predictive model is an equation or set of rules that
makes it possible to predict an unseen or unmeasured
value (the dependent variable or output) from other,
known values (independent variables or input).

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 29
http://www.imt.liu.se
Data Mining

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 30
http://www.imt.liu.se
Data Mining

Without  With replacing  With 


 
pre­processing  missing values  pre­processing 
Accuracy  54%  57%  67% 
Sensitivity  83%  82%  80% 
Specificity  41%  46%  63% 
Number of Leaves  137  196  14 
Tree Size  273  391  27 
 

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 31
http://www.imt.liu.se
From Data to Knowledge

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 32
http://www.imt.liu.se
Validating the Predictive Model
Paper III
Data Mining for Building a Model to Predict
Distant Metastasis in Breast Cancer:
Comparing a Decision Tree with Domain
Experts

Amir R. Razavi, Hans Gill, Hans Åhlfeldt, and Nosrat Shahsavar

Submitted to the “Journal of Artificial Intelligence in


Medicine”

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 33
http://www.imt.liu.se
Validating the Predictive Model

• Transparency of the model:


– By choosing data mining methods which
produce an understandable predictive model
such as decision tree induction (DTI).
– Providing the model to clinicians to inspect all
the details and how the decisions are made in
the model; studying the tree and rules.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 34
http://www.imt.liu.se
Validating the Predictive Model

• Showing that models’ performance is good


– Giving some cases to clinicians without any
data pre-processing and ask for their
predictions.
– Validating the predictive model by examining
the same cases as clinicians.
– Comparing the results and see if there are
statistically significant differences.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 35
http://www.imt.liu.se
Validating the Predictive Model

• Testing the model for accuracy on an independent


dataset, one that has not been used to create the
model.
• Examining the performance of the model on the
training set is not a good indicator because of
overfitting.
• The prediction of the model for an independent
dataset is compared to the actual outcome.
• An analysis is performed which measures how
well a model is performing.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 36
http://www.imt.liu.se
Validating the Predictive Model

• Validating methods:
– Examining an independent dataset.
– Cross validation:
• Divides the whole data by random sampling into n
folds (partitions) and perform n times testing.
– At each testing, one partition of data is used as the testing
set and the rest is training set.
• Leave-one-out cross-validation
–…

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 37
http://www.imt.liu.se
Validating the Predictive Model

• Methods for showing how well a model


works:
– Accuracy: refers to the degree of fit between
the model and the data.
– Sensitivity and specificity.
– Confusion matrix: shows the counts of the
actual versus predicted class values.
– ROC curve and area under the curve (AUC).
–…

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 38
http://www.imt.liu.se
Validating the Predictive Model

• 3699 patients
• A decision tree was trained with all patients except
for 100 cases and tested with those 100 cases.
• Two domain experts were asked to give their
opinion about the probability of recurrence of a
certain outcome for these 100 patients.
• ROC curves and area under the ROC curves
(AUC) for predictions were computed and
compared.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 39
http://www.imt.liu.se
Validating the Predictive Model

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 40
http://www.imt.liu.se
Validating the Predictive Model

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 41
http://www.imt.liu.se
Validating the Predictive Model
100

80

60
Sensitivity

DTI_J48
Oncologist_1
Oncologist_2
40

20

0
0 20 40 60 80 100
100-Specificity

DTI (J48) Oncologist 1 Oncologist 2


AUC 0.761 0.847 0.810
DTI: decision tree induction

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 42
http://www.imt.liu.se
From Data to Knowledge

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 43
http://www.imt.liu.se
Discussion

• In some domains such as finance and banking KDD


has already showed a great benefit to the industry
but in medicine we are far behind them.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 44
http://www.imt.liu.se
Discussion

• There is no gold standard method for how to do the


pre-processing step and handling missing values.
• Consulting with domain experts in the pre-
processing step can save lots of efforts and results
in a better subset of variables for next steps.
• CCA can handle multiple outcomes and this is
unique compared to other methods such as MRA
and Cox RA.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 45
http://www.imt.liu.se
Discussion

• DTI predictive model does not differ significantly


from predictions made by domain experts.
• Compared to other data mining methods, DTI is
more explainable. In contrast ANN works as a
”black box”.
• By using an appropriate small subset of predictors
and also pruning techniques DTI results in a
smaller tree.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 46
http://www.imt.liu.se
Discussion

• Using an independent dataset and for comparing


different models AUC is a good technique.
• A predictive model which is built based on the
most relevant and important predictors of an event
can have a better performance.
• Improvement of the quality of cancer registers by
adding variables with high predictive ability.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 47
http://www.imt.liu.se
Future Works

• Results from the presented methodology


can be used to build a decision support
application in the field of oncology.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 48
http://www.imt.liu.se
Future Works

• Why decision support is important?


– Limited resources,
– Need for ways to improve health care processes
and their outcomes.
– Improving decision making ability of clinicians
by allowing more or better decisions within
constraints of their knowledge and time limits.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 49
http://www.imt.liu.se
Future Works

• In medicine, assisting clinicians in their


decision making in the right time, right
place and in a suitable format is valuable.
• Providing reminders, interpretations or
advices specific to a given patient at a
particular time is advantageous.

Dept of Biomedical Engineering, Medical Informatics


Linköpings universitet, Linköping, Sweden 50
http://www.imt.liu.se
Thanks for your
attention!

amira@imt.liu.se
Dept of Biomedical Engineering, Medical Informatics
Linköpings universitet, Linköping, Sweden 51
http://www.imt.liu.se

You might also like