Halvtids Master-Main

Exploring a Breast Cancer Register by Data
Mining Techniques to Support Prediction of

Recurrence
Amir R Razavi
Department of Biomedical Engineering, Division of Medical Informatics
Linköpings universitet, Linköping, Sweden
Dept of Biomedical Engineering, Medical Informatics

Linköpings universitet, Linköping, Sweden
http://www.imt.liu.se
Outline
• Introduction
• From data to Knowledge
– Data pre-processing - (Paper I)
– Data mining - (Paper II)
– Validating Predictive models - (Paper III)
• Discussion
• Future works

Linköpings universitet, Linköping, Sweden 2
Introduction
• Data in medicine are stored in many different ways:

– Hospital Information Systems (HIS)
– Electronic Medical Records (EMR)
– Medical registers
– Output from devices for example imaging devices
– …
• And storing patients data continues…
– “Patientjournal 08” project in Östergötland: Computerization
of all patients data completed Dec 2008
– …

Introduction
• Medical Registers
– Traditionally established by public health
authorities to monitor trends in the incidence of
conditions such as infectious diseases and
cancer.
– Monitoring outcomes after the implementation
of disease-prevention and treatment programs.
– Assessing the safety of new drugs and
procedures, identify best clinical practice and
compare healthcare systems.
– There are many quality registers in Sweden.
Introduction
• Data in medicine are unique and need

special attention.
– Heterogeneity of medical data
• Complexity of medical data; images, signals, …
• Physician's interpretation
• Poor mathematical characterization
• Degrees of relationships between variables
– Ethical/legal/social issues
• Data ownership
• Confidentiality of human data

Introduction
• What can be extracted from these large

depositories for medical data?
• How can the hidden knowledge be
extracted from medical registers?

Introduction
• Knowledge Discovery in Databases (KDD)

– Process of semi-automatically analyzing large
databases to find patterns that are:
• Valid: true for new data with some certainty
• Novel: non-obvious
• Useful: it should be possible to act upon the item
• Understandable: humans should be able to interpret
the pattern

Introduction
• Data Mining
– “…the process of discovering meaningful new
correlations, patterns, and trends by sifting through
large amounts of data…” (Gartner Group)
– “…the analysis of observational data sets to find
unsuspected relationships and to summarize data in
novel ways…” (Hand et al.)
– “…is an interdisciplinary field bringing together
techniques from machine learning, pattern recognition,
statistics, databases, and visualization…” (Cabana et
al.)
– …

Introduction
• Supervised vs. unsupervised learning

– Supervised learning (classification) is seen as
learning from examples.
• Supervision: The data (observations, measurements,
etc.) are labeled with pre-defined classes.
– Unsupervised learning (clustering)
• Class labels of the data are unknown.
• Given a set of data, the task is to establish the
existence of classes or clusters in the data.

Introduction
• Supervised learning (classification)

– Decision Tree Induction (DTI)
– Artificial Neural Networks (ANN)
– Support Vector Machines (SVM)
– Multiple Regression Analysis (MRA)
– …

Introduction
• Decision Tree Induction (DTI)

– Decision tree learning is widely used
• Its classification accuracy is competitive with other
methods,
• Representation as If-then rules is easy to interpret,
• Works well on noisy data.

From Data to Knowledge

Data Pre-processing
Paper I
Exploring Cancer Register Data to find Risk

Factors for Recurrence of Breast Cancer-
Application of Canonical Correlation
Analysis
Razavi AR, Gill H, Stål O, Sundquist M, Thorstenson S, Åhlfeldt H,
Shahsavar N, the South-East Swedish Breast Cancer Study Group
BMC Medical Informatics and Decision Making. 2005 Aug

22;5:29

Data Pre-processing
• Data pre-processing: any type of processing

performed on raw data to prepare it for
other procedures in KDD.
– Cleaning
• Handling missing values, identify or remove outliers
– Transformation
• Normalization, discretization or aggregation
– Reduction
• Obtains reduced representation in volume but
produces the same or similar analytical results

Data Pre-processing
• Why data pre-processing is needed?

– If the data do not have good quality, the
analysis results will not be good.
– Decisions must be based on high-quality data.
– Duplicate or missing data may cause incorrect
or even misleading results.

Data Pre-processing
• Data reduction:
– Obtain a reduced representation of the dataset
that is much smaller in volume but yet produce
the same or almost the same analytical results.
• Why to do it?
– The dataset may be gigantic in volume
– Processing time

Data Pre-processing
• Dimension reduction
– Removes unimportant attributes: Canonical
Correlation Analysis (CCA)
• Data Compression
• Reducing the number of instances
• Discretization and concept hierarchy
generation

Data Pre-processing
• Canonical correlation analysis (CCA)

– CCA seeks to identify and quantify the
associations between two sets of variables (i.e.,
predictors and outcomes of a disease).
– It focuses on the correlation between a linear
combination of the variables in one set
(independents/predictors) and a linear
combination of the variables in another set
(dependents/outcomes).

Data Pre-processing
• It creates a number of canonical solutions each

consisting of a linear combination of one set of
variables:
Ui = a1 X1 + a2 X2 + … + am Xm
and a linear combination of the other set of
variables:
Vi = b1 Y1 + b2 Y2 + … + bn Yn
• The goal is to determine the coefficients (a’s and
b’s) that maximize the correlation between
canonical variates (a linear combination of a set of
original variables) Ui and Vi.
Data Pre-processing
• Examining canonical solutions to determine

the relative importance of each of the
original variables in the canonical variate
– Canonical Loadings
• Represents the simple linear correlation between an
original observed variable and its canonical variate.
• Shows how each original variable contribute
towards each canonical variate.

Data Pre-processing

Data Pre-processing
Predictor Set Outcome Set ‡
Age DM, first two years
Tumor location DM, 2-4 years
Side DM, more than 4 years
*
Tumor size LRR, first two years
LN involvement * LRR, 2-4 years
LN involvement † LRR, more than 4 years
Periglandular growth *
NHG
Multiple tumors *
Abbreviations: LN: lymph
Estrogen receptor
node, NHG: Nottingham
Progesterone receptor
Histologic Grade, DM:
S-phase fraction
Distant Metastasis, LRR:
Loco-regional Recurrence DNA index
*
from pathology report, † DNA ploidy
N0: Not palpable LN
metastasis, ‡ all periods are
time after diagnosis.

Data Pre-processing

Data Pre-processing
• CCA is suggested as an appropriate method

when there are many variables in the input set
and more than one variable in the output set.
• The results successfully detected well known
predictors for breast cancer recurrence.
• This can be assumed as the dimension
reduction step in the process of knowledge
discovery in databases.


Data Mining
Paper II
A Data Pre-processing Method to Increase

Efficiency and Accuracy in Data Mining
A. R. Razavi, H. Gill, H. Åhlfeldt, and N. Shahsavar
Lecture Notes in Computer Science, Artificial Intelligence in

Medicine, J. H. S. Miksch, E. Keravnou, Ed.: Springer-
Verlag GmbH, 2005, pp. 434-443

Data Mining
• DTI is a a greedy divide-and-conquer

algorithm
• Tree is constructed in a top-down recursive manner
• At start, all the training examples are at the root
• Examples are partitioned recursively based on
selected attributes
• Attributes are selected on the basis of an impurity
function (e.g., information gain)

Data Mining
• DTI
– Pros
• Reasonable training time
• Fast application
• Easy to interpret
• Easy to implement
• Can handle large number of features
– Cons
• Cannot handle complicated relationship between
features

Data Mining
• Building Predictive Models

– An important function of data mining is the production
of a model. A model can be descriptive or predictive.
– A descriptive model helps in understanding underlying
processes or behavior.
– A predictive model is an equation or set of rules that
makes it possible to predict an unseen or unmeasured
value (the dependent variable or output) from other,
known values (independent variables or input).

Data Mining

Data Mining
Without With replacing With

preprocessing missing values preprocessing
Accuracy 54% 57% 67%
Sensitivity 83% 82% 80%
Specificity 41% 46% 63%
Number of Leaves 137 196 14
Tree Size 273 391 27



Validating the Predictive Model
Paper III
Data Mining for Building a Model to Predict
Distant Metastasis in Breast Cancer:
Comparing a Decision Tree with Domain
Experts
Amir R. Razavi, Hans Gill, Hans Åhlfeldt, and Nosrat Shahsavar
Submitted to the “Journal of Artificial Intelligence in

Medicine”

• Transparency of the model:

– By choosing data mining methods which
produce an understandable predictive model
such as decision tree induction (DTI).
– Providing the model to clinicians to inspect all
the details and how the decisions are made in
the model; studying the tree and rules.

• Showing that models’ performance is good

– Giving some cases to clinicians without any
data pre-processing and ask for their
predictions.
– Validating the predictive model by examining
the same cases as clinicians.
– Comparing the results and see if there are
statistically significant differences.

• Testing the model for accuracy on an independent

dataset, one that has not been used to create the
model.
• Examining the performance of the model on the
training set is not a good indicator because of
overfitting.
• The prediction of the model for an independent
dataset is compared to the actual outcome.
• An analysis is performed which measures how
well a model is performing.

• Validating methods:
– Examining an independent dataset.
– Cross validation:
• Divides the whole data by random sampling into n
folds (partitions) and perform n times testing.
– At each testing, one partition of data is used as the testing
set and the rest is training set.
• Leave-one-out cross-validation
–…

• Methods for showing how well a model

works:
– Accuracy: refers to the degree of fit between
the model and the data.
– Sensitivity and specificity.
– Confusion matrix: shows the counts of the
actual versus predicted class values.
– ROC curve and area under the curve (AUC).
–…

• 3699 patients
• A decision tree was trained with all patients except
for 100 cases and tested with those 100 cases.
• Two domain experts were asked to give their
opinion about the probability of recurrence of a
certain outcome for these 100 patients.
• ROC curves and area under the ROC curves
(AUC) for predictions were computed and
compared.



100
80
60
Sensitivity
DTI_J48
Oncologist_1
Oncologist_2
40
20
0
0 20 40 60 80 100
100-Specificity
DTI (J48) Oncologist 1 Oncologist 2

AUC 0.761 0.847 0.810
DTI: decision tree induction


Discussion
• In some domains such as finance and banking KDD

has already showed a great benefit to the industry
but in medicine we are far behind them.

Discussion
• There is no gold standard method for how to do the

pre-processing step and handling missing values.
• Consulting with domain experts in the pre-
processing step can save lots of efforts and results
in a better subset of variables for next steps.
• CCA can handle multiple outcomes and this is
unique compared to other methods such as MRA
and Cox RA.

Discussion
• DTI predictive model does not differ significantly

from predictions made by domain experts.
• Compared to other data mining methods, DTI is
more explainable. In contrast ANN works as a
”black box”.
• By using an appropriate small subset of predictors
and also pruning techniques DTI results in a
smaller tree.

Discussion
• Using an independent dataset and for comparing

different models AUC is a good technique.
• A predictive model which is built based on the
most relevant and important predictors of an event
can have a better performance.
• Improvement of the quality of cancer registers by
adding variables with high predictive ability.

Future Works
• Results from the presented methodology

can be used to build a decision support
application in the field of oncology.

Future Works
• Why decision support is important?

– Limited resources,
– Need for ways to improve health care processes
and their outcomes.
– Improving decision making ability of clinicians
by allowing more or better decisions within
constraints of their knowledge and time limits.

Future Works
• In medicine, assisting clinicians in their

decision making in the right time, right
place and in a suitable format is valuable.
• Providing reminders, interpretations or
advices specific to a given patient at a
particular time is advantageous.

Thanks for your
attention!
amira@imt.liu.se

Halvtids Master-Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Halvtids Master-Main

Uploaded by

Copyright:

Available Formats

Exploring a Breast Cancer Register by Data

Mining Techniques to Support Prediction of

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Data in medicine are stored in many different ways:

Dept of Biomedical Engineering, Medical Informatics

• Data in medicine are unique and need

Dept of Biomedical Engineering, Medical Informatics

• What can be extracted from these large

Dept of Biomedical Engineering, Medical Informatics

• Knowledge Discovery in Databases (KDD)

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Supervised vs. unsupervised learning

Dept of Biomedical Engineering, Medical Informatics

• Supervised learning (classification)

Dept of Biomedical Engineering, Medical Informatics

• Decision Tree Induction (DTI)

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Exploring Cancer Register Data to find Risk

BMC Medical Informatics and Decision Making. 2005 Aug

Dept of Biomedical Engineering, Medical Informatics

• Data pre-processing: any type of processing

Dept of Biomedical Engineering, Medical Informatics

• Why data pre-processing is needed?

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Canonical correlation analysis (CCA)

Dept of Biomedical Engineering, Medical Informatics

• It creates a number of canonical solutions each

• Examining canonical solutions to determine

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• CCA is suggested as an appropriate method

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

A Data Pre-processing Method to Increase

A. R. Razavi, H. Gill, H. Åhlfeldt, and N. Shahsavar

Lecture Notes in Computer Science, Artificial Intelligence in

Dept of Biomedical Engineering, Medical Informatics

• DTI is a a greedy divide-and-conquer

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

• Building Predictive Models

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Without With replacing With

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics

Amir R. Razavi, Hans Gill, Hans Åhlfeldt, and Nosrat Shahsavar

Submitted to the “Journal of Artificial Intelligence in

Dept of Biomedical Engineering, Medical Informatics

• Transparency of the model:

Dept of Biomedical Engineering, Medical Informatics

• Showing that models’ performance is good

Dept of Biomedical Engineering, Medical Informatics

• Testing the model for accuracy on an independent

Dept of Biomedical Engineering, Medical Informatics

Dept of Biomedical Engineering, Medical Informatics