Professional Documents
Culture Documents
Data Warehousing
Oracle's In-Database
Statistical Functions
ETL
<Insert Picture
Here>
OLAP
Statistics
Data Mining
Charlie Berger
Sr. Director Product Management,
Data Mining Technologies
Oracle Corporation
charlie.berger@oracle.com
Copyright 2007 Oracle Corporation
Synopsis
Oracle has delivered on a multi-year strategy to transform the
database from a data repository to an analytical database by
bringing the "analytics" to the data (data mining, text mining, and
statistical functions)
Agenda
Introduction
Oracles in-Database Statistical Functions
Several Simple Demonstrations
Opportunities for Use Cases
Hands-on Exercises
User Stories
A
B
C
Market Trends
Analytics Provide Competitive Value
Competing on Analytics, by Tom Davenport
Some companies have built their very businesses
on their ability to collect, analyze, and act on data.
Although numerous organizations are embracing analytics, only a
handful have achieved this level of proficiency. But analytics
competitors are the leaders in their varied fieldsconsumer products
finance, retail, and travel and entertainment among them.
Organizations are moving beyond query and reporting
- IDC 2006
Market Trends
Analytics Save Lives
Super Crunchers, by Ian Ayers
In December 2004, [Berwick] brazenly announced a plan to save 100,000
lives over the next year and a half. The 100,000 Lives Campaign challenged
hospitals to implement six changes in care to prevent avoidable deaths.
He noticed that thousands of ICU patients die each year from infections
after a central line catheter is placed in their chests. About half of all intensive
care patients have central line catheters, and ICU infections are deadly
(carrying mortality rates of up to 20 percent). He then looked to see if there
was any statistical evidence of ways to reduce the chance of infection. He
found a 2004 article in Critical Care Medicine that showed that systematic
hand-washing (combined with a bundle of improved hygienic procedures such
as cleaning the patients skin with an antiseptic called chlorhexidine) could
reduce the risk of infection from central-line catheters by more than 90
percent. Berwick estimated that if all hospitals just implemented this one
bundle of procedures, they might be able to save as many as 25,000 lives per
year.
New York Times, August 23, 2007, Attack of the Super Crunchers:
Adventures in Data Mining, By Melissa Lafsky
$$
Optimization
Competitive Advantage
Predictive Modeling
Forecasting/Extrapolation
Analytic$
Statistical Analysis
Alerts
Query/drill down
Ad hoc reports
Standard Reports
What happened?
Degree of Intelligence
Source: Competing on Analytics, by T. Davenport & J. Harris
Access &
Reporting
Definition: Statistics
Definition: Statistics
Statistics is a mathematical science pertaining to the
collection, analysis, interpretation or explanation, and
presentation of data. It is applicable to a wide variety
of academic disciplines, from the physical and social
sciences to the humanities. Statistics are also used for
making informed decisions and misused for other
reasons in all areas of business and government.
http://en.wikipedia.org/wiki/Statistics
Definitions: Statistics
Statistical methods can be used to summarize or
describe a collection of data; this is called descriptive
statistics. In addition, patterns in the data may be
modeled in a way that accounts for randomness and
uncertainty in the observations, and then used to draw
inferences about the process or population being
studied; this is called inferential statistics. Both
descriptive and inferential statistics comprise applied
statistics.
http://en.wikipedia.org/wiki/Statistics
Statistical Concepts
Descriptive Statistics
Correlations
LAG/LEAD functions
Direct inter-row reference using offsets
Statistical Aggregates
Correlation, linear regression family, covariance
Linear regression
Fitting of an ordinary-least-squares regression line
to a set of number pairs.
Frequently combined with the COVAR_POP,
COVAR_SAMP, and CORR functions.
Cross Tabs
Enhanced with % statistics: chi squared, phi coefficient,
Cramer's V, contingency coefficient, Cohen's kappa
Hypothesis Testing
Student t-test , F-test, Binomial test, Wilcoxon Signed
Ranks test, Chi-square, Mann Whitney test, KolmogorovSmirnov test, One-way ANOVA
Distribution Fitting
Kolmogorov-Smirnov Test, Anderson-Darling Test, ChiSquared Test, Normal, Uniform, Weibull, Exponential
Descriptive Statistics
MEDIAN & MODE
> SQL
DBMS_STAT_FUNCS Package
SUMMARY procedure
The SUMMARY procedure is used to summarize a numerical column
(ADM_PULSE); the summary is returned as record of type summaryType
> SQL
DECLARE
v_ownername varchar2(8);
v_tablename varchar2(50);
v_columnname varchar2(50);
v_sigma_value number;
type n_arr1 is varray(5) of number;
type num_table1 is table of number;
s1 dbms_stat_funcs.summaryType;
BEGIN
v_ownername
:= 'cberger';
v_tablename
:= 'LYMPHOMA';
v_columnname := 'ADM_PULSE';
v_sigma_value := 3;
dbms_stat_funcs.summary(p_ownername=> v_ownername, p_tablename=> v_tablename, p_columnname=>
v_columnname, p_sigma_value=> v_sigma_value, s=> s1);
END;
/
DBMS_STAT_FUNCS Package
SUMMARY procedure
The SUMMARY procedure is used to summarize a numerical column
(ADM_PULSE); the summary is returned as record of type summaryType
> SQL
DBMS_STAT_FUNCS Package
SUMMARY procedure
A subset of data
that is returned
after execution of
the PL/SQL
package
summarizes the
use of the different
SUMMARY
procedures
Hypothesis Testing
Parametric Tests
Parametric tests make some
assumptions about the data
typically that the data is
normally distributed among
other assumptions
T-Test
T-tests are used to measure the significance of
a difference of means.
T-tests include the following:
One-sample T-test
Paired-samples T-test
Independent-samples T-test (pooled variances)
Independent-samples T-test (unpooled variances)
Basic Example
Compare
difference in blood
pressures
between people
who eat meat
frequently vs.
dont
One-Sample T-Test
STATS_T_TEST_*
The t-test functions are:
STATS_T_TEST_ONE: A one-sample t-test
STATS_T_TEST_PAIRED: A two-sample, paired t-test (also known as
a crossed t-test)
STATS_T_TEST_INDEP: A t-test of two independent groups with the
same variance (pooled variances)
STATS_T_TEST_INDEPU: A t-test of two independent groups with
unequal variance (unpooled variances)
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14200/functions157.htm
One-Sample T-Test
Query compares the mean of SURVIVAL_TIME
to the assumed value of 35:
SELECT avg(SURVIVAL_TIME_MO) group_mean,
stats_t_test_one(SURVIVAL_TIME_MO, 35,
'STATISTIC') t_observed,
stats_t_test_one(SURVIVAL_TIME_MO, 35)
two_sided_p_value
FROM LYMPHOMA;
SQL Worksheet
Copyright 2007 Oracle Corporation
F-Test
Query compares the variance in the SIZE_TUMOR
between MALES and FEMALES
SELECT variance(decode(GENDER,'0', SIZE_TUMOR_MM, null)) var_tumor_men,
variance(decode(GENDER,'1', SIZE_TUMOR_MM,null)) var_tumor_women,
SQL Worksheet
Copyright 2007 Oracle Corporation
F-Test
Query compares the variance in the SIZE_TUMOR
between males and females Grouped By GENDER
SELECT GENDER,
stats_one_way_anova(TREATMENT_PLAN,
SIZE_REDUCTION,'F_RATIO') f_ratio,
stats_one_way_anova(TREATMENT_PLAN,
SIZE_REDUCTION,'SIG') p_value, AVG(SIZE_REDUCTION)
FROM CBERGER.LYMPHOMA
GROUP BY GENDER ORDER BY GENDER;
SQL Worksheet
Copyright 2007 Oracle Corporation
One-Way ANOVA
In statistics, analysis of variance (ANOVA, or
sometimesA.N.O.V.A.) is a collection of statistical
models, and their associated procedures, in which
the observed variance is partitioned into
components due to different explanatory variables.
Example
Group A is given vodka, Group B is given gin, and Group C
is given a placebo. All groups are then tested with a memory
task. A one-way ANOVA can be used to assess the effect of
the various treatments (that is, the vodka, gin, and placebo).
http://en.wikipedia.org/wiki/Statistics
One-Way ANOVA
Query compares the average SIZE_REDUCTION within different
TREATMENT_PLANS Grouped By LYMPH_TYPE:
SELECT LYMPH_TYPE,
stats_one_way_anova(TREATMENT_PLAN,
SIZE_REDUCTION,'F_RATIO') f_ratio,
stats_one_way_anova(TREATMENT_PLAN,
SIZE_REDUCTION,'SIG') p_value
FROM CBERGER.LYMPHOMA
GROUP BY LYMPH_TYPE ORDER BY 1;
Hypothesis Testing
(Nonparametric)
Nonparametric tests are used when certain assumptions
about the data are questionable.
This may include the difference between samples that are
not normally distributed.
All tests involving ordinal scales (in which data is ranked)
are nonparametric.
Nonparametric tests supported in Oracle Database 10g:
Binomial test
Wilcoxon Signed Ranks test
Mann-Whitney test
Kolmogorov-Smirnov test
Customer Example
"..Our experience suggests that Oracle 10g Statistics and Data Mining
features can reduce development effort of analytical systems by an
order of magnitude."
Sumeet Muju
Senior Member of Professional Staff, SRA International (SRA supports NIH bioinformatics
development projects)
?x
Correlation Functions
as TREATMENT_PLAN
from CBERGER.LYMPHOMA
GROUP BY TREATMENT_PLAN;
Cross Tabulations
This query analyzes the strength of the association between
TREATMENT_PLAN and GENDER Grouped By LYMPH_TYPE
using a cross tabulation:
SELECT LYMPH_TYPE,
stats_crosstab(GENDER, TREATMENT_PLAN,
'CHISQ_OBS') chi_squared,
stats_crosstab(GENDER, TREATMENT_PLAN,
'CHISQ_SIG') p_value,
stats_crosstab(GENDER, TREATMENT_PLAN,
'PHI_COEFFICIENT') phi_coefficient
FROM CBERGER.LYMPHOMA
GROUP BY LYMPH_TYPE ORDER BY 1;
Cross Tabulations
STATS_CROSSTAB function takes as arguments two expressions
(the two variables being analyzed) and a value that determines which test to
perform. These values include the following:
Distribution-Fitting Functions
Distribution-fitting functions in Oracle Database 10g
include the following
NORMAL_DIST_FIT function
UNIFORM_DIST_FIT function
POISSON_DIST_FIT function
WEIBULL_DIST_FIT function
EXPONENTIAL_DIST_FIT function
Customer Example
"..Our experience suggests that Oracle 10g Statistics and Data Mining
features can reduce development effort of analytical systems by an
order of magnitude."
Sumeet Muju
Senior Member of Professional Staff, SRA International (SRA supports NIH bioinformatics
development projects)
http://www.oracle.com/technology/products/bi/stats_fns/index.html
Copyright 2007 Oracle Corporation
In-Database Statistics
Advantages
Oracle 10g DB
ETL
OLAP
Statistics
Data Mining
Industry Analysts
PREDICTIVE ANALYTICS: Extending the Value of Your
Data Warehousing Investment, By Wayne W. Eckerson
According to our survey, most organizations plan to significantly
increase the analytic processing within a data warehouse database in
the next three years, particularly for model building and scoring, which
show 88% climbs. The amount of data preparation done in databases
will only climb 36% in that time, but it will be done by almost two-thirds
of all organizations (60%)double the rate of companies planning to
use the database to create or score analytical models.
its surprising that about one-third of organizations plan to build
analytical models in databases within three years.
We leverage the data warehouse database when possible, says one
analytics manager. He says most analysts download a data sample to
their desktop and then upload it to the data warehouse once its
completed. Ultimately, however, everything will run in the data
warehouse, the manager says.
http://download.101com.com/pub/tdwi/Files/PA_Report_Q107_F.pdf
Analytics vs.
1. In-Database Analytics Engine
Basic Statistics (Free)
Data Mining
Text Mining
3. IT Platform
SQL (standard)
Java (standard)
3. IT Platform
SAS Code (proprietary)
Oracle 11g DB
Data Warehousing
ETL
OLAP Statistics
Data Mining
Analytics vs.
1. In-Database Analytics Engine
Basic Statistics (Free)
Data Mining
Text Mining
3. IT Platform
SQL (standard)
Java (standard)
3. IT Platform
SAS Code (proprietary)
Oracle 11g DB
Data Warehousing
Oracle 11g DB
Data Warehousing
ETL
ETL
OLAP Statistics
OLAP Statistics
Data Mining
Data Mining
Oracle
http://www.oracle.com/corporate/analyst/reports/infrastructure/bi_dw/208699e.pdf
References
1.
2.
3.
4.
5.
6.
7.
Source: Oracle 10gR2 Statistics Functions, OLSUG08 Workshop, Henri B. Tuthill, AstraZeneca & Charlie Berger, Oracle
Hands-on Exercises
Quick Start Statistics
More Information:
Oracle Data Mining 10g
oracle.com/technology/products/bi/odm/index.html
Contact Information:
Email: Charlie.berger@oracle.com
Copyright 2007 Oracle Corporation
Q U E S T I O N S
A N S W E R S
This presentation is for informational purposes only and may not be incorporated into a contract or agreement.