Con2803 PDF 2803 0001

The Evolution of Histograms
Jonathan Lewis
jonathanlewis.wordpress.com
www.jlcomp.demon.co.uk
Who am I ?
Independent Consultant
28+ years in IT
24+ using Oracle
Strategy, Design, Review,
Briefings, Educational,
Trouble-shooting
Member of the Oak Table Network
Oracle ACE Director
Oracle author of the year 2006
Select Editors choice 2007
UKOUG Inspiring Presenter 2011
UKOUG Council member 2012
ODTUG 2012 Best Presenter (d/b)
O1 visa for USA
Jonathan Lewis
2011
Title
2 / 30
O-1 Visa
An alien of extraordinary ability

Jonathan Lewis
2011
Title
3 / 30
Highlights
Why Histograms
Current mechanisms
Problems and workarounds
New mechanisms
Jonathan Lewis
2011
Title
4 / 30
Sample Data (a)

S
P
C
O
L
COUNT(*)
52,352
9,416,360
3,499
86,084
CODE
A
B
C
L
O
P
DESCRIPTION
ASSIGNED
HANDED BACK
CLOSED
LOGGED
HANDED OVER
PENDING
Standard Strategy
Frequency histogram with literals in SQL
Other ideas
Change 'commonest value' to null
Virtual columns / Function-based indexes
List partitions
Jonathan Lewis
2011
Title
5 / 30
Problems
Coding to take advantage of histogram

Limit on distinct values
Resources needed for gathering
Accuracy of histogram
Timing of gathering
Jonathan Lewis
2011
Title
6 / 30
Limits (a)
select
specifier, count(*)
from
messages
group by
specifier
order by
count(*) desc
;
Distinct Specifiers = 352
Frequency Limit is 254
Height-balanced less precise
Popular values use lots of buckets
SPECIFIER
BVGFJB
LYYVLH
MTVMIE
YETSDP
DAJYGS
...
KDCFVJ
JITCRI
DNRYKC
BEWPEQ
...
JXXXRE
OHMNVU
YGOBWQ
UBBWQH
COUNT(*)
1,851,177
719,582
672,823
659,661
504,641
75,328
74,104
70,029
68,681
1
1
1
1
Jonathan Lewis
2011
Title
7 / 30
Limits (b)
Interesting arithmetic - for THIS data set
Top N values
140
210
250
% of data
99.00
99.90
99.98
Each "bucket" represents roughly 40,000 rows (10M / 254)

A value with 40,001 rows MIGHT get captured twice
A value with 79,999 rows MIGHT NOT get captured twice
In this data set there are 25 values that WILL get captured (ct > 80,001)
There are 35 values that might be captured one day, and not the next.
Jonathan Lewis
2011
Title
8 / 30
Limits (c)
12c allows 2,048 buckets
The default is still 254
Don't be in a rush to use the maximum
Don't forget the optstat history tables
There are several new columns
There are some new costs
Jonathan Lewis
2011
Title
9 / 30
Precision (a)
select
status, count(*)
from
orders
group by
status
order by
status
;
S
C
P
R
S
X
COUNT(*)
529,100
300
300
300
500,000
begin
dbms_stats.gather_table_stats(
tabname
=>'orders',
estimate_percent => dbms_stats.auto_sample_size,
method_opt
=> 'for columns status size 10'
);
end;
/
Jonathan Lewis
2011
Title
10 / 30
Precision (b)
select
from
endpoint_number,
endpoint_number - nvl(prev_endpoint,0) frequency,
chr(to_number(substr(hex_val, 2,2),'XX'))
status
(
select
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
)
prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val
from
user_tab_histograms
where
table_name = 'ORDERS'
and
column_name = 'STATUS'
)
order by
endpoint_number
/
http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/
Jonathan Lewis
2011
Title
11 / 30
Precision (c)
Results 11.2.0.3 - four attempts
ENDPOINT_NUMBER
2741
2742
2743
5331
FREQUENCY
2741
1
1
2588
STATUS
C
P
R
X
ENDPOINT_NUMBER
2848
2849
5629
FREQUENCY
2848
1
2780
STATUS
C
P
X
ENDPOINT_NUMBER
2706
2708
5355
FREQUENCY
2706
2
2647
STATUS
C
P
X
ENDPOINT_NUMBER
2852
2854
2856
2859
5472
FREQUENCY
2852
2
2
3
2613
STATUS
C
P
R
S
X
Missing values are NOT NICE

Jonathan Lewis
2011
Title
12 / 30
Basic Cost
select
substrb(dump(val,16,0,32),1,120) ep, cnt
from
select /*+ lots of hints */

"STATUS" val, count(*) cnt
from
"TEST_USER"."ORDERS" t
where "STATUS" is not null
group by
"STATUS"
)
order by val
Rows
5
1030000
-- Could extract a sample
Row Source Operation

SORT GROUP BY {various statistics etc.}
TABLE ACCESS FULL : {various statistics etc.}
Jonathan Lewis
2011
Title
13 / 30
Solution (b)
c_array
srec.bkvals
srec.epc
:= dbms_stats.chararray('C', 'P', 'R', 'S', 'X');

:= dbms_stats.numarray (5000, 3, 3, 3, 5000);
:= 5;
dbms_stats.prepare_column_values(srec, c_array);
dbms_stats.set_column_stats(
ownname
=> user,
tabname
=> 'ORDERS',
colname
=> 'STATUS',
distcnt
=> m_distcnt,
density
=> m_density,
nullcnt
=> m_nullcnt,
srec
=> srec,
avgclen
=> m_avgclen
);
end;
Jonathan Lewis
2011
Title
14 / 30
Solution (a)
declare
srec
c_array
dbms_stats.statrec;
dbms_stats.chararray;
m_distcnt
m_density
m_nullcnt
m_avgclen
number;
number;
number;
number;
begin
m_distcnt
m_density
m_nullcnt
m_avgclen
:=
:=
:=
:=
5;
0.00001;
0;
1;
http://jonathanlewis.wordpress.com/2009/05/28/frequency-histograms/
Jonathan Lewis
2011
Title
15 / 30
Precision (12c)
11.2.0.3
12.1.0.0
ENDPOINT_NUMBER
2741
2742
2743
5331
Jonathan Lewis
2011
FREQUENCY
2741
1
1
2588
STATUS
C
P
R
X
2848
2849
5629
2848 C
1 P
2780 X
2706
2708
5355
2706 C
2 P
2647 X
2852
2854
2856
2859
5472
2852
2
2
3
2613
ENDPOINT_NUMBER
529100
529400
529700
530000
1030000
FREQUENCY
529100
300
300
300
500000
STATUS
C
P
R
S
X
12c has enhanced the code for the calculation

of "approximate NDV" so for a small number of
distinct values it can produce an accurate
frequency histogram at virtually no extra cost
C
P
R
S
X
Title
16 / 30
Basic Principle
0
240
15
255
The square is a visual aid only

The number of hash buckets is 2^64 (= 10^19)
Jonathan Lewis
2011
Title
17 / 30
Minimising cost
0
240
15
255
We only keep 16,384 items in the hash table for each column.
We discard half the table each time we reach this limit
Jonathan Lewis
2011
Title
18 / 30
Top-Frequency (12c)
select
skewed, count(*)
from
t1
group by
skewed
order by
skewed
;
If you wanted 18 buckets for this data (840 rows)

you could (easily) fit the four least popular values
into 1 bucket - leaving just 16 interesting values
If a small set of values accounts for most of the data,
Oracle 12c can produce a frequency histogram for
the popular values and use an estimate for the rest.
SKEWED
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
COUNT(*)
4
8
12
16
20
24
28
32
36
116
44
48
52
56
60
64
68
72
76
4
Jonathan Lewis
2011
Title
19 / 30
Top-Frequency (12c)
select
endpoint_value
epv,
endpoint_number
epn,
endpoint_number lag(endpoint_Number,1) over (
order by endpoint_number
)
freq
from
user_tab_histograms
where
table_name = 'T1'
and
column_name = 'SKEWED'
order by
endpoint_value
;
(There is still a little flaw)
Jonathan Lewis
2011
EPV
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
EPN
1
17
37
61
89
121
157
273
317
365
417
473
533
597
665
737
813
814
FREQ
16
20
24
28
32
36
116
44
48
52
56
60
64
68
72
76
1
Title
20 / 30
10
Too many values (a)

Building a Height-Balanced Histogram
23
20
40
38
13
23
38
33
33
12
28
27
19
39
31
24
20
34
16
26
29
28
45
35
13
36
29
28
22
35
27
42
42
32
31
13
26
33
38
41
30
19
27
20
44
46
16
38
34
29
43
33
35
18
22
29
26
21
37
30
25
43
35
27
33
20
18
12
29
33
39
19
8
30
43
38
31
59
50
31
20
32
35
33
28
33
35
34
27
32
29
28
31
27
17
35
22
24
15
28
19
27
31
35
43
19
27
31
35
43
20
27
31
35
44
20
27
31
35
45
20
27
31
35
46
20
28
32
35
50
20
28
32
36
59
Sort
8
21
28
32
37
12
22
28
33
38
12
22
28
33
38
13
22
28
33
38
13
23
29
33
38
13
23
29
33
38
15
24
29
33
39
16
24
29
33
39
16
25
29
33
40
17
26
29
34
41
18
26
30
34
42
18
26
30
34
42
19
27
30
35
43
Jonathan Lewis
2011
Title
21 / 30
Too many values (b)

8
21
28
32
37
12
22
28
33
38
12
22
28
33
38
13
22
28
33
38
13
23
29
33
38
13
23
29
33
38
15
24
29
33
39
16
24
29
33
39
16
25
29
33
40
17
26
29
34
41
18
26
30
34
42
18
26
30
34
42
19
27
30
35
43
19
27
31
35
43
19
27
31
35
43
20
27
31
35
44
20
27
31
35
45
20
27
31
35
46
20
28
32
35
50
20
28
32
36
59
We have 100 items and 37 distinct values.

Assume we are limited to 20 buckets
After sorting the data we record the value of every 5th row. (100/20)
8 13 17 19 20 23 26 27 28 xx 29 31 32 33 34 35 36 38 41 43 59
29 is the only "popular" value with two buckets (i.e. 10 rows).

All other values are assumed to have (100 - 10) / (37 - 1) = 3 rows.
Lots more popular values
13 17 19 20 23 26 27 28
Jonathan Lewis
2011
(10.2.0.4+)
29 31 32 33 34 35 36 38 41 43 59
Title
22 / 30
11
Solution (8i - 11g)

Fake it with a frequency histogram.
Pick the 254 most popular values.
Include the low and high values
Fake selectivity for remainder
Needs one entry with double the desired cardinality
Could assign this to the low/high value if introduced
Otherwise change the value with the lowest frequency
Jonathan Lewis
2011
Title
23 / 30
Too many values (12c)

8
21
28
32
37
12
22
28
33
38
12
22
28
33
38
13
22
28
33
38
13
23
29
33
38
13
23
29
33
38
15
24
29
33
39
16
24
29
33
39
16
25
29
33
40
17
26
29
34
41
18
26
30
34
42
18
26
30
34
42
19
27
30
35
43
19
27
31
35
43
19
27
31
35
43
20
27
31
35
44
20
27
31
35
45
20
27
31
35
46
20
28
32
35
50
20
28
32
36
59
8
21
28
32
37
12
22
28
33
38
12
22
28
33
38
13
22
28
33
38
13
23
29
33
38
13
23
29
33
38
15
24
29
33
39
16
24
29
33
39
16
25
29
33
40
17
26
29
34
41
18
26
30
34
42
18
26
30
34
42
19
27
30
35
43
19
27
31
35
43
19
27
31
35
43
20
27
31
35
44
20
27
31
35
45
20
27
31
35
46
20
28
32
35
50
20
28
32
36
59
Jonathan Lewis
2011
Title
24 / 30
12
Hybrid Histogram
EPN
1
6
endpoint_number,
12
endpoint_value,
20
26
endpoint_repeat_count
32
from
38
user_tab_histograms
44
where
50
58
table_name = 'T1'
69
;
79
7 rows in
the bucket 86
90
92
95
This looks like an old frequency histogram, but
96
each bucket has a "repeat count" showing how
97
often the highest value appears in the bucket.
98
100
select
EPV
8
13
18
20
23
26
27
28
29
31
33
35
38
41
42
43
44
45
46
59
REP
1
3
2
5
2
3
6
6
6
5
8
7
38 appear
5
5 times
1
2
3
1
1
1
1
Jonathan Lewis
2011
Title
25 / 30
SQL (top-N pt.1)

SQL behind basic "approximate NDV" (single column table - 11g)
select
/*+ {lots of hints} */
to_char(count("VALUE")),
to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),
to_char(substrb(dump(max("VALUE"),16,0,64),1,240))
from
"TEST_USER"."T1" t
/* NDV,NIL,NIL*/
SQL behind creating a histogram with 18 buckets

select
/*+ {lots of hints} */
to_char(count("VALUE")),
to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),
to_char(substrb(dump(max("VALUE"),16,0,64),1,240)),
count(rowidtochar(rowid))
from
"TEST_USER"."T1" t
Jonathan Lewis
2011
/* TOPN,NIL,NIL,RWID,U18U*/
Title
26 / 30
13
SQL (top-N pt.2)

substrb(dump("VALUE",16,0,64),1,240) val,
rowidtochar(rowid)
rwid
from
"TEST_USER"."T1" t
where rowid in (
chartorowid('AAAWaHAAFAAAAEEAAB'),chartorowid('AAAWaHAAFAAAAEEAAC'),
chartorowid('AAAWaHAAFAAAAEEAAD'),chartorowid('AAAWaHAAFAAAAEEAAE'),
chartorowid('AAAWaHAAFAAAAEEAAF'),chartorowid('AAAWaHAAFAAAAEEAAG'),
chartorowid('AAAWaHAAFAAAAEEAAH'),chartorowid('AAAWaHAAFAAAAEEAAI'),
chartorowid('AAAWaHAAFAAAAEEAAJ'),chartorowid('AAAWaHAAFAAAAEEAAK'),
chartorowid('AAAWaHAAFAAAAEEAAL'),chartorowid('AAAWaHAAFAAAAEEAAM'),
chartorowid('AAAWaHAAFAAAAEEAAN'),chartorowid('AAAWaHAAFAAAAEEAAO'),
chartorowid('AAAWaHAAFAAAAEEAAP'),chartorowid('AAAWaHAAFAAAAEEAAQ'),
chartorowid('AAAWaHAAFAAAAEFAAA'),chartorowid('AAAWaHAAFAAAAEFAAB')
)
order by "VALUE"
Jonathan Lewis
2011
Title
27 / 30
SQL (hybrid)
select
substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,
(sum(pop) over()) popcnt, (sum(pop * freq) over()) popfreq,
substrb(dump(max(val) over(),16,0,64),1,20) maxval,
substrb(dump(min(val) over(),16,0,64),1,20) minval
from
(
select
val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,
(case when freq > ((sum(freq) over())/15) then 1 else 0 end) pop
from (
"VALUE" val, count("VALUE") freq
from
"TEST_USER"."T1" t
With only 15 buckets this
where
dataset got a hybrid histogram
"VALUE" is not null
group by
"VALUE"
)
)
order by val
/
Jonathan Lewis
2011
Title
28 / 30
14
SQL (old height-balanced)

select
min(minbkt),maxbkt,
substrb(dump(min(val),16,0,32),1,120) minval,
substrb(dump(max(val),16,0,32),1,120) maxval,
sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,
sum(case when rep=1 then 1 else 0 end) unqrep
from
(
select
val, min(bkt) minbkt, max(bkt) maxbkt,
count(val) rep, count(val)*count(val) repsq
from
(
"LN100" val, ntile(200) over (order by "LN100") bkt
from
sys.ora_temp_1_ds_616 t
where
"LN100" is not null
)
group by val
)
group by maxbkt order by maxbkt
Jonathan Lewis
2011
Title
29 / 30
Conclusions for 12c

Use auto_sample_size
2,048 buckets is legal
The default is still 254, and it's likely to be adequate
Frequency / Top N histograms

Fast and accurate
Hybrid
Capture far more popular values, still samples, and costly
Timing is still important

May still want to create some by code
Jonathan Lewis
2011
Title
30 / 30
15

Con2803 PDF 2803 0001

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Con2803 PDF 2803 0001

Uploaded by

Copyright:

Available Formats

The Evolution of Histograms

An alien of extraordinary ability

Sample Data (a)

Coding to take advantage of histogram

Each "bucket" represents roughly 40,000 rows (10M / 254)

Missing values are NOT NICE

select /*+ lots of hints */

-- Could extract a sample

Row Source Operation

:= dbms_stats.chararray('C', 'P', 'R', 'S', 'X');

12c has enhanced the code for the calculation

The square is a visual aid only

If you wanted 18 buckets for this data (840 rows)

Too many values (a)

Too many values (b)

We have 100 items and 37 distinct values.

29 is the only "popular" value with two buckets (i.e. 10 rows).

Solution (8i - 11g)

Too many values (12c)

SQL (top-N pt.1)

SQL behind creating a histogram with 18 buckets

SQL (top-N pt.2)

SQL (old height-balanced)

Conclusions for 12c

Frequency / Top N histograms

Timing is still important

You might also like

select /+ lots of hints /