You are on page 1of 15

The Evolution of Histograms

Jonathan Lewis
jonathanlewis.wordpress.com
www.jlcomp.demon.co.uk

Who am I ?
Independent Consultant
28+ years in IT
24+ using Oracle
Strategy, Design, Review,
Briefings, Educational,
Trouble-shooting
Member of the Oak Table Network
Oracle ACE Director
Oracle author of the year 2006
Select Editors choice 2007
UKOUG Inspiring Presenter 2011
UKOUG Council member 2012
ODTUG 2012 Best Presenter (d/b)
O1 visa for USA
Jonathan Lewis
2011

Title
2 / 30

O-1 Visa

An alien of extraordinary ability


Jonathan Lewis
2011

Title
3 / 30

Highlights

Why Histograms
Current mechanisms
Problems and workarounds
New mechanisms

Jonathan Lewis
2011

Title
4 / 30

Sample Data (a)


S
P
C
O
L

COUNT(*)
52,352
9,416,360
3,499
86,084

CODE
A
B
C
L
O
P

DESCRIPTION
ASSIGNED
HANDED BACK
CLOSED
LOGGED
HANDED OVER
PENDING

Standard Strategy
Frequency histogram with literals in SQL
Other ideas
Change 'commonest value' to null
Virtual columns / Function-based indexes
List partitions

Jonathan Lewis
2011

Title
5 / 30

Problems

Coding to take advantage of histogram


Limit on distinct values
Resources needed for gathering
Accuracy of histogram
Timing of gathering

Jonathan Lewis
2011

Title
6 / 30

Limits (a)
select
specifier, count(*)
from
messages
group by
specifier
order by
count(*) desc
;
Distinct Specifiers = 352
Frequency Limit is 254
Height-balanced less precise
Popular values use lots of buckets

SPECIFIER
BVGFJB
LYYVLH
MTVMIE
YETSDP
DAJYGS
...
KDCFVJ
JITCRI
DNRYKC
BEWPEQ
...
JXXXRE
OHMNVU
YGOBWQ
UBBWQH

COUNT(*)
1,851,177
719,582
672,823
659,661
504,641
75,328
74,104
70,029
68,681
1
1
1
1

Jonathan Lewis
2011

Title
7 / 30

Limits (b)
Interesting arithmetic - for THIS data set
Top N values
140
210
250

% of data
99.00
99.90
99.98

Each "bucket" represents roughly 40,000 rows (10M / 254)


A value with 40,001 rows MIGHT get captured twice
A value with 79,999 rows MIGHT NOT get captured twice
In this data set there are 25 values that WILL get captured (ct > 80,001)
There are 35 values that might be captured one day, and not the next.

Jonathan Lewis
2011

Title
8 / 30

Limits (c)
12c allows 2,048 buckets
The default is still 254
Don't be in a rush to use the maximum
Don't forget the optstat history tables
There are several new columns
There are some new costs
Jonathan Lewis
2011

Title
9 / 30

Precision (a)
select
status, count(*)
from
orders
group by
status
order by
status
;

S
C
P
R
S
X

COUNT(*)
529,100
300
300
300
500,000

begin
dbms_stats.gather_table_stats(
tabname
=>'orders',
estimate_percent => dbms_stats.auto_sample_size,
method_opt
=> 'for columns status size 10'
);
end;
/
Jonathan Lewis
2011

Title
10 / 30

Precision (b)
select

from

endpoint_number,
endpoint_number - nvl(prev_endpoint,0) frequency,
chr(to_number(substr(hex_val, 2,2),'XX'))
status
(
select
endpoint_number,
lag(endpoint_number,1) over(
order by endpoint_number
)
prev_endpoint,
to_char(endpoint_value,'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')hex_val
from
user_tab_histograms
where
table_name = 'ORDERS'
and
column_name = 'STATUS'
)

order by
endpoint_number
/

http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/

Jonathan Lewis
2011

Title
11 / 30

Precision (c)
Results 11.2.0.3 - four attempts
ENDPOINT_NUMBER
2741
2742
2743
5331

FREQUENCY
2741
1
1
2588

STATUS
C
P
R
X

ENDPOINT_NUMBER
2848
2849
5629

FREQUENCY
2848
1
2780

STATUS
C
P
X

ENDPOINT_NUMBER
2706
2708
5355

FREQUENCY
2706
2
2647

STATUS
C
P
X

ENDPOINT_NUMBER
2852
2854
2856
2859
5472

FREQUENCY
2852
2
2
3
2613

STATUS
C
P
R
S
X

Missing values are NOT NICE


Jonathan Lewis
2011

Title
12 / 30

Basic Cost
select
substrb(dump(val,16,0,32),1,120) ep, cnt
from

select /*+ lots of hints */


"STATUS" val, count(*) cnt
from
"TEST_USER"."ORDERS" t
where "STATUS" is not null
group by
"STATUS"
)
order by val
Rows
5
1030000

-- Could extract a sample

Row Source Operation


SORT GROUP BY {various statistics etc.}
TABLE ACCESS FULL : {various statistics etc.}

Jonathan Lewis
2011

Title
13 / 30

Solution (b)
c_array
srec.bkvals
srec.epc

:= dbms_stats.chararray('C', 'P', 'R', 'S', 'X');


:= dbms_stats.numarray (5000, 3, 3, 3, 5000);
:= 5;

dbms_stats.prepare_column_values(srec, c_array);
dbms_stats.set_column_stats(
ownname
=> user,
tabname
=> 'ORDERS',
colname
=> 'STATUS',
distcnt
=> m_distcnt,
density
=> m_density,
nullcnt
=> m_nullcnt,
srec
=> srec,
avgclen
=> m_avgclen
);
end;
Jonathan Lewis
2011

Title
14 / 30

Solution (a)
declare
srec
c_array

dbms_stats.statrec;
dbms_stats.chararray;

m_distcnt
m_density
m_nullcnt
m_avgclen

number;
number;
number;
number;

begin
m_distcnt
m_density
m_nullcnt
m_avgclen

:=
:=
:=
:=

5;
0.00001;
0;
1;

http://jonathanlewis.wordpress.com/2009/05/28/frequency-histograms/
Jonathan Lewis
2011

Title
15 / 30

Precision (12c)
11.2.0.3

12.1.0.0

ENDPOINT_NUMBER
2741
2742
2743
5331

Jonathan Lewis
2011

FREQUENCY
2741
1
1
2588

STATUS
C
P
R
X

2848
2849
5629

2848 C
1 P
2780 X

2706
2708
5355

2706 C
2 P
2647 X

2852
2854
2856
2859
5472

2852
2
2
3
2613

ENDPOINT_NUMBER
529100
529400
529700
530000
1030000

FREQUENCY
529100
300
300
300
500000

STATUS
C
P
R
S
X

12c has enhanced the code for the calculation


of "approximate NDV" so for a small number of
distinct values it can produce an accurate
frequency histogram at virtually no extra cost

C
P
R
S
X
Title
16 / 30

Basic Principle
0

240

15

255

The square is a visual aid only


The number of hash buckets is 2^64 (= 10^19)

Jonathan Lewis
2011

Title
17 / 30

Minimising cost
0

240

15

255

We only keep 16,384 items in the hash table for each column.
We discard half the table each time we reach this limit

Jonathan Lewis
2011

Title
18 / 30

Top-Frequency (12c)
select
skewed, count(*)
from
t1
group by
skewed
order by
skewed
;

If you wanted 18 buckets for this data (840 rows)


you could (easily) fit the four least popular values
into 1 bucket - leaving just 16 interesting values
If a small set of values accounts for most of the data,
Oracle 12c can produce a frequency histogram for
the popular values and use an estimate for the rest.

SKEWED
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

COUNT(*)
4
8
12
16
20
24
28
32
36
116
44
48
52
56
60
64
68
72
76
4

Jonathan Lewis
2011

Title
19 / 30

Top-Frequency (12c)
select
endpoint_value
epv,
endpoint_number
epn,
endpoint_number lag(endpoint_Number,1) over (
order by endpoint_number
)
freq
from
user_tab_histograms
where
table_name = 'T1'
and
column_name = 'SKEWED'
order by
endpoint_value
;
(There is still a little flaw)
Jonathan Lewis
2011

EPV
1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

EPN
1
17
37
61
89
121
157
273
317
365
417
473
533
597
665
737
813
814

FREQ
16
20
24
28
32
36
116
44
48
52
56
60
64
68
72
76
1
Title
20 / 30

10

Too many values (a)


Building a Height-Balanced Histogram
23
20
40
38
13

23
38
33
33
12

28
27
19
39
31

24
20
34
16
26

29
28
45
35
13

36
29
28
22
35

27
42
42
32
31

13
26
33
38
41

30
19
27
20
44

46
16
38
34
29

43
33
35
18
22

29
26
21
37
30

25
43
35
27
33

20
18
12
29
33

39
19
8
30
43

38
31
59
50
31

20
32
35
33
28

33
35
34
27
32

29
28
31
27
17

35
22
24
15
28

19
27
31
35
43

19
27
31
35
43

20
27
31
35
44

20
27
31
35
45

20
27
31
35
46

20
28
32
35
50

20
28
32
36
59

Sort
8
21
28
32
37

12
22
28
33
38

12
22
28
33
38

13
22
28
33
38

13
23
29
33
38

13
23
29
33
38

15
24
29
33
39

16
24
29
33
39

16
25
29
33
40

17
26
29
34
41

18
26
30
34
42

18
26
30
34
42

19
27
30
35
43

Jonathan Lewis
2011

Title
21 / 30

Too many values (b)


8
21
28
32
37

12
22
28
33
38

12
22
28
33
38

13
22
28
33
38

13
23
29
33
38

13
23
29
33
38

15
24
29
33
39

16
24
29
33
39

16
25
29
33
40

17
26
29
34
41

18
26
30
34
42

18
26
30
34
42

19
27
30
35
43

19
27
31
35
43

19
27
31
35
43

20
27
31
35
44

20
27
31
35
45

20
27
31
35
46

20
28
32
35
50

20
28
32
36
59

We have 100 items and 37 distinct values.


Assume we are limited to 20 buckets
After sorting the data we record the value of every 5th row. (100/20)
8 13 17 19 20 23 26 27 28 xx 29 31 32 33 34 35 36 38 41 43 59

29 is the only "popular" value with two buckets (i.e. 10 rows).


All other values are assumed to have (100 - 10) / (37 - 1) = 3 rows.
Lots more popular values
13 17 19 20 23 26 27 28
Jonathan Lewis
2011

(10.2.0.4+)

29 31 32 33 34 35 36 38 41 43 59
Title
22 / 30

11

Solution (8i - 11g)


Fake it with a frequency histogram.
Pick the 254 most popular values.
Include the low and high values
Fake selectivity for remainder
Needs one entry with double the desired cardinality
Could assign this to the low/high value if introduced
Otherwise change the value with the lowest frequency
Jonathan Lewis
2011

Title
23 / 30

Too many values (12c)


8
21
28
32
37

12
22
28
33
38

12
22
28
33
38

13
22
28
33
38

13
23
29
33
38

13
23
29
33
38

15
24
29
33
39

16
24
29
33
39

16
25
29
33
40

17
26
29
34
41

18
26
30
34
42

18
26
30
34
42

19
27
30
35
43

19
27
31
35
43

19
27
31
35
43

20
27
31
35
44

20
27
31
35
45

20
27
31
35
46

20
28
32
35
50

20
28
32
36
59

8
21
28
32
37

12
22
28
33
38

12
22
28
33
38

13
22
28
33
38

13
23
29
33
38

13
23
29
33
38

15
24
29
33
39

16
24
29
33
39

16
25
29
33
40

17
26
29
34
41

18
26
30
34
42

18
26
30
34
42

19
27
30
35
43

19
27
31
35
43

19
27
31
35
43

20
27
31
35
44

20
27
31
35
45

20
27
31
35
46

20
28
32
35
50

20
28
32
36
59

Jonathan Lewis
2011

Title
24 / 30

12

Hybrid Histogram
EPN
1
6
endpoint_number,
12
endpoint_value,
20
26
endpoint_repeat_count
32
from
38
user_tab_histograms
44
where
50
58
table_name = 'T1'
69
;
79
7 rows in
the bucket 86
90
92
95
This looks like an old frequency histogram, but
96
each bucket has a "repeat count" showing how
97
often the highest value appears in the bucket.
98
100
select

EPV
8
13
18
20
23
26
27
28
29
31
33
35
38
41
42
43
44
45
46
59

REP
1
3
2
5
2
3
6
6
6
5
8
7
38 appear
5
5 times
1
2
3
1
1
1
1

Jonathan Lewis
2011

Title
25 / 30

SQL (top-N pt.1)


SQL behind basic "approximate NDV" (single column table - 11g)
select
/*+ {lots of hints} */
to_char(count("VALUE")),
to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),
to_char(substrb(dump(max("VALUE"),16,0,64),1,240))
from
"TEST_USER"."T1" t

/* NDV,NIL,NIL*/

SQL behind creating a histogram with 18 buckets


select
/*+ {lots of hints} */
to_char(count("VALUE")),
to_char(substrb(dump(min("VALUE"),16,0,64),1,240)),
to_char(substrb(dump(max("VALUE"),16,0,64),1,240)),
count(rowidtochar(rowid))
from
"TEST_USER"."T1" t
Jonathan Lewis
2011

/* TOPN,NIL,NIL,RWID,U18U*/
Title
26 / 30

13

SQL (top-N pt.2)


select /*+ lots of hints */
substrb(dump("VALUE",16,0,64),1,240) val,
rowidtochar(rowid)
rwid
from
"TEST_USER"."T1" t
where rowid in (
chartorowid('AAAWaHAAFAAAAEEAAB'),chartorowid('AAAWaHAAFAAAAEEAAC'),
chartorowid('AAAWaHAAFAAAAEEAAD'),chartorowid('AAAWaHAAFAAAAEEAAE'),
chartorowid('AAAWaHAAFAAAAEEAAF'),chartorowid('AAAWaHAAFAAAAEEAAG'),
chartorowid('AAAWaHAAFAAAAEEAAH'),chartorowid('AAAWaHAAFAAAAEEAAI'),
chartorowid('AAAWaHAAFAAAAEEAAJ'),chartorowid('AAAWaHAAFAAAAEEAAK'),
chartorowid('AAAWaHAAFAAAAEEAAL'),chartorowid('AAAWaHAAFAAAAEEAAM'),
chartorowid('AAAWaHAAFAAAAEEAAN'),chartorowid('AAAWaHAAFAAAAEEAAO'),
chartorowid('AAAWaHAAFAAAAEEAAP'),chartorowid('AAAWaHAAFAAAAEEAAQ'),
chartorowid('AAAWaHAAFAAAAEFAAA'),chartorowid('AAAWaHAAFAAAAEFAAB')

)
order by "VALUE"
Jonathan Lewis
2011

Title
27 / 30

SQL (hybrid)
select
substrb(dump(val,16,0,64),1,20) ep, freq, cdn, ndv,
(sum(pop) over()) popcnt, (sum(pop * freq) over()) popfreq,
substrb(dump(max(val) over(),16,0,64),1,20) maxval,
substrb(dump(min(val) over(),16,0,64),1,20) minval
from
(
select
val, freq, (sum(freq) over()) cdn, (count(*) over()) ndv,
(case when freq > ((sum(freq) over())/15) then 1 else 0 end) pop
from (
select /*+ lots of hints */
"VALUE" val, count("VALUE") freq
from
"TEST_USER"."T1" t
With only 15 buckets this
where
dataset got a hybrid histogram
"VALUE" is not null
group by
"VALUE"
)
)
order by val
/
Jonathan Lewis
2011

Title
28 / 30

14

SQL (old height-balanced)


select
min(minbkt),maxbkt,
substrb(dump(min(val),16,0,32),1,120) minval,
substrb(dump(max(val),16,0,32),1,120) maxval,
sum(rep) sumrep, sum(repsq) sumrepsq, max(rep) maxrep, count(*) bktndv,
sum(case when rep=1 then 1 else 0 end) unqrep
from
(
select
val, min(bkt) minbkt, max(bkt) maxbkt,
count(val) rep, count(val)*count(val) repsq
from
(
select /*+ lots of hints */
"LN100" val, ntile(200) over (order by "LN100") bkt
from
sys.ora_temp_1_ds_616 t
where
"LN100" is not null
)
group by val
)
group by maxbkt order by maxbkt
Jonathan Lewis
2011

Title
29 / 30

Conclusions for 12c


Use auto_sample_size
2,048 buckets is legal
The default is still 254, and it's likely to be adequate

Frequency / Top N histograms


Fast and accurate

Hybrid
Capture far more popular values, still samples, and costly

Timing is still important


May still want to create some by code
Jonathan Lewis
2011

Title
30 / 30

15

You might also like