Professional Documents
Culture Documents
Introduction
Question # 1.9:
List and describe the five primitives for
specifying a data mining task.
Answer:
The five primitives for specifying a data-mining task are:
Page 2 of 12
Question # 1.14:
Describe three challenges to data mining
regarding data mining methodology and user
interaction issues.
Answer:
Challenges to data mining regarding data mining methodology and
user interaction issues include the following: Mining different kinds of
knowledge in databases, interactive mining of knowledge at multiple
levels of abstraction, incorporation of background knowledge, data mining
query languages and ad hoc data mining, presentation and visualization of
data mining results, handling noisy or incomplete data, and pattern
evaluation. Below are the descriptions of the rst three challenges
mentioned:
Incorporation
of
background
knowledge:
Background
knowledge, or information regarding the domain under study such
as integrity constraints and deduction rules, may be used to guide
the discovery process and allow discovered patterns to be
expressed in concise terms and at different levels of abstraction.
This helps to focus and speed up a data mining process or judge the
interestingness of discovered patterns.
Page 3 of 12
Chapter No. 02
Data Preprocessing
Question # 2.9:
Frequency
200
450
300
1500
700
44
N /2( freq ) l
width
freq median
We have
L1 = 20
N= 3194
( freq ) l = 950
freq median = 1500
width = 30
median = 32.94 years
Page 4 of 12
Question # 2.9:
Answer:
Other methods that can be used for data smoothing include
alternate forms of binning such as smoothing by bin medians or
smoothing by bin boundaries. Alternatively, equal-width bins can be used
to implement any of the forms of binning, where the interval range of
values in each bin is constant. Methods other than binning include using
regression techniques to smooth the data by fitting it to a function such as
through linear or multiple regression. Classification techniques can be
used to implement concept hierarchies that can smooth the data by
rolling-up lower level concepts to higher-level concepts.
Page 6 of 12
Chapter No. 03
of
data
warehouse
(including
Answer:
The generation of a data warehouse (including aggregation)
ROLAP: Using a ROLAP server, the generation of a data
warehouse can be implemented by a relational or extendedrelational DBMS using summary fact tables. The fact tables
can store aggregated data and the data at the abstraction
levels indicated by the join keys in the schema for the given
data cube.
2. Roll-up
Answer:
Page 7 of 12
3. Drill-down
Answer:
ROLAP: To drill-down on a dimension using the summary fact
table, we look for the record in the table that contains a
generalization on the desired dimension. For example, to drilldown on the location dimension from country to
province_or_state, select the record for which only the next
lowest field in the concept hierarchy for location contains the
special value all. In this case, the city field should contain the
value all. The value of the measure field, ruppees_sold, for
example, given in this record will contain the subtotal for the
desired drill-down.
4. Incremental updating
Answer:
ROLAP: To perform incremental updating, check whether the
corresponding tuple is in the summary fact table. If not, insert
it into the summary table and propagate the result up.
Otherwise, update the value and propagate the result up.
Page 8 of 12
(c)
Answer:
HOLAP is often preferred since it integrates the strength of both ROLAP
and MOLAP methods and avoids their shortcomings. If the cube is quite
dense, MOLAP is often preferred. If the data are sparse and the
dimensionality is high, there will be too many cells (due to exponential
growth) and, in this case, it is often desirable to compute iceberg cubes
instead of materializing the complete cubes.
Chapter No. 05
Page 9 of 12
Frequency
{M, O, N, K, E, Y}
{D, O, N, K, E, Y}
{M, A, K, E}
{M, U, C, K, Y}
{C, O, O, K, I ,E}
C1
=
m
o
3
3
y
d
a
u
c
i
3
1
1
1
2
1
L1
=
C3
=
ok
e
ke
k
e
y
5
4
3
m
o
m
k
m
e
m
y
ok
oe
oy
ke
ky
ey
C2
=
L3
=
ok
e
1
3
2
2
3
3
2
4
3
2
L2
=
m
k
ok
oe
ke
ky
3
4
3
2
Page 10 of 12
y
FP-growth: See Figure 5.2 for the FP-tree.
ite
m
y
conditional
tree
k:3
frequent pattern
conditional pattern
base
{ {k,e,m,o:1}, {k,e,o:1},
{k,m:1} }
{ {k,e,m:1}, {k,e:2} }
k:3,e:3
m
e
{ {k,e:2}, {k:1} }
{ {k:4} }
k:3
k:4
{k,o:3}, {e,o:3},
{k,e,o:3}
{k,m: 3}
{k,e:4}
{k,y:3}
(b) List all of the strong association rules (with support s and
confidence c) matching the following metarule, where X is a
variable representing customers, and itemi denotes variables
representing items (e.g., A", B", etc.):
Answer:
k,o e [0.6,1]
e,o k [0.6,1]
Page 11 of 12
Question # 5.3:
cust_I TID
Items_bought (in the form of brand-item category)
D
01 T100 {King's-Crab, Sunset-Milk, Dairyland-Cheese, Best-Bread}
02 T200 {Best-Cheese, Dairyland-Milk, Goldenfarm-Apple, Tasty-Pie,
Wonder-Bread}
01 T300 {Westcoast-Apple, Dairyland-Milk, Wonder-Bread, Tasty-Pieg}
03 T400
{Wonder-Bread, Sunset-Milk, Dairyland-Cheese}
(a) At the granularity of item category (e.g., itemi could be
Milk"), for the following rule template,
List the frequent k-itemset for the largest k, and all of the strong
association rules (with their support s and confidence c) containing the
frequent k-itemset for the largest k.
Answer:
k = 3 and the frequent 3-itemset is {Bread, Milk, Cheese}. The rules
are:
Bread Cheese => Milk,
[75%, 100%]
[75%, 100%]
[75%, 100%]
Answer:
k = 3 and the frequent 3-itemset is:
{
(Wonder-Bread, Dairyland-Milk, Tasty-Pie),
(Wonder-Bread, Sunset-Milk, Dairyland-Cheese)
}
Page 12 of 12