You are on page 1of 15

MACHINE LEARNING:

UNIT-II : K-Nearest
Neighbours

ML: UE14CS403 UNIT 2


Topics
Basic KNN Technique

Euclidean Distance and other distance


measures

Weighted distances

Induction bias and other issues

ML: UE14CS403 UNIT 2


K-NN
A non-parametric, lazy, supervised, instance-
based learning technique.

No assumption is made about the pattern of


data
Training phase is non-existent. The entire work
happens during the classification phase.
There is no attempt to learn any hypothesis.

ML: UE14CS403 UNIT 2


Details
As data is presented to the technique, there is
no attempt to learn anything. The data is simply
stored.
When a new query instance (xq ) is presented,
the k nearest neighbours of xq are selected.
The distance measure adopted can vary. We
can employ the Euclidean distance,
Manhattan distance or Hamming distance etc.
Once the nearest neighbours are identified, its
a simple matter of choosing the majority
classification among them as the classification
for xq

ML: UE14CS403 UNIT 2


Details
If it is a regression problem, then we can
simply calculate the average of the target
value of the k nearest neighbours as the value
for xq
Euclidean distance: xi and xj are two
instances. Also, ar(x) denotes the value of the
rth attribute of instance x

ML: UE14CS403 UNIT 2


Details.

ML: UE14CS403 UNIT 2


Details

ML: UE14CS403 UNIT 2


KNN for Regression

For Regression Problems

Here, we calculate the average of the target


value for the k nearest neighbouring instances

ML: UE14CS403 UNIT 2


Distance-weighted nearest
neighbours
One obvious improvement to the basic KNN Technique is to
weight the contribution of the k nearest neighbours
according to their distance from the query point xq
By doing this, we can penalize instances that are far off from
the query point as compared to instances which are closer.
So our formula for classifying the query instance is slightly
modified as:

ML: UE14CS403 UNIT 2


Distance weighted KNN for
Regression

For
real-valued functions, the formula
changes to (check slide-8)

ML: UE14CS403 UNIT 2


Remarks on KNN
Highly inductive
Robust to noise
Very effective when provided with a large
set of training data.
By taking the weights of neighbours into
account, it minimises the effect of isolated
noisy data.

ML: UE14CS403 UNIT 2


Inductive Bias of KNN

There is an assumption that the


classification of an instance x, will be
most similar to the classification of other
instances that are nearby in Euclidean
distance.

ML: UE14CS403 UNIT 2


Issues with K-NN
The value for k must be chosen carefully. A very low
value can mean that noisy data will influence the
classification. A high value can mean that the
boundaries between classifications become very
smooth.
Brute force can be deployed where k starts with a small
value. We calculate the accuracy of the technique
and gradually increase k until we reach optimum
performance.

ML: UE14CS403 UNIT 2


Issues with k-NN
Since ALL attributes are considered for calculating
distance, irrelevant attributes can play a major role in
deciding the nearest neighbours.
Two solutions:
A) Weight the attributes when calculating distances.
B) Use the Leave-one-out approach. Iteratively, leave
out one of the attributes and test the algorithm. The
exercise can then lead us to the best set of attributes.

ML: UE14CS403 UNIT 2


Issues with k-NN
Efficient memory indexing:
The problem is that significant time is required to
process every new query instance. Hence very
effective indexing is required.
The kd-tree is well-known solution.
It is worth mentioning that Locally-weighted
Linear Regression is also a well-known technique
for classification. (Learning the hypothesis over
the k-nearest neighbours)

ML: UE14CS403 UNIT 2

You might also like