K-NN A non-parametric, lazy, supervised, instance- based learning technique.
No assumption is made about the pattern of
data Training phase is non-existent. The entire work happens during the classification phase. There is no attempt to learn any hypothesis.
ML: UE14CS403 UNIT 2
Details As data is presented to the technique, there is no attempt to learn anything. The data is simply stored. When a new query instance (xq ) is presented, the k nearest neighbours of xq are selected. The distance measure adopted can vary. We can employ the Euclidean distance, Manhattan distance or Hamming distance etc. Once the nearest neighbours are identified, its a simple matter of choosing the majority classification among them as the classification for xq
ML: UE14CS403 UNIT 2
Details If it is a regression problem, then we can simply calculate the average of the target value of the k nearest neighbours as the value for xq Euclidean distance: xi and xj are two instances. Also, ar(x) denotes the value of the rth attribute of instance x
ML: UE14CS403 UNIT 2
Details.
ML: UE14CS403 UNIT 2
Details
ML: UE14CS403 UNIT 2
KNN for Regression
For Regression Problems
Here, we calculate the average of the target
value for the k nearest neighbouring instances
ML: UE14CS403 UNIT 2
Distance-weighted nearest neighbours One obvious improvement to the basic KNN Technique is to weight the contribution of the k nearest neighbours according to their distance from the query point xq By doing this, we can penalize instances that are far off from the query point as compared to instances which are closer. So our formula for classifying the query instance is slightly modified as:
ML: UE14CS403 UNIT 2
Distance weighted KNN for Regression
For real-valued functions, the formula changes to (check slide-8)
ML: UE14CS403 UNIT 2
Remarks on KNN Highly inductive Robust to noise Very effective when provided with a large set of training data. By taking the weights of neighbours into account, it minimises the effect of isolated noisy data.
ML: UE14CS403 UNIT 2
Inductive Bias of KNN
There is an assumption that the
classification of an instance x, will be most similar to the classification of other instances that are nearby in Euclidean distance.
ML: UE14CS403 UNIT 2
Issues with K-NN The value for k must be chosen carefully. A very low value can mean that noisy data will influence the classification. A high value can mean that the boundaries between classifications become very smooth. Brute force can be deployed where k starts with a small value. We calculate the accuracy of the technique and gradually increase k until we reach optimum performance.
ML: UE14CS403 UNIT 2
Issues with k-NN Since ALL attributes are considered for calculating distance, irrelevant attributes can play a major role in deciding the nearest neighbours. Two solutions: A) Weight the attributes when calculating distances. B) Use the Leave-one-out approach. Iteratively, leave out one of the attributes and test the algorithm. The exercise can then lead us to the best set of attributes.
ML: UE14CS403 UNIT 2
Issues with k-NN Efficient memory indexing: The problem is that significant time is required to process every new query instance. Hence very effective indexing is required. The kd-tree is well-known solution. It is worth mentioning that Locally-weighted Linear Regression is also a well-known technique for classification. (Learning the hypothesis over the k-nearest neighbours)