Professional Documents
Culture Documents
Introduction
1.1 Preamble
Object detection and tracking in video has significant scope in many computer
vision applications. The conception of dynamic computers, availability of
economical camcorders and requirement for computerized video audit and many
applications associated with it has created great significance in moving object
detection and tracking process. Object detection is the process of detecting the
instances of objects and object tracking is the process of locating moving objects in
consecutive video frames. Automatic detection and tracking of moving objects
endure as open research problem for many years which helps us to understand and
analyze objects in video instead of monitoring computer by human operators which
is very time consuming and tedious job. However building automated detection and
tracking system is not simple and has many challenges involved within it. This
research work focuses on detection and tracking of moving objects in video.
Video processing, computer vision and pattern recognition empower major real
world applications which include:
A robust, accurate and high-performance approach for object detection and tracking
is still a great challenge today. It mainly depends on how the object to be detected
and tracked is defined. If a feature like color is used to detect and track the object
there are advantages and disadvantages associated with it, it is easy to locate all the
objects with the same color, but the problem occurs when the color of both
foreground and background is same. Illumination change is one of the challenge in
object detection and tracking process and change in illumination also changes the
color of the object, which leads to erroneous detection and tracking. So, detecting
object just based on visual features like color is not feasible in all cases.
• Illumination Changes: Intensity of light varies during the day in the outdoor
environment such as changeover from cloudy to bright sunny day and vice
versa. This may also happen in the indoor environment with sudden on/off
a light. Sudden change in illuminations affects the performance of detection
and tracking methods with high false positive rates.
• Shadows: Shadows casted from moving objects are also considered as actual
objects because of the same motion properties and this complicates the
detection and tracking process because, with the shadow it is very difficult
to extract the shape and features of the object.
• Low Contrast: Video frames with low contrast are difficult to separate
foreground objects and background due to its low-intensity values.
Detection of moving objects is the fundamental step for extracting information from
the video frames. There are various object detection approaches available in the
literature and generally categorized into feature based, template based, classifier
based and motion based approaches. In feature based object detection, features like
shape, size and color of the objects are extracted and modeled in terms of these
features. In template based object detection a template describing an object is
modeled and the video frames are analyzed based on the matching features between
the template and object in a video frame. There are two types of template matching
approaches fixed template matching and deformable template matching. Fixed
template matching is ideal when the shape of object do not change often. Two
popular methods make use of fixed template matching are Image Subtraction and
Correlation technique. Deformable template matching is ideal when the object‘s
size and shape vary due to rigid and non-rigid deformations. Classifier based object
detection separates moving objects from the background by building a set of
parameters based on knowledge of object to be detected. Motion based object
detection approaches rely on detecting temporal changes in a video frame at the
pixel level. Usually, the foreground is detected by subtracting each frame with the
reference frame if there is any movement of foreground objects. A popular method
to detect moving objects based on motion is background subtraction. Optical flow
and Gaussian mixture model also make use of motion to separate foreground from
background.
1.5 Motivation
Many approaches for moving object detection and tracking in video are available
in the literature and still there is a lot of scope for object detection and tracking.
Especially, due to increasing need in building autonomous systems for video
analysis and availability of economical digital cameras has developed huge
significance in moving object detection and tracking systems. Information present
in the video is significant and information of objects and background changes with
respect to time. Extracting the valuable information for higher level analysis of a
scene is really a challenging task and time-consuming. Object detection and
tracking in video poses many challenges and provides a lot of potential applications
in computer vision as discussed in the above section. Hence this research work
addresses the problem of moving object detection and tracking in video by
developing few novel approaches based on subspace and unsupervised learning
techniques.
• Extensive literature review has been done on moving object detection and
tracking in video.
• The effectiveness of PCA and particle filter in the wavelet domain for object
detection and tracking is explored.
• Essence of wavelet features and model updation to build the subspace while
tracking the object is observed.
• Extensive experiments are carried out and the results depict the effectiveness
of the proposed approaches.
Chapter 2
Literature Survey
In the previous chapter, the scope of object detection and tracking in video and its
significance in various practical applications was addressed. In this chapter a
thorough review on literature is presented.
Satrughan Kumar et al., [2.41] proposed a video object extraction and tracking using
background subtraction and adapting kalman filter. The work in this paper is
focused on realizing the relevant moving blobs on foreground by aiding the proper
initialization and updating of the background module to improve the tracking
accuracy. The first step deals with background modeling and object extraction
phase. Initially, average of some initial frames is taken as reference background.
The temporal processing creates holes and avoids spatial correlation amongst the
moving pixels. Therefore, an approximate motion field is derived using the
background subtraction and temporal difference mechanism. It generates an initial
motion field using spatial-temporal filtering on the consecutive video frames. The
block-wise entropy is evaluated above a certain range of the pixels of the difference
image in order to extract the relevant moving pixels from the initial motion field.
Finally, an adapting Kalman filter is integrated to the object extraction module in
order to track the object in the foreground. The proposed scheme is effective to
eliminate ghost, aperture distortion and achieves high tracking accuracy. But fails
to handle occlusion and computation cost of the tracking system is also high.
Runmin Wang et al., [2.45] presented a novel method to detect license plates in video
sequences automatically. The framework mainly integrates the cascade detectors
method and the Tracking Learning Detection (TLD) algorithm. The cascade
detectors are used to detect license plates, and the TLD algorithm is adopted to track
the license plate regions. The license plates in the first frame image are detected by
the cascade detectors to build the original tracking list, the tracking results and the
detection results in following frames will be compared, and the newly appearing
license plate information will be added to the tracking list. Meanwhile, the tracking
results existing in the current tracking list would be replaced by the corresponding
detection results with higher degree of confidence. The proposed method is tested
with various video sequences which also include low-resolution sequences and
compared with Faradji [2.1], Zheng [2.2] and Wu [2.3] methods. The performance
of proposed system is measured in terms of Recall rate, precision rate and f-Measure
and achieves 78.75%, 74.80% and
76.73% respectively. Advantage of this approach is it detects the license plates in low-
resolution video sequences and also detects multiple license place in each time. But
some regions of the image might not be scanned which makes license plates
undetected and fails to scan the number plate for white color vehicles results in too
many false positives.
Kaveh Ahmadi et al., [2.46] presents a novel algorithm for detecting and tracking small
dim targets in Infrared (IR) image sequences with low Signal to Noise Ratio (SNR)
based on the frequency and spatial domain information. Using a Dual-Tree
Complex Wavelet Transform (DT-CWT), a CFAR detector is applied in the
frequency domain to find potential positions of objects in a frame. Following this
step, a Support Vector Machine (SVM) classification is applied to accept or reject
each potential point based on the spatial domain information of the frame. The
proposed system is compared with morphological based, wavelet based, bilateral
filter, and Human Visual System (HVS) based techniques. To assess the role of the
SVM as a refinement step, a different version of the proposed method is tested
without using the SVM. The Error Rate per Frame (EPF) and Target to Clutter Ratio
(TCR) measures are also calculated in this case for the same 10 datasets. The results
show that without SVM refinement the proposed method achieved 3.25 for EPF,
and 65% correct target detection rate in average. Proposed system with SVM
achieved 2.10 for EPF and achieves target detection rate of 95%. Time taken for
tracking (30 frames) is 22.21 seconds. This method is capable of tracking small
objects under complex background and capable of detecting and tracking targets in
infrared image sequences with low SNR values (less than 2 dB). Shortcomings of
this method are, it fails to track objects under occlusion and multiple targets cannot
be tracked simultaneously.
Faegheh Sardari et al., [2.47] propose an occlusion free object tracking method together
with a simple adaptive appearance model. The proposed appearance model which
is updated at the end of each time step includes three components: the first
component consists of a fixed template of target object, the second component
shows rapid changes in object appearance, and the third one maintains slow changes
generated along the object path. The proposed tracking method based on particle
filter detects, handles the occlusion and also robust against changes in the object
appearance model. A meta-heuristic approach called Modified Galaxy based Search
Algorithm (MGbSA), is used to reinforce finding the optimum state in the particle
filter state space. The particle filter and MGbSA based approach is tested for
different video sequences with challenges including sudden motion, partial/full
occlusion, pose variation, changes of illumination etc. And the performance of the
proposed method is measured in terms of average coordinate error (ACE) and
average scale error (ASE) with average value of 5.51 and 10.97 respectively.
Proposed method is compared with geometric particle filter [2.4], Wang [2.5], PF-
PSO [2.6], Zhang [2.7], Kalman filter and many existing methods with respect to
the values of ACE and ASE. But speed of the proposed method is not high enough;
detection in the first frame is done manually and cannot track more than one object
at any given instance.
Issam Elafi et al., [2.48] introduces a new real-time approach is based on the particle
filter and background subtraction. This approach automatically detects and tracks
multiple moving objects without any learning phase or prior knowledge about the
size, nature or the initial position. The first step is to calculate the subtracted image
to detect moving objects using background subtraction. In the second step, if an
object is detected for the first time, the particle filter seeks the white pixels in order
to estimate the localization of the object; otherwise, the particle filter seeks the
target color distribution calculated in the previous frame. In the third step, the exact
center and dimension of the detected object are calculated. In the fourth step, the
color distribution of the target (moving object) is calculated and stored to be used
in the next frame as a target color distribution. An experimental study is performed
over several video test set and compared with other seven tracking algorithms (CT,
ORIA, SCM, MTT, MS, LOT, CXT). The proposed approach outdoes the CT
tracker by 49%, the CXT tracker by 77% and the MS and ORIA trackers by over
90% with respect to success rate. Positive aspects of this method are, it tracks
moving objects in real time and automatically detects and tracks multiple moving
objects without any learning phase or prior knowledge about the size, the nature or
the initial position of objects. But this approach is prone to frequent full occlusions
and the tracker will drift away from objects in case of shadows. Ivan Huerta et al.,
[2.49] address the detection of both penumbra and umbra shadow regions. First, a
novel bottom-up approach is presented based on gradient and color models, which
successfully discriminates between chromatic moving cast shadow regions and
those regions detected as moving objects. In essence, those regions corresponding
to potential shadows are detected based on edge partitioning and color statistics.
Subsequently, temporal similarities between textures and spatial similarities
between chrominance angle and brightness distortions are analyzed for each
potential shadow region for detecting the umbra shadow regions. A tracking-based
top-down approach increases the performance of bottom-up chromatic shadow
detection algorithm by properly correcting non-detected shadows. Shadow
detection rate and shadow discrimination rate of proposed method are compared
with Qin et al., [2.8], Martel-Brisson [2.9] and Martel-Brisson [2.10]. Proposed
method achieves 83.6% of shadow detection rate and 91.3% of shadow
discrimination rate which surpasses the methods compared above. But this approach
is prone to camouflage which leads to the erroneous classification of shadows and
also computational complexity is very high.
Giyoung Lee, et al., [2.50] presents a fast and light algorithm that is suitable for an
embedded real-time visual surveillance system to detect effectively and track
multiple moving vehicles whose appearance and/or position changes abruptly at a
low frame rate. For effective tracking at low frame rates, a new matching criterion
based on greedy data association using appearance and position similarities between
detections and trackers is proposed. To manage abrupt appearance changes,
manifold learning is used to calculate appearance similarity. To manage abrupt
changes in motion, the next probable centroid area of the tracker is predicted using
trajectory information. The position similarity is then calculated based on the
predicted next position and progress direction of the tracker. Detection performance
of the proposed approach is measured in terms of Accuracy and False Alarm Rate
(FAR) with values of 98.05% and 17% respectively. Tracking performance is
measured with respect to mostly tracked trajectories (MT), mostly lost trajectories
(ML), fragmentation (FRMT), ID switches (IDS) and the proposed method is
compared with Lee [2.11], Wang [2.12] and Zhang [2.13]. Tracking performance
of proposed approach at frame rate of 2 frames per second achieves MT of 40, ML
of 1, FRMT of 4 and IDS of 0. This algorithm has some shortcomings such as
tracking requires trained parameters. Tracking performance is highly dependent on
detection performance.
Detection performance will be deteriorated because of incorrect matching errors and
cannot handle occlusion. Luqman et al., [2.51] propose an integral technique to do
object detection and tracking for video surveillance. First, pixels in the images will
be modeled with Gaussian mixture model with K-Means algorithm to separate
foreground from the background image. Then, the morphological operation is
performed to remove noise pixels. Objects will be formed with spatial evaluation,
with color mean and contour chain code as its feature. Tracking will be performed
with temporal evaluation, i.e. inter-frame object features and distance comparison.
This technique is doing well in object detection and tracking, with high true positive
and low false negative, but still suffering from false positives in the dynamic
background scene. The proposed method achieves 95.2% of precision, 90.90% of
recall and f-measure of 93% for PETS 2009 dataset. 25.5% of precision, 100%
recall and f-measure of 40.6% for traffic-snow dataset. Also, the computational cost
with respect to speed is very high compared to existing methods in the literature.
But the proposed method suffers from too many false positives and high
computational cost and also fails to handle dynamic background.
Olga Zoidi et al., [2.52] propose a visual object tracking framework, which employs an
appearance-based representation of the target object, based on local steering kernel
descriptors and color histogram information. This framework takes as input the
region of the target object in the previous video frame and a stored instance of the
target object, and tries to localize the object in the current frame by finding the
frame region that best resembles the input. As the object view changes over time,
the object model is updated. Color histogram similarity between the detected object
and the surrounding background is used for background subtraction. The proposed
tracking scheme of Olga Zoidi et al Color HistogramLocal Steering Kernel (CH-
LSK) is compared with particle filter [2.14], color histogram and local steering
kernel based methods [2.15] but CH-LSK is successful in tracking objects under
scale and rotation variations and partial occlusion, as well as in tracking rather
slowly deformable articulated objects. Average tracking accuracy of proposed
Color Histogram-Local Steering Kernel for case study 1 is 0.6928 and for case study
7 is 0.7745. But this approach is prone to full occlusion and fails to track the object
when the direction/speed of object changes suddenly.
Jean-Philippe et al., [2.53] presents the system that uses background subtraction
algorithms to detect moving objects. In order to build the object tracks, an object
model is built and updated through time inside a state machine using feature points
and spatial information. When an occlusion occurs between multiple objects, the
positions of feature points at previous observations are used to estimate the positions
and sizes of the individual occluded objects. Multiple Object Tracking Precision
(MOTP) and Multiple Object Tracking Accuracy (MOTA) are used as metrics to
evaluate the performance of Jean-Philippe et al. Urban Tracker (UR) [13] approach
and the same is compared with tracker called Traffic Intelligence (TI) [15] proposed
by Saunier et al. Proposed Jean-Philippe et al., Urban Tracker achieves MOTA of
89.28% and MOTP of 10.53px whereas Saunier et al., proposed Traffic Intelligence
achieves MOTA of 82.53% and MOTP of 7.42px for Sherb video sequence consists
of cars as objects. For a sequence consists of pedestrians as moving objects, urban
tracker achieves 68.09% of MOTA and 6.64px of MOTP, whereas traffic
intelligence achieves
1.41% of MOTA and 11.98px of MTP.
Sheng Chen et al., [2.54] presents a new approach to tracking people in crowded scenes,
where people are subject to long-term (partial) occlusions and may assume varying
postures and articulations. Temporal mid-level features (e.g., supervoxels or dense
point trajectories) as a more coherent spatiotemporal basis for handling occlusion
and pose variations are used. Tracking is formulated as labeling midlevel features
by object identifiers called constrained sequential labeling (CSL). A key feature of
this approach is that it allows for the use of flexible cost functions and constraints
that capture complex dependencies that cannot be represented in standard network-
flow formulations. Proposed approach is also compared with state-of-the-art
detection based approaches and report results on the widely used pedestrian
benchmark video PETS2009-S2L1 where people detectors are effective and also
reports on the Volleyball dataset contains 38 videos of entire collegiate volleyball
plays. Evaluation metrics: miss detection (MD), false positives (FP), ID switches
(IDS), multi-object tracking accuracy (MOTA), recalling and precision are used.
Proposed CSL results for PETS2009 S2L1 dataset is 98.28% of recalling rate,
91.07% of precision, 6 IDS and MOTA of 89.78%. Result for Henriques et al.,
[2.16] is 94.03% of recalling rate, 92.40% of precision, 10 IDS and MOTA of
84.77%. Result for Huang et al., [2.17] is 96.45% of recalling rate, 93.64% of
precision, 8 IDS and MOTA of 90.30%.
Tianzhu Zhang et al., [2.55] proposes a novel tracking algorithm that models and detects
occlusion through structured sparse learning called Tracking by Occlusion
Detection (TOD). This approach assumes that occlusion detected in previous frames
can be propagated to the current one. This propagated information determines which
pixels will contribute to the sparse representation of the current track. i.e., pixels
that were detected as part of an occlusion in the previous frame will be removed
from the target representation process. Tracker is tested on challenging benchmark
sequences, such as sports videos, which involve heavy occlusion, drastic
illumination changes, and large pose variations. Proposed tracker, TOD is compared
with other 6 recent and state-of-the-art trackers. Tracking performance is evaluated
according to the average per-frame distance between the center of the tracking result
and that of the ground truth used in [2.18, 2.19, 2.20]. Clearly, this distance should
be small. TOD consistently produces a smaller distance than other trackers by
accurately tracking the target despite severe occlusions and pose variations. But this
algorithm is prone to severe illumination changes and tracking of objects is affected
by background clutter and sometimes drifts from the target.
A novel approach is introduced by Zhang et al., [2.56] in order to deal with dynamic
scenes. A combined version of five-image difference algorithm and background
subtraction algorithm is provided to provide the complete contour of moving object.
Proposed approach is mainly divided into three consecutive steps i.e., pre-
processing, target identification and rectangular contour modeling. In very first step
video is pre-processed by median filter technique for noise removal. Secondly, five
frame differential technique; an enhanced version of the interframe differential
technique is applied. Thirdly, background subtraction algorithm is applied on the
actual image sequence and output is achieved using the dissimilarity between
current video frame and assumed background model; which is then followed by
binarization operation. In the last stage, rectangular contour model is applied to
eliminate cast shadow effect. But this approach fails to eliminate leaves flutter noise
and fails to detect multiple moving targets in an instance. Wang et al., [2.57]
presented a three-step method based on temporal information for moving object
detection. Firstly, based on the continuous symmetry difference of the adjacent
frames temporal saliency map is generated. Secondly, temporal saliency map is
binarized and candidate areas are obtained by calculating threshold using maximum
entropy sum method. Most salient point are considered as attention seed and based
upon obtained attention seed, the fuzzy approach is performed on saliency map to
grow attention seeds until entire contour of the moving objects are acquired.
Effectiveness of the proposed method is tested on four sequences from the
datasset2014 and performance is evaluated in terms of recall (Re), Specificity (Sp),
False Positive Rate (FPR), False Negative Rate (FNR), Percentage of Bad
Classifications (PBC), F-Measure and Precision. Also, the proposed approach is
compared with ICSAP [2.21], BBM [2.24], UBA [2.22], SGMM [2.23], QCH
[2.25], and GML [2.26]. And the average results of proposed method for
dataset2014 is Re of 84.67%, Sp of 99.43%, FPR of 0.57, FNR of 15.33, PBC of
86.42%, Precision of 77.57% and F-Measure of 80.23. Positive aspects of this
method are it detects objects under dynamic background both in the indoor and
outdoor environment and requires no parameters or threshold tuning. But this
method fails to handle shadows and fails to detect multiple serried moving objects.
A novel and effective method to track moving objects under a static background is
proposed by Sandeep et al., [2.60]. Proposed method first executes the
preprocessing tasks to remove noise from video frames. Then, draw the rectangular
window to select the target object region in the first video frame (reference frame).
Next, it applies the Laplacian operator on the selected target objects for sharpening
and edge detection. The algorithm then applies the DCT and selects the few high
energy coefficients. Subsequently, it computes the perceptual hash of the selected
target objects with the help of mean of all the AC values of the block. Perceptual
hash of a target object is used to find the similar object in subsequent frames of the
video. The proposed method is tested on real indoor-outdoor video sequences and
compared with perceptual hashing techniques, i.e., average hash (aHash) [2.39],
perceptive hash (pHash) [2.39], difference hash (dHash) [2.39] and Laplacian hash
(LHash) [2.40]. The average tracking accuracy of proposed method is 76.32%.
Some positive aspects of this method are it tracks moving object with varying object
size and significant amount of noise, also handles illumination changes and
slow/fast moving object. But, suffers from high computational cost and capable of
tracking only one target object at a time. Weina et al., [2.61] presents a method to
automatically detect small groups of individuals who are traveling together. These
groups are discovered by bottom-up hierarchical clustering using a generalized,
symmetric Hausdorff distance defined with respect to pairwise proximity and
velocity. Weina at al combine a pedestrian detector, a particle filter tracker, and a
multiobject data association algorithm to extract long-term trajectories of people
passing through the scene. The detector is run frequently (at least once per second),
and therefore, in addition to any new individuals entering the scene, people already
being tracked are detected multiple times. For each of the detection, a particle filter
tracker is instantiated to track that person through the next few seconds of video,
yielding a short-term trajectory, or tracklet. Proposed approach is tested in both
indoor and outdoor environments with different group size by handling occlusion
and pose variations. But fails to track pedestrians under severe illumination change
and prone to noise and jittering.
Jon et al., [2.62] presents a probabilistic method for vehicle detection and tracking
through the analysis of monocular images obtained from a vehicle-mounted camera.
The method is designed to address the main shortcomings of traditional particle
filtering approaches, namely Bayesian methods based on importance sampling, for
use in traffic environments. These methods do not scale well when the
dimensionality of the feature space grows, which creates significant limitations
when tracking multiple objects. Alternatively, the proposed method is based on a
Markov chain Monte Carlo (MCMC) approach, which allows efficient sampling of
the feature space. The method involves important contributions in both the motion
and the observation models of the tracker. Regarding the motion model, a new
interaction treatment is defined based on Markov random fields (MRF) that allows
for the handling of possible inter-dependencies in vehicle trajectories. As for vehicle
detection, the method relies on a supervised classification stage using support vector
machines (SVM). A new descriptor based on the analysis of gradient orientations
in concentric rectangles is defined. This descriptor involves a much smaller feature
space compared to traditional descriptors, which are too costly for real-time
applications. Some positive aspects of this approach is it can track the vehicle in a
wide variety of driving situations and environmental conditions and also handles
vehicles entering and leaving a scene pretty well. But, requires high computational
time and tracker drifts away when the number of particles is relatively small.
Chapter 3
In recent years, several object detection methods based on the compressed domain
are developed and attract wide range of applications. Most of the techniques
exploit discrete cosine transformation (DCT) [3.21]–[3.25] to detect moving
objects in a video frame. Weiqiang et al., [3.1] represent the background by
utilizing the block level Discrete Cosine Transform (DCT) coefficients, and
the DCT coefficients are updated in order to adapt the background. This
method is computationally effective in-terms of speed and memory. The
Kalman filterbased approach [3.2] has the desirable computational speed and
low memory requirement. Kilger et al., [3.3] and Koller et al., [3.4] explain
the modified version of kalman filter-based approach called Selective Running
Average which monitors the real-time traffic. Wren et al., [3.5] models each
background pixel in Pfinder system using Gaussian distribution and Piccardi
[3.6] classifies the pixels using adaptive threshold by introducing standard
deviation and named it as Running Gaussian Average. Another background
modeling technique used as an alternative to averaging method is median
filtering technique and the same is used in [3.7]–[3.9]. In order to evaluate the
medoid, Cucchiara et al., [3.10] introduces an approximate implementation
based on median filtering. McFarlane and Schofield [3.11] describe a median-
based approach which adaptively varies the value of background by
approximately estimating the median value.
Simple median method described by Cheung and Kamath [3.12] have less
complexity and low computational cost. Multimodal backgrounds are
modeled using methods based on popular technique called Mixture of
Gaussians [3.13] and [3.14]. Extended version of [3.13] and [3.14] is used to
model the background which demonstrates the purpose of non-parametric
estimation with more number of Gaussian mixtures is explained by Elgammal
et al., [3.15]. [3.16] and [3.17] describes the complex and extended versions
of [3.15]. Toyama et al., [3.16] estimates the background using Wiener
prediction filter. Density modes are identified using sequential density
approximation approach and the same is presented in [3.17]. In order to
designate the background, each of the density modes is assigned with Gaussian
components.
Object detectors that are scale invariant requires extracting features at each scale
from an image pyramid but are computationally infeasible. In [3.43] and
[3.44], a feature approximation technique approximates the feature responses
at nary scales when the color features and gradient histogram features are
extracted at one scale from an image pyramid. Because of this the feature
pyramids are constructed at the reduced computational cost and results in
speedup of object detection over other methods in the literature namely [3.40]
and [3.42] at cost of detection accuracy. To overcome the issue of extracting
features from a tall image pyramid, a classifier pyramid [3.45] is used, which
also improves the speed of object detection. But the method used in [3.45]
requires large storage and lot of time for training because of using multiple
trained classifiers at different scales.
Object detection in recent years has used the various transform domain based
techniques. As an example, to leverage the linearity property of fourier
transform, the 2D-HOG [3.46] is presented along with DFT and has
contributed with a speedup of detection process compared with [3.41]. But the
performance of this method deteriorates due to high computational cost mainly
because of exact computation of feature pyramid. Feature pyramids
approximation [3.32] based on forward and inverse DFT‘s offers high
detection accuracy compared to [3.44]. But the detection speed of the
approximated feature pyramids decreases drastically when compared to [3.44]
mainly due to the complex multiplications required by forward and inverse
DFT‘s.
Ravi et al., [3.47] uses a background model to detect moving object in video based
on modified running average discrete cosine transform (RA-DCT) which is
robust to illumination variations. In order to achieve high detection accuracy,
RA-DCT includes median filter and performs some morphological operations
to reduce the background mask noise. Sagrebin et al., [3.48] proposes a
background removal method, which models each 4 X 4-pixel patch of an image
through a set of low frequency coefficient vectors obtained by means of
discrete cosine transform to extract the important information from image.
Background model will be adapted continuously depending on whether new
coefficient vector matches to the background model or not. This method
reduces the computational cost, suppresses the noise and robust to the sudden
illumination changes. Kalirajan et al., [3.49] compress the input video frames
with 2D-DCT, key feature points are derived by calculating the correlation
coefficients and matching feature points are classified based on the Bayesian
rule as foreground and background. Embedding maximum likelihood feature
points over input frames localizes the foreground feature points.
The 2-D DCT and its inverse (IDCT) of an N x N block are shown below:
The inverse discrete cosine transform reconstructs a sequence from its discrete
cosine transform (DCT) coefficients.
The proposed method uses the 2-D DCT specified in eq. (3.1), where 1-D DCT is
applied on the columns of the frame blocks and then 1-D DCT is applied on
the rows of the consequent frame blocks. Likewise, all the input frames are
processed with respect to 2-D DCT. DCT coefficients of each key frame are
used as local features. Usually, the dimensionality of obtained local features
is high and to outdo the issue of redundancy and to speed up the detection
process, Principal Component Analysis (PCA) can also be used for feature
extraction.
(3.3)
. It gives the next key component that is orthogonal to the first one. Remaining
principal components can be derived in a similar way. Coefficients
can be calculated from eigenvalues and eigenvectors of the
matrix and are ordered according to their eigenvalues. Computational
complexity and computational cost will be decreased because the DCT
processed frames are introduced to PCA for dimensionality reduction. If the
input frames are processed directly to PCA, the computational complexity and
cost will be increased because of high dimensionality among input frames and
also results in erroneous detection of foreground objects.
The proposed method can be summarized as: Select the two frames and apply two
dimensional DCT for the input frames, 1-D DCT is applied on the columns of
the frame blocks and then 1-D DCT is applied on the rows of the resulting
frame blocks. Likewise, all the input frames are processed with respect to 2-D
DCT. Compute the average image and subtract the mean of images.
Calculate the eigenvectors and eigenvalues. Best K eigenvectors are saved in
a matrix form according to their eigenvalues. Then Previous and current
frames are projected onto eigen space as Finally, the
eigenvectors are projected into and IDCT is performed.
Various video sequences are also tested with respect to the proposed method,
which consists of sunny day sequences, low contrast sequences, cloudy
conditions, noisy/poor quality video sequences etc., both in the indoor and
outdoor environment. In contrast to the conventional Principal Component
Analysis [3.19], proposed method obtains fair results. Also noticed that
identified clusters obtained are very ideal and the rate of false alarms is less
when compared with PCA approach. It is also evident from figure 3.2 and
Table 3.1; the rate of false alarms, detection rate, accuracy, the rate of false
positives and false negatives achieved is satisfactory related with conventional
PCA approach.
Advantage of the proposed method is, DCT assists in isolating the video frame
into fragments of differing importance and the properties of the DCT
coefficients blocks makes them very good for generating feature spaces that
best describes the object. The features generated are extracted using PCA and
this step helps to preserve the shape of object. But, the conventional PCA fails
to guarantee the neighbor relationship for neighboring samples, which leads
to failure in preserving the shape of the object. The detection approach is also
invariant to changes in illumination and complex background. In future, the
proposed method can be further extended to track the detected objects
considering occlusion and other challenges.
Figure 3.2: Experimental Results of DCT-PCA; Input frames (Row 1); Results
achieved from proposed method (Row 2); Results obtained from PCA (Row
3).
(FN/FN+TP)
In the prediction phase, we apply the same feature extraction process to the new
images and we pass the features to the trained machine learning algorithm to predict
the label.
The main difference between traditional machine learning and deep learning
algorithms is in the feature engineering. In traditional machine learning algorithms,
we need to hand-craft the features. By contrast, in deep learning algorithms feature
engineering is done automatically by the algorithm. Feature engineering is difficult,
time-consuming and requires domain expertise. The promise of deep learning is
more accurate machine learning algorithms compared to traditional machine
learning with less or no feature engineering.
Fig. 1. Machine learning phase
Fast R-CNN
Fig. 3. Fast R-CNN
Faster (R-CNN) solved some of the drawbacks of R-CNN to build a faster object
detection algorithm and it was called Fast R-CNN. The approach is similar to
the R-CNN algorithm. But, instead of feeding the region proposals to the CNN,
we feed the input image to the CNN to generate a convolutional feature map.
From the convolutional feature map, we identify the region of proposals and
warp them into squares and by using a RoI pooling layer we reshape them into
a fixed size so that it can be fed into a fully connected layer. From the RoI
feature vector, we use a softmax layer to predict the class of the proposed
region and also the offset values for the bounding box.
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed
2000 region proposals to the convolutional neural network every time.
Instead, the convolution operation is done only once per image and a feature
map is generated from it.
From the above graphs, we can infer that Fast R-CNN is significantly faster in
training and testing sessions over R-CNN.
Fig. 5. Faster R-CNN