Professional Documents
Culture Documents
Abstract
Visual Navigation is the key to enabling useful mobile robots to work autonomously in unmapped dynamic environments. It involves positioning a robot by tracking the world as it moves past a camera. Visual Navigation is not widely used in the real world yet as current implementations have limitations including drift (accumulated errors), poor performance in featureless, self-similar or dynamic environments, or during rapid motion, and the need for significant amounts of processing power. In my PhD I propose to address some of these issues including improving the accuracy and robustness of feature-based relative positioning, and reducing drift without damaging reliability by adapting developing loop closure techniques. I will aim to implement a 3d Visual Navigation system to demonstrate improvements in robustness and accuracy over existing technology.
Contents
1 Introduction.................................................................................................................2 2 Current Research and Technology..............................................................................3 2.1 Accumulated Errors..............................................................................................4 2.1.1 Bundle Adjustment........................................................................................4 2.1.2 Sensor Integration..........................................................................................4 2.1.2.1 Inertial Navigation Systems...................................................................5 2.1.3 Loop Closure.................................................................................................5 2.1.3.1 Scene Recognition..................................................................................5 2.2 Motion Models, Frame Rates and Robustness.....................................................6 3 PhD Plan:.....................................................................................................................7 3.1 Progress to-date:...................................................................................................7 3.1.1 Developing techniques to reduce errors in initial solution used to start Bundle Adjustment.................................................................................................7 3.1.1.1 Translation extraction.............................................................................7 3.1.1.2 Unbiased Estimator of stereo point position..........................................8 3.1.1.3 Future development work.......................................................................9 3.1.2 Loop Closure.................................................................................................9 3.1.3 Demonstrating improved Visual Navigation.................................................9 3.2 Future development work.....................................................................................9 3.3 Proposed research timeline and targets..............................................................11 4 Bibliography..............................................................................................................13
1 Introduction
Visual Navigation (VN) involves working out the position of and path taken by a human, robot or vehicle from a series of photographs taken en-route. This is potentially a useful navigation tool in environments without satellite positioning, for example indoors, in rugged terrain or cities, underwater, on other planets or near the poles. It could also be used to augment satellite or inertial navigation sensors where more precision is required, or to provide a backup navigation system. VN is a major component of Visual Simultaneous Localisation and Mapping (V-SLAM). SLAM is the process by which a mobile robot can autonomously map and position itself within an unknown environment. Solving this problem is a major challenge in robotics and is heavily researched. It will enable robots to perform tasks requiring them to navigate unknown environments (for example so they can drive or deliver items, for domestic tasks, in military operations, in hazardous environments). VN is likely to become a widely used navigation solution once solved is because cameras (and microprocessors) are already common on robots and portable computers (e.g. phones). Cameras have many other additional uses, are relatively cheap, dont interfere with other sensors (they are passive if additional lighting is not needed), are not easily mislead, and require no additional infrastructure (such as base stations or satellites). In theory VN will work in any environment where there is enough light and texture that static features can be identified. Humans manage to navigate using predominately stereo vision whenever there is enough light to see our surroundings (although we also integrate acceleration information from our ears, feedback from our limbs and understanding of what we are seeing, for example we know the scale of objects in our environment and can identify those that are moving with respect to other). These applications usually require a near-real time response to what has been seen. Ideally we need each photo or video frame to be processed rapidly enough that the robot does not have to stop while calculation of its position catches up, and so it can react to its change in position (to stop moving, having reached a goal, or to re-plan its path given new information on its environment). To avoid losing our position the frame rate must be high enough that there is significant overlap between
consecutive frames. Real-time VN systems have been demonstrated in controlled environments (mainly two-dimensional and indoors), which limits their usefulness. The aim of my research will be to improve current VN technology so that it can be used in less constrained environments, with the aim of developing a real-time VN system robust enough for real-world applications.
Stage 2 usually uses Harris corners or similar point features. Stage 3 either tracks features or uses a projective transform-invariant feature descriptor (SIFT [4] or SURF [5]) to find correspondences. Stage 5 involves a point-alignment algorithm (some form of Procrustes alignment, or the 3- or n-point algorithm), or a motion model, to estimate the translation and rotation. RANSAC may be used to remove outliers, and Bundle Adjustment [6] may be used to refine the position estimate. Measurements from an odometer or INS may be incorporated (usually using an Extended Kalman Filter). Repeating this algorithm for every frame gives us our position. An alternative approach, from Structure from Motion research, is to use Optical Flow [7]. This requires a high frame rate and is dependent on tracking many features across an image. Optical flow has been combined with feature tracking to make use of distant features by [8], and for mobile robot navigation [9, 10], but it is generally used because it is fast to compute rather than because of its accuracy, however when stereo depth disparities are an issue it may outperform feature-based algorithms as depth is not used. [Neural nets???]
2.1
Accumulated Errors
VN implementations suffer badly from small errors accumulated over many steps. There are several ways of reducing this error: we can use Bundle Adjustment to refine our position estimate, we can integrate a complementary sensor (such as an inertial sensor), and/or we can recognise and use the position of places weve seen before (loop closure).
cope with position estimate failures (e.g. we cant track enough points between two images, or we lose our GNSS signal). Particle filters [15] are another less popular option that can be better at dealing with erroneous input, or data that is not well approximated by a Normal distribution. It appears to be hard to make real-time implementations that provide the accuracy of Kalman Filter SLAM [16]. Odometry (for example on the Mars Rover [17]), GNSS (for example to navigate missiles), laser scanners (to identify profiles as recognisable descriptors [18]) and inertial navigation sensors have been integrated with VN.
Loop closure can also cause us to lose our position if we recognise a scene incorrectly. Therefore it is important that what we see is distinctive and discriminative, so we dont incorrectly recognise scenes that are common in our environment. We would normally do this by clustering descriptors; descriptors in a large cluster are less distinctive than those in a small cluster [23]. The Bag of Words algorithm has been applied to SIFT descriptors to identify discriminative combinations of descriptors [24, 25]. For example when navigating indoors, window corners are common so are not good features to identify scenes with. Features found on posters or signs are much better, although even these may be repeated elsewhere. Scenes with very little detail are more likely to be falsely recognised than those with more detail (which are less likely to happen to have to same property that we are recognising). [26] demonstrates real-time loop detection using a hand-held mono camera, using SIFT features and histograms (of intensity and hue) combined using a Bag of Words approach. [27] also demonstrated real-time loop closure outdoors using SIFT features and laser scan profiles. Much work to remove visually ambiguous scenes was needed, and more complex profiles were preferred to provide more discriminative features. The geometry of scenes has not often been used to recognise them, even though it is likely that people use scene geometry. This is probably because of a lack of geometric properties that are invariant under perspective transformations. In a 2d projection ratios of distances, angles, and ordering along lines are not preserved. The epipolar constraint does define an invariant feature, but this is defined by seven feature matches (up to a small number of possibilities) so eight or more points are needed for this to validate or invalidate a possible correspondence. Simple geometric constraints have been used to eliminate triples of bad correspondences [28], but the key assumption here is that points lie on a convex surface, i.e. there is no occlusion, which is not a good assumption for real world navigation applications. If suitable descriptors can be found then geometric constraints would be very useful for identifying distinctive scenes in an environment made up of different arrangements of similar components.
2.2
A high frame-rate is desirable for VN so that there is always overlap between consecutive frames (and preferably sequences of frames) so feature correspondences can be found between frames. For example, a camera held by a human may be swung through 180 degrees in approximately a second; if the frame is 45 degrees wide then this requires a frame rate higher than eight frames per second for there to be any overlap between frames at all, and higher than 16 frames per second if any features are to be tracked across more than two images. Omni-directional cameras may partially solve this, although they provide a larger or less detailed view of the scene, and are more expensive and less generic. Often the camera motion is modelled and used as an initial estimate of the translation between frames. This works well in many applications (such as UAVs) where the acceleration in the interval between frames is small, however for cameras attached to humans (or even fast robots, or robots travelling through difficult terrain) jerky movement and rapid rotation is likely, and the algorithm must be able to reliably cope with unpredictable motion (in other words, should be able to predict motion from what it sees alone). This is also the time when errors accumulate most rapidlyit is less critical to keep the frame rate high at the same time as the motion model will be most effective at speeding up the algorithm.
3 PhD Plan:
I will develop ways of speeding up VN that uses BA to refine position estimates by identifying existing algorithms, and developing new algorithms that will lead to more accurate initial solutions and robust outlier rejection (to reduce the number of iterations, and the probability of finding false minima). I will compare different approaches using simulated data to determine the best approach. I will investigate ways of improving the reliability of fast loop closure algorithms. I would like to incorporate image geometry into algorithms that match feature points, either as part of a descriptor or to validate matches. I will also investigate conditioning I will implement VN software that I hope will provide real-world verification of algorithms I have developed. People have demonstrated real-time navigation systems and loop closure in real time before, but always with severe restrictions, e.g. on numbers of points tracked, maximum angular velocities, restricted to 2d or relying on a level ground plane.
3.1
Progress to-date:
3.1.1 Developing techniques to reduce errors in initial solution used to start Bundle Adjustment.
It is advantageous to start Bundle Adjustment from a good approximation to the actual motion, so that false minima are more likely to be avoided and to reduce the number of iterations needed to reach a good enough solution. To determine relative motion from a set of correspondences between 3d points we first recover the rotation between the point sets, then the translation. This gives us the camera motion. This process is known as Procrustes Alignment and is related to Point-Pattern Matching. Sometimes an initial solution of either the solution obtained by Bundle Adjustment on previous frames, or an assumption that the motion between frames is close enough to zero is used. This assumes that the frame rate is high enough that motion or acceleration is small between frames, however this is a bad approximation when there is substantial motion between frames. This is precisely when we want the most accurate (relatively) motion estimate as smaller relative errors in estimating smaller movements contribute proportionally less to the global error. Various methods have been proposed for extracting the rotation: 1. 2. [12] derived expressions for the translation and rotation (and scale) between a set of pointcorrespondences that minimise the square of the error. [29] take the Singular Value Decomposition of a matrix formed from matrix products of points and discard the diagonal factor to give an orthogonal rotation matrix.
An alternative approach would be to use the SVD to give a matrix that is the best least-squares transformation matrix between two point sets. The we can use either the method in (2) (taking a second SVD of this 3x3 matrix), or use the algorithm by [30] to find the closest rotation to this matrix. After getting identical simulated results from comparing these methods I have proved that they are equivalent. I have implemented a MATLAB program to simulate noisy stereo image data (projecting points onto images, adding Gaussian measurement noise, calculating 3d structure). I can compare these approaches with each other. Initial results show that the first method is slightly better (mean error approximately 5% less) than the second given 5-15 points with a reprojection error with mean 0.01 radians (approximately 2 pixels for a typical camera), but the second is significantly better given more exact data (mean reprojection error 0.002 radians).
1. 2.
3.
The 3-point algorithm [31]: Three non-collinear point correspondences determine a small finite set of possible new camera positions. We can solve this exactly, then disambiguate between solutions using a fourth correspondence. The N-point algorithm [11]. This is the generalisation of (1) to n > 3 points, making use of the fact that the problem is overdetermined. These algorithms can position a camera relative to known 3d structure given one 2d image alone. The only information about the second point set used is the angles between points. If we have a stereo pair we can use angles between 3d points (this is effectively an average of the angles from the two images, so we would expect a slightly reduced error), and also improve on distance estimates between point pairs (by taking some sort of average again). Procrustes Alignment (difference of centroids of point sets) between 3d point sets after the rotation. Unlike (1) and (2) this method is directly affected by the accuracy of rotation estimates.
My simulations can currently use (1) or (3) and I am working on incorporating (2).
d =
where
bd m 0, (e)de b + 2ed m l
b is baseline length dm is measured depth e is the error in measuring one pixel
b 2d m is a lower limit high enough that we are not considering points that we couldnt
possibly see as they are behind the camera [TODO: should that be root 2? Sum of 2 NDs] This can be pre-computed and stored in a lookup table, so is relatively fast. Several small-angle assumptions are made so possibly simulated data would give better estimates. I will try this approach.
3d points can be shifted by this amount in the direction of the stereo rig so the distribution of their true position is centred on the adjusted point. Preliminary results suggest there is a small improvement in accuracy that is insignificant (about 2.5%) until very noisy data is used, or the stereo baseline is less than ten times the point depth, when it gives a significant improvement in accuracy. More simulations are necessary to see when this is useful.
3.2
I will fix and extend my navigation software to give motion estimates using the navigation algorithm identified by simulation. I will aim to show that this gives a good enough initial estimate to allow BA to refine the position estimate in real-time. Fast BA code is available in the sba library for this purpose. Other potential areas to research include the trade-off between a high frame rate (features are tracked over many frames but not much time is available to process each frame) and spending longer refining estimates from less frequent, possibly higher resolution, frames. Generic relative-positioning techniques across multiple frames will enable me to incorporate loop closure into this software, by positioning relative to frames from the same position in the past. The
most accurate algorithms for navigation at the moment do not incorporate BA. Hopefully by identifying and refining the most appropriate algorithms it will be possible to do this in real-time.
Learning
I will need to learn more Bayesian statistics and statistical geometry. This will allow me to understand and select existing methods and to develop new solutions for the Loop Closure problem. I will do this primarily through reading. I will attend Mathematics undergraduate lecture courses on optimisation (MATH412-07S1), geometry (MATH407-07S1) and calculus (MATH264-07S1) to extend my pure mathematical knowledge to more applied fields. This will help me understand the concepts underlying BA and other geometric algorithms, and to develop and adapt them. For example I am currently investigating whether it is beneficial for VN to adapt BA to minimise reconstruction errors rather than reprojection errors, which requires an understanding of robust optimisation, As the field of VN is moving rapidly and significant advances are likely I will continue my literature review, paying particular attention to forthcoming conference proceedings and the activities of groups working on VN, V-SLAM and image recognition, including the following: Key computer vision conferences: International Conference on Robotics and Automation 2008 European Conference on Computer Vision 2008 International Conference on Computer Vision 2008 Computer Vision and Pattern Recognition 2008 SLAM Summer School 2008unconfirmed at the moment
Leading research groups in the field: ROBOTVIS: Computer Vision and Robotics, INRIA, Grenoble (Localisation, VN, SLAM) Robotics Research Group, Oxford University (Loop closure, SLAM, Image Registration) Center for Visualization & Virtual Environments, University of Kentucky (VN, photogrammetry applied to CV) The Australian Centre for Field Robotics, University of Sydney (SLAM and UAVs)
3.3
2008
January Complete investigation of approximate transformation extraction techniques from point sets. February Learn sufficient Bayesian statistics to be able to adapt categorisation and discrimination (conditioning) techniques to the problems of VN. March Complete analysis of and publish results from transformation extraction experiments. April Develop VN software to a stage where it can infer reasonable motion estimates from point sets. Aim for a real-time implementation. June Complete incorporation of BA into VN software to refine position. Aim for a near real-time implementation, identifying bottlenecks to help guide future work. August Complete preliminary research into suitable registration algorithms and map formats for loop closure. October Decide whether to extend research to monocular vision or to stay with stereo.
November Complete addition of mapping to VN software (either a database of descriptors approach or a traditional SLAM landmark map). 2009 January Decide whether to focus research efforts on sensor integration involving VN, or on localisation. June Complete research into registration/recognition in loop closure. 2009 March Exhibit VN system for 3d indoor or outdoor positioning. April Publish details of any improvements over existing technology of my VN implementation November Start writing thesis. 2010 April Complete experimental work. July Submit PhD thesis. Write papers based on thesis.
4 Bibliography
1. 2. 3. 4. 5. 6. Qyngxiong, Y., et al. Stereo Matching with Color-Weighted Correlation, Hierachical Belief Propagation and Occlusion Handling. in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on. 2006. Laura A. Clemente, A.J.D., Ian Reid, Jos Neira and Juan D. Tards, Mapping Large Loops with a Single Hand-Held Camera. RSS, 2007. Murray, R.O.C.a.D.J.G.a.G.K.a.D.W. Towards simultaneous recognition, localization and mapping for hand-held and wearable cameras. in International Conference on Robotics and Automation. 2007. Rome. Lowe, D.G. Object recognition from local scale-invariant features. in Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on. 1999. Tinne, T. and G. Luc Van, Matching Widely Separated Views Based on Affine Invariant Regions. Int. J. Comput. Vision, 2004. 59(1): p. 61-85. Bill Triggs, P.F.M., Richard I. Hartley and Andrew W. Fitzgibbon, Bundle Adjustment -- A Modern Synthesis. Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms, Corfu, Greece, September 1999. Proceedings, 1999. Volume 1883/2000. Tomasi, C. and T. Kanade, Shape and Motion from Image Streams under Orthography - a Factorization Method. International Journal of Computer Vision, 1992. 9(2): p. 137-154. Agrawal, M., K. Konolige, and R.C. Bolles. Localization and Mapping for Autonomous Navigation in Outdoor Terrains : A Stereo Vision Approach. in Applications of Computer Vision, 2007. WACV '07. IEEE Workshop on. 2007. Lee, S.Y. and J.B. Song, Mobile robot localization using optical flow sensors. International Journal of Control Automation and Systems, 2004. 2(4): p. 485493. Davison, A.J. Real-time simultaneous localisation and mapping with a single camera. in Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. 2003. Lan, L.Q.a.Z.-D., Linear N-Point Camera Pose Determination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1999. Horn, B.K.P., Closed-form solution of absolute orientation using unit quaternions. J. Opt. Soc. Am. A, 1987. 4(4): p. 629. Sunderhauf, N.P., P., Towards Using Sparse Bundle Adjustment for Robust Stereo Odometry in Outdoor Terrain. Proceedings of Towards Autonomous Robotic Systems TAROS06, 2006. Zhang, Z. and Y. Shan. Incremental motion estimation through modified bundle adjustment. in Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on. 2003. Kwok, N.M. and A.B. Rad, A Modified Particle Filter for Simultaneous Localization and Mapping. J. Intell. Robotics Syst., 2006. 46(4): p. 365-382. Dailey, M.N. and M. Parnichkun. Simultaneous Localization and Mapping with Stereo Vision. in Control, Automation, Robotics and Vision, 2006. ICARCV '06. 9th International Conference on. 2006.
25.
Maimone, M., Y. Cheng, and L. Matthies, Two years of Visual Odometry on the Mars Exploration Rovers. Journal of Field Robotics, 2007. 24(3): p. 169186. Kin Ho, P.N. Combining Visual and Spatial Appearance for Loop Closure Detection in SLAM. in 2nd European Conference on Mobile Robots. 2005. Ancona, Italy. Veth, M.J., Raquet, J.R. Fusion of Low-Cost Inertial Systems for Precision Navigation. in Proceedings of the ION GNSS. 2006. Segvic, S., et al. Large scale vision-based navigation without an accurate global reconstruction. in Computer Vision and Pattern Recognition, 2007. CVPR '07. IEEE Conference on. 2007. Felzenszwalb, P.F. and D.P. Huttenlocher, Pictorial structures for object recognition. International Journal of Computer Vision, 2005. 61(1): p. 55-79. Bret Taylor, L.V., Database assisted OCR for street scenes and other images, U.P. office, Editor. 2007. Neira, J. and J.D. Tardos, Data association in Stochastic mapping using the joint compatibility test. Ieee Transactions on Robotics and Automation, 2001. 17(6): p. 890-897. Fei-Fei, L. and P. Pietro, A Bayesian Hierarchical Model for Learning Natural Scene Categories, in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) Volume 2 - Volume 02. 2005, IEEE Computer Society. G. Csurka, C.D., L. Fan, J. Williamowski, C. Bray. Visual categorization with bags of keypoints. in ECCV04 workshop on Statistical Learning in Computer Vision
2004. 26. Filliat, D. A visual bag of words method for interactive qualitative localization and mapping. in International Conference on Robotics and Automation (ICRA). 2007. 27. Ho, K.L. and P. Newman, Detecting loop closure with scene sequences. International Journal of Computer Vision, 2007. 74(3): p. 261-286. 28. Xiaoping, H. and N. Ahuja, Matching point features with ordered geometric, rigidity, and disparity constraints. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 1994. 16(10): p. 1041-1049. 29. Shinji, U., Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell., 1991. 13(4): p. 376-380. 30. Bar-Itzhack, I.Y., New Method for Extracting the Quaternion from a Rotation Matrix. Journal of Guidance, Control, and Dynamics, 2000. 23(6). 31. Haralick, R.M., et al., Review and Analysis of Solutions of the 3-Point Perspective Pose Estimation Problem. International Journal of Computer Vision, 1994. 13(3): p. 331-356. 32. Hartley, R.I.a.Z., A., Multiple View Geometry in Computer Vision. Second ed. 2004: Cambridge University Press.