You are on page 1of 24

What is GIS?

A geographic information system (GIS) integrates hardware, software, and data for capturing, managing, analyzing, and displaying all forms of geographically referenced information. GIS is a system of hardware and software used for storage, retrieval, mapping, and analysis of geographic data. Practitioners also regard the total GIS as including the operating personnel and the data that go into the system. Spatial features are stored in a coordinate system (latitude/longitude, state plane, UTM, etc.), which references a particular place on the earth. Descriptive attributes in tabular form are associated with spatial features. Spatial data and associated attributes in the same coordinate system can then be layered together for mapping and analysis. GIS can be used for scientific investigations, resource management, and development planning. GIS differs from CAD and other graphical computer applications in that all spatial data is geographically referenced to a map projection in an earth coordinate system. For the most part, spatial data can be "re-projected" from one coordinate system into another, thus data from various sources can be brought together into a common database and integrated using GIS software. Boundaries of spatial features should "register" or align properly when re-projected into the same coordinate system. Another property of a GIS database is that it has "topology," which defines the spatial relationships between features. The fundamental components of spatial data in a GIS are points, lines (arcs), and polygons. When topological relationships exist, you can perform analyses, such as modeling the flow through connecting lines in a network, combining adjacent polygons that have similar characteristics, and overlaying geographic features.
History of GIS Decade Milestones for computer-based GIS 1960s Canada Geographic Information System (CGIS) developed: national land inventory pioneered many aspects of GIS Harvard Lab for Computer Graphics and Spatial Analysis: pioneered software for spatial data handling US Bureau of Census developed DIME data format ESRI founded

1970s

CGIS fully operational (and still operational today)

- First Landsat satellite launched (USA) - CARIS founded - USGS begins Geographical Information Retrieval and Analysis System (GIRAS) to manage and analyze large land resources databases and Digital Line Graph (DLG) data format - ERDAS founded - ODYSSEY GIS launched (first vector GIS) 1980s - ESRI launches ARC/INFO (vector GIS) - GPS became operational - US Army Corp of Engineers develop GRASS (raster GIS) - MapInfo founded - First SPOT satellite launched (Europe) - IDRISI Project started (GIS program) - SPANS GIS produced - National Center for Geographic Information and Analysis (NCGIA) established in USA - TIGER digital data 1990s MapInfo for Windows, Intergraph, Autodesk, others ESRI produces ArcView and ARCGIS $7+ billion industry

GIS components
Spatial data

GIS
Computer hardware / software tools

Specific applications / decision making objectives


8

The benefits of GIS include: Better information management Higher quality analysis Ability to carry out what if? scenarios Improve project efficiency

GIS Applications Facilities management Marketing and retailing Environmental Transport/vehicle routing Health Insurance

Geographic Data 1. Attribute data: Says what a feature is Eg. statistics, text, images, sound, etc. 2. Spatial data: the spatial attribute is explicitly stated and linked to the thematic attribute for each data item. Says where the feature is Co-ordinate based Vector data discrete features: Points Lines Polygons (zones or areas) Raster data: A continuous surface

Geo-referencing data Capturing data Scanning: all of map converted into raster data Digitising: individual features selected from map as points, lines or polygons Geo-referencing Initial scanning digitising gives co-ordinates in inches from bottom left corner of digitiser/scanner Real-world co-ordinates are found for four registration points on the captured data These are used to convert the entire map onto a real-world co-ordinate system Advantages of GIS Exploring both geographical and thematic components of data in a holistic way Stresses geographical aspects of a research question Allows handling and exploration of large volumes of data Allows integration of data from widely disparate sources Allows analysis of data to explicitly incorporate location

Allows a wide variety of forms of visualisation Limitations of GIS Data are expensive Learning curve on GIS software can be long Shows spatial relationships but does not provide absolute solutions Origins in the Earth sciences and computer science. Solutions may not be appropriate for humanities research Data Abstraction To use GIS the real world must be abstracted into points, lines, polygons, raster cells, and attribute values Class examples may use common object that most people will understand. If you understand how to abstract common objects you will be able to apply the same method to object in your field What is Vector Data Vector Data uses Points and their (X,Y) coordinates to represent spatial features Points, Lines and Polygons Points A point is a 0 dimensional object and has only the property of location (x,y) Points can be used to Model features such as a well, building, power, pole, sample location ect. Other name for a point are vertex, node, 0-cell Lines A line is a one-dimensional object that has the property of length Lines can be used to represent road, streams, faults, dikes, maker beds, boundary, contacts etc. Lines are also called an edge, link, chain, arc, 1-cell In an ArcInfo coverage an arc starts with a node, has zero or more vertices, and ends with a node Polygons A polygon is a two-dimensional object with properties of area and perimeter A polygon can represent a city, geologic formation, dike, lake, river, ect. Other name for polygons face, zone 2-cell Scale matters Topology A set of rules on how objects relate to each other Major difference in file formats Higher level objects have special topology rules The Science of mathematics of relationships used to validate the geometry of vector entities, and for operations such as network tracing and tests of polygon adjacency. The study of geometric properties that do not change when the forms are bent, stretched or under go similar geometric transformations.

Why Topology Matters Error Detection

open polygons unlabeled polygons slivers polygons that cannot exist next to each other Network Modeling Show Placitas Arc Node Topology Cover# Lpoly# and Rpoly# Tnode fnode Label errors Higher Level Object Regions Networks TIN Triangulated irregular network Dynamic Segmentation Regions Overlapping areas with different attributes Fire history Disconnected areas with the same attributes Hawaii Networks Road systems, power grids, water supply sewerage systems, drainage network Continuous connected networks Rules for displacement in a network Attribute value accumulations due to displacements TIN Vector Surface Model Triangulated Irregular Network A set of nonoverlapping triangles each with a constant gradient A TIN can honor original input elevations Dynamic Segmentation Combines a line coverage with a linear reference system Has event tables for point events and linear events Shape Files Nontopological Advantages no overhead to process topology Disadvantages polygons are double digitized, no topologic data checking 3 files .shp .shx .dbf Coverages Original ArcInfo Format

Directory With Several Files Database Files are stored in the Info Directory Uses Arc Node Topology Planer Enforcement Connectivity Adjacency GeoDatabase New GIS Format at ArcGIS 8.0 Three Types Personal Geodatabase Microsoft access 2000 database File Geodatabase XML based file SDE GeoDatabase Multi-user Can connect to many RDBMS Oracle, SQL server, Informix File are stored in the format native to the RDBMS Shapes are similar to shape files Object-oriented model not a Geo-relational There are 26 topology rules than can be used to relate different layers Raster Data Model

Grid Properties Each Grid Cell holds one value even if it is empty. A cell can hold an index standing for an attribute. Cell resolution is given as its size on the ground. Point and Lines move to the center of the cell.

Minimum line width is one cell. Rasters are easy to read and write, and easy to draw on the screen.

Raster Pyramids With out pyramids the entire raster must be read for each screen draw Pyramids store reduced resolution dataset files .rrd to increase the speed of screen draws When you add a raster to ArcMap if pyramids do not exist you can create them

Raster Resampling Nearest Neighbor Closest cell Continuous and Discrete data Bilinear interpolation Average of nearest 4 cells Continuous data only Cubic Convolution Average of nearest 16 cells Continuous data only Quad Tree Compression May be use to get variable resolution for imagery in the National Map What are Terrains? New Dataset for ArcGIS 9.2 They are a Multi-resolution, Tin-based surface. Comprised of mesurements stored as features in a geodatabase. Terrains live inside Feature Datasets, in a geodatabase. Two Main characteristics of Terrrains: Feature classes participate in a terrain Rules are established to generate TIN pyramids on-th-fly. They are designed to handle mass volumes of point data in a logical and efficient storage mechanism.

Raster Advantages: Simple data structure Compatible with remotely sensed or scanned data Simpler spatial analysis procedures Raster Disadvantages: Requires greater storage space on computer Depending on pixel size, graphical output may be less pleasing Projection transformations are more difficult (and can be time consuming)

More difficult to represent topological relationships Positional precision set by cell size Vector Advantages Requires less disk storage space Topological relationships are readily maintained Graphical output more closely resembles hand-drawn maps Preferred for network analysis Vector Disadvantages More complex data structure Not as compatible with remotely sensed data Software and hardware are often more expensive Some spatial analysis procedures may be more difficult Overlaying multiple vector maps is often time consuming How do we describe geographical features? by recognizing two types of data: Spatial data which describes location (where) Attribute data which specifies characteristics at that location (what, how much, and when) How do we represent these digitally in a GIS? by grouping into layers based on similar characteristics (e.g hydrography, elevation, water lines, sewer lines, grocery sales) and using either: vector data model (coverage in ARC/INFO, shapefile in ArcView) raster data model (GRID or Image in ARC/INFO & ArcView) by selecting appropriate data properties for each layer with respect to: projection, scale, accuracy, and resolution How do we incorporate into a computer application system? by using a relational Data Base Management System (DBMS) Spatial data types and Attribute data types Relational database management systems (RDBMS): basic concepts DBMS and Tables Relational DBMS raster data structures: represents geography via grid cells tesselations run length compression quad tree representation BSQ/BIP/BIL DBMS representation File formats vector data structures: represents geography via coordinates whole polygon

point and polygon node/arc/polygon Tins File formats Spatial Data Types continuous: elevation, rainfall, ocean salinity areas: unbounded: landuse, market areas, soils, rock type bounded: city/county/state boundaries, ownership parcels, zoning moving: air masses, animal herds, schools of fish networks: roads, transmission lines, streams points: fixed: wells, street lamps, addresses moving: cars, fish, deer Attribute data types Categorical (name): nominal no inherent ordering land use types, county names ordinal inherent order road class; stream class often coded to numbers eg SSN but cant do arithmetic Numerical Known difference between values interval No natural zero cant say twice as much temperature (Celsius or Fahrenheit) ratio natural zero ratios make sense (e.g. twice as much) income, age, rainfall may be expressed as integer [whole number] or floating point [decimal fraction] Attribute data tables can contain locational information, such as addresses or a list of X,Y coordinates. ArcView refers to these as event tables. However, these must be converted to true spatial data (shape file), for example by geocoding, before they can be displayed as a map.

GIS Data Models: Raster v. Vector raster is faster but vector is corrector Joseph Berry Raster data model location is referenced by a grid cell in a rectangular array (matrix) attribute is represented as a single value for that cell much data comes in this form images from remote sensing (LANDSAT, SPOT) scanned maps elevation data from USGS best for continuous features: elevation temperature soil type land use Vector data model location referenced by x,y coordinates, which can be linked to form lines and polygons attributes referenced through unique ID number to tables much data comes in this form DIME and TIGER files from US Census DLG from USGS for streams, roads, etc census data (tabular) best for features with discrete boundaries property lines political boundaries transportation

Concept of Vector and Raster

Real World

Raster Representation
0 0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 R R R R R R R R R R T T T T H 6 7 R T T H 8 9

Vector Representation point line

polygon

2/18/2003 Ron Briggs, UTDallas

POEC 5319 Introduction to GIS

Representing Data using Raster Model area is covered by grid with (usually) equal-sized cells location of each cell calculated from origin of grid: two down, three over cells often called pixels (picture elements); raster data often called image data attributes are recorded by assigning each cell a single value based on the majority feature (attribute) in the cell, such as land use type. easy to do overlays/analyses, just by combining corresponding cell values: yield= rainfall + fertilizer (why raster is faster, at least for some things) simple data structure: directly store each layer as a single table (basically, each is analagous to a spreadsheet) computer data base management system not required (although many raster GIS systems incorporate them)

Raster Data Structures: Concepts grid often has its origin in the upper left but note: State Plane and UTM, lower left lat/long & cartesian, center single values associated with each cell typically 8 bits assigned to values therefore 256 possible values (0-255) rules needed to assign value to cell if object does not cover entire cell majority of the area (for continuous coverage feature) value at cell center touches cell (for linear feature such as road) weighting to ensure rare features represented choose raster cell size 1/2 the length (1/4 the area) of smallest feature to map (smallest feature called minimum mapping unit or resel--resolution element) raster orientation: angle between true north and direction defined by raster columns class: set of cells with same value (e.g. type=sandy soil) zone: set of contiguous cells with same value neighborhood: set of cells adjacent to a target cell in some systematic manner Raster Data Structures: Tesselations (Geometrical arrangements that completely cover a surface.) Square grid: equal length sides conceptually simplest cells can be recursively divided into cells of same shape 4-connected neighborhood (above, below, left, right) (rooks case) all neighboring cells are equidistant 8-connected neighborhood (also include diagonals) (queens case) all neighboring cells not equidistant center of cells on diagonal is 1.41 units away (square root of 2) rectangular commonly occurs for lat/long when projected data collected at 1degree by 1 degree will be varying sized rectangles triangular (3-sided) and hexagonal (6-sided) all adjacent cells and points are equidistant triangulated irregular network (tin): vector model used to represent continuous surfaces (elevation) more later under vector Vector Data Model Representing Data using the Vector Model: formal application point (node): 0-dimension single x,y coordinate pair zero area tree, oil well, label location

line (arc): 1-dimension two (or more) connected x,y coordinates road, stream polygon : 2-dimensions four or more ordered and connected x,y coordinates first and last x,y pairs are the same encloses an area census tracts, county, lake Whole Polygon (boundary structure): polygons described by listing coordinates of points in order as you walk around the outside boundary of the polygon. all data stored in one file could also store--inefficiently--attribute data for polygon in same file coordinates/borders for adjacent polygons stored twice; may not be same, resulting in slivers (gaps), or overlap how assure that both updated? all lines are double (except for those on the outside periphery) no topological information about polygons which are adjacent and have common boundary? how relate different geographies? e.g. zip codes and tracts? used by the first computer mapping program, SYMAP, in late 60s adopted by SAS/GRAPH and many business thematic mapping programs. Triangulated Irregular Network a set of adjacent, non-overlapping triangles computed from irregularly spaced points, with x, y horizontal coordinates and z vertical elevations. Advantages Can capture significant slope features (ridges, etc) Efficient since require few triangles in flat areas Easy for certain analyses: slope, aspect, volume Disadvantages Analysis involving comparison with other layers difficult TIN Strengths Automated Basin Delineation with Parameter Calculations Adaptive Resolution you can use most any elevation data source Urban Areas where small variations in flow can be significant It Was in WMS First reservoir definition, storage capacity curves, time area curves, flood-plain delineation TIN Weaknesses Lack of Available Data With conceptual model approach this is not such a big factor anymore

Extra Steps Local editing Digital Elevation Model a sampled array of elevations (z) that are at regularly spaced intervals in the x and y directions. two approaches for determining the surface z value of a location between sample points. In a lattice, each mesh point represents a value on the surface only at the center of the grid cell. The z-value is approximated by interpolation between adjacent sample points; it does not imply an area of constant value. A surface grid considers each sample as a square cell with a constant surface value. Advantages Simple conceptual model Data cheap to obtain Easy to relate to other raster data Irregularly spaced set of points can be converted to regular spacing by interpolation Disadvantages Does not conform to variability of the terrain Linear features not well represented What is Cartography? Art/science/technology of making maps Beauty vs. usefulness Cartographic design is a complex task Unlimited options (16 million colours, many kinds of lines and symbols) A good map makes it easy for a reader to acquire your intended information by: Depicting data effectively Reflecting the relative importance of features Reducing distraction Cartographic Specifications Perception threshold = legibility of smallest detail Line thickness should not be less than .1 mm Points: 0.5 mm for points Separation threshold = distinction between adjacent features > 0.2mm e.g. road and rail road line Differentiation threshold = smallest difference between the nearest same size symbols, e.g. proportional symbols can do this by artificially making symbols larger to increase the contrast Map Projections F Mathematical method for systematically transforming a 3-D earth into a 2-D map. F Three traditional types: cyllindrical conical planar (azimuthal-zenithal) F Newer Mathematical Projections Robinson

All maps introduce distortion: shape (conformance) size (equivalence) direction distance F Maps can be either equivalent or conformal, but cannot emphasize both characteristics. General Types of Maps General Purpose and Topographic Depict the form and relief of the surface and/or general features, such as roads, buildings, and political boundaries. Thematic These maps represent the spatial dimensions of particular phenomenon (themes). F These maps represent the spatial dimensions of a particular phenomenon (theme). Types: u Isopleth maps - isolines connect points of equal magnitude. u Choropleth map - tonal shadings are graduated to represent areal variations in number or density within a region, usually a formal region. Map Scale relates distance on map to distance on earth, thus smaller scale represents larger area. F Small Scale shows large area 1:10,000,000 would represent about 1/2 of U.S. on single page of paper. F Large Scale shows small area 1:63,360 would represent a small town on a single page of paper.

Mapping Process 1. Planning Stage

- Needs Assessment - Projection Specification are established 2. Data Acquisition Stage - Primary/Secondary data Collection 3. Cartographic Production Stage - Design - Drafting - Proofing - Printing 4. Quality Assurance/Quality Control 5. Product Delivery Stage GIS Data Sources Spatial Source Maps and Plans Digital Remote Sensing Photogrammetry Field Surveys Paper files Non-spatial Paper files Digital data Interviews Field Surveys Most suitable data format in GIS: GeoTIFF-because all colour grades can be saved in GeoTIFF Changing to digital formats by Scanning, Digitizing, Keyboard entry for Coordinates and Projection

DATA ACCURACY AND QUALITY


The quality of data sources for GIS processing is becoming an ever increasing concern among GIS application specialists. With the influx of GIS software on the commercial market and the accelerating application of GIS technology to problem solving and decision making roles, the quality and reliability of GIS products is coming under closer scrutiny. Much concern has been raised as to the relative error that may be inherent in GIS processing methodologies. While research is ongoing, and no finite standards have yet been adopted in the commercial GIS marketplace, several practical recommendations have been identified which help to locate possible error sources, and define the quality of data.
Quality

Quality can simply be defined as the fitness for use for a specific data set. Data that is appropriate for use with one application may not be fit for use with another. It is fully dependant on the scale, accuracy, and extent of the data set, as well as the quality of other data sets to be used. The recent U.S. Spatial Data Transfer Standard (SDTS) identifies five components to data quality definitions. These are :
Lineage

Positional Accuracy Attribute Accuracy Logical Consistency Completeness


Lineage

The lineage of data is concerned with historical and compilation aspects of the data such as the:

source of the data; content of the data; data capture specifications; geographic coverage of the data; compilation method of the data, e.g. digitizing versus scanned; ransformation methods applied to the data; and

the use of an pertinent algorithms during compilation, e.g. linear simplification, feature generalization.

Positional Accuracy

The identification of positional accuracy is important. This includes consideration of inherent error (source error) and operational error (introduced error). A more detailed review is provided in the next section.
Attribute Accuracy

Consideration of the accuracy of attributes also helps to define the quality of the data. This quality component concerns the identification of the reliability, or level of purity (homogeneity), in a data set.
Logical Consistency

This component is concerned with determining the faithfulness of the data structure for a data set. This typically involves spatial data inconsistencies such as incorrect line intersections, duplicate lines or boundaries, or gaps in lines. These are referred to as spatial or topological errors.
Completeness

The final quality component involves a statement about the completeness of the data set. This includes consideration of holes in the data, unclassified areas, and any compilation procedures that may have caused data to be eliminated. The ease with which geographic data in a GIS can be used at any scale highlights the importance of detailed data quality information. Although a data set may not have a specific scale once it is loaded into the GIS database, it was produced with levels of accuracy and resolution that make it appropriate for use only at certain scales, and in combination with data of similar scales.
Error

Two sources of error, inherent and operational, contribute to the reduction in quality of the products that are generated by geographic information systems. Inherent error is the error present in source documents and data. Operational error is the amount of error produced through the data capture and manipulation functions of a GIS. Possible sources of operational errors include:
Mis-labelling of areas on thematic maps;

misplacement of horizontal (positional) boundaries; human error in digitizing classification error;. GIS algorithm inaccuracies; and human bias.

While error will always exist in any scientific process, the aim within GIS processing should be to identify existing error in data sources and minimize the amount of error added during processing. Because of cost constraints it is often more appropriate to manage error than attempt to eliminate it. There is a trade-off between reducing the level of error in a data base and the cost to create and maintain the database. An awareness of the error status of different data sets will allow user to make a subjective statement on the quality and reliability of a product derived from GIS processing. The validity of any decisions based on a GIS product is directly related to the quality and reliability rating of the product. Depending upon the level of error inherent in the source data, and the error operationally produced through data capture and manipulation, GIS products may possess significant amounts of error.
One of the major problems currently existing within GIS is the aura of accuracy surrounding digital geographic data. Often hardcopy map sources include a map reliability rating or confidence rating in the map legend. This rating helps the user in determining the fitness for use for the map. However, rarely is this information encoded in the digital conversion process.

Often because GIS data is in digital form and can be represented with a high precision it is considered to be totally accurate. In reality, a buffer exists around each feature which represents the actual positional location of the feature. For example, data captured at the 1:20,000 scale commonly has a positional accuracy of +/- 20 metres. This means the actual location of features may vary 20 metres in either direction from the identified position of the feature on the map. Considering that the use of GIS commonly involves the integration of several data sets, usually at different scales and quality, one can easily see how errors can be propagated during processing.
Map Accuracy Assessment

The purpose of accuracy assessment is to allow a potential user to determine the map's "fitness for use" for their application Spatial Accuracy Thematic Accuracy Topological Accuracy Temporal Accuracy What Kinds of Map Accuracy? Dont be surprised to have an experienced geospatial analyst give you a puzzled look when you say the data is very accurate. The reason for his or her puzzled look is because there are many different categories of map accuracy. The different categories are: Spatial Accuracy: refers to the positional/coordinate accuracies within geospatial data. Maps created at different scales will have different different levels of generalization, and subsequently different positional accuracies. Thematic Accuracy: refers to the accuracies of the attributes that describe a geographic feature. Depending upon how information was collected, there can be misinterpretation of particular geographic objects, or errors in entering the data in the computer. Topological Accuracy: refers to the geometric connectivity of the data. Poorly digitized data may include gaps, or unconnected line segments. Temporal Accuracy: refers to how accurate the information is over a given period of time. Obviously, a map is only a snapshot of reality for the time in which the data was collected. Therefore, some assessment of how the geographic objects may change over time is important. What is Cohens Kappa A measure of agreement that compares the observed agreement to agreement expected by chance if the observer ratings were independent Expresses the proportionate reduction in error generated by a classification process, compared with the error of a completely random classification. For perfect agreement, kappa = 1 A value of .82 would imply that the classification process was avoiding 82 % of the errors that a completely random classification would generate.

kappa is 1 for perfectly accurate data (all N cases on the diagonal), zero for accuracy no better than chance

Arthur J. Lembo, Jr. Cornell University

Fuzzy Accuracy Assessment There is a fundamental problem with the confusion matrix: the ground data may not be just 'correct' but 'somewhat correct'... a problem of classification (1) absolutely wrong, (2) understandable but wrong, (3) reasonable, acceptable but there are better answers, (4) good answer, (5) absolutely right Confusion matrix is expanded to answer two more precise questions: How frequently is the map category the best possible choice? How frequently is the map category acceptable?

You might also like