Data 101: Fundamentals of Data in GIS

Data 101

Fundamentals of data in a GIS

Overview

 Role of data
 Data structures and schemas
 Metadata
 Linking data
 Issues of confidentiality

90 percent rule

90% Data Preparation

90% of the cost, 10% Mapping
time and effort will
be devoted to
data preparation

90% Rule

Data Preparation Mapping
 Collecting  Map design
 Cleaning  Categorization decisions
 Validating  Production
 Formatting
 Linking with other data

GIS analysis is only as strong as
the data used.

Strategies for strong data

 Accuracy
 Timlieness
 Properly structured
 Properly documented

Data accuracy

 Data should accurately reflect reality
 In GIS there are two types of accuracy to be
concerned with:
 Spatial accuracy
 Items located correctly
 Attribute accuracy
 Attributes are correct and properly linked to
geography

Spatial accuracy
Real Location

Hotel Suryaa

Spatial Accuracy and Scale

Hotel Suryaa

Attribute Accuracy

 Is the data associated with the location accurate?
 Is it linked to the right geographic entity?

Timeliness

 Is the data for the
time period of
interest?
 Boundaries change
 New features
created
 Features change

Data Structure

 Proper data structure is necessary in order to
effectively use data
 Software must know how to read the data, and
query it.
 The structure of the data is also known as data
schema

Data Schema

 For most programs, data will need to be stored in
a row and column format
 GIS programs expect well formed data in the
following schema:
 One record per geographic unit
 Geographic units don’t repeat in records
 Variables are stored in columns
 No blank cells unless data is missing

Data Schema
Population China India United Indonesia
States
Total 1339724852 1210193422 312417000 237556363
Percent of 19.23% 17.37% 4.48% 3.41%
World’s
Population
Population 140/km2 368/km2 32/km2 121/km2
Density

Poor data schema
•Columns are geographic units
•Variables are rows

Duplic
ate Dist r i ct N
ames
Bla nk Cells

Proper Data Schema
Columns are variables

One record per geographic unit

Metadata

 Data about data
 Provides information on:
 Source of data
 Who created it
 When it was created
 Coordinate system and datum
 Usage and sharing restrictions

Metadata

 Metadata is especially important with spatial data
because of issues of:
 Spatial accuracy
 Coordinate systems and datums
 Confidentiality
 Timeliness

Metadata formats

 International standard
 ISO 9115
 Mandatory elements
 Schema for metadata
 Countries may have their own national standards
that are compatible with the ISO standard but
provide extra elements

Data Types

 Text
 Numeric
 Coordinates
 Programs assign variables to be a specific type
which can affect the way the program handles
data

Data Types

 Text
 Arithmetic can not be conducted on values in text
fields
 Numeric
 Arithmetic permitted
 May require user to declare number of decimal
places before entering data
 This can be important when storing coordinates

Linking data

 Key field
 The field that contains information common
between tables
 Tables are linked using the key field
 Can’t link using key fields that are two different
types

District Population Male Pop Female Pop
North 24015 14409 9606
West 31154 16202 14952
South 62442 29972 32470

District Area (sq km)
North 243
District is the key field West 310
South 602

District Population Male Pop Female Pop Area (sq km)
North 24015 14409 9606 243
West 31154 16202 14952 310
South 62442 29972 32470 602

Linking data

 Linking using text fields can be problematic
 Variations in spelling

District Population Male Pop Female Pop
North Kinley 24015 14409 9606
West 31154 16202 14952
South 62442 29972 32470

The two tables have District Area (sq km)
different spellings for N. Kinley 243
the district North Kinley
West 310
South 602

District Population Male Pop Female Pop Area (sq km)
West 31154 16202 14952 310
South 62442 29972 32470 602

Linking data
 Linking using numeric fields is often more reliable
and less vulnerable to variations and other
issues
 Countries often use numeric codes for
administrative units to get around problems with
spelling variations
 If standardized national codes exist, it is a good
idea to include them in data
 National Bureau of Statistics or Census often
manage such codes

District Dist code Population Male Pop Female
Pop
North Kinley 100 24015 14409 9606
West 200 31154 16202 14952
South 300 62442 29972 32470

District Dist code Area (sq km)
Dist code is the N. Kinley 100 243
key field West 200 310
South 300 602

District Dist Code Population Male Pop Female Area (sq km)
Pop
North 100 24015 14409 9606 243
West 200 31154 16202 14952 310
South 300 62442 29972 32470 602

Advantage of numeric codes

Can manage hierarchy effectively
District Province Code
Coast North Coast 101
Savanna North Mountain 103
North Savanna 105

Mountain

North District Code 100

Linking data key points

 Key fields must be of the same type
 Text fields can be problematic due to spelling
variations
 Numeric fields are often a more reliable key field
 Unique geography codes, if available in a country
is often the best option for making linkages

Data and confidentiality issues

 Important issue when working with spatial data
 Discuss issues of confidentiality and spatial tools
 Present strategies for protecting confidentiality

Confidentiality

 Protecting identity of individuals
 Requirement
 Informed consent agreements
 Ethical research

Overt disclosure

The act of explicitly
making data available
that breaches
confidentiality
commitments.

Deductive Disclosure
45 year old 45 year old 45 year old female
female female
Has 5 children
Has 5 children
Works for General
Electric in Delhi

28.67171, 77.21211

Spatial Data

 Overt disclosure
 Makes deductive
disclosure easier

Geoprivacy

“[an] individual’s right to
prevent disclosure of the
location of one’s home,
workplace, daily activities
or trips.”

Protection of geoprivacy and accuracy
of Spatial Information: How Effective are
Geographical Masks?
Kwan, Casas, Schmitz
Cartographica, Vol 39, #2

Four Principles
 Protection of
Confidentiality
 Social-Spatial Linkage
 Data Sharing
 Data Preservation

Confidentiality and spatially explicit data:
Concerns and challenges
VanWey, Rindfuss, Gutmann, Entwisle,
Balk PNAS, vol. 102, no. 43

1. Protection of Confidentiality

 Fundamental to ethical research
 Information that might lead to physical,
emotional, financial or other harm
 Protection of information that discloses identity

2. Social-Spatial Linkage

 All human activity takes place on earth
 Understanding that adds context and perspective
 Key to advancement of science
 Essential for understanding the diffusion of
behaviors

3. Data Sharing

 Essential on both scientific and financial grounds
 Provide access to data for other researchers
 Condition of funders

4. Data Preservation

 Data available in the future
 How long should data be deemed “sensitive”?
 When, if ever, can it be released

Random Perturbations

 Random shifting of
point locations
 Pros: Easy
(relatively) to do
 Cons: Lose original
location, introduces
error

Affine Transformation
 Change scale
 Rotate
 Shift a set distance
 Combination
 Pros: Easy to do
 Cons: Easy to undo,
can impact some
types of analysis

Aggregate

 Point locations are
aggregated to
higher unit of
analysis
 Pros: Easy to do
 Cons: Requires
sufficient data
points, Finer data
variations will be lost

Despatialize

 Remove Coordinate
System
 Use Euclidean space
 Pros: Simple, keeps
relative position and
placement
 Cons: Loses
contextual data

Nothing
 Do not collect or
release data
 Cold room or on-site
analysis only
 Pros: Maintains all of
the original spatial data
 Cons: Complicated,
limits data sharing,
limits social-spatial link

Mx u
a im m

Spatial Integrity

M im m
in u

Mx u
a im m M im m
in u
R k
is Disclosure R k
is

“Ignoring is unacceptable”

 Can get lost in the excitement about GIS
 Those who collect data must think about the
confidentiality issues
 Data users must also think about how their
analysis may increase the risk of deductive
disclosure.

Key points

 Confidentiality issues arise when spatial context
is included in data.
 It’s important to protect confidentiality. People
have an expectation that their identities are
protected.
 There are strategies that can preserve
confidentiality, but there is no “one-size-fits-all
solution”

Data 101: Fundamentals of Data in GIS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

More from MEASURE Evaluation

More from MEASURE Evaluation (20)

Recently uploaded

Recently uploaded (20)

Data 101: Fundamentals of Data in GIS

Editor's Notes