7. Strategies for strong data
Accuracy
Timlieness
Properly structured
Properly documented
8. Data accuracy
Data should accurately reflect reality
In GIS there are two types of accuracy to be
concerned with:
Spatial accuracy
Items located correctly
Attribute accuracy
Attributes are correct and properly linked to
geography
13. Timeliness
Is the data for the
time period of
interest?
Boundaries change
New features
created
Features change
14. Data Structure
Proper data structure is necessary in order to
effectively use data
Software must know how to read the data, and
query it.
The structure of the data is also known as data
schema
15. Data Schema
For most programs, data will need to be stored in
a row and column format
GIS programs expect well formed data in the
following schema:
One record per geographic unit
Geographic units don’t repeat in records
Variables are stored in columns
No blank cells unless data is missing
16. Data Schema
Population China India United Indonesia
States
Total 1339724852 1210193422 312417000 237556363
Percent of 19.23% 17.37% 4.48% 3.41%
World’s
Population
Population 140/km2 368/km2 32/km2 121/km2
Density
Poor data schema
•Columns are geographic units
•Variables are rows
19. Metadata
Data about data
Provides information on:
Source of data
Who created it
When it was created
Coordinate system and datum
Usage and sharing restrictions
20. Metadata
Metadata is especially important with spatial data
because of issues of:
Spatial accuracy
Coordinate systems and datums
Confidentiality
Timeliness
21. Metadata formats
International standard
ISO 9115
Mandatory elements
Schema for metadata
Countries may have their own national standards
that are compatible with the ISO standard but
provide extra elements
23. Data Types
Text
Numeric
Coordinates
Programs assign variables to be a specific type
which can affect the way the program handles
data
24. Data Types
Text
Arithmetic can not be conducted on values in text
fields
Numeric
Arithmetic permitted
May require user to declare number of decimal
places before entering data
This can be important when storing coordinates
25. Linking data
Key field
The field that contains information common
between tables
Tables are linked using the key field
Can’t link using key fields that are two different
types
26. District Population Male Pop Female Pop
North 24015 14409 9606
West 31154 16202 14952
South 62442 29972 32470
District Area (sq km)
North 243
District is the key field West 310
South 602
District Population Male Pop Female Pop Area (sq km)
North 24015 14409 9606 243
West 31154 16202 14952 310
South 62442 29972 32470 602
28. District Population Male Pop Female Pop
North Kinley 24015 14409 9606
West 31154 16202 14952
South 62442 29972 32470
The two tables have District Area (sq km)
different spellings for N. Kinley 243
the district North Kinley
West 310
South 602
District Population Male Pop Female Pop Area (sq km)
West 31154 16202 14952 310
South 62442 29972 32470 602
29. Linking data
Linking using numeric fields is often more reliable
and less vulnerable to variations and other
issues
Countries often use numeric codes for
administrative units to get around problems with
spelling variations
If standardized national codes exist, it is a good
idea to include them in data
National Bureau of Statistics or Census often
manage such codes
30. District Dist code Population Male Pop Female
Pop
North Kinley 100 24015 14409 9606
West 200 31154 16202 14952
South 300 62442 29972 32470
District Dist code Area (sq km)
Dist code is the N. Kinley 100 243
key field West 200 310
South 300 602
District Dist Code Population Male Pop Female Area (sq km)
Pop
North 100 24015 14409 9606 243
West 200 31154 16202 14952 310
South 300 62442 29972 32470 602
31. Advantage of numeric codes
Can manage hierarchy effectively
District Province Code
Coast North Coast 101
Savanna North Mountain 103
North Savanna 105
Mountain
North District Code 100
32. Linking data key points
Key fields must be of the same type
Text fields can be problematic due to spelling
variations
Numeric fields are often a more reliable key field
Unique geography codes, if available in a country
is often the best option for making linkages
33. Data and confidentiality issues
Important issue when working with spatial data
Discuss issues of confidentiality and spatial tools
Present strategies for protecting confidentiality
35. Overt disclosure
The act of explicitly
making data available
that breaches
confidentiality
commitments.
36. Deductive Disclosure
45 year old 45 year old 45 year old female
female female
Has 5 children
Has 5 children
Works for General
Electric in Delhi
28.67171, 77.21211
38. Geoprivacy
“[an] individual’s right to
prevent disclosure of the
location of one’s home,
workplace, daily activities
or trips.”
Protection of geoprivacy and accuracy
of Spatial Information: How Effective are
Geographical Masks?
Kwan, Casas, Schmitz
Cartographica, Vol 39, #2
39. Four Principles
Protection of
Confidentiality
Social-Spatial Linkage
Data Sharing
Data Preservation
Confidentiality and spatially explicit data:
Concerns and challenges
VanWey, Rindfuss, Gutmann, Entwisle,
Balk PNAS, vol. 102, no. 43
40. 1. Protection of Confidentiality
Fundamental to ethical research
Information that might lead to physical,
emotional, financial or other harm
Protection of information that discloses identity
41. 2. Social-Spatial Linkage
All human activity takes place on earth
Understanding that adds context and perspective
Key to advancement of science
Essential for understanding the diffusion of
behaviors
42. 3. Data Sharing
Essential on both scientific and financial grounds
Provide access to data for other researchers
Condition of funders
43. 4. Data Preservation
Data available in the future
How long should data be deemed “sensitive”?
When, if ever, can it be released
45. Random Perturbations
Random shifting of
point locations
Pros: Easy
(relatively) to do
Cons: Lose original
location, introduces
error
46. Affine Transformation
Change scale
Rotate
Shift a set distance
Combination
Pros: Easy to do
Cons: Easy to undo,
can impact some
types of analysis
47. Aggregate
Point locations are
aggregated to
higher unit of
analysis
Pros: Easy to do
Cons: Requires
sufficient data
points, Finer data
variations will be lost
48. Despatialize
Remove Coordinate
System
Use Euclidean space
Pros: Simple, keeps
relative position and
placement
Cons: Loses
contextual data
49. Nothing
Do not collect or
release data
Cold room or on-site
analysis only
Pros: Maintains all of
the original spatial data
Cons: Complicated,
limits data sharing,
limits social-spatial link
50. Mx u
a im m
Spatial Integrity
M im m
in u
Mx u
a im m M im m
in u
R k
is Disclosure R k
is
51. “Ignoring is unacceptable”
Can get lost in the excitement about GIS
Those who collect data must think about the
confidentiality issues
Data users must also think about how their
analysis may increase the risk of deductive
disclosure.
52. Key points
Confidentiality issues arise when spatial context
is included in data.
It’s important to protect confidentiality. People
have an expectation that their identities are
protected.
There are strategies that can preserve
confidentiality, but there is no “one-size-fits-all
solution”
Editor's Notes
For this presentation we will talk about the role of data in effective use of data. We will also cover the proper data structures and schemas for use of GIS as well as review the notion of metadata. Lastly we’ll review some important issues concerning linking data as well as discuss issues of confidentiality.
To review, you will remember that GIS combines software, hardware, procedures, people and data. Each element is important, but use of GIS is easier when the data is well formed and ready to go into GIS.
There is a rule of thumb with GIS work known as the 90% rule. It states that for any GIS activity, 90% of the cost will be devoted to data preparation, and 10% to actually producing maps. T
This means that before any map can be produced, many tasks will need to be completed in order to produce maps. For instance, it is necessary to collect, clean, validate, format the data to make sure it is accurate. Then the data may need to be linked with other data to be used, which means that there may be additional work needed to make this possible. For mapping, there is indeed work to be done, but comparatively speaking, much less.
As you can see, data is important in GIS. In fact, GIS analysis is only as strong as the data used.
Data, whether in a GIS or not, should of course be accurate. This means that it reflects reality as much as possible. In GIS there are two types of accuracy to be concerned with: spatial accuracy which refers to whether items are located correctly and attribute accuracy, which refers to the attributes. Here this means that the attributes are correct and are properly linked to geography.
Here is a representation of spatial accuracy. Let’s say you found online a file with latitude and longitude coordinates of hotels in India. You decide you want to create a shapefile with these coordinates. When you then overlay them on images in Google Earth, you see that the points aren’t accurate. Here’s the scene in Google Earth <CLICK TO DISPLAY FIRST ANIMATED ELEMENT> And here is the location of the Hotel Suryaa <CLICK TO DISPLAY NEXT ANIMATED ELEMENT>. This location is inaccurate because the real location of the hotel Suryaa is here. <CLICK TO DISPLAY NEXT ANIMATED ELEMENT>. The point is off by 50 meters or more.
Spatial Accuracy can be affected by scale. For instance here is the same point when viewed at a different scale. <CLICK TO DISPLAY ANIMATED ELEMENT> At this scale the point location is still inaccurate, in that it isn’t the exact latitude and longitude for the hotel, however because our scale exceeds the error of the point, the effect is less obvious. In fact if the location derived using a map at one scale, the accuracy can be assessed by using a map at a smaller scale (a map that has “zoomed in”)
To illustrate, here is a screen shot from Google Maps. Even though it isn’t a GIS, it does rely on a spatial database in that it has locational information and attributes about the locations. If you zoom into the location of the hotel, you see that <CLICK TO DISPLAY FIRST ANIMATED ELEMENT> instead of saying the building is the Hotel Suryaa, it has the building listed as “Hotel Crowne Plaza”
The example from Google Maps illustrates another consideration for strong data, timeliness. Their database is old and doesn’t reflect that this hotel is now the Suryaa and no longer the Crowne Plaza. The world changes, that means that spatial databases, or any data set can quickly become out of date, so it is important to be aware of the timeliness of the data. The data doesn’t necessarily have to be the most recent, sometimes there may be value in having older files, for instance if you want to track changes over time. However, you as the data user needs to be aware of the time frame of the data you use and include information about the time frame of the data you create.
Software, whether it’s a GIS program or not, must know how to read and interpret data files. This means that the data needs to store the data in a standard way that the software expects. The way that the data is stored is known as it’s structure or more commonly, schema.
There has been a standard schema that has evolved over the years for data and it is considered best practice to use this schema generally, whether the data will be used in a GIS or not. This standard schema is as follows: one record per geographic unit, variables are stored in columns and there are no blank cells unless data is missing
Here’s an example of poor data schema. The variables are listed as rows and the columns are the geographic units. It is still an valid way to display data for a table in a publication or presentation, but you would not want to store data using this schema if you wanted to use it in a GIS.
Here is another example of poor data schema. There are several things wrong with this table. First, there are blank cells that don’t represent missing data. The blank cells are supposed to indicate that the values of the last cell is to be repeated. <CLICK TO ADVANCE ANIMATION>. Second there are duplications for district names. In this made up country, there are districts with the same name in different provinces. <CLICK TO ADVANCE ANIMATION>. We’ll come back to this problem in a little bit.
Here is a proper data schema for a GIS program. As you can see there is one record per geographic unit. In this case Region. Regions don’t duplicate. Columns contain variables. Each cell contains well formed data.
As I mentioned, proper documentation is a key component of strong data. Including metadata is the best way to document data. Simply put Metadata is data about data. It provides the data user with information about the data such as: <READ SLIDE>
Metadata is especially important with spatial data because of issues of : Spatial accuracy: it’s important for data users to know how the data was collected, if there are scale issues to consider Coordinate systems and datums: sometimes it is clear what coordinate system was used, but other times it isn’t. Without metadata the user may not know what coordinate system/datum the data is in and this may make it difficult to use the data. Confidentiality: Spatial data can raise issues concerning confidentiality and privacy. The metdata can make sure data users are aware of these issues and what restrictions may exist on sharing, the data or even presenting maps Timeliness: This one should be obvious, when the data was collected
Because metadata is so important, the international standards organization (ISO) has produced an international standard for geographic metadata. The ISO 9115 standard mandates certain elements be included in the metadata. It also developed the schema or structure for metadata. For more information about the ISO9115 standard, you can visit their web site. It’s important to note that many countries have developed their own national standards for spatial metadata, these national standards should be compatible with the ISO standard. It is important to research any national metadata standards you may want to conform to.
Here is an excerpt from metadata for a file obtained from the UN’s Second Administrative Level Boundary (SALB) site. The actual metadata file is much longer and contains many more elements, but this will give you a example of the type of information that is contained in a metadata file.
Most data programs differentiate between different data types and will assign variables to be one type or another the way the field is assigned can affect the way the program handles data.
For fields that are defined as text, arithmetic operations such as addition and subtraction are not allowed. For fields that are defined as numeric, arithmetic is permitted. One issue however is that many programs may require the user to declare the number of decimal places before entering data. This is an important consideration when storing coordinates in a field, since if inadequate number of decimal places are declared, the full coordinate may not be able to be stored which can have an impact on accuracy.
One of the key tasks that a GIS needs to be able to do is linking tables. GIS uses a key field to make the link between tables. A key field is the field that contains information common between the tables. It is important to remember that it is not possible to link tables using key fields that are two different types. In the next slides I’ll illustrate this.
Here are two tables. <ASK GROUP> What is the field that will be the key field? <ANSWER: DISTRICT> <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> It is possible to link these two tables using the common field, District. Just a note to point out that it is no coincidence that a geographic unit is the key field. As we’ve mentioned, geography is the common link between human activity. <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> As you can see there is now a link between the two tables.
One important thing to point out is linking using text fields can be problematic because of variations in spelling.
Here are two tables, notice that they each have a different spelling for the district North Kinley. <ADVANCE SLIDE TO DISPLAY ANIMATION> <ASK GROUP> What do you think will happen? Will it be possible to join these tables? <ADVANCE SLIDE TO DISPLAY ANIMATION> The answer depends on the software and the settings you select, for many GIS programs, a link will be made for those records that do match. It’s easy to see that the linked table doesn’t have the complete number of records in this example, but if you had many records, it might be possible to miss this fact. So a good practice is to check the record count after the join to make sure it is correct.
As you can imagine, there are many different ways text fields can be problematic. Linking using numeric fields is often more reliable since they are less vulnerable to variations. For this reason, countries often use numeric codes to identify administrative units. Often the national bureau of statistics or census bureau manage such codes. If there are standardized national codes, it is a good idea to include them in databases.
So here are two tables with a field for district code which were assigned by the national bureau of statistics. If District code is used as the key field <ADVANCE SLIDE TO DISPLAY ANIMATION> then spelling variation in the district field doesn’t matter and the table can be joined successfully. <ADVANCE SLIDE TO DISPLAY ANIMATION>
Another advantage of numeric codes associated with geography is they can manage geographic hierarchy effectively. So let’s say this is North District. North District is divided into three provinces. <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> Coast province, mountain province and savanna province. North district has a code of 100. Most countries set up their national codes so that hierarchy is included. <ADVANCE SLIDE TO DISPLAY ANIMATED ELEMENT> As you can see from the table, all of the provinces are numbered in the 100’s since they are in North District.
To review, here are the key points from the discussion on linking data <READ SLIDE>
Now to switch topics slightly. Confidentiality is an important consideration when working with spatial data. During this part of the lecture, we’ll discuss issues of confidentiality and spatial tools as well as present strategies for protecting confidentiality.
So let’s start by talking about confidentiality and what I’m referring to. Put simply, confidentiality is the idea that it is important to protect the identity of individuals. This is a requirement of many informed consent agreements that people sign when we collect data. It’s also a pillar of ethical research.
There are two threats to confidentiality, one is overt disclosure. Overt disclosure is the the act of explicitly making data available that breaches confidentiality commitments. Such as releasing data files that contain an individual’s name and/or data.
The second way that confidentiality can be breached is through deductive disclosure. That the process of piecing together multiple pieces of the puzzle until a picture emerges. So for instance, let’s say there was a survey conducted and you knew if you knew that a person was 45 year old female that narrows down the list somewhat [ADVANCE SLIDE] then if you knew that she has 5 children that narrows it down even more [ADVANCE SLIDE] if you know that she works for General Electric in Delhi that makes it a little easier to potentially identify a person. [ADVANCE SLIDE] If you add a geographic coordinate of where they live. [ADVANCE SLIDE] It’s almost the same as listing a name.
When you add a spatial component to data it can be an overt disclosure of identifying information. At the very least it makes deductive disclosure easier. So what’s the answer? Should the spatial element be dropped?
There is an emerging recognition that there is a need to explicitly define issues of geoprivacy. Geoprivacy is a term coined to refer to “an individual’s right to prevent disclosure of the location of one’s home, workplace, daily activities or trips”
As people have thought about this issue of geoprivacy, there are 4 principles that have been laid out to guide people: [READ SLIDE] I’ll talk about each of them
The first principle is the basic protection of confidentiality. This protection is fundamental to ethical research. Information that might lead to physical, emotional, financial or other harm. It’s important to protect information that discloses identity
The second key principle that informs the discussion on confidentiality is the importance of preserving the social-spatial linkage. As we’ve mentioned, all human activity takes place on earth. Understanding that adds context and perspective. Its also a key to advancement of science.
The third principle is the notion of data sharing. Data sharing means sharing data with other researchers or other important stakeholders. It’s essential on both scientific and financial grounds. It allows the data to have maximal use by letting other researchers use the data. Lastly, there’s a growing trend among funders of data collection efforts that the data be shared either publicly or within the research community.
The last principle is the notion that data should be preserved and be available for future use. This raises the question, how long should the data be deemed “sensitive”? When if ever, can it be released? These are things that should be considered at the beginning of any data collection effort or establishment of a data system. It should be spelled out in advance to respondents or individuals who are providing information/data.
What are the strategies that can be employed to protect data?
The first strategy we’ll talk about is simply just randomly shifting the locations. The advantage of this is that is relatively easy to do. There are plugins for QGIS that will do this. The disadvantage is that you lose the original location and it introduces error
The second strategy is what’s known as an affine transformation. This is a systematic change to the data, changing the scale, rotating shifting a set distance. This is easy to do, but it’s also easy to undo if people know the parameters of the transformation. In some cases even if the exact parameters aren’t known, it’s still possible to deduce the types of transformation done if a set of points don’t match the geography on the ground (say points end up in the ocean or lake because of the transformation)
Another strategy is to just aggregate the data. So say for instance you have individual patient data you can just aggregate to mask individual data. This too is easy to do, but it does require sufficient number of data points. Finer data variations will be lost.
Another strategy is to despatialize the data. Simply remove the coordinate system that ties the data to the earth. It uses euclidean space instead of geographic space. This is simple, it keeps relative position and placement. On the downside though, you lose contextual data, so it won’t be possible to bring other data that might be helpful to look at (such as road networks or the surrounding landscape).
Lastly there’s always, “do nothing”. You could make the decision to not collect or release the data. Another option would be to set up a cold room or on-site analysis only. This maintains all of the original spatial data. The disadvantage is that schemes like cold-room or on-site analysis can reduce accessibility to data which can limit social-spatial link and can be complicated to implement.
There is no magic answer. It’s a matter of finding the technique that best suits your needs and the commitments made to respondents. It’s possible to think about the issue in terms of spatial integrity and disclosure risks. Making the decision on what approach to take is dependent on where on this spectrum you want to land. You can preserve spatial integrity or you can minimize risk of confidentiality breaches, but you can’t have both.
The article by Van Wey that I mentioned earlier has a quote that ignoring the issue is unacceptable. Something that often gets lost in the excitement over GIS is the issues around confidentiality. However, those who collect data must think about the confidentiality issues and make sure their informed consent agreements adequately describe the way data will and won’t be used. Data users also have a responsibility to ensure that any extra contextual analysis they do doesn’t increase the risk of deductive disclosure.