Good afternoon, thanks for joining us especially with so many good session choices right now AND of course this close to happy hour.
Becky Yoose is our Library Applications and Systems Manager and has recently started work at SPL and is now overseeing development of our data warehouse initiative.
I am SPL’s Director of Marketing and Online Services and I’ve been with the system almost 3 years.
Our data warehouse program started a couple of years ago.
A few people have worked on this presentation and the data warehouse who have since left (and come back).
I stepped in to assist Becky with this presentation. My last day is now tomorrow.
Becky and I now affectionately refer to the curse of this presentation, which I believe as I was stuck in the elevator earlier today briefly!
When I started 3 years ago and I was building the marketing department and program, I discovered that the Library was regularly purging a treasure trove of useful data that could be used for insights into patron behavior.
I chatted with our Director of IT to see what we could do about this and the data warehouse initiative was born.
So, this session is about what SPL has done in the past couple of years to de-identify data to gain better insights into our patrons’ behavior without sacrificing their privacy.
I want to note that the same principals and privacy goals are in place regardless of whether you are using one of the many CRM’s now available OR if you are compiling trended data via Excel or scribbling it on a piece of paper and using an abacus
[Stephen] At the national level, libraries are guided by the Library Bill of Rights and the interpretations thereof by ALA.
On the matter of personally identifiable data, ALA advises libraries to collect only what is necessary to fulfill the mission of the library.
We of course, argue that deriving patrons’ needs and wants from the data trails they make in our systems IS a necessary part of fulfilling our mission
ALA further advises that we should not share personally identifiable information (or PII) with 3rd parties unless we are authorized to do so and/or we have agreement in place with 3rd parties to protect that data.
Finally, ALA recommends that we comply with law enforcement requests for data, but only if they are valid.
[Stephen] On the state level in Washington, there’s nothing that explicitly protects the privacy of library patrons or mandates the confidentiality of borrower records.
What we do have that’s useful in this context is an exemption from state records retention requirements for library records and a definition of “personally identifiable information” which we will talk more about in a minute.
[Becky] At SPL, our policy is probably similar to yours. We treat borrower records as confidential, meaning we won’t disclose them except under certain circumstances.
[Becky] In practice, we take advantage of our exemption from public records retention and delete records as soon as they cease to be useful to us. To be more specific, we sever the connection between who you are and what you borrowed as long as you return it to us. We also sever the connection between who you are and what you did on a public computer. We do not capture or store information about what movie you’re streaming right now.
Again, this is similar to how most libraries handle patron data. After these transactions are completed, we still know who you are and where you live – we keep a record of your account, for example. And we still know that “The Painter” by Peter Heller was checked out 127 times last week; nonetheless, we don’t know that you were the one who checked it out.
[Becky] Let’s come back to Personally Identifiable Information. Earlier we saw how Washington state defined PII: name, SSN, DOB, account number, and other details about your that can help someone uniquely identify that you are you.
That approach only looks at one facet of PII, however. The National Institute of Standards and Technology by way of the Government Accounting Office define PII using essentially 2 facets – stuff about you, and stuff that you do that is linked to you. NIST specifically lists medical, educational, financial, and employment information. This kind of data, in sufficient enough quantities, can be used in certain circumstances to reverse engineer an identity. I might not know your name, but if I know all the classes you’re taking at the university and I get access to course rosters, I can, by process of elimination, narrow down or even determine precisely who you are.
[Becky] So we have two important facets of PII to consider whenever we approach a problem such as the one we’re here to talk about today.
PII-1 can be seen as data referring to a person, and PII-2 is the data about, in the case of libraries, someone’s intellectual pursuits.
In our context, PII-2 includes the books that patrons read, search queries they execute, and the web sites they visit. This data is extremely valuable and is collected all the time by companies - it helps determines what Amazon recommends what should you read or, more importantly, what to buy and what Facebook suggests that you should “Like” so that they can market more products at you. Libraries should be sanctuaries from this kind of detailed data gathering that has become commonplace in the information age… yet we also need to gain insights and intelligence into the rapidly changing and evolving needs and wants of our patrons so that we can better serve them and continue to be a vital resource to our communities.
Again, from the perspective of Marketing and Public Services, we want to use this data to learn:
How to effectively allocate resources (both financial and human) in the Marketing of our programs and services to ALL audiences (including non-users, disengaged and underserved audiences) Tracking effectiveness of marketing efforts Development of our programs and services to what our patrons desire and expect Tracking success and satisfaction of our patron’s engagement with our programs and services
ALL while balancing library strategic needs with patron privacy
[Becky] At the beginning of the project -
We have a lot of data about what’s in our collection and the “behavior” of our collection independent of who is using it. For example, we know that a certain title has circed 231 times, or that items in the 910 call number range are popular. We have some data about our users (patron account records) but nothing about their completed activities. We make decisions about collections, services, facilities, etc. based on general demographics of a neighborhood, but not based on actual library use by residents. We do not know if services are used a lot by a few individuals or used somewhat by a lot of individuals.
In short, we have data, but we could not use the data in a way that would both respect user privacy while at the same time use that data to inform where library resources should go in terms of programs and services.
This leads into unanswerable questions that are essential for us to answer if we want to allocate resources effectively. This includes identifying and serving underrepresented patrons and our non-users in general.
And it means keeping data for more than a year.
A couple of questions to illustrate the above point:
People in their 20s (aka Millennials) are generally not active users of the public library, We see a lot of teens and children, and then people come back when they become parents themselves. (I call this the church effect) For the patrons who *are* active in their 20s, had they been active users in their teens?
We also have a high circ of materials in certain languages in certain branches, for example, Chinese-language materials at our International District branch. What is the zip code of patrons who check out those materials—is IDC their home branch or are they traveling from elsewhere in the city to get those materials?
In Seattle, we are growing incredibly fast, there are shifting demographics from neighborhood to neighborhood that make understanding the neighborhood information even more important
[Becky] When thinking of ways to solve this problem, we had to consider the following:
[Becky] If you’re trying to design a solution, you have to take into account any perceived threats. The one we talk about most is law enforcement or government. Libraries may receive requests or subpoenas for information on specific individuals or locations. In the analysis of risk, we balance the likelihood of something happening versus the harm that would result. These requests from law enforcement may not be very likely, but we perceive that they would inflict considerable harm on privacy and intellectual freedom, and we take them most seriously.
We often hear about organizations who have lost control of their data to hackers; it seems like every week we have a news story about a major data breach. We don’t hold credit card numbers, social security numbers, or other data likely to have substantial monetary value to hackers. A lot of the data we hold, for example, addresses or phone numbers, can be found in a phone book. On the other hand, some patrons may have protection orders, unlisted phone numbers, sensitive jobs, or other reasons that this information must remain confidential. We should always practice good security for our systems, but this is not a special threat for libraries.
Another risky scenario is an accidental leak of data or even an intentional release of data that is *believed* to be scrubbed of personally identifiable information, but which still can be reconstructed.
[Becky] For example - In 2006, AOL released millions of search queries to researchers. AOL believed that they had removed PII from the data they released, but the search logs contained so many specific queries (for addresses, institutions, etc.) that researchers were able to reconstruct specific identities from various search terms that they identify as belonging to distinct persons.
This data may not have been used for the research AOL originally intended but it proved an important point about privacy.
[Becky] The New York Times profiled one of the cases. Some of the terms that belonged to user 4417749 were fairly generic and may only give you hints about the user’s age, marital status, and that he or she was probably the owner of a dog – an incontinent dog. There was also a flurry of search terms that included the same last name. Researchers knew from previous research that people tend to search a lot for themselves and relatives and so they inferred that since so many search terms were for the last name “Arnold,” user 4417749 might be an Arnold. Finally, they found search terms that provided more clues about 4417749’s whereabouts, which happened to be a very small town in Georgia. By process of elimination, then, they were able to pinpoint … Thelma Arnold, a 60-something widowed dog-owning Georgian who owned up to all the search terms that AOL has labeled as belonging to 4417749. The article does not explain if Thelma ever found a treatment for her dog’s frequent urination problem.
[Stephen] For a marketer, this story of Target is a great one, as it relates to targeted (no pun intended) marketing: From Business Insider: Target broke through to a new level of customer tracking with the help of statistical genius Andrew Pole, according to a New York Times Magazine cover story by Charles Duhigg. Pole identified 25 products that when purchased together indicate a women is likely pregnant. The value of this information was that Target could send coupons to the pregnant woman at an expensive period of her life. Plugged into Target's customer tracking technology, Pole's formula was a beast. Once it even exposed a teen girl's pregnancy: [A] man walked into a Target outside Minneapolis and demanded to see the manager. He was clutching coupons that had been sent to his daughter, and he was angry, according to an employee who participated in the conversation. “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture and pictures of smiling infants. The manager apologized and then called a few days later to apologize again. On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.” It doesn’t get much more targeted, and some will say, intrusive, than this. But this is a great example of how specific data analysis can get
According to PEW:
One of the foundational principles of librarians is supporting the privacy of patrons. Librarians have long resisted keeping or sharing records of the book-borrowing or computer-using activities of their patrons. However, in the age of book-recommendation practices on all kinds of websites, many patrons are comfortable with the idea of getting recommendations from librarians based on their previous book-reading habits.
In a 2012 survey, 64% of respondents said they would be interested in personalized online accounts that provide customized recommendations for books based on their past library activity.
Some 29% said they would be “very likely” to use a service if it were made available by their library.
Indeed, our own statistics show that service is chosen by 34% of the new borrowers over the last 30 days
[Becky] If AOL had recognized, as we’ve discussed, that PII is actually a two-part concept, they would not have been surprised that even if you strip out PII-1, PII-2 may still be used to reconstruct an identity.
[Becky] How can we get to the sweet spot where we know enough about what patrons are doing with our stuff but not so much that we lose their trust or risk violating the agreements we have with them? We now turn to the concept of data de-identification. This is a very tricky thing to get right. A NIST white paper outlines definitions, approaches, and considerations for a data de-identification approach. The concept of de-identifying data also exists in the privacy juggernaut that is HIPAA – in healthcare, analyzing patient data is critical for developing new treatments and new approaches; nevertheless personal identity is sacrosanct and must be protected at all times. In the article “Using Lessons From Health Care to Protect the Privacy of Library Users: Guidelines for the De-Identification of Library Data Based on HIPAA,” the authors enumerate the specific data elements that are considered to constitute PII according to HIPAA, which can be used a general guideline for looking at library borrower data. While the original project leads were not aware of the article at the time, both the project and the paper share similar approaches to PII in libraries.
Here is an example of one of those paper and Excel and abacus methods…and referring to those largely in their 20s This was a grant funded project by the Allen Foundation in 2013 and 2014 Goal was to re-engage the Millennial population But if we didn’t in some way collect patron data, in this case identifiable by age and borrowing statistics, we could not have known whether we were meeting the goals of the project and fulfilling the data reporting requirement of the grant
In this case the oal was to increase usage of Millennials by 15% over the summer of 2014 through targeted marketing and programs One way to provide better targeting is through segmentation We engaged an agency to perform the research 5 subsegments were created from the Millennial demographic Programs were then created based on the market research Metrics were tracked on activity for all those cardholders that fell into the Millennial segment defined by those ages 18-30 at time of grant We developed our successful pop-up library program as well a successful cooking program based on this market research and we actually met our goals for the grant
This is an example of one of the personas that was created as a result of this market segmentation De-identified data gets collected and was then mapped to personas that were created during the project Aruna Kumar is a persona representing this cluster of users, in this case, digital only
Horizon database had exports of usage of a few key items The database had flags based on age New Borrowers Boopsie Logins Checkouts Freegal Downloads Overdrive Downloads We were able to track usage with this method We’d get weekly snapshots and then we would trend the data in Excel This though was the lead up to the better method of de-identification that Becky is about to walk you through
[Becky] The Millennial Project, in hindsight, was the first iteration of the Data Warehouse that we are now currently building. We knew that this data was valuable and we proved that by using this data we could achieve real world impact. Nonetheless, we needed to evolve the Warehouse to balance patron privacy and data analysis. This next section will discuss how we are dealing with both PII-1 and PII-2 in our data de-identification scheme.
[Becky] Let’s start with PII-1 information. Using the age vs. date of birth is a simple example of how you can keep data that is just as useful for us that significantly reduces the risk of identifying a patron.
Say we have the information that a non-fiction DVD was checked out by a patron. We can record that the patron was 41 years old on April 1, 2016, or we can record that their date of birth was March 15, 1975. Which is better? The age is. A person who is 41 years old on 4/1/16 could have been born on any date from 4/2/1974 to 4/1/1975.
There are a lot more people whose birthdays fall in that yearlong span than there are people whose birthdays are specifically on 4/1/75, so using the age is a lot “blurrier” and less identifiable. But if we’re interested in knowing what kinds of materials appeal to different ages, the age is just as good as the DOB.
[Becky] We’re also interested in hitting the right granularity of information about our materials. Using the full call number is obviously far too specific. There are many items in our collection that would have a unique call number, which would be just as bad as us saving the title of something you checked out.
The format goes too far in the opposite direction—it’s so vague that it gives us hardly any useful information about what materials are popular. For example, recent popular movies, foreign-language TV shows, documentaries, and instructional DVDs could all appear to have the same format. If all these are lumped together, we’d probably just see that DVDs are popular among all groups, which is not very useful.
The collection code seems promising, but as we have them set up, it is also too specific. Some collections, like adult fiction or even adult sci-fi/fantasy are quite large. But other collections are as specific as “beginning ESL.” This is too much “intellectually identifiable info” as we think of it, about what the patron is interested in.
Truncating the call numbers appears to be the best match for what we want. This gives us separate buckets for adult fiction or YA fiction, and Dewey decades or centuries, but wouldn’t boil down to individual items.
[Becky] Our everyday systems keep track of the exact time and date of an action. For example, a patron used a computer from 11:02am to 11:46am on April 11. We have no use for this level of detail in our analysis of patron behavior. We’d be interested to know that someone used a computer for 44 minutes on a Monday, though. By keeping the session length and the date, we can achieve that while not maintaining a record of someone’s whereabouts at a specific time and place.
You might think we’d be interested in knowing how many sessions start between 10 and 11am or 11am and 12pm, and so forth. That’s true, but we don’t need to know about the patrons who are doing those sessions at the same time, so that kind of data doesn’t belong in *this* database, our database for longitudinal analysis.
[Becky] So, what is the database we are building? In creating a data warehouse, you do something called ETL, which is extract, transform, and load. You take the data from the operational database where it was being collected, you do whatever you need to change it into the kind of data you want to store for analysis long term, and then you load it into where you’re going to store it.
Here you can see a sample of the data that is stored in the ILS, data about a patron and an item that has circulated.
[x] This is stored in separate tables.
[x] Here you can see the data that is not personally or intellectually identifiable that we want to move. In the case of the call number, we have to transform it as we move it.
[x] Here is it, transformed and pulled together into records about each transaction.
[Becky] The most secure thing for us to do would be to record all of the transactions with the age and home branch of the patron, and then you would have no connection to individual patrons. The problem with this is that it prevents us from answering some of our most important questions, like “do people who check out ebooks still use print?” We need to be able to match up those transactions—but we just need to know that person A is person A and person B is person B, not that John Smith is person A and Jane Doe is person B. That is essential for longitudinal analysis.
The best way we’ve found to do this is to record the borrower ID of the person who did the transaction, but to obfuscate it using some hashing algorithms, with a pinch of salt… while we are not making corned beef hash, it is a good representation of what we are doing. A hash is a way to ensure that borrower number 12345 is always turned into the same disguised ID, but there is no way to calculate it in the reverse. Currently we are working with SHA-1 due to system restraints but are currently evaluating SHA-2’s compatibility within the warehouse. This isn’t bulletproof, because if you had access to our ILS database, you could repeat this procedure, to learn the disguised version of that patron’s ID.
However, this attacker would have to gain access to multiple, completely separate databases to pull that off.
[Becky] What this adds up to is a “belt and suspenders” approach, or an approach that has redundant failsafes baked into it.
If you wanted to use our system to gain access about a specific patron, you’d first need to breach our ILS to find that patron’s identity and ID. Then you’d need to somehow recreate the hash algorithm we’d used to find out how that patron is represented in our data warehouse, and then breach the data warehouse, and then look them up there. Not to get computer science-y on you, but assuming you could even manage the part with the hashing, you’d still need to break into two databases, and this is our lower risk scenario!
Okay, what about our government opponent who can force us to give them access and identify the patrons to them? Even then, remember that we have no details about titles, specific transactions, websites anyone visited, exact times or places that things happened. We only have a record of the fact that certain kinds of transactions occurred.
And what about leaks? We do not plan to wholesale share this information with anyone without first going through a rigorous vetting process, and staff already have strict, clear policies about how, when, and why they can access patron data—these policies are already in place for our ILS and would apply equally to this data.
[Becky] We are still in the early stages of the Data Warehouse; we have a few datasets in there, including door counts, computer session usage, branch information, and some preliminary ILS circulation and patron demographic data. Nonetheless the data we do have in the Warehouse has already proved valuable.
[Becky] Our libraries have both “Public Internet” workstations, which can be reserved for up to 90 minutes per day, and “Express” workstations, which can be used for 15 minutes without prior reservation. Patrons are permitted up to 90 minutes of computer use per day regardless of the type of workstation they use.
Library staff began to report that patrons were “frequently” logging on to the 15-minute Express workstations multiple times in succession, to the detriment (and annoyance) of patrons who were waiting patiently for a computer to open up. It seemed that some patrons had figured out that they could chain together up to six 15-minute sessions at the Express computers. This was obviously not the intended purpose of the Express stations. However, all we had to rely upon was anecdotal information from a few locations. With our traditional data collection practices, we had no way of telling if any given array of six sessions on an Express workstation were done by the same patron or by six different patrons.
Before De-identification, we only had the total number of sessions and minutes that Express workstations were used per day per branch The de-identified data, however, gave us our answer. Since we now had public computer session data recorded along with a patron’s “DeID,” we could easily tell that, indeed, many, many patrons at all library locations were logging into Express workstations 6 times per day for 15 minutes per session. We couldn’t tell specifically who the patrons were, but we could tell when it was the same patron logging on 6 times versus when it was 6 distinct patrons each logging on for 15 minutes.
The new way of collecting data allowed us to determine that this problem was, indeed, rampant. Over 25% of patrons used Express workstations more than 15 minutes per day. We made the decision to change the way our system handled public computer sessions. Now, patrons get both 90 minutes per day on reservable public workstations but only 15 minutes per day on Express workstations. This essentially means that patrons qualify for a total of 105 minutes per day on our computers, but it was decided that this was a fairer and more equitable way to allocate time given the abuses we were seeing of the Express workstations.
[Becky] In summary, we are still in the early days of the Data Warehouse; however we are certain that we are going in the right direction in terms of its architecture and our methodology to de-identify patron information. Storing different types of information in different locations while avoiding keeping any PII-1 or 2 data helps mitigate the risks of storing such data in house. At the same time, we have enough level of detail that the data we have can be used to impact real world operations. Finally, we have the control over what we provide third party customer relationship management systems or other external systems.
[Becky] We hope to have shown here that there is not necessarily a clear-cut conflict between patron privacy and being able to make data-driven decisions. If we are clever about it, we can maintain the promises we’ve made to patrons about their privacy while keeping enough “blurry” data to get some value out of it. It doesn’t need to be a data-driven approach VERSUS patron privacy, it can be data-driven PLUS patron privacy.
[Author Comment - we moved this slide out of the deck in case we will not have room to talk about this historical issue; nonetheless, for those of you playing along at home, this is a reminder that we also have physical security vulnerabilities to address as well.]
In discussing this issue, we’ve also discovered that our historic practices are far from perfect.
In discussions with several security researchers in academic computer science, they quickly spotted serious flaws in our systems. One is example is our holds system. A patron places a hold on an item. Usually we have a record of their intellectual pursuit for the three-week period of their checkout but sometimes you wait for something on hold much longer than that. We’re holding a record of their intellectual inquiry that is vulnerable to law enforcement or government demands, sometimes for months at a time. Then we send them a plaintext email about that request and put the item on the shelf with an tag that only shows four letters of their last name and three of your first name. Of course, if you’re my colleague KAY AOKI, that’s your whole name. The item is on an open holds shelf that anyone can wander by.
One security researcher had an off-the-cuff idea to improve this process. He said we should give each request a QR code that tracks it and is required to pick it up. Then the request is associated with that code and NOT with a patron record. This would be very secure! The idea has a few flaws though. First, there would be no way to contact the patron to tell them that their item is ready because we don’t know who they are. They’d just have to check for it periodically. Second, the patron would have to NOT LOSE their QR code in order to claim the item. Obviously, both of these assumptions are unrealistic from a customer service point of view. Third, it’s a QR code.
Maybe this idea could be refined and become more practical, but we liked this story as an example of how real-world customer service and operations are sometimes incompatible with ideal solutions.
De-identifying Patron Data for Analytics and Intelligence
Patron Data to
The Seattle Public Library
• Overview of Data Management Principles, Policies,
• Balancing Library Strategic Needs with Patron
• SPL Methods of Data De-identification
ALA Data Management
• Collection of personally identifiable information
• only when necessary to fulfill the mission of the library
• Should not share personally identifiable user
information with third parties, unless
• the library has obtained user permission
• has entered into a legal agreement with the vendor
• Make records available [to law enforcement
agencies and officers] only in response to properly
“An interpretation of the Library Bill of Rights.”
State Law: Revised Code of WA
• RCW 42.56.310: Library Records
• Any library record, the primary purpose of which is to
maintain control of library materials, or to gain access to
information, that discloses or could be used to disclose
the identity of a library user is exempt from disclosure
under this chapter.
• RCW 19.255.010: Disclosure, notice — Definitions
— Rights, remedies.
• First & last name combined with SSN, DL #, credit/debit
card number, authentication credentials, “account
SPL Confidentiality of Patron Data
• It is the policy of The Seattle Public Library to
protect the confidentiality of borrower records as
part of its commitment to intellectual freedom.
• The Library will keep patron records confidential
and will not disclose this information except
• as necessary for the proper operation of the Library
• upon consent of the user
• pursuant to subpoena or court order
• as otherwise required by law.
The Seattle Public Library. “Confidentiality of Borrower Records.”
SPL Data Management Practices
• All records connecting a patron to an item that has
been held or borrowed, or to an information
resource that has been accessed, are deleted upon
the successful fulfillment of the transaction.
• Circulation records
• Public computer reservations
• Workstation use data (log files, caches, histories)
• Network logs
NIST: Two-part Definition of PII
1. Any information that can be used to distinguish
or trace an individual‘s identity, such as name,
social security number, date and place of birth,
mother‘s maiden name, or biometric records
2. Any other information that is linked or linkable to
an individual, such as medical, educational,
financial, and employment information
a) Libraries extend the second point by including
borrowing and information seeking activity
National Institute of Standards and Technology via the Government Accounting Office
expression of an amalgam of the definitions of PII from Office of Management and Budget
Memorandums 07-16 and 06-19. May 2008, http://www.gao.gov/new.items/d08536.pdf
Two-part Definition of PII
PII-1: Individual PII-2: Intellectual Pursuits
Strategic Planning Needs
with Patron Privacy
• Longitudinal questions (e.g. trended analysis)
• Long-term (multi-year) rather than snapshot
• Trends, correlations, changes in behavior – not
necessarily individual activity
• Questions only about the type of behavioral use of
Library programs and services:
•“Do teen patrons remain active in their 20s?”
•“Do people use their neighborhood branch or
use the branch where relevant materials are?”
(e.g. Chinese language collection)
• Passionate commitment to intellectual freedom
• Recognition that some patrons have no alternatives
• Intellectual content of transactions should always
• Avoid keeping records that show person’s
Threats to patron privacy
• Law enforcement
• Seeking intellectual pursuit data
• Seeking patron whereabouts
• Library and ILS data
• Data leak
• Reconstruction of identity via data
• Embarrassment/loss of trust
• Notification costs
AOL data release (2006)
There was no personally
identifiable data provided by
AOL with those records, but
search queries themselves
can sometimes include such
This was a
TechCrunch. “AOL: ‘This was a screw up’.” August 2006. http://techcrunch.com/2006/08/07/aol-this-
AOL Example: User 4417749
60 single men
dog that urinates on everything
john arnold georgia
homes sold in shadow lake
subdivision gwinnet county georgia
landscapers in Lilburn, GaThelma Arnold
The Incredible Story Of How Target Exposed A Teen Girl's Pregnancy
Data for Good
In a 2012 Pew Research survey,
64% of respondents said they would be
interested in personalized online
accounts. 29% would be very likely to
use the services.
Last 30 days for SPL BiblioCommons opt-
in for borrowing history is 34%.
“7 Surprises About Libraries in Our Surveys”, Pew Research, 2012
Two-part Definition of PII
PII-1: People PII-2: Intellectual Pursuits
• Designed to protect individual privacy while preserving
some of the dataset’s utility for other purposes.
• Make it hard or impossible to learn if an individual’s
data is in a data set, or determine any attributes about
an individual known to be in the data set.
• HIPAA: Data that does not identify an individual and
with respect to which there is no reasonable basis to
believe that the information can be used to identify an
Garfinkel, Simson L. “De-Identification of Personally Identifiable Information.” April 2015.
Scott Nicholson and Catherine Arnott Smith, "Using Lessons From Health Care to Protect The
Privacy of Library Users: Guidelines For The De‐identification of Library Data Based on HIPAA,"
Proceedings of the American Society for Information Science and Technology 42, no. 1 (2005):
SPL Methods of Data De-identification
Timestamps vs. dates
Mon, 11 Apr 2016 11:02:43 +0000
Email Address Yes
Phone Number Yes
Date of Birth Yes
Zip Code No
Registration Year No
Call Number Yes – truncate it
Item Type No
Age Gender Zip Reg
Item Type Dewey
45 Male 98117 2004 CD 700 CEN 4/1/15
45 Male 98117 2004 Book FIC BAL 4/3/15
Belt & suspenders
• To identify a specific patron’s transaction, you’d
• Breach ILS
• Recreate hash algorithm
• Breach data warehouse
• Look up patron
• Even then
• No intellectual content or whereabouts
• Only the fact of types of transactions
• Strict, clear policies for staff
Express Computer Policy
Are patrons abusing 15-minute “Express”
• 90 minutes per day for “Internet” workstation
• 15 minutes per day for “Express” workstation
• Total of 90 minutes per day for any workstation
• Allowed for “serial” use of Express workstations (6x per
• Staff noticed (anecdotally) that “a lot” of patrons were
chaining Express sessions together
Data Warehouse Summary
• Store identifiable and non-identifiable information
in separate locations
• Avoid storing any intellectually significant or
• Build internal data stores that helps segment
anonymized patrons via behavior and
• Use this Library owned and managed data store as
the source of information for external CRMs