Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Login to see the comments


  1. 1. What Do We Do with All This Big Data? Fostering Insight and Trust in the Digital Age A Market Definition Report January 21, 2015 By Susan Etlinger Edited by Rebecca Lieb
  2. 2. Introduction Every day, we hear new stories about data: how much there is, how fast it moves, how it’s used for good or ill. Data ubiquity affects our businesses, our educational and legal systems, our society, and increasingly, our dinner-table conversation. I had the opportunity to speak at TED@IBM in San Francisco on September 23, 2014, about the implications of a data-rich world, and what we can do as businesspeople, citizens, and consumers, to use it to our best advantage.1 That talk, as well as this document, examines two themes that underlie many conversations about data and technology that correspond to fears that George Orwell and Aldous Huxley chronicled in their novels 1984 and Brave New World. As the culture critic Neil Postman put it in his 1985 book, Amusing Ourselves to Death: What Orwell feared were those who would ban books. What Huxley feared was that there would be no reason to ban a book, for there would be no one who wanted to read one. Orwell feared those who would deprive us of information. Huxley feared those who would give us so much that we would be reduced to passivity and egotism. Orwell feared that the truth would be concealed from us. Huxley feared the truth would be drowned in a sea of irrelevance. Orwell feared we would become a captive culture. Huxley feared we would become a trivial culture.2 These two themes—irrelevance and narcissism on one hand (Huxley) and surveillance and power on the other (Orwell)—anticipate modern fears about the explosion of data in our personal and professional lives. As individuals, we crave insight and convenience, yet we simultaneously fear loss of control over our privacy and our digital identities.
  3. 3. Photo: Daniel K. Davis/TED Susan Etlinger speaking at TED@IBM at SFJAZZ, San Francisco, California, September 23, 2014.
  4. 4. What’s So Hard About Big Data? ....................................................................................................................................... With Big Data, Size Isn’t Everything ............................................................................................................................... Unstructured Data Demands New Analytical Approaches ........................................................................................ Traditional Methodologies Must Adapt ........................................................................................................................ From Data to Insight .............................................................................................................................................................................. Big Data Requires Linguistic Expertise ......................................................................................................................... Big Data Requires Expertise in Data Science and Critical Thinking ......................................................................... Legal and Ethical Issues of Big Data ................................................................................................................................. Planning for Data Ubiquity ............................................................................................................................................................. Conclusion ......................................................................................................................................................................... Table of Contents 5 6 8 10 13 14 14 17 21 23 Executive Summary This document proposes an approach to better understand and address: • How we extract insight from data • How we use data in such a way as to earn and protect trust: the trust of customers, constituents, patients, and partners To be clear, these twin challenges of insight and trust will occupy data scientists, engineers, analysts, ethicists, linguists, lawyers, social scientists, journalists, and, of course, the public for many years to come. To derive insight from data while protecting and sustaining trust with communities, organizations must think deeply about how they source and analyze it and clarify and communicate their roles as stewards of increasingly revealing information. This is only a first step, but it’s a critical one if we are to derive sustainable advantage from data, big and small.
  5. 5. What’s So Hard About Big Data? 5
  6. 6. WITH BIG DATA, SIZE ISN’T EVERYTHING The idea of big data isn’t new; it was defined in the late ’90s by analysts at META Group (now Gartner Group). According to META/Gartner, big data has three main attributes, known as the Three Vs: • Volume (the amount of data) • Velocity (the speed at which the data moves) • Variety (the many types of data)3 Now nearly two decades old, this construct has become increasingly pertinent. As IBM has famously said, “90% of all the data in the world was created in the past two years.”4 To understand why this is, we need to compare the business conditions that existed when big data was originally defined with today’s. In the early 2000s, technologists were grappling with a burgeoning variety of data types, spurred in large part by the rise of electronic commerce. Today, social media is a major catalyst of data proliferation. Consider that: • 100 hours of video are uploaded to YouTube every minute.5 • On WordPress alone, users produce about 64.8 million new blog posts and 60.4 million new comments each month.6 • 500 million tweets are sent per day.7 Much data is unstructured. It is, as Gartner defines it, “content that does not conform to a specific, pre- defined data model. It tends to be the human-generated and people-oriented content that does not fit neatly into database tables.”8 As a result, the primary challenge of what we think of as big data isn’t actually the size; it’s the variety. For this reason, the term “big data” can sometimes be misleading. If this seems counterintuitive, consider this example: the New York Stock Exchange (NYSE) recorded approximately 9.3 billion shares traded on December 16, 2014, more than 18 times the average number of tweets (approximately 500 million) created per day.9 Even though the number of trades is much larger than the number of tweets (volume) and the speed of the market may change from hour to hour and day to day (velocity), the basic attributes of a trade—price, trade time, change from previous trade, previous close, price/ earnings ratio, and so on—are the same every time. A trade is a trade. It is homogeneous and predictable from a data perspective (variety). In contrast, social data is far more complex and variable. While a tweet contains some structured data (metadata about the time it was posted, the user who posted it, whether it includes hashtags or media, such as photography, and other attributes), it can express anything that fits into 140 characters. It is a mix of structured metadata and unstructured text and images that can be expressed with variable lengths, languages, meanings, and formats. It can contain a news headline, a haiku, a sales message, or a random thought. For this reason, a much smaller number of tweets can be far more complex to analyze from a data standpoint. Size isn’t everything. 6
  7. 7. The nature of human language demands rigorous and repeatable processes to extract meaning from it in a transparent and defensible way.
  8. 8. UNSTRUCTURED DATA DEMANDS NEW ANALYTICAL APPROACHES The human-generated and people-oriented nature of unstructured data is both an unprecedented asset and a disruptive force. Data’s value lies in its ability to capture the desires, hopes, dreams, preferences, buying habits, likes, and dislikes of everyday people, whether individually or in aggregate. The disruptive nature of this data stems from two attributes: • It’s raw material. It requires processing to translate it into a format that machines, and therefore people, can understand and act upon at scale. • It offers a window into human behavior and attitudes. When enriched with demographic and location information, data can introduce an unprecedented level of insight and, potentially, privacy concerns.. Unstructured data requires a number of processes and technologies to: • Identify the appropriate sources • Crawl and extract it • Detect and interpret the language being used • Filter it for spam • Categorize it for relevance (e.g., “Gap store” versus “trade gap”) • Analyze the content for context (sentiment, tone, intensity, keywords, location, demographic information) • Classify it so the business can act on it (a customer service issue, a request for a product enhancement, a question, etc.) Each of these steps is rife with nuances that require both sophisticated technologies and processes to address (see Figure 1). The above challenges add up to a host of risks: missed signals, inaccurate conclusions, bad decisions, high total cost of data and tool ownership, and an inability to scale, among others. Even a small misstep, such as a missing source, a disparity in filtering algorithms, or a lack of language support, can have a significant detrimental effect on the trustworthiness of the results. A recent story in Foreign Policy magazine provides a timely example. “Why Big Data Missed the Early Warning Signs of Ebola” highlights the importance of an early media report published by Xinhua’s French-language newswire covering a press conference about an outbreak of an unidentified hemorrhagic fever in the Macenta prefecture in Guinea.10 The Foreign Policy article debunks some of the hyperbole about the role of big data in identifying Ebola, not because the technology wasn’t available (it was) or because the indications weren’t there (they were), but because, as author Kalev Leetaru writes, “part of the problem is that the majority of media in Guinea is not published in English, while most monitoring systems today emphasize English- language material.” 8
  9. 9. 1 2 3 6 7 5 4 ChallengeSteps Identify Data Sources Crawl and Extract Data Detect and Interpret Language Filter for Spam Categorize for Relevance Analyze for Sentiment and Keywords/Themes Classify for Action Not all data sources provide reliable APIs or consistent access. Different tools use different crawlers, which can return different samples. Different spam filtering algorithms can also return different samples, accuracy levels. Sentiment analysis is highly subjective and subject to interpretation or error. Even with human coding (which reduces scalability) and machine learning, no tool is perfect. Requires both organizational and technology resource to tag data so that it is appropriately classified and shared with the right people. Inconsistent levels of accuracy and different approaches. Not all tools support multiple languages, or support them equally well. Bonjour! Hello! Hola! もしもし! Hej! e eek e ve l e e m be eoe w y k eaesw ee n q of e a o u ep eej geee o oty h t af f w FIGURE 1 CHALLENGES OF UNSTRUCTURED DATA 9
  10. 10. TRADITIONAL METHODOLOGIES MUST ADAPT Even in the unlikely event that all relevant data is in English or another single language, there’s no guarantee that it will be easy to interpret or that the path to doing so will be clear. For this reason, researchers in both industry and academia are grappling with the many challenges that large, unstructured human data poses as a tool for conducting scientific or business research. The following provides an example of how one organization is addressing these significant methodological issues. Case Study: Health Media Collaboratory Applying Methodological Rigor to Big Data The Health Media Collaboratory (HMC) at the University of Illinois at Chicago’s Institute for Health Research and Policy is focused on understanding social data, most of which is unstructured, to “positively impact the health behavior of individuals and communities,” according to its website. In the broadest sense, HMC’s mission is to develop and propagate a new paradigm for health media research using innovative strategies to apply methodological rigor to the analysis of big data.11 The focus of a recent project was to look at how people talk about quitting smoking on Twitter so that HMC and the Centers for Disease Control and Prevention (CDC) could learn how they might promote behavior change. Recently, HMC turned to Twitter to explore two questions about the impact, if any, of social data on smoking cessation. The initial research questions were: • How much electronic-cigarette promotion is there on Twitter? • How much organic conversation about electronic cigarettes exists on Twitter? In another project, HMC also looked at whether Twitter could be used as a tool to evaluate the efficacy of health- oriented media campaigns. In particular, the CDC wanted to assess the impact of several provocative and graphic television commercials, one of which featured a woman with a hole in her throat. The questions HMC sought to answer were: • Did the commercials work? • How can we prove it? This type of research, as well as the data it presents, is vastly different from fielding a conventional multiple- choice survey in which the questions and answers are predefined and results tabulate the percentage of answers in each column. HMC instead had to determine, with an appropriate level of confidence, how people talk about smoking on Twitter and whether this data could serve as a useful indicator of public opinion and even of likely behavior. 10 Researchers in both industry and academia are grappling with the many challenges that large, unstructured human data poses as a tool for conducting scientific or business research.
  11. 11. 11 To do this, the team needed to understand how much of the Twitter conversation about smoking was spam, how much was off topic (“smoking marijuana,” “smoking ribs,” “smoking hot women”), and how much was relevant (“I’ve really got to quit smoking cigarettes”). For the first project, it also meant understanding how people talk about electronic cigarettes in particular. Figure 2 is a recreation of the search string HMC used in its research, illustrating why this effort isn’t as simple as it might seem. The methodology that HMC used to collect, clean, and analyze the Twitter conversation related to smoking topics closely mirrors the big data challenges outlined in Figure 1. While it adheres to scientific method, it’s important to know that this was a methodology that HMC itself devised to account for the nuances and challenges of unstructured data. 1. Data collection. Determine the appropriate source and sample size of the data to be collected. 2. Keyword selection. Generate the most comprehensive possible list of keywords, encompassing nonstandard English usages, slang terms, and misspellings. 3. Metadata. Collect metadata related to the tweets, including: a. A tweet ID (a unique numerical identifier assigned to each tweet) b. The username and biographical profile of the account used to post the tweet c. Geolocation (if enabled by the user) d. Number of followers of the posting account e. The number of accounts the posting account follows f. The posting account’s Klout score g. Hashtags h. URL links i. Media content attached to the tweet. 4. Filtering for engagement. Because engagement with the campaign was the determining factor for relevance, the team filtered tweets that described televised commercials, later de-duplicating them to ensure that tweets with multiple keywords would not be counted twice. 5. Human coding. Throughout the process, human coders reviewed the data to assess relevance and code message content. Figure 2: How People Talk About E-Cigarettes Key Words for E-Cigs E cigarettes blue cigarette e cigarettes njoy cigarette e cigarettes blu cig e cigarettes njoy cig e cigarettes ecig e cigarettes e cig e cigarettes @blucigs e cigarettes e-cigarette e cigarettes ecigarette ecigarettes from:blucigs e cigarettes e-cigarette e cigarettes e-cigs e cigarettes ecigarettes e cigarettes e-cigarettes e cigarettes green smoke e cigarettes south beach smoke e cigarettes cartomizer ecigarette (atomizer OR atomizers)-perfume e cigarettes ehookah OR e-hookah e cigarettes ejuice OR ejuices OR e-juice OR e-juice ecigarettes eliquid OR eliquids OR e-liquid OR e-liquids e cigarettes e-smoke OR e-smokes e cigarettes (esmoke OR esmokes) sample:5 lang: en e cigarettes lavatube OR lavatubes e cigarettes logicecig OR logicecigs e cigarettes smartsmoker e cigarettes smokestik OR Smokestiks e cigarettes v2 cig OR “v2 cigs” OR v2cig OR v2cigs vaper or vapers OR vaping e cigarettes zerocig OR Zerocigs e cigarettes cartomizers e cigarettes e-cigarettes FIGURE 2 HOW PEOPLE TALK ABOUT E-CIGARETTES Source: University of Illinois at Chicago’s Institute for Health Research and Policy
  12. 12. 12 6. Precision and relevance. The team used a combination of human and machine coding to assess relevance and eliminate false positives, using three teams of trained coders and a process to assess intercoder reliability using a Kappa score, a statistic “used to assess inter-rater reliability when observing or otherwise coding qualitative/categorical variables.”12 According to HMC, “the human-coded tweets were then used to train a naïve Bayes classifier to automatically classify the larger dataset of Tips engagement tweets for relevance. Precision was calculated as the percent of Tips-relevant tweets yielded by the keyword filters.”13 7. Recall. To assess whether the tweet sample was representative of and could be generalized to all potentially relevant Twitter content, the team compared its sample to a larger sample of unretrieved tweets, again using trained coders and a Kappa score to assess how well the filtered tweet sample represented the larger data set.14 8. Content coding. Finally, the team coded the content to better understand “fear appeals,” that is, whether the user accepted, rejected, or disregarded the message. So, did the CDC’s graphic and disturbing anti-smoking ads and the Twitter conversation surrounding them actually lead people to quit? HMC didn’t overstate its data; rather, it concluded that approximately 87% of the tweets about the TV commercials expressed fear and that the ads had “the desired result of jolting the audience into a thought process that might have some impact on future behavior.”15 HMC’s case study illustrates that unstructured data requires significant adaptations to analytics methodology to extract meaning. Certainly it would have been a lot simpler for the CDC to host a focus group or field a survey to collect impressions about its anti-smoking campaign, but that data, as comparatively simple as it would have been to analyze, would lack the spontaneity and rich variety of expression available on Twitter or other social networks, had the teams extended the research to other sources. The nature of human language demands rigorous and repeatable processes to extract meaning in a transparent and defensible way. As a result, analytics methodology is undergoing an explosive period of change.
  13. 13. From Data to Insight 13
  14. 14. Subject Matter Expertise Access to Tools Critical Thinking, Applied Statistics Inability to Execute Incorrect Conclusions Insights Irrelevant Conclusions 14 BIG DATA REQUIRES LINGUISTIC EXPERTISE As counterintuitive as it might seem, an influx of unstructured data demands not only new and more sophisticated technologies to process and store it but a renewed emphasis on the humanistic disciplines as well. This is because, as Gartner has said, big data “tends to be the human-generated and people-oriented content” rather than highly structured data that fits neatly into databases. Naturally, “human-generated and people-oriented content” includes language, which is rife with contractions, sarcasm, slang, and metaphors expressed in multiple written forms, in hundreds of languages, 24 hours a day, seven days a week. Furthermore, language changes constantly, a fact Oxford Dictionaries marks each November by publishing a word of the year that encapsulates that year’s zeitgeist. 2014’s word was “vape,” salient in light of HMC’s research. Five years ago, “vape” would have been impossible to interpret, because it—and its cultural context—didn’t exist yet. A recent article in MIT Technology Review illustrates just how quickly language and meaning can evolve, both in obvious and subtle ways.16 Vivek Kulkarni, a PhD student in the Data Science Lab at Stony Brook University, along with several of his colleagues, used linguistic mapping to illustrate the speed at which word meanings change, gathering inputs from sources such as Google Books, Amazon, and Twitter. “Mouse” acquired an entirely new meaning following the introduction of the computer mouse in the early 1970s, and “sandy” changed literally overnight with Hurricane Sandy in 2012. Today we see a constant stream of examples both of redefined words and of new ones (“vaping,” “selfie”) that require both technological and humanistic expertise to map, place in context, and understand. BIG DATA REQUIRES EXPERTISE IN DATA SCIENCE AND CRITICAL THINKING The speed, size, and variety of data around us—and the availability of platforms used to visualize and analyze it—have democratized the function of analytics within organizations. At the same time, fundamental analytics education has lagged, creating a situation in which organizations are at risk of misinterpreting data of all kinds. Says Philip B. Stark, professor and chair of statistics at the University of California, Berkeley, “the type of data (structured, text, etc.) isn’t the point at all. The way of thinking matters.”17 Stark emphasizes that good data science requires having subject matter expertise, access to the appropriate computational tools, and most importantly, critical thinking and statistics skills. Figure 3 lays out the consequences of overlooking any of these three foundational elements. FIGURE 3 FUNDAMENTALS OF DATA SCIENCE
  15. 15. 15 1. Irrelevant conclusions. If tools and critical thinking are present but subject matter expertise is absent, the organization risks asking the wrong questions, which can result in irrelevant conclusions and valueless answers. In addition, the organization will lack the context necessary to design experiments that will yield the answers it needs. It will be unable to understand the intrinsic limitations of the data, says Stark: noise, sampling issues, response bias, measurement bias, and so on. This creates a domino effect that can squander resources and lead to ineffectual—or worse, harmful—decisions. 2. Inability to execute. If subject matter expertise and critical thinking are present, but tools are absent, the organization will be unable to extract insights at scale and must resort to time-consuming manual methods. As a result, the organization risks burning out and eventually losing top analysts, who now must focus on brute-force methods of processing and analyzing data, rather than using their skills for more sophisticated and rewarding applications. 3. Incorrect conclusions. If subject matter expertise and tools are present, but critical thinking and a knowledge of applied statistics are absent, the organization risks drawing the wrong conclusions from good data, making poor decisions that may ignore other critical business signals. Like a lack of subject matter expertise, this can have harmful consequences to decision making and, therefore, business results. Given the spread of data throughout organizations and the impracticality of hiring legions of trained analysts to keep pace with its growth, the next step is to evolve from analytics that simply describe a situation to analytics that predict what may happen next and then to analytics that prescribe a course of action.18 But even assuming access to the most sophisticated algorithms that incorporate the most detailed business knowledge, widespread access to data necessitates that more people, irrespective of role, grasp the basics of logic and statistics to understand that data. This doesn’t mandate universal PhDs in applied statistics, but it does require an awareness of basic principles of logic. The good news is that, while the big data industry is still in its infancy, many of the most valuable tools for analysis are widely available—and more than two thousand years old to boot. As early as 350 BCE, Aristotle described 13 logical fallacies, which logicians and philosophers have built upon during the last 2,400 years.19 Ignoring these fallacies leaves organizations vulnerable to a host of risks, which can harm competitive position, financial success, customer sentiment and trust, and other critical objectives. One common example is mistaking correlation for causation, in which organizations erroneously attribute one outcome (for example, increased revenue) to a corresponding data point (for example, reach of a marketing campaign). The increasing use of technologies that present complex data visually can exacerbate the problem. Harvard law student Tyler Vigen succinctly (and sometimes hilariously) presents this phenomenon on his Spurious Correlations blog.20 The good news is that, while the big data industry is still in its infancy, many of the most valuable tools for analysis are widely available—and more than two thousand years old to boot.
  16. 16. 5.0 DIVORCESPER1000PEOPLE POUNDS Divorce rate in Maine correlates with Per capita consumption of margarine (US) Correlation: 0.992558 4.8 4.5 4.4 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 5.0 4.7 4.6 4.4 4.3 4.1 4.2 4.2 4.2 4.1 8.2 7.0 6.5 5.3 5.2 4.0 4.6 4.5 4.2 3.7 4.2 4.0 3 4 5 6 7 8 9 Divorces rate in Maine Per Capita Consumption of Margarine (US) FIGURE 4 MISTAKING CORRELATION FOR CAUSATION Source: Tyler Vigen In Figure 4, Vigen’s calculations show that there is a 99% correlation between the divorce rate in Maine and per- capita margarine consumption. Does the Maine divorce rate somehow cause US residents to eat margarine? Does US margarine consumption somehow lead to divorce in Maine? While these questions are absurd, charts such this visually suggest a link. The correlation/causation fallacy is just one of many logical fallacies that have been documented and described over the years, including formal fallacies (fallacies of logic) and informal fallacies (fallacies of evidence or relevance).21 As more tools become available to visualize data sets quickly and easily, organizations must invest as much in critical thinking and data science expertise as they do in tools to visualize data. Otherwise, they risk succumbing to logical fallacies. 16
  17. 17. Legal and Ethical Issues of Big Data 17
  18. 18. 18 BIG DATA RAISES MULTIPLE LEGAL AND ETHICAL ISSUES The good news—and the bad news—about big data is that it can provide unprecedented insight into people, both as individuals and in aggregate. While surveys can, arguably, reveal human attitudes, Christian Rudder, CEO of dating site OKCupid, points out in his 2014 book, Dataclysm: Who We Are When We Think No One’s Looking, that “we can pinpoint the speaker, the words, the moment, even the latitude and longitude of human communication.”22 Many people know the story of how Target discovered that a young girl was pregnant before her father did; such stories have become mainstream.23 But much of the challenge with recent discussions on ethics and privacy stems from the extremely broad nature of these terms, the spectrum of personal preferences, and the beliefs of individuals about the media environment we live in today. Consider these recent examples: • Seeking to prevent suicides, Samaritans Radar raises privacy concerns. In October 2014, the BBC reported that the Samaritans had launched an app that would monitor words and phrases such as “hate myself” and “depressed” on Twitter and would notify users if any of the people they follow appear to be suicidal.24 While the app was developed to help people reach out to those in need, privacy advocates expressed concern that the information could be used to target and profile individuals without their consent. According to a petition filed on, the Samaritans app was monitoring approximately 900,000 Twitter accounts as of late October.25 By November 7, the app was suspended based on public feedback.26 • Facebook’s “Emotional Contagion” experiment provokes outrage about its methodology. In June 2014, Facebook’s Adam Kramer published a study in The Proceedings of the National Academy of Science, revealing that “emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness.”27 In other words, seeing negative stories on Facebook can make you sad. The experiment provoked outrage about the perceived lack of informed consent, the ethical repercussions of such a study, the concern over appropriate peer review, the privacy implications, and the precedent such a study might set for research using digital data. • Uber knows when and where (and possibly with whom) you’ve spent the night. In March 2012, Uber posted, and later deleted, a blog post entitled “Rides of Glory,” which revealed patterns, by city, of Uber rides after “brief overnight weekend stays,” also known as the passenger version of the walk of shame.28 Uber was later criticized for allegedly revealing its “God View” at an industry event, showing attendees the precise location of a particular journalist without his knowledge, while a December 1, 2014, post on Talking Points Memo disclosed the story of a job applicant who was allegedly shown individuals’ live travel information during an interview.29, 30 Much of the challenge with recent discussions on ethics and privacy stems from the extremely broad nature of these terms.
  19. 19. 1919 • A teenager becomes an Internet celebrity—and a target—in one day. Alex Lee, a 16-year-old Target clerk, became a meme (#AlexFromTarget) and a celebrity within hours, based on a photo taken of him unawares at work. He was invited to appear on The Ellen Show and was reported to have received death threats on social media.31 These stories illustrate several attributes of the data environment we live in now and the attendant ethical issues they represent: • Data collection. The Samaritans example illustrates the law of unintended consequences: what may happen when an app collects data that may, albeit unintentionally, compromise privacy or put people in harm’s way. • Methodology and usage. The Facebook example demonstrates what happens when a company uses its vast reservoir of data to run technically legal but ethically ambiguous experiments on its users, raising questions about the nature of informed consent and ethical data use in the digital age. • Aggregation, storage, and stewardship. The Uber posts illustrate, albeit with aggregated data, the intensely intimate nature of the data users entrust to companies, raising questions of stewardship, ethics (is aggregating such data ethical?), and privacy (what happens if data is intentionally or accidentally disclosed?). • Communication. All of the above examples illustrate the gray areas between law and ethics, or, from an organizational point of view, risk management and customer experience. As data becomes even more valuable and ubiquitous, the way in which organizations communicate—about collection, analysis, intent and usage—will affect not only their legal risk profile, but their ability to attract and retain the trust and loyalty of their communities. Finally, there is, as former secretary of defense Donald Rumsfeld so famously called it, “the unknown unknown.” The #AlexFromTarget story demonstrates not only an example of how an everyday 16 year old (by definition, a minor) can become an instant Internet celebrity but also how a company can unwittingly and suddenly find itself at the center of a crisis not of its own creation, one that raises issues (compounded because of Lee’s age) of employee privacy and even safety. Figure 5 lays out these issues at a high level. In the past, many of these ethical issues related to data were cloaked behind proprietary systems and siloed data stores. As data becomes ubiquitous, more integrated, and more portable, however, the number and type of ethical gray areas will multiply, along with a need to distinguish the organization’s legal responsibilities, such as what it discloses in a terms of service, from its ethical ones—the actions it takes that promote or erode the trust of its community.
  20. 20. Data sources Data types Sample size How the data may have been filtered, enriched or otherwise modified with: Democratic Location Other metadata Keyword selection Human or algorithmic coding Process for assessing precision, relevance, recall How the organization may change the experience based on data Whether the organization plans to sell the data in any form to a third party How data is combined and its impact on personally identifiable information (PII) or user experience in general What data is collected How and for how long data is stored Who owns the data Who has the right to delete data (posts or entire profiles) Process for deleting data (posts or entire profiles) Who has the right to view/modify/share data (administration) Whether and how the data can be extracted Methodology Usage Aggregation Storage & Stewardship Data Collection The extent to which the organization proactively and transparently informs users/custom- ers about what and how it collects, analyz- es, stores, aggregates, and uses their data Action Communication FIGURE 5 ETHICAL ISSUES RELATED TO DATA 20
  21. 21. Planning for Data Ubiquity 21
  22. 22. Define data strategy and operating model If data is to be considered a business-critical asset, it must be treated as such by leaders who drive and instill strategy across the organization. In 2015, leaders must define what critical data streams are needed to drive business goals, how they will source them, and what operating model is needed to process, interpret, and act on them at the right time. The challenge is that an organization’s departments (and therefore the data) tend to be siloed, which can result in blind spots, organizational politics, and spiraling costs. Organizations must balance their need for insight and competitive advantage on the one hand and privacy and rational cost of ownership on the other. All too frequently, these dual imperatives are in conflict, sometimes unnecessarily so, because the organization does not have a clear strategy for what data will be used and stored, what data will be used but not stored, and what data is simply unnecessary. Update analytics methodology to reflect new data realities Analyzing unstructured data will never yield the same confidence levels as a simple binary choice; it will always require interpretation. The key is to make that interpretation transparent, rigorous, and repeatable so that others can reliably repeat analyses and yield the same or substantially similar results. This is one area in which there is a tremendous difference between private and public institutions. In private institutions, work process, product, and data tend to be proprietary. In public institutions, such as universities, research is subject to the highest levels of scrutiny among academic publications and journals. It’s also important to engineer the method of measurement into initiatives to reduce ambiguity and provide a greater ability to trace impact. The broader the topic, the more hashtags can help confirm the provenance and relevance of social conversation. Tracking codes and multivariate testing are also a useful if not perfect solution. Seek out critical thinking and diverse skill sets Unquestionably, engineering and analytical skills, not to mention skills in applied statistics and data science, will continue to gain value as organizations become ever more dependent on multiple data types. At the same time, the demands of analyzing unstructured data also require skill in interpreting context related to language and behavior, a challenge humans have had since we developed language. After all, even the cleanest, most reliable data can be misinterpreted, whether intentionally or unintentionally. To minimize misinterpretation means valuing not only math and engineering but also social sciences and humanities. These disciplines—sociology, psychology, anthropology, linguistics, ethics, philosophy, and rhetoric—provide context and help us become better critical thinkers. Without a balance of critical thinking, business knowledge, and smart analytics tools, we’re in danger of making the wrong decision much more efficiently, quickly, and with far greater impact than we have in the past. If we—individually and collectively—are to make the best use of data and extract relevant insight from it in a trustworthy manner, we must approach data strategy thoughtfully. Following are some basic tenets of a strategic data plan. 22 3 2 1
  23. 23. CONCLUSION The hype over “big data” has partially obscured the fact that our ability to collect, analyze, and act on data—and to some extent predict outcomes based upon it—is a potentially transformative force for business and humanity alike. While Aldous Huxley couldn’t have anticipated the impact of a Kim Kardashian magazine cover or the challenges inherent in understanding how people talk about smoking, he was prescient to call out the ever- increasing difficulty of identifying relevance in a “sea of irrelevance.”32 It seems likely that the privacy and ethical implications of data ubiquity, not to mention recent disclosures about government access to and use of personal data, would have confirmed many of Orwell’s worst fears. At the same time, however, we do not need to blindly accept the dystopian nightmare he envisioned as our only future. We have an opportunity--and an obligation--to examine not only the legal, but the ethical implications of ubiquitous data, and use this understanding to decide how we will use it, sustainably and responsibly, for years to come. 23 Insist on ethical data use and transparent disclosure Earl Warren, former chief justice of the United States, once said, “In civilized life, law floats in a sea of ethics.”33 This is especially true of the digital age, in which few of the implications of digital transformation have found their way into case law and, as a result, organizational policy. As organizations become more data centric, for their own benefit as well as their customers’, they must also look closely at the affirmative and passive decisions they make about where they get their data; their analytics methodology; how they store, steward, aggregate, and use the data; and how transparently they disclose these actions. Reward and reinforce humility and learning It is nearly impossible to calculate the impact that data will have in our lives in the next decade. Technologies such as IBM’s Watson, Ayasdi, and others are illustrating the many applications for big data, whether in healthcare, consumer products, financial services, energy, or elsewhere. Meanwhile, the Internet of Things introduces data feeds from sensors, which can be combined with other data streams to deliver specific, relevant, and even predictive insights that will only compound volume, velocity, and variety challenges. Yet the world is just starting to come to terms with the impact of data ubiquity from the technology, business, research, cultural, and ethical perspectives. The most important and perhaps most difficult impact of data ubiquity is the fact that it radically undermines traditional methods of analysis and laughs at our desire for certainty. The only strategy to combat the fear of uncertainty is to accept and work with the limits of the data and approach the science of challenging data sets with an appetite for continuous learning, whether the goal is to sell a pair of shoes or to help prevent cancer 5 4
  24. 24. Image courtesy of GarySchroeder 24
  25. 25. ENDNOTES 1 You can view the talk at etlinger_what_do_we_do_with_all_this_big_data. 2 Neil Postman, Amusing Ourselves to Death: Public Discourse in the Age of Show Business (New York: Penguin Books,1985), vii. 3 For a more detailed view, a good starting point is ““3D Data Management: Controlling Data Volume, Velocity and Variety,”“ published by META Group on February 6, 2001, http://blogs. Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. 4 “What Is Big Data?” IBM, accessed January 6, 2015, http://www- 5 “Statistics,” YouTube, accessed January 6, 2015, https://www. 6 “Stats,” WordPress, cached on November 2, 2014, http:// 7 “About,” Twitter, accessed January 6, 2015, https://about.twitter. com/company. 8 Darin Stewart, ““Big Content: The Unstructured Side of Big Data,”“ Gartner Group, May 1, 2013, stewart/2013/05/01/big-content-the-unstructured-side-of-big- data/. 9 Zacks Equity Research, “Stock Market News for December 17, 2014 - Market News,” Yahoo! Finance, December 17, 2014, stock-market-news-december-17-151003130.html;_ ylt=AwrBJSCwLpNUWlIAatyTmYlQ. 10 Kalev Leetaru, “Why Big Data Missed the Early Warning Signs of Ebola,” Foreign Policy, September 26, 2014, http://foreignpolicy. com/2014/09/26/why-big-data-missed-the-early-warning-signs- of-ebola/. 11 See also: Sherry L. Emery, Glen Szczypka, Eulàlia P. Abril, Yoonsang Kim, and Lisa Vera, “Are You Scared Yet? Evaluating Fear Appeal Messages in Tweets About the Tips Campaign,” Journal of Communication, 64 (2014): 278–295, doi: 10.1111/ jcom.12083. 12 “Cohen’s Kappa, “University of Nebraska–Lincoln, accessed January 6, 2015, hckappa.PDF. 13 Sherry L. Emery, “Are You Scared Yet?.”’’ 14 Ibid. 15 Ibid. 16 “Linguistic Mapping Reveals How Word Meanings Sometimes Change Overnight,” MIT Technology Review, November 23, 2014, mapping-reveals-how-word-meanings-sometimes-change- overnight/. 17 Philip Stark, Twitter comment, November 24, 2014, https:// 18 For a quick primer on descriptive, predictive, and prescriptive analytics, see this interview with data scientist Michael Wu of Lithium by Jeff Bertolucci in InformationWeek: http://www. data-analytics-descriptive-vs-predictive-vs-prescriptive/d/d- id/1113279. 19 To download the text, go to sophist_refut.html. 20 Vigen maintains a running list of spurious correlations at his blog, Spurious Correlations ( 21 For an excellent tutorial on logical fallacies, see chapter 2 of “SticiGui,” an online statistics textbook by Philip B. Stark, professor and chair of the department of statistics, University of California, Berkeley: reasoning.htm. 22 Rudder, Dataclysm, 146. 23 Kashmir Hill,, “How Target Figured Out a Teen Girl Was Pregnant Before Her Father Did,” Forbes, February 16, 2012, http://www. out-a-teen-girl-was-pregnant-before-her-father-did/. 24 Zoe Kleinman, “Samaritans App Monitors Twitter Feeds for Suicide Warnings,” BBC News, October 28, 2014, com/news/technology-29801214. 25 Adrian Short, “Shut Down Samaritans Radar,”, accessed January 6, 2015, inc-shut-down-samaritans-radar. 26 “Samaritans Radar announcement - Friday 7 November,” Samaritans, November 7, 2014, news/samaritans-radar-announcement-friday-7-november. 25
  26. 26. 27 Adam D.I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock, “Experimental Evidence of Massive-Scale Emotional Contagion Through Social Networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 111 (24), DOI: 10.1073/pnas.1320040111. 28 Voytek, “Rides of Glory,” Uber, cached March 26, 2012, https:// ridesofglory. 29 Kashmir Hill, “‘God View’: Uber Allegedly Stalked Users for Party-Goers’ Viewing Pleasure (Updated),” Forbes, October 3, 2014, god-view-uber-allegedly-stalked-users-for-party-goers-viewing- pleasure/. Talking Points Mem: Uber Let Job Applicant Access Controversial December 1, 2014: http://talkingpointsmemo. com/livewire/uber-job-applicant-ride-logs. 30 Caitlin MacNeal, “Report: Uber Let Job Applicant Access Controversial “‘God View’ Mode,” Talking Points Memo, December 1, 2014, applicant-ride-logs. 31 Nick Bilton, “Alex from Target: The Other Side of Fame”, The New York Times, November 12, 2014, http://www.nytimes. com/2014/11/13/style/alex-from-target-the-other-side-of-fame. html?_r=0. 32 Aldous Huxley, Brave New World Revisited (New York: HarperCollins Publishers, 1958), 36. 33 Earl Warren, speech at the Louis Marshall Award Dinner of the Jewish Theological Seminary (Americana Hotel, New York City, November 11, 1962). SOURCES AND ACKNOWLEDGMENTS This document was developed as a companion piece to a talk given at TED@IBM in San Francisco, California, on September 23, 2014. As such, it was built on online and in-person conversations with market influencers, technology vendors, brands, academics, and others on the effective and ethical use of big data, as well as secondary research, including relevant and timely books, articles, and news stories. My deepest gratitude to the following: • The team at the Health Media Collaboratory at the University of Illinois at Chicago, specifically Sherry Emery, Eman Aly, and Glen Szcypka for sharing their research and methodology and educating me about the nuances of interpreting big data for medical research. • My fellow board members at the Big Boulder Initiative for their insights and perspective on the effective and ethical use of social data: Pernille Bruun-Jensen, CMO, NetBase; Damon Cortesi, Founder and CTO, Simply Measured; Jason Gowans, Director, Data Lab, Nordstrom; Will McInnes, CMO, Brandwatch; Chris Moody, Vice President, Data Strategy, Twitter (Chair); Stuart Shulman, Founder and CEO, Texifier; Carmen Sutter, Product Manager, Social, Adobe; and Tom Watson, Head of Sales, Hanweck Associates, LLC. • The team at TED who helped me hone and focus the talk and provided invaluable feedback throughout: Juliet Blake and Anna Bechtol. • The team at IBM Social Business for planning, executing and marketing a superb event: Michela Stribling, Beth McElroy, Jacqueline Saenz and Michelle Killebrew. • My fellow TED@IBM speakers: Gianluca Ambrosetti, Kare Anderson, Brad Bird, Monika Blaumueller, Erick Brethenoux, Lisa Seacat DeLuca, Jon Iwata, Bryan Kramer, Tan Le, Charlene Li, Florian Pinel, Inhi Cho Suh, Marie Wallace, and Kareem Yusuf. • Philip Stark, professor and chair of Statistics, University of California, Berkeley, for an extremely insightful perspective on the methodological and organizational requirements of big data, as well as access to his superb course materials. 26
  27. 27. 27 • The organizers and speakers at the International Symposium on Digital Ethics at Loyola University in November 2014, with whom I had some incredibly insightful conversations: Don Heider, dean, School of Communication, Loyola University Chicago; Thorsten Busch, senior research fellow, Institute for Business Ethics, University of St. Gallen; Michael Koliska, PhD candidate at University of Maryland; and Caitlin Ring, assistant professor of strategic communication at Seattle University. • Farida Vis, research fellow in the Social Sciences in the Information School at the University of Sheffield. • The teams at DataSift (Nick Halstead, Tim Barker, Jason Rose, Seth Catalli); Lithium Technologies (Katy Keim and Nicol Addison); and Oracle (Tara Roberts and Christine Wan) for valuable insights along the way. • Tyler Vigen for his Spurious Correlations blog, which makes a complex topic simple and fun to explain; Gary Schroeder for his wonderful visual storytelling of my TED talk; Daniel K. Davis for his superb photography at TED@IBM; Vladimir Mirkovic for graphic design; and Erin Brenner for copyediting. • My talented teammates at Altimeter Group: Rebecca Lieb, who edited this report, Cheryl Graves, Jessica Groopman, Jaimy Szymanski, Christine Tran, and, of course, Charlene Li. Input into this document does not represent a complete endorsement of the report by the individuals or organizations listed above. Finally, any errors are mine alone. OPEN RESEARCH This independent research report was 100% funded by Altimeter Group. This report is published under the principle of Open Research and is intended to advance the industry at no cost. This report is intended for you to read, utilize, and share with others; if you do so, please provide attribution to Altimeter Group. PERMISSIONS The Creative Commons License is Attribution-Noncommercial- ShareAlike 3.0 United States, which can be found at https:// DISCLAIMER ALTHOUGH THE INFORMATION AND DATA USED IN THIS REPORT HAVE BEEN PRODUCED AND PROCESSED FROM SOURCES BELIEVED TO BE RELIABLE, NO WARRANTY EXPRESSED OR IMPLIED IS MADE REGARDING THE COMPLETENESS, ACCURACY, ADEQUACY, OR USE OF THE INFORMATION. THE AUTHORS AND CONTRIBUTORS OF THE INFORMATION AND DATA SHALL HAVE NO LIABILITY FOR ERRORS OR OMISSIONS CONTAINED HEREIN OR FOR INTERPRETATIONS THEREOF. REFERENCE HEREIN TO ANY SPECIFIC PRODUCT OR VENDOR BY TRADE NAME, TRADEMARK, OR OTHERWISE DOES NOT CONSTITUTE OR IMPLY ITS ENDORSEMENT, RECOMMENDATION, OR FAVORING BY THE AUTHORS OR CONTRIBUTORS AND SHALL NOT BE USED FOR ADVERTISING OR PRODUCT ENDORSEMENT PURPOSES. THEOPINIONS EXPRESSED HEREIN ARE SUBJECT TO CHANGE WITHOUT NOTICE.
  28. 28. About Us How to Work with Us Altimeter Group research is applied and brought to life in our client engagements. We help organizations understand and take advantage of digital disruption. There are several ways Altimeter can help you with your business initiatives: • Strategy Consulting. Altimeter creates strategies and plans to help companies act on disruptive business and technology trends. Our team of analysts and consultants works with senior executives, strategists .and marketers on needs assessment, strategy roadmaps, and pragmatic recommendations across disruptive trends. • Education and Workshops. Engage an Altimeter speaker to help make the business case to executives or arm practitioners with new knowledge and skills. • Advisory. Retain Altimeter for ongoing research-based advisory: conduct an ad-hoc session to address an immediate challenge; or gain deeper access to research and strategy counsel. To learn more about Altimeter’s offerings, contact 28 Altimeter is a research and consulting firm that helps companies understand and act on technology disruption. We give business leaders the insight and confidence to help their companies thrive in the face of disruption. In addition to publishing research, Altimeter Group analysts speak and provide strategy consulting on trends in leadership, digital transformation, social business, data disruption and content marketing strategy. Altimeter Group 1875 S Grant St #680 San Mateo, CA 94402 @altimetergroup 650.212.2272 Susan Etlinger, Industry Analyst Susan Etlinger is an industry analyst at Altimeter Group, where she works with global organizations to develop data and analytics strategies that support their business objectives. Susan has a diverse background in marketing and strategic planning within both corporations and agencies. She’s a frequent speaker on social data and analytics and has been extensively quoted in outlets, including Fast Company, BBC, The New York Times, and The Wall Street Journal. Find her on Twitter at @setlinger and at her blog, Thought Experiments, at Rebecca Lieb, Industry Analyst Rebecca Lieb (@lieblink) covers digital advertising and media, encompassing brands, publishers, agencies and technology vendors. In addition to her background as a marketing executive, she was VP and editor-in-chief of the ClickZ Network for over seven years. She’s written two books on digital marketing: The Truth About Search Engine Optimization (2009) and Content Marketing (2011). Rebecca blogs at