Big data is hyped, but isn't hype. There are definite technical, process and business differences in the big data market when compared to BI and data warehousing, but they are often poorly understood or explained. BI isn't big data, and big data isn't BI. By distilling the technical and process realities of big data systems and projects we can separate fact from fiction. This session examines the underlying assumptions and abstractions we use in the BI and DW world, the abstractions that evolved in the big data world, and how they are different. Armed with this knowledge, you will be better able to make design and architecture decisions. The session is sometimes conceptual, sometimes detailed technical explorations of data, processing and technology, but promises to be entertaining regardless of the level.
Yes, it’s about the data normally called “big”, but it’s not Hadoop for the database crowd, despite the prominent role Hadoop plays. The session will be technical, but in a technology preview/overview fashion. I won’t be teaching you to write MapReduce jobs or anything of the sort.
The first part will be an overview of the types, formats and structures of data that aren’t normally in the data warehouse realm. The second part will cover some of the basic technology components, vendors and architecture.
The goal is to provide an overview of the extent of data available and some of the nuances or challenges in processing it, coupled with some examples of tools or vendors that may be a starting point if you are building in a particular area.
43. We collect large volumes of text, a rare practice
ten years ago. Today we can turn text into data.
Categories,
taxonomies
Topics, genres,
relationships,
abstracts
Sentiment, tone,
opinion
Words & counts,
keywords, tags
Entities
people, places,
things, events, IDs
Copyright Third Nature, Inc.
45. Example data: Twitter Message API Payload
Looks like:
This is really just a record format
much like a DB row.
Datetime, userID, name, location,
description, message, message
metadata, etc.
But it’s In json or xml.
51. Unstructured is Not Really Unstructured
Slide 51
Unstructured data isn’t
really unstructured:
language has structure.
Text can contain traditional
structured data elements.
The problem is that the
content is unmodeled.
73. About the Presenter
Mark Madsen is president of Third
Nature, a technology research and
consulting firm focused on business
intelligence, data integration and data
management. Mark is an award-winning
author, architect and CTO whose work
has been featured in numerous industry
publications. Over the past ten years
Mark received awards for his work from
the American Productivity & Quality
Center, TDWI, and the Smithsonian
Institute. He is an international speaker,
a contributor to Forbes Online and on
the O’Reilly Strata program committee.
For more information or to contact
Mark, follow @markmadsen on Twitter
or visit http://ThirdNature.net