Big Data is simply about recognizing that there are valuable sources of information in the enterprises that for various reasons, do not fit nicely into the traditional data warehousing and BI technology stack. The dimensions of Big Data we’ve all heard so much about are really just a way to classify the reasons a particular source of data does not fit. For example, the volume dimension usually refers to sources of data with low-value per record which make it cost prohibitive to load those sources into the warehouse. For example, web click logs or system logs – this type of data needs to be stored cheaply and reduced to meaningful data for end user consumption. Technologies like Hadoop do a great job of this. Similarly High Velocity data often requires technology like CEP to take a large noisy stream and glean meaningful insight from it. On the Variety and Complexity side we are looking at data sources such as text and documents that either live in hard to access repositories or require advanced text analytics to produce meaningful data points.
This is the beginning of the great divide in businesses today.
Taking taxonomical look at where Big Content comes from, it’s easily visible as a sibling of true Big (Unstructured) Data and lives within the same family tree. While the Unstructured Data world describes mostly log files, sensor data, and other machine generated outputs, Unstructured Content describes the human generated information in the form of communication, documents, etc.
Let’s look at Big Content in the lens of the 3 V’s. Should you tackle Big Content the same way as Big Data? In short, no. They tend to be inverses of one another in their extremes. Volumetrically, Big Data is about reducing petabytes to meaningful gigabytes. Big Content tends to deal with taking gigabytes or terabytes of human generated information and often creating a bit MORE data that was originally being managed. For instance, a document may contain hundreds of entities, entity relationships, sentiment scores, key phrases, etc. Big Content is about expanding that dense, single document into many data points.Speaking with respect to Velocity, the key difference resides in the way information is created in Big Data vs Big Content. While human behavior will drive some Big Data generation, ultimately it’s happening at machine speeds recording vast amounts of small pieces of information on what often seems like a single human interaction. Big Content is generated directly by humans in the form of documents, emails, reports, etc. (Though this is one area where dimensionally speaking with enough humans your Big Content can approach a Big Data problem: e.g. Twitter fire hose.)Lastly, the Variety dimension sheds some light on often overlooked differences. While Big Data comes in a multitude of forms since it’s typically stored in flat file compatible formats, Big Content systems need to deal with HUNDREDS of file types combined with DOZENS of languages. Complexity aside, when you layer in text analytics routines as part of best practices, the result is often extremely jagged data forms. While Big Data is designed to support variety, variety is a definite constant even with the same input types in Big Content.
Take an example of email. A single communication with a customer often reveals lots of information on products, services, and more.
What does a single email tell us? Quiet often they can be quite rich!
If we get an even LARGER pile of content, truly Big Content, it starts to look just like the data you’re probably used to working with today.
How do you start to connect the dots? Tried and true text analytics is our friend here! Plus, some established metadata practices from the content-world such as taxonomies and ontologies. They may already be in use by your organization! In the end, it’s all about using routines to derive the data points from the content and providing a way to connect those dots back to the business.
Easy to look around and see Big Content in any organization. What systems even within IT do you rely on to contribute to your role? SharePoint? Documentum? File systems?