Read&Subscribe to my blogs@siddhithakkar.com. Details of slides below:
This document is aimed at explaining the following:
1) How big is big data?
2) Types of big data (with examples)
3) When does your data become big data? (objective parameters)
Remember to look into the notes section for more hints.
2. Key points to be discussed:
• How big is big data?
• Types of big data
• When does your data
become big data?
3. How Big is Big Data?
• It isn’t just size!!
• 4 V’s determine if data is
really big
4. Types of Big Data
• Structured data
• Follows a fixed format
• Based on relational db
• Unstructured data
• Doesn’t follow a fixed format
• Based on character and binary
data
• Semi-structured data
• Hybrid case
• Based on xml/json
5. When does your data become Big Data?
Parameter Trad. Data Big Data
Generated rate Per hour/ day.. Rapid
Structure Structured Semi-structured and
unstructured
Data Source Centralized Fully distributed
Data Store RDBMS HDFS, NoSQL
Scenarios Repeated read
and write
Write once,
repeated read
Integration Easy Difficult
6. Big Data Analytics
1) Descriptive analysis
• “What happened?”
2) Diagnostic analysis
• “Why did it happen?”
3) Predictive analysis
• “What will happen next?”
4) Prescriptive analysis
• “What should I do?”
1
2
3
4
7. • Often a target environment for
ETL tools
• Electronic storage of data
• Designed for query and
analysis
• Not meant for transactional
processing
• Makes data mining possible
• Core of BI systems
What is data warehouse?
8. Popular Data terms
• Data science is the
umbrella term
• Data analytics is
examining data to deliver
business insights
• Data mining is looking for
patterns in data
9. What is data science?
• Umbrella term
• Includes:
• Data munging
• Data exploration
• Data representation
Most of us are aware that big data corresponds to huge amount of data- so huge that our traditional data management tools can’t store or process them.
But very important to understand in the beginning itself is that the complexity of big data isn’t just about its size. There are so many other factors that make big data the thing it is today.
These factors or characteristics are called 4V’s of Big Data.
First is the volume of data and when you say volume- it isn’t just about its current size but also the rate at which your data is growing.
Second is the variety of data i.e. from how many heterogenous sources or distributed systems are you collecting this data
Thirdly, Velocity which is the rate at which data is coming in. As an example, there could be billions of transactional data being recorded in a second
And lastly Veracity which refers to the unreliability of data. If your data is from a source that you don’t trust, business insights delivered wont be of much value either.
So, if you have term your data as big- make sure that it is scoring fairly well on these four characteristics.
Next is the types of big data. It can be classified into three basic areas:
Structured data is one that follows a pre-defined schema and can perfectly fit into our relational databases. That kind of data is simplest to manage, and then of course is the easiest one to analyze. You could think of an excel sheet about employees of a company as an example of structured data.
Unstructured data, like the name implies, doesn’t follow a fixed format and neither do they fit into our mainstream relational databases. As an example, all the character and binary data generated from (text, pictures, word, pdf of) our social media tools is extremely unstructured. It’s a big challenge to manage and analyze such kind of data.
Semi-structured data is mix of both types. It does follow a fixed format- but still difficult to analyze as compared to completely structured data. You could think of xml files generated from some of software solutions as an example of semi-structured data.
In this slide, we are going to talk about some objective parameters that can help you identify big data.
If your data is generated at an extremely rapid rate- say billions of transactions per second- this could already be a hint towards moving in the direction of big data. Where as traditional data expands on per day/per hour basis.
Secondly, big data is mostly highly unstructured- or at best- semi structured- its extremely difficult to analyze it. Whereas traditional is structured and hence our conventional data tools are able to manage it.
Such kind of data is mostly coming from complex heterogenous and distributed systems whereas traditional data can be picked up from one source.
Since data is not in a form that can fit into a structured tabular format, therefore this huge hue and cry about non-conventional NoSQL and Hadoop systems whereas structured data can live inside relational databases.
- Our traditional data could change anytime- for example: bank account information, phone number etc. Therefore, you need multiple repeated updates there- whereas big data represents events- for example: purchase in a store, a web page view etc. and event data by nature doesn’t change. So, there is no need of multiple write statements there.
- Because data is being collected from several heterogenous sources, data integration is a difficult task. This is the reason you have so many ETL tools doing the rounds in market where as with tradition data you don’t need any such tools.
Analytics, as we know, is the process of examining data sets so as to discover hidden patterns or unknown correlations. The ultimate stakeholder of such a process are the business stakeholders because such analytics are supposed to help businesses take informed decisions. There are four different types of big data analytics.
Descriptive analytics- describes what had happened in a particular situation. It creates reports or simple visualizations to help you understand what happened at one particular time or over a period of time. They are least complex and don’t involve any AI or ML stuff. (Helps you understand the past) For example: summarizing the success of a market campaign on social media
Diagnostic analytics answers the question as to why something happened at all. It allows analysts to deep dive into data and really understand the root cause of a problem/events. Such analysis is moderately complex and can involve use of AI and ML techniques. (Helps you understand the past) For example: why did a marketing campaign succeed- because how many number of posts made, number of followers or mentions etc.
Predictive analytics helps you answer what can happen in the future. Such analytics is complex because it involves the use of highly advanced algorithms. (Helps you understand the future)
Prescriptive analytics will help you answer what should you do in order to achieve a desired result. Such analysis require highly complex machine learning techniques which very few tools are able to offer. Example: Your gps devices makes use of geo spatial data to suggest the route that you should take. (Helps you understand the future)
It is electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users in a timely manner to make a difference.
data mining: finding patterns in data
https://www.guru99.com/data-warehousing.html#13
Data science is an umbrella term that includes several data processes. You may think of data mining as the process of preparing a meal.
The first step of preparing food is the collection of raw materials, cleaning and cutting them. Similarly, in data science the first process is munging i.e. collection of data from several sources using ETL tools and its cleansing.
Second is cooking/processing the food. Similarly, in data science you explore through the data to find some hidden patterns and unknown co-relations.
Thirdly, you serve food in a representable way. Similarly, data science involves the generation of reports or dashboards so that business owners are able to make sense of the data.