SlideShare a Scribd company logo
1 of 57
Chapter 2
Data science
1
Introduction
•In this chapter, we gonna learn more about
•data science
•Data Vs Information
•Data types and representation
•Data value chain and
•Basic concept of Big data
After completing this chapter, the students will be able to:
• Describe what data science is and the role of data
scientists.
• Differentiate data and information.
• Describe data processing life cycle
• Understand different data types from diverse perspectives
• Describe data value chain in emerging era of big data.
• Understand the basics of Big Data.
• Describe the purpose of the Hadoop ecosystem components.
2.1. An Overview of Data
Science
Self test Exercise
•What is data science?
•Can you describe the role of data in
emerging technology?
• What are data and information?
•What is big data?
Con’t…
• Data science is a multi-disciplinary field that uses scientific
methods, processes, algorithms, and systems to extract
knowledge and insights from
• structured data (relational database): is a data whose
elements are addressable for effective analysis.
• It has been organized into a formatted repository that is typically
a database.
• They have relational key and can easily mapped into pre-designed
fields, Example: Relational data.
• semi-structured data: is information that does not reside in a rational
database but that have some organizational properties that make it easier
to analyze.
• Example: XML data.
Con’t…
• Unstructured data: is a data that is which is not organized in a
pre-defined manner or does not have a pre-defined data model.
• it is not a good fit for a mainstream relational database.
• it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.
Self test Exercise
What’s The Difference Between Structured, Semi-Structured
And Unstructured Data? Discuss in detail
Con’t…
• Data science is much more than simply analyzing data
E.g Supermarket data- On Thursday nights people who buy milk also tend
to buy beer .
• catalog order of items,
• Know the profile of your best customer,
• Predict more wanted item for the next 2 month,
• The probability of customer who buy computer also buy
pen drive.
Self-test Exercise
• Describe in some detail the main disciplines that
contribute to data science.
2.1.1 What are data and information?
• Data: a representation of facts, concepts, or instructions in a
formalized manner.
• suitable for communication, interpretation, or processing, by
human or electronic machines.
• represented with the help of characters such as alphabets,
digits or special characters.
• Information :is the processed data on which decisions and
actions are based.
• It is data that has been processed into a form that is
meaningful to the recipient
Con’t…
Data
 raw facts
 no context
 just numbers and text
Information
 data with context
 processed data
 value-added to data
 summarized
 organized
 analyzed
Con’t…
 Data: 180413
 Information:
 18/04/13 The date of your final exam.
 18,0413 Birr The average starting salary of an accounting
major.
 180413 Zip code of Dessie.
2.1.2. Data Processing Cycle
• Data processing is the re-structuring or re-ordering of data by
people or machines to increase their usefulness
• three steps constitute the data processing cycle.
• Input − in this step, the input data is prepared in some
convenient form for processing. The form will depend on the
processing machine.
• E.g when electronic computers are used, the input data can be
recorded on any one of the several types of storage medium,
such as hard disk, CD, flash disk and so on
Con’t…
• Processing − in this step, the input data is changed to produce
data in a more useful form.
• For example, interest can be calculated on deposit to a bank, or
a summary of sales for the month can be calculated from the
sales orders.
• Output − at this stage, the result of the proceeding
processing step is collected.
• The particular form of the output data depends on the use
of the data. For example, output data may be payroll for
employees.
Self Test Exercise
• Discuss the main differences between data and information
with examples
• Can we process data manually using a pencil and paper?
• Discuss the differences with data processing using the
computer with manually.
2.3 Data types and their representation
• Data types can be described from diverse perspectives.
• In CS and CP, for instance, a data type is simply an attribute
of data that tells the compiler or interpreter how the
programmer intends to use the data.
Int x=5;
String name=“Melaku”;
Float final pi=3.14;
2.3.1. Data types from Computer programming
perspective
• What is Programming language?
• Almost all programming languages explicitly include the notion of
data type, though different languages may use different
terminology. Common data types include:
• Integers(int)- is used to store whole numbers, mathematically
known as integers
• If integer is not sufficient use long int
• Booleans(bool)- is used to represent restricted to one of two
values: true or false, E.g Boolean is_she_come=True;
• Characters(char)- is used to store a single character, E.g Char
x=‘M’;
• If char is not sufficient use string
• Float decimal(float)- is used to store a number with decimals, E.g
float x=24.5
• If float is not sufficient use double
Con’t
• Explain types of Programming language.
• A data type makes the values that expression, such as a
variable or a function, might take.
• This data type defines the operations that can be done on
the data, the meaning of the data, and the way values of
that type can be stored.
2.3.2. Data types from Data Analytics
perspective
• From a data analytics point of view, it is important to
understand that there are three common types of data types
or structures:
• Structured
• Semi-structured, and
• Unstructured data type
• Structured
Con’t…
Structured Data: data that adheres to a pre-defined data
model and is therefore straightforward to analyze.
• conforms to a tabular format with a relationship between the
different rows and columns.
• Common examples of structured data are Excel files or SQL
databases. Each of these has structured rows and columns that
can be sorted
Con’t…
Semi-structured Data: is a form of structured data that
does not conform with the formal structure of data models
associated with relational databases or other forms of data
tables.
• contains tags or other markers to separate semantic
elements and enforce hierarchies of records and fields
within the data.
• Therefore, it is also known as a self-describing structure.
• Examples of semi-structured data include JSON
and XML are forms of semi-structured data.
Con’t…
Unstructured data: is information that either does not have a
predefined data model or is not organized in a pre-defined
manner.
• Unstructured information is typically text-heavy but may
contain data such as dates, numbers, and facts as well.
• results in irregularities and ambiguities that make it difficult
to understand using traditional programs as compared to data
stored in structured databases
• Common examples of unstructured data include audio, video
files or NoSQL databases, social media data(fb,
twitter,instagram…).
Con’t…
Metadata – Data about Data: this is not a separate data
structure, but it is one of the most important elements for Big
Data analysis and big data solutions.
• Metadata is data about data. It provides additional information
about a specific set of data
• In a set of photographs, for example, metadata could
describe when and where the photos were taken.
• The metadata then provides fields for dates and locations
which, by themselves, can be considered structured data.
Because of this reason, metadata is frequently used by Big
Data solutions for initial analysis.
Con’t…
Self Test Exercise
o Discuss data types from programming and analytics
perspectives.
o Compare metadata with structured, unstructured and
semi-structured data
o Given at least one example of structured, unstructured
and semi-structured data types
2.4. Data value Chain
• Is introduced to describe the information flow within a big
data system as a series of steps needed to generate value and
useful insights from data.
• The Big Data Value Chain identifies the following key high-level
activities:
2.4.1. Data Acquisition
• It is the process of gathering, filtering, and cleaning data
before it is put in a data warehouse .
• Data warehouse is a repository of information collected
from multiple sources, stored under a unified schema, and
which usually resides at a single site.
• A data warehouse is a decision support
database that is maintained separately from
the organization’s operational database.
Con’t…
• Data acquisition is one of the major big data
challenges in terms of infrastructure requirements.
• The infrastructure required to support the
acquisition of big data must deliver low, predictable
latency in both capturing data and in executing
queries.
• be able to handle very high transaction volumes,
often in a distributed environment; and support
flexible and dynamic data structures.
2.4.2. Data Analysis
• It is concerned with making the raw data acquired amenable to
use in decision-making as well as domain-specific usage.
• Data analysis involves exploring, transforming, and modeling
data with the goal of highlighting relevant data, synthesizing
and extracting useful hidden information with high potential
from a business point of view.
• Related areas include data mining, business intelligence, and
machine learning
2.4.3. Data Curation
• It is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for
its effective usage.
• Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation.
• Data curators (scientific curators or data annotators) hold
the responsibility of ensuring that data are trustworthy,
discoverable, accessible, reusable and fit their purpose.
2.4.4. Data Storage
• It is the persistence and management of data in a
scalable way that satisfies the needs of applications that
require fast access to the data.
• RDBMS have been the main, and almost unique, a solution
to the storage paradigm for nearly 40 years.
• However, the ACID (Atomicity, Consistency, Isolation,
and Durability) properties that guarantee database
transactions lack flexibility with regard to schema
changes
Con’t…
• the performance and fault tolerance when data
volumes and complexity grow, making them
unsuitable for big data scenarios.
• What is NoSql Database? Compare it with Sql.
• NoSQL technologies have been designed with the
scalability goal in mind and present a wide range of
solutions based on alternative data models.
2.4.5. Data Usage
• It covers the data-driven business activities that need
access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
• Data usage in business decision-making can enhance
competitiveness through the reduction of costs, increased
added value, or any other parameter that can be measured
against existing performance criteria.
Self Test Exercise
• Which information flow step in the data value chain you think is
labor-intensive? Why?
• What are the different data types and their value chain?
2.5. Basic concepts of big data
Big Data Every Where!
•Lots of data is being collected
and warehoused
• Web data, e-commerce
• purchases at department/
grocery stores
• Bank/Credit Card
transactions
• Social Network
Con’t…
• Big data is a blanket term for the non-traditional strategies
and technologies needed to gather, organize, process, and
gather insights from large datasets.
• While the problem of working with data that exceeds the
computing power or storage of a single computer is not new,
the pervasiveness, scale, and value of this type of computing
have greatly expanded in recent years.
2.5.1. What Is Big Data?
• is the term for a collection of data sets so large and
complex that it becomes difficult to process using on-
hand database management tools or traditional data
processing applications.
• a “large dataset” means a dataset too large to reasonably
process or store with traditional tooling or on a single
computer.
• the common scale of big datasets is constantly shifting
and may vary significantly from organization to
organization
Bigdata: 4/5 V’s
• Big data is characterized by 4V and more:
• Volume: Machine generated data is produced in larger
quantities than non traditional data.
• large amounts of data Zeta bytes/Massive datasets
• Velocity: the speed of data processing.
• Data is live streaming or in motion
• Variety: large variety of input data which in turn generates
large amount of data as output.
• data comes in many different forms from diverse sources
• Veracity: can we trust the data? How accurate is it? etc.
• Value: for what purpose we are going to use data
Con’t…
2.5.2. Clustered Computing and Hadoop
Ecosystem
2.5.2.1.Clustered Computing
• To better address the high storage and computational
needs of big data, computer clusters are a better fit.
• Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits:
• Resource Pooling: Combining the available storage space to
hold data is a clear benefit, but CPU and memory pooling are
also extremely important.
Con’t…
• High Availability: Clusters can provide varying levels of
fault tolerance and availability guarantees to prevent
hardware or software failures
• Easy Scalability: Clusters make it easy to scale horizontally
by adding additional machines to the group.
• clusters requires a solution for managing cluster
membership, coordinating resource sharing, and scheduling
actual work on individual nodes.
Con’t…
• Cluster membership and resource allocation can be handled
by software like Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
• assembled computing cluster often acts as a foundation that
other software interfaces with to process the data.
•
Self Test Exercise
• List and discuss the characteristics of big data ➢ Describe the big
data life cycle. Which step you think most useful and why?
• List and describe each technology or tool used in the big data life
cycle.
• Discuss the three methods of computing over a large dataset.
How much data?
•Google processes more than 20 PB a day (2018)
•Wayback Machine has 3 PB + 100 TB/month
(3/2009)
•Facebook process 500+TB/day (2019)
•eBay has 6.5 PB of user data + 50 TB/day (5/2009)
•CERN’s Large Hydron Collider (LHC) generates 15
PB a year
Con’t…
Who is collecting what?
2/25/2023 43
 Credit Card Companies  What data are they
getting?
Restaurant check
Grocery Bill
Airline ticket
Hotel Bill
How Can You Avoid Big Data?
• Pay cash for everything!
• Never go online!
• Don’t use a telephone! How????
• Don’t fill any prescriptions!
• Never leave your house!
Examples of big data…..
• Walmart handles more than 1 million customer
transactions every hour, which is imported into
databases estimated to contain more than 2.5
petabytes * of data — the equivalent of 167 times
the information contained in all the books in the US
Library of Congress.
• FICO Credit Card Fraud Detection System protects
2.1 billion active accounts world-wide.
Examples of big data…..
• NASA Climate Simulation
32 petabytes
• The Large Hadron Collider
25 petabytes annually, 200 petabytes after
replication
• Wall mart
2.5 petabytes per hour
Who’s Generating Big Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the
time)
Sensor technology and
networks
(measuring all kinds of data)
2.5.2.2.Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make
interaction with big data easier.
• It is a framework that allows for the distributed processing
of large datasets across clusters of computers using simple
programming models.
• Designed to answer the question: “How to process big data
with reasonable cost and time?”
Con’t…
• inspired by a technical document published by Google. The four
key characteristics of Hadoop are:
• Economical: highly economical as ordinary computers can be used
for data processing.
• Reliable: stores copies of the data on different machines and is
resistant to hardware failure.
• Scalable: It is easily scalable both, horizontally and vertically.
• Flexible: can store as much structured and unstructured data .
Con’t…
• Hadoop has an ecosystem that has evolved from its four
core components:
• data management,
• access,
• processing, and
• storage.
• It is continuously growing to meet the needs of Big Data.
Component's of Big Data(Hadoop Ecosystem)
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm
libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Assignment
• discuss the purpose of each Hadoop Ecosystem components?
2.5.3. Big Data Life Cycle with Hadoop
2.5.3.1. Ingesting data into the system
• The first stage of Big Data processing is Ingest.
• The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
• Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2.5.3.2. Processing the data in storage
• The second stage is Processing. In this stage, the data is
stored and processed.
• The data is stored in the distributed file system, HDFS,
and the NoSQL distributed data, HBase.
• Spark and MapReduce perform data processing.
2.5.3.3. Computing and analyzing data
• The third stage is to Analyze. Here, the data is analyzed by
processing frameworks such as Pig, Hive, and Impala.
• Pig converts the data using a map and reduce and then
analyzes it.
• Hive is also based on the map and reduce programming and
is most suitable for structured data.
2.5.3.4. Visualizing the results
• The fourth stage is Access, which is performed by tools
such as Hue and Cloudera Search.
• In this stage, the analyzed data can be accessed by users.
Questions?

More Related Content

Similar to Data Science Chapter on Data Types and Value Chain

Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basicNivaTripathy2
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxJethroDignadice2
 
ISBB_Chapter4.pptx
ISBB_Chapter4.pptxISBB_Chapter4.pptx
ISBB_Chapter4.pptxpurnecole
 
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEMM. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEMDr.Florence Dayana
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptxinfinix8
 
Data Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student helpData Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student helpSaurabhJaiswal790114
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptxAnusuya123
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining Jeremiah Fadugba
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And DesignLijo Stalin
 

Similar to Data Science Chapter on Data Types and Value Chain (20)

Digital data
Digital dataDigital data
Digital data
 
Digital Types
Digital TypesDigital Types
Digital Types
 
Data mining concept and methods for basic
Data mining concept and methods for basicData mining concept and methods for basic
Data mining concept and methods for basic
 
Chapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptxChapter 2 - Intro to Data Sciences[2].pptx
Chapter 2 - Intro to Data Sciences[2].pptx
 
ISBB_Chapter4.pptx
ISBB_Chapter4.pptxISBB_Chapter4.pptx
ISBB_Chapter4.pptx
 
Data Mining-2023 (2).ppt
Data Mining-2023 (2).pptData Mining-2023 (2).ppt
Data Mining-2023 (2).ppt
 
Unit 2 DATABASE ESSENTIALS.pptx
Unit 2 DATABASE ESSENTIALS.pptxUnit 2 DATABASE ESSENTIALS.pptx
Unit 2 DATABASE ESSENTIALS.pptx
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
ITFT- Dbms
ITFT- DbmsITFT- Dbms
ITFT- Dbms
 
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEMM. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
M. FLORENCE DAYANA/DATABASE MANAGEMENT SYSYTEM
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
TOPIC.pptx
TOPIC.pptxTOPIC.pptx
TOPIC.pptx
 
RowanDay4.pptx
RowanDay4.pptxRowanDay4.pptx
RowanDay4.pptx
 
Data Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student helpData Analytics-Unit 1 , this Is ppt for student help
Data Analytics-Unit 1 , this Is ppt for student help
 
Introduction to Data Science.pptx
Introduction to Data Science.pptxIntroduction to Data Science.pptx
Introduction to Data Science.pptx
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 
Data science unit1
Data science unit1Data science unit1
Data science unit1
 
What is Data?
What is Data?What is Data?
What is Data?
 
Data concepts
Data conceptsData concepts
Data concepts
 
System Analysis And Design
System Analysis And DesignSystem Analysis And Design
System Analysis And Design
 

More from Wollo UNiversity

More from Wollo UNiversity (6)

grade 11-ict_fetena_net_c954.pdf
grade 11-ict_fetena_net_c954.pdfgrade 11-ict_fetena_net_c954.pdf
grade 11-ict_fetena_net_c954.pdf
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptx
 
Chapter 5.pptx
Chapter 5.pptxChapter 5.pptx
Chapter 5.pptx
 
Chapter 2.pptx
Chapter 2.pptxChapter 2.pptx
Chapter 2.pptx
 
Chapter 4.pptx
Chapter 4.pptxChapter 4.pptx
Chapter 4.pptx
 
Chapter 1.pptx
Chapter 1.pptxChapter 1.pptx
Chapter 1.pptx
 

Recently uploaded

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 

Recently uploaded (20)

PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 

Data Science Chapter on Data Types and Value Chain

  • 2. Introduction •In this chapter, we gonna learn more about •data science •Data Vs Information •Data types and representation •Data value chain and •Basic concept of Big data
  • 3. After completing this chapter, the students will be able to: • Describe what data science is and the role of data scientists. • Differentiate data and information. • Describe data processing life cycle • Understand different data types from diverse perspectives • Describe data value chain in emerging era of big data. • Understand the basics of Big Data. • Describe the purpose of the Hadoop ecosystem components.
  • 4. 2.1. An Overview of Data Science Self test Exercise •What is data science? •Can you describe the role of data in emerging technology? • What are data and information? •What is big data?
  • 5. Con’t… • Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from • structured data (relational database): is a data whose elements are addressable for effective analysis. • It has been organized into a formatted repository that is typically a database. • They have relational key and can easily mapped into pre-designed fields, Example: Relational data. • semi-structured data: is information that does not reside in a rational database but that have some organizational properties that make it easier to analyze. • Example: XML data.
  • 6. Con’t… • Unstructured data: is a data that is which is not organized in a pre-defined manner or does not have a pre-defined data model. • it is not a good fit for a mainstream relational database. • it is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs. Self test Exercise What’s The Difference Between Structured, Semi-Structured And Unstructured Data? Discuss in detail
  • 7. Con’t… • Data science is much more than simply analyzing data E.g Supermarket data- On Thursday nights people who buy milk also tend to buy beer . • catalog order of items, • Know the profile of your best customer, • Predict more wanted item for the next 2 month, • The probability of customer who buy computer also buy pen drive. Self-test Exercise • Describe in some detail the main disciplines that contribute to data science.
  • 8. 2.1.1 What are data and information? • Data: a representation of facts, concepts, or instructions in a formalized manner. • suitable for communication, interpretation, or processing, by human or electronic machines. • represented with the help of characters such as alphabets, digits or special characters. • Information :is the processed data on which decisions and actions are based. • It is data that has been processed into a form that is meaningful to the recipient
  • 9. Con’t… Data  raw facts  no context  just numbers and text Information  data with context  processed data  value-added to data  summarized  organized  analyzed
  • 10. Con’t…  Data: 180413  Information:  18/04/13 The date of your final exam.  18,0413 Birr The average starting salary of an accounting major.  180413 Zip code of Dessie.
  • 11. 2.1.2. Data Processing Cycle • Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness • three steps constitute the data processing cycle. • Input − in this step, the input data is prepared in some convenient form for processing. The form will depend on the processing machine. • E.g when electronic computers are used, the input data can be recorded on any one of the several types of storage medium, such as hard disk, CD, flash disk and so on
  • 12. Con’t… • Processing − in this step, the input data is changed to produce data in a more useful form. • For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be calculated from the sales orders. • Output − at this stage, the result of the proceeding processing step is collected. • The particular form of the output data depends on the use of the data. For example, output data may be payroll for employees.
  • 13. Self Test Exercise • Discuss the main differences between data and information with examples • Can we process data manually using a pencil and paper? • Discuss the differences with data processing using the computer with manually.
  • 14. 2.3 Data types and their representation • Data types can be described from diverse perspectives. • In CS and CP, for instance, a data type is simply an attribute of data that tells the compiler or interpreter how the programmer intends to use the data. Int x=5; String name=“Melaku”; Float final pi=3.14;
  • 15. 2.3.1. Data types from Computer programming perspective • What is Programming language? • Almost all programming languages explicitly include the notion of data type, though different languages may use different terminology. Common data types include: • Integers(int)- is used to store whole numbers, mathematically known as integers • If integer is not sufficient use long int • Booleans(bool)- is used to represent restricted to one of two values: true or false, E.g Boolean is_she_come=True; • Characters(char)- is used to store a single character, E.g Char x=‘M’; • If char is not sufficient use string • Float decimal(float)- is used to store a number with decimals, E.g float x=24.5 • If float is not sufficient use double
  • 16. Con’t • Explain types of Programming language. • A data type makes the values that expression, such as a variable or a function, might take. • This data type defines the operations that can be done on the data, the meaning of the data, and the way values of that type can be stored.
  • 17. 2.3.2. Data types from Data Analytics perspective • From a data analytics point of view, it is important to understand that there are three common types of data types or structures: • Structured • Semi-structured, and • Unstructured data type • Structured
  • 18. Con’t… Structured Data: data that adheres to a pre-defined data model and is therefore straightforward to analyze. • conforms to a tabular format with a relationship between the different rows and columns. • Common examples of structured data are Excel files or SQL databases. Each of these has structured rows and columns that can be sorted
  • 19. Con’t… Semi-structured Data: is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables. • contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. • Therefore, it is also known as a self-describing structure. • Examples of semi-structured data include JSON and XML are forms of semi-structured data.
  • 20. Con’t… Unstructured data: is information that either does not have a predefined data model or is not organized in a pre-defined manner. • Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well. • results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in structured databases • Common examples of unstructured data include audio, video files or NoSQL databases, social media data(fb, twitter,instagram…).
  • 21. Con’t… Metadata – Data about Data: this is not a separate data structure, but it is one of the most important elements for Big Data analysis and big data solutions. • Metadata is data about data. It provides additional information about a specific set of data • In a set of photographs, for example, metadata could describe when and where the photos were taken. • The metadata then provides fields for dates and locations which, by themselves, can be considered structured data. Because of this reason, metadata is frequently used by Big Data solutions for initial analysis.
  • 22. Con’t… Self Test Exercise o Discuss data types from programming and analytics perspectives. o Compare metadata with structured, unstructured and semi-structured data o Given at least one example of structured, unstructured and semi-structured data types
  • 23. 2.4. Data value Chain • Is introduced to describe the information flow within a big data system as a series of steps needed to generate value and useful insights from data. • The Big Data Value Chain identifies the following key high-level activities:
  • 24. 2.4.1. Data Acquisition • It is the process of gathering, filtering, and cleaning data before it is put in a data warehouse . • Data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site. • A data warehouse is a decision support database that is maintained separately from the organization’s operational database.
  • 25. Con’t… • Data acquisition is one of the major big data challenges in terms of infrastructure requirements. • The infrastructure required to support the acquisition of big data must deliver low, predictable latency in both capturing data and in executing queries. • be able to handle very high transaction volumes, often in a distributed environment; and support flexible and dynamic data structures.
  • 26. 2.4.2. Data Analysis • It is concerned with making the raw data acquired amenable to use in decision-making as well as domain-specific usage. • Data analysis involves exploring, transforming, and modeling data with the goal of highlighting relevant data, synthesizing and extracting useful hidden information with high potential from a business point of view. • Related areas include data mining, business intelligence, and machine learning
  • 27. 2.4.3. Data Curation • It is the active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage. • Data curation processes can be categorized into different activities such as content creation, selection, classification, transformation, validation, and preservation. • Data curators (scientific curators or data annotators) hold the responsibility of ensuring that data are trustworthy, discoverable, accessible, reusable and fit their purpose.
  • 28. 2.4.4. Data Storage • It is the persistence and management of data in a scalable way that satisfies the needs of applications that require fast access to the data. • RDBMS have been the main, and almost unique, a solution to the storage paradigm for nearly 40 years. • However, the ACID (Atomicity, Consistency, Isolation, and Durability) properties that guarantee database transactions lack flexibility with regard to schema changes
  • 29. Con’t… • the performance and fault tolerance when data volumes and complexity grow, making them unsuitable for big data scenarios. • What is NoSql Database? Compare it with Sql. • NoSQL technologies have been designed with the scalability goal in mind and present a wide range of solutions based on alternative data models.
  • 30. 2.4.5. Data Usage • It covers the data-driven business activities that need access to data, its analysis, and the tools needed to integrate the data analysis within the business activity. • Data usage in business decision-making can enhance competitiveness through the reduction of costs, increased added value, or any other parameter that can be measured against existing performance criteria.
  • 31. Self Test Exercise • Which information flow step in the data value chain you think is labor-intensive? Why? • What are the different data types and their value chain?
  • 32. 2.5. Basic concepts of big data Big Data Every Where! •Lots of data is being collected and warehoused • Web data, e-commerce • purchases at department/ grocery stores • Bank/Credit Card transactions • Social Network
  • 33. Con’t… • Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organize, process, and gather insights from large datasets. • While the problem of working with data that exceeds the computing power or storage of a single computer is not new, the pervasiveness, scale, and value of this type of computing have greatly expanded in recent years.
  • 34. 2.5.1. What Is Big Data? • is the term for a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications. • a “large dataset” means a dataset too large to reasonably process or store with traditional tooling or on a single computer. • the common scale of big datasets is constantly shifting and may vary significantly from organization to organization
  • 35. Bigdata: 4/5 V’s • Big data is characterized by 4V and more: • Volume: Machine generated data is produced in larger quantities than non traditional data. • large amounts of data Zeta bytes/Massive datasets • Velocity: the speed of data processing. • Data is live streaming or in motion • Variety: large variety of input data which in turn generates large amount of data as output. • data comes in many different forms from diverse sources • Veracity: can we trust the data? How accurate is it? etc. • Value: for what purpose we are going to use data
  • 37. 2.5.2. Clustered Computing and Hadoop Ecosystem 2.5.2.1.Clustered Computing • To better address the high storage and computational needs of big data, computer clusters are a better fit. • Big data clustering software combines the resources of many smaller machines, seeking to provide a number of benefits: • Resource Pooling: Combining the available storage space to hold data is a clear benefit, but CPU and memory pooling are also extremely important.
  • 38. Con’t… • High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware or software failures • Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group. • clusters requires a solution for managing cluster membership, coordinating resource sharing, and scheduling actual work on individual nodes.
  • 39. Con’t… • Cluster membership and resource allocation can be handled by software like Hadoop’s YARN (which stands for Yet Another Resource Negotiator). • assembled computing cluster often acts as a foundation that other software interfaces with to process the data. •
  • 40. Self Test Exercise • List and discuss the characteristics of big data ➢ Describe the big data life cycle. Which step you think most useful and why? • List and describe each technology or tool used in the big data life cycle. • Discuss the three methods of computing over a large dataset.
  • 41. How much data? •Google processes more than 20 PB a day (2018) •Wayback Machine has 3 PB + 100 TB/month (3/2009) •Facebook process 500+TB/day (2019) •eBay has 6.5 PB of user data + 50 TB/day (5/2009) •CERN’s Large Hydron Collider (LHC) generates 15 PB a year
  • 43. Who is collecting what? 2/25/2023 43  Credit Card Companies  What data are they getting? Restaurant check Grocery Bill Airline ticket Hotel Bill
  • 44. How Can You Avoid Big Data? • Pay cash for everything! • Never go online! • Don’t use a telephone! How???? • Don’t fill any prescriptions! • Never leave your house!
  • 45. Examples of big data….. • Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes * of data — the equivalent of 167 times the information contained in all the books in the US Library of Congress. • FICO Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide.
  • 46. Examples of big data….. • NASA Climate Simulation 32 petabytes • The Large Hadron Collider 25 petabytes annually, 200 petabytes after replication • Wall mart 2.5 petabytes per hour
  • 47. Who’s Generating Big Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)
  • 48. 2.5.2.2.Hadoop and its Ecosystem • Hadoop is an open-source framework intended to make interaction with big data easier. • It is a framework that allows for the distributed processing of large datasets across clusters of computers using simple programming models. • Designed to answer the question: “How to process big data with reasonable cost and time?”
  • 49. Con’t… • inspired by a technical document published by Google. The four key characteristics of Hadoop are: • Economical: highly economical as ordinary computers can be used for data processing. • Reliable: stores copies of the data on different machines and is resistant to hardware failure. • Scalable: It is easily scalable both, horizontally and vertically. • Flexible: can store as much structured and unstructured data .
  • 50. Con’t… • Hadoop has an ecosystem that has evolved from its four core components: • data management, • access, • processing, and • storage. • It is continuously growing to meet the needs of Big Data.
  • 51. Component's of Big Data(Hadoop Ecosystem) • HDFS: Hadoop Distributed File System • YARN: Yet Another Resource Negotiator • MapReduce: Programming based Data Processing • Spark: In-Memory data processing • PIG, HIVE: Query-based processing of data services • HBase: NoSQL Database • Mahout, Spark MLLib: Machine Learning algorithm libraries • Solar, Lucene: Searching and Indexing • Zookeeper: Managing cluster • Oozie: Job Scheduling
  • 52. Assignment • discuss the purpose of each Hadoop Ecosystem components?
  • 53. 2.5.3. Big Data Life Cycle with Hadoop 2.5.3.1. Ingesting data into the system • The first stage of Big Data processing is Ingest. • The data is ingested or transferred to Hadoop from various sources such as relational databases, systems, or local files. • Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.
  • 54. 2.5.3.2. Processing the data in storage • The second stage is Processing. In this stage, the data is stored and processed. • The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase. • Spark and MapReduce perform data processing.
  • 55. 2.5.3.3. Computing and analyzing data • The third stage is to Analyze. Here, the data is analyzed by processing frameworks such as Pig, Hive, and Impala. • Pig converts the data using a map and reduce and then analyzes it. • Hive is also based on the map and reduce programming and is most suitable for structured data.
  • 56. 2.5.3.4. Visualizing the results • The fourth stage is Access, which is performed by tools such as Hue and Cloudera Search. • In this stage, the analyzed data can be accessed by users.