SlideShare a Scribd company logo
1 of 36
Introduction to Data Science
1
What is Data Science ?
Buzzwords in Data Science :
Data Analysis,DataMining,Statistical Analysis,Big Data,Machine learning
Data Science is :
.. An interdisciplinary field that uses scientific methods ,processes,algorithms and systems to extract
knowledge and insights from data in various forms ,both structured and unstructured.
..An area that manages,manipulates,extracts and interprets knowledge from tremendous amount
of data
..
2
The Data Science Lifecycle
3
Types of Data
● Data is set of raw facts and values such as observations and descriptions that
must be analyzed and processed to make it more meaningful.
● Data comes from variety of sources and in a variety of formats.
● Data can be classified into following types
➢ Structured data
➢ Semi structured data
➢ Unstructured data
4
Structured Data
Structured data well organized ,highly specific and is stored in predefined format.
Structured data is information that is formatted and transformed into a well defined data model
The data types and formats in structured data are clearly defined.
It is easy to read ,organize ,query,manage and store structured data using programming
languages and tools
5
Characteristics of Structured data
● Data conforms to a data model and has easily identifiable structure
● Data is stored in the form of rows and columns Example : Database
● Data is well organised so, Definition, Format and Meaning of data is explicitly known
● Data resides in fixed fields within a record or file
● Similar entities are grouped together to form relations or classes
● Entities in the same group have same attributes
● Easy to access and query, So data can be easily used by other programs
● Data elements are addressable, so efficient to analyse and process
6
Sources of structured data
Structured data is generated by both humans and machines.
Examples : names ,addresses,date,accno are human generated structured
data
Machine generated data refers to data that is created by a machine
Examples Data generated by sensor,user and activity log,barcode
7
Advantages of Structured data
Highly organized
Universally understood
Easily operated upon
More tools available
Less storage
8
Disadvantages of Structured data
Limited usage
Limited storage options
Difficult to change the format
Expensive
9
Semi structured Data
● Semi-structured data is a type of data that is not purely structured, but also not completely unstructured.
It contains some level of organization or structure, but does not conform to a rigid schema or data model, and
may contain elements that are not easily categorized or classified.
● Examples
Semi-structured data is typically characterized by the use of metadata or tags that provide additional information
about the data elements. For example, an XML document might contain tags that indicate the structure of the
document, but may also contain additional tags that provide metadata about the content, such as author, date, or
keywords.Semi structured data is more complex than structured data but less coplex than unstructured data
● Semi-structured data is data that does not conform to a data model but has some structure. It lacks a
fixed or rigid schema. It is the data that does not reside in a rational database but that have some
organizational properties that make it easier to analyze. With some processes, we can store them in the
relational database.
10
Characteristics of semi-structured Data:
● Data does not conform to a data model but has some structure.
● Data can not be stored in the form of rows and columns as in Databases
● Semi-structured data contains tags and elements (Metadata) which is used to group data and describe
how the data is stored
● Similar entities are grouped together and organized in a hierarchy
● Entities in the same group may or may not have the same attributes or properties
● Does not contain sufficient metadata which makes automation and management of data difficult
● Size and type of the same attributes in a group may differ
● Due to lack of a well-defined structure, it can not used by computer programs easily
11
Advantages of semi structured data
● The data is not constrained by a fixed schema
● Flexible i.e Schema can be easily changed.
● Data is portable
● It is possible to view structured data as semi-structured data
● Its supports users who can not express their need in SQL
● It can deal easily with the heterogeneity of sources.
12
Disadvantages of semi structured data
● Lack of fixed, rigid schema make it difficult in storage of the data
● Interpreting the relationship between data is difficult as there is no separation of the schema and the
data.
● Queries are less efficient as compared to structured data.
13
Unstructured Data
Unstructured data is the data which does not conforms to a data model and has no easily identifiable structure
such that it can not be used by a computer program easily.
Unstructured data is not organised in a pre-defined manner or does not have a pre-defined data model, thus it is
not a good fit for a mainstream relational database.
Example An audio speech,cctv video,social media,comment
14
Characteristics of Unstructured Data:
● Data neither conforms to a data model nor has any structure.
● Data can not be stored in the form of rows and columns as in Databases
● Data does not follows any semantic or rules
● Data lacks any particular format or sequence
● Data has no easily identifiable structure
● Due to lack of identifiable structure, it can not used by computer programs easily
15
Advantages of unstructured data
1. Flexible : The data is not considered by a fixed schema
2. More applications :Since there is no predefined model or
schema,unstructured data can be used for more than one intended purpose
3. More formatting options
4. Easy storage
5. Heterogeneity
16
Disadvantages of unstructured data
Require expertise
Requires specific data tools
Difficult to process
17
18
COMPARISON OF Structured ,Semi Structured and Unstructured data
Structured Semi Structured Unstructured
It is based on Relational database table It is based on XML/RDF(Resource
Description Framework).
It is based on character and binary data
Easy to process Can be processed after converting to
structured format
Difficult to process
Well organized Not in rigid format but contain tags or
metadata
Not organized
Matured transaction and various
concurrency techniques
Transaction is adapted from DBMS not
matured
No transaction management and no
concurrency
It is schema dependent and less flexible It is more flexible than structured data
but less flexible than unstructured data
It is more flexible and there is absence
of schema
It is very difficult to scale DB schema It’s scaling is simpler than structured
data
It is more scalable.
Structured query allow complex joining Queries over anonymous nodes are
possible
Only textual queries are possible
19
Data Sources
Any data science application needs data.
This data can be produced in various ways and from different sources.
● Open data
● Social Media Data
● Multimodel Data
● Standard Datasets
20
Open Data
● Open data may come from any source
● Data should be available in a public domain that can be used by anyone
● Without restriction,from copyright,patents
● Local and federal governments,Non government organization and academic
communities all lead data initiatives
Principles
● Public : The data must be open to access as permitted by law and subject to
privacy,confidentiality,security
● Accessible
● Described
● Reusable
● Complete
● Timely
● Managed Post-Release 21
Social Media Data
● Rich source of data
● Social media data is the information that is collected across different social media
networks such as facebook,Instagram,Twitter,LinkedIn,Youtube
● This data gives valuable insights from people’s likes,shares,comments,clicks and
more
● Collecting and analyzing social media data can help businesses to improve their
marketing efforts,identify emerging trends and give better experience to their
customers.
● This data can be analyzed for various purposes like demographic analysis,provide
targeted and personalized content
● To access this data researchers and developers use the Application Programming
Interface that social media companies provide.
● API is set of methods for fetching and sending data
22
Facebook API
Twitter API
Instagram API
Youtube API
Google API
Social media platforms that have an API
23
Multimodal Data
● Technology enables us to connect more and more devices to the Internet
using IoT technology.
● These devices generate and use a lot of data
● Some data is structured and while other is unstructured
● A multimodal dataset stores data from different sources in different formats
● Storing and processing multimodal data poses additional challenges and
requires specialized tools and Operations
24
Standard Dataset
● Collecting high quality data is a fundamental prerequisite for starting any data
science project
● A dataset is collection of data in which data is arranged in some order.
● Collecting and preparing dataset is one of the most crucial parts in the
project.
● However it is not possible for each programmer to collect a lot of data to work
on
● Many organizations and individual share their dataset free of charge for
anyone to download and use.
● Each dataset is summarized in a consistent way.
25
Aspects Need to know about dataset
Name
Problem Type
Features
Sample
Example of Dataset
Iris Flower Data set This flower dataset involves predicting the flower
species given measurements of iris flowers
https://www.kaggle.com
/datasets/arshid/iris-
flower-dataset
The Zomato Restaurant
Dataset
The Zomato Restaurant Dataset is a
comprehensive collection of restaurant data
sourced from the popular online food delivery
platform, Zomato.
https://www.kaggle.com
/datasets/abhijitdahaton
de/zomato-restaurants-
dataset
26
Dataset Repositories
Dataset repositories maintain multiple dataset as a service to the data science community.
They contain a numerous amount of real life datasets of all shapes and sizes
Some popular repositories are
Kaggle DataSet Various domains
finance,sports,covid,social
media
http://www.kaggle.com/data
sets
Amazon dataset Public transport,satellite
images
http://registry.opendata.aws/
27
Data Formats
● Numeric data : Numeric data types include integers and floats
Integer :
An integer represents numeric information in the form of whole numbers.
Integers can be signed or unsigned.
Float:
A floating point number represents number with a fractional part
28
Text Data
● A sequence of bytes
● ASCII code (8 bits)
● UNICODE (16 bits
29
Files
Text files
● All bytes of information is interpreted as ASCII or unicode
● It is readable by a human being
● It works best for a data with a relatively simple format
30
Dense Numerical Arrays
● Scientific application deal with numeric information– Integers and floats
● More efficient to store this large array of numbers in the native format that
computer use for processing
● Image file or sound file consist of dense array of numbers
31
Compressed or Archived Data
● Many data files take up a lot of space compared to the actual requirement.
● Storing a repeated characters
● Compression reduces the file size so that it takes up less memory to store
and transmit
●
32
CSV files
CSV files are similar to Simple ASCII based text but the field separator is a
comma.
Data is stored in form of rows and columns.
This files are more compact but less human readable
33
JSON Files
JavaScript Object Notation (JSON) is a standard text-based format for representing structured data
based on JavaScript object syntax.
It is commonly used for transmitting data in web applications.
It use lightweight data interchange format.
It is easy for humans to read and write.
It stores data as simple text as a set of objects.
An object is enclosed in { }
Example
{
“ID” :”2783”
“Name”:
{
“First “ : “AA”
“Last “”BB”
}
}
34
XML Files
XML supports information exchange between computer systems such as websites, databases,
and third-party applications.
Predefined rules make it easy to transmit data as XML files over any network because the
recipient can use those rules to read the data accurately and efficiently.
The data can be stored and transported in a standard way between system that use different data
format.
Data is stored in an XML file as text.
35
Image files
36

More Related Content

Similar to Introductio to Data Science and types of data

Database Systems
Database SystemsDatabase Systems
Database SystemsUsman Tariq
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big DataUmair Shafique
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsEmbarcadero Technologies
 
Ch-11 Relational Databases.pptx
Ch-11 Relational Databases.pptxCh-11 Relational Databases.pptx
Ch-11 Relational Databases.pptxShadowDawg
 
EContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdf
EContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdfEContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdf
EContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdfsitework231
 
Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures DATAVERSITY
 
Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures Data Blueprint
 
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdfACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdfJerichoGerance
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDatavalley.ai
 
Introduction to data interoperability across the data value chain.pdf
Introduction to data interoperability across the data value chain.pdfIntroduction to data interoperability across the data value chain.pdf
Introduction to data interoperability across the data value chain.pdfAhmedHany Sayed
 
Database.pdf
Database.pdfDatabase.pdf
Database.pdfl235546
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Practical steps to GDPR compliance
Practical steps to GDPR compliance Practical steps to GDPR compliance
Practical steps to GDPR compliance Jean-Michel Franco
 
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...IRJET Journal
 

Similar to Introductio to Data Science and types of data (20)

Database Systems
Database SystemsDatabase Systems
Database Systems
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
 
U - 2 Emerging.pptx
U - 2 Emerging.pptxU - 2 Emerging.pptx
U - 2 Emerging.pptx
 
Driving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data AssetsDriving Business Value Through Agile Data Assets
Driving Business Value Through Agile Data Assets
 
Ch-11 Relational Databases.pptx
Ch-11 Relational Databases.pptxCh-11 Relational Databases.pptx
Ch-11 Relational Databases.pptx
 
EContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdf
EContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdfEContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdf
EContent_11_2024_01_23_18_48_10_DatamodelsUnitIVpptx__2023_11_10_16_13_01.pdf
 
Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures Data-Ed Webinar: Design & Manage Data Structures
Data-Ed Webinar: Design & Manage Data Structures
 
Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures Data-Ed: Design and Manage Data Structures
Data-Ed: Design and Manage Data Structures
 
DATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEMDATABASE MANAGEMENT SYSTEM
DATABASE MANAGEMENT SYSTEM
 
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdfACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
ACCOUNTING-IT-APP-MIdterm Topic-Bigdata.pdf
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
 
Introduction to data interoperability across the data value chain.pdf
Introduction to data interoperability across the data value chain.pdfIntroduction to data interoperability across the data value chain.pdf
Introduction to data interoperability across the data value chain.pdf
 
DSA
DSADSA
DSA
 
Database.pdf
Database.pdfDatabase.pdf
Database.pdf
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
computer.pdf
computer.pdfcomputer.pdf
computer.pdf
 
Practical steps to GDPR compliance
Practical steps to GDPR compliance Practical steps to GDPR compliance
Practical steps to GDPR compliance
 
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...IRJET-  	  Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
IRJET- Comparative Study of Efficacy of Big Data Analysis and Deep Learni...
 

Recently uploaded

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 

Recently uploaded (20)

代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 

Introductio to Data Science and types of data

  • 2. What is Data Science ? Buzzwords in Data Science : Data Analysis,DataMining,Statistical Analysis,Big Data,Machine learning Data Science is : .. An interdisciplinary field that uses scientific methods ,processes,algorithms and systems to extract knowledge and insights from data in various forms ,both structured and unstructured. ..An area that manages,manipulates,extracts and interprets knowledge from tremendous amount of data .. 2
  • 3. The Data Science Lifecycle 3
  • 4. Types of Data ● Data is set of raw facts and values such as observations and descriptions that must be analyzed and processed to make it more meaningful. ● Data comes from variety of sources and in a variety of formats. ● Data can be classified into following types ➢ Structured data ➢ Semi structured data ➢ Unstructured data 4
  • 5. Structured Data Structured data well organized ,highly specific and is stored in predefined format. Structured data is information that is formatted and transformed into a well defined data model The data types and formats in structured data are clearly defined. It is easy to read ,organize ,query,manage and store structured data using programming languages and tools 5
  • 6. Characteristics of Structured data ● Data conforms to a data model and has easily identifiable structure ● Data is stored in the form of rows and columns Example : Database ● Data is well organised so, Definition, Format and Meaning of data is explicitly known ● Data resides in fixed fields within a record or file ● Similar entities are grouped together to form relations or classes ● Entities in the same group have same attributes ● Easy to access and query, So data can be easily used by other programs ● Data elements are addressable, so efficient to analyse and process 6
  • 7. Sources of structured data Structured data is generated by both humans and machines. Examples : names ,addresses,date,accno are human generated structured data Machine generated data refers to data that is created by a machine Examples Data generated by sensor,user and activity log,barcode 7
  • 8. Advantages of Structured data Highly organized Universally understood Easily operated upon More tools available Less storage 8
  • 9. Disadvantages of Structured data Limited usage Limited storage options Difficult to change the format Expensive 9
  • 10. Semi structured Data ● Semi-structured data is a type of data that is not purely structured, but also not completely unstructured. It contains some level of organization or structure, but does not conform to a rigid schema or data model, and may contain elements that are not easily categorized or classified. ● Examples Semi-structured data is typically characterized by the use of metadata or tags that provide additional information about the data elements. For example, an XML document might contain tags that indicate the structure of the document, but may also contain additional tags that provide metadata about the content, such as author, date, or keywords.Semi structured data is more complex than structured data but less coplex than unstructured data ● Semi-structured data is data that does not conform to a data model but has some structure. It lacks a fixed or rigid schema. It is the data that does not reside in a rational database but that have some organizational properties that make it easier to analyze. With some processes, we can store them in the relational database. 10
  • 11. Characteristics of semi-structured Data: ● Data does not conform to a data model but has some structure. ● Data can not be stored in the form of rows and columns as in Databases ● Semi-structured data contains tags and elements (Metadata) which is used to group data and describe how the data is stored ● Similar entities are grouped together and organized in a hierarchy ● Entities in the same group may or may not have the same attributes or properties ● Does not contain sufficient metadata which makes automation and management of data difficult ● Size and type of the same attributes in a group may differ ● Due to lack of a well-defined structure, it can not used by computer programs easily 11
  • 12. Advantages of semi structured data ● The data is not constrained by a fixed schema ● Flexible i.e Schema can be easily changed. ● Data is portable ● It is possible to view structured data as semi-structured data ● Its supports users who can not express their need in SQL ● It can deal easily with the heterogeneity of sources. 12
  • 13. Disadvantages of semi structured data ● Lack of fixed, rigid schema make it difficult in storage of the data ● Interpreting the relationship between data is difficult as there is no separation of the schema and the data. ● Queries are less efficient as compared to structured data. 13
  • 14. Unstructured Data Unstructured data is the data which does not conforms to a data model and has no easily identifiable structure such that it can not be used by a computer program easily. Unstructured data is not organised in a pre-defined manner or does not have a pre-defined data model, thus it is not a good fit for a mainstream relational database. Example An audio speech,cctv video,social media,comment 14
  • 15. Characteristics of Unstructured Data: ● Data neither conforms to a data model nor has any structure. ● Data can not be stored in the form of rows and columns as in Databases ● Data does not follows any semantic or rules ● Data lacks any particular format or sequence ● Data has no easily identifiable structure ● Due to lack of identifiable structure, it can not used by computer programs easily 15
  • 16. Advantages of unstructured data 1. Flexible : The data is not considered by a fixed schema 2. More applications :Since there is no predefined model or schema,unstructured data can be used for more than one intended purpose 3. More formatting options 4. Easy storage 5. Heterogeneity 16
  • 17. Disadvantages of unstructured data Require expertise Requires specific data tools Difficult to process 17
  • 18. 18
  • 19. COMPARISON OF Structured ,Semi Structured and Unstructured data Structured Semi Structured Unstructured It is based on Relational database table It is based on XML/RDF(Resource Description Framework). It is based on character and binary data Easy to process Can be processed after converting to structured format Difficult to process Well organized Not in rigid format but contain tags or metadata Not organized Matured transaction and various concurrency techniques Transaction is adapted from DBMS not matured No transaction management and no concurrency It is schema dependent and less flexible It is more flexible than structured data but less flexible than unstructured data It is more flexible and there is absence of schema It is very difficult to scale DB schema It’s scaling is simpler than structured data It is more scalable. Structured query allow complex joining Queries over anonymous nodes are possible Only textual queries are possible 19
  • 20. Data Sources Any data science application needs data. This data can be produced in various ways and from different sources. ● Open data ● Social Media Data ● Multimodel Data ● Standard Datasets 20
  • 21. Open Data ● Open data may come from any source ● Data should be available in a public domain that can be used by anyone ● Without restriction,from copyright,patents ● Local and federal governments,Non government organization and academic communities all lead data initiatives Principles ● Public : The data must be open to access as permitted by law and subject to privacy,confidentiality,security ● Accessible ● Described ● Reusable ● Complete ● Timely ● Managed Post-Release 21
  • 22. Social Media Data ● Rich source of data ● Social media data is the information that is collected across different social media networks such as facebook,Instagram,Twitter,LinkedIn,Youtube ● This data gives valuable insights from people’s likes,shares,comments,clicks and more ● Collecting and analyzing social media data can help businesses to improve their marketing efforts,identify emerging trends and give better experience to their customers. ● This data can be analyzed for various purposes like demographic analysis,provide targeted and personalized content ● To access this data researchers and developers use the Application Programming Interface that social media companies provide. ● API is set of methods for fetching and sending data 22
  • 23. Facebook API Twitter API Instagram API Youtube API Google API Social media platforms that have an API 23
  • 24. Multimodal Data ● Technology enables us to connect more and more devices to the Internet using IoT technology. ● These devices generate and use a lot of data ● Some data is structured and while other is unstructured ● A multimodal dataset stores data from different sources in different formats ● Storing and processing multimodal data poses additional challenges and requires specialized tools and Operations 24
  • 25. Standard Dataset ● Collecting high quality data is a fundamental prerequisite for starting any data science project ● A dataset is collection of data in which data is arranged in some order. ● Collecting and preparing dataset is one of the most crucial parts in the project. ● However it is not possible for each programmer to collect a lot of data to work on ● Many organizations and individual share their dataset free of charge for anyone to download and use. ● Each dataset is summarized in a consistent way. 25
  • 26. Aspects Need to know about dataset Name Problem Type Features Sample Example of Dataset Iris Flower Data set This flower dataset involves predicting the flower species given measurements of iris flowers https://www.kaggle.com /datasets/arshid/iris- flower-dataset The Zomato Restaurant Dataset The Zomato Restaurant Dataset is a comprehensive collection of restaurant data sourced from the popular online food delivery platform, Zomato. https://www.kaggle.com /datasets/abhijitdahaton de/zomato-restaurants- dataset 26
  • 27. Dataset Repositories Dataset repositories maintain multiple dataset as a service to the data science community. They contain a numerous amount of real life datasets of all shapes and sizes Some popular repositories are Kaggle DataSet Various domains finance,sports,covid,social media http://www.kaggle.com/data sets Amazon dataset Public transport,satellite images http://registry.opendata.aws/ 27
  • 28. Data Formats ● Numeric data : Numeric data types include integers and floats Integer : An integer represents numeric information in the form of whole numbers. Integers can be signed or unsigned. Float: A floating point number represents number with a fractional part 28
  • 29. Text Data ● A sequence of bytes ● ASCII code (8 bits) ● UNICODE (16 bits 29
  • 30. Files Text files ● All bytes of information is interpreted as ASCII or unicode ● It is readable by a human being ● It works best for a data with a relatively simple format 30
  • 31. Dense Numerical Arrays ● Scientific application deal with numeric information– Integers and floats ● More efficient to store this large array of numbers in the native format that computer use for processing ● Image file or sound file consist of dense array of numbers 31
  • 32. Compressed or Archived Data ● Many data files take up a lot of space compared to the actual requirement. ● Storing a repeated characters ● Compression reduces the file size so that it takes up less memory to store and transmit ● 32
  • 33. CSV files CSV files are similar to Simple ASCII based text but the field separator is a comma. Data is stored in form of rows and columns. This files are more compact but less human readable 33
  • 34. JSON Files JavaScript Object Notation (JSON) is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications. It use lightweight data interchange format. It is easy for humans to read and write. It stores data as simple text as a set of objects. An object is enclosed in { } Example { “ID” :”2783” “Name”: { “First “ : “AA” “Last “”BB” } } 34
  • 35. XML Files XML supports information exchange between computer systems such as websites, databases, and third-party applications. Predefined rules make it easy to transmit data as XML files over any network because the recipient can use those rules to read the data accurately and efficiently. The data can be stored and transported in a standard way between system that use different data format. Data is stored in an XML file as text. 35