2. What is Data Science ?
Buzzwords in Data Science :
Data Analysis,DataMining,Statistical Analysis,Big Data,Machine learning
Data Science is :
.. An interdisciplinary field that uses scientific methods ,processes,algorithms and systems to extract
knowledge and insights from data in various forms ,both structured and unstructured.
..An area that manages,manipulates,extracts and interprets knowledge from tremendous amount
of data
..
2
4. Types of Data
● Data is set of raw facts and values such as observations and descriptions that
must be analyzed and processed to make it more meaningful.
● Data comes from variety of sources and in a variety of formats.
● Data can be classified into following types
➢ Structured data
➢ Semi structured data
➢ Unstructured data
4
5. Structured Data
Structured data well organized ,highly specific and is stored in predefined format.
Structured data is information that is formatted and transformed into a well defined data model
The data types and formats in structured data are clearly defined.
It is easy to read ,organize ,query,manage and store structured data using programming
languages and tools
5
6. Characteristics of Structured data
● Data conforms to a data model and has easily identifiable structure
● Data is stored in the form of rows and columns Example : Database
● Data is well organised so, Definition, Format and Meaning of data is explicitly known
● Data resides in fixed fields within a record or file
● Similar entities are grouped together to form relations or classes
● Entities in the same group have same attributes
● Easy to access and query, So data can be easily used by other programs
● Data elements are addressable, so efficient to analyse and process
6
7. Sources of structured data
Structured data is generated by both humans and machines.
Examples : names ,addresses,date,accno are human generated structured
data
Machine generated data refers to data that is created by a machine
Examples Data generated by sensor,user and activity log,barcode
7
8. Advantages of Structured data
Highly organized
Universally understood
Easily operated upon
More tools available
Less storage
8
9. Disadvantages of Structured data
Limited usage
Limited storage options
Difficult to change the format
Expensive
9
10. Semi structured Data
● Semi-structured data is a type of data that is not purely structured, but also not completely unstructured.
It contains some level of organization or structure, but does not conform to a rigid schema or data model, and
may contain elements that are not easily categorized or classified.
● Examples
Semi-structured data is typically characterized by the use of metadata or tags that provide additional information
about the data elements. For example, an XML document might contain tags that indicate the structure of the
document, but may also contain additional tags that provide metadata about the content, such as author, date, or
keywords.Semi structured data is more complex than structured data but less coplex than unstructured data
● Semi-structured data is data that does not conform to a data model but has some structure. It lacks a
fixed or rigid schema. It is the data that does not reside in a rational database but that have some
organizational properties that make it easier to analyze. With some processes, we can store them in the
relational database.
10
11. Characteristics of semi-structured Data:
● Data does not conform to a data model but has some structure.
● Data can not be stored in the form of rows and columns as in Databases
● Semi-structured data contains tags and elements (Metadata) which is used to group data and describe
how the data is stored
● Similar entities are grouped together and organized in a hierarchy
● Entities in the same group may or may not have the same attributes or properties
● Does not contain sufficient metadata which makes automation and management of data difficult
● Size and type of the same attributes in a group may differ
● Due to lack of a well-defined structure, it can not used by computer programs easily
11
12. Advantages of semi structured data
● The data is not constrained by a fixed schema
● Flexible i.e Schema can be easily changed.
● Data is portable
● It is possible to view structured data as semi-structured data
● Its supports users who can not express their need in SQL
● It can deal easily with the heterogeneity of sources.
12
13. Disadvantages of semi structured data
● Lack of fixed, rigid schema make it difficult in storage of the data
● Interpreting the relationship between data is difficult as there is no separation of the schema and the
data.
● Queries are less efficient as compared to structured data.
13
14. Unstructured Data
Unstructured data is the data which does not conforms to a data model and has no easily identifiable structure
such that it can not be used by a computer program easily.
Unstructured data is not organised in a pre-defined manner or does not have a pre-defined data model, thus it is
not a good fit for a mainstream relational database.
Example An audio speech,cctv video,social media,comment
14
15. Characteristics of Unstructured Data:
● Data neither conforms to a data model nor has any structure.
● Data can not be stored in the form of rows and columns as in Databases
● Data does not follows any semantic or rules
● Data lacks any particular format or sequence
● Data has no easily identifiable structure
● Due to lack of identifiable structure, it can not used by computer programs easily
15
16. Advantages of unstructured data
1. Flexible : The data is not considered by a fixed schema
2. More applications :Since there is no predefined model or
schema,unstructured data can be used for more than one intended purpose
3. More formatting options
4. Easy storage
5. Heterogeneity
16
19. COMPARISON OF Structured ,Semi Structured and Unstructured data
Structured Semi Structured Unstructured
It is based on Relational database table It is based on XML/RDF(Resource
Description Framework).
It is based on character and binary data
Easy to process Can be processed after converting to
structured format
Difficult to process
Well organized Not in rigid format but contain tags or
metadata
Not organized
Matured transaction and various
concurrency techniques
Transaction is adapted from DBMS not
matured
No transaction management and no
concurrency
It is schema dependent and less flexible It is more flexible than structured data
but less flexible than unstructured data
It is more flexible and there is absence
of schema
It is very difficult to scale DB schema It’s scaling is simpler than structured
data
It is more scalable.
Structured query allow complex joining Queries over anonymous nodes are
possible
Only textual queries are possible
19
20. Data Sources
Any data science application needs data.
This data can be produced in various ways and from different sources.
● Open data
● Social Media Data
● Multimodel Data
● Standard Datasets
20
21. Open Data
● Open data may come from any source
● Data should be available in a public domain that can be used by anyone
● Without restriction,from copyright,patents
● Local and federal governments,Non government organization and academic
communities all lead data initiatives
Principles
● Public : The data must be open to access as permitted by law and subject to
privacy,confidentiality,security
● Accessible
● Described
● Reusable
● Complete
● Timely
● Managed Post-Release 21
22. Social Media Data
● Rich source of data
● Social media data is the information that is collected across different social media
networks such as facebook,Instagram,Twitter,LinkedIn,Youtube
● This data gives valuable insights from people’s likes,shares,comments,clicks and
more
● Collecting and analyzing social media data can help businesses to improve their
marketing efforts,identify emerging trends and give better experience to their
customers.
● This data can be analyzed for various purposes like demographic analysis,provide
targeted and personalized content
● To access this data researchers and developers use the Application Programming
Interface that social media companies provide.
● API is set of methods for fetching and sending data
22
24. Multimodal Data
● Technology enables us to connect more and more devices to the Internet
using IoT technology.
● These devices generate and use a lot of data
● Some data is structured and while other is unstructured
● A multimodal dataset stores data from different sources in different formats
● Storing and processing multimodal data poses additional challenges and
requires specialized tools and Operations
24
25. Standard Dataset
● Collecting high quality data is a fundamental prerequisite for starting any data
science project
● A dataset is collection of data in which data is arranged in some order.
● Collecting and preparing dataset is one of the most crucial parts in the
project.
● However it is not possible for each programmer to collect a lot of data to work
on
● Many organizations and individual share their dataset free of charge for
anyone to download and use.
● Each dataset is summarized in a consistent way.
25
26. Aspects Need to know about dataset
Name
Problem Type
Features
Sample
Example of Dataset
Iris Flower Data set This flower dataset involves predicting the flower
species given measurements of iris flowers
https://www.kaggle.com
/datasets/arshid/iris-
flower-dataset
The Zomato Restaurant
Dataset
The Zomato Restaurant Dataset is a
comprehensive collection of restaurant data
sourced from the popular online food delivery
platform, Zomato.
https://www.kaggle.com
/datasets/abhijitdahaton
de/zomato-restaurants-
dataset
26
27. Dataset Repositories
Dataset repositories maintain multiple dataset as a service to the data science community.
They contain a numerous amount of real life datasets of all shapes and sizes
Some popular repositories are
Kaggle DataSet Various domains
finance,sports,covid,social
media
http://www.kaggle.com/data
sets
Amazon dataset Public transport,satellite
images
http://registry.opendata.aws/
27
28. Data Formats
● Numeric data : Numeric data types include integers and floats
Integer :
An integer represents numeric information in the form of whole numbers.
Integers can be signed or unsigned.
Float:
A floating point number represents number with a fractional part
28
29. Text Data
● A sequence of bytes
● ASCII code (8 bits)
● UNICODE (16 bits
29
30. Files
Text files
● All bytes of information is interpreted as ASCII or unicode
● It is readable by a human being
● It works best for a data with a relatively simple format
30
31. Dense Numerical Arrays
● Scientific application deal with numeric information– Integers and floats
● More efficient to store this large array of numbers in the native format that
computer use for processing
● Image file or sound file consist of dense array of numbers
31
32. Compressed or Archived Data
● Many data files take up a lot of space compared to the actual requirement.
● Storing a repeated characters
● Compression reduces the file size so that it takes up less memory to store
and transmit
●
32
33. CSV files
CSV files are similar to Simple ASCII based text but the field separator is a
comma.
Data is stored in form of rows and columns.
This files are more compact but less human readable
33
34. JSON Files
JavaScript Object Notation (JSON) is a standard text-based format for representing structured data
based on JavaScript object syntax.
It is commonly used for transmitting data in web applications.
It use lightweight data interchange format.
It is easy for humans to read and write.
It stores data as simple text as a set of objects.
An object is enclosed in { }
Example
{
“ID” :”2783”
“Name”:
{
“First “ : “AA”
“Last “”BB”
}
}
34
35. XML Files
XML supports information exchange between computer systems such as websites, databases,
and third-party applications.
Predefined rules make it easy to transmit data as XML files over any network because the
recipient can use those rules to read the data accurately and efficiently.
The data can be stored and transported in a standard way between system that use different data
format.
Data is stored in an XML file as text.
35