10. OBJECTIVES
The objective of this course is to Impart necessary knowledge of the
mathematical foundations needed for data science and develop
programming skills required to build data science applications.
Duration – 60 Hours (40L + 20C)
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 10
11. LEARNING OUTCOMES
At the end of this course, the students will be able to:
● Demonstrate understanding of the mathematical foundations
needed for data science.
● Collect, explore, clean, munge and manipulate data.
● Implement models such as k-nearest Neighbors, Naïve Bayes,
linear and logistic regression, decision trees, neural networks and
clustering.
● Build data science applications using Python based toolkits.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 11
12. Data, Big Data and Challenges
Data Science
◦ Introduction
◦ Why Data Science
Data Scientists
◦ What do they do?
Major/Concentration in Data Science
◦ What courses to take.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 12
13. Data All Around
Lots of data is being collected and warehoused
◦Web data, e-commerce
◦Financial transactions, bank/credit transactions
◦Online trading and purchasing
◦Social Network
13
14. How Much Data Do We have?
Google processes 20 PB a day (2008)
Facebook has 60 TB of daily logs
eBay has 6.5 PB of user data + 50 TB/day (5/2009)
1000 genomes project: 200 TB
Cost of 1 TB of disk: $35
Time to read 1 TB disk: 3 hrs
(100 MB/s)
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 14
15. Big Data
Big Data is any data that is expensive to manage and hard to extract value
from
◦ Volume
◦ The size of the data
◦ Velocity
◦ The latency of data processing relative to the growing demand for interactivity
◦ Variety and Complexity
◦ the diversity of sources, formats, quality, structures.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 15
17. What is Data Science?
Dealing with unstructured and structured data, Data Science is a
field that comprises everything that related to data cleansing,
preparation, and analysis.
Data Science is the combination of statistics, mathematics,
programming, problem-solving, capturing data in ingenious ways,
the ability to look at things differently, and the activity of cleansing,
preparing, and aligning the data.
In simple terms, it is the umbrella of techniques used when trying
to extract insights and information from data.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 17
18. What is Big Data?
Big Data refers to humongous volumes of data that cannot be processed effectively with
the traditional applications that exist. The processing of Big Data begins with the raw data
that isn’t aggregated and is most often impossible to store in the memory of a single
computer.
A buzzword that is used to describe immense volumes of data, both unstructured and
structured, Big Data inundates a business on a day-to-day basis. Big Data is something that
can be used to analyze insights that can lead to better decisions and strategic business
moves.
The definition of Big Data, given by Gartner, is, “Big data is high-volume, and high-velocity
or high-variety information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision making, and process
automation.”
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 18
21. What is Data Analytics?
Data Analytics the science of examining raw data to conclude that
information.
Data Analytics involves applying an algorithmic or mechanical process to
derive insights and, for example, running through several data sets to look for
meaningful correlations between each other.
It is used in several industries to allow organizations and companies to
make better decisions as well as verify and disprove existing theories or
models. The focus of Data Analytics lies in inference, which is the process of
deriving conclusions that are solely based on what the researcher already
knows.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 21
22. Types of Data We Have
Relational Data (Tables/Transaction/Legacy Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can afford to scan the data once
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 22
23. What To Do With These Data?
Aggregation and Statistics
◦ Data warehousing and OLAP
Indexing, Searching, and Querying
◦ Keyword based search
◦ Pattern matching (XML/RDF)
Knowledge discovery
◦ Data Mining
◦ Statistical Modeling
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 23
24. Big Data and Data Science
“… the sexy job in the next 10 years will be statisticians,” Hal Varian, Google Chief
Economist
The U.S. will need 140,000-190,000 predictive analysts and 1.5 million managers/analysts
by 2018.
McKinsey Global Institute’s June 2011
India will be needing around 160,000+ Data Scientists by 2020 and World demand
predicted to be around 2.7million by 2020.
New Data Science institutes being created or repurposed – NYU, Columbia, Washington,
UCB,...
New degree programs, courses, boot-camps:
◦ e.g., at Berkeley: Stats, I-School, CS, Astronomy…
◦ One proposal (elsewhere) for an MS in “Big Data Science”
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 24
25. What is Data Science?
An area that manages, manipulates, extracts, and interprets knowledge from
tremendous amount of data.
Data science (DS) is a multidisciplinary field of study with goal to address the challenges
in big data.
Data science principles apply to all data – big and small.
Simply – Extraction of knowledge from large volumes of data that are structure or
unstructured.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 25
26. What is Data Science?
Theories and techniques from many fields and disciplines are used to
investigate and analyze a large amount of data to help decision makers in
many industries such as science, engineering, economics, politics, finance,
and education.
◦ Computer Science
◦ Pattern recognition, visualization, data warehousing, High performance computing,
Databases, AI
◦ Mathematics
◦ Mathematical Modeling
◦ Statistics
◦ Statistical and Stochastic modeling, Probability.
Mr. Dhruv Saxena, Asst. Professor (TEQIP-NPIU) 26
27. Why is it sexy?
Gartner’s 2014 Hype Cycle
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 27
30. Real Life Examples
Companies learn your secrets, shopping patterns, and preferences
◦ For example, can we know if a woman is pregnant, even if she doesn’t want us to know?
Target case study
Data Science and election (2008, 2012)
◦ 1 million people installed the Obama Facebook app that gave access to info on “friends”
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 30
31. Applications of Data Science
Internet Search
Search engines make use of data science algorithms to deliver the best results for search queries
in a fraction of seconds.
Digital Advertisements
The entire digital marketing spectrum uses the data science algorithms - from display banners to
digital billboards. This is the mean reason for digital ads getting higher CTR than traditional
advertisements.
Recommender Systems
The recommender systems not only make it easy to find relevant products from billions of
products available but also adds a lot to user-experience. A lot of companies use this system to
promote their products and suggestions in accordance with the user’s demands and relevance of
information. The recommendations are based on the user’s previous search results.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 31
32. Big Data for Retail
Brick and Mortar or an online e-tailer, the answer to staying the
game and being competitive is understanding the customer better
to serve them. This requires the ability to analyze all the disparate
data sources that companies deal with every day, including the
weblogs, customer transaction data, social media, store-branded
credit card data, and loyalty program data.
32
33. Applications of Big Data
Big Data for Financial Services
Credit card companies, retail banks, private wealth management
advisories, insurance firms, venture funds, and institutional investment
banks use big data for their financial services. The common problem
among them all is the massive amounts of multi-structured data living
in multiple disparate systems, which can be solved by big data. Thus big
data is used in several ways like:
Customer analytics
Compliance analytics
Fraud analytics
Operational analytics
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 33
34. Big Data in Communications
Gaining new subscribers, retaining customers, and
expanding within current subscriber bases are top
priorities for telecommunication service providers. The
solutions to these challenges lie in the ability to combine
and analyze the masses of customer-generated data and
machine-generated data that is being created every day.
34
35. Applications of Data Analytics
Healthcare
The main challenge for hospitals with cost pressures tightens is to treat as many patients
as they can efficiently, keeping in mind the improvement of the quality of care. Instrument
and machine data are being used increasingly to track as well as optimize patient flow,
treatment, and equipment used in the hospitals. It is estimated that there will be a 1%
efficiency gain that could yield more than $63 billion in global healthcare savings.
Travel
Data analytics can optimize the buying experience through mobile/ weblog and social
media data analysis. Travel sights can gain insights into the customer’s desires and
preferences. Products can be up-sold by correlating the current sales to the subsequent
browsing increase browse-to-buy conversions via customized packages and offers.
Personalized travel recommendations can also be delivered by data analytics based on
social media data.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 35
36. Gaming
Data Analytics helps in collecting data to optimize and spend within as well as
across games. Game companies gain insight into the dislikes, the
relationships, and the likes of the users.
Energy Management
Most firms are using data analytics for energy management, including smart-
grid management, energy optimization, energy distribution, and building
automation in utility companies. The application here is centered on the
controlling and monitoring of network devices, dispatch crews, and manage
service outages. Utilities are given the ability to integrate millions of data
points in the network performance and lets the engineers use the analytics to
monitor the network.
36
37. Data Scientists
Data Scientist
◦ The Sexiest Job of the 21st Century
“They find stories, extract
knowledge. They are not reporters “
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 37
38. Data Scientists
Data scientists are the key to realizing the opportunities presented by big data. They bring
structure to it, find compelling patterns in it, and advise executives on the implications for
products, processes, and decisions
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 38
39. What do Data Scientists do?
National Security
Cyber Security
Business Analytics
Engineering
Healthcare
And more ….
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 39
40. Concentration in Data Science
Mathematics and Applied Mathematics
Applied Statistics/Data Analysis
Solid Programming Skills (R, Python, Julia, SQL)
Data Mining
Data Base Storage and Management
Machine Learning and discovery
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 40
42. What is Machine Learning ?
Machine learning (ML) is the study of computer algorithms
that improve automatically through experience.
It is seen as a subset of artificial intelligence.
Machine learning algorithms build a mathematical model
based on sample data, known as "training data", in order to
make predictions or decisions without being explicitly
programmed to do so.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 42
43. What is Machine Learning ?
Machine learning algorithms are used in a wide variety of
applications, such as email filtering and computer vision,
where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks.
Machine learning is closely related to computational
statistics, which focuses on making predictions using
computers.
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 43
50. NASSCOM Formative Assessments (Mid-training)
Formative assessment of students shall be conducted for 100 marks and the test duration shall be
between 45-60 min.
Post training assessment and certification shall be conducted after the successful completion of
training.
Only those students who are Registered and Attending training on Future Skills shall be eligible for
mid-training and post-training assessment.
All assessments shall be conducted online and Auto Proctored through NASSCOM SSC.
The assessment results shall be shared within 3 working days with the SPOC of the institute.
Formative Assessment scores are independent and shall not be counted in the final assessment
scores for certification.
Tentative Date – 16th August 2020
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 50
51. NASSCOM Formative Assessment
Syllabus for Data Sci. & Analytics
Module
No. of
Questions
Type of
Questions
Indicative
Time/Module
Marks
Introduction to
Data Science
2
MCQ & DC 2 min 6
Mathematical
Foundations
18
MCQ, DC &
ScB
20 min 44
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 51
52. Multiple Choice
Questions
MCQ
In this type of question, the candidate is asked to choose one or more
responses from a limited list of choices. It also includes True/ False
questions(T/F) depending on the level of difficulty.
Scenario based ScB
This question asks the candidate to describe how they might respond
to a hypothetical situation.
Direct Concept DC
This type of question revolves around the concept that particular subject
deals with. The candidate would be asked a direct question pertaining
to the concept of that particular subject. This can be an MCQ or Fill in
the Blank or Multiple Response
Mr. Dhruv Saxena, Assistant Professor (TEQIP-NPIU) 52