Global Lehigh Strategic Initiatives (without descriptions)
The Analytics and Data Science Landscape
1. The Analytics & Data Science
Landscape
Philip E. Bourne
peb6a@virginia.edu
Analytics Challenges in Modern Tax Administration
November 16, 2020
2. Disclaimer
• I pay taxes but typically do not get
it right
• My PhD is actually in physical
chemistry
• I did work for the NIH for 3 years
3. On a Positive Note
• I have been working with “big” data for many years
• As Dean I am very interested in mapping the
capabilities of our students to the needs of the
workplace
• As a researcher I am concerned that the research our
school undertakes is for societal benefit
7. Everything is Digital Data to be Analyzed…
Play the data science game – pick an
object/subject and you will immediately see a
reason why data science is important …
8.
9. If I were a tie maker I would be
undertaking a data science analysis right
now…
Large collection of
random images with
metadata before and
during the pandemic
Who is still wearing
ties?
• Age
• Profession
• Ethnicity
• Socioeconomic
status
• Location
• …..
Causality –
Does the pandemic
represent a shift in tie
wearing? If so by how
much?
Prediction –
What will be the market
post COVID?
11. A Paradigm Shift Reflected in the Workforce
Increased Demand over the Past Five Years
74%
Artificial Intelligence specialists
Top industries hiring this talent: Computer software, internet,
information technology and services, higher education, consumer
electronics
37%
Data Scientist
Top industries hiring this talent: Information technology and
services, computer software, internet, financial services, higher
education
33%
Data Engineer
Top industries hiring this talent: Information technology and
services, internet, computer software, financial services, hospital
and healthcare
15. The Rising Demand for Data Scientists
*for graduates seeking employment
100% 100% 100% 98% 97%
UVA School of Data Science
Graduate Job Placement
2019 2018 2017 2016 2015
*
Roles
Machine Learning Engineer, Director of Data
Science, Deep Learning Research Scientist,
Senior Data Analyst, Data Science Developer,
Consultant, Product Data Analyst, Financial
Engineer, Engagement Manager & more
Industries
● Finance
● Government
● Healthcare & Medicine
● Professional Sports
● Commerce
● Media
● Higher Ed
● Technology
17. A New School for a New Century
A School Without Walls
Mission
To be a national and international leader in responsible data science
emphasizing interdisciplinary collaboration which results in furthering
discovery, sharing knowledge, and societal benefit
18. Our Working Definition
• Use of the ever increasing amount of open, complex, diverse
digital data frequently in ubiquitous cloud environments
• Finding ways to ask and then answer relevant questions by
combining such diverse data sets
• Arriving at statistically significant conclusions not otherwise
obtainable
• Sharing such findings in a useful way
• Translating such findings into actions that improve the human
condition
19. Use Case – Data Integration
Researcher and Assistant Professor of Medicine
Dr. Thomas Hartka, also a current online Masters
in Data Science student, is combining two
disparate data sets—electronic health records and
DMV crash data—to save lives after motor vehicle
crashes.
“I enrolled in the MSDS program to
expand my research on automotive
safety. I have already used
techniques from classes in my work.
I hope to expand my research to
real-time analytics to improve
emergency room care.”
— Dr. Thomas Hartka, UVA School
of Medicine
20. Guiding Principles
• Excellence
• Integrity
• Diversity
• Openness and transparency
• FAIR data - the ability to Find, Access, Interoperate and Reuse data
• Innovation
• For the social good
21. • Data/code as first
class citizen – Part
of promotion and
tenure
http://www.ncbi.nlm.nih.gov/pubmed/26207759
Why FAIR?
[Adapted from Carole Goble]
Only 12% of data
from research is
preserved
22. Infrastructure
Commons - Platform Stack
https://datascience.nih.gov/commons
Compute Platform: Cloud or HPC
Services: APIs, Containers, Indexing,
Software: Services & Tools
scientific analysis tools/workflows
Data
“Reference” Data Sets
User defined data
DigitalObjectCompliance
App store/User Interface
Why not more like AirBnB?
https://doi.org/10.1371/journal.pbio.2001818
24. The 4+1 Model
The model is based
on the core insight
that all definitions of
data science assume
a pipeline and that
this pipeline forms a
parallel process
[From Raf Alvarado]
25. Our Representation of Data Science
The 4+1 Model
• Value – assuring
societal benefit
• Design -
Communication of the
value of data
• Systems – the means
to communicate and
convey benefit
• Analytics – models
and methods
• Practice – where
everything happens
[From Raf Alvarado]
26. The 4+1 Model Interplay
[From Raf Alvarado]
• Value + Design = Openness,
responsibility
• Value + Analytics = Human
centered AI, algorithmic bias
• Value + Systems =
sustainability, access,
environmental impact
• Design + Analytics = literate
programming, visualization
• Design + Systems =
dashboards, engineering
design
• Analytics + Systems = ML
engineering
27. The 4+1 Model
27
Integration Practice of DS, Capstones
Analytics Linear Models, Data Mining, Bayesian ML, Deep
Learning, Text Analytics, Foundations of CS
Systems Programming and Systems, Big Data Analytics
Value Ethics of Big Data
Design Practice of DS, Visualization
We strive to build a curriculum that aligns with our model
28. Distinctive Features
28
Foundational topics in analytics from linear models to data
mining and machine learning
An integrated curriculum developed in consultation with
practicing data scientists that incorporates a challenging
capstone experience
Applications and data drawn from different disciplines, e.g.,
science, business, and health
Instruction in the best practices in the management and
conduct of data science projects
Computational methods built on the latest techniques in R and
Python
Required course in data ethics
Emphasis on team science — data science is a team sport
29. Analytics and Machine Learning
29
STAT 6021 Linear Models - Multiple linear regression, logistic
regression. (R)
SYS 6018 Data Mining - Tree-based methods, kernel methods,
unsupervised learning. Uses An Introduction to Statistical Learning
by James, Witten, Hastie and Tibshirani. (R)
SYS 6014 Bayesian Machine Learning - Methods to handle
uncertainty and to apply per variable weight distributions (as
opposed to a single optimal value). (Python)
SYS 6016 Machine Learning - Focuses on neural networks,
including deep learning, convolutional neural networks, recurrent
neural networks, and autoencoders. (Python)
30. Computer Science
30
CS 5010 Programming and Systems for Data Science -- Python,
Pandas, data analysis at scale and in context, some development
practices.
CS 5021 Foundations of CS -- Data structures, algorithms,
complexity, relational and noSQL databases; "CS in a box."
31. Data Ethics
31
Virtuous cycle between {computer science,
statistics, applied mathematics} and the
humanities
Exemplified with use cases
It’s a culture not a tick of the box
Computer Science
Statistics
Applied Mathematics
Humanities
32. Practice and Application of Data
Science
32
DS 6001 and 6003 focus on data design
Flow of data between Human and Machine domains
H → M: Establishing data so that it can be analyzed
M → H: Presenting results of analysis to the world
6001: Data engineering pipeline -- acquiring, cleaning, exploring
6003: Data product development -- presenting, visualizing, app
dev
33. Electives
33
CS 6160 Theory of Computation
CS 6444 Parallel Computing
CS 6501 Text Mining
CS 6750 Database Systems
DS 5001 Exploratory Text Analytics
DS 6559 Biomedical Cloud Computing Seminar
SARC 5400 Data Visualization
STAT 6250 Longitudinal Data Analysis
STAT 6260 Categorical Data Analysis
SYS 6023 Cognitive Systems Engineering
SYS 6050 Risk Analysis
SYS 6582 Reinforcement Learning
SYS 7001 System and Decision Sciences
34. Capstones
34
A parallel and culminating experience that focuses on a real world
data science problem
Emphasizes problem definition and scoping
Employ project management
Involves developing, evaluating, and creating a data product for a
client
Requires presentations, a proposal and a published paper (IEEE)
Teams of students work on separate projects under guidance of an
advisor
35. Furthering Discovery to Build a Better World
RESEARCH
Cybersecurity
Detecting broad-spectrum cyber
threats almost immediately after
they are launched through a $7.6
million Defense Advanced
Research Projects Agency
(DARPA) grant.
Environment
Using NASA data collected aboard the
International Space Station to examine climate
change in the Shenandoah National Forest
and beyond, and find solutions
Health & Medicine
Securing high-performance computing
equipment and personnel to allow
collaboration across the university on brain
science research like Autism, Alzheimer’s,
mental health disorders, traumatic brain
injuries and more.
Business
Discovering what makes a job
interview successful for the
candidate and the recruiter, and
how to mitigate bias in the
recruiting process
Democracy
Investigating how terrorist groups recruit
women through propaganda and examining
risk and threat assessment for extremist
violence perpetrated by women.
Education
Helping economically disadvantaged,
underrepresented populations pursue tailored
educational workforce pathways that have a
higher probability of leading them to success.
36. Applying Data Science Across Industries
“To tackle challenges in science and medicine.”
— Elizabeth Driskell, MSDS ‘20
“To inform public policy and government.”
— Bradley Katcher, MSDS ‘20
“I want to use data science to find a new way of
thinking.” — Alex Gromadzki, MSDS ‘21
“I want to use data science to solve complex business
problems.” — Ruslan Askerov, MSDS ‘21
“To address poverty and income inequality.”
— Arti Patel, MSDS ‘20
37. Growing the School
M.S. IN DATA SCIENCE
Residential & Online
2020
2020-2023
UNDERGRADUATE
COURSES
increase to 18
courses per AY
2021
PH.D. PROGRAM
2023
UNDERGRADUATE
MAJOR
Building occupied
Team Size (FTEs)
5
40
60
80
120
Exec. Ed.
38. SDS and IRS - Actions
• Workforce pipeline - awareness
• Continuing Ed opportunities
• Provision of synthetic data
• Funded and collaborative research
• Faculty, Capstone, Presidential Fellowship, PhD Internships
• IRS Admits – MSDS, PhD
• Join the corporate commons
• ….
40. SDS Faculty Research
Data Science Faculty member or affiliated
faculty Website Research Interests
Nada Basit
https://engineering.virginia.edu/facul
ty/nada-basit
Machine Learning, Bioinformatics, Data Mining, Pattern
Recognition
Phil Bourne
https://engineering.virginia.edu/facul
ty/philip-e-bourne
Multiscale Modeling Using Data Science Techniques
Early Stage Drug Discovery and Drug Repurposing
Early Stage Drug Methods and Tools for Macromolecular
Don Brown
https://engineering.virginia.edu/facul
ty/donald-e-brown-phd
Data Fusion, Knowledge Discovery, and Simulation
Optimization
Sallie Keller
https://biocomplexity.virginia.edu/sal
lie-keller
social and decision informatics, statistical underpinnings of
data science, and data access and confidentiality.
Daniel Mietchen
https://tools.wmflabs.org/scholia/aut
hor/Q20895785
Computational Biology, Biodiversity integrating research
workflows with the World Wide Web through open
licensing, open standards, and open collaboration via
Rafael Avarado http://transducer.ontoligent.com/
Cultural Analytics and Machine Learning, Digital
Humanities, Text Analysis
Heman Shakeri https://www.hemanshakeri.com/
structure and function of interconnected networks, often
expressed via graphs that comprise a set of nodes and a
set of connections between them.
Jonathan Kropko
https://facultydirectory.virginia.edu/f
aculty/jk8sd
methods to examine historical data, to test theories of
voting in U.S. presidential elections, and to handle
nonresponse in surveys.
Michael Porter
https://engineering.virginia.edu/facul
ty/michael-d-porter
event prediction, pattern and anomaly detection, and data
linkage - applications for Criminology, Transportation,
Terrorism, Defense, Security, Forensics, Business
Mohammad Fallahi-Sichani new hire
designing and building new experimental and
computational tools to enable the analysis, interpretation
and rational modulation of multi-scale processes that
Jack Van Horn
https://scholar.google.com/citations?
user=i9bGqbgAAAAJ&hl=en Psychology and Data Science, Cognitive Neuroscience
Pete Alonzi https://github.com/alonzi
Vicente Ordonez
https://engineering.virginia.edu/facul
ty/vicente-ordonez-roman
Computer Vision, Natural Language Processing and
Machine Learning
Tim Clark
https://scholar.google.com/citations?
user=k-iwlCUAAAAJ&hl=en
next generation approaches for biomedical
communications and data integration, including
semantically integrated data repositories, claims and
Gerard Learmonth
https://www.researchgate.net/profil
e/Gerard_Learmonth
Generation and testing of pseudorandom number
generators; Abstract database design; Strategic
applications of information systems and technology
Hongning Wang http://www.cs.virginia.edu/~hw5x/
data mining, machine learning, and information retrieval,
with a special emphasis on computational user behavior
modelin
Stephen Adams
http://www.nsfcvdi.org/wordpress/c
vdi_personnel/steven-adams-ph-d/
Adaptive Decision Systems Lab at UVA and his research is
applied to several domains including activity recognition,
prognostics and health management for manufacturing
Aidong Zhang
https://engineering.virginia.edu/facul
ty/aidong-zhang ML, Data mining, bioinformatics
Jundong Li http://people.virginia.edu/~jl6qk/
Data Mining, Machine Learning, Social Computing, and
Deep Learning
Brian Wright
https://www.linkedin.com/in/brian-
wright-ph-d-90063027/