Data literacy connects the knowledge of deep concepts of statistics, computer science, visualization, and domain science with the practical understanding of the data---and the stories it tells---that we encounter in our lives. As data becomes more pervasive, we see teaching data literacy to students as part of a broad education as an imperative for 21st century. Learning how to arm the next generation with the tools to make the most of data, and avoid the common pitfalls of data, will be a major thrust of our efforts with the Moore/Sloan initiative at Berkeley, through the Berkeley Institute for Data Science.
Computational Training for Domain Scientists & Data Literacy
1. Computational Training for Domain
Scientists & Data Literacy
Joshua
Bloom
Moore
Founda1on
Data-‐Driven
Inves1gator
UC
Berkeley,
Astronomy
@pro%sb
AAAS,
San
Jose
Sunday
Feb
15,
2015
2. “knowledge and understanding of scientific concepts and processes required
for personal decision making, participation in civic and cultural affairs,
and economic productivity"
Scientific Literacy
National Science Education Standards Report, National
Academies of Science 1996
Data Literacy, as a broad new educational imperative
Data Literacy
“knowledge and understanding of analytical concepts and processes required
to draw fair conclusions from data that impact personal, professional, and
civic lives”
now @ AAAS
4. Our
ML
framework
found
the
Nearest
Supernova
in
3
Decades
..‣ Built
&
Deployed
robust,
real-‐?me
machine
learning
framework,
discovering
>10,000
events
in
>
10
TB
of
imaging
→
50+
journal
ar?cles
‣ Built
Probabilis?c
Event
classifica?on
catalogs
with
innova?ve
ac?ve
learning
hPp://?medomain.org hPps://www.nsf.gov/news/news_summ.jsp?cntn_id=122537
5. What is the toolbox
of the modern
(data-driven) scientist?
domain
training
statistics
advanced
computing
database
GUI
parallel
visualization
Bayesian
machine learning
Physics
laboratory techniques
MCMC
MapReduce
6. And...How do we teach this with what
little time the students have?
What is the toolbox
of the modern
(data-driven) scientist?
7. Data-Centric Coursework, Bootcamps, Seminars, & Lecture
Series
BDAS: Berkeley Data Analytics
Stack
[Spark, Shark, ...]
parallel
programming
bootcamp
...and entire degree programs
Taught by CS/Stats
Aimed at Engineers & Programmers
Heading Toward Industry
10. ‣ 3 days of live/archive streamed lectures
‣ all open material in GitHub
‣ widely disseminated (e.g., @ NASA)
http://pythonbootcamp.info
11. Part of the
Designated
Emphasis in
Computational
Science &
Engineering at
Berkeley
visualization
machine learning
database interaction
user interface & web frameworks
timeseries & numerical computing
interfacing to other languages
Bayesian inference & MCMC
hardware control
parallelism
14. Time domain preprocessing
- Start with raw photometry!
- Gaussian process detrending!
- Calibration!
- Petigura & Marcy 2012!
!
Transit search
- Matched filter!
- Similar to BLS algorithm (Kovcas+ 2002)!
- Leverages Fast-Folding Algorithm
O(N^2) → O(N log N) (Staelin+ 1968)!
!
Data validation
- Significant peaks in periodogram, but
inconsistent with exoplanet transit
TERRA – optimized for small planets
Detrended/calibrated photometry
TERRA
RawFlux(ppt)CalibratedFlux
Erik Petigura
Berkeley Astro
Grad Student
Prevalence of Earth-size planets orbiting Sun-like stars
Erik A. Petiguraa,b,1
, Andrew W. Howardb
, and Geoffrey W. Marcya
a
Astronomy Department, University of California, Berkeley, CA 94720; and b
Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822
Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013)
Determining whether Earth-like planets are common or rare looms
as a touchstone in the question of life in the universe. We searched
for Earth-size planets that cross in front of their host stars by
examining the brightness measurements of 42,000 stars from
National Aeronautics and Space Administration’s Kepler mission.
We found 603 planets, including 10 that are Earth size (1 − 2 R⊕)
and receive comparable levels of stellar energy to that of Earth
(0:25 − 4 F⊕). We account for Kepler’s imperfect detectability of
such planets by injecting synthetic planet–caused dimmings into
the Kepler brightness measurements and recording the fraction
detected. We find that 11 ± 4% of Sun-like stars harbor an Earth-
size planet receiving between one and four times the stellar inten-
sity as Earth. We also find that the occurrence of Earth-size planets is
constant with increasing orbital period (P), within equal intervals of
logP up to ∼200 d. Extrapolating, one finds 5:7+1:7
−2:2 % of Sun-like stars
harbor an Earth-size planet with orbital periods of 200–400 d.
extrasolar planets | astrobiology
The National Aeronautics and Space Administration’s (NASA’s)
Kepler mission was launched in 2009 to search for planets
that transit (cross in front of) their host stars (1–4). The resulting
dimming of the host stars is detectable by measuring their bright-
ness, and Kepler monitored the brightness of 150,000 stars every
30 min for 4 y. To date, this exoplanet survey has detected more
We searched for transiting planets in Kepler brightness mea-
surements using our custom-built TERRA software package
described in previous works (6, 9) and in SI Appendix. In brief,
TERRA conditions Kepler photometry in the time domain, re-
moving outliers, long timescale variability (>10 d), and systematic
errors common to a large number of stars. TERRA then searches
for transit signals by evaluating the signal-to-noise ratio (SNR) of
prospective transits over a finely spaced 3D grid of orbital period,
P, time of transit, t0, and transit duration, ΔT. This grid-based
search extends over the orbital period range of 0.5–400 d.
TERRA produced a list of “threshold crossing events” (TCEs)
that meet the key criterion of a photometric dimming SNR ratio
SNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri-
terion, many of which are inconsistent with the periodic dimming
profile from a true transiting planet. Further vetting was performed
by automatically assessing which light curves were consistent with
theoretical models of transiting planets (10). We also visually
inspected each TCE light curve, retaining only those exhibiting a
consistent, periodic, box-shaped dimming, and rejecting those
caused by single epoch outliers, correlated noise, and other data
anomalies. The vetting process was applied homogeneously to all
TCEs and is described in further detail in SI Appendix.
To assess our vetting accuracy, we evaluated the 235 Kepler
objects of interest (KOIs) among Best42k stars having P > 50 d,
which had been found by the Kepler Project and identified as planet
candidates in the official Exoplanet Archive (exoplanetarchive.
ONOMY
Bootcamp/Seminar Alum
Python
DOE/NERSC computation
PNAS [2014]
15. Dangers
of
a
Cursory
Training/Teaching
Curriculum
16. 34 Fitting a straight line to data
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
y = 1.33 x + 164
0 50 100 150 200 250 300
x
0
100
200
300
400
500
600
700
y
0.0
1.0
1.0
1.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.00.0
0.0
0.0
0.0
0.0
0.0
0.0
Data analysis recipes:
Fitting a model to data⇤
David W. Hogg
Center for Cosmology and Particle Physics, Department of Physics, New York University
Max-Planck-Institut f¨ur Astronomie, Heidelberg
Jo Bovy
Center for Cosmology and Particle Physics, Department of Physics, New York University
Dustin Lang
Department of Computer Science, University of Toronto
Princeton University Observatory
Abstract
We go through the many considerations involved in fitting a model
to data, using as an example the fit of a straight line to a set of points
in a two-dimensional plane. Standard weighted least-squares fitting
is only appropriate when there is a dimension along which the data
points have negligible uncertainties, and another along which all the
uncertainties can be described by Gaussians of known variance; these
D
F
Da
Ce
Ma
Jo
Ce
Du
De
Pr
th
di
lin
an
arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010
Sta7s7cal
Inference
Data
Literacy
in
the
Undergraduate
&
Graduate
Training
Mission
17. Versioning
&
Reproducibility
“Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed
preclinical studies are not reproducible.” McNutt, 2014
http://www.sciencemag.org/content/343/6168/229.summary
-‐
Git
has
emerged
as
the
de
facto
versioning
tool
-‐
Berkeley
Common
Environment
(BCE)
So]ware
Stack
-‐
“Reproducible
and
Collabora?ve
Sta?s?cal
Data
Science”
(Sta?s?cs
157:
P.
Stark)
Data
Literacy
in
the
Undergraduate
&
Graduate
Training
Mission
18. Undergraduate Curriculum
• Data
Science
Task
Force
report
generated
at
Chancellor/Provost
behest
• Recommenda?ons:
– A
founda?onal
lower-‐division
course
accessible
to
everyone
(sta?s?cal
&
computa?onal
thinking),
with
broad
applicability
– A
suite
of
courses
that
advance
the
use
of
data
sciences
within
the
broad
swath
of
disciplines
available
to
Berkeley
undergraduates
– A
high-‐quality
minor
for
students
who
wish
to
couple
data
sciences
with
their
major
discipline
– A
data
sciences
major
for
those
students
who
want
this
to
be
their
primary
area
of
study
20. Grand
challenges
at
the
intersec7on
of:
• Open
science
and
reproducible
data
analysis
• Responses
of
coupled
human-‐natural
systems
to
rapid
environmental
change
(water
resources,
land
use
planning,
agriculture,
etc.)
• Development
of
evidence-‐based
solu?ons
in
public
policy,
resource
management
and
environmental
design
• Five
schools
and
colleges:
LeFers
&
Science,
Natural
Resources,
Engineering,
Public
Policy
and
Environmental
Design
• Two
year
graduate
training
program;
Master’s
and
PhD
students
• Four
cohorts
of
up
to
20
students
each,
first
cohort
in
fall
2015
Data Sciences for the 21st Century (DS421): Environment and Society
via Prof. David Ackerly (Integrative Biology)
21. Data Sciences for the 21st Century (DS421): Environment and Society
via Prof. David Ackerly (Integrative Biology)
22. http://bitly.com/bundles/fperezorg/1
“Bold new partnership launches to harness potential of data scientists and big data”
Founded
in
December
2013
as
a
result
of
a
year+
long
na?onal
selec?on
process
$37.8M
over
5
years,
along
with
University
of
Washington
&
NYU
‣ An
accelerator
for
data-‐driven
discovery
‣ An
agent
of
change
in
the
modern
university
as
Data
Science
takes
hold
‣ An
incubator
for
the
next
genera?on
of
Data
Science
technology
and
prac?ce
23. Leadership
from
across
the
spectrum
Joshua
Bloom,
Professor,
Astronomy;
Director,
Center
for
Time
Domain
Informa?cs
Henry
Brady,
Dean,
Goldman
School
of
Public
Policy
Cathryn
Carson,
Associate
Dean,
Social
Sciences;
Ac?ng
Director
of
Social
Sciences
Data
Laboratory
"D-‐Lab”
David
Culler,
Chair,
EECS
Michael
Franklin,
Professor;
EECS,
Co-‐Director,
AMP
Lab
Erik
Mitchell,
Associate
University
Librarian
Faculty
Lead/PI:
Saul
PerlmuFer,
Physics,
Berkeley
Center
for
Cosmological
Physics
Fernando
Perez,
Researcher,
Henry
H.
Wheeler
Jr.
Brain
Imaging
Center
Jasjeet
Sekhon,
Professor,
Poli?cal
Science
and
Sta?s?cs;
Center
for
Causal
Inference
and
Program
Evalua?on
Jamie
Sethian,
Professor,
Mathema?cs
Kimmen
Sjölander,
Professor,
Bioengineering,
Plant
and
Microbial
Biology
Philip
Stark,
Chair,
Sta?s?cs
Ion
Stoica,
Professor,
EECS;
Co-‐Director,
AMP
Lab
24. “I
love
working
with
astronomers,
since
their
data
is
worthless.”
-‐
Jim
Gray,
Microso'
25. Created by Natalia Bilenko.
Data source: PubMed Central. http://sciencereview.berkeley.edu/bsr_design/Issue26/datascience/cover/
26. Established
CS/Stats/Math
in
Service
of
novelty
in
domain
science
vs.
Novelty
in
domain
science
driving
&
informing
novelty
in
CS/Stats/Math
“novelty2
problem”
an
extra
Burden
for
Forefront
Scien?sts
hPps://medium.com/tech-‐talk/dd88857f662
27. ỉπ vs.
21st Century Educational Mission: Produce Gamma-shaped people
deep domain skill/knowledge/training
deep methodological knowledge/skill
deep domain or methodological skill/knowledge/training
strong methodological or domain knowledge/skill
Goal: empower teams of gamma’s to excel
28. Summary
Data
science
at
Berkeley
is
thriving
and
BIDS
is
its
intellectual
home
-‐
incuba?ng
novel
science
&
methodologies
-‐
teaching
&
training
-‐
innovate
environments,
interac?ons,
&
networks
Teaching
impera7ves
emerging
around
data
literacy
-‐
new
campus-‐wide
undergraduate
curriculum
(star?ng
with
a
founda?on
for
all
1st
years)
-‐
new
mul?-‐year,
mul?-‐disciplinary
graduate
program