Data Science Highlights

Data Scientist
Square - San Francisco Bay Area
Job Description
Square is hiring a Data Scientist on our Risk team. The Risk team at Square is responsible for enabling growth while mitigating financial
loss associated with transactions. We work closely with our Product and Growth teams to craft a fantastic experience for our buyers and
sellers.
!
Desired Skills & Experience
As a Data Scientist on our Risk team, you will use machine learning and data mining techniques to assess and mitigate the risk of every
entity and event in our network. You will sift through a growing stream of payments, settlements, and customer activities to identify
suspicious behavior with high precision and recall. You will explore and understand our customer base deeply, become an expert in
Risk, and contribute to a world-class underwriting system that helps Square provide delightful service to both buyers and sellers. 
 
To accomplish this, you are comfortable writing production code in Java and conducting exploratory data analysis in R and Python. You
can take statistical and engineering ideas from prototype to production. You excel in a small team setting and you apply expert
knowledge in engineering and statistics. 
 
Responsibilities
1. Investigate, prototype and productionize features and machine learning models to identify good and bad behavior.
2. Design, build, and maintain robust production machine learning systems.
3. Create visualizations that enable rapid detection of suspicious activity in our user base.
4. Become a domain expert in Risk.
5. Participate in the engineering life-cycle.
6. Work closely with analysts and engineers.
!
Requirements
1. Ability to find a needle in the haystack. With data.
2. Extensive programming experience in Java and Python or R.
3. Knowledge of one or more of the following: classification techniques in machine learning, data mining, applied statistics, data
visualization.
4. Concise verbal and written articulation of complex ideas.
!
Even Better
1. Contagious passion for Square’s mission.
2. Data mining or machine learning competition experience.
!
Company Description
Square is a revolutionary service that enables anyone to accept credit cards anywhere. Square offers an easy to use, free credit card
reader that plugs into a phone or iPad. It's simple to sign up. There is no extra equipment, complicated contracts, monthly fees or
merchant account required. 
 
Co-founded by Jim McKelvey and Jack Dorsey in 2009, the company is headquartered in San Francisco.

Sense Maker Segment
Sense makers need to create and/or employ insights to accomplish
their business goals and satisfy their responsibilities.
!
These insights emerge from independent and collaborative discovery
efforts that involve direct interaction with discovery applications, and
participation in discovery environments.
Insight Consumer
!
Analyst
Casual Analyst
Data Scientist
Analytics Manager
!
Problem Solver

Data Scientist
Data Scientist / Senior Research Scientist
Data Scientists work with other members of the Data science team, using emerging methods and tools to engage with ‘Big
Data’ from a variety of external and internal sources. Data Scientists aim to generate actionable insights that transform the
organization; enhance existing products, services and operations; and identify, define and prototype new data-driven
products, services, and offerings.
They have advanced analytical skills and/or a specialized educational background, and rely on open-source and custom-
created tools, to address the ad-hoc and open-horizon questions the Data Science team takes on. Data Scientists collaborate
with Insight Consumers, evolving and publishing insights and prototypes of new offerings.
Business Goals & Work Setting
• Create new data-driven products, services, business opportunities
• Transform the business with insights derived from Big Data
• Create effective tools and infrastructure for the data science group
and other analytical groups within the organization
• Develop prototypes based on proprietary or open source tools
• Prototype new ways to visualize and understand data relationships
• May work within a business unit, providing analytical capability to
that unit only, or a centralized Data Science group
!
Discovery Needs
• Solves complex, critical problems & significant and unique issues.
• Have numerous and dynamic ill-formed questions with
unpredictable needs for data, visualization, discovery capabilities
!
Discovery Tools
• Open source tools and platforms for big data, ETL, visualization,
analysis, statistics: Hadoop, Cassandra, Kafka, Voldemorte,
• Open source algorithms languages: R, HIVE, PIG,
• Custom-developed analytical tools
Engagement w/ Discovery Applications
• Creates custom discovery applications to suit their own needs
• Application lifecycle involvement: rolls their own from
scratch, iterates and then publishes to wider audiences /
productizes
• Original author of all discovery solution elements: data / data
sets, information models, discovery applications and
workspaces
• Shares / publishes insights to decision-making groups &
social forums in the business
!
Collaboration
• Works with Engineers and Software Architects to create prototypes
and products
• Collaborates with Data Scientists on ill-formed questions
!
Skills & Expertise
• Data management, analytics modeling and business analysis
• Prototyping / software engineering
• Discovery: advanced statistics, quantitative and qualitative
analysis, machine learning, data mining, natural language
processing, computational linguistics, broad knowledge of applied
mathematics, statistical methods and algorithms

Profiles & Discovery Problem Spectrum
D
ata
Scientist
Analyst(all)
C
asualAnalyst
Problem
Solver
Ill-formed Well-formed

http://upload.wikimedia.org/wikipedia/commons/4/44/DataScienceDisciplines.png

http://nirvacana.com/thoughts/wp-content/uploads/
2013/07/RoadToDataScientist1.png

What sort of animal?
They seem different than analysts:
• problem set
• relationship to discovery tools
• skills and professional profile
• discovery / analytical methods
• perspective
• workflow and collaboration
!
Are they? How?

Areas of Investigation
• Workflow
• Environment
• Organizational model
• Pain points
• Tools
• Data landscape
• Analytical practices
• Project structure
• Unmet needs

Discussion Guide
Can you please walk me through a recent or current project?
a. How was the project initiated?
b. How defined was the business problem in the beginning? Did the problem change?
c. Where/who did you obtain data sets from? How did you make the decision?
d.Describe the data you used: How did the data sets look like? How big were they? Were they
structured or unstructured?
e. What tools or techniques did you use to do the analyses? Did they map to the specific steps you
mentioned just now?
f. How did you decide these were the tools/techniques to use? To what extent were these decisions
made by yourself and to what extent were they standardized by your group/team?
g. How did you present the results of your analyses? What tools did you use? What do you like and
dislike about your current tool set?
h. Which stage of this project was the most challenging? To what extent did the tools satisfy what you
intended to do? What features were lacking?
i. How much collaboration was there during each stage of the project?
i. Background and role of collaborators
ii. Collaboration modes
iii. Types of information shared
!
Thinking about the projects you have worked on, is there a common approach you take to address these
problems?
How did you decide on this approach/tools?
!

Business
Analytics
(future)
Data
Science
(now)
=

Creates
data-‐driven
insights,
offerings,
and
resources
to
transform
the
organiza7on
Work
Experience

10
Years

Educa0on
Ph.D.
Sta7s7cs,
MS
Bio-‐Informa7cs
Job
Title

Senior
Data
Scien7st

Company

LInkedIn
Summarize
&
Communicate

!
Review
findings
with
colleagues;

summarize
,visualize,
and

communicate
key
findings
to

Insight
Consumers/decision

makers
Prototype
&
Experiment

with
data
driven
feature:

!
How
can
we

prototype/
evaluate
this
w/out

disrup0ng
the
site?
Gather
Data
&

Analyze
Results

!
Use
descrip0ve,

inferen0al,
and

predic0ve
sta0s0cs

to
evaluate

results
Analyze
&
Iden7fy
causal/
predic7ve
factors:

Who
are
the
best

candidates
to
contact
for
a

job
based
on
recruiter

needs
and
profile
content?
Dana
Data
Scien0st

• Defining
and
capturing
useful
measures
of

online
aMen0on

• GeOng
all
the
data
analy0c
tools
to
work

together
properly

• No
current
workflow
support
or
tools
for
data

wrangling,
analysis,

experimenta0on,,
and

prototyping
• Effec0ve
tools
to
help
experiment
with
and

evaluate
value
/u0lity
of
features
and

ac0vi0es
for
users

• Ability
to
rapidly
prototype
data-‐driven

features
w/out
risk
of
online
service

disrup0ons
• Open
source
data
manipula0on,
mining
&

analysis
tools
including
R,
Pig,
Hadoop,
Python,

etc.

• Sta0s0cal
packages
such
as
SAS,
SPSS,
etc.

• Custom
analy0cal
tools
built
using
open
source

components
and
languages
• Leverage
data
to
support
the
org
mission

• Enhance
products
&
services
with
data-‐driven

insights
and
features

• Use
data
to
iden0fy
new

opportuni0es
and

prototype/drive
new
customer
offerings

• Create
useful
data
sets/streams,
measures,
&

resources
(e.g.,
data
models,
algorithms,
etc.
Key
Goals
Tools
Pain
Points
Wish
List
Sample
Workflow
Dana
is
a
Senior
Data
Scien0st
who
has
worked
at
LinkedIn
for
5
years.

Dana’s

educa0on
includes
a
Ph.D.
in
Sta0s0cs
and
an
MS
in
Bio
Informa0cs.

Dana’s

previous
work
includes
posi0ons
in
academic
research
groups
as
a
doctoral

candidate
and
post-‐doc,
as
well
as
so_ware
engineering
roles
in
the
Internet
&

technology
industries.
• Dana
works
with
several
other
data
scien0sts
and
her
Analy0cs
Manager
on

a
centralized
team

• Dana
and
her
colleagues
aim
to
create
data
driven
insights,
features,

resources,
and
offerings
that
deliver
strategic
value
to
LinkedIn

• Dana
works
with
Analysts
on
other
teams
to
define
and
create
discovery

tools,
data
sets,
and
methods
for
use
by
their
groups
at
LinkedIn.

• Dana
&
team
are
visible
&
well
established
within
LinkedIn,
and
have
a
voice

in
product
strategy
and
opera0onal
context;
they
have
a
high
degree
of

autonomy
in
defining
data
science
projects

• Dana
works
with
Insight
Consumers
to
suggest
and
determine
poten0al
new

data
driven
offerings
to
prototype
and
evaluate.
• How
can
we
leverage
data
to
increase
online
engagement
with
LinkedIn?

• How
should
we
measure
engagement
&
what
factors
drive
it?

• What
aspects
of
a
personal
profile
are
most
likely
to
encourage
/

discourage
new
connec0ons
between
people?

• How
can
we
increase
people’s
ac0vity
and
contribu0ons
to
topical

discussion
groups?

• What
factors
drive
the
effec0veness
of
our
marke0ng
campaigns?

• Why
did
one
of
our
marke0ng
campaigns
work
excep0onally
well?

• How
can
leverage
data
to
help
recruiters
iden0fy
and
communicate

effec0vely

with
qualified
and
poten0ally
available
candidates?
Typical
Discovery
Scenarios
&
Problems
Background
Work
Context
• Mines,
analyzes,
&
experiments
with
data
to

iden0fy
paMerns,
trends,
outliers,
causal

factors,
predic0ve
models,
&
opportuni0es

• Defines
and
explains
newly
devised

measurements,
predic0ve
models,
&

insights

• Compares
effec0veness
of
opera0ons
at

achieving
company
goals
for
engagement,

growth,
data
quality

• Produces
&
explores
new
data
sets

• Collaborates
with
other
data
scien0sts
to

capture
new
data
streams

• Prototypes
new
data
driven
site
features/
offerings

• Runs
data
based
experiments
to
test/
evaluate
models,
hypotheses
&
prototypes

• Communicates
&
explains
analyses
to

colleagues
&
Insight
Consumers
I’ll
do
whatever
it
takes
–
wrangle,

extract,
manipulate,
analyze,

experiment,
prototype
–
to
use
data

to
drive
value
&
innovate
“

”
Ac7vi7es

Business Analytics Data Science
Intuitive
Manual
Gradual
Individual
Empirical
Augmented
Accelerated
Cooperative*
Nature of sense making activity

The Essence
• Empirical perspective
• Business imperatives drive activities
• Analytical approach
• Recipe is always the same
• Engineering always present
• Data challenges are paramount
• consume 60% - 80% of time and effort
• Data volumes range huge to moderate (PB > MB)
• Domain often drives analysis
• Data scientists already have self-service
• Some new problems, many the same
• Use ‘advanced’ analytics, not conventional BA
• Innovate by applying known analyses to new data
• Current workflow fragmented across tools and data stores
• Success can be a model, product, insight, infrastructure, tool

State of the Discipline
A small set of formally constituted Data Science teams at major Internet and
technology companies (Facebook, Google, MicroSoft, Yahoo, Twitter,
LinkedIn, eBay, Amazon) lead the field in most identifiable respects:
• maturity of practice - sophistication of methods, quality of infrastructure
• history and tenure as formal function / group
• business integration and impact
• internal and public visibility
• pace of innovation in methods, tools, architecture
• quality and rate of contributions to open source and other tools /
infrastructure
• role in the industry and public discourse on data science: visibility in
community, publication of experiments and findings, etc.

Tooling & Infrastructure
Leading shops have their own comprehensive and often home-built / heavily
customized data science environments, tools, infrastructure.
!
This infrastructure is aligned to the particulars of their domain and business.
Their data science environments are sometimes considerably more 'mature'
than those of other shops.
!
The large majority of existing data science teams and practices are 'followers'
of these leaders, in the sense that while they have idiosyncratic problems and
varying domains to address, they rely on innovation from the DS leaders to
guide the evolution of their data science practices.
!
Their environments reflect a mix of some purpose-built data science
components, and infrastructure extended / adapted from business analytic
needs such as BI.

Tooling & Infrastructure
Many organizations are establishing new data science capabilities. A minority
of these create new data science teams / practices from scratch without
building out other conventional analytical capabilities such as BI. They will
need new environments to support data science activities, and may leapfrog
older generations of analytic environment, following leaders by directly
creating new 'stacks' oriented more specifically for data science.
!
The majority of organizations are creating new data science capabilities by
building on existing analytical groups and functions. In terms of environments
and infrastructure, these organizations have existing analytical environments
aligned to BI and other business analytic functions, not specifically adapted to
data science needs. Cumulative investment in these environments can be
very high.
!
New teams will need new tools. Existing teams will need new tools to support
new discovery activities
!
Berkeley Data Analytics Stack is the most visible open source 'platform' at the
moment. No interview participants mentioned it.

Organizational Model
Data science capability = provisioned via standard org models (ranging across
in house, external, centralized, embedded, etc.).
!
The ways data science teams and practice groups are managed and their
relationship to the orgs they are part of seems to be conventional / familiar.
!
We can summarize the landscape of organizational models for providing data
science capability by plotting the size of data science team / pool of resources
vs. the 'distance' from the problem / need.
!
Landscape reflects common patterns for specialized expertise.
!
This could shift over time as discovery maturity increases overall first within
the analytics industry, then within the general business realm.

Discovery Problems
Discovery efforts are set in motion by Insight Consumers, not Data Scientists.
The success of efforts is gauged by Insight consumers. Insights are used by
the originating Insight Consumers, not other analysts, and rarely other Insight
Consumers.
!
Multiple hypotheses are often explored in parallel, supported by multiple data
sets / interim data products.
!
Useful reconstructing of analytical workflows requires linear history of all steps
/ activities.

Discovery Problems
Data science resources - Individuals, projects, and teams - are always aligned
to business areas or strategic goals: e.g. the Content Insights team at
LinkedIn supports analytical goals related to LinkedIn's major push to enhance
its media presence and role in media.
!
At large scales of group, this inverts - for example within a company,
communities of practice are aligned to a discipline, and will include members
who's activities span the needs of all the business units.
!
No analytical efforts begin completely open-ended, with no idea of the nature
or import of resulting insights.
!
There is almost always a hypothesis, or more than one. (Even in more
academic / research oriented settings, there is no basic research - all
investigations are purposive and grounded in defined business intent.

PROBLEM NATURE
• Well-defined
• Explicit form: Why, What, and How questions
• Implicit form: which question
• Hypothesis are driven by domain knowledge or work experience
• Not very different from the problems business analysts address
!
vBusinesses address the same problems they have been working on, which are
determined in the very beginning before resources should be allocated. Data scientists
do not necessarily contribute to initiating new problems.

Data Science
Insight
Model
Insight
Model
Data Product
Product
Analysts
Outcomes

Skills Portfolio
Data scientists use three kinds of languages: analysis (R- Matlab), scripting
(python, perl), data processing (sql, pig)
!
Analytical environments should allow integration of languages / capabilities they
offer.
!
Every analyst has their preferred language / method - defaults to using their
own for analytical efforts. True within centralized analytical teams.

Discovery Maturity
• Discovery is poorly understood and little recognized as a capability. It is rarely
mentioned by any of the Data Science / Analytics professionals spoken with.
When mentioned, it is seen as a small-scale activity and / or a desired
outcome of particular projects, not something the organization needs to be
able to in an ongoing / comprehensive / large-scale fashion such as
understanding customers.
!
• Data scientists understand their own challenges in terms of what stages /
aspects of a data-centric workflow require greatest time, effort, or present most
complexity or potential for introducing uncertainty / ambiguity into the efforts.
Broader framings are the need for or desire to work on data-driven products,
or transform and improve business through offering data-centered insights.
!
• Product-centric data scientists (aim directly at making data-driven offerings)
are a small minority of the active community. Many more are engineers with
strong data skills, and many more analysts trying to acquire data science skills
/ perspective.

Supporting Factors
• Regardless of particulars, the core ingredients remain the same: analytical
skills and perspective, domain knowledge, engineering / tooling skills and
perspective
!
• In data science practices, analysis is always enabled by engineering - either
localized to the data science team, or centrally provided via IT.
!
• In BI practices, analysis is always enabled by IT and systems consultants /
integrators (in house or external).
!
• Leading DS groups rely on a number of hybrid approaches to support data
cleansing and the evaluation of models, insights, and results - e.g. crowd
source prep of data and checking of results for prototypes and experiments.
!
• Data scientists rarely productionize code, analytical workflows, analytical tools.
Engineers / IT convert 'prototype' artifacts created by data scientists into
production code / tools.

Perspective
Analytical
The analytical perspective is the center of definition for all analytical roles.
Contrast with engineers, who "make stuff". Analytical roles figure things out
for some purpose: whether a model to inform a product prototype or provide
insight.
!
Empirical
The empirical perspective is distinct from the analytical perspective, and
marks 'true' data scientists. This revolves around framing and testing
hypotheses formally and informally, often requires validation and interrogation
of experimental methods and results by others, expects significant degree of
transparency at (all) stages of the analytical effort.

Cooperation and Collaboration
• Discovery efforts are structured as individual efforts - insights come from
individual analytical engagement with data sets.
!
• Collaboration between analysts is asynchronous.
!
• Diversity of analytical tools / languages in practice = barrier to cooperation and
collaboration.
!
• There is little re-use of analytical insights by analysts to further other efforts.
!
• When tools and/or problem domains are stable / known, analysts create
individual and group assets for reuse - e.g. R script libraries, code snippets for
SAS, templates for data set file formats and structures
!
• Intermediate work products created during analytical work (data sets / subsets,
code, analytical scripts, algorithms, interim results, hypotheses,) perceived as
often irrelevant or throwaway, if not outright wrong. Little investment is made
to annotate / preserve intermediate work products for individual or group re-
use, sharing, review.

THE MANY SHADES OF COLLABORATION
Independent: Have-it-all type data scientist (I know, I design & I implement)
Linear: Complementary (Analysts know, data scientists design, engineers implement)
Project-based: The missing piece ( Data scientists lead or support engineers)
Consultancy: From abstract to concrete (Some data scientists know & design, some other
data scientists implement)

Data Landscape
• The physical location of data - where stored / what environment - is a
significant cost factor for almost all aspects of analytical work.
!
• Distributed data (managed / located in multiple stores) increases costs for
many individual steps in analytical workflows.
!
• Distributed data costs often = barrier to conducting insightful analysis using
multiple techniques / steps. Default to basic / simple analysis to avoid high
effort / low probability of success.
!
• For analysts with low levels of db / data wrangling skill, even marginal
distributed data costs = preventative barrier for engaging with data.
!
• Most analysts reported having to migrate all of the data sets into the same
data processing framework to begin analysis. [If all the data were in one
place...]

DATA NATURE
• Messy: various forms (Web logs, web pages, genome data, sales revenues….)
• Scattered: Data scientists have to search from the wild (outside of enterprise
databases)
• Started “Big”, ended “Lean”: Meaningful data units are small in size
• Standardization is key to all data science work: why engineers become data scientists
!
v Data scientists are “data foragers“ and “data format equalizers”. They have the ability
to manipulate large data sets and gradually narrow the data sets down to the exact
units needed for analysis.

Algorithms and Analytical Tools
• Well-known algorithms and methods are used to plan and structure
experiments, discover insights, drive the creation of new models, evaluate the
effectiveness of new models & products.
!
• The algorithm and method are often determined by domain, such as TF-IDF
for IR, Smith-Waterman for bioinformatics,

PROCESS NATURE
• Wicked: Solutions are often times hardly pre-defined
• Iterative three-step cycle: Data collection, data cleansing, & data analysis
• trial-and-error: Hypotheses revision, hypotheses validation, & data recollection
• Ad-hoc analysis chance encountering
!
v Data scientists provide new perspectives to address old problems. The path to the
solution is usually exploratory. But the goal has always been clear and pre-defined.

Data Science Workflows
http://strata.oreilly.com/2013/09/data-analysis-just-one-component-of-the-data-science-workflow.html

Data Science Workflow
• Frame problem / goal of effort
• Identify and extract data to be used in effort from whole corpus / totality of
available data
• Exploratory identification and selection of working data for use in
experiments
• Define experiment(s): hypothesis / null hypothesis, methods, success criteria
• Derive insight(s)
• Wrangle, process, visualize, interpret
• Codify / create new model reflecting insights outcomes from experiments
• Validate new model(s)
• Provision training data
• Train new model
• Validation and outcome of training model
• Hand-off for implementation on production systems / as production code

Analysis Workflow & Activities
• Empirical analysis of subsets of data
• Understand topology of data, boundaries (sets / subsets, complete corpus,
totality of data)
• Outlier identification and profiling
• How significant are outliers to overall topology
• Comparative exclusion and profiling of resulting data subsets to understand their role,
discover principal components
• Find and analyze patterns, areas of interestingness / deserving attention
• Find and analyze central actors / factors (in existing model that produced
source data, in topology of working data, in patterns, etc.)
• ID and understand their impact on local and global data topology and primary metrics if in
several ways / more than one axis / at the same time
• Discover and analyze relationships amongst central actors
• Understand cycles, trends, changes (dynamic characteristics) for core
actors, topology, patterns and structure
• Understand causal factors
• Codify / create new model reflecting insights & outcomes from experiments

• dynamic working data sets & subset
• iterative
• experimental frame

Key Workflows
Insight Consumer <> Data Scientist
originate, define, address discovery effort
!
Data Scientist > Data Engineer
create & evolve apps to address new & in-progress efforts
!
Analyst <> Analyst
define & address in-progress discovery efforts
!
Data Scientist > internal networks
create & curate archive & community

Needs
What are the most common and useful statistical techniques you use during
discovery and analysis efforts?
!
What statistical capabilities or functions would be very useful if provided within
discovery applications, and where would they be useful?
“(1)
The
most
commonly
used
sta0s0cal
techniques
used
to
date
(in
our
strategic

planning
work)
are:

dimensionality
reduc0on
(par00on
clustering,
mul0ple

correspondence
analysis),
factor
analysis,
par00on
clustering
(k-‐means,
k-‐medoids,

fuzzy
clustering),
cluster
valida0on
techniques
(silhoueMe,
dunn’s
index,
connec0vity),

mul0variate
outlier
detec0on,
linear
regression,
and
logis0c
regression.”
!
(2)
Techniques
that
would
assist
with
iden0fying
outliers
or
invalid
data.

Much
of
this

work
seems
to
be
done
by
hand.

I
believe
that
we
are
also
geOng
to
the
point
where

we
could
start
using
linear
regression
and
splines
(for
showing
trends).”

Needs
For example, would system-generated descriptive statistical visualizations be
useful for whole data sets - or for smaller user-selected groups of attributes?
!
Would it be useful for the application to analyze and suggest possible
distribution models it sees in the data; for the values of individual attributes,
and/or for larger sets of data?
“With
regards
to
your
last
ques0on
on
visualiza0on,
we
have
put
in
significant
effort
to

use
visualiza0on
in
our
Endeca
installa0on.

We
have
built
visualiza0ons
such
as
tree

maps,
flow
diagrams,
sun
burst
diagrams,
scaMer
plots
showing
clusters,
and

hierarchical
edge
bundling
diagrams
to
explore
our
data
sets.

!
Our
data
tends
to
be
qualita0ve
rather
than
quan0ta0ve
so
this
drives
much
of
our

visualiza0ons.
!
So
yes,
interac0ve
descrip0ve
sta0s0cal
visualiza0on
would
be
helpful
–
on
the

complete
data
set
and
individual
aMributes.”

Needs
1. What are the most common statistical techniques you use at work -
descriptive, inferential, or otherwise? What are the most valuable?
!
2. What are the most common visualizations you use to present findings or
share insights? What are the most valuable?
“(1) We do a lot of chi-square tests, permutation tests, false discovery rate correction, Bonferroni
correction, 2x2 Fisher exact test, logistic regression. !
!
I also use SVM, Artiﬁcial Neural Networks (ANN), Naive-Bayes Classiﬁers (NBC), parts of speech
taggers.”!
!
(2) ROC curves, tables with p-values or odds ratios or hazard ratio (http://en.wikipedia.org/wiki/
Hazard_ratio)!
!
Things p-value!
XYZ1 0.001!
XYZ2 ...!
etc.”

Needs
1. What are the most common statistical techniques you use at work -
descriptive, inferential, or otherwise? What are the most valuable?
!
2. What are the most common visualizations you use to present findings or
share insights? What are the most valuable?
!
“Logistic Regression, Decision Trees, Markov Models, Area Under Curve”

Casual
Analyst
Analytical
Manager
Data Skills
Level
Customize
Models
Low / none
High
Composition CapabilityLow / Use High / Make
Create New
Models
Create Complex
Models
Analyst
Sense Makers: Information Management Ability
Use
Models
Problem Solver
Data Scientist

Materials
• http://www.datasciencecentral.com/
• Ben Lorica’s blog: http://strata.oreilly.com/ben
• https://blog.twitter.com/tags/twitter-data
• http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-
linkedin-23512853

Algorithms (ex: computational complexity, CS theory)
Back-End Programming (ex: JAVA/Rails/Objective C)
Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS)
Big and Distributed Data (ex: Hadoop, Map/Reduce)
Business (ex: management, business development, budgeting)
Classical Statistics (ex: general linear model, ANOVA)
Data Manipulation (ex: regexes, R, SAS, web scraping)
Front-End Programming (ex: JavaScript, HTML, CSS)
Graphical Models (ex: social networks, Bayes networks)
Machine Learning (ex: decision trees, neural nets, SVM, clustering)
Math (ex: linear algebra, real analysis, calculus)
Optimization (ex: linear, integer, convex, global)
Product Development (ex: design, project management)
Science (ex: experimental design, technical writing/publishing)
Simulation (ex: discrete, agent-based, continuous)
Spatial Statistics (ex: geographic covariates, GIS)
Structured Data (ex: SQL, JSON, XML)
Surveys and Marketing (ex: multinomial modeling)
Systems Administration (ex: *nix, DBA, cloud tech.)
Temporal Statistics (ex: forecasting, time-series analysis)
Unstructured Data (ex: noSQL, text mining)
Visualization (ex: statistical graphics, mapping, web-based dataviz)

Figure 3-3. There were interesting partial correlations among each
respondent’s primary Skills Group (rows) and primary Self-ID Group!
(columns). The mosaic plot illustrates the proportions of respondents!
who fell into each combination of groups. For example, there were few!
Data Researchers whose top Skill Group was Programming.
Skills

Data Science Highlights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Data Science Highlights

Similar to Data Science Highlights (20)

More from Joe Lamantia

More from Joe Lamantia (20)

Recently uploaded

Recently uploaded (20)

Data Science Highlights