This document provides an overview of Prokarma Inc., a technology company that offers data science and machine learning services. It discusses Prokarma's strategy, technology and industry verticals, platform solutions, global presence with 21 offices and 11 delivery centers, large customer base of over 150 customers including 40 Fortune 1000 companies, and workforce of over 2000 employees.
2. OVERVIEW
ü Strategy
ü Technology Verticals
ü Industry Verticals
ü Platform Solutions
SERVICES
ü 21 Offices including 11 Global Delivery
Centers in US, India, Argentina & Peru
PRESENCE
ü 150+ Customers
ü 40 Fortune 1000 Companies
CUSTOMERS
ü Doing business since 2000
ü Global Staff of 2000+ employees
PEOPLE
PK
3. problem formulation -> tool chain construction
data science with notebooks
the data set
Demo: Spark + GraphLab Create
RDD -> SFrame
ML on PySpark
introduction to boosted trees in GraphLab Create
Demo: Spark + GraphLab Create
PySpark ML with GraphLab Create
PK
What should we talk about?
+
4. Moving down the
data pipeline from
raw to results
How best to quickly
move through
pipeline to:
1. Show value of
work
2. Communicate
results
3. Move models into
production
pipeline
PK
problem formulation -> tool chain construction
+
Clean up your
data
Run
Sophisticated
Analytics
Discover
Insights!
Integrate
Results into
Pipelines
Use lots of different
data sources
ETL Communicate
Results and
Value
…NOTEBOOKS
Based on Diagram Source: Paco Nathan
5. PK
problem formulation -> tool chain construction
+
“Models are not right or wrong; they're always
wrong. They're always approximations. The
question you have to ask is whether a model
tells you more information than you would
have had otherwise. If it does, it's skillful.” -
Gavin Schmidt’s excellent TED Talk
Data science with notebooks allows
data science teams to quickly move
from exploration > transformation >
modeling > visualization > export to
pipeline
Many startups create to fill the need
for one product solution with
collaboration and containerization
In the mean time:
PySpark +
IPython +
GraphLab Create
6. Fatality Analysis Reporting System
National Highway Transportation Safety Administration
● Publicly available
● Historical time-series
Currently available: 1975 – 2013
● Raw, Rich, Relevant
● Time-series, geo location
● Human recorded events transcribed into annual databases (dbf,
SQL, SAS)
● Measurable outcomes for modelling (updated on yearly basis)
PK
example data set: transportation safety +
7. ² Proof of concept demonstra/on for
customer concerned with
Transporta/on Safety
² Applica/on of science method
towards diverse data sets
² Evolving real-‐world data sets for
advanced analy/cs workshops and
training sessions
² Visual and conceptual presenta/on
of scien/fic approach to
computa/onal analysis
data science showcase
IENCE – DATA PROTOTYPING
25
30
35
40
45
50
−120 −100 −80
Fatalities in Fatal Accidents from 1975−2012
lat
0
40000
80000
120000
160000
deaths
ü Scien/fic Method
ü Predic/ve Modeling
ü Hidden Insights
PK
+
8. Hardware:
• MacBook Pro (late 2012)
• ~36 GB free disk space
• 8GB RAM
• 2 cores
• Not exactly a blazingly fast, top of the line machine….
Software:
• Spark 1.1.0 for Hadoop 2.4
• GraphLab Create 1.2.1
• Hadoop 2.4
• Scala 2.10.4
• Python 2.7.9 on Anaconda 1.9.1
PK
notes about my setup
+
9. One more step…
gl.get_spark_integration_jar_path()
$SPARK_HOME/bin/spark-submit --driver-class-path /path/to/graphlab-
create-spark-integration.jar --master yarn-client my_awesome_code.py
Or…
$SPARK_HOME/bin/pyspark --driver-class-path /path/to/graphlab-create-
spark-integration.jar --master yarn-client
Works with –master local OR –master yarn-client
Then…it works J
PK
working with glc in pyspark
+