Discover why Python is better for Data Science: the whole workflow of Data Analysis is covered by Python. Tools for various tasks are shown, including: workflow, data analysis, data visualization, integration with Hadoop ecosystem, and communication.
7. PYTHON CAN BE USED IN WHOLE DATA SCIENCE WORKFLOW
https://speakerdeck.com/chdoig/the-state-of-python-for-data-science-pyss-2015?slide=22
8. AUTHOR A MULTISTAGE PROCESSING PIPELINE IN
PYTHON, DESIGN A HYPOTHESIS TEST, PERFORM A
REGRESSION ANALYSIS OVER DATA SAMPLES WITH R,
DESIGN AND IMPLEMENT AN ALGORITHM FOR SOME
DATA-INTENSIVE PRODUCT OR SERVICE IN HADOOP,
OR COMMUNICATE THE RESULTS OF OUR ANALYSES
Jeff Hammerbacher
ONE DAY AT FACEBOOK’S DATA SCIENCE TEAM, A MEMBER COULD…
http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/
9. OPTIONS FOR PROCESSING PIPELINE
Airflow
https://github.com/airbnb/airflow
https://github.com/spotify/luigi
11. REGRESSION ANALYSIS IN PYTHON: EASY
http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/ols.html
12.
13. PYTHON <3 BIG DATA
map reduce in python
pure python HDFS client
fast and general engine for large-scale
data processing
mrjob
http://spark.apache.org
https://github.com/spotify/snakebite
https://pythonhosted.org/mrjob
…
14. OH, BUT SCALA/JAVA IS FASTER. PYTHON IS 2 *FASTER: [WRITING, RUNNING]
DataFrame operations are optimized and compiled into JVM bytecode
https://databricks.com/blog/2015/04/24/recent-performance-improvements-in-apache-spark-sql-python-
dataframes-and-more.html
23. SCIENCE IS GETTING MORE AND MORE IMPORTANT FOR PYTHON COMMUNITY
# module imports imports/numpy
1 sys 2437939 5.85
2 os 2009086 4.82
3 re 1303009 3.12
4 numpy 416981 1.00
5 warnings 371345 0.89
6 subprocess 344934 0.83
7 django 282097 0.68
8 math 281987 0.68
11 matplotlib 146913 0.35
13 pylab 77817 0.19
14 scipy 69092 0.17
22 pandas 18928 0.05
24 theano 5482 0.051
6/25 MOST POPULAR LIBRARIES ARE FOR DATA SCIENCE
https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
24. SCIENCE IS IMPORTANT FOR PYTHON: MATRIX MULTIPLICATION
https://www.python.org/dev/peps/pep-0465/#but-isn-t-matrix-multiplication-a-pretty-niche-requirement
import numpy as np
from numpy.linalg import inv, solve
# Using dot function:
S = np.dot((np.dot(H, beta) - r).T,
np.dot(inv(np.dot(np.dot(H, V), H.T)),
np.dot(H, beta) - r))
# With the @ operator
S = (H @ beta - r).T @ inv(H @ V @ H.T) @ (H @ beta - r)
S = ( H β − r ) T ( H V H T ) − 1 ( H β − r )
PEP 0465: PROPOSED FEB/14. SINCE PY 3.5 (SEP/15)
2013: 7 INTERNATIONAL CONFERENCES ON NUMERICAL PYTHON
AT PYCON 2014, ~20% OF THE TUTORIALS INVOLVED THE USE OF MATRICES
25. SCIENCE STACK IS GETTING BETTER EACH DAY
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=8
26. SCIENCE STACK IS ALWAYS EVOLVING…
https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote?slide=29