SlideShare a Scribd company logo
1 of 34
Download to read offline
@Galvanize with Pandas !!
Pandas, Data Wrangling & Data Science
August 12, 2016
@ksankar // doubleclix.wordpress.com
San Francisco 2016
o Intro & Setup [10:45-10:55) (10)
• Goals/non-goals
o Data Wrangling & Data Science Pipeline [10:55 –
11:05) (10)
o Pandas – APIs & Namespaces [11:05-11:15) (10)
o Pandas – Basic Maneuvers [11:15-11:30) (15)
o Pandas – Data Wrangling – Transformations,
Aggregations & Join [11:30-12:15) (45)
• Hands-on : Titanic Dataset
• Hands-on : NW Dataset, State Of The
Union Speeches, Recsys-2015 Data
o Q & A [12:15-Inf) (10)
o Not covering – Panels, Time Series
Agenda : Pandas, Data Wrangling & Data Science
http://pydata.org/sfo2016/schedule/presentation/67/
Goals & non-goals
Goals
¤Understand Data Wrangling with
Pandas
¤Focus on APIs and usage
¤Give you a focused time to work
thru examples
§ Work with me. I will wait if you
want to catch-up
¤Less theory, more usage - let us see
if this works
¤As straightforward as possible
§ The programs can be optimized
¤Foundation for the next 2 tutorials
§ Python Visualization for Exploration of
Data by Stephen F. Elston, Ronald Lopez
§ Applied Time Series Econometrics in Python
(and R)Jeffrey Yau
Non-goals
¡ Not “expert” Pandas
• We don’t have sufficient
time. The topic can be
easily a 1 day tutorial !
¡ Time to do hands-on
• Only 90 minutes
¡ Python vs. R
• I’ve come to discuss
Pandas, not to praise R !
¡ A passive talk
• Nope. Interactive &
hands-on
1. Brandon Rhodes - Pandas From The Ground Up - PyCon
2015 https://www.youtube.com/watch?v=5JnMutdy6Fw
2. A Visual Guide To Pandas by Jason Wirth
https://www.youtube.com/watch?v=9d5-Ti6onew
3. 2012 PyData Workshop: Data Analysis in Python with
Pandas by Wes McKinney
https://www.youtube.com/watch?v=MxRMXhjXZos
4. http://nbviewer.jupyter.org/github/jbochi/recsyschallenge2015/
blob/master/visualization.ipynb
5. https://www.analyticsvidhya.com/blog/2016/01/complete-
tutorial-learn-data-science-python-scratch-2/
Thanks to the Giants whose work
helped to prepare this tutorial
About Me
o AI/Data Scientist
• Autonomous Vehicles [https://goo.gl/BgicSY][https://goo.gl/LZ3fY9]
• Building Autonomous Drone-Jarvis / Working towards FAA Drone Pilot Certification
• What would you want AI to do, if it could do whatever you want it to do ?[https://goo.gl/eqWUEn]
• Decision Data Science & Product Data Science
• Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L]
• Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3]
o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513,
http://www.slideshare.net/ksankar/pydata-19] …
o Have done lots of things:
• Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA
• Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI,
• Guest Lecturer at Naval PG School,…
o Volunteer as Robotics Judge at First Lego league World Competitions
o @ksankar, doubleclix.wordpress.com
Pandas & Notebook Installation
o Pandas – Best with Anaconda
o Notebook - Install Jupyter/iPython
Tutorial Materials
oGithub : https://github.com/xsankar/cautious-octo-waffle
• Clone or download zip
oOpen terminal
• cd ~/cautious-octo-waffle
• jupyter notebook
oClick on ipython dashboard
• Run 000-PreFlightCheck.ipynb
• Now you are ready for the workshop !
oOne More Thing !!
• The RecSYs-2015 data is ~2GB. So pl download the data
to the data/recsys-2015 directory
Data Wrangling & Data Science Pipeline
10:55
Pipelines …
“[Collect-Store-Transform]-[Reason-Model]-[Deploy]-[Visualize-Recommend-Predict]-[Explore]”
Data Science - Context
o Scalable Model
Deployment
o Big Data
automation &
purpose built
appliances
(soft/hard)
o Manage SLAs &
response times
o Volume
o Velocity
o Streaming Data
o Canonical form
o Data catalog
o Data Fabric across the
organization
o Access to multiple
sources of data
o Think Hybrid – Big Data
Apps, Appliances &
Infrastructure
Collect Store Transform
o Metadata
o Monitor counters &
Metrics
o Structured vs. Multi-
structured
o Flexible & Selectable
§ Data Subsets
§ Attribute sets
o Refine model with
§ Extended Data
subsets
§ Engineered
Attribute sets
o Validation run across a
larger data set
Reason Model Deploy
Data Management
Data Science
o Dynamic Data Sets
o 2 way key-value tagging of
datasets
o Extended attribute sets
o Advanced Analytics
ExploreVisualize Recommend Predict
o Performance
o Scalability
o Refresh Latency
o In-memory Analytics
o AdvancedVisualization
o Interactive Dashboards
o Map Overlay
o Infographics
¤ Bytes to Business
a.k.a. Build the full
stack
¤ Find Relevant Data
For Business
¤ Connect the Dots
Data Science :
The art of building a model with known knowns, which when let loose, works with unknown
unknowns!
Donald Rumsfeld is an armchair Data Scientist !
http://smartorg.com/2013/07/valuepoint19/
The
World
Knowns
Unknowns
You
UnKnown Known
o Others	know,	you	don’t o What	we	do
o Facts,	outcomes	or	
scenarios	we	have	not	
encountered,	nor	
considered
o “Black	swans”,	outliers,	
long	tails	of	probability	
distributions
o Lack	of	experience,	
imagination
o Potential	facts,	
outcomes	we	
are	aware,	but	
not		with	
certainty
o Stochastic	
processes,	
Probabilities
o Known Knowns
o There are things we know that we know
o Known Unknowns
o That is to say, there are things that we
now know we don't know
o But there are also Unknown Unknowns
o There are things we do not know we
don't know
The curious case of the Data Scientist
o Data Scientist is multi-faceted & Contextual
o Data Scientist should be building Data Products
o Data Scientist should tell a story
http://doubleclix.wordpress.com/2014/01/25/the-curious-case-of-the-data-scientist-profession/
Large is hard; Infinite is much easier !
– Titus Brown
Pandas – APIs & Namespaces
11:05
Pandas Data Model
o Layer over numPy
o Data Model
• 1D Series (numPy Array w/labels)
• Data frame - 2D labelled sheet
• Column operations similar to vector operations
o Pay attention to the index
• Indexed rows, Indexed Columns & info at the center
o Pay attention to the objects
• DataFrame vs Series vs numpy array
• Eg. size() vs size
o “Answer all questions about a dataset” - Wes
pandas namespaces
objects
o pandas.Series
o pandas.DataFrame
o pandas.Panel
o pandas.Panel4D
o pandas.index
I/O
o read_(csv, table,
excel, json, gbq,…)
o to_(csv, table,
excel, json, gbq,…)
o pandas.read_
o df.to_
Computations,
operations,…)
o +,-,
o pow,
o corr, …
DateTime
o .dt
NaN,
Missing
o Isnull()
o fillna()
o dropna()
o skipna
o interpolate
String
o .str
Plotting
o .plot
Notes :
[1] df[“date”].dt - only series has date time ! df.dt won’t work
[2] .sort a DataFrame, but .order a Series
[3] to_frame() converts to a series.Most of the time DataFrame is the preferred
object
Data Science “folk knowledge” (1 of A)
o "If you torture the data long enough, it will confess to anything." – Hal Varian,
Computer Mediated Transactions
o Learning = Representation + Evaluation + Optimization
o It’s Generalization that counts
• The fundamental goal of machine learning is to generalize beyond the
examples in the training set
o Data alone is not enough
• Induction not deduction - Every learner should embody some knowledge
or assumptions beyond the data it is given in order to generalize beyond it
o Machine Learning is not magic – one cannot get something from nothing
• In order to infer, one needs the knobs & the dials
• One also needs a rich expressive dataset
A few useful things to know about machine learning - by Pedro Domingos
http://dl.acm.org/citation.cfm?id=2347755
Pandas- Basic Maneuvers
11:15
1-Getting Data in/out
o pd.read_csv(…)
o read_table(...) <- arbitrary delimied file
o read_{clipboard,json,excel,SAS,SQL,gbq,...)
o df.to_csv(...)
2-Basic Operations
o head()
o tail()
o count()
o describe()
o dtypes
3-Labelled Indexing
4. na-Missing Data
o One of the tenets of big data and data science is that data is never fully clean-while we
can handle types, formats et al, missing values is always challenging
o One easy solution is to drop the rows that have missing values, but then we would lose
valuable data in the columns that do have values.
o A better solution is to impute data based on some criteria. It is true that data cannot be
created out of thin air, but data can be inferred with some success – it is better than
dropping the rows.
• We can replace null with 0
• A better solution is to replace numerical values with the average of the rest of the valid
values; for categorical replacing with the most common value is a good strategy
• We could use mode or median instead of mean
• Another good strategy is to infer the missing value from other attributes ie “Evidence
from multiple fields”.
• For example the Titanic data has name and for imputing the missing age field, we could use the
Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field
from the corresponding designation. So a row with missing age with Master. In name would get
the average age of all records with “Master.”
• There is also the filed for number of siblings and number of spouse. We could average the age
based on the value of that field.
• We could even average the ages from different strategies.
4. na-Missing Data
o NaN better than 0 - says I don’t know
• Comes ihandy n recommendation, stock data on a Saturday,…
o Skipna
o Fillna
• forward fill/backward fill method !
o Interpolate
5-Statistics
o Min
o Max
o Quantile
o Mean,SD,variance,…
o Correlation
• Pearson
• Spearman
o Covariance
6-Aggregation/Groupby
Pandas – Data Wrangling –
Transformations, Aggregations & Join
11:30
Merge,Join and friends
o merge
• Use Merge
• join is a set of common merge patterns with defaults
o groupby
• Think in terms of split-apply-combine
o stack/unstack
• unstack operation to compare unlike things - parameter to unstack
different columns
• Too much stack-unstack results in a series !
• Be ready to handle NaN
o Powerful Techniques
• groupby + merge
• groupby + unstack
Hands-On : Pandas@Kaggle
o 020-Titanic.ipyb
o GitHub : https://github.com/xsankar/cautious-octo-waffle/blob/master/020-
Titanic.ipynb
• Let us analyze the Titanic Dataset for a Kaggle Competition
Hands-On : Orders Data
o 030-Orders.ipyb
o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/030-
Orders.ipynb
• Data wrangling with the Orders dataset
Data Science “folk knowledge” (Wisdom of Kaggle)
Jeremy’s Axioms
o Iteratively explore data
o Tools
• Excel Format, Perl, Perl Book, Pandas !
o Get your head around data
• Pivot Table
o Don’t over-complicate
o If people give you data, don’t assume that you
need to use all of it
o Look at pictures !
o History of your submissions – keep a tab
o Don’t be afraid to submit simple solutions
• We will do this during this workshop
Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by-
jeremy-howard/
Hands-On : Clicks & Buys
o 050-RecSys-2015.ipynb
o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/050-
RecSys-2015.ipynb
• Data wrangling with the RecSys-2015 dataset
Questions ?
12:15
Essential Reading List
o A few useful things to know about machine learning - by Pedro Domingos
• http://dl.acm.org/citation.cfm?id=2347755
o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert
• http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper
t.pdf
o http://www.no-free-lunch.org/
o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y.
and Hochberg, Y. C
• http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD
R.pdf
o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe
• http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/
o Avoid these three mistakes, James Faghmo
• https://medium.com/about-data/73258b3848a4
o Leakage in Data Mining: Formulation, Detection, and Avoidance
• http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap
er_LeakingInDataMining.pdf
For your reading & viewing pleasure … An ordered List
① An Introduction to Statistical Learning
• http://www-bcf.usc.edu/~gareth/ISL/
② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning
• http://online.stanford.edu/course/statistical-learning-winter-2014
③ Prof. Pedro Domingo
• https://class.coursera.org/machlearning-001/lecture/preview
④ Prof. Andrew Ng
• https://class.coursera.org/ml-003/lecture/preview
⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data
• https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120
⑥ Mathematicalmonk @ YouTube
• https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA
⑦ The Elements Of Statistical Learning
• http://statweb.stanford.edu/~tibs/ElemStatLearn/
http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn-
machine-learning/
The Beginning As The End
How did we do ?
4:45
Pandas, Data Wrangling & Data Science

More Related Content

What's hot

The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012OSCON Byrum
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionKrist Wongsuphasawat
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017MLconf
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsHugh McCamphill
 
The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015Seattle DAML meetup
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationFrank van Harmelen
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with RStephen Withington
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoFrank van Harmelen
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflowsSSSW
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceTrey Grainger
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologiesProf. Wim Van Criekinge
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Oscar Corcho
 
Modular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxologyModular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxologyFrank van Harmelen
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 

What's hot (20)

The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012The Art of Social Media Analysis with Twitter & Python-OSCON 2012
The Art of Social Media Analysis with Twitter & Python-OSCON 2012
 
Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4) Analyzing social media with Python and other tools (4/4)
Analyzing social media with Python and other tools (4/4)
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Information Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An IntroductionInformation Visualization for Knowledge Discovery: An Introduction
Information Visualization for Knowledge Discovery: An Introduction
 
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
Luna Dong, Principal Scientist, Amazon at MLconf Seattle 2017
 
Test trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely testsTest trend analysis: Towards robust reliable and timely tests
Test trend analysis: Towards robust reliable and timely tests
 
The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015The Road to Data Science - Joel Grus, June 2015
The Road to Data Science - Joel Grus, June 2015
 
The Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge RepresentationThe Empirical Turn in Knowledge Representation
The Empirical Turn in Knowledge Representation
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Semantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years agoSemantic Web questions we couldn't ask 10 years ago
Semantic Web questions we couldn't ask 10 years ago
 
Data Structure in Elixir
Data Structure in ElixirData Structure in Elixir
Data Structure in Elixir
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Measuring Relevance in the Negative Space
Measuring Relevance in the Negative SpaceMeasuring Relevance in the Negative Space
Measuring Relevance in the Negative Space
 
2017 biological databases_part1_vupload
2017 biological databases_part1_vupload2017 biological databases_part1_vupload
2017 biological databases_part1_vupload
 
Bio ontologies and semantic technologies
Bio ontologies and semantic technologiesBio ontologies and semantic technologies
Bio ontologies and semantic technologies
 
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...Slow-cooked data and APIs in the world of Big Data: the view from a city per...
Slow-cooked data and APIs in the world of Big Data: the view from a city per...
 
My Spark Journey
My Spark JourneyMy Spark Journey
My Spark Journey
 
Modular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxologyModular design patterns for systems that learn and reason: a boxology
Modular design patterns for systems that learn and reason: a boxology
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 

Viewers also liked

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkMartin Goodson
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaSpark Summit
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Spark Summit
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalSpark Summit
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to GreenplumDave Cramer
 
Programmer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp versionProgrammer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp versionIgor Kleiner
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Michelle Casbon
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesAlfredo Abate
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesNAYATech
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Bryan Yang
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingKristian Alexander
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllDataWorks Summit
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
 

Viewers also liked (20)

NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache SparkNLP on a Billion Documents: Scalable Machine Learning with Apache Spark
NLP on a Billion Documents: Scalable Machine Learning with Apache Spark
 
NLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey StellaNLP Structured Data Investigation on Non-Text by Casey Stella
NLP Structured Data Investigation on Non-Text by Casey Stella
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
Use of Spark MLib for Predicting the Offlining of Digital Media-(Christopher ...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Introduction to Greenplum
Introduction to GreenplumIntroduction to Greenplum
Introduction to Greenplum
 
Programmer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp versionProgrammer interview exposed - lection 5 temp version
Programmer interview exposed - lection 5 temp version
 
Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016Advanced Spark Meetup - Jan 12, 2016
Advanced Spark Meetup - Jan 12, 2016
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
COUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_FeaturesCOUG_AAbate_Oracle_Database_12c_New_Features
COUG_AAbate_Oracle_Database_12c_New_Features
 
Oracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA TechnologiesOracle12 - The Top12 Features by NAYA Technologies
Oracle12 - The Top12 Features by NAYA Technologies
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Spark etl
Spark etlSpark etl
Spark etl
 
SPARQL and Linked Data Benchmarking
SPARQL and Linked Data BenchmarkingSPARQL and Linked Data Benchmarking
SPARQL and Linked Data Benchmarking
 
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
AMIS Oracle OpenWorld 2015 Review – part 3- PaaS Database, Integration, Ident...
 
From Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for AllFrom Beginners to Experts, Data Wrangling for All
From Beginners to Experts, Data Wrangling for All
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 

Similar to Pandas, Data Wrangling & Data Science

Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learningTom Dierickx
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkBas Geerdink
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientistryanorban
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...ryanorban
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Onlinesfdatascience
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1Roger Barga
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSara-Jayne Terp
 

Similar to Pandas, Data Wrangling & Data Science (20)

Interview
InterviewInterview
Interview
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
SENCER_panel.ppt
SENCER_panel.pptSENCER_panel.ppt
SENCER_panel.ppt
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Statistics vs machine learning
Statistics vs machine learningStatistics vs machine learning
Statistics vs machine learning
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
The field-guide-to-data-science
The field-guide-to-data-scienceThe field-guide-to-data-science
The field-guide-to-data-science
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
Bridging the Gap Between Data Science & Engineer: Building High-Performance T...
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Clare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science OnlineClare Corthell: Learning Data Science Online
Clare Corthell: Learning Data Science Online
 
Barga Data Science lecture 1
Barga Data Science lecture 1Barga Data Science lecture 1
Barga Data Science lecture 1
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 

More from Krishna Sankar

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Krishna Sankar
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0Krishna Sankar
 

More from Krishna Sankar (11)

An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 
Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29Cloud Interoperability Demo at OGF29
Cloud Interoperability Demo at OGF29
 
A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0A Hitchhiker's Guide to NOSQL v1.0
A Hitchhiker's Guide to NOSQL v1.0
 

Recently uploaded

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 

Recently uploaded (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Pandas, Data Wrangling & Data Science

  • 1. @Galvanize with Pandas !! Pandas, Data Wrangling & Data Science August 12, 2016 @ksankar // doubleclix.wordpress.com San Francisco 2016
  • 2. o Intro & Setup [10:45-10:55) (10) • Goals/non-goals o Data Wrangling & Data Science Pipeline [10:55 – 11:05) (10) o Pandas – APIs & Namespaces [11:05-11:15) (10) o Pandas – Basic Maneuvers [11:15-11:30) (15) o Pandas – Data Wrangling – Transformations, Aggregations & Join [11:30-12:15) (45) • Hands-on : Titanic Dataset • Hands-on : NW Dataset, State Of The Union Speeches, Recsys-2015 Data o Q & A [12:15-Inf) (10) o Not covering – Panels, Time Series Agenda : Pandas, Data Wrangling & Data Science http://pydata.org/sfo2016/schedule/presentation/67/
  • 3. Goals & non-goals Goals ¤Understand Data Wrangling with Pandas ¤Focus on APIs and usage ¤Give you a focused time to work thru examples § Work with me. I will wait if you want to catch-up ¤Less theory, more usage - let us see if this works ¤As straightforward as possible § The programs can be optimized ¤Foundation for the next 2 tutorials § Python Visualization for Exploration of Data by Stephen F. Elston, Ronald Lopez § Applied Time Series Econometrics in Python (and R)Jeffrey Yau Non-goals ¡ Not “expert” Pandas • We don’t have sufficient time. The topic can be easily a 1 day tutorial ! ¡ Time to do hands-on • Only 90 minutes ¡ Python vs. R • I’ve come to discuss Pandas, not to praise R ! ¡ A passive talk • Nope. Interactive & hands-on
  • 4. 1. Brandon Rhodes - Pandas From The Ground Up - PyCon 2015 https://www.youtube.com/watch?v=5JnMutdy6Fw 2. A Visual Guide To Pandas by Jason Wirth https://www.youtube.com/watch?v=9d5-Ti6onew 3. 2012 PyData Workshop: Data Analysis in Python with Pandas by Wes McKinney https://www.youtube.com/watch?v=MxRMXhjXZos 4. http://nbviewer.jupyter.org/github/jbochi/recsyschallenge2015/ blob/master/visualization.ipynb 5. https://www.analyticsvidhya.com/blog/2016/01/complete- tutorial-learn-data-science-python-scratch-2/ Thanks to the Giants whose work helped to prepare this tutorial
  • 5. About Me o AI/Data Scientist • Autonomous Vehicles [https://goo.gl/BgicSY][https://goo.gl/LZ3fY9] • Building Autonomous Drone-Jarvis / Working towards FAA Drone Pilot Certification • What would you want AI to do, if it could do whatever you want it to do ?[https://goo.gl/eqWUEn] • Decision Data Science & Product Data Science • Insights = Intelligence + Inference + Interface [https://goo.gl/s2KB6L] • Predicting NFL with Elo like Nate Silver & 538 [NFL : http://goo.gl/Q2OgeJ, NBA’15 : https://goo.gl/aUhdo3] o Have been speaking at OSCON [http://goo.gl/1MJLu], PyCon, Pydata [http://vimeo.com/63270513, http://www.slideshare.net/ksankar/pydata-19] … o Have done lots of things: • Big Data (Retail, Bioinformatics, Financial, AdTech), Starting MS-CFRM, University of WA • Written Books (Web 2.0, Wireless, Java,…))Standards, some work in AI, • Guest Lecturer at Naval PG School,… o Volunteer as Robotics Judge at First Lego league World Competitions o @ksankar, doubleclix.wordpress.com
  • 6. Pandas & Notebook Installation o Pandas – Best with Anaconda o Notebook - Install Jupyter/iPython
  • 7. Tutorial Materials oGithub : https://github.com/xsankar/cautious-octo-waffle • Clone or download zip oOpen terminal • cd ~/cautious-octo-waffle • jupyter notebook oClick on ipython dashboard • Run 000-PreFlightCheck.ipynb • Now you are ready for the workshop ! oOne More Thing !! • The RecSYs-2015 data is ~2GB. So pl download the data to the data/recsys-2015 directory
  • 8. Data Wrangling & Data Science Pipeline 10:55 Pipelines … “[Collect-Store-Transform]-[Reason-Model]-[Deploy]-[Visualize-Recommend-Predict]-[Explore]”
  • 9. Data Science - Context o Scalable Model Deployment o Big Data automation & purpose built appliances (soft/hard) o Manage SLAs & response times o Volume o Velocity o Streaming Data o Canonical form o Data catalog o Data Fabric across the organization o Access to multiple sources of data o Think Hybrid – Big Data Apps, Appliances & Infrastructure Collect Store Transform o Metadata o Monitor counters & Metrics o Structured vs. Multi- structured o Flexible & Selectable § Data Subsets § Attribute sets o Refine model with § Extended Data subsets § Engineered Attribute sets o Validation run across a larger data set Reason Model Deploy Data Management Data Science o Dynamic Data Sets o 2 way key-value tagging of datasets o Extended attribute sets o Advanced Analytics ExploreVisualize Recommend Predict o Performance o Scalability o Refresh Latency o In-memory Analytics o AdvancedVisualization o Interactive Dashboards o Map Overlay o Infographics ¤ Bytes to Business a.k.a. Build the full stack ¤ Find Relevant Data For Business ¤ Connect the Dots
  • 10. Data Science : The art of building a model with known knowns, which when let loose, works with unknown unknowns! Donald Rumsfeld is an armchair Data Scientist ! http://smartorg.com/2013/07/valuepoint19/ The World Knowns Unknowns You UnKnown Known o Others know, you don’t o What we do o Facts, outcomes or scenarios we have not encountered, nor considered o “Black swans”, outliers, long tails of probability distributions o Lack of experience, imagination o Potential facts, outcomes we are aware, but not with certainty o Stochastic processes, Probabilities o Known Knowns o There are things we know that we know o Known Unknowns o That is to say, there are things that we now know we don't know o But there are also Unknown Unknowns o There are things we do not know we don't know
  • 11. The curious case of the Data Scientist o Data Scientist is multi-faceted & Contextual o Data Scientist should be building Data Products o Data Scientist should tell a story http://doubleclix.wordpress.com/2014/01/25/the-curious-case-of-the-data-scientist-profession/ Large is hard; Infinite is much easier ! – Titus Brown
  • 12. Pandas – APIs & Namespaces 11:05
  • 13. Pandas Data Model o Layer over numPy o Data Model • 1D Series (numPy Array w/labels) • Data frame - 2D labelled sheet • Column operations similar to vector operations o Pay attention to the index • Indexed rows, Indexed Columns & info at the center o Pay attention to the objects • DataFrame vs Series vs numpy array • Eg. size() vs size o “Answer all questions about a dataset” - Wes
  • 14. pandas namespaces objects o pandas.Series o pandas.DataFrame o pandas.Panel o pandas.Panel4D o pandas.index I/O o read_(csv, table, excel, json, gbq,…) o to_(csv, table, excel, json, gbq,…) o pandas.read_ o df.to_ Computations, operations,…) o +,-, o pow, o corr, … DateTime o .dt NaN, Missing o Isnull() o fillna() o dropna() o skipna o interpolate String o .str Plotting o .plot Notes : [1] df[“date”].dt - only series has date time ! df.dt won’t work [2] .sort a DataFrame, but .order a Series [3] to_frame() converts to a series.Most of the time DataFrame is the preferred object
  • 15. Data Science “folk knowledge” (1 of A) o "If you torture the data long enough, it will confess to anything." – Hal Varian, Computer Mediated Transactions o Learning = Representation + Evaluation + Optimization o It’s Generalization that counts • The fundamental goal of machine learning is to generalize beyond the examples in the training set o Data alone is not enough • Induction not deduction - Every learner should embody some knowledge or assumptions beyond the data it is given in order to generalize beyond it o Machine Learning is not magic – one cannot get something from nothing • In order to infer, one needs the knobs & the dials • One also needs a rich expressive dataset A few useful things to know about machine learning - by Pedro Domingos http://dl.acm.org/citation.cfm?id=2347755
  • 17. 1-Getting Data in/out o pd.read_csv(…) o read_table(...) <- arbitrary delimied file o read_{clipboard,json,excel,SAS,SQL,gbq,...) o df.to_csv(...)
  • 18. 2-Basic Operations o head() o tail() o count() o describe() o dtypes
  • 20. 4. na-Missing Data o One of the tenets of big data and data science is that data is never fully clean-while we can handle types, formats et al, missing values is always challenging o One easy solution is to drop the rows that have missing values, but then we would lose valuable data in the columns that do have values. o A better solution is to impute data based on some criteria. It is true that data cannot be created out of thin air, but data can be inferred with some success – it is better than dropping the rows. • We can replace null with 0 • A better solution is to replace numerical values with the average of the rest of the valid values; for categorical replacing with the most common value is a good strategy • We could use mode or median instead of mean • Another good strategy is to infer the missing value from other attributes ie “Evidence from multiple fields”. • For example the Titanic data has name and for imputing the missing age field, we could use the Mr., Master, Mrs. Miss designation from the name and then fill-in the average of the age field from the corresponding designation. So a row with missing age with Master. In name would get the average age of all records with “Master.” • There is also the filed for number of siblings and number of spouse. We could average the age based on the value of that field. • We could even average the ages from different strategies.
  • 21. 4. na-Missing Data o NaN better than 0 - says I don’t know • Comes ihandy n recommendation, stock data on a Saturday,… o Skipna o Fillna • forward fill/backward fill method ! o Interpolate
  • 22. 5-Statistics o Min o Max o Quantile o Mean,SD,variance,… o Correlation • Pearson • Spearman o Covariance
  • 24. Pandas – Data Wrangling – Transformations, Aggregations & Join 11:30
  • 25. Merge,Join and friends o merge • Use Merge • join is a set of common merge patterns with defaults o groupby • Think in terms of split-apply-combine o stack/unstack • unstack operation to compare unlike things - parameter to unstack different columns • Too much stack-unstack results in a series ! • Be ready to handle NaN o Powerful Techniques • groupby + merge • groupby + unstack
  • 26. Hands-On : Pandas@Kaggle o 020-Titanic.ipyb o GitHub : https://github.com/xsankar/cautious-octo-waffle/blob/master/020- Titanic.ipynb • Let us analyze the Titanic Dataset for a Kaggle Competition
  • 27. Hands-On : Orders Data o 030-Orders.ipyb o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/030- Orders.ipynb • Data wrangling with the Orders dataset
  • 28. Data Science “folk knowledge” (Wisdom of Kaggle) Jeremy’s Axioms o Iteratively explore data o Tools • Excel Format, Perl, Perl Book, Pandas ! o Get your head around data • Pivot Table o Don’t over-complicate o If people give you data, don’t assume that you need to use all of it o Look at pictures ! o History of your submissions – keep a tab o Don’t be afraid to submit simple solutions • We will do this during this workshop Ref: http://blog.kaggle.com/2011/03/23/getting-in-shape-for-the-sport-of-data-sciencetalk-by- jeremy-howard/
  • 29. Hands-On : Clicks & Buys o 050-RecSys-2015.ipynb o Github : https://github.com/xsankar/cautious-octo-waffle/blob/master/050- RecSys-2015.ipynb • Data wrangling with the RecSys-2015 dataset
  • 31. Essential Reading List o A few useful things to know about machine learning - by Pedro Domingos • http://dl.acm.org/citation.cfm?id=2347755 o The Lack of A Priori Distinctions Between Learning Algorithms by David H. Wolpert • http://mpdc.mae.cornell.edu/Courses/MAE714/Papers/lack_of_a_priori_distinctions_wolper t.pdf o http://www.no-free-lunch.org/ o Controlling the false discovery rate: a practical and powerful approach to multiple testing Benjamini, Y. and Hochberg, Y. C • http://www.stat.purdue.edu/~doerge/BIOINFORM.D/FALL06/Benjamini%20and%20Y%20FD R.pdf o A Glimpse of Googl, NASA,Peter Norvig + The Restaurant at the End of the Universe • http://doubleclix.wordpress.com/2014/03/07/a-glimpse-of-google-nasa-peter-norvig/ o Avoid these three mistakes, James Faghmo • https://medium.com/about-data/73258b3848a4 o Leakage in Data Mining: Formulation, Detection, and Avoidance • http://www.cs.umb.edu/~ding/history/470_670_fall_2011/papers/cs670_Tran_PreferredPap er_LeakingInDataMining.pdf
  • 32. For your reading & viewing pleasure … An ordered List ① An Introduction to Statistical Learning • http://www-bcf.usc.edu/~gareth/ISL/ ② ISL Class Stanford/Hastie/Tibsharani at their best - Statistical Learning • http://online.stanford.edu/course/statistical-learning-winter-2014 ③ Prof. Pedro Domingo • https://class.coursera.org/machlearning-001/lecture/preview ④ Prof. Andrew Ng • https://class.coursera.org/ml-003/lecture/preview ⑤ Prof. Abu Mostafa, CaltechX: CS1156x: Learning From Data • https://www.edx.org/course/caltechx/caltechx-cs1156x-learning-data-1120 ⑥ Mathematicalmonk @ YouTube • https://www.youtube.com/playlist?list=PLD0F06AA0D2E8FFBA ⑦ The Elements Of Statistical Learning • http://statweb.stanford.edu/~tibs/ElemStatLearn/ http://www.quora.com/Machine-Learning/Whats-the-easiest-way-to-learn- machine-learning/
  • 33. The Beginning As The End How did we do ? 4:45