20161215 python pandas-spark四方山話

•

7 likes•1,234 views

Ryuji Tamagawa

2016/12/15 インサイトテクノロジーさんの三木会でお話しした内容のスライドです。PythonとかPandasとかSparkとか。

Technology

•
• Python 2000
(**)
• db tech showcase MongoDB
•
• FB: Ryuji Tamagawa
• Twitter : tamagawa_ryuji

Python
Pandas Python
Jupyter Notebook
Jenkins
Spark 2.0

• Spark API RDD ~1.3 DataFrame
/ DataSet 1.4~
• DataFrame API
RDD API Python Spark

DataFrame
• RDB /
• R Pandas Spark
Spark
R / Pandas
Spark
+

CSV
zip
RDB
Parquet
Excel
CSV
Feather
Spark
Pandas / Spark

•
• CPU
•
• Pandas read_csv zip CSV
Pandas

2
• CSV CPU
Pandas zip CSV
CPU …
• Parquet !
•

: Parquet
I/O
•
• Spark Parquet
• Python Parquet

•
• I/O Pandas
• Spark
• DataFrame Pandas → Spark
Spark → Pandas Pandas → Spark
• Apache Arrow

Apache Spark 2.0
• 1.x
• 2.0
1.x
• DataFrame API Python
• databricks  
http://go.databricks.com/mastering-apache-spark-2.0
•

Spark 2.0
• CPU
• CPU
• SQL DataFrame
• + SSD
• CSV zip
Pandas read_csv

Python + Spark
• Python serialize
• DataFrame API UDF
UDF Scala/Java
• http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr-
and-dataframe-api
Executor
JVM
DataFrame,
Cached
Python
lambda items:
items[0] == ‘abc’
transfer
DataFrame,
result
transfer
Driver

What's hot

Beginner Apache Spark PresentationNidhin Pattaniyil

StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有Yoshiyasu SAEKI

Brug af Solr i IMPACTIMPACT

Growing a Data Pipeline for AnalyticsRoberto Agostino Vitillo

Sparkler Presentation for Spark Summit East 2017Karanjeet Singh

Денис Головняк - Продвинутый поиск с помощью Search APILEDC 2016

Final_showNitay Alon

ストリーム処理を支えるキューイングシステムの選び方Yoshiyasu SAEKI

Cassandra + Hadoop @ApacheCon Jeremy Hanna

Introduing sparkTaotao Li

使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化－曾書庭台灣資料科學年會

The Evolution of Hadoop at Spotify - Through Failures and PainRafał Wojdyła

MongoDB & Hadoop, Sittin' in a TreeMongoDB

ニュースパスのクローラーアーキテクチャとマイクロサービスmosa siru

Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copyUwe Korn

Apache Spark Super Happy Funtimes - CHUG 2016Holden Karau

Go, memcached, microservicesmosa siru

Microsoft Azure + RDmitry Petukhov

Fluentd - Flexible, Stable, ScalableShu Ting Tseng

What's hot (20)

Beginner Apache Spark Presentation

StackStormを1年間データ基盤で使ってみてぶつかったトラブルとその解決策の共有

Brug af Solr i IMPACT

Growing a Data Pipeline for Analytics

Sparkler Presentation for Spark Summit East 2017

Денис Головняк - Продвинутый поиск с помощью Search API

Final_show

ストリーム処理を支えるキューイングシステムの選び方

Cassandra + Hadoop @ApacheCon

Introduing spark

使用 Elasticsearch 及 Kibana 進行巨量資料搜尋及視覺化－曾書庭

The Evolution of Hadoop at Spotify - Through Failures and Pain

MongoDB & Hadoop, Sittin' in a Tree

ニュースパスのクローラーアーキテクチャとマイクロサービス

Debugging PySpark: Spark Summit East talk by Holden Karau

Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy

Apache Spark Super Happy Funtimes - CHUG 2016

Go, memcached, microservices

Microsoft Azure + R

Fluentd - Flexible, Stable, Scalable

Similar to 20161215 python pandas-spark四方山話

Contributing to pandas (Korean)Younggun Kim

data science toolkit 101: set up Python, Spark, & JupyterRaj Singh

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...Uwe Korn

Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau

Apache Arrow and Pandas UDF on Apache SparkTakuya UESHIN

Wisely Chen Spark Talk At Spark Gathering in Taiwan Wisely chen

Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks

Overview of Apache Spark 2.3: What’s New? with Sameer AgarwalDatabricks

Spark7poovarasu maniandan

Fluentd: Unified Logging Layer at CWT2014N Masahiro

Spark Streamingによるリアルタイムユーザ属性推定Yoshiyasu SAEKI

Docker and FluentdN Masahiro

Hands on with Apache SparkDan Lynn

Jump Start on Apache® Spark™ 2.x with Databricks Databricks

Jumpstart on Apache Spark 2.2 on DatabricksDatabricks

Big data beyond the JVM - DDTX 2018Holden Karau

Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido

Penny coventry fiddler-spsbe23BIWUG

ApacheCon Europe Big Data 2016 – Parquet in practice & detailUwe Korn

OSINT tools for security auditing with pythonJose Manuel Ortega Candel

Similar to 20161215 python pandas-spark四方山話 (20)

Contributing to pandas (Korean)

data science toolkit 101: set up Python, Spark, & Jupyter

PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...

Accelerating Big Data beyond the JVM - Fosdem 2018

Apache Arrow and Pandas UDF on Apache Spark

Wisely Chen Spark Talk At Spark Gathering in Taiwan

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

Spark7

Fluentd: Unified Logging Layer at CWT2014

Spark Streamingによるリアルタイムユーザ属性推定

Docker and Fluentd

Hands on with Apache Spark

Jump Start on Apache® Spark™ 2.x with Databricks

Jumpstart on Apache Spark 2.2 on Databricks

Big data beyond the JVM - DDTX 2018

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Penny coventry fiddler-spsbe23

ApacheCon Europe Big Data 2016 – Parquet in practice & detail

OSINT tools for security auditing with python

Recently uploaded

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions

Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Manulife - Insurer Innovation Award 2024The Digital Insurer

HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

MINDCTI Revenue Release Quarter One 2024MIND CTI

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...

Top 10 Most Downloaded Games on Play Store in 2024

Tata AIG General Insurance Company - Insurer Innovation Award 2024

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Strategies for Landing an Oracle DBA Job as a Fresher

Manulife - Insurer Innovation Award 2024

HTML Injection Attacks: Impact and Mitigation Strategies

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

How to Troubleshoot Apps for the Modern Connected Worker

Automating Google Workspace (GWS) & more with Apps Script

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

Boost Fertility New Invention Ups Success Rates.pdf

Data Cloud, More than a CDP by Matt Robison

Artificial Intelligence Chap.5 : Uncertainty

MINDCTI Revenue Release Quarter One 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

20161215 python pandas-spark四方山話

1. Python, Pandas, Spark 2.0 Sky

3. • • Python 2000 (**) • db tech showcase MongoDB • • FB: Ryuji Tamagawa • Twitter : tamagawa_ryuji

5. 2017

6. • Python Spark •

7. • • Python / Pandas • Spark 2.0

8. Part 1 :

9. • • • csv

10. Python Pandas Python Jupyter Notebook Jenkins Spark 2.0

11. • Spark API RDD ~1.3 DataFrame / DataSet 1.4~ • DataFrame API RDD API Python Spark

12. DataFrame • RDB / • R Pandas Spark Spark R / Pandas Spark +

13. Part 2 :

14. CSV zip RDB Parquet Excel CSV Feather Spark Pandas / Spark

15. • • CPU • • Pandas read_csv zip CSV Pandas

16. 2 • CSV CPU Pandas zip CSV CPU … • Parquet ! •

17. : Parquet I/O • • Spark Parquet • Python Parquet

18. HDFS / S3 Parquet Parquet

19. SSD Parquet Parquet

20. Parquet No No Yes HDD

21. • • I/O Pandas • Spark • DataFrame Pandas → Spark Spark → Pandas Pandas → Spark • Apache Arrow

22. CPU ~2010 2010~ SSD CPU  

23. Apache Spark 2.0 • 1.x • 2.0 1.x • DataFrame API Python • databricks   http://go.databricks.com/mastering-apache-spark-2.0 •

24. Spark 2.0 • CPU • CPU • SQL DataFrame • + SSD • CSV zip Pandas read_csv

25. Python + Spark • Python serialize • DataFrame API UDF UDF Scala/Java • http://www.slideshare.net/dragan10/performant-data-processing-with-pyspark-sparkr- and-dataframe-api Executor JVM DataFrame, Cached Python lambda items: items[0] == ‘abc’ transfer DataFrame, result transfer Driver

20161215 python pandas-spark四方山話

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 20161215 python pandas-spark四方山話

Similar to 20161215 python pandas-spark四方山話 (20)

More from Ryuji Tamagawa

More from Ryuji Tamagawa (20)

Recently uploaded

Recently uploaded (20)

20161215 python pandas-spark四方山話