4. January 2020: pandas 1.0
● 26th major release after 10 years of
development
● ~2000 unique contributors
Thanks, Indeed!
5. Dec 2009 - pandas 0.1
● First open source release after ~18 months
of proprietary use
● Still on PyPI!
6. Funding pandas development
● pandas received first formal grant in 2019
from Chan-Zuckerberg Initiative
● Core devs primarily volunteers, self-funded,
or company-funded (Anaconda, others)
7. The early pandas gang (2011 - 2012)
Wes McKinney Chang She Adam Klein
8. pandas’s amazing Core Dev Team
Core Dev Meetup,
2019
Jeff Reback Tom Augspurger
Brock MendelMarc Garcia
Partial cast of characters
Joris van den
Bossche
11. "We believe that in the coming years there will be
great opportunity to attract users in need of
statistical data analysis tools to Python who might
have previously chosen R, MATLAB, or another
research environment. By designing robust, easy
to-use data structures that cohere with the rest of the
scientific Python stack, we can make Python
compelling choice for data analysis applications. In
our opinion, pandas provides a solid foundation upon
which a very powerful data analysis ecosystem can
be established."
Me, Proceedings of SciPy 2011
17. Contributing factors
● Massive need for data wranglers + scientists
● “Perfect storm” of necessary packages
● New data science education
● Successful early adopters
● Packaging improvements
21. ● Large codebase concerns
● Long-term software lifecycle
● Interpreted languages
○ ... unsafe?
○ ... slow?
● Open source… trustworthy?
Common concerns
22. May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."
23. May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
May 2011 - “PyData” core dev meetings
24. May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"
May 2011 - “PyData” core dev meetings
25. May 2011
* also, we need to fix packaging
May 2011 - “PyData” core dev meetings
26. July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
28. Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
● 2 editions (2012, 2017)
● 8 translations so far
29. PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be used
as, a database query
engine”
30. Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
● File formats
● JVM interop
● Non-array-oriented
interfaces
31. Difficulties in pandas (and R) dataframes
● Limited built-in data types
● Performance and memory use issues
● Challenges with larger-than-memory datasets
● Naive execution strategies (no “query
optimization”)
38. Other thoughts
● Projects like pandas may be taking
responsibility for too many things
● It would be more productive (long-term) to
have a reusable computational foundation
for data frames
39. ● New data frame format for
designed for speed
● Computational foundation for
data processing libraries
● Fast cross-language data
interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries
42. ● CPU/GPU-friendly columnar memory layout
● Memory map huge datasets
● Relocate data structures without serialization
Important features
43. Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage
44. “New Data Frame” projects
● dask.dataframe
● Modin
● NVIDIA RAPIDS
● Vaex
● … and more surely in development
45. Learning from R
● Domain-specific language culture (“same
code, different backends”)
● Non-standard evaluation
○ Inspect and manipulate unevaluated code
fragments
46. Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset
47. Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
Can be a massive Arrow dataset
48. Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset