What is Pandas?
Pandas is not just cute and cuddly animal that chomps on bamboo all day. It's also a powerful Python package that can transform how you work with data!
Pandas uses two main data structures: Series and DataFrame. A Series is a one-dimensional array-like object that can hold any data type, such as numbers, strings, or booleans. A DataFrame is a two-dimensional tabular data structure that can hold multiple columns of different data types, such as numbers, strings, booleans, or even other Series or DataFrames. A DataFrame is like a spreadsheet.
Pandas also provides many tools and methods for working with these data structures, such as:
- Easy handling of missing data (represented as NaN, NA, or NaT) in floating point as well as non-floating point data
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can ignore the labels and let Series, DataFrame, etc., automatically align the data for you in computations
- Powerful, flexible group by functionality to perform split-apply-combine operations on data sets for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust I.O. tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting, and lagging.
Apache Arrow Replaces NumPy Underhood
The major update concerns using Apache Arrow backend for Pandas, not NumPy. This change may have gone unnoticed partly by the smooth transition for the user experience, but its implications are significant. To appreciate this change, it's important to understand how Pandas works.
Loading data into memory is a critical first step in using Pandas for data manipulation and analysis, accomplished through functions like `read_csv`, `read_sql`, and `read_parquet`. Data representation in memory depends on the type of data being loaded, with simple data like integers or floats requiring standard representations, such as arrays with a fixed number of elements. However, alternative representations are needed for more complex data types like strings, dates and times, and categories, as Python data structures like lists, dictionaries, and tuples are too slow. Non-Python and non-standard data representations are used, with implementations often carried out via Python extensions written in C++, Rust, or other languages. NumPy has been the primary extension used for fast array representation and manipulation in Pandas for many years. The new Apache Arrow backend introduces an alternative to NumPy, improving memory management and allowing for faster, more efficient data processing.
Advantages of Apache Arrow
Missing values
Representing missing values in data can be tricky. While in Python it's easy to use the "None" value to indicate missing data, performance is a concern when working with different data types. This is because the internal representation of data types is CPU-dependent, and mixing different data types is impossible. Any bit pattern can't represent a missing value, so sentinel values are used instead. This solution is not ideal.
In the past, Pandas has used the NaN (Not a Number) value as a sentinel for missing values for floating-point numbers. This approach has drawbacks, such as converting integer values to floating-point notation. However, with the recent addition of extension arrays to Pandas, it's now possible to use two arrays to represent missing values. One array contains the actual data, while the other is a boolean array indicating which values are present and which are missing.
The Apache Arrow in-memory data representation also includes a similar missing-value specification. By utilizing Arrow, Pandas can now handle missing values more efficiently and without having to create its own representation for each data type.
Speed
Here are some examples of a performance comparison between NumPy and Arrow, demonstrating the speed efficiencies of Arrow.
Arrow outperforms NumPy, especially for strings.
Interoperability
The benefits of Arrow are two-fold. First, it is easy and standard to share data among different programs. Second, it allows for extremely fast and memory-efficient data sharing because two programs can share the same data without having to make a copy for each program. While it may not be common to use different programs simultaneously on the same loaded data, there are some examples where interoperability can be advantageous, and we will explore one of them.
Data types
Using Apache Arrow as the container for pandas data has an advantage over NumPy regarding data types. Although NumPy provides great support for integer and float values, boolean values backed by 8 bits per value, and datetime values backed by 64 bits, it falls short for other data types. On the other hand, Arrow has better support for dates and times. Arrow also uses a single bit per value for boolean types, consuming one-eighth of memory, and supports other types like decimals, binary data, and complex types.
Arrow defines a type to encode categories. This means that instead of storing the strings for each element in a possibly large column, an index is stored for each value (e.g. 0, 1, and 2) and a small lookup table is used to know that 0 is "red," 1 is "green," and 2 is "blue." This data representation is transparent to users, who will only notice the memory savings.
Pandas 2.0 has integrated Apache Arrow, improving performance, greater simplicity, enhanced compatibility with related libraries, and support for a broader range of data types.
Connect with Genomely!
Substack: Genomely from Jacob L. Steenwyk
Twitter: @Genomely
GitHub: GenomelyBio
Connect with me!
Twitter: @jlsteenwyk
Website: jlsteenwyk.com
Reference
• https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html
• https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i
• https://airbyte.com/blog/pandas-2-0-ecosystem-arrow-polars-duckdb