- The Musings On AI
- Posts
- ๐จ The 6th Edition `RIP Pandas - Learn Polars` - Part 1
๐จ The 6th Edition `RIP Pandas - Learn Polars` - Part 1
RIP Pandas as pd
I saw this post last year.
And then I began analysis on this topic - I found the goldmine - And discovered `Polars`. I have not been using Pandas for the last six months. Polars is my tool.
๐ Itโs Really Fast
๐ธ Powerful data manipulation and analysis library.
๐ธ Written in Rust (created by Ritchie Vink)
๐ธ Uses Apache Arrow - native arrow2 Rust implementation.
๐ธ Available with multiple languages: python, rust, and NodeJS.
๐ธ Memory efficiency: Similar to Arrow, which reduces memory overhead and improves memory utilization.
๐ธ Expressive API: Chains of expressions build an optimized query plan
๐ธ Interoperability: Integrates seamlessly with other data processing frameworks, such as Apache Arrow, enabling efficient data interchange between different systems
๐ง Parallelization
We always aim for multithreaded code that runs into multiple cores - But in practical life, we don't get the best results with it perfect parallelization is a myth.
โญ Without Parallelization
In pandas you can use the codes without parallelization - but the execution time will be higher and other cores will be unutilised.
๐ Multi-core Parallelization done wrong
Thats why we want to use parallel computing to allocate task on all the cores.
That is why we run code in a parallel way rightly is complex rather than That's why we want to use parallel computing to allocate tasks on all the cores. Some libraries like Dask and Modin can be used in a parallel way. but we need to ensure that we do parallelize our work in the right way so its not hiding the drawbacks and putting the sugar on top of that to make it work to some extent.
Now let me tell some basics of parallelization theory in the data frame domain.
Scenario: We have data for column `x` and `y` and there are three keys `a`, `b` and `c` . We need to first goupby and then apply the `sum` process.
๐ Basic Embarrassingly Parallel
We can use here basic map reduce technique - split the data and then we can complete the operation.
๐ธ Aggregations across different columns.
๐ธ Groupby operations can be parallelized.
๐ธ This is the ideal scenario - because all the data is in the stored order so splitting the data with the keys relay easier. Let's look at some real word situations where the data is mixed.
๐ Parallel Hashing
Hashing is the core of many operations in a DataFrame library, a groupby-operation creates a hash table with the group index pointers, and a join operation needs a hash table to find the tuples mapping the rows of the left to the right DataFrame.
๐ธ Expensive Synchronization
In both operations, we cannot simply split the data among the threads. There is no guarantee that all the same keys would end up in the same hash table on the same thread. Therefore we would need an extra synchronization-phase where we build a new hashtable. This principle is shown in the figure below for 2 threads.
๐ธ In real world the data would be mixed.
๐ธ Data split into thread(CPU core) s.
๐ธ Each thread applies operations independently.
๐ธ There is no guarantee, that a key doesnโt fall into multiple threads.
๐ธ Extra synchronization step necessary and complicated to build.
๐ Expensive Locking
Another option that is found too expensive is hashing the data on separate threads and have a single hash table in a mutex. As you can imagine, thread contention is very high in this algorithm and the parallelism doesnโt really pay of.
๐ธ To provide the solution of the expensive synchronization the Expensive locking mechanism was built.
๐ธ Data is split into threads.
๐ธ Threads have shared storage (mutex) to prevent duplicates.
๐ธ But different threads block each other.
๐ธ It will be fine when you have 2-4 cores - but if you have to execute on 24-32 cores, there will be complications due to large inter-thread communication messages.
๐ Lock-free hashing
Instead of the before mentioned approaches, Polars uses a lock-free hashing algorithm. This approach does do more work than the previous Expensive locking approach, but this work is done in parallel and all threads are guaranteed to not have to wait on any other thread. Every thread computes the hashes of the keys, but depending on the outcome of the hash, it will determine if that key belongs to the hash table of that thread. This is simply determined by the hash value % thread number. Due to this simple trick, we know that every threaded hash table has unique keys and we can simply combine the pointers of the hash tables on the main thread.
๐ธ Thats why Ploars has developed this technique called Lock-free hashing.
๐ธ All threads read the full data.
๐ธ Threads independently decide which value to operate on by a modulo function.
๐ธ Results can be cheaply combined by trivial concatenation.
Thank you for reading Musings on AI. This post is public so feel free to share it.
Thatโs All I want!
๐ธ select/slice columns: select
๐ธ create/transform/assign columns: with_columns
๐ธ filter/slice/query rows: filter
๐ธ grouping data frame rows: groupby
๐ธ aggregation: agg
๐ธ sort the data frame: sort
Lets Install It
!echo "Creating virtual environment"
!python -m venv polars_env
!echo "Activate the virtual environment, upgrade pip and install the packages into the virtual environment"
!source polars_env/bin/activate && pip install --upgrade pip
!pip install polars connectorx xlsx2csv pyarrow ipython jupyterlab plotly pandas matplotlib seaborn xlsxwriter RISE
Restart the Karnel and The try to call
import polars as pl
import pandas as pd
import numpy as np
import plotly.express as px
Next Editions (Series of 4)
We will learn โ
Second Edition
๐ธ Lazy mode 1: Introducing lazy mode
๐ธ Lazy mode 2: evaluating queries
๐ธ Introduction to Data Types
๐ธ Series and DataFrame
๐ธ Conversion to & from Pandas and Numpy
๐ธ Filtering rows 1: Indexing with []
๐ธ Filtering rows 2: Using filter and the Expression API
๐ธ Filtering rows 3: using filter in lazy mode
Third Edition
๐ธ Selecting columns 1: using []
๐ธ Selecting columns 2: using select and expressions
๐ธ Selecting columns 3: selecting multiple columns
๐ธ Selecting columns 4: Transforming and adding a column
๐ธ Selecting columns 5: Transforming and adding multiple columns
๐ธ Selecting columns 6: Adding a new column based on a mapping or condition
๐ธ Sorting
๐ธ Missing values
๐ธReplacing missing values
๐ธ Replacing missing values with expressions
๐ธ Transforming text data
๐ธ Value counts
Forth Edition
๐ธ Groupby 1: The GroupBy object
๐ธ Groupby 2: Aggregation and expressions
๐ธ Groupby 3: Multiple aggregations
๐ธ Groupby 4: The LazyGroupBy object
๐ธ Concatenation
๐ธ Left, inner, outer, cross and fast-track joins
๐ธ Join on string and categorical columns
๐ธ Filtering one DataFrame by another DataFrame
๐ธ Use an expression in another DataFrame
๐ธ Extending, stacking and concatenating
**
I will publish the next Edition on Sunday.
This is the 6th Edition, If you have any feedback please donโt hesitate to share it with me, And if you love my work, do share it with your colleagues.
Cheers!!
Raahul
**
Reply