Joblib: A Practical Guide to Caching and Parallelization in Python

Joblib is a powerful Python library that provides tools for lightweight pipelining in Python.
It’s particularly useful for saving the results of time-consuming computations, implementing caching mechanisms, and parallelizing code execution. This guide covers the most common and practical use cases of joblib.

Official Documentation

For more detailed information, check out the official joblib documentation.

What is Joblib?

Joblib is designed to provide lightweight pipelining in Python.
It offers:

Transparent disk-caching of functions and lazy re-evaluation
Easy simple parallel computing
Logging and tracing of execution

Let’s explore its key features with practical examples.

Installation

First, let’s install joblib:

1	`pip install joblib`

Caching Function Results with `Memory`

One of the most useful features of joblib is its ability to cache function results:

import joblib
import numpy as np
import time

# Create a memory cache
memory = joblib.Memory(location=".cache", verbose=0)

@memory.cache
def slow_function(x):
    """A function that takes some time to execute."""
    print("Computing slow_function...")
    time.sleep(2)  # Simulate a time-consuming computation
    return np.sum(x)

# First call - will be computed and cached
data = np.random.rand(1000)
t0 = time.time()
result1 = slow_function(data)
print(f"First call took {time.time() - t0:.3f} seconds")

# Second call - will use cached result
t0 = time.time()
result2 = slow_function(data)
print(f"Second call took {time.time() - t0:.3f} seconds")

Caching DataFrames with Custom Decorators

For pandas DataFrames, we can create a specialized caching decorator:

import os
import hashlib
import pandas as pd
from functools import wraps
import joblib

def df_cache(func):
    """DataFrame caching decorator"""
    # Create cache directory
    cache_dir = ".cache"
    os.makedirs(cache_dir, exist_ok=True)

    memory = joblib.Memory(location=cache_dir, verbose=0)

    @wraps(func)
    def wrapper(df: pd.DataFrame, *args, **kwargs):
        # Calculate DataFrame hash
        df_hash = hashlib.md5(
            pd.util.hash_pandas_object(df).values).hexdigest()

        # Internal hashable function
        @memory.cache
        def cached_func(df_hash, *args, **kwargs):
            return func(df, *args, **kwargs)

        return cached_func(df_hash, *args, **kwargs)

    return wrapper

# Example usage
@df_cache
def process_dataframe(df, threshold=0.5):
    """Some expensive DataFrame processing."""
    print("Processing DataFrame...")
    time.sleep(2)  # Simulate expensive operation
    return df[df > threshold].dropna()

Parallel Processing with `Parallel` and `delayed`

Joblib makes it easy to parallelize computations:

from joblib import Parallel, delayed

def process_item(x):
    """Process a single item."""
    return x * x

# Sequential processing
results_seq = [process_item(i) for i in range(10)]

# Parallel processing with 4 workers
results_par = Parallel(n_jobs=4)(delayed(process_item)(i) for i in range(10))

# With progress bar
from tqdm import tqdm
results_par = Parallel(n_jobs=4)(
    delayed(process_item)(i) for i in tqdm(range(10))
)

Persisting Objects to Disk

Joblib provides efficient tools for saving Python objects to disk:

import joblib
import numpy as np

# Create a large array
large_array = np.random.rand(1000, 1000)

# Save to disk
joblib.dump(large_array, 'large_array.joblib')

# Load from disk
loaded_array = joblib.load('large_array.joblib')

Configuring Memory Cache

You can configure how the cache works:

# Create a memory cache with custom settings
memory = joblib.Memory(
    location=".cache",     # Cache directory
    verbose=1,             # Display cache operations
    compress=True,         # Compress cached data
    mmap_mode='r'          # Memory-map mode for numpy arrays
)

@memory.cache
def my_function(x):
    return x * 2

Memory Management and Cache Clearing

Managing your cache is important:

# Clear a specific cached function
memory.clear()

# Clear a specific cached function result
my_function.clear()

# Get information about cache
memory.info()

# Check if a function is cached
memory.check_call_in_cache(my_function, args=(10,))

Using Joblib with scikit-learn

Joblib is heavily used in scikit-learn for model persistence:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import joblib

# Train a model
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)

# Save the model
joblib.dump(clf, 'random_forest.joblib')

# Load the model
clf_loaded = joblib.load('random_forest.joblib')

# Make predictions with the loaded model
predictions = clf_loaded.predict(X[:5])

Advanced Parallel Processing

For more complex parallelization:

import numpy as np
from joblib import Parallel, delayed, parallel_backend

def complex_operation(matrix):
    # Some complex operation
    return np.linalg.svd(matrix)

matrices = [np.random.rand(500, 500) for _ in range(10)]

# Using different backends
with parallel_backend('loky', n_jobs=4):
    results1 = Parallel()(delayed(complex_operation)(m) for m in matrices)
    
with parallel_backend('threading', n_jobs=4):
    results2 = Parallel()(delayed(complex_operation)(m) for m in matrices)

Joblib: A Practical Guide to Caching and Parallelization in Python

https://www.hardyhu.cn/2023/05/29/Joblib-A-Practical-Guide-to-Caching-and-Parallelization-in-Python/

Author

John Doe

Posted on

May 29, 2023

Licensed under

Wechat mini program Tutorial Previous

PyTorch Tutorial Next

Joblib: A Practical Guide to Caching and Parallelization in Python

Official Documentation

What is Joblib?

Installation

Caching Function Results with Memory

Caching DataFrames with Custom Decorators

Parallel Processing with Parallel and delayed

Persisting Objects to Disk

Configuring Memory Cache

Memory Management and Cache Clearing

Using Joblib with scikit-learn

Advanced Parallel Processing

Caching Function Results with `Memory`

Parallel Processing with `Parallel` and `delayed`