Joblib: A Practical Guide to Caching and Parallelization in Python

Joblib is a powerful Python library that provides tools for lightweight pipelining in Python.
It’s particularly useful for saving the results of time-consuming computations, implementing caching mechanisms, and parallelizing code execution. This guide covers the most common and practical use cases of joblib.

Official Documentation

For more detailed information, check out the official joblib documentation.

What is Joblib?

Joblib is designed to provide lightweight pipelining in Python.
It offers:

  • Transparent disk-caching of functions and lazy re-evaluation
  • Easy simple parallel computing
  • Logging and tracing of execution

Let’s explore its key features with practical examples.

Installation

First, let’s install joblib:

1
pip install joblib

Caching Function Results with Memory

One of the most useful features of joblib is its ability to cache function results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import joblib
import numpy as np
import time

# Create a memory cache
memory = joblib.Memory(location=".cache", verbose=0)

@memory.cache
def slow_function(x):
"""A function that takes some time to execute."""
print("Computing slow_function...")
time.sleep(2) # Simulate a time-consuming computation
return np.sum(x)

# First call - will be computed and cached
data = np.random.rand(1000)
t0 = time.time()
result1 = slow_function(data)
print(f"First call took {time.time() - t0:.3f} seconds")

# Second call - will use cached result
t0 = time.time()
result2 = slow_function(data)
print(f"Second call took {time.time() - t0:.3f} seconds")

Caching DataFrames with Custom Decorators

For pandas DataFrames, we can create a specialized caching decorator:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import os
import hashlib
import pandas as pd
from functools import wraps
import joblib

def df_cache(func):
"""DataFrame caching decorator"""
# Create cache directory
cache_dir = ".cache"
os.makedirs(cache_dir, exist_ok=True)

memory = joblib.Memory(location=cache_dir, verbose=0)

@wraps(func)
def wrapper(df: pd.DataFrame, *args, **kwargs):
# Calculate DataFrame hash
df_hash = hashlib.md5(
pd.util.hash_pandas_object(df).values).hexdigest()

# Internal hashable function
@memory.cache
def cached_func(df_hash, *args, **kwargs):
return func(df, *args, **kwargs)

return cached_func(df_hash, *args, **kwargs)

return wrapper

# Example usage
@df_cache
def process_dataframe(df, threshold=0.5):
"""Some expensive DataFrame processing."""
print("Processing DataFrame...")
time.sleep(2) # Simulate expensive operation
return df[df > threshold].dropna()

Parallel Processing with Parallel and delayed

Joblib makes it easy to parallelize computations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from joblib import Parallel, delayed

def process_item(x):
"""Process a single item."""
return x * x

# Sequential processing
results_seq = [process_item(i) for i in range(10)]

# Parallel processing with 4 workers
results_par = Parallel(n_jobs=4)(delayed(process_item)(i) for i in range(10))

# With progress bar
from tqdm import tqdm
results_par = Parallel(n_jobs=4)(
delayed(process_item)(i) for i in tqdm(range(10))
)

Persisting Objects to Disk

Joblib provides efficient tools for saving Python objects to disk:

1
2
3
4
5
6
7
8
9
10
11
import joblib
import numpy as np

# Create a large array
large_array = np.random.rand(1000, 1000)

# Save to disk
joblib.dump(large_array, 'large_array.joblib')

# Load from disk
loaded_array = joblib.load('large_array.joblib')

Configuring Memory Cache

You can configure how the cache works:

1
2
3
4
5
6
7
8
9
10
11
# Create a memory cache with custom settings
memory = joblib.Memory(
location=".cache", # Cache directory
verbose=1, # Display cache operations
compress=True, # Compress cached data
mmap_mode='r' # Memory-map mode for numpy arrays
)

@memory.cache
def my_function(x):
return x * 2

Memory Management and Cache Clearing

Managing your cache is important:

1
2
3
4
5
6
7
8
9
10
11
# Clear a specific cached function
memory.clear()

# Clear a specific cached function result
my_function.clear()

# Get information about cache
memory.info()

# Check if a function is cached
memory.check_call_in_cache(my_function, args=(10,))

Using Joblib with scikit-learn

Joblib is heavily used in scikit-learn for model persistence:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import joblib

# Train a model
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X, y)

# Save the model
joblib.dump(clf, 'random_forest.joblib')

# Load the model
clf_loaded = joblib.load('random_forest.joblib')

# Make predictions with the loaded model
predictions = clf_loaded.predict(X[:5])

Advanced Parallel Processing

For more complex parallelization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import numpy as np
from joblib import Parallel, delayed, parallel_backend

def complex_operation(matrix):
# Some complex operation
return np.linalg.svd(matrix)

matrices = [np.random.rand(500, 500) for _ in range(10)]

# Using different backends
with parallel_backend('loky', n_jobs=4):
results1 = Parallel()(delayed(complex_operation)(m) for m in matrices)

with parallel_backend('threading', n_jobs=4):
results2 = Parallel()(delayed(complex_operation)(m) for m in matrices)

Joblib: A Practical Guide to Caching and Parallelization in Python
https://www.hardyhu.cn/2023/05/29/Joblib-A-Practical-Guide-to-Caching-and-Parallelization-in-Python/
Author
John Doe
Posted on
May 29, 2023
Licensed under