Scikit-learn is a popular package for data scientists. However, the speed of computation can be horribly slow. Common task like GridSearchCV() can run for days on a single machine before the optimized parameters can be found. With cluster network of machines, computing speed can be increased by ten-fold when setup properly.
To enable job sharing among cluster nodes, package joblib provides a custom backend service for use. It is not enabled by default. That means extra lines of code are required to register the backend in order to get job running inside the nodes.
# Assuming an environment with scheduler and worker nodes setup properly
# Register distributed parallel backend
from joblib import _dask, parallel_backend
from sklearn.utils import register_parallel_backend
from joblib import parallel_backend
register_parallel_backend('distributed',_dask.DaskDistributedBackend)
# Send parallel job to scheduler
...
with parallel_backend('distributed', scheduler_host='127.0.0.1:8786', scatter=[x_train]):
scaler.fit(x_train)
...
The way to register distributed backend has been evolving along the versions of joblib and sklearn. This is how it is at the time of writing and may change in near future.
From the above code, data in variable 'x_train' is split and sent out to the distributed network for sharing among nodes which needs part of the data required in the task.
No comments:
Post a Comment