Using Scikit-learn in cluster computing environment

Cluster computer networking combines commodity machines and high speed network switch in order to create high performance computing environment. It requires collaborations among worker nodes through the scheduler node. Once setup, commodity servers with various number of CPUs and size of memory can be linked to together to form a super computing device. Scheduler is responsible to receive tasks, share them among Workers, also collect and send computed results back to Client.

Scikit-learn is a popular package for data scientists. However, the speed of computation can be horribly slow. Common task like GridSearchCV() can run for days on a single machine before the optimized parameters can be found. With cluster network of machines, computing speed can be increased by ten-fold when setup properly.

To enable job sharing among cluster nodes, package joblib provides a custom backend service for use. It is not enabled by default. That means extra lines of code are required to register the backend in order to get job running inside the nodes.

# Assuming an environment with scheduler and worker nodes setup properly
# Register distributed parallel backend
from joblib import _dask, parallel_backend
from sklearn.utils import register_parallel_backend
from joblib import parallel_backend

register_parallel_backend('distributed',_dask.DaskDistributedBackend)

# Send parallel job to scheduler
...
with parallel_backend('distributed', scheduler_host='127.0.0.1:8786', scatter=[x_train]):
  scaler.fit(x_train)
...


The way to register distributed backend has been evolving along the versions of joblib and sklearn. This is how it is at the time of writing and may change in near future.

From the above code, data in variable 'x_train' is split and sent out to the distributed network for sharing among nodes which needs part of the data required in the task.


No comments:

Post a Comment

apt install through corporate proxy

Assuming proxy service like CNTLM is up and running on Ubuntu machine, one can use apt-get to install package with specifying http proxy inf...