A sense of A.I. in business: 2019

apt install through corporate proxy

Assuming proxy service like CNTLM is up and running on Ubuntu machine, one can use apt-get to install package with specifying http proxy information as follow:

$ 
$ sudo apt-get -o Acquire::http::proxy="http://user:password@host:port/" install PACKAGE_NAME;

Using Scikit-learn in cluster computing environment

Cluster computer networking combines commodity machines and high speed network switch in order to create high performance computing environment. It requires collaborations among worker nodes through the scheduler node. Once setup, commodity servers with various number of CPUs and size of memory can be linked to together to form a super computing device. Scheduler is responsible to receive tasks, share them among Workers, also collect and send computed results back to Client.

Scikit-learn is a popular package for data scientists. However, the speed of computation can be horribly slow. Common task like GridSearchCV() can run for days on a single machine before the optimized parameters can be found. With cluster network of machines, computing speed can be increased by ten-fold when setup properly.

To enable job sharing among cluster nodes, package joblib provides a custom backend service for use. It is not enabled by default. That means extra lines of code are required to register the backend in order to get job running inside the nodes.

# Assuming an environment with scheduler and worker nodes setup properly
# Register distributed parallel backend
from joblib import _dask, parallel_backend
from sklearn.utils import register_parallel_backend
from joblib import parallel_backend

register_parallel_backend('distributed',_dask.DaskDistributedBackend)

# Send parallel job to scheduler
...
with parallel_backend('distributed', scheduler_host='127.0.0.1:8786', scatter=[x_train]):
  scaler.fit(x_train)
...

The way to register distributed backend has been evolving along the versions of joblib and sklearn. This is how it is at the time of writing and may change in near future.

From the above code, data in variable 'x_train' is split and sent out to the distributed network for sharing among nodes which needs part of the data required in the task.

Comparison among PyPy, Cython and Numba

CPython is the standard Python implementation while there are alternative implementations, extensions and packages available to boost up the speed. However, some sacrifices are required to get the full throttle speed.

Here's the extract about the comparison of three popular approaches to make Python code running faster:

Name of technology	Python Package/Full implementation	Type of compiler	Dependency	Package supported	Python features supported	Coding style	Performance
PyPy	Full implementation in RPython	Just-in-time		Only pure Python package (Especially NOT SciPy, Matplotlib, and scikit-learn)	Full	Pure Python syntax	High, 10x times faster than CPython
Cython	Python package	Ahead-of-time			Partial	Cython syntax	Very high, 100x times faster than CPython
Numba	Python package	Just-in-time	LLVM		Partial	Only decorator syntax required ahead of desired function	Very high, 100x times faster than CPython

Benchmarks are collected from here.