apt install through corporate proxy

Assuming proxy service like CNTLM is up and running on Ubuntu machine, one can use apt-get to install package with specifying http proxy information as follow:
$ 
$ sudo apt-get -o Acquire::http::proxy="http://user:password@host:port/" install PACKAGE_NAME;

Using Scikit-learn in cluster computing environment

Cluster computer networking combines commodity machines and high speed network switch in order to create high performance computing environment. It requires collaborations among worker nodes through the scheduler node. Once setup, commodity servers with various number of CPUs and size of memory can be linked to together to form a super computing device. Scheduler is responsible to receive tasks, share them among Workers, also collect and send computed results back to Client.

Scikit-learn is a popular package for data scientists. However, the speed of computation can be horribly slow. Common task like GridSearchCV() can run for days on a single machine before the optimized parameters can be found. With cluster network of machines, computing speed can be increased by ten-fold when setup properly.

To enable job sharing among cluster nodes, package joblib provides a custom backend service for use. It is not enabled by default. That means extra lines of code are required to register the backend in order to get job running inside the nodes.

# Assuming an environment with scheduler and worker nodes setup properly
# Register distributed parallel backend
from joblib import _dask, parallel_backend
from sklearn.utils import register_parallel_backend
from joblib import parallel_backend

register_parallel_backend('distributed',_dask.DaskDistributedBackend)

# Send parallel job to scheduler
...
with parallel_backend('distributed', scheduler_host='127.0.0.1:8786', scatter=[x_train]):
  scaler.fit(x_train)
...


The way to register distributed backend has been evolving along the versions of joblib and sklearn. This is how it is at the time of writing and may change in near future.

From the above code, data in variable 'x_train' is split and sent out to the distributed network for sharing among nodes which needs part of the data required in the task.


Comparison among PyPy, Cython and Numba

CPython is the standard Python implementation while there are alternative implementations, extensions and packages available to boost up the speed. However, some sacrifices are required to get the full throttle speed.

Here's the extract about the comparison of three popular approaches to make Python code running faster:


Name of technology Python Package/Full implementation Type of compiler Dependency Package supported Python features supported Coding style Performance
PyPy Full implementation in RPython Just-in-time Only pure Python package (Especially NOT SciPy, Matplotlib, and scikit-learn) Full Pure Python syntax High, 10x times faster than CPython
Cython Python package Ahead-of-time Partial Cython syntax Very high, 100x times faster than CPython
Numba Python package Just-in-time LLVM Partial Only decorator syntax required ahead of desired function Very high, 100x times faster than CPython

Benchmarks are collected from here.

apt install through corporate proxy

Assuming proxy service like CNTLM is up and running on Ubuntu machine, one can use apt-get to install package with specifying http proxy inf...