### Latency - Bulk vs Atomic Mode¶

• In general doing predictions in bulk (many instances at the same time) is more efficient for a number of reasons (branching predictability, CPU cache, linear algebra optimizations, etc.).

• Scikit-learn's data validations increase the overhead per call to predict and similar functions.

• In particular: checking for finite (not NaN or Infinite) features involves a full pass over the data. If you know your data is acceptable, you can suppress finiteness checking by setting the environment variable SKLEARN_ASSUME_FINITE to a non-empty string before importing scikit-learn - or configure it in Python with set_config.

• For fine-tuned control of this feature:

with sklearn.config_context(assume_finite=True): pass <learning/prediction method here, with reduced validation>

### Latency vs Number of Features¶

• A matrix of $M$ samples with $N$ features has a memory footprint of $O(NM)$. This means the number of basic math operations increases too. (See the prediction time vs #features graph, below.)

### Latency vs Input Datatypes¶

• SciPy provides sparse matrix structures, which are much more memory efficient. A non-zero value in a CSR or CSC sparse matrix will consume, on average:

• a position (integer, 32b)
• a value (floating point, 64)
• row or column ID (integer, 32b)
• Math operations on dense data structures can leverage vector operations and multithreading in BLAS, and usually results in fewer CPU cache misses.

• As a general rule: if the data sparsity ratio is >90%, consider using a sparse format.

### Latency vs Feature Extraction¶

• Most scikit-learn use compiled Cython extensions or optimized computing libraries. However, the feature extraction process (i.e. turning raw data like database rows or network packets into numpy arrays) governs the overall prediction time in most real-world applications.

• Example: the data prep for the Reuters text classification task (reading and parsing SGML files, tokenizing the text and hashing it into a common vector space) takes 100-500X more time than the prediction code.

### Example: Model Complexity¶

• Compare SGD Classifier (stochastic gradient descent) vs NuSVR (Nu support vector regression) vs Gradient Boosting Regressor (additive model, iterative).
• regressions: use diabetes toy dataset.
• classifications: use 20newsgroups text dataset.
• Benchmark influence: find param influence on each estimator. Collect prediction time, performance & complexities.
• Complexity calculated with complexity_computer.
• SGD classifier: Relaxing the L1 penalty reduces the prediction error but leads to an increase in the training time.

• NuSVR: Training time increases with the #support vectors; and there is an optimal number of support vectors which reduces the prediction error. Too few support vectors lead to an underfitted model while too many support vectors lead to an overfitted model.

• Gradient boosting: Same conclusions.

• Nu-SVR: too many trees in the ensemble is not as detrimental.

### Tips - Linear Algebra¶

• Make sure your version of NumPy is built using an optimized version of BLAS / LAPACK.

• Not all models benefit from BLAS & LAPACK, for example randomized decision trees or SVMs. However, a linear model built with a BLAS DGEMM call (via numpy.dot) is a huge improvement.

• Optimized BLAS/LAPACK implementations:

• ATLAS (requires hardware-specific tuning)
• OpenBLAS
• MKL
• Apple Accelerate & vecLib (MacOS)

### Tips - Working Memory Limits¶

• Some vectorized NumPy ops consume large amounts of temporary memory Consider limiting computation to fixed-size chunks where possible.

• Working memory size can be accessed via set_config or config_context.

with sklearn.config_context(working_memory=1024): pass <code goes here>

### Tips - Model Compression¶

• Only applies to linear models, for now, in Scikit-learn.
• Whenever possible - combine model sparsity with sparse input data formats, eg clf.fit(X_train,y_train).sparsify()

### Tips - Model Reshaping¶

• Model reshaping implies selecting a portion of the available features to fit a model. If a model discards features during the learning phase we can strip them from the input.

• It reduces memory (and therefore time) overhead of the model itself. It also allows us to discard explicit feature selection components in a pipeline once we know which features to keep from a previous run. Finally, it can reduce processing time and I/O usage upstream in the data access and feature extraction layers by not collecting features that are discarded by the model.

• If raw data comes from a database, write simpler and queries or reduce I/O usage by making the queries return lighter records.

• Currently scikit-learn reshaping is manual. In the case of sparse input (particularly in CSR format), it is generally sufficient to not generate the relevant features, leaving their columns empty.