62_feature_extraction

Feature Extraction (FE)¶

Used to extract feature information from text & image datasets.
Very different from feature selection (FE is a technique that is applied to the result of a FE method.)

Feactures from Dicts ¶

DictVectorizer converts feature arrays (lists of Python dict objects) to NumPy/SymPy format.
Uses one-of-K (aka "one hot") category coding. Category features are unordered attribute:value pairs.

In [3]:

measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
    {'city': 'San Francisco', 'temperature': 18.},
]

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
vec.fit_transform(measurements).toarray()
vec.get_feature_names()

DictVectorizer accepts multiple strings for one feature (aka, multiple categories per movie).

movie_entry = [{'category': ['thriller', 'drama'], 'year': 2003}, {'category': ['animation', 'family'], 'year': 2011}, {'year': 1974}]

vec.fit_transform(movie_entry).toarray()

vec.get_feature_names() == ['category=animation', 'category=drama', 'category=family', 'category=thriller', 'year']

vec.transform({'category': ['thriller'], 'unseen_feature': '3'}).toarray()

Dict Vectorizer - NLP applications¶

Suppose we have an algorithm that extracts Part of Speech (PoS) tags to use for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:
The description can be vectorized into a sparse 2D matrix, suitable for a classifier.
Extracting this info around each individual word of a corpus of documents will return a very wide (many one-hot-features) matrix - with mostly zero values. DictVectorizer therefore uses a scipy.sparse matrix by default.

In [9]:

pos_window = [
    {
        'word-2': 'the',
        'pos-2': 'DT',
        'word-1': 'cat',
        'pos-1': 'NN',
        'word+1': 'on',
        'pos+1': 'PP',
    },
    # in a real application one would extract many such dictionaries
]

vec = DictVectorizer()
pos_vectorized = vec.fit_transform(pos_window)
print(pos_vectorized)

pos_vectorized.toarray()
print(vec.get_feature_names())

Feature Hashing ¶

FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”.
Instead of building a hash table of features during training, as vectorizers do, instances of FeatureHasher apply a hash function to the features to directly determine their column index in sample matrices.
The result is increased speed and reduced memory usage, at the expense of inspectability; the hasher does not remember what the input features looked like and has no inverse_transform method.
Since the hash function can cause collisions between (unrelated) features, a signed hash function is used. The sign determines the sign of the value stored in the output matrix for a feature.
This means that collisions are likely to cancel out rather than accumulate error - so the expected mean of any output feature’s value is zero.
It is enabled by default with alternate_sign=True and is particularly useful for small hash table sizes (n_features < 10000). For large hash table sizes, it can be disabled. This will allow outputs to be passed to estimators like MultinomialNB or chi2 feature selectors that expect non-negative inputs.
FeatureHasher accepts maps (like Python’s dict and its variants in the collections module), (feature, value) pairs, or strings, depending on the constructor parameter input_type. Maps are treated as lists of (feature, value) pairs.
Single strings have an implicit value of 1, so ['feat1', 'feat2', 'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)].
If a single feature occurs multiple times in a sample, the feature values will be summed (so ('feat', 2) and ('feat', 3.5) become ('feat', 5.5)). The output from FeatureHasher is a scipy.sparse matrix in the CSR format.
Feature hashing can be used in document classification. Unlike CountVectorizer, FeatureHasher does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding. See below for a combined tokenizer/hasher.

Implementation¶

FeatureHasher uses the signed 32-bit variant of MurmurHash3. The maximum number of features supported is currently $2^31-1$.
The original formulation of the hashing trick used two separate hash functions $h$ and $phi$ to determine the column index and sign of a feature,. This implementation assumes the sign bit of MurmurHash3 is independent of its other bits.
Since a simple modulo is used to transform the hash function to a column index, consider using a power of two as the n_features param. Otherwise the features will not be mapped evenly to the columns.

Feature Extraction (FE)¶

Feactures from Dicts¶

Dict Vectorizer - NLP applications¶

Feature Hashing¶

Implementation¶

Feactures from Dicts ¶

Feature Hashing ¶