cythonized_feature_encoder.pyx module

class FeatureEncoder

Bases: object

Encodes JSON encodable objects into float vectors

__init__()

Initialize the feature encoder for this model

Parameters:

feature_names (list) – the list of feature names. Order matters - first feature name should be the first feature in the model
string_tables (dict) – a mapping from feature names to string hash tables
model_seed (int) – model seed to be used during string encoding

Raises:

ValueError if feature names or tables are corrupt –

_encode()

Encodes a JSON serializable object to a float vector Rules of encoding go as follows:

None, json null, {}, [], and nan are treated as missing features and ignored.
numbers and booleans are encoded as-is.
strings are encoded using a lookup table

Parameters:

obj (object) – a JSON serializable object to be encoded to a flat key-value structure
path (str) – the path to the current object
into (np.ndarray[double, ndim=1, mode='c']) – an array into which feature values will be encoded
noise_shift (double) – small bias added to the feature value
noise_scale (double) – small multiplier of the feature value

encode_context()

Encodes provided context to input numpy array

Parameters:

context (object) – JSON encodable python object
into (np.ndarray[double, ndim=1, mode='c']) – array storing results of encoding
noise_shift (double) – value to be added to features
noise_scale (double) – multiplier used to scale shifted feature value

encode_feature_vector()

Fully encodes provided variant and context into a np.ndarray provided as into parameter. into must not be None

Parameters:

item (object) – a JSON encodable object to be encoded
context (object) – a JSON encodable object to be encoded
into (np.ndarray[double, ndim=1, mode='c']) – an array into which feature values will be added
noise (double) – value in [0, 1) which will be combined with the feature value

encode_item()

Encodes provided item to input numpy array

Parameters:

item (object) – JSON encodable python object
into (np.ndarray[double, ndim=1, mode='c']) – array storing results of encoding
noise_shift (double) – value to be added to features
noise_scale (double) – multiplier used to scale shifted feature value

feature_indexes: A map between feature names and feature indexes. Created by simple iteration with enumeration over feature names

string_tables: List of StringTable objects. The order of elements follows constructor’s string_tables parameter.

class StringTable

Bases: object

A class responsible for target encoding of strings

__init__()

Init StringTable with params

Parameters:

string_table (list) – a list of masked hashed strings for each string feature
model_seed (int) – model seed value

encode()

Encode input string to a target value

Parameters:: string (str) – string to be encoded
Returns:: encoded value
Return type:: double

encode_miss()

Encodes string hash as a miss

Parameters:: string_hash (unsigned long) – string hash to be encoded as a miss
Returns:: encoded miss value
Return type:: double

mask: At most 64 bit int representation of a string hash mask e.g., 000..00111

miss_width: Float value representing snap / width of the ‘miss interval’ - numeric interval into which all missing / unknown values are encoded. It is 0-centered.

model_seed: 32-bit random integer used for string hashing with xxhash

value_table: A mapping from masked string hash to target encoding’s target value for a given feature

encode_candidates_to_matrix()

Encodes list of candidates to 2D np.array for a given context with provided noise

Parameters:

candidates (object) – list or tuple or np.ndarray of JSON encodable candidates / items to encode
context (object) – JSON encodable object
noise (double) – noise to be used for sprinkling of encoded features

Returns:

2D numpy array with encoded candidates

Return type:

np.ndarray[double, ndim=2, mode=’c’]

get_mask()

Returns an integer representation of binary mask for a given string table

Parameters:: string_table (list) – list of hash string values for a given feature
Returns:: number of bytes needed to represent string hashed in the table
Return type:: unsigned long long

get_noise_shift_scale()

Returns noise shift (small value added to feature value) and noise scale (value by which shifted feature value is multiplied)

Parameters:: noise (double) – value in [0, 1) which will be combined with the feature value
Returns:: tuple of double: (noise_shift, noise_scale)
Return type:: tuple

scale()

Scales input miss value to [-width/2, width/2]. Assumes input is within [0, 1] range.

Parameters:

val (double) – miss value to be scaled
width (double) – miss range width

Returns:

scaled miss value

Return type:

double

sprinkle()

Adds noise shift and scales shifted value

Parameters:

x (double) – value to be sprinkled
noise_shift (double) – small bias added to the feature value
noise_scale (double) – small multiplier of the shifted feature value

Returns:

sprinkled value

Return type:

double