feature_encoder.py module

ITEM_FEATURE_KEY = 'item'

Feature names prefix for features derived from candidates / items, e.g.:

item == 1 -> feature name is “item”
item == [1] -> feature names is “item.0”
item == {“a”: 1}} - feature name is “item.a”

CONTEXT_FEATURE_KEY = 'context'

Feature names prefix for features derived from context, e.g.:

context == 1 -> feature name is “context”
context == [1] -> feature names is “context.0”
context == {“a”: 1}} - feature name is “context.a”

class FeatureEncoder

Bases: object

Encodes JSON encodable objects into float vectors

__init__(feature_names, string_tables, model_seed)

Initialize the feature encoder for this model

Parameters:

feature_names (list) – the list of feature names. Order matters - first feature name should be the first feature in the model
string_tables (dict) – a mapping from feature names to string hash tables
model_seed (int) – model seed to be used during string encoding

Raises:

ValueError if feature names or tables are corrupt –

property feature_indexes: dict

A map between feature names and feature indexes. Created by simple iteration with enumeration over feature names

Returns:: a mapping between a string feature names and feature index
Return type:: dict

property string_tables: list

List of StringTable objects. The order of elements follows constructor’s string_tables parameter.

Returns:: list of StringTables
Return type:: list

_check_into(into)

Checks if the provided into array is an array and has desired np.float64 dtype

Parameters:: into (np.ndarray) – array which will store feature values
Raises:: ValueError if into is not a numpy array or not of a float64 dtype –

encode_item(item, into, noise_shift=0.0, noise_scale=1.0)

Encodes provided item to input numpy array

Parameters:

item (object) – JSON encodable python object
into (np.ndarray) – array storing results of encoding
noise_shift (float) – value to be added to values of features
noise_scale (float) – multiplier used to scale shifted feature values

encode_context(context, into, noise_shift=0.0, noise_scale=1.0)

Encodes provided context to input numpy array

Parameters:

context (object) – JSON encodable python object
into (np.ndarray) – array storing results of encoding
noise_shift (float) – value to be added to values of features
noise_scale (float) – multiplier used to scale shifted feature values

encode_feature_vector(item, context, into, noise=0.0)

Fully encodes provided variant and context into a np.ndarray provided as into parameter. into must not be None

Parameters:

item (object) – a JSON encodable item to be encoded
context (object) – a JSON encodable context to be encoded
into (np.ndarray) – an array into which feature values will be added
noise (float) – value in [0, 1) which will be combined with the feature value

_encode(obj, path, into, noise_shift=0.0, noise_scale=1.0)

Encodes a JSON serializable object to a float vector Rules of encoding go as follows:

None, json null, {}, [], and nan are treated as missing features and ignored.
numbers and booleans are encoded as-is.
strings are encoded using a lookup table

Parameters:

obj (object) – a JSON serializable object to be encoded to a flat key-value structure
path (str) – the JSON-normalized path to the current object
into (np.ndarray) – an array into which feature values will be encoded
noise_shift (float) – small bias added to the feature value
noise_scale (float) – small multiplier of the feature value

get_noise_shift_scale(noise)

Returns noise shift (small value added to feature value) and noise scale (value by which shifted feature value is multiplied)

Parameters:: noise (float) – value in [0, 1) which will be combined with the feature value
Returns:: tuple of floats: (noise_shift, noise_scale)
Return type:: tuple

sprinkle(x, noise_shift, noise_scale)

Adds noise shift and scales shifted value

Parameters:

x (float) – value to be sprinkled
noise_shift (float) – small bias added to the feature value
noise_scale (float) – small multiplier of the shifted feature value

Returns:

sprinkled value

Return type:

float

class StringTable

Bases: object

A class responsible for target encoding of strings for a given feature

__init__(string_table, model_seed)

Init StringTable with params

Parameters:

string_table (list) – a list of masked hashed strings for each string feature
model_seed (int) – model seed value

property model_seed: int

32-bit random integer used to hash strings with xxhash

Returns:: model seed
Return type:: int

property mask: int

At most 64 bit int representation of a string hash mask e.g., 000..00111

Returns:: mask used to ‘decrease’ hashed string value
Return type:: int

property miss_width: float

Float value representing snap / width of the ‘miss interval’ - numeric interval into which all missing / unknown values are encoded. It is 0-centered.

Returns:: miss width value
Return type:: float

property value_table: dict

A mapping from masked string hash to target encoding’s target value for a given feature

Returns:: a dict with target value encoding
Return type:: dict

encode(string)

Encode input string to a target value

Parameters:: string (str) – string to be encoded
Returns:: encoded value
Return type:: float

encode_miss(string_hash)

Encodes string hash as a miss

Parameters:: string_hash (int) – string hash to be encoded as a miss
Returns:: encoded miss value
Return type:: float

scale(val, width=2)

Scales input miss value to [-width/2, width/2]. Assumes input is within [0, 1] range.

Parameters:

val (float) – miss value to be scaled
width (float) – miss range width

Returns:

scaled miss value

Return type:

float

get_mask(string_table)

Returns an integer representation of a binary mask for a given string table

Parameters:: string_table (list) – list of hash string values for a given feature
Returns:: number of bytes needed to represent string hashed in the table
Return type:: int