feature_encoder.py module

ITEM_FEATURE_KEY = 'item'

Feature names prefix for features derived from candidates / items, e.g.:

  • item == 1 -> feature name is “item”

  • item == [1] -> feature names is “item.0”

  • item == {“a”: 1}} - feature name is “item.a”

CONTEXT_FEATURE_KEY = 'context'

Feature names prefix for features derived from context, e.g.:

  • context == 1 -> feature name is “context”

  • context == [1] -> feature names is “context.0”

  • context == {“a”: 1}} - feature name is “context.a”

class FeatureEncoder

Bases: object

Encodes JSON encodable objects into float vectors

__init__(feature_names, string_tables, model_seed)

Initialize the feature encoder for this model

Parameters:
  • feature_names (list) – the list of feature names. Order matters - first feature name should be the first feature in the model

  • string_tables (dict) – a mapping from feature names to string hash tables

  • model_seed (int) – model seed to be used during string encoding

Raises:

ValueError if feature names or tables are corrupt

property feature_indexes: dict

A map between feature names and feature indexes. Created by simple iteration with enumeration over feature names

Returns:

a mapping between a string feature names and feature index

Return type:

dict

property string_tables: list

List of StringTable objects. The order of elements follows constructor’s string_tables parameter.

Returns:

list of StringTables

Return type:

list

_check_into(into)

Checks if the provided into array is an array and has desired np.float64 dtype

Parameters:

into (np.ndarray) – array which will store feature values

Raises:

ValueError if into is not a numpy array or not of a float64 dtype

encode_item(item, into, noise_shift=0.0, noise_scale=1.0)

Encodes provided item to input numpy array

Parameters:
  • item (object) – JSON encodable python object

  • into (np.ndarray) – array storing results of encoding

  • noise_shift (float) – value to be added to values of features

  • noise_scale (float) – multiplier used to scale shifted feature values

encode_context(context, into, noise_shift=0.0, noise_scale=1.0)

Encodes provided context to input numpy array

Parameters:
  • context (object) – JSON encodable python object

  • into (np.ndarray) – array storing results of encoding

  • noise_shift (float) – value to be added to values of features

  • noise_scale (float) – multiplier used to scale shifted feature values

encode_feature_vector(item, context, into, noise=0.0)

Fully encodes provided variant and context into a np.ndarray provided as into parameter. into must not be None

Parameters:
  • item (object) – a JSON encodable item to be encoded

  • context (object) – a JSON encodable context to be encoded

  • into (np.ndarray) – an array into which feature values will be added

  • noise (float) – value in [0, 1) which will be combined with the feature value

_encode(obj, path, into, noise_shift=0.0, noise_scale=1.0)

Encodes a JSON serializable object to a float vector Rules of encoding go as follows:

  • None, json null, {}, [], and nan are treated as missing features and ignored.

  • numbers and booleans are encoded as-is.

  • strings are encoded using a lookup table

Parameters:
  • obj (object) – a JSON serializable object to be encoded to a flat key-value structure

  • path (str) – the JSON-normalized path to the current object

  • into (np.ndarray) – an array into which feature values will be encoded

  • noise_shift (float) – small bias added to the feature value

  • noise_scale (float) – small multiplier of the feature value

get_noise_shift_scale(noise)

Returns noise shift (small value added to feature value) and noise scale (value by which shifted feature value is multiplied)

Parameters:

noise (float) – value in [0, 1) which will be combined with the feature value

Returns:

tuple of floats: (noise_shift, noise_scale)

Return type:

tuple

sprinkle(x, noise_shift, noise_scale)

Adds noise shift and scales shifted value

Parameters:
  • x (float) – value to be sprinkled

  • noise_shift (float) – small bias added to the feature value

  • noise_scale (float) – small multiplier of the shifted feature value

Returns:

sprinkled value

Return type:

float

class StringTable

Bases: object

A class responsible for target encoding of strings for a given feature

__init__(string_table, model_seed)

Init StringTable with params

Parameters:
  • string_table (list) – a list of masked hashed strings for each string feature

  • model_seed (int) – model seed value

property model_seed: int

32-bit random integer used to hash strings with xxhash

Returns:

model seed

Return type:

int

property mask: int

At most 64 bit int representation of a string hash mask e.g., 000..00111

Returns:

mask used to ‘decrease’ hashed string value

Return type:

int

property miss_width: float

Float value representing snap / width of the ‘miss interval’ - numeric interval into which all missing / unknown values are encoded. It is 0-centered.

Returns:

miss width value

Return type:

float

property value_table: dict

A mapping from masked string hash to target encoding’s target value for a given feature

Returns:

a dict with target value encoding

Return type:

dict

encode(string)

Encode input string to a target value

Parameters:

string (str) – string to be encoded

Returns:

encoded value

Return type:

float

encode_miss(string_hash)

Encodes string hash as a miss

Parameters:

string_hash (int) – string hash to be encoded as a miss

Returns:

encoded miss value

Return type:

float

scale(val, width=2)

Scales input miss value to [-width/2, width/2]. Assumes input is within [0, 1] range.

Parameters:
  • val (float) – miss value to be scaled

  • width (float) – miss range width

Returns:

scaled miss value

Return type:

float

get_mask(string_table)

Returns an integer representation of a binary mask for a given string table

Parameters:

string_table (list) – list of hash string values for a given feature

Returns:

number of bytes needed to represent string hashed in the table

Return type:

int