Utilities

Utility module for the search engine

Contains various functions to do common operations on strings and iterables, such as normalization, tokenization, averages, splitting and walking through the elements of the iterable/string.

search.utils.average(values)[source]

Return the arithmetic average of the values.

search.utils.best_partial_ratio(query, string)[source]

Best partial ratio between query and string. String is walked with the shifter function and each segment’s length is equal to the query length:

search.utils.generate_weights(iterable)[source]

Generate normalized weights for the iterable elements in reversed order.

search.utils.max_distance(sequence, idx)[source]

Given a list an int in range(len(sequence)), determine the maximum amount of movements available from that index position in the list.

Parameters:
  • sequence (any with __len__) – iterable to calculate the maximum moves into. It’s used only to get its len, so any object that implements the __len__ method will do.
  • idx (int) – index to count from. The value ** must ** be >= 0 but ** can ** be over the length of the list.
Returns:

maximum moves available in any direction.

Return type:

int

Example

>> > utils.get_max_moves([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], 1) 4 # index 1 -> ‘b’, so max is 4 moves right >> > utils.get_max_moves([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], 6) 5 # can move left 5

search.utils.normalize(iterable)[source]

Normalize an iterable of numbers in a series that sums up to 1.

Parameters:iterable (numbers) – an iterable containing numbers (either int or float) to normalize.
Returns:Normalized float values, in the same order as the one provided.
Return type:list

Example

>> > normalize([3, 2, 1]) [0.5, 0.333333, 0.166667]

search.utils.ratio(query, string)[source]

Simple ratio between the whole strings. values 0 -> 1. Should be good for lengths == 1

search.utils.scale_to_one(iterable)[source]

Scale an iterable of numbers proportionally such as the highest number equals to 1

Example

>> > scale_to_one([5, 4, 3, 2, 1]) [1, 0.8, 0.6, 0.4, 0.2]

search.utils.shifter(string, chunk_size)[source]

Generator function that slides through a string and returns strings of (max) length chunk_size, sliding one character at a time.

Parameters:
  • string (str) – the string to walk
  • chunk_size (int) – the size of each chunk
Yields:

one chunks of string of size chunk_size on each call such as

>>> splitter('hello, world', 3)
'hel', 'ell', 'llo', ...
search.utils.sorted_intersect(query_tokens, string_tokens)[source]

return the sorted intersection and remainders of the two iterables

Parameters:
  • query_tokens (iterable) – first elements group to intersect
  • string_tokens (iterable) – second elements group to intersect
Returns:

  • sorted intersected list - all values common to both iterables
  • sorted remainders for query_tokens
  • sorted remainders for string_tokens

Return type:

tuple of lists

Example

>> > sorted_intersect([1, 3, 2, 4], [3, 4, 5]) [3, 4], [1, 2], [5]

search.utils.sorted_unique_tokens(string, regexp=’\W+’, min_len=3)[source]

Return a sorted list of unique tokens contained in the string

search.utils.splitter(string, chunk_size)[source]

Generator function that returns chunks of string of size chunk_size. If chunk_size is -1 the whole string is returned.

Parameters:
  • string (str) – the string to get the chunks from
  • chunk_size (int) – max length of the chunks (last one can be shorter)
Yields:

one chunks of string of size chunk_size on each call such as

>>> splitter('hello, world', 3)
'hel', 'lo,', ' wo', 'rld'
search.utils.stringify_tokens(tokens)[source]

Given a list or set of tokens, return them as a single string

search.utils.tokenize(string, regexp=’\W+’, min_len=3)[source]

Given a string return a list of segments of the string, splitted with config.STR_SPLIT_REGEX, removing every word < config.MIN_WORD_LENGTH.

Parameters:
  • string (str) – string to tokenize
  • regexp (regexp) – regular expression to split the string with
  • min_len (int) – minimum word length
Returns:

list - all the tokens extracted from the string in the same order they were found.

Example

>>> utils.tokenize('hello there, how are you')
['hello', 'there']
search.utils.tokenize_set(string, regexp=’\W+’, min_len=3)[source]

Given a string returns a set of unique segments of the strings (single words) splitted with config.STR_SPLIT_REGEX, removing every word < config.MIN_WORD_LENGTH.

The differences with tokenize is that returns a set instead of a list, so the values are unique.

Parameters:
  • string (str) – string to tokenize
  • regexp (regexp) – regular expression to split the string with
  • min_len (int) – minimum word length
Returns:

set - all the tokens extracted from string in a set

Example

>>> utils.tokenize('hello there, there are some fishes there!')
['hello', 'there']
search.utils.weighted_average(values, weights=None)[source]

Calculate the weighted mean average between two iterables of values and matching weights. If weights is None they will be autogenerated

Parameters:
  • values (iterable) – Values to average, either int or float
  • weights (iterable) – Matching weights iterable
Returns:

weighted average

Return type:

float