Utilities¶
Utility module for the search engine
Contains various functions to do common operations on strings and iterables, such as normalization, tokenization, averages, splitting and walking through the elements of the iterable/string.
-
search.utils.best_partial_ratio(query, string)[source]¶ Best partial ratio between query and string. String is walked with the shifter function and each segment’s length is equal to the query length:
-
search.utils.generate_weights(iterable)[source]¶ Generate normalized weights for the iterable elements in reversed order.
-
search.utils.max_distance(sequence, idx)[source]¶ Given a list an int in range(len(sequence)), determine the maximum amount of movements available from that index position in the list.
Parameters: - sequence (any with __len__) – iterable to calculate the maximum moves into. It’s used only to get its len, so any object that implements the __len__ method will do.
- idx (int) – index to count from. The value ** must ** be >= 0 but ** can ** be over the length of the list.
Returns: maximum moves available in any direction.
Return type: Example
>> > utils.get_max_moves([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], 1) 4 # index 1 -> ‘b’, so max is 4 moves right >> > utils.get_max_moves([‘a’, ‘b’, ‘c’, ‘d’, ‘e’], 6) 5 # can move left 5
-
search.utils.normalize(iterable)[source]¶ Normalize an iterable of numbers in a series that sums up to 1.
Parameters: iterable (numbers) – an iterable containing numbers (either int or float) to normalize. Returns: Normalized float values, in the same order as the one provided. Return type: list Example
>> > normalize([3, 2, 1]) [0.5, 0.333333, 0.166667]
-
search.utils.ratio(query, string)[source]¶ Simple ratio between the whole strings. values 0 -> 1. Should be good for lengths == 1
-
search.utils.scale_to_one(iterable)[source]¶ Scale an iterable of numbers proportionally such as the highest number equals to 1
Example
>> > scale_to_one([5, 4, 3, 2, 1]) [1, 0.8, 0.6, 0.4, 0.2]
-
search.utils.shifter(string, chunk_size)[source]¶ Generator function that slides through a string and returns strings of (max) length chunk_size, sliding one character at a time.
Parameters: Yields: one chunks of string of size chunk_size on each call such as
>>> splitter('hello, world', 3) 'hel', 'ell', 'llo', ...
-
search.utils.sorted_intersect(query_tokens, string_tokens)[source]¶ return the sorted intersection and remainders of the two iterables
Parameters: - query_tokens (iterable) – first elements group to intersect
- string_tokens (iterable) – second elements group to intersect
Returns: - sorted intersected list - all values common to both iterables
- sorted remainders for query_tokens
- sorted remainders for string_tokens
Return type: tuple of lists
Example
>> > sorted_intersect([1, 3, 2, 4], [3, 4, 5]) [3, 4], [1, 2], [5]
-
search.utils.sorted_unique_tokens(string, regexp=’\W+’, min_len=3)[source]¶ Return a sorted list of unique tokens contained in the string
-
search.utils.splitter(string, chunk_size)[source]¶ Generator function that returns chunks of string of size chunk_size. If chunk_size is
-1the whole string is returned.Parameters: Yields: one chunks of string of size chunk_size on each call such as
>>> splitter('hello, world', 3) 'hel', 'lo,', ' wo', 'rld'
-
search.utils.stringify_tokens(tokens)[source]¶ Given a list or set of tokens, return them as a single string
-
search.utils.tokenize(string, regexp=’\W+’, min_len=3)[source]¶ Given a string return a
listof segments of the string, splitted withconfig.STR_SPLIT_REGEX, removing every word <config.MIN_WORD_LENGTH.Parameters: Returns: list - all the tokens extracted from the string in the same order they were found.
Example
>>> utils.tokenize('hello there, how are you') ['hello', 'there']
-
search.utils.tokenize_set(string, regexp=’\W+’, min_len=3)[source]¶ Given a string returns a
setof unique segments of the strings (single words) splitted withconfig.STR_SPLIT_REGEX, removing every word <config.MIN_WORD_LENGTH.The differences with
tokenizeis that returns a set instead of a list, so the values are unique.Parameters: Returns: set - all the tokens extracted from string in a set
Example
>>> utils.tokenize('hello there, there are some fishes there!') ['hello', 'there']
-
search.utils.weighted_average(values, weights=None)[source]¶ Calculate the weighted mean average between two iterables of values and matching weights. If weights is
Nonethey will be autogeneratedParameters: - values (iterable) – Values to average, either int or float
- weights (iterable) – Matching weights iterable
Returns: weighted average
Return type: