Python comes with a lot of built-in data types to manage sequences. This includes lists, dictionaries, sets, Queues, Deques, and more. The standard comparator operators can be applied to such sequences but trouble arises when comparing sequences of different lengths or dimensions.
In this article, we will take a look at how one can apply a custom function to two sequences of different lengths. This can be used to compare all items in one list to those in another, find the smallest value among multiple sets, or anything else you can imagine.
Getting Started
To get started we are going to define two lists as our sequences to demonstrate things. Then we will compare each item in the first list to determine if it is less than every item in the second list (spoiler: each item is less). Consider the following code:
# Define two equally sized lists l1 = [1, 2, 3, 4, 5] l2 = [6, 7, 8, 9, 10] # Determine if each item in L1 is less than all items in L2 results = [l1[i] < l2[i] for i in range(len(l2))] # Print results print(results) [True, True, True, True, True, True]
This is expected behavior and produces the desired results. Now, let us consider how this might have played out with two lists of different sizes.
Throwing Errors
To change things up, we are going to add an additional item to our second list, the integer value 11, to ensure our lists are of different lengths. We will then attempt the same pairwise comparator evaluation as before. Consider the following example code:
# Define two differently sized lists l1 = [1, 2, 3, 4, 5] l2 = [6, 7, 8, 9, 10] # Determine if each item in L1 is less than all items in L2 results = [l1[i] < l2[i] for i in range(len(l2))] >>> IndexError: list index out of range
Here we see an index error being thrown that results from our use of the len(l2)
argument in our comparison logic. This attempts to iterate over our first list for every item in our second. There is one extra element in our second, which means we will be trying to access an extra element in the first list that simply isn’t there.
Broadcasting
The term broadcasting is used by the NumPy library deals with arrays of different sizes. Keep in mind; lists are just 1-dimensional arrays. The concept of broadcasting among arrays has been leveraged by many of the most powerful machine learning libraries such as Tensorflow, Theano, Octave, and others. The term is defined by Scipy as such:
The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
Skipping over the finer details of broadcasting—it can be used here to resize the smaller of our lists to ensure the IndexError Exception is no longer an issue. We will use the NumPy.take()
method to co-opt this feature from NumPy in the following code:
# Define lists of different size l1 = [1, 2, 3, 4, 5] l2 = [6, 7, 8, 9, 10, 11] # Apply NumPy's take() method to each # Use the max length of each for sizing l1 = np.take(l1, list(range(max([len(l1), len(l2)])))) l2 = np.take(l2, list(range(max([len(l1), len(l2)])))) >>> IndexError: index 5 is out of bounds for axis 0 with size 5
What gives—after all this hullabaloo about broadcasting—we are still getting an IndexError
. We need to realize that broadcasting does not perform any magic on our arrays. Some action must be taken to resize one of our arrays to avoid such an error.
This is handled by the take()
method by providing an additional mode
keyword argument to determine what happens with arrays of different sizes. By default (what was used above) this argument is 'raise'
as in raise an Exception.
The documentation for the NumPy take()
function details the possibility of using 'clip'
or 'wrap'
as alternative values for mode. Let’s see how each will affect our output:
# Specify 'clip' as broadcast mode l1 = np.take(l1, list(range(max([len(l1), len(l2)]))), mode='clip') l2 = np.take(l2, list(range(max([len(l1), len(l2)]))), mode='clip') # View results print(l1.tolist()) print(l2.tolist()) [1, 2, 3, 4, 5, 5] [6, 7, 8, 9, 10, 11]
Here we see our first list having had an additional copy of the last element added. This action is taken as many times as necessary such that our l1 list is resized to match the length of l2
. For example, if we had added 5 additional elements in the l2
list—the 'clip'
mode of broadcasting would ad 5 duplicates of the last element in the l1
list. Let’s see that in action:
# Define new lists l1 = [1, 2, 3, 4, 5] l2 = [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] # Broadcast again l1 = np.take(l1, list(range(max([len(l1), len(l2)]))), mode='clip') l2 = np.take(l2, list(range(max([len(l1), len(l2)]))), mode='clip') # Print results print(l1.tolist()) print(l2.tolist()) [1, 2, 3, 4, 5, 5, 5, 5, 5, 5] [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Here we find 5 additional 5’s added to list one—just as expected. The NumPy take()
method offers 'wrap'
as an additional mode of broadcasting. Let’s consider how this handles our data, again using the newly-defined lists:
# Define lists again l1 = [1, 2, 3, 4, 5] l2 = [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] # Broadcast with NumPy take, using clip as mode l1 = np.take(l1, list(range(max([len(l1), len(l2)]))), mode='wrap') l2 = np.take(l2, list(range(max([len(l1), len(l2)]))), mode='wrap') # View results print(l1.tolist()) print(l2.tolist()) [1, 2, 3, 4, 5, 1, 2, 3, 4, 5] [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Here we see our first list having been apparently duplicated within itself. This isn’t exactly the case—though the result would appear as such. What happens is the 'wrap'
mode begins wrapping back to the beginning of the l1
list when it runs our index values to match with l2
. The first index value of l2
outside the range of l1
(index value of 5) is then wrapped around to the first value of l1
. Logically, this is equivalent of 6 % 5 -1
.
Emulating with Vanilla Python
There’s not much practical cause for using a vanilla Python implementation of broadcasting. Nonetheless, let us consider how one might approach such as task for the purposes of our collective enlightenment. To implement this in code, we would need to handle the following cases:
- Determine the length of the iterable such that we could know if an index value were going to cause an Exception to be thrown;
- Handle out-of-bounds index values by delegating to logic for either ‘wrap’ or ‘clip’ modes.
- Maintain a container for returning output values taken from the above logic
Consider the following code demonstrating this basic functionality of broadcasting suited for the selection of indices from a 1-dimensional array:
# Define some data iterable = [1, 2, 3, 4, 5] indices = [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] mode = 'wrap' # Determine the length of our iterable # to avoid re-checking during each iteration iter_length = len(iterable) # Create an output container output = [] # Consider each index value from indices for i in indices: # Get element for any idex value within range if i < iter_length: output.append(iterable[i]) # If index out-of-range, handle with mode elif mode == 'wrap': output.append(iterable[i % iter_length - 1]) elif mode == 'clip': output.append(iterable[-1]) # View result print(output) [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
Here we see output values identical to those returned via the NumPy.take(mode=’wrap’) approach. The only bit of fancy in this code is the i % iter_length - 1
line. That implements a modulus operation to divide the index by the number of available iterable items, and then subtracts the one to accommodate zero-based indexing. Read this article by Real Python for a in-depth explanation of the modulo operator in Python.
Note: The above code is available as a function via GitHub
Final Thoughts
Dealing with lists of differing lengths can be tricky. The first step in any approach is to carefully consider one’s goals and determine the best way to handle out-of-bounds cases. For example, if comparing all values of a shorter list to those of a larger list duplication of elements may be acceptable. In other cases where an even distribution is required (such as selecting random colors) one might need to make use of an approach similar to the mode=’wrap’ option in the take() method.
For information on basic list
comparisons check out the end of the official Python documentation on sequence types. For more on complex use-cases of the take() method check out this article on broadcasting that is an excerpt from the book Python Data Science Handbook. Broadcasting is an incredibly useful concept for managing arrays of differing lengths. Its role in modern machine learning algorithms is no joke and knowing the conceptual basics can pay off in a big way. Check out this article on the best machine learning books for some ideas on how it is implemented!