Python is arguably the most popular programming language for data science. As one might expect, it comes with a slew of built-in libraries that can handle statistical analysis such as mean, median, and mode calculations.
Depending on your use—there are several ways to approach using Python to calculate the average value of a set of numbers. Whether you’re in need of a weighed average, the harmonic mean, or something more exotic—python has several average functions that are anything but!
Obligatory Clarification
The term “average” and even “mean” are ambiguous and may refer to one (or several) different types of mathematical calculation. The arithmetic mean is the most common method—and the one I’d wager most people mean when they broach the subject.
Other types include the geometric mean, harmonic mean, and more than a dozen others such as the moving average. For a deeper dive into exactly how some of these are calculated, check out this article. For our discussion here, I’m going to glaze over the finer details and focus mostly on implementation in Python.
Using Python to Get the Average
Below are several approaches to getting the mean in Python ranging from the simple mean function built into the statistics module to third-party libraries like numpy. For the examples, I’ll be using the following randomly-generated set of numbers:
import random # Generate a list of 10 random numbers from 0-99 numbers = [random.choice(range(100)) for _ in range(10)] >>> [59, 97, 94, 98, 54, 40, 96, 37, 11, 17]
These are the numbers that will be used for each example to follow.
Statistics Library
Since Python 3.4 there is a standard statistics library that provides several methods to calculate the mean of a set of numbers. Among them are methods to calculate arithmetic, geometric, and harmonic means as shown below:
import statistics # Define a list of random numbers numbers = [59, 97, 94, 98, 54, 40, 96, 37, 11, 17] arithmetic_mean = statistics.mean(numbers) >>> 60.3 geometric_mean = statistics.geometric_mean(numbers) >>> 48.73877382924253 harmonic_mean = statistics.harmonic_mean(numbers) >>> 35.868566290602814
Note: Before 3.4 Python provided similar functionality via the stats
library.
Numpy
NumPy is among the most-used numerical processing libraries among data scientists, along with other staples such as pandas
, matplotlib, and scikit-learn
. All of these are overkill for simple mean calculation but, if they’re already dependencies they can be convenient.
import numpy # The arithmetic mean numpy_mean = numpy.mean(numbers) >>> 60.3 # The weighted average (without weights) numpy_average = numpy.average(numbers) >>> 60.3
Note: The numpy.average
calculates a weighted average but, in the example above, isn’t provided with any data for weights. As such, a non-weighted average is calculated instead.
SciPy.stats
The SciPy library is focused mostly on probability distributions but provides some functions for mean calculation. The arithmetic mean, along with median and mode functions, are available as attributes of other larger functions. Overall, the scipy.stats module isn’t used for simple mean calculation. If you’re hellbent on doing so, the geometric and harmonic means are available as such:
from scipy import stats # The geometric mean geometric_mean = stats.gmean(numbers) >>> 48.738773829242575 # The harmonic mean harmonic_mean = stats.hmean(numbers) >>> 35.868566290602814
Manual Calculation
For those that prefer vanilla Python code—the mean/average isn’t exactly a hat trick. The following illustrates approaches for calculating the mean using nothing but standard Python syntax:
# Arithmetic Mean arithmetic_mean = sum(numbers) / len(numbers) >>> 60.3
The geometric mean and harmonic mean can be done in vanilla Python but they’re not nearly as straight-forward. At the very least, one would want to make use of the math.log
and math.pow
functions.
Final Thoughts
Python makes it super easy to calculate the mean—the toughest part is deciding which method and/or library to use to do so! The three Pythagorean means are available as methods via the standard statistics
library. Other third-party libraries like numpy
can offer the same features but often add unwanted project overhead.