Fitting a line¶
There are many analyses where you might want to try and fit a function to some data. There are also many ways that you can do this in Python. Here we will give a very brief demonstration of fitting a simple polynomial function, e.g., a line, to data using methods that make use of standard least squares fitting.
Least squares fitting¶
We will briefly describe least squares fitting for a simple linear equation (often known as linear regression). Suppose we have a set of data points \(\mathbf{y} = \{y_1, y_2, \ldots, y_n\}\) evaluated, or measured, at points \(\mathbf{x} = \{x_1, x_2, \ldots, x_n\}\) and a model that describes a line \(\mathbf{m} = \beta_0 + \beta_1 \mathbf{x}\). To fit the model to the data we are interested in finding the values of \(\vec{\beta} = \{\beta_0, \beta_1\}\) that "best" match the data. There are multiple equivalent ways of framing this (minimising the mean square loss function, minimising the \(\chi^2\) or maximising the likelihood), but they all basically break down to the same thing; finding the values of \(\vec{\beta}\) that solve the pair of equations:
where
Performing the differentiation and doing a bit of algebra produces the solutions:
and
This can be generalised to polynomials of any order and can also include weighting for each data point, i.e., in cases where the noise on measurements is not the same for each measurement.
Fitting a line with NumPy¶
In the NumPy library there is more than one function for doing least squares fitting:
We will show an example with using polyfit()
for a polynomial of order 1, i.e., a straight line.
To do this we will generate some fake data consisting of a line with added noise (which in reality
might come from uncertainties on the measurements).
import numpy as np
from matplotlib import pyplot as plt
# define our linear model function
def model(x, beta0, beta1):
"""
A function describing a straight line.
Parameters
----------
x: array_like
A set a values at which to evaluate the model
beta0: float
The coefficients of the linear model providing the y-intercept
beta1: float
The coefficients of the linear model providing the gradient
Returns
-------
y: array_like
An array containing the model evaluated at the x values
"""
return beta0 + x * beta1
# create our "fake" data
n = 100 # number of data points
x = np.linspace(-10, 10, n) # set x values at which the measurements are made
noise = np.random.default_rng().normal(0.0, 1.0, n) # set Gaussian noise with mean of 0 and standard deviation of 1
beta = [0.0, 1.0] # set "true" y-intercept and gradient to be 0 and 1
data = model(x, beta[0], beta[1]) + noise
# perform line fitting with polyfit
order = 1 # polynomial of order 1, i.e., a straight line
c = np.polyfit(x, data, order)
# plot the data and the best fit line
fig, ax = plt.subplots(figsize=(3.5,2.5),dpi=200)
ax.scatter(x, data, label="data") # plot the data
ax.plot(x, model(x, c[1], c[0]), 'k', lw=2, label="Best fit") # plot the "best fit" line
ax.set_xlabel("$x$")
ax.set_ylabel("$y$")
ax.text(0, -8, f'Best fit gradient: {c[0]:.2f}', fontsize=14)
ax.text(0, -9.5, f'Best fit y-intercept is {c[1]:.2f}', fontsize=14)
ax.legend()
fig.tight_layout()
plt.show()
Note
The output of polyfit()
, given by c
in the example above contains the coefficients ordered
with that from the largest polynomial order first and the smallest order last, i.e., for the
straight line fitting c[0]
is the gradient (the coefficient of the \(x^1\) term) and c[1]
is
the y-intercept (the coefficient of the \(x^0\) term).
Fitting a line with SciPy¶
In the SciPy library there are again multiple function to do least squares fitting:
curve_fit()
- will fit any user supplied function, not just polynomialslsq_linear()
- solves the linear least squares problemleast_squares()
- solves the non-linear least squares problem.
We will show an example of using curve_fit()
. This can fit the parameters of an arbitrary user-defined
function, although we will again show fitting a straight line. curve_fit()
requires the
function being fit to have its first argument as the independent variable, i.e., the \(\mathbf{x}\)
values in our example, follow by separate arguments for the parameters being fit. In the example, we
will again fit a straight line to some fake data and will re-use the model
function from the
previous example:
from scipy.optimize import curve_fit
from matplotlib import pyplot as plt
# create our "fake" data
n = 100 # number of data points
x = np.linspace(-10, 10, n) # set x values at which the measurements are made
noise = np.random.default_rng().normal(0.0, 3.0, n) # set Gaussian noise with mean of 0 and standard deviation of 3
beta = [-2.0, 0.5] # set "true" y-intercept and gradient to be 0 and 1
data = model(x, beta[0], beta[1]) + noise
# perform line fitting with curve_fit
copt, ccov = curve_fit(model, x, data)
# plot the data and the best fit line
fig, ax = plt.subplots()
ax.scatter(x, data, label="data") # plot the data
ax.plot(x, model(x, copt[0], copt[1]), 'k', lw=2, label="Best fit") # plot the "best fit" line
ax.set_xlabel("$x$")
ax.set_ylabel("$y$")
ax.text(-0.5, -9.5, f'Best fit gradient: {copt[1]:.2f}', fontsize=14)
ax.text(-0.5, -11, f'Best fit y-intercept is {copt[0]:.2f}', fontsize=14)
ax.legend()
fig.tight_layout()
plt.show()
Note
The output of curve_fit()
, given by copt
in the example above contains estimates of the
parameters in the same order in which they are defined in the function being fit.