biweight_midcovariance

astropy.stats.biweight_midcovariance(data, c=9.0, M=None, modify_sample_size=False)[source] [edit on github]

Compute the biweight midcovariance between pairs of multiple variables.

The biweight midcovariance is a robust and resistant estimator of the covariance between two variables.

This function computes the biweight midcovariance between all pairs of the input variables (rows) in the input data. The output array will have a shape of (N_variables, N_variables). The diagonal elements will be the biweight midvariances of each input variable (see biweight_midvariance()). The off-diagonal elements will be the biweight midcovariances between each pair of input variables.

For example, if the input array data contains three variables (rows) x, y, and z, the output ndarray midcovariance matrix will be:

\[\begin{split}\begin{pmatrix} \zeta_{xx} & \zeta_{xy} & \zeta_{xz} \\ \zeta_{yx} & \zeta_{yy} & \zeta_{yz} \\ \zeta_{zx} & \zeta_{zy} & \zeta_{zz} \end{pmatrix}\end{split}\]

where \(\zeta_{xx}\), \(\zeta_{yy}\), and \(\zeta_{zz}\) are the biweight midvariances of each variable. The biweight midcovariance between \(x\) and \(y\) is \(\zeta_{xy}\) (\(= \zeta_{yx}\)). The biweight midcovariance between \(x\) and \(z\) is \(\zeta_{xz}\) (\(= \zeta_{zx}\)). The biweight midcovariance between \(y\) and \(z\) is \(\zeta_{yz}\) (\(= \zeta_{zy}\)).

The biweight midcovariance between two variables \(x\) and \(y\) is given by:

\[\zeta_{xy} = n \ \frac{\Sigma_{|u_i| < 1, \ |v_i| < 1} \ (x_i - M_x) (1 - u_i^2)^2 (y_i - M_y) (1 - v_i^2)^2} {(\Sigma_{|u_i| < 1} \ (1 - u_i^2) (1 - 5u_i^2)) (\Sigma_{|v_i| < 1} \ (1 - v_i^2) (1 - 5v_i^2))}\]

where \(M_x\) and \(M_y\) are the medians (or the input locations) of the two variables and \(u_i\) and \(v_i\) are given by:

\[ \begin{align}\begin{aligned}u_{i} = \frac{(x_i - M_x)}{c * MAD_x}\\v_{i} = \frac{(y_i - M_y)}{c * MAD_y}\end{aligned}\end{align} \]

where \(c\) is the biweight tuning constant and \(MAD_x\) and \(MAD_y\) are the median absolute deviation of the \(x\) and \(y\) variables. The biweight midvariance tuning constant c is typically 9.0 (the default).

For the standard definition of biweight midcovariance \(n\) is the total number of observations of each variable. That definition is used if modify_sample_size is False, which is the default.

However, if modify_sample_size = True, then \(n\) is the number of observations for which \(|u_i| < 1\) and \(|v_i| < 1\), i.e.

\[n = \Sigma_{|u_i| < 1, \ |v_i| < 1} \ 1\]

which results in a value closer to the true variance for small sample sizes or for a large number of rejected values.

Parameters:
data : 2D or 1D array-like

Input data either as a 2D or 1D array. For a 2D array, it should have a shape (N_variables, N_observations). A 1D array may be input for observations of a single variable, in which case the biweight midvariance will be calculated (no covariance). Each row of data represents a variable, and each column a single observation of all those variables (same as the numpy.cov convention).

c : float, optional

Tuning constant for the biweight estimator (default = 9.0).

M : float or 1D array-like, optional

The location estimate of each variable, either as a scalar or array. If M is an array, then its must be a 1D array containing the location estimate of each row (i.e. a.ndim elements). If M is a scalar value, then its value will be used for each variable (row). If None (default), then the median of each variable (row) will be used.

modify_sample_size : bool, optional

If False (default), then the sample size used is the total number of observations of each variable, which follows the standard definition of biweight midcovariance. If True, then the sample size is reduced to correct for any rejected values (see formula above), which results in a value closer to the true covariance for small sample sizes or for a large number of rejected values.

Returns:
biweight_midcovariance : ndarray

A 2D array representing the biweight midcovariances between each pair of the variables (rows) in the input array. The output array will have a shape of (N_variables, N_variables). The diagonal elements will be the biweight midvariances of each input variable. The off-diagonal elements will be the biweight midcovariances between each pair of input variables.

References

[1]http://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/biwmidc.htm

Examples

Compute the biweight midcovariance between two random variables:

>>> import numpy as np
>>> from astropy.stats import biweight_midcovariance
>>> # Generate two random variables x and y
>>> rng = np.random.RandomState(1)
>>> x = rng.normal(0, 1, 200)
>>> y = rng.normal(0, 3, 200)
>>> # Introduce an obvious outlier
>>> x[0] = 30.0
>>> # Calculate the biweight midcovariances between x and y
>>> bicov = biweight_midcovariance([x, y])
>>> print(bicov)    
[[ 0.82483155 -0.18961219]
 [-0.18961219 9.80265764]]
>>> # Print standard deviation estimates
>>> print(np.sqrt(bicov.diagonal()))    
[ 0.90820237  3.13091961]