M-estimator

In statistics, M-estimators are a broad class of estimators, which are obtained as the minima of sums of functions of the data. Least-squares estimators are a special case of M-estimators. The definition of M-estimators was motivated by robust statistics, which contributed new types of M-estimators. The statistical procedure of evaluating an M-estimator on a data set is called M-estimation.

More generally, an M-estimator may be defined to be a zero of an estimating function.^[1]^[2]^[3]^[4]^[5]^[6] This estimating function is often the derivative of another statistical function: For example, a maximum-likelihood estimate is often defined to be a zero of the derivative of the likelihood function with respect to the parameter: thus, a maximum-likelihood estimator is often a critical point of the score function.^[7] In many applications, such M-estimators can be thought of as estimating characteristics of the population.

Historical motivation

The method of least squares is a prototypical M-estimator, since the estimator is defined as a minimum of the sum of squares of the residuals.

Another popular M-estimator is maximum-likelihood estimation. For a family of probability density functions f parameterized by θ, a maximum likelihood estimator of θ is computed for each set of data by maximizing the likelihood function over the parameter space { θ } . When the observations are independent and identically distributed, a ML-estimate $\hat{\theta}$ satisfies

\widehat {\theta }=\arg \max _{{\displaystyle \theta }}{\left(\prod _{{i=1}}^{n}f(x_{i},\theta )\right)}\,\!

or, equivalently,

\widehat {\theta }=\arg \min _{{\displaystyle \theta }}{\left(-\sum _{{i=1}}^{n}\log {(f(x_{i},\theta ))}\right)}.\,\!

Maximum-likelihood estimators are often inefficient and biased for finite samples. For many regular problems, maximum-likelihood estimation performs well for "large samples", being an approximation of a posterior mode. If the problem is "regular", then any bias of the MLE (or posterior mode) decreases to zero when the sample-size increases to infinity. The performance of maximum-likelihood (and posterior-mode) estimators drops when the parametric family is mis-specified.

Definition

In 1964, Peter J. Huber proposed generalizing maximum likelihood estimation to the minimization of

\sum _{{i=1}}^{n}\rho (x_{i},\theta ),\,\!

where ρ is a function with certain properties (see below). The solutions

{\hat {\theta }}=\arg \min _{{\displaystyle \theta }}\left(\sum _{{i=1}}^{n}\rho (x_{i},\theta )\right)\,\!

are called M-estimators ("M" for "maximum likelihood-type" (Huber, 1981, page 43)); other types of robust estimator include L-estimators, R-estimators and S-estimators. Maximum likelihood estimators (MLE) are thus a special case of M-estimators. With suitable rescaling, M-estimators are special cases of extremum estimators (in which more general functions of the observations can be used).

The function ρ, or its derivative, ψ, can be chosen in such a way to provide the estimator desirable properties (in terms of bias and efficiency) when the data are truly from the assumed distribution, and 'not bad' behaviour when the data are generated from a model that is, in some sense, close to the assumed distribution.

Types of M-estimators

M-estimators are solutions, θ, which minimize

\sum _{{i=1}}^{n}\rho (x_{i},\theta ).\,\!

This minimization can always be done directly. Often it is simpler to differentiate with respect to θ and solve for the root of the derivative. When this differentiation is possible, the M-estimator is said to be of ψ-type. Otherwise, the M-estimator is said to be of ρ-type.

In most practical cases, the M-estimators are of ψ-type.

ρ-type

For positive integer r, let $(\mathcal{X},\Sigma)$ and $(\Theta \subset {\mathbb {R}}^{r},S)$ be measure spaces. $\theta\in\Theta$ is a vector of parameters. An M-estimator of ρ-type $T$ is defined through a measurable function $\rho :{\mathcal {X}}\times \Theta \rightarrow {\mathbb {R}}$ . It maps a probability distribution $F$ on ${\mathcal {X}}$ to the value $T(F)\in \Theta$ (if it exists) that minimizes $\int _{{{\mathcal {X}}}}\rho (x,\theta )dF(x)$ :

T(F):=\arg \min _{{\theta \in \Theta }}\int _{{{\mathcal {X}}}}\rho (x,\theta )dF(x)

For example, for the maximum likelihood estimator, $\rho (x,\theta )=-\log(f(x,\theta ))$ , where $f(x,\theta )={\frac {\partial F(x,\theta )}{\partial x}}$ .

ψ-type

If $\rho$ is differentiable, the computation of ${\widehat {\theta }}$ is usually much easier. An M-estimator of ψ-type T is defined through a measurable function $\psi :{\mathcal {X}}\times \Theta \rightarrow {\mathbb {R}}^{r}$ . It maps a probability distribution F on ${\mathcal {X}}$ to the value $T(F)\in \Theta$ (if it exists) that solves the vector equation:

\int _{{{\mathcal {X}}}}\psi (x,\theta )\,dF(x)=0

\int _{{{\mathcal {X}}}}\psi (x,T(F))\,dF(x)=0

For example, for the maximum likelihood estimator, $\psi (x,\theta )=\left({\frac {\partial \log(f(x,\theta ))}{\partial \theta ^{1}}},\dots ,{\frac {\partial \log(f(x,\theta ))}{\partial \theta ^{p}}}\right)^{{\mathrm {T}}}$ , where $u^{{\mathrm {T}}}$ denotes the transpose of vector u and $f(x,\theta )={\frac {\partial F(x,\theta )}{\partial x}}$ .

Such an estimator is not necessarily an M-estimator of ρ-type, but if ρ has a continuous first derivative with respect to $\theta$ , then a necessary corresponding M-estimator of ψ-type to be an M-estimator of ρ-type is $\psi (x,\theta )=\nabla _{\theta }\rho (x,\theta )$ . The previous definitions can easily be extended to finite samples.

If the function ψ decreases to zero as $x\rightarrow \pm \infty$ , the estimator is called redescending. Such estimators have some additional desirable properties, such as complete rejection of gross outliers.

Computation

For many choices of ρ or ψ, no closed form solution exists and an iterative approach to computation is required. It is possible to use standard function optimization algorithms, such as Newton-Raphson. However, in most cases an iteratively re-weighted least squares fitting algorithm can be performed; this is typically the preferred method.

For some choices of ψ, specifically, redescending functions, the solution may not be unique. The issue is particularly relevant in multivariate and regression problems. Thus, some care is needed to ensure that good starting points are chosen. Robust starting points, such as the median as an estimate of location and the median absolute deviation as a univariate estimate of scale, are common.

Properties

Distribution

It can be shown that M-estimators are asymptotically normally distributed. As such, Wald-type approaches to constructing confidence intervals and hypothesis tests can be used. However, since the theory is asymptotic, it will frequently be sensible to check the distribution, perhaps by examining the permutation or bootstrap distribution.

Influence function

The influence function of an M-estimator of $\psi$ -type is proportional to its defining $\psi$ function.

Let T be an M-estimator of ψ-type, and G be a probability distribution for which $T(G)$ is defined. Its influence function IF is

\operatorname {IF}(x;T,G)=-{\frac {\psi (x,T(G))}{\int \left[{\frac {\partial \psi (y,\theta )}{\partial \theta }}\right]f(y){\mathrm {d}}y}}

assuming the density function $f(y)$ exists. A proof of this property of M-estimators can be found in Huber (1981, Section 3.2).

Applications

M-estimators can be constructed for location parameters and scale parameters in univariate and multivariate settings, as well as being used in robust regression.

Examples

Mean

Let (X₁, ..., X_n) be a set of independent, identically distributed random variables, with distribution F.

If we define

\rho (x,\theta )={\frac {(x-\theta )^{2}}{2}},\,\!

we note that this is minimized when θ is the mean of the Xs. Thus the mean is an M-estimator of ρ-type, with this ρ function.

As this ρ function is continuously differentiable in θ, the mean is thus also an M-estimator of ψ-type for ψ(x, θ) = θ − x.

References

↑ V. P. Godambe, editor. Estimating functions, volume 7 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 1991.
↑ Christopher C. Heyde. Quasi-likelihood and its application: A general approach to optimal parameter estimation. Springer Series in Statistics. Springer-Verlag, New York, 1997.
↑ D. L. McLeish and Christopher G. Small. The theory and applications of statistical inference functions, volume 44 of Lecture Notes in Statistics. Springer-Verlag, New York, 1988.
↑ Parimal Mukhopadhyay. An Introduction to Estimating Functions. Alpha Science International, Ltd, 2004.
↑ Christopher G. Small and Jinfang Wang. Numerical methods for nonlinear estimating equations, volume 29 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press, New York, 2003.
↑ Sara A. van de Geer. Empirical Processes in M-estimation: Applications of empirical process theory, volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.
↑ Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. doi:10.1080/01621459.1982.10477894. JSTOR 2287314.

External links

M-estimators — an introduction to the subject by Zhengyou Zhang

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode

Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range

Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data

Survey methodology	Sampling Standard error stratified cluster Opinion poll Questionnaire

Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment

Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in

Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife

Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons

Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F

Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC

Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product–moment Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality

Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey

Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)

Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time

Hazard function	Nelson–Aalen estimator

Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia - version of the 11/18/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.

M-estimator

Historical motivation

Definition

Types of M-estimators

ρ-type

ψ-type

Computation

Properties

Distribution

Influence function

Applications

Examples

Mean

See also

References

Further reading

External links