Independent and identically distributed random variables

In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent.^[1] This property is usually abbreviated as i.i.d., iid, or IID. IID was first defined in statistics and finds application in different fields such as data mining and signal processing.

Introduction[edit]

Statistics commonly deals with random samples. A random sample can be thought of as a set of objects that are chosen randomly. More formally, it is "a sequence of independent, identically distributed (IID) random data points".

In other words, the terms random sample and IID are one and the same. In statistics, "random sample" is the typical terminology, but in probability, it is more common to say "IID".

Identically distributed means that there are no overall trends—the distribution does not fluctuate and all items in the sample are taken from the same probability distribution.
Independent means that the sample items are all independent events. In other words, they are not connected to each other in any way;^[2] knowledge of the value of one variable gives no information about the value of the other and vice versa.

Application[edit]

Independent and identically distributed random variables are often used as an assumption, which tends to simplify the underlying mathematics. In practical applications of statistical modeling, however, the assumption may or may not be realistic.^[3]

The i.i.d. assumption is also used in the central limit theorem, which states that the probability distribution of the sum (or average) of i.i.d. variables with finite variance approaches a normal distribution.^[4]

The i.i.d. assumption frequently arises in the context of sequences of random variables. Then "independent and identically distributed" implies that an element in the sequence is independent of the random variables that came before it. In this way, an i.i.d. sequence is different from a Markov sequence, where the probability distribution for the $n$ th random variable is a function of the previous random variable in the sequence (for a first-order Markov sequence). An i.i.d. sequence does not imply the probabilities for all elements of the sample space or event space must be the same.^[5] For example, repeated throws of loaded dice will produce a sequence that is i.i.d., despite the outcomes being biased.

In signal processing and image processing the notion of transformation to i.i.d. implies two specifications, the "i.d." part and the "i." part:

i.d. – The signal level must be balanced on the time axis.

i. – The signal spectrum must be flattened, i.e. transformed by filtering (such as deconvolution) to a white noise signal (i.e. a signal where all frequencies are equally present).

Definition[edit]

Definition for two random variables[edit]

Suppose that the random variables $X$ and $Y$ are defined to assume values in $I\subseteq \mathbb {R}$ . Let $F_{X}(x)=\operatorname {P} (X\leq x)$ and $F_{Y}(y)=\operatorname {P} (Y\leq y)$ be the cumulative distribution functions of $X$ and $Y$ , respectively, and denote their joint cumulative distribution function by $F_{X,Y}(x,y)=\operatorname {P} (X\leq x\land Y\leq y)$ .

Two random variables $X$ and $Y$ are identically distributed if and only if^[6] $F_{X}(x)=F_{Y}(x)\,\forall x\in I$ .

Two random variables $X$ and $Y$ are independent if and only if $F_{X,Y}(x,y)=F_{X}(x)\cdot F_{Y}(y)\,\forall x,y\in I$ . (See further Independence (probability theory) § Two random variables.)

Two random variables $X$ and $Y$ are i.i.d. if they are independent and identically distributed, i.e. if and only if

{\begin{aligned}&F_{X}(x)=F_{Y}(x)\,&\forall x\in I\\&F_{X,Y}(x,y)=F_{X}(x)\cdot F_{Y}(y)\,&\forall x,y\in I\end{aligned}}

(Eq.1)

Definition for more than two random variables[edit]

The definition extends naturally to more than two random variables. We say that $n$ random variables $X_{1},\ldots ,X_{n}$ are i.i.d. if they are independent (see further Independence (probability theory) § More than two random variables) and identically distributed, i.e. if and only if

{\begin{aligned}&F_{X_{1}}(x)=F_{X_{k}}(x)\,&\forall k\in \{1,\ldots ,n\}{\text{ and }}\forall x\in I\\&F_{X_{1},\ldots ,X_{n}}(x_{1},\ldots ,x_{n})=F_{X_{1}}(x_{1})\cdot \ldots \cdot F_{X_{n}}(x_{n})\,&\forall x_{1},\ldots ,x_{n}\in I\end{aligned}}

(Eq.2)

where $F_{X_{1},\ldots ,X_{n}}(x_{1},\ldots ,x_{n})=\operatorname {P} (X_{1}\leq x_{1}\land \ldots \land X_{n}\leq x_{n})$ denotes the joint cumulative distribution function of $X_{1},\ldots ,X_{n}$ .

Definition for independence[edit]

In probability theory, two events, ${\textstyle \color {red}A}$ and ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\color {Green}B}$ , are called independent if and only if ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}P({\color {red}A}\ \mathrm {and} \ {\color {green}B})=P({\color {red}A})P({\color {green}B})}$ . In the following, ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}P({\color {red}A}{\color {green}B})}$ is short for ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}P({\color {red}A}\ \mathrm {and} \ {\color {green}B})}$ .

Suppose there are two events of the experiment, ${\textstyle \color {red}A}$ and ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\color {Green}B}$ . If ${\textstyle P({\color {red}A})>0}$ , there is a possibility ${\textstyle P({\color {green}B}|{\color {red}A})}$ . Generally, the occurrence of ${\textstyle \color {red}A}$ has an effect on the probability of ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\color {Green}B}$ , which is called conditional probability, and only when the occurrence of ${\textstyle \color {red}A}$ has no effect on the occurrence of ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\color {Green}B}$ , there is ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}P({\color {green}B}|{\color {red}A})=P({\color {green}B})}$ .

Note: If ${\textstyle P({\color {red}A})>0}$ and ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}P({\color {Green}B})>0}$ , then ${\textstyle \color {red}A}$ and ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\color {Green}B}$ are mutually independent which cannot be established with mutually incompatible at the same time; that is, independence must be compatible and mutual exclusion must be related.

Suppose ${\textstyle \color {red}A}$ , ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\color {Green}B}$ , and ${\textstyle \definecolor {blue}{rgb}{0,0,1}\color {blue}C}$ are three events. If ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}P({\color {red}A}{\color {green}B})=P({\color {red}A})P({\color {green}B})}$ , ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\definecolor {blue}{rgb}{0,0,1}\definecolor {Blue}{rgb}{0,0,1}P({\color {green}B}{\color {blue}C})=P({\color {green}B})P({\color {blue}C})}$ , ${\textstyle \definecolor {blue}{rgb}{0,0,1}P({\color {red}A}{\color {blue}C})=P({\color {red}A})P({\color {blue}C})}$ , and ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\definecolor {blue}{rgb}{0,0,1}\definecolor {Blue}{rgb}{0,0,1}P({\color {red}A}{\color {green}B}{\color {blue}C})=P({\color {red}A})P({\color {green}B})P({\color {blue}C})}$ are satisfied, then the events ${\textstyle \color {red}A}$ , ${\textstyle \definecolor {Green}{rgb}{0,0.5019607843137255,0}\definecolor {green}{rgb}{0,0.5019607843137255,0}\color {Green}B}$ , and ${\textstyle \definecolor {blue}{rgb}{0,0,1}\color {blue}C}$ are mutually independent.

A more general definition is there are ${\textstyle n}$ events, ${\textstyle {\color {red}A}_{1},{\color {red}A}_{2},\ldots ,{\color {red}A}_{n}}$ . If the probabilities of the product events for any ${\textstyle 2,3,\ldots ,n}$ events are equal to the product of the probabilities of each event, then the events ${\textstyle {\color {red}A}_{1},{\color {red}A}_{2},\ldots ,{\color {red}A}_{n}}$ are independent of each other.

Examples[edit]

Example 1[edit]

A sequence of outcomes of spins of a fair or unfair roulette wheel is i.i.d. One implication of this is that if the roulette ball lands on "red", for example, 20 times in a row, the next spin is no more or less likely to be "black" than on any other spin (see the gambler's fallacy).

Example 2[edit]

Toss a coin 10 times and record how many times the coin lands on heads.

Independent – Each outcome of landing will not affect the other outcome, which means the 10 results are independent from each other.
Identically distributed – Regardless of whether the coin is fair (probability 1/2 of heads) or unfair, as long as the same coin is used for each flip, each flip will have the same probability as each other flip.

Such a sequence of two possible i.i.d. outcomes is also called a Bernoulli process.

Example 3[edit]

Roll a die 10 times and record how many times the result is 1.

Independent – Each outcome of the die roll will not affect the next one, which means the 10 results are independent from each other.
Identically distributed – Regardless of whether the die is fair or weighted, each roll will have the same probability as each other roll. In contrast, rolling 10 different dice, some of which are weighted and some of which are not, would not produce i.i.d. variables.

Example 4[edit]

Choose a card from a standard deck of cards containing 52 cards, then place the card back in the deck. Repeat it 52 times. Record the number of kings that appear.

Independent – Each outcome of the card will not affect the next one, which means the 52 results are independent from each other. In contrast, if each card that is drawn is kept out of the deck, subsequent draws would be affected by it (drawing one king would make drawing a second king less likely), and the result would not be independent.
Identically distributed – After drawing one card from it, each time the probability for a king is 4/52, which means the probability is identical each time.

Generalizations[edit]

Many results that were first proven under the assumption that the random variables are i.i.d. have been shown to be true even under a weaker distributional assumption.

Exchangeable random variables[edit]

The most general notion which shares the main properties of i.i.d. variables are exchangeable random variables, introduced by Bruno de Finetti.^{[citation needed]} Exchangeability means that while variables may not be independent, future ones behave like past ones – formally, any value of a finite sequence is as likely as any permutation of those values – the joint probability distribution is invariant under the symmetric group.

This provides a useful generalization – for example, sampling without replacement is not independent, but is exchangeable.

Lévy process[edit]

In stochastic calculus, i.i.d. variables are thought of as a discrete time Lévy process: each variable gives how much one changes from one time to another. For example, a sequence of Bernoulli trials is interpreted as the Bernoulli process. One may generalize this to include continuous time Lévy processes, and many Lévy processes can be seen as limits of i.i.d. variables—for instance, the Wiener process is the limit of the Bernoulli process.

In machine learning[edit]

Machine learning uses currently acquired massive quantities of data to deliver faster, more accurate results.^[7] Therefore, we need to use historical data with overall representativeness. If the data obtained is not representative of the overall situation, then the rules will be summarized badly or wrongly.

Through i.i.d. hypothesis, the number of individual cases in the training sample can be greatly reduced.

This assumption makes maximization very easy to calculate mathematically. Observing the assumption of independent and identical distribution in mathematics simplifies the calculation of the likelihood function in optimization problems. Because of the assumption of independence, the likelihood function can be written like this:

l(\theta )=P(x_{1},x_{2},x_{3},...,x_{n}|\theta )=P(x_{1}|\theta )P(x_{2}|\theta )P(x_{3}|\theta )...P(x_{n}|\theta )

.

In order to maximize the probability of the observed event, take the log function and maximize the parameter θ. That is to say, to compute:

\mathop {\rm {argmax}} \limits _{\theta }\log(l(\theta ))

,

where

\log(l(\theta ))=\log(P(x_{1}|\theta ))+\log(P(x_{2}|\theta ))+\log(P(x_{3}|\theta ))+...+\log(P(x_{n}|\theta ))

.

The computer is very efficient to calculate multiple additions, but it is not efficient to calculate the multiplication. This simplification is the core reason for the increase in computational efficiency. And this Log transformation is also in the process of maximizing, turning many exponential functions into linear functions.

For two reasons, this hypothesis is easy to use the central limit theorem in practical applications.

Even if the sample comes from a more complex non-Gaussian distribution, it can also approximate well. Because it can be simplified from the central limit theorem to Gaussian distribution. For a large number of observable samples, "the sum of many random variables will have an approximately normal distribution".
The second reason is that the accuracy of the model depends on the simplicity and representative power of the model unit, as well as the data quality. Because the simplicity of the unit makes it easy to interpret and scale, and the representative power + scale out of the unit improves the model accuracy. Like in a deep neural network, each neuron is very simple but has strong representative power, layer by layer to represent more complex features to improve model accuracy.

References[edit]

^ Clauset, Aaron (2011). "A brief primer on probability distributions" (PDF). Santa Fe Institute. Archived from the original (PDF) on 2012-01-20. Retrieved 2011-11-29.
^ Stephanie (2016-05-11). "IID Statistics: Independent and Identically Distributed Definition and Examples". Statistics How To. Retrieved 2021-12-09.
^ Hampel, Frank (1998), "Is statistics too difficult?", Canadian Journal of Statistics, 26 (3): 497–513, doi:10.2307/3315772, hdl:20.500.11850/145503, JSTOR 3315772, S2CID 53117661 (§8).
^ Blum, J. R.; Chernoff, H.; Rosenblatt, M.; Teicher, H. (1958). "Central Limit Theorems for Interchangeable Processes". Canadian Journal of Mathematics. 10: 222–229. doi:10.4153/CJM-1958-026-0. S2CID 124843240.
^ Cover, T. M.; Thomas, J. A. (2006). Elements Of Information Theory. Wiley-Interscience. pp. 57–58. ISBN 978-0-471-24195-9.
^ Casella & Berger 2002, Theorem 1.5.10
^ "What is Machine Learning? A Definition". Expert.ai. 2020-05-05. Retrieved 2021-12-16.

v t e Stochastic processes
Discrete time	Bernoulli process Branching process Chinese restaurant process Galton–Watson process Independent and identically distributed random variables Markov chain Moran process Random walk Loop-erased Self-avoiding Biased Maximal entropy
Continuous time	Additive process Bessel process Birth–death process pure birth Brownian motion Bridge Excursion Fractional Geometric Meander Cauchy process Contact process Continuous-time random walk Cox process Diffusion process Dyson Brownian motion Empirical process Feller process Fleming–Viot process Gamma process Geometric process Hawkes process Hunt process Interacting particle systems Itô diffusion Itô process Jump diffusion Jump process Lévy process Local time Markov additive process McKean–Vlasov process Ornstein–Uhlenbeck process Poisson process Compound Non-homogeneous Schramm–Loewner evolution Semimartingale Sigma-martingale Stable process Superprocess Telegraph process Variance gamma process Wiener process Wiener sausage
Both	Branching process Galves–Löcherbach model Gaussian process Hidden Markov model (HMM) Markov process Martingale Differences Local Sub- Super- Random dynamical system Regenerative process Renewal process Stochastic chains with memory of variable length White noise
Fields and other	Dirichlet process Gaussian random field Gibbs measure Hopfield model Ising model Potts model Boolean network Markov random field Percolation Pitman–Yor process Point process Cox Poisson Random field Random graph
Time series models	Autoregressive conditional heteroskedasticity (ARCH) model Autoregressive integrated moving average (ARIMA) model Autoregressive (AR) model Autoregressive–moving-average (ARMA) model Generalized autoregressive conditional heteroskedasticity (GARCH) model Moving-average (MA) model
Financial models	Binomial options pricing model Black–Derman–Toy Black–Karasinski Black–Scholes Chan–Karolyi–Longstaff–Sanders (CKLS) Chen Constant elasticity of variance (CEV) Cox–Ingersoll–Ross (CIR) Garman–Kohlhagen Heath–Jarrow–Morton (HJM) Heston Ho–Lee Hull–White Korn-Kreer-Lenssen LIBOR market Rendleman–Bartter SABR volatility Vašíček Wilkie
Actuarial models	Bühlmann Cramér–Lundberg Risk process Sparre–Anderson
Queueing models	Bulk Fluid Generalized queueing network M/G/1 M/M/1 M/M/c
Properties	Càdlàg paths Continuous Continuous paths Ergodic Exchangeable Feller-continuous Gauss–Markov Markov Mixing Piecewise-deterministic Predictable Progressively measurable Self-similar Stationary Time-reversible
Limit theorems	Central limit theorem Donsker's theorem Doob's martingale convergence theorems Ergodic theorem Fisher–Tippett–Gnedenko theorem Large deviation principle Law of large numbers (weak/strong) Law of the iterated logarithm Maximal ergodic theorem Sanov's theorem Zero–one laws (Blumenthal, Borel–Cantelli, Engelbert–Schmidt, Hewitt–Savage, Kolmogorov, Lévy)
Inequalities	Burkholder–Davis–Gundy Doob's martingale Doob's upcrossing Kunita–Watanabe Marcinkiewicz–Zygmund
Tools	Cameron–Martin formula Convergence of random variables Doléans-Dade exponential Doob decomposition theorem Doob–Meyer decomposition theorem Doob's optional stopping theorem Dynkin's formula Feynman–Kac formula Filtration Girsanov theorem Infinitesimal generator Itô integral Itô's lemma Karhunen–Loève theorem Kolmogorov continuity theorem Kolmogorov extension theorem Lévy–Prokhorov metric Malliavin calculus Martingale representation theorem Optional stopping theorem Prokhorov's theorem Quadratic variation Reflection principle Skorokhod integral Skorokhod's representation theorem Skorokhod space Snell envelope Stochastic differential equation Tanaka Stopping time Stratonovich integral Uniform integrability Usual hypotheses Wiener space Classical Abstract
Disciplines	Actuarial mathematics Control theory Econometrics Ergodic theory Extreme value theory (EVT) Large deviations theory Mathematical finance Mathematical statistics Probability theory Queueing theory Renewal theory Ruin theory Signal processing Statistics Stochastic analysis Time series analysis Machine learning
List of topics Category