# 11.7: The Beta-Bernoulli Process

- Page ID
- 10239

An interesting thing to do in almost any parametric probability model is to randomize

one or more of the parameters. Done in a clever way, this often leads to interesting new models and unexpected connections between models. In this section we will randomize the success parameter in the Bernoulli trials process. This leads to interesting and surprising connections with Pólya's urn process.

## Basic Theory

### Definitions

First, recall that the beta distribution with left parameter \(a \in (0, \infty)\) and right parameter \(b \in (0, \infty)\) is a continuous distribution on the interval \( (0, 1) \) with probability density function \(g\) given by \[ g(p) = \frac{1}{B(a, b)} p^{a-1} (1 - p)^{b-1}, \quad p \in (0, 1) \] where \(B\) is the beta function. So \(B(a, b)\) is simply the normalizing constant for the function \(p \mapsto p^{a-1} (1 - p)^{b-1}\) on the interval \((0, 1)\). Here is our main definition:

Suppose that \(P\) has the beta distribution with left parameter \( a \in (0, \infty) \) and right parameter \( b \in (0, \infty) \). Next suppose that \(\bs{X} = (X_1, X_2, \ldots)\) is a sequence of indicator random variables with the property that given \( P = p \in (0, 1) \), \(\bs{X}\) is a conditionally independent sequence with \[ \P(X_i = 1 \mid P = p) = p, \quad i \in \N_+ \] Then \(\bs{X}\) is the beta-Bernoulli process with parameters \(a\) and \(b\).

In short, given \(P = p\), the sequence \(\bs{X}\) is a Bernoulli trials sequence with success parameter \(p\). In the usual language of reliability, \(X_i\) is the outcome of trial \(i\), where 1 denotes success and 0 denotes failure. For a specific application, suppose that we select a random probability of heads according to the beta distribution with with parameters \( a \) and \( b \), and then toss a coin with this probability of heads repeatedly.

### Outcome Variables

What's our first step? Well, of course we need to compute the finite dimensional distributions of \(\bs{X}\). Recall that for \( r \in \R \) and \( j \in \N \), \( r^{[j]} \) denotes the ascending power \( r (r + 1) \cdots [r + (j - 1)] \). By convention, a product over an empty index set is 1, so \( r^{[0]} = 1 \).

Suppose that \( n \in \N_+ \) and \((x_1, x_2, \ldots, x_n) \in \{0, 1\}^n\). Let \(k = x_1 + x_2 + \cdots + x_n\). Then \[ \P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n) = \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}} \]

## Proof

First, note that \(\P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n \mid P = p) = p^k (1 - p)^{n-k}\) by the conditional independence. Thus, conditioning on \(P\) gives \begin{align} \P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n) & = \E[\P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n \mid P)] \\ & = \int_0^1 p^k (1 - p)^{n-k} \frac{1}{B(a, b)} p^{a-1} (1 - p)^{b-1} \, dp \\ & = \frac{B[a + k, b + (n -k)]}{B(a, b)} = \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}} \end{align} The last step uses a property of the beta function.

From this result, it follows that Pólya's urn process with parameters \(a, \, b, \, c \in \N_+\) is equivalent to the beta-Bernoulli process with parameters \( a / c \) and \( b / c \), quite an interesting result. Note that since the joint distribution above depends only on \( x_1 + x_2 + \cdots + x_n \), the sequence \( \bs{X} \) is exchangeable. Finally, it's interesting to note that the beta-Bernoulli process with parameters \( a \) and \( b \) could simply be defined as the sequence with the finite-dimensional distributions above, without reference to the beta distribution! It turns out that *every* exchangeable sequence of indicator random variables can be obtained by randomizing the success parameter in a sequence of Bernoulli trials. This is de Finetti's theorem, named for Bruno de Finetti, which is studied in the section on backwards martingales.

For each \(i \in \N_+\)

- \(\E(X_i) = \frac{a}{a + b}\)
- \(\var(X_i) = \frac{a}{a + b} \frac{b}{a + b}\)

## Proof

Since the sequence is exchangeable, \(X_i\) has the same distribution as \(X_1\), so \(\P(X_i = 1) = \frac{a}{a + b}\). The mean and variance now follow from standard results for indicator variables.

Thus \(\bs{X}\) is a sequence of identically distributed variables, quite surprising at first but of course inevitable for any exchangeable sequence. Compare the joint distribution with the marginal distributions. Clearly the variables are dependent, so let's compute the covariance and correlation of a pair of outcome variables.

Suppose that \(i, \; j \in \N_+\) are distinct. Then

- \(\cov(X_i, X_j) = \frac{a \, b}{(a + b)^2 (a + b + 1)}\)
- \(\cor(X_i, X_j) = \frac{1}{a + b + 1}\)

## Proof

Since the variables are exchangeable, \(\P(X_i = 1, X_j = 1) = \P(X_1 = 1, X_2 = 1) = \frac{a}{a + b} \frac{a + 1}{a + b + 1}\). The results now follow from standard formulas for covariance and correlation.

Thus, the variables are positively correlated. It turns out that in any *infinite* sequence of exchangeable variables, the the variables must be nonnegatively correlated. Here is another result that explores how the variables are related.

Suppose that \( n \in \N_+ \) and \( (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \). Let \( k = \sum_{i=1}^n x_i \). Then \[ \P(X_{n+1} = 1 \mid X_1 = x_1, X_2 = x_2, \ldots X_n = x_n) = \frac{a + k}{a + b + n} \]

## Proof

Using the joint distribution, \begin{align*} \P(X_{n+1} = 1 \mid X_1 = x_1, X_2 = x_2, \ldots X_n = x_n) & = \frac{\P( X_1 = x_1, X_2 = x_2, \ldots X_n = x_n, X_{n+1} = 1)}{\P(X_1 = x_1, X_2 = x_2, \ldots X_n = x_n)} \\ & = \frac{a^{[k+1]} b^{[n-k]}}{(a + b)^{[n + 1]}} \frac{(a + b)^{[n]}}{a^{[k]} b^{[n-k]}} = \frac{a + k}{a + b + n} \end{align*}

The beta-Bernoulli model starts with the conditional distribution of \( \bs X \) given \( P \). Let's find the conditional distribution in the other direction.

Suppose that \( n \in \N_+ \) and \( (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \). Let \( k = \sum_{i=1}^n x_i \). Then the conditional distribution of \(P\) given \((X_1 = x_1, X_2, = x_2, \ldots, X_n = x_n)\) is beta with left parameter \(a + k\) and right parameter \(b + (n - k)\). Hence \[ \E(P \mid X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n) = \frac{a + k}{a + b + k} \]

## Proof

This follows from Bayes' theorem. The conditional PDF \( g(\cdot \mid x_1, x_2, \ldots, x_n) \) is given by \[ g(p \mid X_1, x_2, \ldots, x_n) = \frac{g(p) \P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n) \mid P = p)}{\int_0^1 g(t) \P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n \mid P = t) dt}, \quad p \in (0, 1) \] The numerator is \[ \frac{1}{B(a, b)} p^{a-1} (1 - p)^{b-1} p^k (1 - p)^{n-k} = \frac{1}{B(a, b)} p^{a + k - 1} (1 - p)^{b + n - k - 1} \] The denominator is simply the normalizing constant for the expression, as a function of \( p \) and is \( B(a + k, b + n - k) / B(a, b) \). Hence \[ g(p \mid k) = \frac{1}{B(a + k, b + n - k)} p^{a + k - 1} (1 - p)^{b + n - k - 1}, \quad p \in (0, 1) \] The last result follows since the mean of the beta distribution is the left parameter divided by the sum of the parameters.

Thus, the left parameter increases by the number of successes while the right parameter increases by the number of failures. In the language of Bayesian statistics, the original distribution of \(P\) is the prior distribution, and the conditional distribution of \(P\) given the data \( (x_1, x_2, \ldots, x_n) \) is the posterior distribution. The fact that the posterior distribution is beta whenever the prior distribution is beta means that the beta distributions is conjugate to the Bernoulli distribution. The conditional expected value in the last theorem is the Bayesian estimate of \( p \) when \( p \) is modeled by the random variable \( P \). These concepts are studied in more generality in the section on Bayes Estimators in the chapter on Point Estimation. It's also interesting to note that the expected values in the last two theorems are the same: If \( n \in \N \), \( (x_1, x_2, \ldots, x_n) \in \{0, 1\}^n \) and \( k = \sum_{i=1}^n x_i \) then \[ \E(X_{n+1} \mid X_1 = x_1, \ldots, X_n = x_n) = \E(P \mid X_1 = x_1, \ldots, X_n = x_n) = \frac{a + k}{a + b + n} \]

Run the simulation of the beta coin experiment for various values of the parameter. Note how the posterior probability density function changes from the prior probability density function, given the number of heads.

### The Number of Successes

It's already clear that the number of successes in a given number of trials plays an important role, so let's study these variables. For \( n \in \N_+ \), let \[ Y_n = \sum_{i=1}^n X_i \] denote the number of successes in the first \(n\) trials. Of course, \( \bs{Y} = (Y_0, Y_1, \ldots) \) is the partial sum process associated with \( \bs{X} = (X_1, X_2, \ldots) \).

\(Y_n\) has probability density function given by \[ \P(Y_n = k) = \binom{n}{k} \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}}, \quad k \in \{0, 1, \ldots, n\} \]

## Proof

Every bit string of length \( n \) with 1 occurring exactly \( k \) times has the probability given in the joint distribution above. There are \( \binom{n}{k} \) such bit strings.

The distribution of \(Y_n\) is known as the beta-binomial distribution with parameters \(n\), \(a\), and \(b\).

In the simulation of the beta-binomial experiment, vary the parameters and note how the shape of the probability density function of \( Y_n \) (discrete) parallels the shape of the probability density function of \( P \) (continuous). For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function.

The case where the parameters are both 1 is interesting.

If \( a = b = 1 \), so that \( P \) is uniformly distributed on \( (0, 1) \), then \( Y_n \) is uniformly distributed on \( \{0, 1, \ldots, n\} \).

## Proof

Note that \( 1^{[j]} = j! \) and \( 2^{[j]} = (j + 1)! \) for \( j \in \N \). Hence, from the general PDF \( Y_n \) above \[ \P(Y_n = k) = \frac{n!}{k! (n - k)!} \frac{k! (n - k)!}{(n + 1)!} = \frac{1}{n + 1}, \quad k \in \{0, 1, \ldots, n\} \]

Next, let's compute the mean and variance of \( Y_n \).

The mean and variance of \( Y_n \) are

- \( \E(Y_n) = n \frac{a}{a + b} \)
- \( \var(Y_n) = n \frac{a b}{(a + b)^2} \left[1 + (n - 1) \frac{1}{a + b + 1} \right] \)

## Proof

These results follow from the mean and covariance results given above: \begin{align} \E(Y_n) & = \sum_{i=a}^n \E(X_i) = n \frac{a}{a + b} \\ \var(Y_n) & = \sum_{i=1}^n \sum_{j=1}^n \cov(X_i, X_j) = n \frac{a \, b}{(a + b)^2} + n (n - 1) \frac{a \, b}{(a + b)^2 (a + b + 1)} \end{align}

In the simulation of the beta-binomial experiment, vary the parameters and note the location and size of the mean-standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical moments to the true moments.

We can restate the conditional distributions in the last subsection more elegantly in terms of \( Y_n \).

Let \( n \in \N \).

- The conditional distribution of \( X_{n+1} \) given \( Y_n \) is \[\P(X_{n+1} = 1 \mid Y_n) = \E(X_{n+1} \mid Y_n) = \frac{a + Y_n}{a + b + n} \]
- The conditional distribution of \(P\) given \(Y_n\) is beta with left parameter \(a + Y_n\) and right parameter \(b + (n - Y_n)\). In particular \[ \E(P \mid Y_n) = \frac{a + Y_n}{a + b + n} \]

## Proof

The proof is easy using the nesting property of conditional expected value and the fact that the conditional distributions given \( (X_1, X_2, \ldots, X_n) \) depend only on \( Y_n = \sum_{i=1}^n X_i \).

- Note that \begin{align} \E(X_{n+1} \mid Y_n) & = \E[\E(X_{n+1} \mid Y_n) \mid X_1, X_2, \ldots, X_n] \\ & = \E[E(X_{n+1} \mid X_1, X_2, \ldots, X_n) \mid Y_n] = \E\left(\frac{a + Y_n}{a + b + n} \biggm| Y_n\right) = \frac{a + Y_n}{a + b + n} \end{align}
- Similarly, if \( A \subseteq (0, 1) \) is measurable then \( \P(P \in A \mid X_1, X_2, \ldots, X_n) \) depends only on \( Y_n \) and so \begin{align} \P(P \in A \mid Y_n) & = \E[\P(P \in A \mid Y_n) \mid X_1, X_2, \ldots, X_n] \\ & = \E[\P(P \in A \mid X_1, X_2, \ldots, X_n) \mid Y_n] = \P(P \in A \mid Y_n) \end{align}

Once again, the conditional expected value \( \E(P \mid Y_n) \) is the Bayesian estimator of \( p \). In particular, if \(a = b = 1\), so that \( P \) has the uniform distribution on \( (0, 1) \), then \(\P(X_{n+1} = 1 \mid Y_n = n) = \frac{n + 1}{n + 2}\). This is Laplace's rule of succession, another interesting connection. The rule is named for Pierre Simon Laplace, and is studied from a different point of view in the section on Independence.

### The Proportion of Successes

For \( n \in \N_+ \), let \[ M_n = \frac{Y_n}{n} = \frac{1}{n} \sum_{i=1}^n X_i\] so that \( M_n \) is the sample mean of \( (X_1, X_2, \ldots, X_n) \), or equivalently the proportion of successes in the first \( n \) trials. Properties of \( M_n \) follow easily from the corresponding properties of \( Y_n \). In particular, \( \P(M_n = k / n) = \P(Y_n = k) \) for \( k \in \{0, 1, \ldots, n\} \) as given above, so let's move on to the mean and variance.

For \( n \in \N_+ \), the mean and variance of \( M_n \) are

- \( \E(M_n) = \frac{a}{a + b} \)
- \( \var(M_n) = \frac{1}{n} \frac{a b}{(a + b)^2} + \frac{n-1}{n} \frac{a b}{(a + b)^2 (a + b + 1)} \)

## Proof

These results follow from the mean and variance of \( Y_n \) above and properties of expected value and variance:

- \( \E(M_n) = \frac{1}{n} \E(Y_n) \)
- \( \var(M_n) = \frac{1}{n^2} \var(Y_n) \)

So \( \E(M_n) \) is constant in \( n \in \N_+ \) while \( \var(M_n) \to a b / (a + b)^2 (a + b + 1) \) as \( n \to \infty \). These results suggest that perhaps \( M_n \) has a limit, in some sense, as \( n \to \infty \). For an ordinary sequence of Bernoulli trials with success parameter \( p \in (0, 1) \), we know from the law of large numbers that \( M_n \to p \) as \( n \to \infty \) with probability 1 and in mean (and hence also in distribution). What happens here when the success probability \( P \) has been randomized with the beta distribution? The answer is what we might hope.

\( M_n \to P \) as \( n \to \infty \) with probability 1 and in mean square, and hence also in in distribution.

## Proof

Let \( g \) denote the PDF of \( P \). For convergence with probability 1, we condition on \( P \) \begin{align} \P(M_n \to P \text{ as } n \to \infty) & = \E[\P(M_n \to P \text{ as } n \to \infty) \mid P] \\ & = \int_0^1 \P(M_n \to p \text{ as } n \to \infty \mid P = p) g(p) dp = \int_0^1 g(p) dp = 1 \end{align} For convergence in mean square, once again we condition on \( P \). Note that \[ \E[(M_n - P)^2 \mid P = p] = \E[(M_n - p)^2 \mid P = p] = \frac{p (1 - p)}{n} \to 0\ \text{ as } n \to \infty \] Hence by the dominated convergence theorem, \[ \E[(M_n - P)^2] = \int_0^1 \frac{p (1 - p)}{n} g(p) dp \to 0 \text{ as } n \to \infty \]

## Proof of convergence in distribution

Convergence with probability 1 implies convergence in distribution, but it's interesting to gove a direct proof. For \( x \in (0, 1) \), note that \[ \P(M_n \le x) = \P(Y_n \le n x) = \sum_{k=0}^{\lfloor nx \rfloor} \binom{n}{k} \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}} \] where \( \lfloor \cdot \rfloor \) is the floor function. But recall that \[ \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}} = \frac{B(a + k, b + n - k)}{B(a, b)} = \frac{1}{B(a, b)} \int_0^1 p^{a + k - 1} (1 - p)^{b + n - k - 1} dp \] Substituting and doing some algebra we get \[ \P(M_n \le x) = \frac{1}{B(a, b)} \int_0^1 \left[\sum_{k=0}^{\lfloor nx \rfloor} \binom{n}{k} p^k (1 - p)^{n-k}\right] p^{a - 1}(1 - p)^{b - 1} dp \] The sum in the square brackets is \( \P(W_n \le n x) = \P(W_n / n \le x) \) where \( W_n \) has the ordinary binomial distribution with parameters \( n \) and \( p \). But \( W_n / n \) converges (in every sense) to \( p \) as \( n \to \infty \) so \( \P(W_n / n \le x) \to \bs{1}(p \le x) \) as \( n \to \infty \). So by the dominated convergence theorem, \[ \P(M_n \le x) \to \frac{1}{B(a, b)} \int_0^x p^{a - 1} (1 - p)^{b - 1} dp = \P(P \le x)\]

Recall again that the Bayesian estimator of \( p \) based on \( (X_1, X_2, \ldots, X_n) \) is \[ \E(P \mid Y_n) = \frac{a + Y_n}{a + b + n} = \frac{a / n + M_n}{a / n + b / n + 1} \] It follows from the last theorem that \( \E(P \mid Y_n) \to P \) with probability 1, in mean square, and in distribution. The stochastic process \( \bs Z = \{Z_n = (a + Y_n) / (a + b + n): n \in \N\} \) that we have seen several times now is of fundamental importance, and turns out to be a martingale. The theory of martingales provides powerful tools for studying convergence in the beta-Bernoulli process.

### The Trial Number of a Success

For \( k \in \N_+ \), let \( V_k \) denote the trial number of the \( k \)th success. As we have seen before in similar circumstances, the process \( \bs{V} = (V_1, V_2, \ldots) \) can be defined in terms of the process \( \bs{Y} \): \[ V_k = \min\{n \in \N_+: Y_n = k\}, \quad k \in \N_+ \] Note that \(V_k\) takes values in \(\{k, k + 1, \ldots\}\). The random processes \(\bs{V} = (V_1, V_2, \ldots)\) and \(\bs{Y} = (Y_0, Y_1, \ldots)\) are inverses of each other in a sense.

For \(k \in \N \) and \(n \in \N_+\) with \(k \le n\),

- \(V_k \le n\) if and only if \(Y_n \ge k\)
- \(V_k = n\) if and only if \(Y_{n-1} = k - 1\) and \(X_n = 1\)

The probability denisty function of \(V_k\) is given by \[ \P(V_k = n) = \binom{n - 1}{k - 1} \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}}, \quad n \in \{k, k + 1, \ldots\} \]

## Proof 1

As usual, we can condition on \( P \) and use known results for ordinary Bernoulli trials. Given \( P = p \), random variable \( V_k \) has the negative binomial distribution with parameters \( k \) and \( p \). Hence \begin{align*} \P(V_k = n) & = \int_0^1 \P(V_k = n \mid P = p) g(p) dp = \int_0^1 \binom{n - 1}{k - 1} p^k (1 - p)^{n-k} \frac{1}{B(a, b)} p^{a-1} (1 - p)^{b - 1} dp \\ & = \binom{n - 1}{k - 1} \frac{1}{B(a, b)} \int_0^1 p^{a + k - 1} (1 - p)^{b + n - k - 1} dp \\ & = \binom{n - 1}{k - 1} \frac{B(a + k, b + n - k)}{B(a, b)} = \binom{n - 1}{k - 1} \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}} \end{align*}

## Proof 2

In this proof, we condition on \( Y_{n-1} \). Using the PDF of \( Y_{n-1} \) and the result above, \begin{align*} \P(V_k = n) & = \P(Y_{n-1} = k - 1, X_n = 1) = \P(Y_{n-1} = k - 1) \P(X_n = 1 \mid Y_{n-1} = k - 1) \\ & = \binom{n - 1}{k - 1} \frac{a^{[k - 1]} b^{[(n - 1) - (k - 1)]}}{(a + b)^{[n-1]}} \frac{a + k - 1}{a + b + (n - 1)} = \binom{n - 1}{k - 1} \frac{a^{[k]} b^{[n-k]}}{(a + b)^{[n]}} \end{align*}

The distribution of \( V_k \) is known as the beta-negative binomial distribution with parameters \( k \), \( a \), and \( b \).

If \(a = b = 1\) so that \( P \) is uniformly distributed on \( (0, 1) \), then \[ \P(V_k = n) = \frac{k}{n (n + 1)}, \quad n \in \{k, k + 1, k + 2, \ldots\} \]

## Proof

Recall again that \( 1^{[j]} = j! \) and \( 2^{[j]} = (j + 1)! \) for \( j \in \N \). Hence from the previous result, \[ \P(V_k = n) = \frac{(n - 1)!}{(k - 1)! (n - k)!} \frac{k! (n - k)!}{(n + 1)!} = \frac{k}{n (n + 1)}, \quad n \in \{k, k + 1, \ldots\} \]

In the simulation of the beta-negative binomial experiment, vary the parameters and note the shape of the probability density function. For various values of the parameters, run the simulation 1000 times and compare the empirical density function to the probability density function.

The mean and variance of \( V_k \) are

- \( \E(V_k) = k \frac{a + b - 1}{a - 1} \) if \( a \gt 1 \).
- \( \var(V_k) = k \frac{a + b - 1}{(a - 1)(a - 2)} [b + k (a + b - 2)] - k^2 \left(\frac{a + b - 1}{a - 1}\right)^2 \)

## Proof

From our work with the negative binomial distribution we know that \( \E(V_k | P = p) = k \frac{1}{p} \) and \( \E(V_k^2 | P = p) = k \frac{1 - p}{p^2} + \frac{k^2}{p^2} \). Thus, conditioning on \( P \) we have \[ \E(V_k) = \E[\E(V_k | P)] = \int_0^1 \frac{k}{p} \frac{p^{a-1} (1 - p)^{b - 1}}{B(a, b)} = k \frac{B(a - 1, b)}{B(a, b)} = k \frac{a + b - 1}{a - 1}\] which gives part (a). Similarly \begin{align} \E(V_k^2) & = \E[\E(V_k^2 | P)] = \int_0^1 \left(k \frac{1 - p}{p^2} + \frac{k^2}{p^2}\right) \frac{p^{a - 1} (1 - p)^{b - 1}}{B(a, b)}\\ & = k \frac{B(a - 2, b + 1)}{B(a, b)} + k^2 \frac{B(a - 2, b)}{B(a, b)} = k \frac{b (a + b - 2)}{(a - 1)(a - 2)} + k^2 \frac{(a + b - 1)(a + b - 2)}{(a - 1)(a - 2)} \end{align} Simplifying and using part (a) gives part (b).

In the simulation of the beta-negative binomial experiment, vary the parameters and note the location and size of the mean\(\pm\)standard deviation bar. For various values of the parameters, run the simulation 1000 times and compare the empirical moments to the true moments.