1.4 - Sampling Schemes

What are some ways of generating these one-way tables of counts?

Why do you think we care about the random mechanism that generated the data?

Any data analysis requires some assumptions about the data generation process. For continuous data and linear regression, for example, we assume that the response variable has been randomly generated from a normal distribution. For categorical data, we will often assume that data have been generated from a Poisson, binomial, or multinomial distribution. Statistical analysis depends on the data generation mechanism, although depending on the objective, we may be able to ignore that mechanism and simplify our analysis.

The following sampling methods correspond to the distributions considered:

Poisson Sampling

Poisson sampling assumes that the random mechanism to generate the data can be described by a Poisson distribution. It is useful for modeling counts or events that occur randomly over a fixed period of time or in a fixed space. It can also be used as an approximation to the binomial distribution when the success probability of a trial is very small, but the number of trials is very large. For example, consider the number of emails you receive between 4 p.m. and 5 p.m. on a Friday.

Or, let X be the number of goals scored in a professional soccer game. We may model this as XPoisson(\(\lambda\)):

The parameter \(\lambda\) represents the expected number of goals in the game or the long-run average among all possible such games. The expression x! stands for x factorial, i.e., \(x!=1*2*3*\dots*x. P(X=x)\) or P(x) is the probability that X (the random variable representing the unknown number of goals in the game) will take on the particular value x. That is, X is random, but x is not.

The Poisson Model (distribution) Assumptions

  1. Independence: Events must be independent (e.g. the number of goals scored by a team should not make the number of goals scored by another team more or less likely.)
  2. Homogeneity: The mean number of goals scored is assumed to be the same for all teams.
  3. Time period (or space) must be fixed

Recall that mean and variance of Poisson distribution are the same; e.g., \(E(X) = Var(X) = \lambda\). However, in practice, the observed variance is usually larger than the theoretical variance and in the case of Poisson, larger than its mean. This is known as overdispersion, an important concept that occurs with discrete data. We assumed that each team has the same probability of in each match of the first round of scoring goals, but it's more realistic to assume that these probabilities will vary by the team's skill, the day the matches were played because of the weather, maybe even if the order of the matches, etc. Then we may observe more variations in the scoring than the Poisson model predicts. Analyses assuming binomial, Poisson or multinomial distributions are sometimes invalid because of overdispersion. We will see more on this later when we study logistic regression and Poisson regression models.

Binomial Sampling

When data are collected on a pre-determined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the sample of 20 smartphone users, where each individual either uses Android or not. In this study, there was a fixed number of trials (e.g., fixed number of smartphone users surveyed, \(n=20\)), and the researcher counted the number \(X\) of "successes". We can then use the binomial probability distribution (i.e., binomial model), to describe \(X\).

Binomial distributions are characterized by two parameters: \(n\), which is fixed---this could be the number of trials or the total sample size if we think in terms of sampling---and \(\pi\), which usually denotes a probability of "success". In our example, this would be the probability that a smartphone user uses Android. Please note that some textbooks will use \(\pi\) to denote the population parameter and \(p\) to denote the sample estimate, whereas some may use \(p\) for the population parameters as well. The context should make it clear whether we're referring to a population or sample value. Once we know \(n\) and \(\pi\), the probability of success, we know everything about that binomial distribution, including its mean and variance.

Binomial Model (distribution) Assumptions