What are some ways of generating these one-way tables of counts?
Why do you think we care about the random mechanism that generated the data?
Any data analysis requires some assumptions about the data generation process. For continuous data and linear regression, for example, we assume that the response variable has been randomly generated from a normal distribution. For categorical data, we will often assume that data have been generated from a Poisson, binomial, or multinomial distribution. Statistical analysis depends on the data generation mechanism, although depending on the objective, we may be able to ignore that mechanism and simplify our analysis.
The following sampling methods correspond to the distributions considered:
Poisson sampling assumes that the random mechanism to generate the data can be described by a Poisson distribution. It is useful for modeling counts or events that occur randomly over a fixed period of time or in a fixed space. It can also be used as an approximation to the binomial distribution when the success probability of a trial is very small, but the number of trials is very large. For example, consider the number of emails you receive between 4 p.m. and 5 p.m. on a Friday.
Or, let X be the number of goals scored in a professional soccer game. We may model this as X ∼ Poisson(\(\lambda\)):
The parameter \(\lambda\) represents the expected number of goals in the game or the long-run average among all possible such games. The expression x! stands for x factorial, i.e., \(x!=1*2*3*\dots*x. P(X=x)\) or P(x) is the probability that X (the random variable representing the unknown number of goals in the game) will take on the particular value x. That is, X is random, but x is not.
Recall that mean and variance of Poisson distribution are the same; e.g., \(E(X) = Var(X) = \lambda\). However, in practice, the observed variance is usually larger than the theoretical variance and in the case of Poisson, larger than its mean. This is known as overdispersion, an important concept that occurs with discrete data. We assumed that each team has the same probability of in each match of the first round of scoring goals, but it's more realistic to assume that these probabilities will vary by the team's skill, the day the matches were played because of the weather, maybe even if the order of the matches, etc. Then we may observe more variations in the scoring than the Poisson model predicts. Analyses assuming binomial, Poisson or multinomial distributions are sometimes invalid because of overdispersion. We will see more on this later when we study logistic regression and Poisson regression models.
When data are collected on a pre-determined number of units and are then classified according to two levels of a categorical variable, a binomial sampling emerges. Consider the sample of 20 smartphone users, where each individual either uses Android or not. In this study, there was a fixed number of trials (e.g., fixed number of smartphone users surveyed, \(n=20\)), and the researcher counted the number \(X\) of "successes". We can then use the binomial probability distribution (i.e., binomial model), to describe \(X\).
Binomial distributions are characterized by two parameters: \(n\), which is fixed---this could be the number of trials or the total sample size if we think in terms of sampling---and \(\pi\), which usually denotes a probability of "success". In our example, this would be the probability that a smartphone user uses Android. Please note that some textbooks will use \(\pi\) to denote the population parameter and \(p\) to denote the sample estimate, whereas some may use \(p\) for the population parameters as well. The context should make it clear whether we're referring to a population or sample value. Once we know \(n\) and \(\pi\), the probability of success, we know everything about that binomial distribution, including its mean and variance.
Multinomial sampling may be considered as a generalization of Binomial sampling. Data are collected on a pre-determined number of individuals or trials and classified into one of \(k\) categorical outcomes.
The most common violation of these assumptions occurs when clustering is present in the data. Clustering means that some of the trials occur in groups or clusters, and that trials within a cluster tend to have outcomes that are more similar than trials from different clusters. Clustering can be thought of as a violation of either (a) or (b).
In this example, eye color was recorded for n = 96 persons.
Eye color | Count |
---|---|
Brown | 46 |
Blue | 22 |
Green | 26 |
Other | 2 |
Total | 96 |
Suppose that the sample included members from the same family as well as unrelated individuals. Persons from the same family are more likely to have similar eye color than unrelated persons, so the assumptions of the multinomial model would be violated. If both parents have brown eye color, it is very likely that their offspring will also have brown eye color. Whereas eye color of family members related by marriage will not violate the multinomial assumption, distribution of eye color of blood relations will.
Now suppose that the sample consisted of "unrelated" persons randomly selected within Pennsylvania. In other words, persons are randomly selected from a list of Pennsylvania residents. If two members of the same family happen to be selected into the sample purely by chance, that's okay; the important thing is that each person on the list has an equal chance of being selected, regardless of who else is selected.
Based on what we've seen so far with the multinomial distribution and multinomial sampling, can you answer the following questions?