Kurzius Math Notes

1.3 Data Collection and Experimental Design

Vocabulary words: 16

Design of a Statistical Study

Any conclusion based on a statistical study is only as good as the process used to design the study. You’ll need to be familiar with how to design a statistical study so you can judge whether its results should be trusted.

Designing a Statistical Study

Identify the variables of interest (or the focus) and the population of the study.
Develop a detailed plan for collecting data. If you use a sample, make sure the sample is representative of the population.
Collect the data.
Describe the data, using descriptive statistics techniques.
Interpret the data and make decisions about the population using inferential statistics.
Identify any possible errors.

Statistical studies are typically either an observational or an experiment.

In an observational study, a researcher does not influence the responses. The researchers do their absolute best to not interfere with the study, for fear of contaminating the results.

In an experiment, a researcher deliberately applies a treatment before observing the responses. Since experiments involve interaction with the subjects, there are two groups set up: a treatment group and a control group. The treatment group receives whatever is being studied, say a drug, while the control group does not. The responses from both groups are then recorded and studied.

Sometimes, to deal with the human element, placebos are given to the control group. The reasons for this are psychological, so we won’t get into it here.

Data Collection

For collecting data, there are two ways we’re going to focus on here. One is a simulation, where you would use a model to reproduce the conditions of a situation or process. Simulations are typically done on a computer, and for situations that are either impractical or dangerous to create in real life. One example is an automobile company simulating car crashes to observe the effect on the humans involved.

The other is a survey, which is an investigation of one or more characteristics of a population. Surveys are generally people asking other people questions. Wording of questions in surveys is incredibly important, including possibly giving options for the individual to choose from.

Experimental Design

Three key elements of a well-designed experiment are control, randomization, and replication.

Control

Controlling the variables at play is important to well-designed experiment. One thing that can work against this are confounding variables. These occur when an experimenter cannot tell the difference between the effects of different factors on the variable. If a coffee shop owner remodels to attract more customers, but at the same time a nearby mall has a grand opening, it will be tough to tell which thing is impacting any change in the number of customers.

Another issue is the placebo effect, where individuals react as if they received a treatment, when in fact they were given a placebo. Making sure they don’t know if they got the placebo or the treatment is helpful in reducing this strange effect. Double-blind experiments take this even further. In a double-blind experiment, neither the experimenter nor the subjects know if the subjects are receiving a treatment or a placebo.

Randomization

Randomization is a process of randomly assigning subjects to different treatment groups. Randomizing participants helps control the effect of outside variables on the experiment. Giving all the men the placebo and women the treatment would lead to questionable results.

Randomization can take on other forms as well. Sometimes people are grouped and then randomly put in either the control or treatment group, say by age. They can also be paired, so that for each pair of really similar people (think age, sex, location, etc.), one is placed in the control and the other gets the treatment.

Replication

Replication is the repetition of an experiment under the same or similar conditions, and is necessary to improve the validity of an experiment. Flukes happen, so being able to do the experiment again with a group of similar make-up strengthens its results. And being of similar make-up is key here. An experiment involving only people over the age of 50, and then a second with only 20-year-olds wouldn’t be as convincing.

Sampling Techniques

A census is a measure of an entire population. By contrast, a sampling is a count or measure of part of a population. Keep in mind that efforts must be made so that the sampling represents the population as best as possible. Despite those efforts, differences still arise. A sampling error is the difference between the results of a sample and those of the population.

Random samples are one way to ensure a representative slice of the population is surveyed. A simple random sample means everyone in a population has an equal chance of being selected. The most straightforward way to do that would be to assign a number to everyone in the population, and then choose numbers at random.

Humans are pretty bad at choosing random numbers off the top of our head—even with just single digits, you might lean towards even numbers, or forget one of digits completely—so solutions have been developed to ensure sufficiently random numbers. An easy source of random numbers is the website RANDOM.ORG. If you want a technology-free source, dice are good, preferably if you have a 10-sided dice.

Here are some other random sampling methods:

A stratified sample is used when it’s important to have members from each segment of the population. Subjects are grouped based on some characteristic, say income, and then random people are selected from each group.
A cluster sample is when a population is divided in to groups, and then the entirety of one or more groups is surveyed. It’s important to note that groups should be fairly representative of the population, with the only characteristic they share being they are close to each other, and therefore easier to survey. If a whale watching company occasionally surveyed a group they took out, that would be cluster sampling.
A systemic sample is similar to a simple random sample. It starts with everyone in a population each being assigned a number, but instead of randomly choosing, every $n$th is chosen.

One method that should be avoided is a convenience sample. That’s just a sample of the population that’s easy to get to. This way is prone to not representing the population.