Consider the example of a marketing campaign. This is called a sampling error , something you must contend with in any test that does not include the entire population of interest. Redman notes that there are two main contributors to sampling error: the size of the sample and the variation in the underlying population. Sample size may be intuitive enough. Think about flipping a coin five times versus flipping it times. Of course, showing the campaign to more people costs more, so you have to balance the need for a larger sample size with your budget.
Variation is a little trickier to understand, but Redman insists that developing a sense for it is critical for all managers who use data. Consider the images below. Each expresses a different possible distribution of customer purchases under Campaign A. In the chart on the left with less variation , most people spend roughly the same amount of dollars.
Compare that to the chart on the right with more variation. Here, people vary more widely in how much they spend. The average is still the same, but quite a few people spend more or less. If you pick a customer at random, chances are higher that they are pretty far from the average.
To summarize, the important thing to understand is that the greater the variation in the underlying population, the larger the sampling error. In the U. In election years, there is a steady stream of polls in the months leading up to the election announcing which candidates are up and which are down in the horse race of popular opinion.
If you have ever wondered what makes these polls accurate and how each poll decides how many voters to talk to, then you have thought like a researcher who seeks to know how many participants they need in order to obtain statistically significant survey results.
Statistically significant results are those in which the researchers have confidence their findings are not due to chance. Calculating sample sizes can be difficult even for expert researchers. Here, we show you how to calculate sample size for a variety of different research designs. Before jumping into the details, it is worth noting that formal sample size calculations are often based on the premise that researchers are conducting a representative survey with probability-based sampling techniques.
Probability-based sampling ensures that every member of the population being studied has an equal chance of participating in the study and respondents are selected at random. For a variety of reasons, probability sampling is not feasible for most behavioral studies conducted in industry and academia. As a result, we outline the steps required to calculate sample sizes for probability-based surveys and then extend our discussion to calculating sample sizes for non-probability surveys i.
Determining how many people you need to sample in a survey study can be difficult. How difficult? Look at this formula for sample size. No one wants to work through something like that just to know how many people they should sample. Fortunately, there are several sample size calculators online that simplify knowing how many people to collect data from. Even if you use a sample size calculator, however, you still need to know some important details about your study.
Specifically, you need to know:. Population size is the total number of people in the group you are trying to study. If, for example, you were conducting a poll asking U. Everyone who is currently engaged in digital marketing may be a potential customer. In situations like these, you can often use industry data or other information to arrive at a reasonable estimate for your population size.
Margin of error is a percentage that tells you how much the results from your sample may deviate from the views of the overall population.
The smaller your margin of error, the closer your data reflect the opinion of the population at a given confidence level.
Garcia-Marques et al. More in general, Garcia-Marques et al. As a rule of thumb, the interaction will not be smaller than both main effects when the lines touch or cross each other at some point.
The numbers of participants required are very similar to the situation depicted in the middle panel of Figure 3. So, the interaction is the same. The remaining small difference in numbers is due to the extra requirements related to the pairwise post hoc tests. When performance of two groups is compared, researchers often use a so-called split-plot design with one between-groups variable and one repeated-measures factor.
Indeed, researchers often wonder whether such a design is not more powerful than a simple between-groups comparison. Suppose you want to examine whether students with dyslexia are disadvantaged in naming pictures of objects. What is to be preferred then? Use a simple one-way design in which you compare students with dyslexia and controls on picture naming? For a long time, the author was convinced that the latter option was to be preferred because of what power calculators told me , but is this confirmed in simulations?
Before we start with the interactions, it is good to have a look at the main effect of the repeated-measures factor. In a first scenario, the between-groups variable is not expected to have a main effect or to interact with the repeated-measures factor.
It just increases the complexity of the design. In a second scenario, the Latin-square group interacts with the main effect of the repeated-measures variable.
One stimulus set is easier than the other, and this introduces an effect of equal size. How many participants do we need in such a scenario to find a main effect of the repeated-measures variable with a power of. This is interesting news, because it tells us that we can add extra between-groups control factors to our design, without having much impact on the power to detect the main effect of the repeated-measures variable, as was indeed argued by Pollatsek and Well We can also look at the power of the between-groups variable.
Is it the same as for the between-groups t test, or does the fact that we have two observations per participant make a difference? And does the outcome depend on the correlation between the levels of the repeated-measures variable? Here are the data:. The lower the correlation between the levels of the repeated-measures variable, the smaller the number of participants becomes.
This can be understood, because highly correlated data do not add much new information and they do not much decrease the noise in the data. In contrast, uncorrelated data add new information. When the interaction is the focus of attention, we have to make a distinction between the three types of interactions illustrated in Figure 3.
The fully crossed interaction is most likely to be found with control variables e. The other two interactions are more likely to be of theoretical interest. If we only look at the significance of the interaction, then two groups of 27 participants each are enough for an F-test.
Half of the time, however, the interaction will not be accompanied by the right pattern of post-hoc effects in the groups. For the complete pattern to be present, we need two groups of 67 participants for the F-test and two groups of participants for the Bayesian analysis. So, a split-plot design is not more powerful than a between-subjects design in terms of participants required.
It does give more information, though, because it adds information about a possible main effect of the between-groups variable, and the group dependency of the repeated-measures effect. Notice how different the outcome is from the conviction mentioned in the introduction that you can find small effect sizes in a split-plot design with 15 participants per group.
When seeing this type of output, it good to keep in mind that you need 50 participants for a typical effect in a t-test with related samples. This not only sounds too good to be true, it is also too good to be true.
Some authors have recommended using an analysis of covariance for a situation in which pretest and posttest scores of two groups of people are compared e. To check whether this analysis is more powerful, we ran simulations to determine the minimum number of participants required. At the same time, the power is not increased to such an extent that a researcher can observe typical effect sizes with groups of 25 participants as erroneously assumed by a researchers mentioned in the introduction.
Indeed, samples of 20—24 participants for a long time were the norm in experimental psychology. There are two reasons for this. The first is the illusion of sufficient power based on significant p values, as explained in the introduction unreplicable studies are a problem in experimental psychology too. By increasing the correlation between the two levels of the repeated measure, you can increase d z relative to d av.
The effect sizes reported in meta-analyses often are d av or a mixture of d av an d z. D av is preferred for meta-analysis because it allows researchers to compare results from between-groups designs and repeated-measures designs.
However, it cannot always be calculated because the authors of the original studies do not provide enough information in their articles. So, in all likelihood d z values are often included in meta-analyses as well. Still, d z is the value that matters for power analysis. Because of the equation, d z will be 1.
The correlation between two variables depends on the reliability of the variables: Noisy variables with low reliabilities do not correlate much with each other, because they do not even correlate much with themselves. So, by increasing the reliability of the measurements, we can increase d z in a repeated-measures design. Most cognitive researchers have an intuitive understanding of the requirement for reliable measurements, because they rarely rely on a single observation per participant per condition.
A perception psychologist investigating the relationship between stimulus intensity and response speed is unlikely to have each participant respond to each stimulus intensity only once. Instead, they will ask the participant to respond say 40 times to every stimulus intensity and take the average reaction time. Similarly, a psycholinguist studying the impact of a word variable say, concreteness on word recognition is unlikely to present a single concrete and abstract word to each participant.
Instead, they will present some 40 concrete words and 40 abstract words, also because they want to generalize the findings across stimuli. Similarly, Zwaan et al. Decreasing the noise by averaging per participant and per condition over multiple observations has a second advantage: It decreases the variance within the conditions. This is also true for between-groups studies. Averaging over multiple observations per participants is likely to decrease the interindividual differences within the groups.
Also in the split-plot design, fewer participants were needed than in the corresponding t-test because there was more than one observation per person at least if the levels of the repeated-measures did not correlate too much, so that there was room for noise.
All in all, it can be expected that averaging across multiple observations per condition per participant will increase the power of an experiment when the following two conditions are met:. The latter condition is particularly true for reaction times, where there are huge differences in response speed when the same participant responds several times to the same stimulus. However, noise may be much more limited when participants are asked to rate stimuli.
They may give more or less the same ratings to the same items, so that there is no point in repeating items many times. The easiest way to find out whether profit can be made by increasing the number of responses is to look at the reliability of the dependent variable.
Reliability is a cornerstone of correlational psychology and it is a shame that its importance has been lost in experimental psychology Cronbach, In the next sections, we will see how the reliability of the dependent variable can be used to optimize the effect size and in that way reduce the number of participants that must be tested. Multiple observations can be realized by presenting the same stimulus more than once e.
Table 4 gives an example in which 6 participants responded 4 times to the same stimulus. There are several ways to calculate the reliability of the data in Table 4. For instance, we could calculate the correlation between S1 and S2. We could also calculate the correlation between the average of the first two presentations and the average of the second two presentations.
This correlation is known as the split-half correlation. For Table 4 it amounts to. Shrout and Fleiss showed how to calculate two summary measures of reliability, which they called intraclass correlations. The first measure corresponds to the average correlation between the repetitions. The second corresponds to the expected correlation between the mean scores of the four repetitions and the mean scores that can be expected if another four repetitions were run.
Brown and Spearman in the beginning of the 20 th century already showed that averages of n scores correlate higher with each other than the individual scores according to the following equation:. All we have to do is turn Table 4 in so-called long notation and use a published algorithm.
Table 5 shows the first lines of the long notation of Table 4. First lines of the long notation of Table 4. Lines with missing values are simply left out. The R package psychometric by Fletcher contains commands to calculate the two intraclass correlations of Shrout and Fleiss All you have to do is to import the full Table 5 long notation in R and use the following commands:.
You can see that the value of ICC2 is the one predicted on the basis of the Spearman-Brown equation:. This means that we can also use the Spearman Brown equation to estimate how many more stimuli we must present to get a higher reliability for the dependent variable.
For this we can use the equation:. This is the value of ICC2 we should aim for and reporting it should be part of every data analysis, also in experimental psychology. To get the value for the data in Table 5 , we can calculate how many observations we should add as follows:. A last issue we must address is how to deal with the fact that experiments consist of several conditions. There are two options: 1 we calculate the ICCs for each condition in the design, or 2 we calculate the ICCs across the complete dataset.
In general, both calculations are likely to agree pretty well. However, there will be exceptions. One is illustrated by the valence rating and false memory studies from Table 3 , which had a negative correlation between conditions.
Obviously, this will hurt the ICC if we calculate it across the entire dataset. Another exception is when there is a big difference between-groups and at the same time not much systematic variance within the groups, as shown in Figure 5. In such a situation you will find strong ICCs across the entire dataset because of the large group difference together with weak ICCs in the conditions because of the restricted range.
Finally, design-wide calculations will be worse than condition-specific calculations when there is an interaction between the independent variables, in particular when there is a cross-over interaction. Illustration of how you can find a high correlation in the entire dataset because of the group difference and a low correlation within each group because of the range restriction within groups.
Because both calculations give us interesting information and are easy to run with contemporary software, it is advised to run and report them both in order to get a better feeling for the dataset.
To better appreciate the reliabilities typically observed in psychology, we calculated them for the replication studies reported in Camerer et al. For the Camerer et al. The results are given in Table 5 repeated-measures designs and Table 6 between-groups designs.
The tables also include the reported effect sizes and the effect sizes when the analysis was run on a single observation in each condition.
For the repeated-measures experiments these were randomly chosen stimuli per condition; for the between-groups experiments it was the average based on the stimuli used. The two effect sizes help us to understand the increase in effect size by averaging over multiple observations per condition. Intraclass correlations for designs with one repeated-measures factor 2 levels.
The experiments of Zwaan et al. Intraclass correlations for between-groups designs. At the same time, for some dependent variable e. As predicted, in particular for the designs with repeated-measures, increasing the number of observations per condition made a big difference Table 5.
The situation is less compelling for the between-groups designs reported in Camerer et al. However, in Table 6 too we see that reliable, stable dependent variables improve the interpretation. The most convincing instance is the replication of Pyc and Rawson These authors examined the impact of testing on Swahili-English word translation retention.
Table 6 also illustrates that some dependent variables can be quite reliable even when based on a few items. Wilson et al. Intercorrelations between the three ratings were. At the same time, notice that the reliability can only be assessed when there is more than one measurement and that even in this study the average of the three questions still increased the reliability to. This can easily be verified and reported by using ICC2. The higher values of d require dependent variables with a reliability of.
Therefore, authors using these estimates must present evidence about the reliability of their variables. However, we can think of situations in which failing to find a true effect has important costs as well. Think of a scientific theory that critically depends on a particular effect. How costly is it to give up a theory because we failed to put in enough effort to test it properly? Unfortunately, this comes at a further cost in terms of the number of participants that must be tested.
These numbers can be compared to those of Tables 7 and 8. The latter decreases the chances of not finding an effect present in the population. Articles and talks about power in psychology research have a predictable course. At first the audience is engaged and enthusiastic about the need for properly powered studies. This suddenly changes, however, when the required numbers of participants are tabulated. Then howls of disbelief and protest arise. Surely the numbers must be wrong!
For some classic, robust effects in designs with repeated-measures and many observations per level this is indeed true e. This leads to following recommendations. Still, they are not impossible, as shown in recent replication projects, which often exceed the minimum sample sizes.
Particularly worrying for cognitive psychology is the large number of observations needed to properly test the interaction between a repeated-measures variable and a between-groups variable in the split-plot design. It looks like this effect needs the same number of participants as a between-groups comparison, something which has not been documented before.
An analysis of replication studies suggests that in particular between-subjects manipulations are difficult to replicate Schimmack, , raising the possibility that the same may be true for interactions between repeated-measures and between-groups variables. The main reason why underpowered studies keep on being published is that the current reward system favors such studies. One element that could help nudging researchers towards properly powered studies may be the inclusion of power in the badge system used in some journals to promote good research practices Chambers, ; Lindsay, , ; Kidwell et al.
Also the evaluation of graduate students will have to change. Rather than being impressed by a series of small-scale studies, supervisors and examiners should start endorsing PhD theses with 2—4 properly run studies. Finally, well-powered studies are less insurmountable when they are done in collaboration with other research groups.
It is probably not a co-incidence that many replication studies are run in labs with active collaboration policies. This is another element that can be taken into account when evaluating research.
A comparison of Tables 7 , 8 , 9 is likely to bias readers against Bayesian analysis, given their fondness for the smallest possible sample sizes. This is not justified. Hence the larger numbers of participants required. This requires sample sizes closer to those of Table 8 than to those of Table 7.
As a result, some researchers have recommended to use the criterion flexibly and to justify the alpha level used Bishop, ; Lakens et al. The Bayesian approach is oriented more towards estimation of model parameters and reducing the accompanying uncertainty by continuous fine tuning than towards binary null hypothesis testing based on non-informative priors.
Still, it is important that users of software packages have knowledge of the power requirements when they use Bayesian factors to argue for a null hypothesis or an alternative hypothesis. Observing a large or small BF is as uninformative as a small p-value, when based on a hopelessly underpowered study see the introduction.
So, the BF-value obtained in a specific study simulation is uninformative about the methodological soundness of the study. Having more participants is good, having less is bad.
There are four main reasons. First, the data in Tables 7 and 8 have been obtained under ideal simulations. In everyday life the data are likely to be messier and violate some requirement of the statistical test e.
When based on enough data, small violations are unlikely to invalidate the conclusions much unless a strong confound has been overlooked. However, they will introduce extra noise and so require a few more observations to reach the same level of power as under ideal circumstances.
Such practice can easily be defended, but it comes at a cost. It is unlikely that such numbers will be rewarded within the current academic system although this hopefully will change once power issues are given due consideration. So, one needs good theoretical motivation to go after such effects, in which case one is likely to know the direction of the effect, so that one-tailed tests can be used these require fewer participants. Third, Tables 7 and 8 strive for a power of.
As Table 9 shows, this need not be the case. Nothing prevents you from going for a higher power. Fourth, we can look at how precisely the numbers of Tables 7 and 8 measure the effects. To some extent, the issue of statistical significance is of secondary importance in science. When we look at how precisely the effect sizes are estimated in Tables 7 and 8 , the outcome is rather sobering. For the repeated-measures design with 52 participants, the confidence interval is even larger: plus or minus.
This means that in terms of precision the experiment says little more than that the effect can vary from nearly non-existent d slightly above. For a repeated-measures t test it ranges from the effect minus. These numbers are better than the confidence intervals of Table 7 because of the larger numbers of participants , but still tell us that in terms of precision the participant numbers in Tables 7 and 8 are at the low end rather than at the high end.
Psychology really could do with more high-powered, high-precision studies of important phenomena, like the top studies of Figure 2. At the same time, it is important to comprehend that the numbers of Tables 7 , 8 , 9 must not be used as fixed standards against which to judge each and every study without further thought.
A better analogy is that of reference numbers. If, however, you have reasons to believe that the effect size under investigation is smaller but of theoretical interest, you will require higher numbers.
Alternatively, if you have good evidence that the expected effect size is larger, you can justify smaller numbers. Importantly, such justification must be done before the study is run, not after seeing the data and deciding whether they agree with your expectations. An aspect often overlooked in power analyses is that noise can be reduced by having more observations per participant.
This is particularly effective for repeated-measures designs, because the power of such designs depends on the correlation between the conditions in addition to the difference in means i. Brysbaert and Stevens recommended 40 participants and 40 stimuli per condition as good practice in reaction time studies with two repeated-measures conditions. Unfortunately, Brysbaert and Stevens further suggested that the numbers of 40 participants and 40 stimuli per condition may be true for more complex designs.
A look at Table 7 makes this unlikely. Further simulations will have to indicate whether these designs also require more stimuli per condition. The correlation between two within-participant conditions can be calculated and depends on the reliability of the measures. Therefore, it is good practice to measure and optimize reliability. Reliability can be measured with the intraclass correlation; it can be optimized by increasing the number of observations.
The latter is particularly required for noisy dependent variables, such as reaction times. Dependent variables with less variability e. Reliability can only be calculated when there are at least two observations per participant per condition. Researchers are recommended to make sure this is the case either by repeating the stimuli or by including equivalent items. It would be good practice if researchers always included the effect sizes d z and d av when they report a pairwise test of repeated-measures conditions.
This allows readers to calculate the correlation between the conditions and gives makers of meta-analyses all the information they need to compare between-groups studies with repeated-measures studies. Table 10 shows the values for the paradigms investigated by Zwaan et al. A comparison of d z with d av allows readers to deduce the correlations between the conditions Table 3.
Comparison of d z and d av for repeated-measures designs in Zwaan et al. Reporting d av in addition to d z for pairwise comparisons allows readers to compare the effect sizes of repeated-measures studies with those of between-groups studies. The curse of underpowered studies is unlikely to stop as long as reviewers and editors value potential interest of the findings more than methodological soundness.
The main problem with the evaluation of findings after the experiment is run is that most significant findings can be given an interesting post hoc interpretation. Effects in line with the expectations pass without question and are interpreted as adding credibility to the methodological choices made in the study in line with the post hoc ergo propter hoc reasoning fallacy.
Even unexpected effects may point to exciting new insights. Non-significant findings cause more interpretation problems and, therefore, are less likely to get published, leading to the file drawer problem and a failure to expose previously published false positives. Logistically, it is extremely simple to include power considerations into the editorial decision. If editors and reviewers collectively decided no longer to publish underpowered studies, research practices would change overnight.
That this has not happened yet, is arguably due to two factors: 1 the underestimation of the power issue, and 2 the lack of clear guidelines about the sample sizes needed for properly powered studies. The present article is an attempt to address the second factor. It is hoped that it will kick off the discussion and lead to a consensus paper with a wider remit than a single-authored publication. These reasons must not make reference to existing research traditions or data from previous small-scale studies, but can be data from large-scale studies or data indicating that the reliability of the measures is high enough to have d z sufficiently larger than d av.
The numbers in Tables 7 , 8 , 9 also provide textbook writers with useful guidelines about whether to include a study in their book and how best to describe the finding: As a well-established fact or an interesting hypothesis that still requires a proper test?
The numbers in Tables 7 , 8 , 9 seem to exclude research on all topics involving small populations or research techniques with a hefty cost. Does this mean that psychology can only investigate issues lending themselves to large-scale, cheap internet testing? To answer this question, we must return to the consequences of underpowered studies.
Most of the time, these studies will not detect a true effect Table 1. If they detect it, this is because the effect size in the small sample happens to be considerably larger than the true effect size. Chances of non-replicable, spurious effects increase the more complex the design is, certainly when no correction for multiple testing is made Maxwell, As a result, underpowered studies are unlikely to add much insight.
They may hit on a new true finding if the sampling error happens to enlarge the effect in the small sample studied, but most of the time they will just churn out negative findings and false positives. So, rather than continuing to excuse underpowered studies, we must use the information in Tables 7 , 8 , 9 to determine what type of research is meaningful with small samples.
The tables are very clear in this respect: Only a main effect between two reliably measured within-subjects conditions is within reach. Further reassuring is that the main effect is not compromised by the presence of control variables e.
In sum, researchers confronted with a small number of participants must not search for excuses to keep doing bad research, but must ask questions that stay informative i. For them, even more than for others, it is imperative to try to measure each participant as thoroughly as possible, so that stable results are obtained per participant.
A challenge for these studies is to make sure that they are not invalidated by demand characteristics. Alternatively, we must start budgeting for appropriate numbers of participants. If little good research can be done with samples smaller than participants, we must mention this in grant applications and argue why the money is needed. Better to fund useful research than to pay for studies that burden the field with more ambiguity after they are run than before.
A frustration with tables is that they often do not cover the exact condition you are interested in. This is the attraction of power calculators giving you explicit numbers for whatever combination of independent variables and power requirements you can think of, even though you may not understand the outcome in the way the authors assumed you would do. In this respect, it may be interesting to know that the numbers in Tables 7 , 8 , 9 are not so difficult to obtain.
Basically, what you need is:. All of this is rather straightforward in computer languages with statistical libraries, such as R. They can easily be adapted for different parameters. With a bit of tinkering, you can also adapt them for other designs e.
What you have to do, is make your predictions explicit in terms of standardized values for the various conditions and the ways in which you are going to analyze your data exactly the things you do when you preregister a study. By then generating and analyzing multiple datasets following these restrictions, you can estimate how often the statistical analysis will confirm the structure you used to generate the numbers and which you think represents reality.
In that way, you immediately get a feeling for the usefulness of your design. It may even be an idea to drop the automatic loop and generate, analyze, and report the data of simulations one after the other, just to get an idea of how often you obtain results that do not agree with the structure you imposed on the number generator. Finding and reporting 90 simulations that fail to confirm a structure you know to be there arguably is the best remedy to avoid underpowered studies.
The hope is that by making them explicit the research community can zoom in on them and fine tune them if needed. For instance, Callens, Tops, and Brysbaert was rejected by two American journals, because the reviewers felt unsure about the testing language Dutch. This concern outweighed the size of the study which involved two years of solid data collection and analysis. For instance, you can figure out whether an imaginary bridge will stand or collapse in imaginary conditions. Without a theory, if you want to know what would happen if you did X, you actually have to do X, which is more labor-intensive and less insightful.
Luckily, computers nowadays are so powerful and fast that we can easily run the simulations needed for the power analyses in the present article. An advantage of simulations is that they are less prone to oversight of important details e. So, the present paper is a combination of findings based on theoretical analysis for the simple cases and simulation for the more complex cases. The larger N, the closer the approximation of the equation.
It is also possible to make the equation more complex, so that it always gives the correct value. By multiplying the codes of Factor A and Factor B, you get the interaction. Then you can run a multiple regression with three continuous predictors: Factor A, Factor B, and the interaction. The regression weights will give you sizes of the three effects that can be compared directly.
This requires the data to be coded in long notation see later in this article. This alternative approach is not included in the present article because of the dismal record psychologists have when given the opportunity to peek at the data.
We really must learn to accept that the only way to get properly powered experiments with good effect size estimates is to test more than a few participants. It is not clear either how well the sequential technique works for the more complicated designs with multiple pairwise comparisons. Because there was no collection of empirical data, no consent of participants was needed.
This paper was written because the author was getting too many questions about the power of experiments and too much criticism about the answers he gave. I am grateful to Daniel Lakens and Kim Quinn for helpful suggestions on a previous draft. Daniel even kindly checked a few of the numbers with a new package he is developing. Albers, C. When power analyses based on pilot data are biased: Inaccurate effect size estimators and follow-up bias.
Journal of Experimental Social Psychology , 74, — Anderson, S. Sample-size planning for more accurate statistical power: A method adjusting sample effect sizes for publication bias and uncertainty. Psychological Science , 28 11 , — Baayen, R. Analyzing Linguistic Data: A practical introduction to statistics using R. Bakker, M. Psychological Science , 27 8 , —
0コメント