Multiple linear regression: minimum sample size required when only a subset of independent variables are of interest

I am planning a multiple linear regression model with 1 continuous dependent variable and 6 independent variables. Of these 6 independent variables, I am only interested in the contribution of 2 variables (lets call them A and B) to this inferential model; I only intend to adjust for other variables (lets call them x1 to x4). So the model is: y = A + B + x1 + x2 + x3 + x4 All of the values are recorded. My initial question is: what is the minimum sample size required for this linear regression model? I already looked up available references and found this paper by Faul et al. (2009). In the mentioned paper in the subsection of "Deviation of a subset of linear regression coefficients from zero (F test, fixed model)", they suggested using an F test in the G*power application and choosing the "Linear multiple regression: Fixed model, R2 increase” option. Following their instructions, I chose an f2 of 0.15 (for a moderate effect), alpha of 0.05, power of 0.8, tested predictors = 2 and total predictors =6. These input parameters will result in a minimum required sample size of 68 for my specified power. Is my suggested solution correct? Is this the correct method of calculating the minimum required sample size in my study? Bonus question: Is the F-test described above when the tested predictor(s)=1 mathematically equal to the t-test of a single regression coefficient in the GPower app? edit: Should I be mindful of any new assumptions in my study because of this method of calculating the sample size? Any help will be greatly appreciated.

The_old_man asked Jan 6, 2023 at 13:02 The_old_man The_old_man 127 14 14 bronze badges

$\begingroup$ Will you be setting the values of x1 to x4 yourself, or are you going to record whatever values you find and then "adjust for" them in the regression? Please add that information by editing the question, as comments are easy to overlook and can be deleted. $\endgroup$

Commented Jan 10, 2023 at 19:14

$\begingroup$ @EdM Thank you for your comment. I edited the question, all of the variables, including x1 to x4 are recorded from real values; I dont set anything. $\endgroup$

Commented Jan 10, 2023 at 19:51

1 Answer 1

$\begingroup$

Section 4 of the cited paper says:

In the fixed-predictors model underlying previous versions of G*Power, the predictors X are assumed to be fixed and known. In the random-predictors model, by contrast, they are assumed to be random variables with values sampled from an underlying multivariate normal distribution. Whereas the fixed-predictors model is often more appropriate in experimental research (where known predictor values are typically assigned to participants by an experimenter), the random-predictors model more closely resembles the design of observational studies (where participants and their associated predictor values are sampled from an underlying population). The test procedures and the maximum likelihood estimates of the regression weights are identical for both models. However, the models differ with respect to statistical power.

What you have is an observational study, so your use of the "fixed model" power estimates isn't correct. As Section 4.4 says:

sample sizes required for the random-predictors model are always slightly larger than those required for the corresponding fixed-predictors model.

Even if you are only interested in 2 out of the 6 predictors, the model still has to fit all 6 predictors. If your sample size is too small, then you will overfit the model and results will be unreliable. A useful rule of thumb is that you need at least 15 observations for each predictor that you will fit in typical biomedical or social-science studies; see Frank Harrell's online notes. That would suggest a minimum sample size of 90 to handle your 6 predictors.

You might be able to reduce that number somewhat by using penalized regression to handle the 4 predictors that you don't care about; that would lower the effective number of predictors by reducing the magnitudes of those 4 coefficients. Harrell's notes go into penalization, which is illustrated in this paper for the type of situation you describe.

I you want to do a more formal power analysis, you could try the "random-predictor" tools in G*Power. Ideally, though, that requires "specifying a vector $u$ of correlations between criterion $Y$ and predictors $X_i$ and the $(m \times m)$ matrix $B$ of correlations among the predictors," from which you specify the effect size $\rho^2=u^TB^u$ . That's because the power of a study depends on the associations of the predictors with outcome and the individual and joint distributions of the predictor values. I don't think that you can separate out the associations of any specific predictors in this case, however; the analysis is just for the overall model.