PLS-DA/OPLS-DA 2D plot: What does it mean when r2x (close to 0.9) differs significantly from r2y and q2 (around 0.1), and how should this be addressed?
Review of indicator meanings:
R²X: Explains the variance capability of the independent variable (X, i.e., metabolite features). The higher the value, the better the model fits the input data.
R²Y: Explains the variance capability of the dependent variable (Y, usually grouping information, such as healthy vs diseased). The higher the value, the better the model distinguishes between different groups.
Q²: Prediction capability indicator obtained through cross-validation. Reflects the model's ability to predict new samples. Generally:
- Q² > 0.5: Has moderate predictive capability
- Q² > 0.9: Very good
- Q² ≈ 0.1: Almost no predictive capability
Problem interpretation:
R²X is very high (≈0.9): Indicates that the model can well explain the structural information of X (variance among features), meaning the model has a good fitting capability on X.
But R²Y and Q² are very low (≈0.1): Indicates that the model can hardly explain the differences between groups, and has almost no predictive capability.
This usually indicates the model “overfits X but hasn't learned effective distinguishing information for Y”, i.e.:
The model can restore the internal structure of the original data well but cannot differentiate groups (hasn't captured the true differential metabolites).
Possible reasons and suggestions:
1. Group differences are not obvious
Reason: The metabolic feature differences between groups are not significant.
Solution:
(1) Return to the raw data and perform PCA to check if there is natural clustering between groups.
(2) Try other classification methods (such as Random Forest) to confirm if the group has identification value.
2. Sample size is too small
Small samples can easily lead to distortion or instability in the PLS-DA model or Q².
Solution:
(1) Increase sample size
(2) Perform stronger cross-validation (such as 7-fold, 10-fold)
3. Too many variables
This is common in metabolomics, for example, having thousands of metabolites but only dozens of samples, leading to overfitting.
Solution:
(1) Perform feature selection first (for example, use VIP > 1、p-value < 0.05 filtering)
(2) Or use dimensionality reduction methods like PCA preprocessing + then use PLS-DA
4. Unreasonable model parameter settings
For example, not balancing the training/test ratio during leave-one-out cross-validation.
Solution: Reset cross-validation parameters (e.g., use k-fold instead of leave-one-out)
5. Model results haven't been verified with permutation test
Verify whether the model results are 'randomly generated illusions'
Solution: Perform 200 permutation tests; if permuted Q² > original Q², the model is unreliable.
Suggested steps:
- Use PCA to check data structure to see if there is no group difference.
- Perform permutation test to verify if the model is effective.
- Check VIP, p-value for variable selection, and then model again.
- Consider using Random Forest, SVM, and other models to cross-verify if the results are consistent.
- Increase sample size if possible.
BioTech Park Biotech -- Characterization of biological products, high-quality service provider for multi-group biological mass spectrometry testing
Related services:
How to order?






