Getting Started with olr: Optimal Linear Regression

📊 Load Example Dataset

# Load data
crudeoildata <- read.csv(system.file("extdata", "crudeoildata.csv", package = "olr"))
dataset <- crudeoildata[, -1]

# Define variables
responseName <- 'CrudeOil'
predictorNames <- c('RigCount', 'API', 'FieldProduction', 'RefinerNetInput',
                    'OperableCapacity', 'Imports', 'StocksExcludingSPR',
                    'NonCommercialLong', 'NonCommercialShort',
                    'CommercialLong', 'CommercialShort', 'OpenInterest')

🔎 Run OLR Models

# Full model using R-squared
model_r2 <- olr(dataset, responseName, predictorNames, adjr2 = FALSE)

## Returning model with max R-squared.
## 
## Call:
## lm(formula = CrudeOil ~ RigCount + API + FieldProduction + RefinerNetInput + 
##     OperableCapacity + Imports + StocksExcludingSPR + NonCommercialLong + 
##     NonCommercialShort + CommercialLong + CommercialShort + OpenInterest, 
##     data = dataset)
## 
## Coefficients:
##        (Intercept)           RigCount                API    FieldProduction 
##       0.0068578950      -0.3551354134       0.0004393875       0.2670366950 
##    RefinerNetInput   OperableCapacity            Imports StocksExcludingSPR 
##       0.3535677365       0.0030449534      -0.1034192549       0.7417144521 
##  NonCommercialLong NonCommercialShort     CommercialLong    CommercialShort 
##      -0.5643353759       0.0207113857      -1.3007001952       1.8508558043 
##       OpenInterest 
##      -0.0409690597

# Adjusted R-squared model
model_adjr2 <- olr(dataset, responseName, predictorNames, adjr2 = TRUE)

## Returning model with max adjusted R-squared.
## 
## Call:
## lm(formula = CrudeOil ~ RigCount + RefinerNetInput + Imports + 
##     StocksExcludingSPR + NonCommercialLong + CommercialLong + 
##     CommercialShort, data = dataset)
## 
## Coefficients:
##        (Intercept)           RigCount    RefinerNetInput            Imports 
##        0.008256759       -0.380836990        0.322995592       -0.102405212 
## StocksExcludingSPR  NonCommercialLong     CommercialLong    CommercialShort 
##        0.694028117       -0.528991035       -1.219766893        1.676484528

📈 Visual Comparison of Model Fits

# Actual values
actual <- dataset[[responseName]]
fitted_r2 <- model_r2$fitted.values
fitted_adjr2 <- model_adjr2$fitted.values

# Data frames for ggplot
plot_data <- data.frame(
  Index = 1:length(actual),
  Actual = actual,
  R2_Fitted = fitted_r2,
  AdjR2_Fitted = fitted_adjr2
)

# Plot both fits
ggplot(plot_data, aes(x = Index)) +
  geom_line(aes(y = Actual), color = "black", size = 1, linetype = "dashed") +
  geom_line(aes(y = R2_Fitted), color = "steelblue", size = 1) +
  labs(
    title = "Full Model (R-squared): Actual vs Fitted Values",
    subtitle = "Observation Index used in place of dates (parsed from original dataset)",
    x = "Observation Index",
    y = "CrudeOil % Change"
  ) +
  theme_minimal()

ggplot(plot_data, aes(x = Index)) +
  geom_line(aes(y = Actual), color = "black", size = 1, linetype = "dashed") +
  geom_line(aes(y = AdjR2_Fitted), color = "limegreen", size = 1.1) +
  labs(
    title = "Optimal Model (Adjusted R-squared): Actual vs Fitted Values",
    subtitle = "Observation Index used in place of dates (parsed from original dataset)",
    x = "Observation Index",
    y = "CrudeOil % Change"
  )+
  theme_minimal() +
  theme(plot.background = element_rect(color = "limegreen", size = 2))

📊 Model Comparison Summary Table

Metric	adjr2 = FALSE (All 12 Predictors)	adjr2 = TRUE (Best Subset of 7 Predictors)
Adjusted R-squared	0.6145	0.6531 ✅ (higher is better)
Multiple R-squared	0.7018	0.699
Residual Std. Error	0.02388	0.02265 ✅ (lower is better)
F-statistic (p-value)	8.042 (1.88e-07)	15.26 (3.99e-10) ✅ (stronger model)
Model Complexity	12 predictors	7 predictors ✅ (simpler, more robust)
Significant Coeffs	4	6 ✅ (more signal, less noise)
R² Difference	—	~0.003 ❗ (negligible)

✅ Best Practice Tips

The olr() function automates model selection by testing every valid predictor combination.
Use adjr2 = TRUE to prioritize models that balance accuracy and parsimony.
A small drop in raw R² is acceptable if the adjusted R² is higher — it means fewer variables, better generalization.

📌 Summary

The adjusted R² model outperformed the full model on: - Adjusted R² - F-statistic - Residual error - Model simplicity - # of significant coefficients

👉 Use adjusted R² (adjr2 = TRUE) in practice to avoid overfitting and ensure interpretability.

Created by Mathew Fok • Author of the olr package

Contact: quiksilver67213@yahoo.com