↖︎ Vishal Singh
Topic 2: Quantifying Metrics Regression + controls Self-contained HTML

Regression Exercise: Did Southwest lower airfares?

Use route-level airline data to move from a simple comparison to a controlled regression. The goal is not just to run a model; it is to explain what the coefficient means in a business decision.

Start exercise
Objective

Estimate the marginal effect of Southwest presence on fares

The headline question is simple: are fares lower on routes where Southwest operates? The statistical challenge is that fares also depend on route distance and competition. This exercise shows how regression helps compare routes that are more similar on those dimensions.

Language discipline: this dataset is observational. The regression estimates a controlled association. To call it a causal effect, we would need an identification argument explaining why Southwest presence is as-good-as-random after controls.
Data

What one row means

Each row is a route, identified by an origin and destination airport. The outcome is the average fare. The key variable is whether Southwest serves the route.

VariableMeaningType
Origin, DestAirport pair for the routeCategorical
SouthWest1 if Southwest operates on the route, 0 otherwiseBinary / treatment indicator
DISTANCERoute distance in milesNumeric control
AirlinesNumber of airlines serving the routeNumeric control / competition
FareAverage airfare in dollarsNumeric outcome

Decision question template

Action / treatment: Southwest operates on a route.

Outcome: average fare.

Unit: airport route.

Comparison: routes with and without Southwest, adjusted for distance and competition.

Exploration

Look before modeling

Fare distribution

A few long, expensive routes create a right tail. That is one reason a log-fare model can be useful later.

Distance and fare

No SouthwestSouthwest
Step 1

Start with the raw difference

What this comparison says

Why it can mislead: Southwest routes in this dataset are not identical to non-Southwest routes. They differ in distance, competition, route mix, and possibly unmeasured factors such as airport demand and seasonality.

Fare = β₀ + β₁ × SouthWest

In a simple regression with only the Southwest indicator, β₁ equals the raw difference in group means.

Step 2

Control for distance and competition

Multiple regression asks: among routes with similar distance and similar number of airlines, how much lower or higher are fares when Southwest is present?

Fare = β₀ + β₁ × SouthWest + β₂ × DISTANCE + β₃ × Airlines

Coefficient table

Main interpretation: after controlling for distance and number of airlines, Southwest presence is associated with about lower fares on average.
Interactive

Least squares intuition lab

Regression finds the coefficient values that minimize the sum of squared prediction errors. Move the sliders and watch the error change. Then compare your values with the OLS solution.

Current sum of squared errors
OLS SSE
Current RMSE
OLS RMSE
Decision tool

Prediction lab: compare two otherwise identical routes

Use the multiple regression model to estimate a fare for a hypothetical route. Then toggle Southwest to see the model-implied difference for the same distance and competition level.

Predicted fare
Multiple regression model
Same route, no Southwest
Counterfactual-style comparison from model
Same route, with Southwest
Only the Southwest indicator changes
Model-implied difference
With SW minus without SW
Extension

Log model: percentage interpretation

The log model is useful when you want to talk in percentages rather than dollars, especially when distance and fares have long right tails.

log(Fare) = β₀ + β₁ × SouthWest + β₂ × log(DISTANCE) + β₃ × Airlines

Log-dummy interpretation: for a dummy variable in a log-outcome model, convert the coefficient using exp(β) − 1. Here, the Southwest coefficient implies about lower fares, holding the included controls constant.
Practice

Mini exercises

Exercise 1: Interpret the Southwest coefficient

The multiple regression coefficient on Southwest is shown in the coefficient table. Write a one-sentence business interpretation that includes the phrase “holding distance and competition constant.”

Routes served by Southwest have fares that are about $49 lower on average than otherwise similar routes, holding route distance and the number of airlines constant.
Exercise 2: Why did the Southwest estimate shrink?

The raw comparison is about $142 lower, while the controlled regression estimate is about $49 lower. What does that tell you?

Part of the raw difference was due to route differences, not Southwest alone. Once we account for distance and competition, the remaining association is smaller.
Exercise 3: Causal caution

Can we say Southwest caused fares to fall by this amount? What additional evidence would strengthen the claim?

Not from this regression alone. Stronger evidence would come from random or quasi-random entry into routes, a difference-in-differences design around Southwest entry, or another identification strategy that addresses unobserved demand and route selection.
Exercise 4: Write a short managerial memo

Write three sentences: finding, interpretation, and caution. Use the template below or your own wording.

Optional

Copyable AI tutor prompt

This prompt is adapted for students who want a step-by-step tutor. Use the embedded CSV download from this page, upload it to ChatGPT, and paste the prompt.

I've uploaded a file called southwest_regression_data.csv with airline route data. Please guide me step-by-step through a regression tutorial using this data.

Learning objectives:
- Explore the data and visualize fare, distance, and Southwest presence.
- Compare average fares with and without Southwest.
- Explain why a raw difference can be misleading when routes differ in distance and competition.
- Run a simple regression: Fare = b0 + b1(SouthWest).
- Run a multiple regression: Fare = b0 + b1(SouthWest) + b2(DISTANCE) + b3(Airlines).
- Interpret the coefficients in plain business language.
- Explain correlation vs causation and what identification evidence would be needed for a causal claim.
- Run a log model: log(Fare) = b0 + b1(SouthWest) + b2(log(DISTANCE)) + b3(Airlines), and interpret percentage effects.

Please pause after each major step and wait for me to type "next" before continuing. Always use the real numbers from the uploaded data.