Regression Exercise: Did Southwest lower airfares?
Use route-level airline data to move from a simple comparison to a controlled regression. The goal is not just to run a model; it is to explain what the coefficient means in a business decision.
Estimate the marginal effect of Southwest presence on fares
The headline question is simple: are fares lower on routes where Southwest operates? The statistical challenge is that fares also depend on route distance and competition. This exercise shows how regression helps compare routes that are more similar on those dimensions.
What one row means
Each row is a route, identified by an origin and destination airport. The outcome is the average fare. The key variable is whether Southwest serves the route.
| Variable | Meaning | Type |
|---|---|---|
| Origin, Dest | Airport pair for the route | Categorical |
| SouthWest | 1 if Southwest operates on the route, 0 otherwise | Binary / treatment indicator |
| DISTANCE | Route distance in miles | Numeric control |
| Airlines | Number of airlines serving the route | Numeric control / competition |
| Fare | Average airfare in dollars | Numeric outcome |
Decision question template
Action / treatment: Southwest operates on a route.
Outcome: average fare.
Unit: airport route.
Comparison: routes with and without Southwest, adjusted for distance and competition.
Look before modeling
Fare distribution
A few long, expensive routes create a right tail. That is one reason a log-fare model can be useful later.
Distance and fare
Start with the raw difference
What this comparison says
Fare = β₀ + β₁ × SouthWest
In a simple regression with only the Southwest indicator, β₁ equals the raw difference in group means.
Control for distance and competition
Multiple regression asks: among routes with similar distance and similar number of airlines, how much lower or higher are fares when Southwest is present?
Fare = β₀ + β₁ × SouthWest + β₂ × DISTANCE + β₃ × Airlines
Coefficient table
Least squares intuition lab
Regression finds the coefficient values that minimize the sum of squared prediction errors. Move the sliders and watch the error change. Then compare your values with the OLS solution.
Prediction lab: compare two otherwise identical routes
Use the multiple regression model to estimate a fare for a hypothetical route. Then toggle Southwest to see the model-implied difference for the same distance and competition level.
Log model: percentage interpretation
The log model is useful when you want to talk in percentages rather than dollars, especially when distance and fares have long right tails.
log(Fare) = β₀ + β₁ × SouthWest + β₂ × log(DISTANCE) + β₃ × Airlines
Mini exercises
Exercise 1: Interpret the Southwest coefficient
The multiple regression coefficient on Southwest is shown in the coefficient table. Write a one-sentence business interpretation that includes the phrase “holding distance and competition constant.”
Exercise 2: Why did the Southwest estimate shrink?
The raw comparison is about $142 lower, while the controlled regression estimate is about $49 lower. What does that tell you?
Exercise 3: Causal caution
Can we say Southwest caused fares to fall by this amount? What additional evidence would strengthen the claim?
Exercise 4: Write a short managerial memo
Write three sentences: finding, interpretation, and caution. Use the template below or your own wording.
Copyable AI tutor prompt
This prompt is adapted for students who want a step-by-step tutor. Use the embedded CSV download from this page, upload it to ChatGPT, and paste the prompt.
I've uploaded a file called southwest_regression_data.csv with airline route data. Please guide me step-by-step through a regression tutorial using this data. Learning objectives: - Explore the data and visualize fare, distance, and Southwest presence. - Compare average fares with and without Southwest. - Explain why a raw difference can be misleading when routes differ in distance and competition. - Run a simple regression: Fare = b0 + b1(SouthWest). - Run a multiple regression: Fare = b0 + b1(SouthWest) + b2(DISTANCE) + b3(Airlines). - Interpret the coefficients in plain business language. - Explain correlation vs causation and what identification evidence would be needed for a causal claim. - Run a log model: log(Fare) = b0 + b1(SouthWest) + b2(log(DISTANCE)) + b3(Airlines), and interpret percentage effects. Please pause after each major step and wait for me to type "next" before continuing. Always use the real numbers from the uploaded data.