## Sports Franchise Values

##### A 9Ware Workshop Regression Analysis Project
###### Do attendance and/or social media follower counts explain the huge variation in franchise values across Major League Baseball and the National Football League? We thought we'd find out.

Given our continuing fascination with data related to sports, business, and sports business, we decided to do some regression analysis to see if fanbase counts can predict franchise values in the MLB and NFL. And if so, we wanted to know what part of the fanbase does a better job of it -- attendance or social media followers. We gathered the data from various sources* and started with a visual assessment -- scatterplots with linear regression lines and their 95% confidence intervals (shaded)....

#### MLB  #### NFL  [Franchise values in \$millions]

We see some useful observations here. First, MLB attendance appears to be a good indicator of franchise value. However, most of the NFL teams are clustered in a nearly circular pattern showing little linear connection with franchise value. This is not too surprising because media rights are such a huge part of the NFL revenue model, especially compared to game day revenue across only eight home games.

Now let's compare the social media follower charts. The MLB chart is difficult to decipher as shown here because the 15 million(!) Yankees followers have caused most of the other teams to bunch up in the lower left corner, but even the 'bunching' appears more random than linear. On the other hand, the NFL followers chart does show promise for a meaningful linear relationship to franchise values.

Ok, let's get into some regression analysis. For each league we ran three regressions (value ~ attendance), (value ~ followers), and (value ~ attendance & followers). Here are the results:

##### MLB Regression Results
```----------
Regression: data\$value ~ data\$attendance
Residual standard error: 744.6 on 28 degrees of freedom
Multiple R-squared:  0.4371,	Adjusted R-squared:  0.417
F-statistic: 21.74 on 1 and 28 DF,  p-value: 6.968e-05

----------
Regression: data\$value ~ data\$all_followers)
Residual standard error: 509.8 on 28 degrees of freedom
Multiple R-squared:  0.7362,	Adjusted R-squared:  0.7268
F-statistic: 78.14 on 1 and 28 DF,  p-value: 1.361e-09

----------
Regression: data\$value ~ data\$attendance + data\$all_followers
Residual standard error: 509.8 on 28 degrees of freedom
Multiple R-squared:  0.8111,	Adjusted R-squared:  0.7971
F-statistic: 57.96 on 2 and 27 DF,  p-value: 1.698e-10

--------------------
AIC Model Comparison:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol       4 456.69       0.00   0.98   0.98 -223.54
followers     3 464.03       7.34   0.02   1.00 -228.55
attendance    3 486.76      30.08   0.00   1.00 -239.92
```
```----------

Call:
lm(formula = data\$value ~ data\$attendance)

Residuals:
Min       1Q   Median       3Q      Max
-1182.13  -444.26   -41.18   251.85  2554.89

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     -42.72074  439.26954  -0.097    0.923
data\$attendance   0.08293    0.01779   4.663 6.97e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 744.6 on 28 degrees of freedom
Multiple R-squared:  0.4371,	Adjusted R-squared:  0.417
F-statistic: 21.74 on 1 and 28 DF,  p-value: 6.968e-05

----------

Call:
lm(formula = data\$value ~ data\$all_followers)

Residuals:
Min      1Q  Median      3Q     Max
-874.29 -216.36  -44.89  182.82 1734.62

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        7.181e+02  1.634e+02   4.396 0.000144 ***
data\$all_followers 2.997e-04  3.390e-05   8.840 1.36e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 509.8 on 28 degrees of freedom
Multiple R-squared:  0.7362,	Adjusted R-squared:  0.7268
F-statistic: 78.14 on 1 and 28 DF,  p-value: 1.361e-09

----------

Call:
lm(formula = data\$value ~ data\$attendance + data\$all_followers)

Residuals:
Min      1Q  Median      3Q     Max
-571.91 -338.04    3.01  170.80 1508.72

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        6.027e+00  2.592e+02   0.023  0.98162
data\$attendance    3.946e-02  1.206e-02   3.271  0.00293 **
data\$all_followers 2.455e-04  3.358e-05   7.311 7.29e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 439.3 on 27 degrees of freedom
Multiple R-squared:  0.8111,	Adjusted R-squared:  0.7971
F-statistic: 57.96 on 2 and 27 DF,  p-value: 1.698e-10

sma
744.6172  - Residual Standard Error
15524733  - Sum Squared Error
517491.1  - Mean Squared Error
0.4371058  - Multiple R-squared

sms
509.7522  - Residual Standard Error
7275725  - Sum Squared Error
242524.2  - Mean Squared Error
0.7361975  - Multiple R-squared

smas
439.2982  - Residual Standard Error
5210538  - Sum Squared Error
173684.6  - Mean Squared Error
0.8110768  - Multiple R-squared

Model selection based on AICc:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol       4 456.69       0.00   0.98   0.98 -223.54
followers     3 464.03       7.34   0.02   1.00 -228.55
attendance    3 486.76      30.08   0.00   1.00 -239.92
```
##### NFL Regression Results
```----------
Regression: data\$value ~ data\$attendance
Residual standard error: 752.3 on 30 degrees of freedom
Multiple R-squared:  0.3347,	Adjusted R-squared:  0.3125
F-statistic: 15.09 on 1 and 30 DF,  p-value: 0.0005236

----------
Regression: data\$value ~ data\$all_followers
Residual standard error: 681.4 on 30 degrees of freedom
Multiple R-squared:  0.4543,	Adjusted R-squared:  0.4361
F-statistic: 24.97 on 1 and 30 DF,  p-value: 2.349e-05

----------
Regression: data\$value ~ data\$attendance + data\$all_followers
Residual standard error: 640.4 on 29 degrees of freedom
Multiple R-squared:  0.5339,	Adjusted R-squared:  0.5018
F-statistic: 16.61 on 2 and 29 DF,  p-value: 1.557e-05

--------------------
AIC Model Comparison:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol       4 510.72       0.00   0.76   0.76 -250.62
followers     3 513.15       2.43   0.23   0.99 -253.14
attendance    3 519.49       8.77   0.01   1.00 -256.31
```
```----------

Call:
lm(formula = data\$value ~ data\$attendance)

Residuals:
Min       1Q   Median       3Q      Max
-1307.85  -473.77   -54.69   497.78  1585.54

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     -866.26473 1127.68150  -0.768 0.448384
data\$attendance    0.06504    0.01674   3.885 0.000524 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 752.3 on 30 degrees of freedom
Multiple R-squared:  0.3347,	Adjusted R-squared:  0.3125
F-statistic: 15.09 on 1 and 30 DF,  p-value: 0.0005236

----------

Call:
lm(formula = data\$value ~ data\$all_followers)

Residuals:
Min      1Q  Median      3Q     Max
-1119.2  -465.1  -128.4   237.6  1873.6

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        2.489e+03  2.327e+02  10.695 9.35e-12 ***
data\$all_followers 1.722e-04  3.447e-05   4.997 2.35e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 681.4 on 30 degrees of freedom
Multiple R-squared:  0.4543,	Adjusted R-squared:  0.4361
F-statistic: 24.97 on 1 and 30 DF,  p-value: 2.349e-05

----------

Call:
lm(formula = data\$value ~ data\$attendance + data\$all_followers)

Residuals:
Min      1Q  Median      3Q     Max
-906.17 -422.11  -20.93  308.58 1575.46

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        2.841e+02  1.014e+03   0.280  0.78132
data\$attendance    3.651e-02  1.640e-02   2.227  0.03389 *
data\$all_followers 1.312e-04  3.727e-05   3.521  0.00144 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 640.4 on 29 degrees of freedom
Multiple R-squared:  0.5339,	Adjusted R-squared:  0.5018
F-statistic: 16.61 on 2 and 29 DF,  p-value: 1.557e-05

sma
752.3046  - Residual Standard Error
16978867  - Sum Squared Error
530589.6  - Mean Squared Error
0.334681  - Multiple R-squared

sms
681.3525  - Residual Standard Error
13927236  - Sum Squared Error
435226.1  - Mean Squared Error
0.4542595  - Multiple R-squared

smas
640.4129  - Residual Standard Error
11893731  - Sum Squared Error
371679.1  - Mean Squared Error
0.5339426  - Multiple R-squared

Model selection based on AICc:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol       4 510.72       0.00   0.76   0.76 -250.62
followers     3 513.15       2.43   0.23   0.99 -253.14
attendance    3 519.49       8.77   0.01   1.00 -256.31
```

There are many ways to assess and compare different linear regression models. One of the clearest metrics for our purposes here is the R-Squared value, a number indicating the proportion of the variation in the dependent variable (franchise value) that is described by the predictor variables (attendance, followers, or attendance&followers). Actually, we'll focus on the Adjusted R-Squared value, which adjusts R-Squared for the number of predictor variables used. Adjusted R-Squared is always less than or equal to R-Squared, decreasing with the addition of more variables that don't contribute much to the overall analysis. Regression MLB NFL
Value~Attendance 0.4170 0.3125
Value~Followers 0.7268 0.4361
Value~Att&Fol 0.7971 0.5018

The green dashed line above represents the point where 70% of the variation in franchise value is explained by the model. And for MLB teams, followers alone as well as the combination of attendance and followers get us there (0.7268, 0.7971). This means that most of the regression fit comes from followers, but adding the weaker attendance variable meaningfully adds to the predictability of the model. We confirmed that with an Akaike information criterion (AIC) model comparison at the bottom of the output above. The AIC metric finds the model that explains the most variation in the data, while penalizing models that use an excessive number of parameters. [Read more about AIC here.]

So, we're satisfied with our results on the MLB side of things -- that attendance and follower counts work together to sufficiently describe variations in franchise values (to the 70% benchmark we chose to target).

We have a different situation on the NFL side, as our maximum Adjusted R-Squared value here indicates that we are barely explaining 50% of franchise values, even with the combination of attendance and follower counts. Taking another look at the initial visualizations, we noticed that the teams below the regression lines in both charts are generally in small markets, while the teams above the lines are generally in large markets. So we obtained metro area population data for NFL teams from www.census.gov to see if adding this as a third NFL predictor variable might improve things. Starting with a scatterplot of franchise values vs metro area populations... Metro area population sizes alone don't look too promising as a predictor variable, but adding it to the regression analysis could tell a different story given the clear separation of small vs large population sizes shown in the earlier scatterplots.

##### NFL Regression Results with Metro Populations Added
```----------
Regression: data\$value ~ data\$population
Residual standard error: 796.1 on 30 degrees of freedom
Multiple R-squared:  0.2549,	Adjusted R-squared:  0.2301
F-statistic: 10.26 on 1 and 30 DF,  p-value: 0.003208

----------
Regression: data\$value ~ data\$attendance + data\$population
Residual standard error: 654.4 on 29 degrees of freedom
Multiple R-squared:  0.5134,	Adjusted R-squared:  0.4798
F-statistic:  15.3 on 2 and 29 DF,  p-value: 2.913e-05

----------
Regression: data\$value ~ data\$all_followers + data\$population
Residual standard error: 504 on 29 degrees of freedom
Multiple R-squared:  0.7113,	Adjusted R-squared:  0.6914
F-statistic: 35.73 on 2 and 29 DF,  p-value: 1.499e-08

----------
Regression: data\$value ~ data\$attendance + data\$all_followers data\$population
Residual standard error: 477.3 on 28 degrees of freedom
Multiple R-squared:   0.75,	Adjusted R-squared:  0.7232
F-statistic:    28 on 3 and 28 DF,  p-value: 1.412e-08

--------------------
AIC Model Comparison:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol+pop 5 493.61       0.00   0.71   0.71 -240.65
fol+pop     4 495.39       1.78   0.29   1.00 -242.95
att+fol     4 510.72      17.11   0.00   1.00 -250.62
att+pop     4 512.10      18.49   0.00   1.00 -251.31
followers   3 513.15      19.53   0.00   1.00 -253.14
attendance  3 519.49      25.87   0.00   1.00 -256.31
population  3 523.11      29.50   0.00   1.00 -258.13
```
```----------

Call:
lm(formula = data\$value ~ data\$population)

Residuals:
Min       1Q   Median       3Q      Max
-1296.16  -387.92    21.43   233.74  2815.99

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     2.986e+03  2.096e+02  14.250 6.79e-15 ***
data\$population 9.252e-05  2.888e-05   3.204  0.00321 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 796.1 on 30 degrees of freedom
Multiple R-squared:  0.2549,	Adjusted R-squared:  0.2301
F-statistic: 10.26 on 1 and 30 DF,  p-value: 0.003208

----------

Call:
lm(formula = data\$value ~ data\$attendance + data\$population)

Residuals:
Min       1Q   Median       3Q      Max
-1053.09  -466.26   -73.81   349.40  1617.12

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)     -8.042e+02  9.811e+02  -0.820 0.419064
data\$attendance  5.781e-02  1.473e-02   3.925 0.000491 ***
data\$population  7.835e-05  2.401e-05   3.263 0.002821 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 654.4 on 29 degrees of freedom
Multiple R-squared:  0.5134,	Adjusted R-squared:  0.4798
F-statistic:  15.3 on 2 and 29 DF,  p-value: 2.913e-05

----------

Call:
lm(formula = data\$value ~ data\$all_followers + data\$population)

Residuals:
Min     1Q Median     3Q    Max
-838.2 -321.0  -29.2  285.7 1139.6

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        1.987e+03  1.985e+02  10.012 6.43e-11 ***
data\$all_followers 1.726e-04  2.550e-05   6.771 1.97e-07 ***
data\$population    9.291e-05  1.828e-05   5.082 2.02e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 504 on 29 degrees of freedom
Multiple R-squared:  0.7113,	Adjusted R-squared:  0.6914
F-statistic: 35.73 on 2 and 29 DF,  p-value: 1.499e-08

----------

Call:
lm(formula = data\$value ~ data\$attendance + data\$all_followers +
data\$population)

Residuals:
Min      1Q  Median      3Q     Max
-623.22 -344.98  -96.32  270.66 1113.25

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        4.610e+02  7.566e+02   0.609   0.5472
data\$attendance    2.584e-02  1.241e-02   2.082   0.0466 *
data\$all_followers 1.436e-04  2.789e-05   5.148 1.85e-05 ***
data\$population    8.651e-05  1.759e-05   4.920 3.45e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 477.3 on 28 degrees of freedom
Multiple R-squared:   0.75,	Adjusted R-squared:  0.7232
F-statistic:    28 on 3 and 28 DF,  p-value: 1.412e-08

mp
796.1272  - Residual Standard Error
19014554  - Sum Squared Error
594204.8  - Mean Squared Error
0.2549123  - Multiple R-squared

msp
504.017  - Residual Standard Error
7366961  - Sum Squared Error
230217.5  - Mean Squared Error
0.7113247  - Multiple R-squared

map
654.3906  - Residual Standard Error
12418583  - Sum Squared Error
388080.7  - Mean Squared Error
0.5133763  - Multiple R-squared

masp
477.3202  - Residual Standard Error
6379368  - Sum Squared Error
199355.3  - Mean Squared Error
0.7500236  - Multiple R-squared

Model selection based on AICc:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol+pop 5 493.61       0.00   0.71   0.71 -240.65
fol+pop     4 495.39       1.78   0.29   1.00 -242.95
att+fol     4 510.72      17.11   0.00   1.00 -250.62
att+pop     4 512.10      18.49   0.00   1.00 -251.31
followers   3 513.15      19.53   0.00   1.00 -253.14
attendance  3 519.49      25.87   0.00   1.00 -256.31
population  3 523.11      29.50   0.00   1.00 -258.13
```
`                   `

Here are the Adjusted R-Squared values again, including the three new NFL models that consider metro area populations... Regression MLB NFL
Value~Attendance 0.4170 0.3125
Value~Followers 0.7268 0.4361
Value~Att&Fol 0.7971 0.5018
Value~Pop 0.2301
Value~Att&Pop 0.4798
Value~Fol&Pop 0.6914
Value~Att&Fol&Pop 0.7232

It's not surprising to see here that metro area population alone is indeed a lousy metric -- in fact, it turns out to be the worst predictor of all, explaining a mere 23% of the variance in NFL franchise values! However, it has clearly contributed to the predictive capability of the other NFL variables, significantly improving all of the prior models when added to them. The NFL model that combines attendance, followers and population has the highest Adjusted R-Squared value, and also proves to be the best based on AIC model analysis, as shown at the bottom of the data output above.

Thus, by adding metro area populations to the NFL model, we have surpassed our target of explaining at least 70% of the variation in MLB and NFL franchise values based on average attendance and social media follower counts. Here are the two models we've chosen to best explain the variance in MLB and NFL franchise values, respectively:

##### Chosen MLB Regression
###### 79%+ of franchise values explained by the model
```----------
Regression: data\$value ~ data\$attendance + data\$all_followers
Estimate Std. Error t value Pr(>|t|)
(Intercept)        6.027e+00  2.592e+02   0.023  0.98162
data\$attendance    3.946e-02  1.206e-02   3.271  0.00293 **
data\$all_followers 2.455e-04  3.358e-05   7.311 7.29e-08 ***

---
Residual standard error: 439.3 on 27 degrees of freedom
Multiple R-squared:  0.8111,	Adjusted R-squared:  0.7971
F-statistic: 57.96 on 2 and 27 DF,  p-value: 1.698e-10

--------------------
AIC Model Comparison:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol       4 456.69       0.00   0.98   0.98 -223.54
followers     3 464.03       7.34   0.02   1.00 -228.55
attendance    3 486.76      30.08   0.00   1.00 -239.92

```
##### Chosen NFL Regression
###### 72%+ of franchise values explained by the model
```----------
Regression: data\$value ~ data\$attendance + data\$all_followers + data\$population
Estimate Std. Error t value Pr(>|t|)
(Intercept)        4.610e+02  7.566e+02   0.609   0.5472
data\$attendance    2.584e-02  1.241e-02   2.082   0.0466 *
data\$all_followers 1.436e-04  2.789e-05   5.148 1.85e-05 ***
data\$population    8.651e-05  1.759e-05   4.920 3.45e-05 ***
---
Residual standard error: 477.3 on 28 degrees of freedom
Multiple R-squared:   0.75,	Adjusted R-squared:  0.7232
F-statistic:    28 on 3 and 28 DF,  p-value: 1.412e-08

--------------------
AIC Model Comparison:
K   AICc Delta_AICc AICcWt Cum.Wt      LL
att+fol+pop 5 493.61       0.00   0.71   0.71 -240.65
fol+pop     4 495.39       1.78   0.29   1.00 -242.95
att+fol     4 510.72      17.11   0.00   1.00 -250.62
att+pop     4 512.10      18.49   0.00   1.00 -251.31
followers   3 513.15      19.53   0.00   1.00 -253.14
attendance  3 519.49      25.87   0.00   1.00 -256.31
population  3 523.11      29.50   0.00   1.00 -258.13
```

While we could obviously continue the regression process by gathering more data and trying out additional predictor variables to get to a more respectable Adjusted R-Squared level (say, beyond 90%), we'll stop here and examine the current results...

##### Interpreting The Results

###### MLB
For the MLB, the coefficients of 3.946e-02 (attendance) and 2.455e-04 (followers) mean that, on average, franchise value rises \$1M for every
• (1/3.946e-02) = 25.3 increase in average attendance, and
• (1/2.455e-04) = 4,073 increase in social media followers
###### NFL
For the NFL, the coefficients of 2.584e-02 (attendance), 1.436e-04 (followers) and 8.651e-05 (population) mean that, on average, franchise value rises \$1M for every
• (1/2.584e-02) = 38.7 increase in average attendance,
• (1/1.436e-04) = 6,964 increase in social media followers, and
• (1/8.651e-05) = 11,559 increase in metro area population

These relationships as stated explain 70-80% of differences in franchise values. If we were to continue the analysis, we would test some more factors such as number of titles won, number of times reaching the playoffs, years of existence, and of course MLB metro area population sizes... in order to continue improving the model and explain more of the franchise value variations while taking care to not overfit the data.

##### Wrap Up

This exercise is an example of linear regression analysis. We started with a logical premise -- in our case that attendance and social media follower counts should have some level of impact on franchise value variations, hopefully explaining at least 70% of these variations across teams. We first took a visual look at correlations on scatterplots, then we ran some regressions, reviewed the results, and re-inspected the scatterplots to see whether we had missed something relevant -- in our case that metro area populations might improve the model. Then we ran new regressions based on this re-inspection, reached reasonable 'final' models, and interpreted those results.

Perhaps most importantly, we didn't overfit the data or overthink the results. In most real-world cases, including this one, countless factors are involved in determining something like the value of a franchise, business, product, service, etc. Regression analysis can provide an important and useful partial perspective on the overall situation for your business.

If you made it this far, well, you must really like statistics. Thanks for reading!

*We used MLB and NFL attendance figures from ESPN, averaging per-game attendance across 2019 and 2021 due to the pandemic. For social media follower counts, we used statista.com. And finally, Forbes provided our franchise values.