I am interested in isolating which NHL players shoot the puck well, and which NHL goaltenders
do a good job at preventing shots from becoming goals. To that end I have fit a regression
model which replicates some of the simple features of shooting and saving. Throughout this
article, when I say "shot" I will mean "unblocked shot", that is, goals, saves, and misses
(including shots that hit the post or the crossbar). Furthermore, when I talk of shooting
talent, I mean the ability to score *more than one would expect given the shot location*,
so a player may well take a lot of shots from great scoring locations and still be "a bad
shooter" in some sense. Generating many such shots is obviously desirable and surely can be
done more often by talented players, but I do not consider any such talents to be part of
*shooting* talent, which is (half of) the subject of this article.

In contrast to last year's model, which used only 5v5 shots, now I also use 5v4 shots. Although not the same, the fundemental mechanics of shooting and saving are similar in both cases.

Once a shot is being taken by a given player from a certain spot against a specific goaltender, I estimate the probability that such a shot will be a goal. This process is modelled with a linear model, and fit with a generalized ridge logistic regression. For a detailed exposition about how such models can be fit, please see Section 5. Briefly: I use a design matrix \(X\) for which every row is a shot with the following columns:

- An indicator for shooter;
- An indicator for the goaltender;
- A set of indicators for shot type, where wrist and snap shots are (together, undistinguished) taken as the "base" shot type, and indicator variables are set to 1 for slap shots, backhands, wraparounds, and tips (including deflections);
- An indicator for "rush shots", that is, shots for which the previous recorded play-by-play event is in a different zone and no more than four seconds prior;
- An indicator for "rebound shots", that is, shots for which the previous recorded play-by-play event is another shot taken by the same team no more than three seconds prior;
- The distance from the shot location to the net, divided by 89 ft; making the intersection of the red line and the split line distance "1" and shots from immediately in front of the net close to zero.
- The "visible net", that is, the width of the net projected onto the plane which is square to the shooter, divided by six feet. For shots from the split line, the visible net has value 1, and for shots very close to the goal line, the visible net is close to 0; and
- An indicator for teams which are leading and another for teams which are trailing; to be interpreted as representing change in configurations surrounding shots compared to when teams are tied.
- An indicator for 5v4 shots, to be interpreted as the change compared to a similar shot taken at 5v5.
- An intercept.

I make a slightly unusual modification to shot distances; namely, shots which are recorded as coming from closer than ten feet are assigned a distance of 10ft. This is to stop small variations in shot location from having outsize effects on the regression, and also because it is close to the threshold of minimum human reaction time for goaltenders given typical NHL wrist-shot speeds.

The observation vector \(Y\) is 1 for goals and 0 for saves or misses. The model itself is the usual linear one: $$ Y = X\beta $$ where \(\beta\) is the vector of covariate values.

The model isSimple formulas for the \(\beta\) which maximixes this likelihood to not seem to exist, but
we can still find it by iteratively
computing: $$ \beta_{n+1} = ( X^TX + \Lambda )^{-1} X^T ( X \beta_n +
Y - f(X,\beta_n) ) $$ where \(f(X,\beta)\) is the vector function whose entry at position i is
\((1 + \exp(-X_i\beta))^{-1}\) where \(X_i\) is the i'th row of \(X\) (this choice of \(f\) is what
makes the regression *logistic*). By starting with
\(\beta_0\) as the zero vector and iterating until convergence, I obtain estimates of shooter ability,
goaltending ability, with suitable modifications for shot location and type, as well as the score
and the skater strength.

This model is zero-biased, which is to say that we consider deviations from average ability to be on-their-face unlikely and bias our results towards average. Another way of saying the same thing is to say that we are beginning with an assumption (held with a certain strength) that all players are of league average ability and then letting the observed data slowly update our knowledge, instead of beginning with an assumption that we know nothing about the shooters and goaltenders at all. The bias controlled by the matrix \(\Lambda\), which must be positive definite for the above formula to be the well-defined solution which makes \(\beta\) the one which minimizes the total error. Similarly to my 5v5 shot rate model, I use a diagonal matrix, where the entries correspoding to goaltenders and shooters are \(\lambda = 100\) and those corresponding to all other columns are 0.001, that is, very close to zero. As for that model, the non-trivial \(\lambda\) values were chosen by varying \(\lambda\) and choosing a value where player estimates have stabilized.

(Incidentally, I do not yet see a way to convert this model, as the above 5v5 shot rate model has been, into a "chained" model where each year's results can be used as the prior for estimates formed after the following year. I would dearly like to do so, though.)

In the future, I will publish results for all seasons, but for now, I record the results of fitting this model on all of the 5v5 shots in the 2017-2019 regular seasons. First, the non-player covariates are:

Covariate | Value |
---|---|

Constant | `-2.59` |

Slapshot | `-0.0971` |

Tip/Deflection | `+0.185` |

Backhand | `+0.0915` |

Wraparound | `-0.284` |

Rush | `+0.0969` |

Rebound | `+0.944` |

Distance | `-3.34` |

Visible Net | `+0.606` |

Leading | `+0.166` |

Trailing | `+0.0106` |

5v4 | `+0.413` |

Logistic regression coefficient values can be difficult to interpret, but negative values always mean "less likely to become a goal" and positive values mean "more likely to become a goal". To compute the probability that a shot with a given description will become a goal, add up all of the model covariates to obtain a number, and then apply the logistic function to it, that is, $$ x \mapsto \frac{1}{1 + \exp(-x)}$$ This function (after which the regression type is named) is very convenient for modelling probabilities, since it monotonically takes the midpoint of the number line (that is, zero) to 50% while taking large negative numbers to positive numbers close to zero and very large positive numbers to positive numbers close to one.

Thus, for instance, we might want to compute the goal probability of a 5v5 wrist shot from 30 feet out (just below the tops of the circles), in a tied game, on the split line, neither on the rush nor a rebound. To do this, begin with the constant value -2.59. We have encoded distance by dividing by 89, so we multiply 30/89 times the distance coefficient of -3.34 to obtain -1.13. From the split line, the visible net is 1, so we add +0.606. Wrist and snap shots are taken as the base category, so no shot type term needs to be added. Since the shot is neither a shot nor a rebound, taken while tied at 5v5, we have all the terms we need, adding them together gives -3.114. Applying the logistic function gives 4.2%, somewhat below the historical percentage of six to eight percent from this area, as we'd expect since most of the "special" things that could have described our shot would have increased it's chance of becoming a goal.

The overall features of the model are more or less as expected---shots from farther away are less
likely to go in, seeing more of the net is good, rush shots are good, rebound shots are even
better. Power-play shots are substantially more likely to result in goals from equivalently-described
5v5 shots. The shot type terms are somewhat surprising to me, especially the negative term for
slapshots even after accounting for distance. Also mildly surprising is that *both* leading
and trailing increases the chance of a shot becoming a goal, suggesting that games do "open up"
when one team takes a lead, rather than the "losing teams dominate" pattern of score effects that
are familiar from work on shot rates. Also interesting is that the effect of leading improves
the goal odds of shots much more than that of trailing, suggesting perhaps that teams with the
lead hold on to the puck a little more, preferring not to give up the puck unless they feel their
chances of scoring are higher.

As the above example shows, the model can already be used without specifying shooters or goaltenders. However, this is perhaps a little boring. Below are the values for all the goaltenders who faced at least one shot in the 2017-2019 regular seasons. I've inverted the scale so that the better performances are at the top.

The scale is the same units as for the non-player covariates above, so even the best or worst performances are smaller than the effect of a shot being a rush shot, for instance, consistent with goaltending performances being broadly similar across the league.

Similarly for forward and defender results, which I've put on separate pages for performance reasons.

Minimum Shots Faced: