I am interested in isolating which NHL players shoot the puck well, and which NHL goaltenders
do a good job at preventing shots from becoming goals. To that end I have fit a regression
model which replicates some of the simple features of shooting and saving. Throughout this
article, when I say "shot" I will mean "unblocked shot", that is, goals, saves, and misses
(including shots that hit the post or the crossbar). Furthermore, when I talk of shooting
talent, I mean the ability to score *more than one would expect given the shot location*,
so a player may well take a lot of shots from great scoring locations and still be "a bad
shooter" in some sense. Generating many such shots is obviously desirable and surely can be
done more often by talented players, but I do not consider any such talents to be part of
*shooting* talent, which is (half of) the subject of this article.

Throughout, I'll be using only 5v5 shots, since I think the hockey assumptions underlying the model are only valid for a single score state. However, one could presumably fit such a model (with perhaps slightly different tuning parameters) for 5v4 and even for 5v3, and then obtain aggregate estimates for players by combining their estimates from the various different models.

Once a shot is being taken by a given player from a certain spot against a specific goaltender, I estimate the probability that such a shot will be a goal. This process is modelled with a generalized ridge logistic regression, for a detailed exposition please see Section 3. Briefly: I use a design matrix for which every row is a shot with the following columns:

- An indicator for shooter;
- An indicator for the goaltender;
- A set of indicators for shot type, where wrist and snap shots are (together, undistinguished) taken as the "base" shot type, and dummy variables are set to 1 for slap shots, backhands, wraparounds, and tips (including deflections);
- An indicator for "rush shots", that is, shots for which the previous recorded play-by-play event is in a different zone and no more than four seconds prior;
- An indicator for "rebound shots", that is, shots for which the previous recorded play-by-play event is another shot taken by the same team no more than three seconds prior;
- The distance from the shot location to the net, divided by 89 ft; making the intersection of the blue line and the split line distance "1";
- The "visible net", that is, the width of the net projected onto the plane which is square to the shooter, divided by six feet. For shots from the split line, the visible net has value 1, and for shots very close to the goal line, the visible net is close to 0; and
- An intercept.

I make a slightly unusual modification to shot distances; namely, shots which are recorded as coming from closer than ten feet are assigned a distance of 10ft. This is to stop small variations in shot location from having outsize effects on the regression, and also because it is close to the threshold of minimum human reaction time for goaltenders given typical NHL wrist shot speeds.

The observation is 1 for goals and 0 for saves or misses.
The model is fit by maximizing the likelihood of the model, that is, for a given model,
form the product of the predicted probabilities for all of the events that *did* happen
(90% chance of a save *here* times 15% of that goal *there*, etc.). Large
products are awkward, so we solve the mathematically equivalent problem of maximizing the
*logarithm* of the likelihood, and before we do so we add a term of the form
\(-\beta^T\Lambda\beta\), where we use \(\Lambda\) to encode our prior knowledge, as
described below.

Simple formulas for the \(\beta\) which maximixes this likelihood to not seem to exist, but
we can still find it by iteratively
computing: $$ \beta_{n+1} = ( X^TX + \Lambda )^{-1} X^T ( X \beta_n +
Y - f(X,\beta_n) ) $$ where \(f(X,\beta)\) is the vector function whose entry as position i is
\((1 + \exp(-X_i\beta))^{-1}\) where \(X_i\) is the i'th row of \(X\) (this choice of \(f\) is what
makes the regression *logistic*). By starting with
\(\beta_0\) as the zero vector and iterating until convergence, I obtain estimates of shooter ability,
goaltending ability, with suitable modifications for shot location and type.

This model is zero-biased, which is to say that we consider deviations from average ability to be on-their-face unlikely and bias our results towards average. Another way of saying the same thing is to say that we are beginning with an assumption (of a certain strength) that all players are of league average ability and then letting the observed data slowly update our knowledge, instead of beginning with an assumption that we know nothing about the shooters and goaltenders at all. The bias controlled by the matrix \(\Lambda\), which must be positive definite for the above formula to be the well-defined solution which makes \(\beta\) the one which minimizes the total error. As in my 5v5 shot rate model, I use a diagonal matrix, where the entries correspoding to goaltenders and shooters are \(\lambda = 100\) and those corresponding to all other columns are 0.001, that is, very close to zero. As for that model, the non-trivial \(\lambda\) values were chosen by varying \(\lambda\) and choosing a value where player estimates have stabilized.

In the future, I will publish results for all seasons, but for now, I record the results of fitting this model on all of the 5v5 shots in the 2016-2018 regular seasons. First, the non-player covariates are:

Covariate | Value |
---|---|

Constant | `-2.55` |

Slapshot | `+0.0836` |

Tip/Deflection | `-0.222` |

Backhand | `-0.175` |

Wraparound | `-0.300` |

Rush | `+0.228` |

Rebound | `+0.754` |

Distance | `-2.86` |

Visible Net | `+1.15` |

Logistic regression coefficient values can be difficult to interpret, but negative values always mean "less likely to become a goal" and positive values mean "more likely to become a goal". To compute the probability that a shot with a given description will become a goal, add up all of the model covariates to obtain a number, and then apply the logistic function to it, that is, $$ x \mapsto \frac{1}{1 + \exp(-x)}$$ This function (after which the regression type is named) is very convenient for modelling probabilities, since it monotonically takes the midpoint of the number line (that is, zero) to 50% while taking large negative numbers to positive numbers close to zero and very large positive numbers to positive numbers close to one.

Thus, for instance, we might want to compute the goal probability of a wrist shot from 30 feet out (just below the tops of the circles), on the split line, neither on the rush nor a rebound. To do this, begin with the constant value -2.55. We have encoded by dividing by 89, so we multiply 30/89 times the distance coefficient of -2.86 to obtain -0.964. From the split line, the visible net is 1, so we add +1.15. Wrist and snap shots are taken as the base category, so no shot type term needs to be added. Since the shot is neither a shot nor a rebound, we have all the terms we need, adding them together gives -2.364. Applying the logistic function gives 8.6%, close to the historical percentage of six to eight percent from this area.

The overall features of the model are more or less as expected---shots from farther away are less likely to go in, seeing more of the net is good, rush shots are good, rebound shots are even better. The (very slight) positive value for slapshots and negative value for tips and deflections may seem surprising at first, after all, slapshots are scored only rarely and tips score often. However, slapshots are systematically taken far from the net, and tips and deflections almost always from close to the net, after accounting for shot location there is almost no difference between wrist and slap shots and tips are, after all, somewhat less precise than wrist shots and the player tipping the prior shot generally isn't looking at the net.

As the above example shows, the model can already be used without specifying shooters or goaltenders. However, this is perhaps a little boring. Below are the values for all the goaltenders who faced at least one shot in the 2016-2018 regular seasons. I've inverted the scale so that the better performances are at the top.

The scale is the same units as for the non-player covariates above, so even the best or worst performances are smaller than the effect of a shot being a rush shot, for instance, consistent with goaltending performances being broadly similar across the league.

Similarly for forward and defender results, which I've put on separate pages for performance reasons.

Minimum Minutes: