Gaussian Smoothing

Many of my charts use a smoothing method of which I am very fond. It is extremely similar to kernel-density estimation but with some slight differences for extra control. First, for notation, write b(x,s) for the bell curve of unit area centred at x with shape s. A shape of 0 is an infinitely tall spike at x (a Dirac delta distribution) and a very large shape is a broad curve which stretches considerably to both sides of x.

Let (X,Y) be a pair of sets of points that I want to graph a smoothed version of. In most examples, the x values will be some sort of time variable (seconds in a game, or game number in a season) and the y values will be some sort of hockey stat; minutes of ice time or goals or shot rates, say. First, form a function f(x) which is the sum of y * b(x,s) for every pair (x,y) in (X,Y). I choose the scale parameter s differently depending on the application. Large values of s effectively smear the point in question substantially backwards and forwards in time and make a smoother function but are worse approximations to the raw data.


One of the reasons that I do not use moving averages is that they handle initial cases very awkwardly. Strictly speaking there is no data for a five-game moving average until five games are played, and plotting the raw data anyway wrongly suggests that the variation is high at first and then settles down; this is very misleading. However, the smoothed function f(x) above has non-zero values before the first value for x and after the last value for x, this is silly. Hence, if we have fixed values for the first and last possible x values (the 0th second and 3600th second of a regulation hockey game, for instance, or the first game and the 82nd game of a full season), then we can use these boundaries to simply "fold back" the data in question: Namely, given a starting value a and a final value b, we can define g(x) = f(x) + f(a-x) + f(2b-x). This is the smoothed function of the data that I use. (Strictly speaking, this should be an infinite sum, but, unless a and b are very close together, only these terms will be needed.)

Physically, you can think of this as making a box with very tall hard sides at a and b, and then each point (x,y) corresponds to a clump of wet sand of weight y being dropped at point x. If the sand is fine and flows easily, this corresponds to a very low s; if the sand is clumpy and flows slowly then this corresponds to a very high s.