SiZer for Smoothing Splines
Zhang Jin-Ting

Dept of Statistics and Applied Prob.

Nonparametric smoothing is a useful, and time honored visualization tool for understanding the structure of two dimensional data sets. It is usually done by a smoother, together with a smoothing parameter. The smoothing parameter is usually chosen via some smoothing parameter selector. Very often a smoother with a single  smoothing parameter can only reveal part of the story about the statistical model underlying the data. However, some interesting features (e.g. bumps) that may be present in the data, only appear at some ``levels of resolution'', i.e. for only some choices of the smoothing parameter. Such features can be made to disappear by over-smoothing, and many more spurious features appear for under-smoothing. To overcome this problem, a group of smoothing parameters can be used via some well designed device.

SiZer (SIgnicant ZERo crossing) is such a device.  It combines statistical inference regarding which features in a smooth are really there, with a novel visualization that makes the results quickly accessible, especially  to non-experts via a SiZer map. The SiZer  proposed by Chaudhuri and Marron (1999) is based on the local linear smoother, namely SiZerLL, and the one proposed by Marron and Zhang (2002) is based on the smoothing spline smoother, namely SiZerSS.

II An Example

Figure 1 present a SiZerSS map for  family expenditure for food, as a function of family income, from a survey in the United Kingdom. The top panel shows the raw data points as green dots. A family of 11 smoothing splines, with different smoothing parameters is shown as blue curves. These represent a wide range of smoothing. The bottom panel is the SiZer map, which is the key to statistical inference. Rows of the SiZer map correspond to level of smoothing, i.e. to the blue curves in the top panel. Columns of the SiZer map are location (same as the x axis in the top panel). At each location SiZer uses a color that indicates simple simultaneous inference about the derivative, i.e. the slope, of the corresponding blue curve in the top panel.

When a 95% confidence interval for the slope is entirely above 0, the slope is significantly positive, and the color blue appears in the SiZer map. When the same confidence interval is entirely below 0, the slope is significantly negative, and the color red is used. When the confidence interval contains 0, there is no significant slope, and the intermediate color of purple is used. The final color in the SiZer map is gray. This indicates that there are not enough data points nearby at this level of smoothing, for trustworthy inference.

The SiZerSS map presented in Figure 1 indicates that when the smoothing parameter (labeled by log10-scale in the y-axis) is too small, the associated (lower) part of the SiZerSS map is gray, indicating there are no enough data nearby for trustworthy inference. When the smoothing parameter is slightly larger, the associated (middle-lower) part of the SiZerSS map is purple, indicating there is no significant slope using these smoothing parameters. When the smoothing further increases, the blue area appears at the left end, the purple at the middle, and the  red at the right end of the SiZerSS map, indicating that the smoothed curves at these smoothing parameters significantly increase and then significantly decrease. Therefore, the underlying curve can not be monotone.

If we want to see how the smoothed curves change over different smoothing parameters, and how they are associated with the parts of the SiZerSS map, we can  use a movie associated with the SiZerSS map, namely a SiZerSS movie.

Associated with Figure 1 is the SiZerSS movie.  When the smoothing parameter (at the y-axis in log10-scale of the SiZerSS map) runs from the smallest end (representing extreme under-smoothing) to the largest end (representing extreme over-smoothing), the running line crossing the SiZerSS map indicates the location of the smoothing parameter and the associated slopes, while the running curve in the upper box indicates a smoothed curve with the associated smoothing parameter. Here the SiZerSS movie is used to check if a monotone assumption is satisfied by the underlying curve.

The SiZerSS movie can also be used to test if a linear model fits the data. Here is an example . It is  for a real data set concerning the relationship between the mileage and displacement for cars. The SiZerSS movie is for the residuals after the linear effect is removed from the data. It seems that the associated relationship is not linear at all.

IV More Details and Acknowledgements

More details about SiZerSS  can be found in the paper "Sizer for Smoothing Splines", by J. S. Marron and J. T. Zhang (2002).  This paper will appear soon in Computational Statistics.