ABSTRACT

This chapter includes sections on exploring a new dataset, summarizing numerical variables, anomalies in numerical data, including outliers and their detection, and visualizing relations between variables. It introduces a number of important anomalies that can occur in numerical data, with examples of some of their consequences. The chapter presents another variable included in UScereal data frame—potassium—illustrates the problem of metadata errors, a potentially catastrophic problem. It discusses various aspects of the problem of missing data and introduces the important ideas of systematic missing data and disguised missing data, this latter notion representing a particular type of metadata error. Replacing the mean with the median and the standard deviation with the MADM scale estimator gives rise to the Hampel identifier, which is much less sensitive to masking effects than the three-sigma rule. Probably the best-known automatic outlier detection procedure is the "three-sigma edit rule," known in the statistics literature by various names including the extreme Studentized deviation (ESD) identifier.