Statistical Applications for Environmental Analysis and Risk Assessment Statistics in Practice Series

Langue : Anglais

Auteur : Ofungwu Joseph

Couverture de l’ouvrage Statistical Applications for Environmental Analysis and Risk Assessment

Résumé
Sommaire
Biographie

Statistical Applications for Environmental Analysis and Risk Assessment guides readers through real-world situations and the best statistical methods used to determine the nature and extent of the problem, evaluate the potential human health and ecological risks, and design and implement remedial systems as necessary. Featuring numerous worked examples using actual data and ?ready-made? software scripts, Statistical Applications for Environmental Analysis and Risk Assessment also includes:

? Descriptions of basic statistical concepts and principles in an informal style that does not presume prior familiarity with the subject

? Detailed illustrations of statistical applications in the environmental and related water resources fields using real-world data in the contexts that would typically be encountered by practitioners

? Software scripts using the high-powered statistical software system, R, and supplemented by USEPA?s ProUCL and USDOE?s VSP software packages, which are all freely available

? Coverage of frequent data sample issues such as non-detects, outliers, skewness, sustained and cyclical trend that habitually plague environmental data samples

? Clear demonstrations of the crucial, but often overlooked, role of statistics in environmental sampling design and subsequent exposure risk assessment.

Preface xvii

Acknowledgments xix

1 Introduction 1

1.1 Introduction and Overview 1

1.2 The Aim of the Book: Get Involved! 2

1.3 The Approach and Style: Clarity, Clarity, Clarity 3

Part I Basic Statistical Measures and Concepts 5

2 Introduction to Software Packages Used in This Book 7

2.1 R 8

2.1.1 Helpful R Tips 9

2.1.2 Disadvantages of R 10

2.2 ProUCL 10

2.2.1 Helpful ProUCL Tips 11

2.2.2 Potential Deficiencies of ProUCL 12

2.3 Visual Sample Plan 12

2.4 DATAPLOT 13

2.4.1 Helpful Tips for Running DATAPLOT in Batch Mode 13

2.5 Kendall–Thiel Robust Line 14

2.6 Minitab^® 14

2.7 Microsoft Excel 15

3 Laboratory Detection Limits, Nondetects, and Data Analysis 17

3.1 Introduction and Overview 17

3.2 Types of Laboratory Data Detection Limits 18

3.3 Problems with Nondetects in Statistical Data Samples 19

3.4 Options for Addressing Nondetects in Data Analysis 20

3.4.1 Kaplan–Meier Estimation 21

3.4.2 Robust Regression on Order Statistics 22

3.4.3 Maximum Likelihood Estimation 23

4 Data Sample, Data Population, and Data Distribution 25

4.1 Introduction and Overview 25

4.2 Data Sample Versus Data Population or Universe 26

4.3 The Concept of a Distribution 27

4.3.1 The Concept of a Probability Distribution Function 28

4.3.2 Cumulative Probability Distribution and Empirical Cumulative Distribution Functions 31

4.4 Types of Distributions 34

4.4.1 Normal Distribution 34

4.4.1.1 Goodness-of-Fit (GOF) Tests for the Normal Distribution 40

4.4.1.2 Central Limit Theorem 48

4.4.2 Lognormal, Gamma, and Other Continuous Distributions 49

4.4.2.1 Gamma Distribution 51

4.4.2.2 Logistic Distribution 51

4.4.2.3 Other Continuous Distributions 52

4.4.3 Distributions Used in Inferential Statistics (Student’s t, Chi-Square, F) 53

4.4.3.1 Student’s t Distribution 53

4.4.3.2 Chi-Square Distribution 55

4.4.3.3 F Distribution 57

4.4.4 Discrete Distributions 57

4.4.4.1 Binomial Distribution 57

4.4.4.2 Poisson Distribution 61

Exercises 64

5 Graphics for Data Analysis and Presentation 67

5.1 Introduction and Overview 67

5.2 Graphics for Single Univariate Data Samples 68

5.2.1 Box and Whiskers Plot 68

5.2.2 Probability Plots (i.e., Quantile–Quantile Plots for Comparing a Data Sample to a Theoretical Distribution) 72

5.2.3 Quantile Plots 79

5.2.4 Histograms and Kernel Density Plots 82

5.3 Graphics for Two or More Univariate Data Samples 86

5.3.1 Quantile–Quantile Plots for Comparing Two Univariate Data Samples 86

5.3.2 Side-by-Side Box Plots 89

5.4 Graphics for Bivariate and Multivariate Data Samples 91

5.4.1 Graphical Data Analysis for Bivariate Data Samples 91

5.4.2 Graphical Data Analysis for Multivariate Data Samples 95

5.5 Graphics for Data Presentation 98

5.6 Data Smoothing 105

5.6.1 Moving Average and Moving Median Smoothing 105

5.6.2 Locally Weighted Scatterplot Smoothing (LOWESS or LOESS) 108

5.6.2.1 Smoothness Factor and the Degree of the Local Regression 109

5.6.2.2 Basic and Robust LOWESS Weighting Functions 109

5.6.2.3 LOESS Scatterplot Smoothing for Data with Multiple Variables 112

Exercises 113

6 Basic Statistical Measures: Descriptive or Summary Statistics 115

6.1 Introduction and Overview 115

6.2 Arithmetic Mean and Weighted Mean 116

6.3 Median and Other Robust Measures of Central Tendency 117

6.4 Standard Deviation, Variance, and Other Measures of Dispersion or Spread 119

6.4.1 Quantiles (Including Percentiles) 121

6.4.2 Robust Measures of Spread: Interquartile Range and Median Absolute Deviation 124

6.5 Skewness and Other Measures of Shape 124

6.6 Outliers 134

6.6.1 Tests for Outliers 135

6.7 Data Transformations 139

Exercises 141

Part II Statistical Procedures for Mostly Univariate Data 143

7 Statistical Intervals: Confidence, Tolerance, and Prediction Intervals 145

7.1 Introduction and Overview 145

7.2 Confidence Intervals 146

7.2.1 Parametric Confidence Intervals 151

7.2.1.1 Parametric Confidence Interval around the Arithmetic Mean or Median for Normally Distributed Data 151

7.2.1.2 Lognormal and Other Parametric Confidence Intervals 153

7.2.2 Nonparametric Confidence Intervals Around the Mean, Median, and Other Percentiles 154

7.2.3 Parametric Confidence Band Around a Trend Line 164

7.2.4 Nonparametric Confidence Band Around a Trend Line 166

7.3 Tolerance Intervals 168

7.3.1 Parametric Tolerance Intervals 169

7.3.2 Nonparametric Tolerance Intervals 170

7.4 Prediction Intervals 173

7.4.1 Parametric Prediction Intervals for Future Individual Values and Future Means 175

7.4.2 Nonparametric Prediction Intervals for Future Individual Values and Future Medians 176

7.5 Control Charts 178

Exercises 178

8 Tests of Hypothesis and Decision Making 181

8.1 Introduction and Overview 181

8.2 Basic Terminology and Procedures for Tests of Hypothesis 182

8.3 Type I and Type II Decision Errors, Statistical Power, and Interrelationships 190

8.4 The Problem with Multiple Tests or Comparisons: Site-Wide False Positive Error Rates 193

8.5 Tests for Equality of Variance 195

Exercises 199

9 Applications of Hypothesis Tests: Comparing Populations, Analysis of Variance 201

9.1 Introduction and Overview 201

9.2 Single Sample Tests 202

9.2.1 Parametric Single-Sample Tests: One-Sample t-Test and One-Sample Proportion Test 203

9.2.2 Nonparametric Single-Sample Tests: One-Sample Sign Test and One-Sample Wilcoxon Signed Rank Test 205

9.2.2.1 Nonparametric One-Sample Sign Test 206

9.2.2.2 Nonparametric One-Sample Wilcoxon Signed Rank Test 208

9.3 Two-Sample Tests 208

9.3.1 Parametric Two-Sample Tests 210

9.3.1.1 Parametric Two-Sample t-Test for Independent Populations 210

9.3.1.2 Parametric Two-Sample t-Test for Paired Populations 214

9.3.2 Nonparametric Two-Sample Tests 216

9.3.2.1 Nonparametric Wilcoxon Rank Sum Test for Two Independent Populations 216

9.3.2.2 Nonparametric Gehan Test for Two Independent Populations 220

9.3.2.3 Nonparametric Quantile Test for Two Independent Populations 221

9.3.2.4 Nonparametric Two-Sample Paired Sign Test and Paired Wilcoxon Signed Rank Test 222

9.4 Comparing Three or More Populations: Parametric ANOVA and Nonparametric Kruskal–Wallis Tests 227

9.4.1 Parametric One-Way ANOVA 228

9.4.1.1 Computation of Parametric One-Way ANOVA 230

9.4.2 Nonparametric One-Way ANOVA (Kruskal–Wallis Test) 235

9.4.3 Follow-Up or Post Hoc Comparisons After Parametric and Nonparametric One-Way ANOVA 238

9.4.4 Parametric and Nonparametric Two-Way and Multifactor ANOVA 244

Exercises 255

10 Trends, Autocorrelation, and Temporal Dependence 257

10.1 Introduction and Overview 257

10.2 Tests for Autocorrelation and Temporal Effects 258

10.2.1 Test for Autocorrelation Using the Sample Autocorrelation Function 259

10.2.2 Test for Autocorrelation Using the Rank Von Neumann Ratio Method 261

10.2.3 An Example on Site-Wide Temporal Effects 264

10.3 Tests for Trend 265

10.3.1 Parametric Test for Trends—Simple Linear Regression 266

10.3.2 Nonparametric Test for Trends—Mann–Kendall Test and Seasonal Mann–Kendall Test 271

10.3.3 Nonparametric Test for Trends—Theil–Sen Trend Test 273

10.4 Correcting Seasonality and Temporal Effects in the Data 279

10.4.1 Correcting Seasonality for a Single Data Series 280

10.4.2 Simultaneously Correcting Temporal Dependence for Multiple Data Sets 281

10.5 Effects of Exogenous Variables on Trend Tests 282

Exercises 285

Part III Statistical Procedures for Mostly Multivariate Data 287

11 Correlation, Covariance, Geostatistics 289

11.1 Introduction and Overview 289

11.2 Correlation and Covariance 290

11.2.1 Pearson’s Correlation Coefficient 292

11.2.2 Spearman’s and Kendall’s Correlation Coefficients 294

11.3 Introduction to Geostatistics 300

11.3.1 The Variogram or Covariogram 300

11.3.2 Kriging 302

11.3.3 A Note on Data Sample Size and Lag Distance Requirements 311

Exercises 312

12 Simple Linear Regression 315

12.1 Introduction and Overview 315

12.2 The Simple Linear Regression Model 316

12.2.1 The True or Population X–Y Relationship 317

12.2.2 The Estimated X–Y Relationship Based on a Data Sample 320

12.3 Basic Applications of Simple Linear Regression 324

12.3.1 Description and Graphical Review of the Data Sample for Regression 324

12.3.1.1 Computing the Regression 325

12.3.1.2 Interpreting the Regression Results 326

12.4 Verify Compliance with the Assumptions of Conventional Linear Regression 332

12.4.1 Assumptions of Linearity and Homoscedasticity 332

12.4.2 Assumption of Independence 334

12.4.3 Exogeneity Assumption, Normality of the Y Errors, and Absence of Outliers 337

12.5 Check the Regression Diagnostics for the Presence of Influential Data Points 339

12.6 Confidence Intervals for the Predicted Y Values 343

12.7 Regression for Left-Censored Data (Non-detects) 344

Exercises 349

13 Data Transformation Versus Generalized Linear Model 351

13.1 Introduction and Overview 351

13.2 Data Transformation 352

13.2.1 General Approach for Data Transformations 355

13.2.2 The Ladder of Powers 357

13.2.3 The Bulging Rule and Data Transformations for Regression Analysis 359

13.2.4 Facilitating Data Transformations Using Box–Cox Methods 366

13.2.5 Back-Transformation Bias and Other Issues with Data Transformation 367

13.2.5.1 Logarithmic Transformations 369

13.2.5.2 Other Transformations 370

13.2.6 Transformation Bias Correction 371

13.3 The Generalized Linear Model (GLM) and Applications for Regression 374

13.3.1 Components of the Generalized Linear Model and Inherent Limitations 374

13.3.2 Estimation and Hypothesis Tests of Significance for GLM Parameters 376

13.3.3 Deviance, Null Deviance, Residual Deviance, and Goodness of Fit 377

13.3.4 Diagnostics for GLM 379

13.3.5 Procedural Steps for Regression with GLM in R 380

13.4 Extension of Data Transformation and Generalized Linear Model to Multiple Regression 385

13.4.1 Data Transformation for Multiple Regression 385

13.4.2 Generalized Linear Models for Multiple Regression 387

Exercises 387

14 Robust Regression 391

14.1 Introduction and Overview 391

14.2 Kendall–Theil Robust Line 393

14.2.1 Computation of the Kendall–Theil Robust Line Regression 393

14.2.2 Test of Significance for the Kendall–Theil Robust Line 396

14.2.3 Bias Correction for Y Predictions by the Kendall–Theil Robust Line 397

14.3 Weighted Least Squares Regression 398

14.3.1 Procedure for Weighted Least Squares Regression for Known Variances of the Observations 399

14.4 Iteratively Reweighted Least Squares Regression 405

14.4.1 The Iteratively Reweighted Least Squares Procedure 409

14.5 Other Robust Regression Alternatives: Bounded Influence Methods 412

14.5.1 Least Absolute Deviation or Least Absolute Values 412

14.5.2 Quantile Regression 413

14.5.3 Least Median of Squares 413

14.5.4 Least Trimmed Squares 414

14.6 Robust Regression Methods for Multiple-Variable Data 416

Exercises 417

15 Multiple Linear Regression 419

15.1 Introduction and Overview 419

15.2 The Need for Multiple Regression 420

15.3 The Multiple Linear Regression (MLR) Model 421

15.4 The Estimated Multivariable X–Y Relationship Based on a Data Sample 422

15.5 Assumptions of Multiple Linear Regression 430

15.5.1 Linearity of the Relationship Between the Dependent and Explanatory Variables 431

15.5.2 Absence of Multicollinearity Among the Explanatory Variables 433

15.5.2.1 Potential Remedies for Multicollinearity 436

15.5.3 Homoscedasticity or Constancy of Variance of the Y Population Errors 439

15.5.4 Statistical Independence of the Y Population Errors 441

15.5.5 Exogeneity Assumption, Normality of the Y Errors, and Absence of Outliers 445

15.5.6 Absence of Variability or Errors in the Explanatory Variables 446

15.6 Hypothesis Tests for Reliability of the MLR Model 447

15.6.1 ANOVA F Test for Overall Significance of the Regression 447

15.6.1.1 A Note on ANOVA Tables 448

15.6.2 Partial t and Partial F Tests for Individual Regression Coefficients 452

15.6.3 Complete and Reduced Models 452

15.7 Confidence Intervals for the Regression Coefficients and Predicted Y Values 457

15.8 Coefficient of Multiple Correlation (R), Multiple Determination (R²), Adjusted R², and Partial Correlation Coefficients 458

15.8.1 Coefficient of Multiple Correlation (R) 458

15.8.2 Coefficient of Multiple Determination (R²) and Adjusted R² 459

15.8.3 Partial Correlations and Squared Partial Correlations 460

15.9 Regression Diagnostics 462

15.10 Model Interactions and Multiplicative Effects 467

15.10.1 The Multiple Linear Regression Interaction Model 467

15.10.2 Hypothesis Tests of the Interaction Terms for Significance 468

Exercises 474

16 Categorical Data Analysis 477

16.1 Introduction and Overview 477

16.2 Types of Variables and Associated Data 478

16.2.1 Quantitative Variables 479

16.2.2 Qualitative Variables 479

16.3 One-Way Analysis of Variance Regression Model 480

16.3.1 Interpretation of the Regression Results and ANOVA F-Test for Overall Significance of the Regression Model 485

16.4 Two-Way Analysis of Variance Regression Model with No Interactions 486

16.5 Two-Way Analysis of Variance Regression Model with Interactions 490

16.6 Analysis of Covariance Regression Model 491

Exercises 499

17 Model Building: Stepwise Regression and Best Subsets Regression 501

17.1 Introduction and Overview 501

17.2 Consequences of Inappropriate Variable Selection 502

17.3 Stepwise Regression Procedures 505

17.3.1 Advantages and Disadvantages of Stepwise Procedures 512

17.4 Subsets Regression 513

Exercises 522

18 Nonlinear Regression 525

18.1 Introduction and Overview 525

18.2 The Nonlinear Regression Model 526

18.3 Assumptions of Nonlinear Least Squares Regression 528

Exercises 545

Part IV Statistics in Environmental Sampling Design and Risk Assessment 547

19 Data Quality Objectives and Environmental Sampling Design 549

19.1 Introduction and Overview 549

19.2 Sampling Design 550

19.3 Sampling Plans 550

19.3.1 Simple Random Sampling 552

19.3.2 Systematic Sampling 554

19.3.3 Other Sampling Designs 556

19.4 Sample Size Determination 557

19.4.1 Types I and II Decision Errors 558

19.4.2 Variance and Gray Region 559

19.4.3 Width of the Gray Region 560

19.4.4 Computation of the Recommended Minimum Sample Size for Estimating the Population Mean or Median 561

19.4.4.1 Minimum Sample Size for Computing UCL95 on the Mean for Normally Distributed Data 562

19.4.4.2 Minimum Sample Size for Computing UCL95 on the Median for Nonnormally Distributed Data 564

19.4.5 Computation of the Recommended Minimum Sample Size for Comparing a Population Mean or Median with a Fixed

Threshold Value 565

19.4.6 Computation of the Recommended Minimum Sample Size for Comparing the Population Means or Medians for Two Populations 568

Exercises 569

20 Determination of Background and Applications in Risk Assessment 571

20.1 Introduction and Overview 571

20.2 When Background Sampling is Required and When it is not 572

20.3 Background Sampling Plans 572

20.4 Graphical and Quantitative Data Analysis for Site Versus Background Data Comparisons 573

20.5 Determination of Exposure Point Concentration and Contaminants of Potential Concern 583

Exercises 585

21 Statistics in Conventional and Probabilistic Risk Assessment 587

21.1 Introduction and Overview 587

21.2 Conventional or Point Risk Estimation 588

21.3 Probabilistic Risk Assessment Using Monte Carlo Simulation 594

Exercises 598

Appendix A: Software Scripts 599

Appendix B: Datasets 603

References 609

Answers for Exercises 613

Index 619

Joseph Ofungwu, PhD, is an environmental professional with over eighteen years of hands-on experience in environmental practice, including contaminant impact analysis, human health and ecological risk assessment, pollutant fate and transport modeling in ambient air, soil, ground and surface water. Dr. Ofungwu is also Visiting Assistant Professor with the Urban Environmental Systems Management Program at Pratt Institute and teaches statistics courses for professional engineer license maintenance requirements.