Skip to contents

This function identifies and handles outliers in a given dataset using various methods including Z-Score, Modified Z-Score, and Inter-Quartile Range (IQR). It also provides options to treat the identified outliers, using mean, median, rolling mean by group and inter-quartile range. It also has the option to generate a summary report and a plot.

Usage

handle_outliers(
  data,
  vars = NULL,
  method = NULL,
  zscore_threshold = 3,
  mod_zscore_threshold = 3.5,
  iqr_k_value = 1.5,
  treat_method = "none",
  grouping_vars = NULL,
  report_mode = FALSE
)

Arguments

data

Dataframe containing the variables to be checked for outliers.

vars

Character vector of variable names to check for outliers. Default is NULL, which selects all numeric columns.

method

Character indicating the method for outlier detection. Options are "zscore", "mod_zscore", and "iqr_method". Default is NULL, which applies all methods.

zscore_threshold

Numeric value for Z-Score threshold. Default is 3.

mod_zscore_threshold

Numeric value for Modified Z-Score threshold. Default is 3.5.

iqr_k_value

Numeric value for IQR multiplier. Default is 1.5.

treat_method

Character indicating how to treat outliers. Options are "none", "remove", "mean", "median", "grouped_mean", and "quantile". Default is "none".

grouping_vars

Character vector of grouping variables for "grouped_mean". Required only if treat_method is "grouped_mean".

report_mode

Logical, if TRUE, the function returns a summary report and a plot. Default is FALSE.

Value

If report_mode is TRUE, a list containing a summary dataframe and a ggplot object. Otherwise, a dataframe with outliers treated according to treat_method.

Examples


# get path
path <- system.file(
        "extdata",
        "fake_epi_df_togo.rds",
         package = "epiCleanr")

fake_epi_df_togo <- import(path)

variables <- c("malaria_tests", "malaria_cases",
                 "cholera_tests", "cholera_cases")
result <- handle_outliers(fake_epi_df_togo, vars = variables,
               method = "zscore", report_mode = TRUE)

print(result$report)
#> # A tibble: 4 × 4
#>   variable      test    outliers prop_outliers
#>   <chr>         <chr>   <glue>   <chr>        
#> 1 malaria_tests Z-Score 3/900    <1%          
#> 2 malaria_cases Z-Score 5/900    <1%          
#> 3 cholera_tests Z-Score 1/900    <1%          
#> 4 cholera_cases Z-Score 2/900    <1%          

print(result$plot)