Validating data

Jeff Stevens

2023-02-10

Review

Mental model of importing data

Set up

Import dog breed traits data

(mydf <- readr::read_csv(here::here("data/dog_breed_traits.csv")))
# A tibble: 195 × 17
   Breed Affec…¹ Good …² Good …³ Shedd…⁴ Coat …⁵ Drool…⁶ Coat …⁷ Coat …⁸ Openn…⁹
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>     <dbl>
 1 Retr…       5       5       5       4       2       2 Double  Short         5
 2 Fren…       5       5       4       3       1       3 Smooth  Short         5
 3 Germ…       5       5       3       4       2       2 Double  Medium        3
 4 Retr…       5       5       5       4       2       2 Double  Medium        5
 5 Bull…       4       3       3       3       3       3 Smooth  Short         4
 6 Pood…       5       5       3       1       4       1 Curly   Long          5
 7 Beag…       3       5       5       3       2       1 Smooth  Short         3
 8 Rott…       5       3       3       3       1       3 Smooth  Short         3
 9 Poin…       5       5       4       3       2       2 Smooth  Short         4
10 Dach…       5       3       4       2       2       2 Smooth  Short         4
# … with 185 more rows, 7 more variables: `Playfulness Level` <dbl>,
#   `Watchdog/Protective Nature` <dbl>, `Adaptability Level` <dbl>,
#   `Trainability Level` <dbl>, `Energy Level` <dbl>, `Barking Level` <dbl>,
#   `Mental Stimulation Needs` <dbl>, and abbreviated variable names
#   ¹​`Affectionate With Family`, ²​`Good With Young Children`,
#   ³​`Good With Other Dogs`, ⁴​`Shedding Level`, ⁵​`Coat Grooming Frequency`,
#   ⁶​`Drooling Level`, ⁷​`Coat Type`, ⁸​`Coat Length`, ⁹​`Openness To Strangers`

Set up

Also load {palmerpenguins} for access to penguins

Data validation

Data validation

Check that your imported data are correct/valid/reasonable

  • Dimensions

  • Data types

  • Ranges and constraints

  • Allowed values (code lists)

  • Column dependencies

  • Completeness/uniqueness

  • Missing values

Dimensions

Does the data frame have the correct dimensions?

How do we check dimensions of a data frame?

dim(mydf)
[1] 195  17

Data types

Do the data columns have the correct data types?

How do we check data types of a data frame?

str(mydf)
spc_tbl_ [195 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Breed                     : chr [1:195] "Retrievers (Labrador)" "French Bulldogs" "German Shepherd Dogs" "Retrievers (Golden)" ...
 $ Affectionate With Family  : num [1:195] 5 5 5 5 4 5 3 5 5 5 ...
 $ Good With Young Children  : num [1:195] 5 5 5 5 3 5 5 3 5 3 ...
 $ Good With Other Dogs      : num [1:195] 5 4 3 5 3 3 5 3 4 4 ...
 $ Shedding Level            : num [1:195] 4 3 4 4 3 1 3 3 3 2 ...
 $ Coat Grooming Frequency   : num [1:195] 2 1 2 2 3 4 2 1 2 2 ...
 $ Drooling Level            : num [1:195] 2 3 2 2 3 1 1 3 2 2 ...
 $ Coat Type                 : chr [1:195] "Double" "Smooth" "Double" "Double" ...
 $ Coat Length               : chr [1:195] "Short" "Short" "Medium" "Medium" ...
 $ Openness To Strangers     : num [1:195] 5 5 3 5 4 5 3 3 4 4 ...
 $ Playfulness Level         : num [1:195] 5 5 4 4 4 5 4 4 4 4 ...
 $ Watchdog/Protective Nature: num [1:195] 3 3 5 3 3 5 2 5 4 4 ...
 $ Adaptability Level        : num [1:195] 5 5 5 5 3 4 4 4 4 4 ...
 $ Trainability Level        : num [1:195] 5 4 5 5 4 5 3 5 5 4 ...
 $ Energy Level              : num [1:195] 5 3 5 3 3 4 4 3 5 3 ...
 $ Barking Level             : num [1:195] 3 1 3 1 2 4 4 1 3 5 ...
 $ Mental Stimulation Needs  : num [1:195] 4 3 5 4 3 5 4 5 5 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   Breed = col_character(),
  ..   `Affectionate With Family` = col_double(),
  ..   `Good With Young Children` = col_double(),
  ..   `Good With Other Dogs` = col_double(),
  ..   `Shedding Level` = col_double(),
  ..   `Coat Grooming Frequency` = col_double(),
  ..   `Drooling Level` = col_double(),
  ..   `Coat Type` = col_character(),
  ..   `Coat Length` = col_character(),
  ..   `Openness To Strangers` = col_double(),
  ..   `Playfulness Level` = col_double(),
  ..   `Watchdog/Protective Nature` = col_double(),
  ..   `Adaptability Level` = col_double(),
  ..   `Trainability Level` = col_double(),
  ..   `Energy Level` = col_double(),
  ..   `Barking Level` = col_double(),
  ..   `Mental Stimulation Needs` = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Data types

Do the data columns have the correct data types?

Test specific data type for individual columns

is.numeric(mydf$`2013 Rank`)
[1] FALSE

Ranges and constraints

Do numerical column values have the correct range and/or other constraints?

Note

Useful to check for Likert scales and measures of age (especially coming out of Qualtrics).

range(mydf$`Drooling Level`)
[1] 1 5

Allowed values (code lists)

Do categorical column values have the correct possible values?

Note

Useful to check for when users can enter text instead of choose options.

(recorded_coats <- unique(mydf$`Coat Type`))
[1] "Double"   "Smooth"   "Curly"    "Silky"    "Wavy"     "Wiry"     "Hairless"
[8] "Rough"    "Corded"  
allowed_coats <- c("Smooth", "Curly", "Silky", "Wavy", "Wiry", "Rough")
recorded_coats %in% allowed_coats
[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Column dependencies

Do column dependencies match up?

E.g., if “other” is selected in choice column, does other column have an entry?

df$choice == "other" & !is.na(df$other)

Completeness

Are all expected observations included?

How do we test this?

observed_subjects %in% expected_subjects

Uniqueness

Are there duplicate observations?

duplicated(mydf$Breed)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE
mydf[duplicated(mydf$Breed), ]
# A tibble: 0 × 17
# … with 17 variables: Breed <chr>, Affectionate With Family <dbl>,
#   Good With Young Children <dbl>, Good With Other Dogs <dbl>,
#   Shedding Level <dbl>, Coat Grooming Frequency <dbl>, Drooling Level <dbl>,
#   Coat Type <chr>, Coat Length <chr>, Openness To Strangers <dbl>,
#   Playfulness Level <dbl>, Watchdog/Protective Nature <dbl>,
#   Adaptability Level <dbl>, Trainability Level <dbl>, Energy Level <dbl>,
#   Barking Level <dbl>, Mental Stimulation Needs <dbl>

Missing values

Are there values with missing data?

is.na(penguins$bill_length_mm)
penguins[is.na(penguins$bill_length_mm), ]
# A tibble: 2 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen             NA            NA          NA      NA <NA>   2007
2 Gentoo  Biscoe                NA            NA          NA      NA <NA>   2009
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Missing values

Are there values with missing data?

penguins[!complete.cases(penguins), ]
# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 2 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
 3 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
 4 Adelie  Torgersen           37.8          17.1        186    3300 <NA>   2007
 5 Adelie  Torgersen           37.8          17.3        180    3700 <NA>   2007
 6 Adelie  Dream               37.5          18.9        179    2975 <NA>   2007
 7 Gentoo  Biscoe              44.5          14.3        216    4100 <NA>   2007
 8 Gentoo  Biscoe              46.2          14.4        214    4650 <NA>   2008
 9 Gentoo  Biscoe              47.3          13.8        216    4725 <NA>   2009
10 Gentoo  Biscoe              44.5          15.7        217    4875 <NA>   2009
11 Gentoo  Biscoe              NA            NA           NA      NA <NA>   2009
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g

Summarizing data

Summarizing data

head(mydf)
# A tibble: 6 × 17
  Breed  Affec…¹ Good …² Good …³ Shedd…⁴ Coat …⁵ Drool…⁶ Coat …⁷ Coat …⁸ Openn…⁹
  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>     <dbl>
1 Retri…       5       5       5       4       2       2 Double  Short         5
2 Frenc…       5       5       4       3       1       3 Smooth  Short         5
3 Germa…       5       5       3       4       2       2 Double  Medium        3
4 Retri…       5       5       5       4       2       2 Double  Medium        5
5 Bulld…       4       3       3       3       3       3 Smooth  Short         4
6 Poodl…       5       5       3       1       4       1 Curly   Long          5
# … with 7 more variables: `Playfulness Level` <dbl>,
#   `Watchdog/Protective Nature` <dbl>, `Adaptability Level` <dbl>,
#   `Trainability Level` <dbl>, `Energy Level` <dbl>, `Barking Level` <dbl>,
#   `Mental Stimulation Needs` <dbl>, and abbreviated variable names
#   ¹​`Affectionate With Family`, ²​`Good With Young Children`,
#   ³​`Good With Other Dogs`, ⁴​`Shedding Level`, ⁵​`Coat Grooming Frequency`,
#   ⁶​`Drooling Level`, ⁷​`Coat Type`, ⁸​`Coat Length`, ⁹​`Openness To Strangers`

Summarizing data

glimpse()

tibble::glimpse(mydf)
Rows: 195
Columns: 17
$ Breed                        <chr> "Retrievers (Labrador)", "French Bulldogs…
$ `Affectionate With Family`   <dbl> 5, 5, 5, 5, 4, 5, 3, 5, 5, 5, 5, 3, 5, 4,…
$ `Good With Young Children`   <dbl> 5, 5, 5, 5, 3, 5, 5, 3, 5, 3, 3, 5, 5, 5,…
$ `Good With Other Dogs`       <dbl> 5, 4, 3, 5, 3, 3, 5, 3, 4, 4, 4, 3, 3, 3,…
$ `Shedding Level`             <dbl> 4, 3, 4, 4, 3, 1, 3, 3, 3, 2, 4, 3, 1, 2,…
$ `Coat Grooming Frequency`    <dbl> 2, 1, 2, 2, 3, 4, 2, 1, 2, 2, 2, 2, 5, 2,…
$ `Drooling Level`             <dbl> 2, 3, 2, 2, 3, 1, 1, 3, 2, 2, 1, 1, 1, 3,…
$ `Coat Type`                  <chr> "Double", "Smooth", "Double", "Double", "…
$ `Coat Length`                <chr> "Short", "Short", "Medium", "Medium", "Sh…
$ `Openness To Strangers`      <dbl> 5, 5, 3, 5, 4, 5, 3, 3, 4, 4, 4, 3, 5, 4,…
$ `Playfulness Level`          <dbl> 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ `Watchdog/Protective Nature` <dbl> 3, 3, 5, 3, 3, 5, 2, 5, 4, 4, 5, 3, 5, 4,…
$ `Adaptability Level`         <dbl> 5, 5, 5, 5, 3, 4, 4, 4, 4, 4, 4, 3, 5, 3,…
$ `Trainability Level`         <dbl> 5, 4, 5, 5, 4, 5, 3, 5, 5, 4, 4, 5, 4, 4,…
$ `Energy Level`               <dbl> 5, 3, 5, 3, 3, 4, 4, 3, 5, 3, 4, 5, 4, 4,…
$ `Barking Level`              <dbl> 3, 1, 3, 1, 2, 4, 4, 1, 3, 5, 4, 3, 4, 3,…
$ `Mental Stimulation Needs`   <dbl> 4, 3, 5, 4, 3, 5, 4, 5, 5, 3, 4, 5, 4, 4,…

Summarizing data

summary()

summary(penguins)
      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2                                 

{skimr}

View info about your data

library(skimr)
skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇

Testing data

{validate}

  • Create rules about dimensions, data types, ranges, code lists, etc.

  • Confront your data with the rules

  • Summarize/visualize validation

{assertr}

  • Assertions: tests of data embedded in functions
library(assertr)
verify(penguins, has_all_names("species", "island", "sex"))
verify(penguins, nrow(penguins) > 100)
verify(penguins, bill_length_mm > 0)
insist(penguins, within_n_sds(4), bill_length_mm)
assert(penguins, in_set(2007, 2008, 2009), year)

Excluding data

{excluder}

Working with Qualtrics data can be … challenging.

{excluder} helps:

  • get rid of initial rows with remove_label_rows()
  • use Qualtrics labels as column names with use_labels()
  • deidentify data with deidentify()
  • view, mark, or exclude data based on: preview status, survey progress, survey completion time, IP address country, geolocation, duplicate IP address, and screen resolution

Codebooks

{dataReporter}

{dataReporter}

{dataReporter}

dataReporter::makeCodebook(mydf3, file = here("06_codebook.Rmd"))

Let’s code!

Validating data [Rmd]