Validating data

Jeff Stevens

2023-02-10

Review

Mental model of importing data

Set up

Import dog breed traits data

(mydf <- readr::read_csv(here::here("data/dog_breed_traits.csv")))

# A tibble: 195 × 17
   Breed Affec…¹ Good …² Good …³ Shedd…⁴ Coat …⁵ Drool…⁶ Coat …⁷ Coat …⁸ Openn…⁹
   <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>     <dbl>
 1 Retr…       5       5       5       4       2       2 Double  Short         5
 2 Fren…       5       5       4       3       1       3 Smooth  Short         5
 3 Germ…       5       5       3       4       2       2 Double  Medium        3
 4 Retr…       5       5       5       4       2       2 Double  Medium        5
 5 Bull…       4       3       3       3       3       3 Smooth  Short         4
 6 Pood…       5       5       3       1       4       1 Curly   Long          5
 7 Beag…       3       5       5       3       2       1 Smooth  Short         3
 8 Rott…       5       3       3       3       1       3 Smooth  Short         3
 9 Poin…       5       5       4       3       2       2 Smooth  Short         4
10 Dach…       5       3       4       2       2       2 Smooth  Short         4
# … with 185 more rows, 7 more variables: `Playfulness Level` <dbl>,
#   `Watchdog/Protective Nature` <dbl>, `Adaptability Level` <dbl>,
#   `Trainability Level` <dbl>, `Energy Level` <dbl>, `Barking Level` <dbl>,
#   `Mental Stimulation Needs` <dbl>, and abbreviated variable names
#   ¹`Affectionate With Family`, ²`Good With Young Children`,
#   ³`Good With Other Dogs`, ⁴`Shedding Level`, ⁵`Coat Grooming Frequency`,
#   ⁶`Drooling Level`, ⁷`Coat Type`, ⁸`Coat Length`, ⁹`Openness To Strangers`

Set up

Also load {palmerpenguins} for access to `penguins`

library(palmerpenguins)

Data validation

Check that your imported data are correct/valid/reasonable

Dimensions
Data types
Ranges and constraints
Allowed values (code lists)
Column dependencies
Completeness/uniqueness
Missing values

Dimensions

Does the data frame have the correct dimensions?

How do we check dimensions of a data frame?

dim(mydf)

[1] 195  17

Data types

Do the data columns have the correct data types?

How do we check data types of a data frame?

str(mydf)

spc_tbl_ [195 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Breed                     : chr [1:195] "Retrievers (Labrador)" "French Bulldogs" "German Shepherd Dogs" "Retrievers (Golden)" ...
 $ Affectionate With Family  : num [1:195] 5 5 5 5 4 5 3 5 5 5 ...
 $ Good With Young Children  : num [1:195] 5 5 5 5 3 5 5 3 5 3 ...
 $ Good With Other Dogs      : num [1:195] 5 4 3 5 3 3 5 3 4 4 ...
 $ Shedding Level            : num [1:195] 4 3 4 4 3 1 3 3 3 2 ...
 $ Coat Grooming Frequency   : num [1:195] 2 1 2 2 3 4 2 1 2 2 ...
 $ Drooling Level            : num [1:195] 2 3 2 2 3 1 1 3 2 2 ...
 $ Coat Type                 : chr [1:195] "Double" "Smooth" "Double" "Double" ...
 $ Coat Length               : chr [1:195] "Short" "Short" "Medium" "Medium" ...
 $ Openness To Strangers     : num [1:195] 5 5 3 5 4 5 3 3 4 4 ...
 $ Playfulness Level         : num [1:195] 5 5 4 4 4 5 4 4 4 4 ...
 $ Watchdog/Protective Nature: num [1:195] 3 3 5 3 3 5 2 5 4 4 ...
 $ Adaptability Level        : num [1:195] 5 5 5 5 3 4 4 4 4 4 ...
 $ Trainability Level        : num [1:195] 5 4 5 5 4 5 3 5 5 4 ...
 $ Energy Level              : num [1:195] 5 3 5 3 3 4 4 3 5 3 ...
 $ Barking Level             : num [1:195] 3 1 3 1 2 4 4 1 3 5 ...
 $ Mental Stimulation Needs  : num [1:195] 4 3 5 4 3 5 4 5 5 3 ...
 - attr(*, "spec")=
  .. cols(
  ..   Breed = col_character(),
  ..   `Affectionate With Family` = col_double(),
  ..   `Good With Young Children` = col_double(),
  ..   `Good With Other Dogs` = col_double(),
  ..   `Shedding Level` = col_double(),
  ..   `Coat Grooming Frequency` = col_double(),
  ..   `Drooling Level` = col_double(),
  ..   `Coat Type` = col_character(),
  ..   `Coat Length` = col_character(),
  ..   `Openness To Strangers` = col_double(),
  ..   `Playfulness Level` = col_double(),
  ..   `Watchdog/Protective Nature` = col_double(),
  ..   `Adaptability Level` = col_double(),
  ..   `Trainability Level` = col_double(),
  ..   `Energy Level` = col_double(),
  ..   `Barking Level` = col_double(),
  ..   `Mental Stimulation Needs` = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Data types

Do the data columns have the correct data types?

Test specific data type for individual columns

is.numeric(mydf$`2013 Rank`)

[1] FALSE

Ranges and constraints

Do numerical column values have the correct range and/or other constraints?

Note

Useful to check for Likert scales and measures of age (especially coming out of Qualtrics).

range(mydf$`Drooling Level`)

[1] 1 5

Allowed values (code lists)

Do categorical column values have the correct possible values?

Note

Useful to check for when users can enter text instead of choose options.

(recorded_coats <- unique(mydf$`Coat Type`))

[1] "Double"   "Smooth"   "Curly"    "Silky"    "Wavy"     "Wiry"     "Hairless"
[8] "Rough"    "Corded"

allowed_coats <- c("Smooth", "Curly", "Silky", "Wavy", "Wiry", "Rough")
recorded_coats %in% allowed_coats

[1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Column dependencies

Do column dependencies match up?

E.g., if “other” is selected in choice column, does other column have an entry?

df$choice == "other" & !is.na(df$other)

Completeness

Are all expected observations included?

How do we test this?

observed_subjects %in% expected_subjects

Uniqueness

Are there duplicate observations?

duplicated(mydf$Breed)

  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE

mydf[duplicated(mydf$Breed), ]

# A tibble: 0 × 17
# … with 17 variables: Breed <chr>, Affectionate With Family <dbl>,
#   Good With Young Children <dbl>, Good With Other Dogs <dbl>,
#   Shedding Level <dbl>, Coat Grooming Frequency <dbl>, Drooling Level <dbl>,
#   Coat Type <chr>, Coat Length <chr>, Openness To Strangers <dbl>,
#   Playfulness Level <dbl>, Watchdog/Protective Nature <dbl>,
#   Adaptability Level <dbl>, Trainability Level <dbl>, Energy Level <dbl>,
#   Barking Level <dbl>, Mental Stimulation Needs <dbl>

Missing values

Are there values with missing data?

is.na(penguins$bill_length_mm)

penguins[is.na(penguins$bill_length_mm), ]

# A tibble: 2 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen             NA            NA          NA      NA <NA>   2007
2 Gentoo  Biscoe                NA            NA          NA      NA <NA>   2009
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g

Missing values

Are there values with missing data?

penguins[!complete.cases(penguins), ]

# A tibble: 11 × 8
   species island    bill_length_mm bill_depth_mm flipper_…¹ body_…² sex    year
   <fct>   <fct>              <dbl>         <dbl>      <int>   <int> <fct> <int>
 1 Adelie  Torgersen           NA            NA           NA      NA <NA>   2007
 2 Adelie  Torgersen           34.1          18.1        193    3475 <NA>   2007
 3 Adelie  Torgersen           42            20.2        190    4250 <NA>   2007
 4 Adelie  Torgersen           37.8          17.1        186    3300 <NA>   2007
 5 Adelie  Torgersen           37.8          17.3        180    3700 <NA>   2007
 6 Adelie  Dream               37.5          18.9        179    2975 <NA>   2007
 7 Gentoo  Biscoe              44.5          14.3        216    4100 <NA>   2007
 8 Gentoo  Biscoe              46.2          14.4        214    4650 <NA>   2008
 9 Gentoo  Biscoe              47.3          13.8        216    4725 <NA>   2009
10 Gentoo  Biscoe              44.5          15.7        217    4875 <NA>   2009
11 Gentoo  Biscoe              NA            NA           NA      NA <NA>   2009
# … with abbreviated variable names ¹flipper_length_mm, ²body_mass_g

Summarizing data

`head()`

head(mydf)

# A tibble: 6 × 17
  Breed  Affec…¹ Good …² Good …³ Shedd…⁴ Coat …⁵ Drool…⁶ Coat …⁷ Coat …⁸ Openn…⁹
  <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>     <dbl>
1 Retri…       5       5       5       4       2       2 Double  Short         5
2 Frenc…       5       5       4       3       1       3 Smooth  Short         5
3 Germa…       5       5       3       4       2       2 Double  Medium        3
4 Retri…       5       5       5       4       2       2 Double  Medium        5
5 Bulld…       4       3       3       3       3       3 Smooth  Short         4
6 Poodl…       5       5       3       1       4       1 Curly   Long          5
# … with 7 more variables: `Playfulness Level` <dbl>,
#   `Watchdog/Protective Nature` <dbl>, `Adaptability Level` <dbl>,
#   `Trainability Level` <dbl>, `Energy Level` <dbl>, `Barking Level` <dbl>,
#   `Mental Stimulation Needs` <dbl>, and abbreviated variable names
#   ¹`Affectionate With Family`, ²`Good With Young Children`,
#   ³`Good With Other Dogs`, ⁴`Shedding Level`, ⁵`Coat Grooming Frequency`,
#   ⁶`Drooling Level`, ⁷`Coat Type`, ⁸`Coat Length`, ⁹`Openness To Strangers`

Summarizing data

`glimpse()`

tibble::glimpse(mydf)

Rows: 195
Columns: 17
$ Breed                        <chr> "Retrievers (Labrador)", "French Bulldogs…
$ `Affectionate With Family`   <dbl> 5, 5, 5, 5, 4, 5, 3, 5, 5, 5, 5, 3, 5, 4,…
$ `Good With Young Children`   <dbl> 5, 5, 5, 5, 3, 5, 5, 3, 5, 3, 3, 5, 5, 5,…
$ `Good With Other Dogs`       <dbl> 5, 4, 3, 5, 3, 3, 5, 3, 4, 4, 4, 3, 3, 3,…
$ `Shedding Level`             <dbl> 4, 3, 4, 4, 3, 1, 3, 3, 3, 2, 4, 3, 1, 2,…
$ `Coat Grooming Frequency`    <dbl> 2, 1, 2, 2, 3, 4, 2, 1, 2, 2, 2, 2, 5, 2,…
$ `Drooling Level`             <dbl> 2, 3, 2, 2, 3, 1, 1, 3, 2, 2, 1, 1, 1, 3,…
$ `Coat Type`                  <chr> "Double", "Smooth", "Double", "Double", "…
$ `Coat Length`                <chr> "Short", "Short", "Medium", "Medium", "Sh…
$ `Openness To Strangers`      <dbl> 5, 5, 3, 5, 4, 5, 3, 3, 4, 4, 4, 3, 5, 4,…
$ `Playfulness Level`          <dbl> 5, 5, 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
$ `Watchdog/Protective Nature` <dbl> 3, 3, 5, 3, 3, 5, 2, 5, 4, 4, 5, 3, 5, 4,…
$ `Adaptability Level`         <dbl> 5, 5, 5, 5, 3, 4, 4, 4, 4, 4, 4, 3, 5, 3,…
$ `Trainability Level`         <dbl> 5, 4, 5, 5, 4, 5, 3, 5, 5, 4, 4, 5, 4, 4,…
$ `Energy Level`               <dbl> 5, 3, 5, 3, 3, 4, 4, 3, 5, 3, 4, 5, 4, 4,…
$ `Barking Level`              <dbl> 3, 1, 3, 1, 2, 4, 4, 1, 3, 5, 4, 3, 4, 3,…
$ `Mental Stimulation Needs`   <dbl> 4, 3, 5, 4, 3, 5, 4, 5, 5, 3, 4, 5, 4, 4,…

Summarizing data

`summary()`

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

{skimr}

View info about your data

library(skimr)
skim(penguins)

Data summary
Name	penguins
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

Testing data

{validate}

Create rules about dimensions, data types, ranges, code lists, etc.
Confront your data with the rules
Summarize/visualize validation

{assertr}

Assertions: tests of data embedded in functions

library(assertr)
verify(penguins, has_all_names("species", "island", "sex"))
verify(penguins, nrow(penguins) > 100)
verify(penguins, bill_length_mm > 0)
insist(penguins, within_n_sds(4), bill_length_mm)
assert(penguins, in_set(2007, 2008, 2009), year)

Excluding data

{excluder}

Working with Qualtrics data can be … challenging.

{excluder} helps:

get rid of initial rows with remove_label_rows()
use Qualtrics labels as column names with use_labels()
deidentify data with deidentify()
view, mark, or exclude data based on: preview status, survey progress, survey completion time, IP address country, geolocation, duplicate IP address, and screen resolution

Codebooks

{dataReporter}

dataReporter::makeCodebook(mydf3, file = here("06_codebook.Rmd"))

Let’s code!

Validating data [Rmd]

Validating data

Review

Mental model of importing data

Set up

Import dog breed traits data

Set up

Also load {palmerpenguins} for access to penguins

Data validation

Data validation

Check that your imported data are correct/valid/reasonable

Dimensions

Does the data frame have the correct dimensions?

How do we check dimensions of a data frame?

Data types

Do the data columns have the correct data types?

How do we check data types of a data frame?

Data types

Do the data columns have the correct data types?

Test specific data type for individual columns

Ranges and constraints

Do numerical column values have the correct range and/or other constraints?

Allowed values (code lists)

Do categorical column values have the correct possible values?

Column dependencies

Do column dependencies match up?

Completeness

Are all expected observations included?

How do we test this?

Uniqueness

Are there duplicate observations?

Missing values

Are there values with missing data?

Missing values

Are there values with missing data?

Summarizing data

Summarizing data

head()

Summarizing data

glimpse()

Summarizing data

summary()

{skimr}

View info about your data

Testing data

{validate}

{assertr}

Excluding data

{excluder}

Working with Qualtrics data can be … challenging.

Codebooks

{dataReporter}

{dataReporter}

{dataReporter}

Let’s code!

Also load {palmerpenguins} for access to `penguins`

`head()`

`glimpse()`

`summary()`