This section covers the topic of how to validate Flexible Query data from Viva Insights. Before you apply any functions to your data set and start looking for new insights, it is recommended that you perform data validation. This best practice is applicable regardless of whether your aim is to explore the data, to establish a baseline, or to perform advanced analytics.
Data validation is an essential routine for every analyst as it ensures that you can trust the data you are using to be accurate, clean and helpful. Data validation ensures that your dataset provides a good basis for your analyses, and acts as a proactive intervention to safeguard your analyses from the starting point.
There are several additional reasons why you should validate your Workplace Analytics data.
In a nutshell, it is good practice to have a comprehensive understanding of the data context and checks for common biases, errors, and anomalies prior to analysis, as otherwise we would risk the quality and the reliability of the analysis.
Before you begin with data validation, it’s helpful to know what data set you are looking at, which includes information such as:
This can all be done with check_query()
:
library(wpa)
check_query(sq_data)
The resulting output would look something like the following and will give you an initial view into your data:
The data used is a Person Query.
There are 1034 employees in this dataset.
Date ranges from 2019-11-03 to 2020-01-26.
There are 9 (estimated) HR attributes in the data:
`Domain`, `FunctionType`, `LevelDesignation`, `Region`, `Organization`, `attainment`, `TimeZone`, `IsInternal`, `IsActive`
There are 1034 active employees out of all in the dataset.
Variable name check:
`Collaboration_hours` is used instead of `Collaboration_hrs` in the data.
`Instant_Message_hours` is used instead of `Instant_message_hours` in the data.
The below functions are also helpful for exploring your data:
names(sq_data)
class(sq_data$Date)
summary(sq_data)
length(unique(sq_data$PersonId))
dplyr::glimpse(sq_data)
, or skimr::skim(sq_data)
.View(sq_data)
(not recommended for large datasets)Validating the structure of your data is just as important as validating the data itself. You may wish to check that your data is correctly imported into R if you observe certain anomalies, such as:
Date
variable is showing up as a variable type that is neither character nor Date typeAn easy way to perform data validation with the wpa package is to run the data validation report:
# `spq_df` is your Standard Person Query data
validation_report(spq_df)
This function generates an interactive HTML report in your working directory which contains a series of checks against your data, including:
You can find a demo output of the validation report here. Note that validation_report()
only runs with a Standard Person Query, but you can supply an optional meeting query to include checks against meeting subject lines. To do so, you should run:
# Assuming:
# `spq_df` is your Standard Person Query data
# `mt_df` is your Standard Meeting Query data
validation_report(spq_df,
meeting_data = mt_df)
The data validation report provides you with a series of recommendations on whether you should adjust certain settings or consider certain assumptions before proceeding with your analysis. After you have made the relevant adjustments, you can run the ‘cleaned’ dataset through validation_report()
again to make sure that the potential issues have been caught out.
Note that validation_report()
only provides recommendations based on common scenarios observed with Workplace Analytics data. When deciding whether to make an adjustment, you should consider other factors such as quality of organizational data, context, and other known collaboration norms within the organization.
The wpa package provides additional data validation functions to be used prior to embarking on a new analysis. These functions make up the majority of the automated checks of validation_report()
, where you can run them individually to extract more detailed information (for instance, the report may identify certain employees as non-knowledge workers, but does the distribution of these workers with respect to organization make sense? ). The key data validation functions are described below.
check_query()
performs a check on a query (data frame) and returns a diagnostic message about the data query to the R console, with information such as date range, number of employees, HR attributes identified, etc.
check_query(sq_data)
hrvar_count()
enables you to create a count of the distinct people by the specified HR attribute:
hrvar_count(sq_data, hrvar = "LevelDesignation")
To run a blanket analysis for all the organizational attributes in the dataset, you can run hrvar_count_all()
instead.
Also check out:
Click on the linked functions to find out more.
There are three common reasons for removing certain employees or weeks from the data:
There are three functions in wpa to address each these respective scenarios:
identify_holidayweeks()
identifies weeks where collaboration hours are low outliers for the entire organization.identify_nkw()
identifies employees with overall low collaboration signals, based on average collaboration hours. In addition to non-knowledge workers, this method would also capture any effective part-timers or freelancers, where their collaboration hours are significantly low.identify_inactiveweeks()
identifies individual weeks where collaboration hours are low outliers for the entire organization.Functions (1) to (3) all come with options to return only the ‘clean’ dataset or the original dataset with an appended flag to identify the anomalous persons/weeks. As per above, you can click on the linked functions to find out more.
Below is an example of one might create a ‘clean’ dataset using the functions above:
library(wpa)
library(dplyr)
clean_spq <-
raw_wowa %>% # Loaded in as a Ways of Working query
standardise_pq() %>% # Standardize variable names - not necessary if SPQ
# Force the use of Teams metrics - optional
mutate(Collaboration_hours =
select(.,
Email_hours,
Meeting_hours,
Unscheduled_Call_hours,
Instant_Message_hours) %>%
apply(1, sum)) %>%
identify_nkw(collab_threshold = 4.99999, return = "data_cleaned") %>% # >= 5 CH
identify_inactiveweeks(sd = 2, return = "data_cleaned") %>% # 2 SD
filter(Date >= as.Date("08/30/2020", "%m/%d/%Y")) %>% # Trim start date
filter(Date <= as.Date("03/07/2021", "%m/%d/%Y")) %>% # Trim end date
filter(Date != as.Date("11/22/2020", "%m/%d/%Y")) # Remove certain weeks
In addition to the main validation_report()
, there is also an additional report you can run for validating whether you set your meeting exclusion rule appropriately.
subject_validate_report()
creates a text mining report in HTML based on a standard meeting query. Meeting subject lines from the Subject
variable are analyzed, and meetings with subjects that include common exclusion terms are highlighted. The report will help you understand whether your data may be including a significant proportion of non-meetings in your meeting collaboration data.
Let’s go to the Summary Functions section, to see how we can analyse different Workplace Analytics Metrics in a similar way.