This function processes the Subject column in a Meeting Query by applying tokenisation usingtidytext::unnest_tokens(), and removing any stopwords supplied in a data frame (using the argument stopwords). This is a sub-function that feeds into tm_freq(), tm_cooc(), and tm_wordcloud(). The default is to return a data frame with tokenised counts of words or ngrams.

tm_clean(data, token = "words", stopwords = NULL, ...)

Arguments

data

A Meeting Query dataset in the form of a data frame.

token

A character vector accepting either "words" or "ngrams", determining type of tokenisation to return.

stopwords

A character vector OR a single-column data frame labelled 'word' containing custom stopwords to remove.

...

Additional parameters to pass to tidytext::unnest_tokens().

Value

data frame with two columns:

  • line

  • word

Examples

# words
tm_clean(mt_data)
#> # A tibble: 6,520 × 2
#>     line word       
#>    <int> <chr>      
#>  1     1 planning   
#>  2     1 core       
#>  3     2 agile      
#>  4     2 officer    
#>  5     3 setup      
#>  6     3 performance
#>  7     3 ryan       
#>  8     3 friday     
#>  9     3 consumer   
#> 10     4 volometrix 
#> # … with 6,510 more rows

# ngrams
tm_clean(mt_data, token = "ngrams")
#> # A tibble: 7,688 × 2
#>     line word                 
#>    <int> <chr>                
#>  1     1 planning will and    
#>  2     1 will and core        
#>  3     1 and core r           
#>  4     1 core r d             
#>  5     1 r d from             
#>  6     2 the agile officer    
#>  7     3 setup performance and
#>  8     3 performance and the  
#>  9     3 and the ryan         
#> 10     3 the ryan for         
#> # … with 7,678 more rows