Stats
calculate_entity_coverage_entropy(entity_coverage)
¶
Show source code in recon/stats.py
252 253 254 255 256 257 258 259 260 261 262 263 |
|
Use Entropy to calculate a metric for entity coverage.
Parameters
Name | Type | Description | Default |
---|---|---|---|
entity_coverage |
List[recon.types.EntityCoverage] |
List of EntityCoverage from get_entity_coverage | required |
Returns
Type | Description |
---|---|
float |
float: Entropy for entity coverage counts |
calculate_entity_coverage_similarity(x, y)
¶
Show source code in recon/stats.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
Calculate how well dataset x covers the entities in dataset y. This function should be used to calculate how similar your train set annotations cover the annotations in your dev/test set
Parameters
Name | Type | Description | Default |
---|---|---|---|
x |
List[recon.types.Example] |
Dataset to compare coverage to (usually corpus.train) | required |
y |
List[recon.types.Example] |
Dataset to evaluate coverage for (usually corpus.dev or corpus.test) | required |
Returns
Type | Description |
---|---|
EntityCoverageStats |
EntityCoverageStats: Stats with 1. The base entity coverage (does entity in y exist in x) 2. Count coverage (sum of the EntityCoverage.count property for each EntityCoverage in y to get a more holisic coverage scaled by how often entities occur in each dataset x and y) |
calculate_label_balance_entropy(ner_stats)
¶
Show source code in recon/stats.py
238 239 240 241 242 243 244 245 246 247 248 249 |
|
Use Entropy to calculate a metric for label balance based on an NERStats object
Parameters
Name | Type | Description | Default |
---|---|---|---|
ner_stats |
NERStats |
NERStats for a dataset. | required |
Returns
Type | Description |
---|---|
float |
float: Entropy for annotation counts of each label |
calculate_label_distribution_similarity(x, y)
¶
Show source code in recon/stats.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
|
Calculate the similarity of the label distribution for 2 datasets.
e.g. This can help you understand how well your train set models your dev and test sets. Empircally you want a similarity over 0.8 when comparing your train set to each of your dev and test sets.
calculate_label_distribution_similarity(corpus.train, corpus.dev)
# 98.57
calculate_label_distribution_similarity(corpus.train, corpus.test)
# 73.29 - This is bad, let's investigate our test set more
Parameters
Name | Type | Description | Default |
---|---|---|---|
x |
List[recon.types.Example] |
Dataset | required |
y |
List[recon.types.Example] |
Dataset to compare x to | required |
Returns
Type | Description |
---|---|
float |
float: Similarity of label distributions |
detect_outliers(seq, use_log=False)
¶
Show source code in recon/stats.py
266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 |
|
Detect outliers in a numerical sequence.
Parameters
Name | Type | Description | Default |
---|---|---|---|
seq |
Sequence[Any] |
Sequence of ints or floats | required |
use_log |
bool |
Use logarithm of seq. | False |
Returns
Type | Description |
---|---|
Outliers |
Tuple[List[int], List[int]]: Tuple of low and high indices |
entropy(seq, total=None)
¶
Show source code in recon/stats.py
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 |
|
Calculate Shannon Entropy for a sequence of Floats or Integers. If Floats, check they are probabilities If Integers, divide each n in seq by total and calculate entropy
Parameters
Name | Type | Description | Default |
---|---|---|---|
seq |
Union[List[int], List[float]] |
Sequence to calculate entropy for | required |
total |
int |
Total to divide by for List of int | None |
Exceptions
Type | Description |
---|---|
ValueError |
If seq is not valid |
Returns
Type | Description |
---|---|
float |
float: Entropy for sequence |
get_entity_coverage(data, sep='||', use_lower=True, return_examples=False)
¶
Show source code in recon/stats.py
114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
|
Identify how well you dataset covers an entity type. Get insights on the how many times certain text/label span combinations exist across your data so that you can focus your annotation efforts better rather than annotating examples your Model already understands well.
Parameters
Name | Type | Description | Default |
---|---|---|---|
data |
List[recon.types.Example] |
List of examples | required |
sep |
str |
Separator used in coverage map, only change if | |
use_lower |
bool |
Use the lowercase form of the span text in ents_to_label. | True |
return_examples |
bool |
Return Examples that contain the entity label annotation. | False |
Returns
Type | Description |
---|---|
List[recon.types.EntityCoverage] |
List[EntityCoverage]: Sorted List of EntityCoverage objects containing the text, label, count, and an optional list of examples where that text/label annotation exists. |
get_ner_stats(data, serialize=False, return_examples=False)
¶
Show source code in recon/stats.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
|
Compute statistics for NER data
Parameters
Name | Type | Description | Default |
---|---|---|---|
data |
List[recon.types.Example] |
Data as a List of examples | required |
serialize |
bool |
Serialize to a JSON string for printing. | False |
return_examples |
bool |
Whether to return examples per type | False |
Returns
Type | Description |
---|---|
Union[recon.types.NERStats, str, NoneType] |
Union[NERStats, str, None]: List of examples or string if serialize and no_print are both True |
get_probs_from_counts(seq)
¶
Show source code in recon/stats.py
198 199 200 201 202 203 204 205 206 207 208 |
|
Convert a sequence of counts to a sequence of probabilties by dividing each n by the sum of all n in seq
Parameters
Name | Type | Description | Default |
---|---|---|---|
seq |
Sequence[int] |
Sequence of counts | required |
Returns
Type | Description |
---|---|
Sequence[float] |
Sequence[float]: Sequence of probabilities |
get_sorted_type_counts(ner_stats)
¶
Show source code in recon/stats.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
|
Get list of counts for each type in n_annotations_per_type property of an NERStats object sorted by type name
Parameters
Name | Type | Description | Default |
---|---|---|---|
ner_stats |
NERStats |
Dataset stats | required |
Returns
Type | Description |
---|---|
List[int] |
List[int]: List of counts sorted by type name |