Single-step Evaluation
Usage
syntheseus eval-single-step \
data_dir=[DATA_DIR] \
fold=[TRAIN, VAL or TEST] \
model_class=[MODEL_CLASS] \
model_dir=[MODEL_DIR]
The eval-single-step
command accepts further arguments to customize the evaluation; see BaseEvalConfig
in cli/eval_single_step.py
for the complete list.
The code will scan data_dir
looking for files matching *{train, val, test}.{jsonl, csv, smi}
and select the right data format based on the file extension. An error will be raised in case of ambiguity. Only the fold that was selected for evaluation has to be present.
Data format
The single-step evaluation script supports reaction data in one of three formats.
JSONL
Our internal format based on *.jsonl
files in which each line is a JSON representation of a single reaction, for example:
{"reactants": [{"smiles": "Cc1ccc(Br)cc1"}, {"smiles": "Cc1ccc(B(O)O)cc1"}], "products": [{"smiles": "Cc1ccc(-c2ccc(C)cc2)cc1"}]}
ReactionSample
object, so it can include additional metadata such as template information, while the reactants and products can include other fields accepted by the Molecule
object. The evaluation script will only use reactant and product SMILES to compute the metrics.
Unlike the other formats below, reactants and products in this format are assumed to be already stripped of atom mapping, which leads to slightly faster data loading as that avoids extra calls to rdkit
.
CSV
This format is based on *.csv
files and is commonly used to store raw USPTO data, e.g. as released by Dai et al.:
id,class,reactants>reagents>production
ID,UNK,[cH:1]1[cH:2][c:3]([CH3:4])[cH:5][cH:6][c:7]1Br.B(O)(O)[c:8]1[cH:9][cH:10][c:11]([CH3:12])[cH:13][cH:14]1>>[cH:1]1[cH:2][c:3]([CH3:4])[cH:5][cH:6][c:7]1[c:8]2[cH:14][cH:13][c:11]([CH3:12])[cH:10][cH:9]2
The evaluation script will look for the reactants>reagents>production
column to extract the reaction SMILES, which are stripped of atom mapping and canonicalized before being fed to the model.
SMILES
The most compact format is to list reaction SMILES line-by-line in a *.smi
file:
[cH:1]1[cH:2][c:3]([CH3:4])[cH:5][cH:6][c:7]1Br.B(O)(O)[c:8]1[cH:9][cH:10][c:11]([CH3:12])[cH:13][cH:14]1>>[cH:1]1[cH:2][c:3]([CH3:4])[cH:5][cH:6][c:7]1[c:8]2[cH:14][cH:13][c:11]([CH3:12])[cH:10][cH:9]2
The data will be handled in the same way as for the CSV format, i.e. it will only be fed into model after removing atom mapping and canonicalization.