Our evaluation framework compares various versions of Imago OCR application with other applications for molecule image optical character recognition. For each image the testing framework measures execution time and similarity score with a reference molecule file in Molfile format. Indigo toolkit is used to measure molecule similarity. Because different application produces output differently testing framework applies the following rules to standardize molecules:
Hydrogens are folded.
If the output contains multiple molecules in SDF format then all of them are merged into a single molecule with several disconnected fragments.
Both aromatized and dearomatized structures are compared and best score is selected.
Diverse Dataset Report¶
This report contains 5 datasets of 500 molecule images each from different sources:
If you can suggest other test sets or other publicly available solutions we would be happy to include them too in the report.