Our evaluation framework compares various versions of Imago OCR application with other applications for molecule image optical character recognition. For each image the testing framework measures execution time and similarity score with a reference molecule file in Molfile format. Indigo toolkit is used to measure molecule similarity. Because different application produces output differently testing framework applies the following rules to standardize molecules:
- Hydrogens are folded.
- If the output contains multiple molecules in SDF format then all of them are merged into a single molecule with several disconnected fragments.
- Both aromatized and dearomatized structures are compared and best score is selected.
Diverse Dataset Report¶
This report contains 5 datasets of 500 molecule images each from different sources:
- Image2Structure (description) A random subset of 500 images from Image2Structure task at TREC Chem 2011 conference.
- Mobile Camera (description) Photos from a mobile phone of 500 PubChem molecules rendered using Indigo toolkit.
- Rendered (description) 500 PubChem molecules rendered using Indigo toolkit.
- USPTO (description) A random subset of 500 structures from a validation set available at OSRA website.
- chem-infty (description) A random subset of 500 structure from Chem-Infty Dataset.
If you can suggest other test sets or other publicly available solutions we would be happy to include them too in the report.