Peer review - Investigating the reliability and interpretability of machine learning frameworks for chemical retrosynthesis (2024)

Comments from Referee #1:

Comment 1) Page 3: “In 2017, Segler and Waller devised the first deep-learning model to smartly rank templates” -> “smartly” could be replaced by an adverb (it seems to imply previous approaches were not smart)

Response:
- Thank you for pointing out the wording. We replaced the word with “rank templates by probability” accordingly.

Comment 2a) The proposed Diversity metric rewards models that return a distribution of proposed reaction classes that resembles the prior distribution of reactions observed across all precedents. I wonder if this is a reasonable expectation. Wouldn’t it be reasonable to assume that the distribution of proposed reaction classes should in fact biased, considering it is conditioned on a specific target molecule that might be more easily achieved via certain transformations? In fact, isn’t a bias toward the more suitable reaction classes indicative of learning? If the proposed distribution is the same as the prior one, without knowledge of the molecule that needs to be synthesized, couldn’t this be seen as a failure of the model to learn the chemistry that is needed to achieve the target molecule of interest? All in all, I feel that the rationale behind this metric may need to be revised and/or discussed in more detail.

Response:
- We thank you for this suggestion. The metric was initially chosen to reflect the algorithm’s ability to propose all different reaction types in the dataset. However, we agree that bias can indeed be desirable to indicate the model’s progress in learning about feasible reaction chemistry for a specific target. Additionally, as the metric is not as informative as Div, we decided to remove it from the script to avoid confusion. We left the reaction class distribution plot in the ESI as we believe that it still could be insightful for an interested reader to observe certain trends for a specific model category.

Comment 2b) Also on the diversity metric: rather than measuring similarity based on reaction classes, wouldn’t it be possible to define similarity/diversity using, e.g., reaction fingerprints (e.g. https://pubs.rsc.org/en/content/articlelanding/2022/dd/d1dd00006c)? This may allow to evaluate the diversity or proposed retrosynthetic steps (e.g. by looking at the distribution and dispersion of pairwise similarities for each model) without having to classify reactions by type or compare to a reference distribution. It may be worth considering or discussing.

Response:
- Thank you for bringing up this interesting idea. We have added a short introduction to this idea within Section 2.2.2 – Diversity. For our work, the reactions are classified to offer greater interpretability/understanding of preferred reaction types for a given model. However, if no reaction class information is available in the dataset, the idea of measuring diversity by pairwise dispersion is indeed a good alternative.
- It now reads:
“However, it should be noted that there are other methods to measure diversity. For example, one could use data-driven reaction fingerprints (e.g. rxnfp or DRFP) to measure average pairwise dispersion between reactions, with a larger dispersion indicating a higher diversity. Nonetheless, this would come with reduced interpretability.”

Comment 3) Page 12: “This difference is possibly linked to a dissimilar method of calculating the top-k accuracy.” It might be good to add a sentence clarifying how the top-k calculations differ.

Response:
- Thank you for the suggestion. Upon closer examination of the code, we could not find any flaws with the top-k accuracy. Instead, we found the performance difference due to different “optimal” hyperparameters used in the paper compared to the GitHub repo, namely:
Paper: GNN hidden_dim: 512, MPNN depth: 5 (for 1st model), 7 (for 2nd model)
Github: GNN hidden_dim: 256, MPNN depth: 10 (for 1st & 2nd model)
- Additionally, for LocalRetro, the authors have updated their top-k accuracy metric on their GitHub. Their updated results for the top-k accuracy are in agreement with our study.
- We have therefore removed the sentence about dissimilar top-k accuracy in Section 3.2.1. for LocalRetro and added a sentence for the different hyperparameter selection of G2Retro.

Comment 4) The “GNNExplainer” seems to be referred to also as the “GraphExplainer” and the “Explainer”. If these three terms do indeed refer to the same method, there should be consistency across the text to avoid possible confusion.

Response:
- We appreciate highlighting this inconsistency within our manuscript. To ensure consistency, we have rephrased all occurrences to GNNExplainer.

Comments 5a/b) Page 16: “Case Study 4 - Kinetic Inhibitor” -> “Kinase Inhibitor; Case Study 5, across all text and figures: “Waferin” -> “Warfarin”
Response:
- Thank you for pointing out the typos. We have updated the naming of the two molecules, accordingly.

Comment 6) The node (atom) features are only updated once, at the end of the message passing operation”. This does not seem to be right, or phrased correctly, certainly not for all D-MPNN implementations […].In D-MPNNs with edge-centered updates the atom features are concatenated to the bond ones, but then are still updated via message passing, and aggregated into the final atom embeddings (only once) after message passing. But the data/information in the initial atom features is still used and updated during message passing. I think I see what the authors may have wanted to convey but it might be worth rephrasing the sentence to be more precise.

Response:
- We appreciate the reviewer’s comment, and we agree that our wording does not convey clearly the principled operation of the D-MPNN. According to this suggestion, we have rephrased the sentence in Section 3.3.3, item 3.
- It now reads:
“The D-MPNN differs from the conventional Message Passing Network in a major fashion: The messages in the graph propagate via directed edges (bonds) rather than nodes (atoms). This has the advantage of preventing information from being passed back and forth between adjacent nodes. Furthermore, in the case of edge-centered updates, the finalised node embeddings are constructed by aggregating the updated edge embeddings along with initial node features. Subsequently, the atoms (nodes) incorporate a larger proportion of initial atom features.”

Comments from Referee #2:

Comment 1) Please fix the references in the paper. There are too many uncertain and inappropriately formatted citations. Without fixing those problems, it is not able to review the manuscript.

Response:
- Thank you for your comment regarding referencing. We have changed references (5,13,27,29,38,43,45,47,58) from their (arXiv) preprint reference to the respective peer-reviewed publications. Regarding the formatting: the style is inherent to Digital Discovery Journal (which omits the publication’s title).

Comments from Referee #3:

Comment 1) Although I do appreciate all the details furnished in the introduction to define the types of models, the validation methods, and associate metrics. I think part 3 could probably come a little earlier or this manuscript should be advertised as a mini-review.

Response:
- We appreciate this suggestion and have accordingly revised Section 2.1 by cutting it by roughly 45-50%. Section 3 now comes 1 page earlier.

Comment 2) The discussion of the metrics is very valuable, but I am a little less convinced about the use of a selected number of case studies to conclude the chemical interpretability of the models... maybe some of them could go to the SI part of the manuscript to bring the conclusions a little earlier.

Response:
- We appreciate this suggestion. We decided to move the Warfarin case study to the ESI. Along with the reduction in Section 2 and an additional (added) benchmarking simulation for Section 3.2, the conclusion comes 1 page earlier.

Comment 3) There are a few typos in "SELFIES" at the end of the manuscript.
Response:
- We thank you for pointing out the typos. We have corrected the typos within Section 4, accordingly.

Comments from Referee #4:

Comment 1a) The authors offer an extensive overview of the field of one-step retrosynthesis models in Section 2.1, but at a length of over three pages it is way too extensive and detailed. This kind of overview is better suited for a review, and a recent one is indeed referenced by the authors, and this should suffice for the interested reader. […]. My strong recommendation is to cut Section 2.1 by at least 50%.

Response:
- Thank you for your suggestion. We have reduced the length of Section 2.1.1-2.1.3 by roughly 50% (Figures and Text) to streamline the introduction.

Comment 1b) Although, the authors need to update Ref 5 as it is now published in a peer-reviewed journal (https://doi.org/10.1002/wcms.1694).

Response:
- We appreciate bringing this to our attention. We have updated Ref. 5 along with references 13,27,29,38,43,45,47,58 from their (arXiv) preprint to the respective peer-reviewed publications.

Comment 2) It is of course highly subjective how to categorize single-step retrosynthesis models, but the authors do deviate from the consensus and Ref 5, with their alternative definition of semi-template methods. Yes, models like MEGAN depend on atom-mapping but they do not rely on the extraction of a template. Equating atom-mapping with template stretches the definition of templates considerably. I recommend that the authors insert 2-3 sentences discussing the different categorization of the models.

Response:
- We thank you for highlighting the difference in our model categorization to Ref. 5. Our categorization follows two principal statements in the following references:
1. Reference 5 (Zhong et. al.): “Since the chemical reaction in the datasets is atom-mapped, the transformations of atoms and bonds during the reaction can be automatically identified by comparing the product to their corresponding reactants. The retrosynthesis can be resolved by predicting these transformations.”
Following suit with their definition, MEGAN should be classified as a semi-template model, too, which was not done in their review paper – possibly an oversight by the authors.
2. Reference 27 (Schwaller et. al.): “However, correctly mapping the product back to the reactant atoms is still an unsolved problem, and, more disconcertingly, commonly used tools to find the atom mapping (e.g., NameRXN) are themselves based on libraries of expert rules and templates. This creates a vicious circle. Atom-mapping is based on templates and templates are based on atom mapping, and ultimately, seemingly automatic techniques are actually premised on handcrafted and often artisanal chemical rules.”
As USPTO-50k utilizes atom-mapping by NameRXN, there is a strong dependence between templates and atom mapping in the dataset. According to this, only models that do not utilize any information in the form of template/atom-mapping, can therefore be classified as template-free.
- Following the reviewer’s suggestion, we have introduced two additional sentences (and Schwaller et. al.’s) reference in Section 2.1 - Background to elucidate the categorization.
- It now reads:
“Utilising atom mapping, one can extract the sequence of atom and bond transformations during a reaction computationally. Thus, semi-template models address the prediction of the transformation sequence5. Since reaction templates are curated from atom mapping (and atom mapping algorithms themselves depend on templates and/or expert rules21) semi-template and template-based models share a certain degree of knowledge, thus giving rise to the naming convention. Within this paper, an algorithm utilising exact atom mapping is categorised as semi-template”

Comment 3a) The authors base their benchmark on USPTO-50K with the motivation “The USPTO-50k is the preferred dataset for retrosynthesis thanks to its rich data”. This almost laughably controversial.

Response:
- We appreciate this concern and we do acknowledge the fact that larger datasets exist for retrosynthesis prediction (such as USPTO-full/Pararoutes). We utilized the USPTO-50k in this work as the literature and research community primarily use this dataset for SOTA comparison and as performance advertisem*nt within research abstracts. Below, we copied several recent (peer-reviewed) publications up until early 2024, that solely evaluate their model on the USPTO-50k. According to this, the USPTO-50k continues to be the preferred dataset within the community and in general for retrosynthesis predictions. We therefore follow the general opinion in the research community and our benchmark focuses on the USPTO-50k.
List of References:
2024: G-MATT: Single-step retrosynthesis prediction using molecular grammar tree transformer doi: 10.1002/aic.18244
2024: MARS: a motif-based autoregressive model for retrosynthesis prediction doi: 10.1093/bioinformatics/btae115
2024: RCsearcher: Reaction center identification in retrosynthesis via deep Q-learning doi: 10.1016/j.patcog.2024.110318
2023: Retrosynthesis prediction with local template retrieval (RetroKNN) doi: 10.1609/aaai.v37i4.25664
2023: Enhancing diversity in language based models for single-step retrosynthesis doi: 10.1039/D2DD00110A
2023: Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing (Graph2Edits) doi: 10.1038/s41467-023-38851-5
2023: G2Retro as a two-step graph generative models for retrosynthesis prediction doi: 10.1038/s42004-023-00897-3
2023: SynCluster: Reaction Type Clustering and Recommendation Framework for Synthesis Planning doi: 10.1021/jacsau.3c00607
2022: RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction doi: 10.3390/biom12091325
2022: Improving the performance of models for one-step retrosynthesis through re-ranking doi: 10.1186/s13321-022-00594-8

Comment 3b) From what I gather from the section, the “rich data” refers to the classification information as well as the improved NextMove atom-mapping, but this hardly covers for the limited number of reaction types available in the dataset and the low number of data points. It has been proven several times (see Ref 48 for instance) that the performance on UPSTO-50K is no transferable to higher volume datasets with more diverse reactions. Hence, any conclusion drawn on USPTO-50K is very limiting. The authors at least need to acknowledge that, and they should preferably repeat some of their evaluation on a larger dataset drawn from the USPTO set (e.g. USPTO-MIT, full USPTO, or PaRoutes USPTO).

Response:
- Thank you for this suggestion. We extended our methodology for the top-performing models from each category to USPTO-Pararoutes (~1M reactions) – the cleaner version of the USPTO-full. We added Section 3.2.3. – Scalability of Benchmarking Results for this. From our findings, we see that the performance on UPSTO-50k is reasonably transferable to larger datasets drawn from the USPTO, although the difference (and magnitude) in rt-accuracy becomes smaller for all 3 models. The drawn conclusion is limited as we have only tested 3/12 models on this dataset.
- A similar finding was made previously by Maziarz et. al. (ref 18.) for the proprietary Pistachio dataset, which is a superset of the USPTO-full / Pararoutes: “Surprisingly, model ranking on USPTO-50K transfers to Pistachio quite well, although all results are substantially degraded, e.g. in terms of top-50 accuracy all models still fall below 55% […]”.
- On the other hand, we cannot be certain that our results would be transferable to datasets outside the distribution of the USPTO.

Comment 4) Dropbox is not the best medium to distribute research material. I would recommend the authors to use a services meant for research dissemination that provides long-term, fixed identifiers such as Zenodo or FigShare.

Response:
- Thank you for raising the matter of data distribution. The data is now hosted on FigShare and the links in the paper and GitHub have been updated to reflect this change.

Comment 5a) With regards to the evaluation metrics, I have several remarks. First, I don’t think it is as uncommon to evaluate one-step retrosynthesis model with something else than top-n accuracy as the authors make it out. It has been acknowledged for many years that this is an insufficient metric for one-step performance.

Response:
- Thank you for pointing this out. You are correct that this issue has been known for several years (which we acknowledged in Section 1 – Introduction by citing references 5,17).
- However, researchers (see list of references below) are still heavily reliant on the top-k accuracy to compare their models and to claim “SOTA” performance until now (early 2024). As the references listed below are published in peer-reviewed journals; we suggest that it is still considered “common-practise” to only evaluate the model on the top-k accuracy and publish the findings, accordingly.
- As of our knowledge, only a selected few algorithms like LocalRetro, Retroformer and RetroPrime utilise round-trip accuracies, but with different forward models/ways of calculating the round-trip accuracy. There is no unifying framework for this, which we provide within our paper.
- Other papers utilize MaxFrag accuracy, which, under the hood, applies the same principle as the top-k accuracy, thereby inheriting the same flaws. The invalidity metric has been only applied to template-free models, as we have acknowledged in the paper. We extend this measure to semi-template models. The other metrics employed in the paper are not reported (duplicity/diversity) in single-step papers in general.
- In Section 2.2.2 (first sentence), we acknowledged the presence of other evaluation metrics. To further clarify that the top-k accuracy is the most popular rather than the only metric, we changed “current” to “most popular” in Section 1 – Introduction.
List of References:
2023: Single-step retrosynthesis prediction by leveraging commonly preserved substructures doi: 10.1038/s41467-023-37969-w
2023: BiG2S: A dual task graph-to-sequence model for the end-to-end template-free reaction prediction doi: 10.1007/s10489-023-05048-8
2023: Retrosynthesis prediction with an interpretable deep-learning framework based on molecular assembly tasks doi: 10.1038/s41467-023-41698-5
- Please also see comment 3a for other references relying heavily on the top-k accuracy for performance comparison (2024 G-Matt, 2023 RetroKNN, 2023 Graph2Edits, 2023 G2Retro, 2023 SynCluster etc.)

Comment 5b) Furthermore, I think that all the evaluation metrics should be re-scaled and reported on a common scale from 0 to 1, with 1 being the best. It would aid in for instance the interpretation of Table 1, where it is now rather difficult to judge the different models as you must think about the magnitude and scale of each number.

Response:
- We are very thankful for this great advice. All metrics have been rescaled from 0 to 1 with 1 being the best in Table 1. This change directly leads to an important change in the methodology: the invalidity metric now becomes a validity metric and this has been changed throughout the metric. The invalidity can be calculated as Inv_k = 1 – Val_k.
- The SCScore metric remains unscaled but based on the results, it is highly unlikely that it will exceed 1 or fall below 0 i.e. the highest SC difference in the paper is 0.47 with the lowest being 0.3.

Comment 5c) Also, in Table 2 it just plain confusing to have top-k accuracy being in percentage and round-trip accuracy in fractions.

Response:
- Thank you for your suggestion. We have changed the round-trip accuracy to percentages (as opposed to Table 1, where rt-accuracy is in fractions). This is because the top-k accuracy is usually reported as a percentage in the literature and we utilize Table 2 to corroborate our results with the literature. Thus, we opted for percentages in Table 2.

Comment 6) The authors have chosen two metrics for diversity, but I would recommend picking one and stick with it. My recommendation is diversity (Div) as it is easier to understand and interpret. It should be acknowledged by the authors a diverse retrosynthesis is not always possible. For instance, a molecule could have only one bond that can be disconnected in one specific reaction.

Response:
- We appreciate this suggestion. We removed the second metric accordingly and added information in Section 2.2.2 – Diversity to acknowledge the reviewer’s comment on the fact that diverse retrosynthesis is desired but not possible in all cases.
- It now reads:
“Finally, note that while a diverse set of predictions is desired, it might not always be possible e.g. for molecules that only have one feasible disconnection site.”

Comment 7) The authors need to elaborate on why template-based models can generate invalid SMILES, as this might seem counterintuitive to the reader. It should be explained that it has a different origin than the invalid SMILES from generative models.

Response:
- Thank you for the comment. We added a sentence to elaborate on the origin of “invalid” SMILES for template-based models in Section 3.2.2.
- It now reads:
“As template-based models guarantee to return a valid chemical transformation, the invalidity herein refers to the inability to retrieve a relevant template that matches the target i.e. a template whose subgraph pattern oT matches any subgraph o in the product molecule. As the number of relevant templates to a specific product is limited, the model will fail to return a relevant template after a certain top-k.

Comment 8a) The authors need to elaborate that some reaction classes will lead to increased complexity, like protections, and that is perfectly alright. Hence it is not always desirable to have a lowering on the SCScore.

Response:
- We thank you for raising this issue. Indeed, lowering the SCScore (when going from products to reactants) is not always desirable for some molecules. We have added a sentence in Section 2.2.2 – SCScore to clarify this.
- It now reads:
“It should be noted that a positive SC is not desired for all reaction classes such as protection reaction. As these only make up 1.2% of the dataset, the overall aim remains to maximise SC”

Comment 8b) In a related issue, the authors need to acknowledge the merit of evaluation one-step models in a multi-step fashion, more than they currently do. Yes, it is important that the model provide feasible reactions, but it does not matter much if the proposals always lower the SCScore, produces unique and diverse solutions, etc., if those solutions does not lead to starting material that is purchasable, i.e., you find a synthetic route for your target molecule. I think this needs to be emphasized further in Section 2.2

Response:
- Thank you for this comment. We have elaborated on this in Section 2.2. where we added a short description about the shortcomings of single-step benchmarking and how one can synergize between our benchmark and the multistep benchmark by Maziarz et. al..
- It now reads:
“As a final note: Our pipeline does not guarantee that a single-step model can find synthesis routes towards purchasable building blocks. We suggest that once a promising model is identified through our pipeline, it could be further validated for synthesis planning on the benchmark proposed by Maziarz et. al.18.”
- Additionally, we also further highlighted the disadvantage of this methodology – which is the high resource and time-requirements needed to benchmark a large selection of models (see ref. 48 Torren-Peraire et. al. – limitations/conclusion).

Comment 9) Lastly, the authors use XAI to compare the internal model reasoning between two GNNs and a transformer model. They use a masking algorithm together with maximizing mutual information to gather node importance from the GNN algorithms and the attention with the highest values between reactant and product tokens to gather importance for the transformer model. The intention of the evaluation is interesting, but it comes out as a different study altogether, and arguably this subject deserves proper attention in it is own paper.

Response:
- Thank you for your suggestion. The XAI could be considered a “separate” study from the benchmarking. However, as our benchmarking mostly concerns reaction feasibility, investigating the model interpretability is integral to understanding why certain models fail to propose mostly feasible reaction chemistry (i.e., semi-template and template-free models). We believe that answering this question is therefore an important addition to this study.

Comment 10a) Moreover, the comparison in inherently biased towards EGAT and DMPNN for the task of reaction center prediction, as this is their primary task, whereas the task of the transformer model is generation.

Response:
- We appreciate this comment, and we agree with the statement that the Transformer’s primary task is generation, whereas the GNNs’ task is reaction centre prediction. While stabilising functional groups often fall on/around the reaction centre, it is reasonable to assume that the comparison could be somewhat biased towards the semi-template (GNN) models. However, it is important to interpret the model architectures as they appear in the template-free and semi-template settings, which are sequence-to-sequence generation and classification, respectively. Our conclusions within Section 3.3.3 state: “1. Transformer sequence-to-sequence models […]”. We do not make any conclusion regarding the Transformer architecture for classification tasks in this regard.
- To address these concerns, we have added further content in Section 2.3.2 – Template-free Interpretability to clarify that the Transformer task is generation compared to classification and acknowledge the potential “bias” towards the classification GNN models in Section 2.3 – Black-Box Interpretability.
- It now reads:
“Furthermore, its main task concerns sequence generation, which is more challenging compared to the semi-template reaction centre classification”
and
“The aim of this study is to uncover whether the other two framework categories capture chemically important functional groups, sterics and charge transfers in the reaction. Note that these important thermodynamic features often appear in and around the reaction centre, potentially favouring the interpretability of ``reaction-centre aware" models.”

Comment 10b) Expecting a transformer to give attention to the reaction center by using a random attention map is flawed logic. Transformer attention considers important functional groups, discerning molecular features as well as previous tokens to then generate new molecules token-by-token. This is also the reason why the transformer will fail to identify stabilizing functional groups, as this effect will be hidden by other features important for the model to remember. Using attention is such a way and subsequently drawing conclusions is biased and should be removed.

Response:
- Thank you for highlighting the use of attention for interpretability. We highlighted a relevant reference below which demonstrates that attention provides human interpretability to the end-user and that attention (cross-attention) correlates with feature importance.
List of References:
Attention Interpretability Across NLP Tasks doi: 10.48550/arXiv.1909.11218
Attention is not not Explanation doi: 10.18653/v1/D19-1002 (This reference disproves the notion that attention is not explanation)
- Specifically, the first reference highlights the following findings:
1.While for single-sequence tasks, attentions are meaningless – attention holds strong token correlation/importance for tasks involving two sequences such as sequences translation (as in our case)
2.The most important attention weights in the Transformer are within the Cross Attention layer (the attention we extracted for our interpretability study)
3.Attention weights correlate strongly with feature importance in translation tasks i.e. the higher the attention, the more important a token is
4.Attention is most importantly human-interpretable
- Accordingly, the most important tokens in the product SMILES should receive the highest attention within the model. The extracted attention is not self-attention, therefore the attention is not concerned with previously generated tokens – only the input product tokens, and the extracted attention does not have to remember previously generated tokens.
- By doing a summation, we obtain the accumulated importance of a token/feature. For human interpretable retrosynthesis, the most important tokens should be the ones that undergo a change during translation (i.e. in the reaction centre) as this determines the reaction chemistry.
- We agree that attention is potentially not the most rigorous approach to Transformer XAI (which is a rapidly developing field in research), compared to newer attribution/gradient methods. Accordingly, we have added further content in Section 2.3.2 – Template-free interpretability highlighting different approaches (plus a relevant reference) and acknowledging the fact that attention might not be as rigorous as newer XAI methods.
- It now reads:
“While using attention directly might not be as rigorous as recent attribution/gradient methods61, they have been proven to provide a reliable measure of model interpretability to the end-user62”

Comment 10c) In conclusion, the transformer XAI methods should be improved or removed from the comparisons completely, as the comparisons made currently do not measure identical values between the different models and are false comparisons.

Response:
- While the GNNExplainer and Attention do not return identical values, they do provide a measure of importance for each token/node in the sequence/graph. This importance value can be interpreted for each architecture, independently. In the paper, the main conclusion and discussion in Section 3.3.3. are not presented as a comparison of the GNN versus Transformer architectures saying that one is superior to the other, but rather highlighting the human interpretability of each model for their respective retrosynthesis task.