Hacker News new | past | comments | ask | show | jobs | submit login

At the risk of bringing this analogy too far, there was a recent paper [1] arguing that the ability to generate invalid SMILES is beneficial and allows the model to more accurately represent the target distribution compared to SELFIES.

This could be similar to mutations allow one to explore a wider range of options, although sometimes it can go too far and get a non-functional individual.

[1]: https://www.nature.com/articles/s42256-024-00821-x




Yeah that paper does not seem intuitive to me. I'm probably just bitter because I worked so hard to reproduce the original papers results only to realize that it didn't work very well and the author's basically swept that under the rug. I could have saved a month if they had just shared their code and weights. The element example I give is a perfect example of where it seems very unlikely that generating invalid representations would be useful


They do share code and data, the links are at the end [1,2]. The results do seem to more or less reproduce, but I'm also not entirely convinced by the explanation they provide. I was thinking that the difference is attributable to the shenanigans that SELFIES do with valences to make sure that all strings are valid (thus entirely derailing the model after a single mistake rather than getting the string thrown out), but I couldn't figure out how to prove it.

[1] https://doi.org/10.5281/zenodo.8321735 [2] https://doi.org/10.5281/zenodo.10680855


No, I mean the older paper with the VAE for SMILES https://arxiv.org/abs/1610.02415 which we ended up implementing here: https://github.com/maxhodak/keras-molecules

SELFIES is just a pip install away; I think the team learned from their previous papers to be more open about the limits, as well as releasing code and weights for reproducibility.

I don't follow this area closely enough to really have any comments about SELFIES other than "well, at least the problems we identified were addressed in later work". Specifically, my goal was to start with two SMILES strings, encode them to vectors, then sample points along the path between the two vectors, and decode them back to valid molecules. Presumably, SELFIES does this far better (see the examples in the repo).




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: