This notebook details the creation of a baseline model to predict molecule-protein binding using a Random Forest classifier. The model's features are a combination of Extended-Connectivity Fingerprints (ECFPs), generated with RDKit to represent molecular structures, and one-hot encoded protein names. After training on a balanced sample of 60,000 molecules, the model achieved a high Mean Average Precision (mAP) of 0.96 on a local validation set, a result the author notes is likely due to overfitting. Finally, the trained classifier is used to process the entire test set of over 1.6 million molecules in chunks, generating a ecfps_submission.csv file for the competition.
This notebook implements a sophisticated deep learning solution for molecule-protein binding prediction using a custom Transformer architecture named Belka built in TensorFlow and Keras. Unlike the baseline model, this approach learns representations directly from atom-level tokenized SMILES strings, which are processed through custom embedding, positional encoding, and Transformer encoder layers. The framework is highly modular, featuring different operational modes (e.g., 'clf' for classification) and employing custom components like a masked, weighted focal loss function (MultiLabelLoss) and a masked AUC metric for robust training and evaluation. The notebook also details a comprehensive data processing pipeline using dask and tf.data to manage large datasets, perform a validation split based on molecular building blocks, and ultimately generate a submission file with the trained model's predictions.
“IN THE END… We only regret the chances we didn’t take, the relationships we were afraid to have,and the decisions we waited too long to make.” ― Lewis Carroll