From Petri Dish to Data Science:

ECFP RF baseline

This notebook details the creation of a baseline model to predict molecule-protein binding using a Random Forest classifier. The model's features are a combination of Extended-Connectivity Fingerprints (ECFPs), generated with RDKit to represent molecular structures, and one-hot encoded protein names. After training on a balanced sample of 60,000 molecules, the model achieved a high Mean Average Precision (mAP) of 0.96 on a local validation set, a result the author notes is likely due to overfitting. Finally, the trained classifier is used to process the entire test set of over 1.6 million molecules in chunks, generating a ecfps_submission.csv file for the competition.

To The Top

Tensorflow transformer with 4 attention head

This notebook implements a sophisticated deep learning solution for molecule-protein binding prediction using a custom Transformer architecture named Belka built in TensorFlow and Keras. Unlike the baseline model, this approach learns representations directly from atom-level tokenized SMILES strings, which are processed through custom embedding, positional encoding, and Transformer encoder layers. The framework is highly modular, featuring different operational modes (e.g., 'clf' for classification) and employing custom components like a masked, weighted focal loss function (MultiLabelLoss) and a masked AUC metric for robust training and evaluation. The notebook also details a comprehensive data processing pipeline using dask and tf.data to manage large datasets, perform a validation split based on molecular building blocks, and ultimately generate a submission file with the trained model's predictions.

To The Top

Predict small molecule-protein interactions using the Big Encoded Library for Chemical Assessment

ECFP RF baseline

Tensorflow transformer with 4 attention head

Take a Chance!