Discovery of potential degraders (BacPROTACS) for essential tRNA synthetases in Mycobacterium tuberculosis.
This project is part of the GC-ADDA4TB project, led by Prof. Erick Strauss. The project builds upon the BacPROTACs technology in which small-molecule bifunctional degraders are designed to harness the proteolytic machinery for targeted protein degradataion (TPD) of essential proteins in Mtb. This involves linking (a) a ligand that binds to a protein of interest (POI) to (b) a molecule that recruits the mycobacterial protease machinery, such as the ClpC:ClpP complex (ClpCP).
In a CRISPR genetic screening, several Mtb tRNA synthetases were identified as highly vulnerable in Mtb. Here, targeting these tRNA synthetases with TPD is proposed as a novel therapeutic strategy.
In this project, the main objective is to prioritise a set of purchasable or easily synthesizable compounds with multi-target properties against the tRNA synthetases family. This will be achieved in three consecutive steps:
- Structural annotation and binding pocket comparison across the tRNA synthetase family.
- Large-scale virtual screening for compounds with strong predicted binding scores across multiple tRNA synthetase binding sites.
- Final shortlisting using additional criteria, such as the ligand's amenability to being extended with a linker to dCymM without disrupting the binding pose.
We recommend to create a Conda environment to run this code.
conda create -n adda4tb python=3.10
conda activate adda4tb
pip install -r requirements.txt
In addition, the following tools are required:
- Open source PyMol for protein visualization.
- PDB2PQR for protonation at pH 7.0.
- PyRosetta for protein structure relaxation. PyRosetta can be installed with the PyRosetta Installer.
- P2RANK for pocket detection.
To run P2RANK, Java is required:
conda activate adda4tb
conda install -c conda-forge openjdk
This repository is work in progress, as summarized in the following progress report meetings:
- Check-in meeting 1 (2025/02/04). Selection and structural annotation of tRNA synthetases.
- Check-in meeting 2 (scheduled mid March 2025)
Below, we explain the progress made chronologically.
Based on the results of the CRISPR genetic screen (Bosch et al, 2021; Figure 5), we have selected 21 essential tRNA synthetases. The UniProt AC, name, protein sequence and EC number have been obtained from UniProt (Mtb H37RV proteome).
We have used the following servers or databases to obtain structural data for the selected tRNA synthetases. To ease the query of some resources, we have generated FASTA files for each tRNA synthetase using the scripts/00_generate_fasta_files.py
script.
- PDBe: Experimental structures (when available) can be found in the this subfolder. Note that these structures are often presented in multimeric form, and do not necessarily have full sequence coverage.
- AlphaFold2 database: Predicted structures with AF2 were downloaded from the AF2-EBI database and stored in this subfolder. All structures in AF2 had 100% sequence coverage and are monomeric. Only one model was downloaded.
- AlphaFold3 server: We predicted structures with the AF3 server and downloaded them into this subfolder. Five models are available per query.
- Chai-1 server: Likewise, we predicted structures with the Chai-1 server, ticking the MSA option. Results are stored in this subfolder. Five models per query were returned.
- AlphaFill: The AlphaFill resource was used to obtain AF2 structures along with ligands. We used the
/scripts/01_download_alphafill.py
script to programmatically download the structures and store them in this subfolder. - Swiss-Model: The Swiss-Model server was used to obtain homology models for each sequence. They can be found in this folder. Note that full coverage is not guaranteed, and that we required a minimum of 80%. A variable number of models per query were returned.
The multiple struture files were organized in the processed data subfolder and stored both in .cif
and .pdb
formats. This was done with the /scripts/02_organize_structures.py
script. This scripts ensures that only one chain is saved for each file, and that sequences are not chunked. Note that we ommitted the PDBe files in this automated processing. Moreover, to simplify visualization, we aligned all structures using the /scripts/03_align_structures.py
script. Based on RMSD, we removed structures that seemed to be far apart from the rest.
Then, we prepared these structures for docking with protein protonation with PDB2PQR and relaxation with PyRosetta using the scripts/04_relax_structures.py
script. This procedure is computationally intensive.
Afterwards, structures are aligned again with the scripts/05_align_relaxed_structures.py
script, using their unrelaxed counterparts as reference for all the alignments. At this stage, no structures were removed, even those with high RMSD against the unrelaxed structures.
We downloaded protein family and domain annotations from InterPro. Files can be found here. Sequence annotation data was processed using the scripts/05_sequence_annotation.py
script.
In a first instance, we fetched data from ChEMBL using the UniProt AC identifiers. This was done with the scripts/06_fetch_from_chembl.py
script. We only found data for 3 of the 21 tRNA synthetases.
An aggregated file containing one row per processed structure is available here. This file contains the following information:
Field | Description |
---|---|
file_name |
Name of the processed PDB structure file |
uniprot_ac |
Uniprot AC identifier |
n_residues |
Number of residues |
start_resid |
First residue number (first residue is 1) of the sequence available in the PDB file, with respect to the Uniprot full sequence |
end_resid |
Last residue number (first residue is 1) of the sequence available in the PDB file, with respect to the Uniprot full sequence |
coverage |
Percentage sequence coverage |
sequence_structure |
Sequence found in the PDB file |
full_sequence |
Sequence found in UniProt |
After the complete structural and sequential characterization of tRNA synthetases, we detected pockets in AF2, AF3, Chai-1 and SwissModel predicted models. Following the authors recommendations: for each structure, we considered detected pockets as those with a probability (K) > 0.2, but at least Top-3 (N) per structure. After that, we filtered those pockets having at least one residue with pLDDT < 65 (AF2, AF3, Chai-1) or QSQE < 0.65 (SwissModel), discarding about 25-30% of the pockets. Cut-offs are arbitrary; usual recommendations are 70 & 0.7, we’ve been slightly less restrictive.
A summary file containing one row per detected pocket and structure is available here. This file contains the following information:
Field | Description |
---|---|
Uniprot_AC |
Uniprot AC identifier file |
File name |
PDB file in which pockets have been detected |
Prediction type |
The method used for protein structure prediction |
Full path |
The full directory path where the PDB file is stored |
Pocket number |
The identified pocket number within the structure |
Pocket score |
The score assigned to the detected pocket |
Pocket probability |
The probability value indicating confidence in pocket detection |
Pocket centroid coordinate (x y z) |
The (x, y, z) coordinates of the pocket’s centroid |
Pocket residues (chain_resn) |
List of residues forming the pocket, with chain and residue number |
B-factors |
Confidence measures: pLDDT (AF2, AF3, Chai) or QSQE (SM) |
PyMOL session files (.pse
) have been prepared to facilitate the visualization of detected pockets and their corresponding residues. These were generated using the scripts/09_prepare_pymol_visualizations.py
script as step 09 in the pipeline.
Each PyMOL session (one per protein) includes the following elements:
Element | Description | Displayed? |
---|---|---|
Reference structure (AF2) | Wheat color, surface + cartoon representation | ✅ Yes |
Pockets detected in reference structure (AF2) | Sky-blue spheres with arbitrary size (pocket detection provides a single 3D coordinate) | ✅ Yes |
Residues defining each pocket in AF2 | Orange color, surface + cartoon representation | ✅ Yes |
Aligned structures (all but AF2) | Gray color, cartoon representation | ❌ No |
Pockets detected in aligned structures | Gray-colored points | ❌ No |
InterPro annotations | Includes conserved sites, domains, families, homologous superfamilies etc (red color, surface representation) | ❌ No |
This repository is developed by the Ersilia Open Source Initiative. Ersilia develops AI/ML tools to support drug discovery research in the Global South. To learn more about us, please visit our GitBook Documentation and our GitHub profile.