Protein Structure Prediction and Analysis Tutorial
A comprehensive workflow for protein structure prediction using AlphaFold3 and downstream analysis, using Odorant Receptors (ORs) as an example.
This workflow includes structure prediction, protein-ligand docking, and binding affinity analysis.
Table of Contents
1. Protein Structure Prediction
Sequence Preprocessing
Input: species.OR.fa
(FASTA file containing OR protein sequences)
Dependencies: Python 3.11.10, AlphaFold 3.0.0
In the /data/000_af3/species_name
directory, run:
mamba activate alphafold3
python ../Species_devide_af3.py
What this does:
The script Species_devide_af3.py
processes each sequence in the FASTA file by:
- Creating a separate folder for each sequence
- Generating a
fold_input.json
file in each folder for AlphaFold3 input - Creating an
OR_list.txt
file in the main directory listing all OR gene IDs
Source: This script is part of the InsectBase3.0 toolkit. The original script can be found in the scripts/Protein Structure
directory of the InsectBase3.0 repository.
Note for beginners: The FASTA file should contain protein sequences in standard format with headers starting with ">". Each sequence will be processed individually. Make sure your working directory has write permissions.
AlphaFold3 Structure Prediction
In the /data/000_af3/species_name
directory, run:
nohup bash multi_genome.sh > af_aa.log 2>&1 &
What this does:
The multi_genome.sh
script:
- Uses the base directory and
OR_list.txt
specified in the script - Processes each gene ID in the list to calculate AlphaFold3 structures
- Saves the best predicted structure (based on ranking score) as
or_model.cif
in thegene_id/or/
folder
Note: The ranking score is calculated as: 0.8*ipTM + 0.2*pTM + 0.5*disorder - 100*has_clash
Source: The multi_genome.sh
script is part of the InsectBase3.0 toolkit. It orchestrates batch processing of multiple sequences through AlphaFold3.
For beginners: This is a computationally intensive step that requires significant GPU resources. The nohup
command allows the process to continue running even if your terminal session ends. The output is redirected to af_aa.log
.
Converting to PDB Format
In the /data/000_af3/species_name/gene_id/
directory, run:
mamba activate alphafold3
gemmi convert or/or_model.cif or/best_or_model.pdb
What this does:
Converts the highest-ranked prediction (or_model.cif
) to PDB format (best_or_model.pdb
) using the GEMMI tool, making it compatible with downstream analysis tools like DaliLite.
Important: The converted PDB file does not contain pLDDT information. For quality assessment, you should extract PTM/ranking_score/pLDDT metrics from or_summary_confidences.json
and or_model.cif
.
Source: GEMMI is an open-source library for structural biology. More information: GEMMI GitHub
Removing OXT Information
In the /data/000_af3/species_name/gene_id/
directory, run:
python ../../af3_no_OXT.py
What this does:
The af3_no_OXT.py
script:
- Reads
or/best_or_model.pdb
- Removes OXT atom lines (terminal oxygen atoms that can cause issues in downstream analysis)
- Removes CONECT lines related to OXT
- Updates the MASTER record to reflect the removed residue
- Outputs
or/best_or_model_no_OXT.pdb
Why this is necessary: AlphaFold adds OXT atoms to mark the C-terminal end of peptide chains. These can cause errors in tools like DaliLite.
Source: The af3_no_OXT.py
script is part of the InsectBase3.0 toolkit, specifically designed to prepare AlphaFold3 outputs for compatibility with other structural analysis tools.
Confidence Metrics Analysis
In the /data/000_af3/species_name/
directory, run:
python ../get_species_confidences.py
What this does:
The script:
- Processes each gene ID in
OR_list.txt
- Extracts confidence metrics from
or/or_confidences.json
andor/or_summary_confidences.json
- Calculates statistics (mean, median, max, min) for pLDDT values
- Extracts pTM and ranking score for the best model
- Compiles all metrics into
species_confidences.csv
Source: The get_species_confidences.py
script is part of the InsectBase3.0 toolkit, designed to aggregate quality metrics across multiple structure predictions.
For beginners: These confidence metrics help assess the reliability of the predicted structures. Higher values indicate more reliable predictions.
2. Protein-Ligand Interaction Prediction
Obtaining Ligand Information
What to do:
- Visit PubChem
- Search for your compound of interest
- Note the CID (PubChem Compound ID)
- Copy the SMILES string (simplified molecular-input line-entry system)
Note: The CID will be used for naming the protein-ligand complex models, while the SMILES string serves as input for AlphaFold3.
For beginners: SMILES is a notation that represents chemical structures as text strings. PubChem is a comprehensive database of chemical compounds maintained by the National Center for Biotechnology Information (NCBI).
Protein-Ligand Complex Prediction
2.2.1 Sequence Preprocessing for Docking
In the /data/002_af3_docking/species_name
directory, run:
mamba activate alphafold3
python ../Species_devide_af3_docking.py "97697" "C1CN1CCN"
What this does:
The script:
- Processes each sequence in
species.OR.fa
- Creates a folder for each gene ID
- Within each gene folder, creates a subfolder named after the CID
- Generates
fold_input.json
files with protein ID "A" and ligand ID "B" - Names the project as
or_CID
(e.g.,or_97697
) - Creates an
OR_list.txt
file listing all OR gene IDs
Source: The Species_devide_af3_docking.py
script is part of the InsectBase3.0 toolkit, extending the basic preprocessing to include ligand information.
For beginners: Replace "97697" with your compound's CID and "C1CN1CCN" with your compound's SMILES string.
2.2.2 AlphaFold3 Structure Prediction for Complexes
In the /data/002_af3_docking/
directory, run:
nohup bash multi_genome.sh > Ffus_97697.log 2>&1 &
What this does:
The script:
- Uses the base directory and genome list specified in the script
- For each genome, reads the
OR_list.txt
- For each OR gene, calculates the OR-ligand complex structure
- Saves the best predicted structure as
or_CID_model.cif
in thegene_id/CID/or_CID/
folder
Source: The multi_genome.sh
script is adapted for protein-ligand complex prediction in the InsectBase3.0 toolkit.
Confidence Analysis of Predicted Structures
2.3.1 Calculating Confidence Metrics
In the /data/002_af3_docking/species_name
directory, run:
mamba activate alphafold3
python ../get_species_confidences_docking.py
What this does:
The script:
- Uses
OR_list.txt
andVOC_CID_list_top100.txt
to process all OR-VOC predictions - Extracts confidence metrics from JSON files in each prediction folder
- Calculates statistics for pLDDT values
- Extracts pTM, ipTM, chain_pair_pae_min, and ranking score
- Compiles all metrics into
species_confidences.csv
Source: The get_species_confidences_docking.py
script is part of the InsectBase3.0 toolkit, designed to aggregate quality metrics for protein-ligand complexes.
2.3.2 Filtering Prediction Results
What this does:
- Filters results based on quality criteria (pLDDT and pTM )
- Saves filtered results to
species_confidences_ptm0.7_plddt75.csv
- Further filters VOCs based on chemical properties and occurrence frequency
- Saves the CIDs of the top 58 VOCs to
VOC_CID_list_top58_20250320.txt
To filter results based on the top 58 VOCs, run:
python get_TOP58_confidences.py
This saves the filtered results to species_confidences_ptm0.7_plddt75_TOP58.csv
.
Source: The filtering criteria and scripts are part of the InsectBase3.0 analysis pipeline.
For beginners: These filtering steps help focus on the most reliable predictions and most relevant compounds.
3. Binding Affinity Analysis
Preparing Receptor and Ligand Files
In the /data/002_af3_docking/species_name/gene_id/CID/or_CID/
directory, run:
mamba activate vina
# Convert mmCIF to PDB format
gemmi convert or_${CID}_model.cif best_or_model.pdb
# Extract ligand and convert to PDBQT
grep HETATM best_or_model.pdb > ligand.pdb
obabel ligand.pdb -O ligand.pdbqt
# Extract receptor and prepare for docking
grep ATOM best_or_model.pdb > receptor.pdb
/data/home/chenhao/software/ADFRsuite-1.0/ADFRsuite_x86_64Linux_1.0/bin/prepare_receptor -r receptor.pdb -o receptor.pdbqt
What this does:
- Converts the complex structure to PDB format
- Extracts the ligand (HETATM records) and converts it to PDBQT format
- Extracts the protein (ATOM records) and prepares it as a receptor in PDBQT format
Source: This workflow uses standard tools in molecular docking:
- GEMMI for file conversion
- Open Babel (obabel) for ligand preparation
- AutoDock for receptor preparation
For beginners: PDBQT format adds charge and atom type information needed for docking calculations.
Determining Grid Parameters
In the same directory, run:
python ../../../../get_grid_params.py ligand.pdb 5.0
What this does:
Calculates the center coordinates and dimensions of the grid box for docking, based on the ligand position with a 5.0 Å buffer.
Source: The get_grid_params.py
script is part of the InsectBase3.0 toolkit, designed to automate grid box definition for docking.
Calculating Binding Affinity
In the same directory, run (using the grid parameters from the previous step):
vina --receptor receptor.pdbqt \
--ligand ligand.pdbqt \
--score_only \
--center_x -9.873 --center_y 0.827 --center_z -0.985 \
--size_x 12.979 --size_y 12.186 --size_z 13.245
What this does:
Uses AutoDock Vina to calculate the binding affinity (in kcal/mol) between the receptor and ligand.
Source: AutoDock Vina is an open-source molecular docking program. More information: AutoDock Vina
For beginners: Lower (more negative) binding affinity values indicate stronger predicted binding.
Summarizing Binding Affinity Data
To automate the entire process and compile results, in the /data/002_af3_docking/species_name
directory, run:
nohup bash multi_get_vina_score.sh 2>&1 &
What this does:
Automates all steps in section 3.1-3.3 and compiles the results into vina_score.txt
.
Then, to summarize the results for the top compounds:
nohup bash multi_get_score.sh 2>&1 &
This compiles binding affinities from vina_score.txt
based on the OR-VOC pairs in species_confidences_ptm0.7_plddt75_TOP16.csv
and saves the results to vina_score_all_TOP16.csv
.
Source: The multi_get_vina_score.sh
and multi_get_score.sh
scripts are part of the InsectBase3.0 toolkit, designed to automate batch processing of docking calculations.
Visualizing Binding Affinity
To visualize the binding affinity data as a heatmap, use the Convert_vina_score_list_into_matrix.py
script:
python Convert_vina_score_list_into_matrix.py
What this does:
Converts the list-format binding affinity data into a matrix format suitable for heatmap visualization.
Source: The Convert_vina_score_list_into_matrix.py
script is part of the InsectBase3.0 toolkit.
For beginners: The resulting matrix can be visualized using tools like Python's matplotlib, R's ggplot2, or online visualization platforms.
Additional Resources
Citation
If you use this workflow in your research, please cite:
- InsectBase3.0 (Citation information to be added)
- AlphaFold 3.0: Abramson, J., Ahdritz, G., Akdel, M. et al. Accurate structure prediction of biomolecular interactions. Nature (2024). https://doi.org/10.1038/s41586-024-07487-w
- AutoDock Vina: Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010;31(2):455-461. doi:10.1002/jcc.21334
License
This tutorial and associated scripts are provided under the terms of the license of the InsectBase3.0 project.
This tutorial was created based on the workflow developed for InsectBase3.0. Last updated: June 2024.