Protein Structure Prediction and Analysis Tutorial

A comprehensive workflow for protein structure prediction using AlphaFold3 and downstream analysis, using Odorant Receptors (ORs) as an example.

This workflow includes structure prediction, protein-ligand docking, and binding affinity analysis.

Table of Contents

1. Protein Structure Prediction

1.1

Sequence Preprocessing

Input: species.OR.fa (FASTA file containing OR protein sequences)
Dependencies: Python 3.11.10, AlphaFold 3.0.0

In the /data/000_af3/species_name directory, run:

mamba activate alphafold3
python ../Species_devide_af3.py

What this does:

The script Species_devide_af3.py processes each sequence in the FASTA file by:

  1. Creating a separate folder for each sequence
  2. Generating a fold_input.json file in each folder for AlphaFold3 input
  3. Creating an OR_list.txt file in the main directory listing all OR gene IDs

Source: This script is part of the InsectBase3.0 toolkit. The original script can be found in the scripts/Protein Structure directory of the InsectBase3.0 repository.

Note for beginners: The FASTA file should contain protein sequences in standard format with headers starting with ">". Each sequence will be processed individually. Make sure your working directory has write permissions.

1.2

AlphaFold3 Structure Prediction

In the /data/000_af3/species_name directory, run:

nohup bash multi_genome.sh > af_aa.log 2>&1 &

What this does:

The multi_genome.sh script:

  1. Uses the base directory and OR_list.txt specified in the script
  2. Processes each gene ID in the list to calculate AlphaFold3 structures
  3. Saves the best predicted structure (based on ranking score) as or_model.cif in the gene_id/or/ folder

Note: The ranking score is calculated as: 0.8*ipTM + 0.2*pTM + 0.5*disorder - 100*has_clash

Source: The multi_genome.sh script is part of the InsectBase3.0 toolkit. It orchestrates batch processing of multiple sequences through AlphaFold3.

For beginners: This is a computationally intensive step that requires significant GPU resources. The nohup command allows the process to continue running even if your terminal session ends. The output is redirected to af_aa.log.

1.3

Converting to PDB Format

In the /data/000_af3/species_name/gene_id/ directory, run:

mamba activate alphafold3
gemmi convert or/or_model.cif or/best_or_model.pdb

What this does:

Converts the highest-ranked prediction (or_model.cif) to PDB format (best_or_model.pdb) using the GEMMI tool, making it compatible with downstream analysis tools like DaliLite.

Important: The converted PDB file does not contain pLDDT information. For quality assessment, you should extract PTM/ranking_score/pLDDT metrics from or_summary_confidences.json and or_model.cif.

Source: GEMMI is an open-source library for structural biology. More information: GEMMI GitHub

1.4

Removing OXT Information

In the /data/000_af3/species_name/gene_id/ directory, run:

python ../../af3_no_OXT.py

What this does:

The af3_no_OXT.py script:

  1. Reads or/best_or_model.pdb
  2. Removes OXT atom lines (terminal oxygen atoms that can cause issues in downstream analysis)
  3. Removes CONECT lines related to OXT
  4. Updates the MASTER record to reflect the removed residue
  5. Outputs or/best_or_model_no_OXT.pdb

Why this is necessary: AlphaFold adds OXT atoms to mark the C-terminal end of peptide chains. These can cause errors in tools like DaliLite.

Source: The af3_no_OXT.py script is part of the InsectBase3.0 toolkit, specifically designed to prepare AlphaFold3 outputs for compatibility with other structural analysis tools.

1.5

Confidence Metrics Analysis

In the /data/000_af3/species_name/ directory, run:

python ../get_species_confidences.py

What this does:

The script:

  1. Processes each gene ID in OR_list.txt
  2. Extracts confidence metrics from or/or_confidences.json and or/or_summary_confidences.json
  3. Calculates statistics (mean, median, max, min) for pLDDT values
  4. Extracts pTM and ranking score for the best model
  5. Compiles all metrics into species_confidences.csv

Source: The get_species_confidences.py script is part of the InsectBase3.0 toolkit, designed to aggregate quality metrics across multiple structure predictions.

For beginners: These confidence metrics help assess the reliability of the predicted structures. Higher values indicate more reliable predictions.

2. Protein-Ligand Interaction Prediction

2.1

Obtaining Ligand Information

What to do:

  1. Visit PubChem
  2. Search for your compound of interest
  3. Note the CID (PubChem Compound ID)
  4. Copy the SMILES string (simplified molecular-input line-entry system)

Note: The CID will be used for naming the protein-ligand complex models, while the SMILES string serves as input for AlphaFold3.

For beginners: SMILES is a notation that represents chemical structures as text strings. PubChem is a comprehensive database of chemical compounds maintained by the National Center for Biotechnology Information (NCBI).

2.2

Protein-Ligand Complex Prediction

2.2.1 Sequence Preprocessing for Docking

In the /data/002_af3_docking/species_name directory, run:

mamba activate alphafold3
python ../Species_devide_af3_docking.py "97697" "C1CN1CCN"

What this does:

The script:

  1. Processes each sequence in species.OR.fa
  2. Creates a folder for each gene ID
  3. Within each gene folder, creates a subfolder named after the CID
  4. Generates fold_input.json files with protein ID "A" and ligand ID "B"
  5. Names the project as or_CID (e.g., or_97697)
  6. Creates an OR_list.txt file listing all OR gene IDs

Source: The Species_devide_af3_docking.py script is part of the InsectBase3.0 toolkit, extending the basic preprocessing to include ligand information.

For beginners: Replace "97697" with your compound's CID and "C1CN1CCN" with your compound's SMILES string.

2.2.2 AlphaFold3 Structure Prediction for Complexes

In the /data/002_af3_docking/ directory, run:

nohup bash multi_genome.sh > Ffus_97697.log 2>&1 &

What this does:

The script:

  1. Uses the base directory and genome list specified in the script
  2. For each genome, reads the OR_list.txt
  3. For each OR gene, calculates the OR-ligand complex structure
  4. Saves the best predicted structure as or_CID_model.cif in the gene_id/CID/or_CID/ folder

Source: The multi_genome.sh script is adapted for protein-ligand complex prediction in the InsectBase3.0 toolkit.

2.3

Confidence Analysis of Predicted Structures

2.3.1 Calculating Confidence Metrics

In the /data/002_af3_docking/species_name directory, run:

mamba activate alphafold3
python ../get_species_confidences_docking.py

What this does:

The script:

  1. Uses OR_list.txt and VOC_CID_list_top100.txt to process all OR-VOC predictions
  2. Extracts confidence metrics from JSON files in each prediction folder
  3. Calculates statistics for pLDDT values
  4. Extracts pTM, ipTM, chain_pair_pae_min, and ranking score
  5. Compiles all metrics into species_confidences.csv

Source: The get_species_confidences_docking.py script is part of the InsectBase3.0 toolkit, designed to aggregate quality metrics for protein-ligand complexes.

2.3.2 Filtering Prediction Results

What this does:

  1. Filters results based on quality criteria (pLDDT and pTM )
  2. Saves filtered results to species_confidences_ptm0.7_plddt75.csv
  3. Further filters VOCs based on chemical properties and occurrence frequency
  4. Saves the CIDs of the top 58 VOCs to VOC_CID_list_top58_20250320.txt

To filter results based on the top 58 VOCs, run:

python get_TOP58_confidences.py

This saves the filtered results to species_confidences_ptm0.7_plddt75_TOP58.csv.

Source: The filtering criteria and scripts are part of the InsectBase3.0 analysis pipeline.

For beginners: These filtering steps help focus on the most reliable predictions and most relevant compounds.

3. Binding Affinity Analysis

3.1

Preparing Receptor and Ligand Files

In the /data/002_af3_docking/species_name/gene_id/CID/or_CID/ directory, run:

mamba activate vina

# Convert mmCIF to PDB format
gemmi convert or_${CID}_model.cif best_or_model.pdb

# Extract ligand and convert to PDBQT
grep HETATM best_or_model.pdb > ligand.pdb
obabel ligand.pdb -O ligand.pdbqt

# Extract receptor and prepare for docking
grep ATOM best_or_model.pdb > receptor.pdb
/data/home/chenhao/software/ADFRsuite-1.0/ADFRsuite_x86_64Linux_1.0/bin/prepare_receptor -r receptor.pdb -o receptor.pdbqt

What this does:

  1. Converts the complex structure to PDB format
  2. Extracts the ligand (HETATM records) and converts it to PDBQT format
  3. Extracts the protein (ATOM records) and prepares it as a receptor in PDBQT format

Source: This workflow uses standard tools in molecular docking:

  • GEMMI for file conversion
  • Open Babel (obabel) for ligand preparation
  • AutoDock for receptor preparation

For beginners: PDBQT format adds charge and atom type information needed for docking calculations.

3.2

Determining Grid Parameters

In the same directory, run:

python ../../../../get_grid_params.py ligand.pdb 5.0

What this does:

Calculates the center coordinates and dimensions of the grid box for docking, based on the ligand position with a 5.0 Å buffer.

Source: The get_grid_params.py script is part of the InsectBase3.0 toolkit, designed to automate grid box definition for docking.

3.3

Calculating Binding Affinity

In the same directory, run (using the grid parameters from the previous step):

vina --receptor receptor.pdbqt \
     --ligand ligand.pdbqt \
     --score_only \
     --center_x -9.873 --center_y 0.827 --center_z -0.985 \
     --size_x 12.979 --size_y 12.186 --size_z 13.245

What this does:

Uses AutoDock Vina to calculate the binding affinity (in kcal/mol) between the receptor and ligand.

Source: AutoDock Vina is an open-source molecular docking program. More information: AutoDock Vina

For beginners: Lower (more negative) binding affinity values indicate stronger predicted binding.

3.4

Summarizing Binding Affinity Data

To automate the entire process and compile results, in the /data/002_af3_docking/species_name directory, run:

nohup bash multi_get_vina_score.sh 2>&1 &

What this does:

Automates all steps in section 3.1-3.3 and compiles the results into vina_score.txt.

Then, to summarize the results for the top compounds:

nohup bash multi_get_score.sh 2>&1 &

This compiles binding affinities from vina_score.txt based on the OR-VOC pairs in species_confidences_ptm0.7_plddt75_TOP16.csv and saves the results to vina_score_all_TOP16.csv.

Source: The multi_get_vina_score.sh and multi_get_score.sh scripts are part of the InsectBase3.0 toolkit, designed to automate batch processing of docking calculations.

3.5

Visualizing Binding Affinity

To visualize the binding affinity data as a heatmap, use the Convert_vina_score_list_into_matrix.py script:

python Convert_vina_score_list_into_matrix.py

What this does:

Converts the list-format binding affinity data into a matrix format suitable for heatmap visualization.

Source: The Convert_vina_score_list_into_matrix.py script is part of the InsectBase3.0 toolkit.

For beginners: The resulting matrix can be visualized using tools like Python's matplotlib, R's ggplot2, or online visualization platforms.

Additional Resources

Citation

If you use this workflow in your research, please cite:

  • InsectBase3.0 (Citation information to be added)
  • AlphaFold 3.0: Abramson, J., Ahdritz, G., Akdel, M. et al. Accurate structure prediction of biomolecular interactions. Nature (2024). https://doi.org/10.1038/s41586-024-07487-w
  • AutoDock Vina: Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010;31(2):455-461. doi:10.1002/jcc.21334

License

This tutorial and associated scripts are provided under the terms of the license of the InsectBase3.0 project.

This tutorial was created based on the workflow developed for InsectBase3.0. Last updated: June 2024.