Alphafold []

Alphafold

Alphafold3

Alphafold3 Apptainer File Image

Alphafold3 Apptainer File Image:

/hpc/share/containers/apptainer/alphafold/3.0.1/alphafold-3.0.1.sif

Alphafold3 GPU demo

mkdir -p demo/af_input
cp -p /hpc/share/containers/apptainer/alphafold/3/af_input/fold_input.json demo/af_input
cp -p /hpc/share/containers/apptainer/alphafold/3.0.1/slurm-alphafold-gpu-a100_40g.sh demo
cd demo
sbatch slurm-alphafold-gpu-a100_40g.sh

Alphafold3 GPU job

Download the Alphafold3 input file fold_input.json and save it in af_input folder:

fold_input.json

{
  "name": "2PV7",
  "sequences": [
    {
      "protein": {
        "id": ["A", "B"],
        "sequence": "GMRESYANENQFGFKTINSDIHKIVIVGGYGKLGGLFARYLRASGYPISILDREDWAVAESILANADVVIVSVPINLTLETIERLKPYLTENMLLADLTSVKREPLAKMLEVHTGAVLGLHPMFGADIASMAKQVVVRCDGRFPERYEWLLEQIQIWGAKIYQTNATEHDHNMTYIQALRHFSTFANGLHLSKQPINLANLLALSSPIYRLELAMIGRLFAQDAELYADIIMDKSENLAVIETLKQTYDEALTFFENNDRQGFIDAFHKVRDWFGDYSEQFLKESRQLLQQANDLKQG"
      }
    }
  ],
  "modelSeeds": [1],
  "dialect": "alphafold3",
  "version": 1
}

Script slurm-alphafold-gpu-a100_40g.sh to run alphafold on 1 node with 1 A100 (40 GB) GPU (8 tasks per node):

slurm-alphafold-gpu-a100_40g.sh

#!/bin/bash --login
#SBATCH --job-name=alphafold
#SBATCH --output=af_output/%x.d%j/%x.o%j
#SBATCH --error=af_output/%x.d%j/%x.e%j
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --time=0-02:00:00
#SBATCH --mem=10G
#SBATCH --partition=gpu
#SBATCH --qos=gpu
#SBATCH --gres=gpu:a100_40g:1
##SBATCH --account=<account>
 
shopt -q login_shell || exit 1
test -n "$SLURM_NODELIST" || exit 1
test $SLURM_NNODES -eq 1 || exit 1
 
module load apptainer
module load alphafold/3.0.1
 
test -n "$ALPHAFOLD_CONTAINER" || exit 1
 
set -x
 
ALPHAFOLD_JSON_INPUT_FILE='fold_input.json'
ALPHAFOLD_INPUT_DIR="$PWD/af_input"
ALPHAFOLD_OUTPUT_DIR="$PWD/af_output/${SLURM_JOB_NAME}.d${SLURM_JOB_ID}"
 
mkdir -p "$ALPHAFOLD_OUTPUT_DIR"
 
apptainer exec \
    --nv \
    --bind "$ALPHAFOLD_INPUT_DIR:/root/af_input" \
    --bind "$ALPHAFOLD_OUTPUT_DIR:/root/af_output" \
    "$ALPHAFOLD_CONTAINER" \
    python /app/alphafold/run_alphafold.py \
    --json_path="/root/af_input/$ALPHAFOLD_JSON_INPUT_FILE" \
    --model_dir=/root/models \
    --db_dir=/root/public_databases \
    --db_dir=/root/public_databases_fallback \
    --output_dir=/root/af_output

The processing result will be saved in the af output folder.

Scripts for specific NVIDIA GPU models to run alphafold on 1 node with 1 GPU (8 tasks per node):

GPU	Path
NVIDIA P100 (12 GB)	`/hpc/share/containers/apptainer/alphafold/3.0.1/slurm-alphafold-gpu-p100.sh`
NVIDIA V100 (32 GB)	`/hpc/share/containers/apptainer/alphafold/3.0.1/slurm-alphafold-gpu_guest-v100_hylab.sh`
NVIDIA A100 (40 GB)	`/hpc/share/containers/apptainer/alphafold/3.0.1/slurm-alphafold-gpu-a100_40g.sh`
NVIDIA A100 (80 GB)	`/hpc/share/containers/apptainer/alphafold/3.0.1/slurm-alphafold-gpu-a100_80g.sh`

Documentation

How to get a list of all flags of run_alphafold.py (version 3.0.1):

module load apptainer
module load alphafold/3.0.1
 
apptainer exec "$ALPHAFOLD_CONTAINER" python /app/alphafold/run_alphafold.py --helpfull

List of all flags of run_alphafold.py (version 3.0.1):

AlphaFold 3 structure prediction script.

AlphaFold 3 source code is licensed under CC BY-NC-SA 4.0. To view a copy of
this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/

To request access to the AlphaFold 3 model parameters, follow the process set
out at https://github.com/google-deepmind/alphafold3. You may only use these
if received directly from Google. Use is subject to terms of use available at
https://github.com/google-deepmind/alphafold3/blob/main/WEIGHTS_TERMS_OF_USE.md

flags:

run_alphafold.py:
  --buckets: Strictly increasing order of token sizes for which to cache compilations. For any input with more tokens than the largest bucket size, a new
    bucket is created for exactly that number of tokens.
    (default: '256,512,768,1024,1280,1536,2048,2560,3072,3584,4096,4608,5120')
    (a comma separated list)
  --conformer_max_iterations: Optional override for maximum number of iterations to run for RDKit conformer search.
    (an integer)
  --db_dir: Path to the directory containing the databases. Can be specified multiple times to search multiple directories in order.;
    repeat this option to specify a list of values
    (default: "['/hpc/home/sti_calcolo/public_databases']")
  --flash_attention_implementation: <triton|cudnn|xla>: Flash attention implementation to use. 'triton' and 'cudnn' uses a Triton and cuDNN flash attention
    implementation, respectively. The Triton kernel is fastest and has been tested more thoroughly. The Triton and cuDNN kernels require Ampere GPUs or later.
    'xla' uses an XLA attention implementation (no flash attention) and is portable across GPU devices.
    (default: 'triton')
  --gpu_device: Optional override for the GPU device to use for inference. Defaults to the 1st GPU on the system. Useful on multi-GPU systems to pin each run
    to a specific GPU.
    (default: '0')
    (an integer)
  --hmmalign_binary_path: Path to the Hmmalign binary.
    (default: '/hmmer/bin/hmmalign')
  --hmmbuild_binary_path: Path to the Hmmbuild binary.
    (default: '/hmmer/bin/hmmbuild')
  --hmmsearch_binary_path: Path to the Hmmsearch binary.
    (default: '/hmmer/bin/hmmsearch')
  --input_dir: Path to the directory containing input JSON files.
  --jackhmmer_binary_path: Path to the Jackhmmer binary.
    (default: '/hmmer/bin/jackhmmer')
  --jackhmmer_n_cpu: Number of CPUs to use for Jackhmmer. Default to min(cpu_count, 8). Going beyond 8 CPUs provides very little additional speedup.
    (default: '8')
    (an integer)
  --jax_compilation_cache_dir: Path to a directory for the JAX compilation cache.
  --json_path: Path to the input JSON file.
  --max_template_date: Maximum template release date to consider. Format: YYYY-MM-DD. All templates released after this date will be ignored.
    (default: '2021-09-30')
  --mgnify_database_path: Mgnify database path, used for protein MSA search.
    (default: '${DB_DIR}/mgy_clusters_2022_05.fa')
  --model_dir: Path to the model to use for inference.
    (default: '/hpc/home/sti_calcolo/models')
  --nhmmer_binary_path: Path to the Nhmmer binary.
    (default: '/hmmer/bin/nhmmer')
  --nhmmer_n_cpu: Number of CPUs to use for Nhmmer. Default to min(cpu_count, 8). Going beyond 8 CPUs provides very little additional speedup.
    (default: '8')
    (an integer)
  --ntrna_database_path: NT-RNA database path, used for RNA MSA search.
    (default: '${DB_DIR}/nt_rna_2023_02_23_clust_seq_id_90_cov_80_rep_seq.fasta')
  --num_diffusion_samples: Number of diffusion samples to generate.
    (default: '5')
    (a positive integer)
  --num_recycles: Number of recycles to use during inference.
    (default: '10')
    (a positive integer)
  --num_seeds: Number of seeds to use for inference. If set, only a single seed must be provided in the input JSON. AlphaFold 3 will then generate random
    seeds in sequence, starting from the single seed specified in the input JSON. The full input JSON produced by AlphaFold 3 will include the generated
    random seeds. If not set, AlphaFold 3 will use the seeds as provided in the input JSON.
    (a positive integer)
  --output_dir: Path to a directory where the results will be saved.
  --pdb_database_path: PDB database directory with mmCIF files path, used for template search.
    (default: '${DB_DIR}/mmcif_files')
  --rfam_database_path: Rfam database path, used for RNA MSA search.
    (default: '${DB_DIR}/rfam_14_9_clust_seq_id_90_cov_80_rep_seq.fasta')
  --rna_central_database_path: RNAcentral database path, used for RNA MSA search.
    (default: '${DB_DIR}/rnacentral_active_seq_id_90_cov_80_linclust.fasta')
  --[no]run_data_pipeline: Whether to run the data pipeline on the fold inputs.
    (default: 'true')
  --[no]run_inference: Whether to run inference on the fold inputs.
    (default: 'true')
  --[no]save_embeddings: Whether to save the final trunk single and pair embeddings in the output.
    (default: 'false')
  --seqres_database_path: PDB sequence database path, used for template search.
    (default: '${DB_DIR}/pdb_seqres_2022_09_28.fasta')
  --small_bfd_database_path: Small BFD database path, used for protein MSA search.
    (default: '${DB_DIR}/bfd-first_non_consensus_sequences.fasta')
  --uniprot_cluster_annot_database_path: UniProt database path, used for protein paired MSA search.
    (default: '${DB_DIR}/uniprot_all_2021_04.fa')
  --uniref90_database_path: UniRef90 database path, used for MSA search. The MSA obtained by searching it is used to construct the profile for template
    search.
    (default: '${DB_DIR}/uniref90_2022_05.fa')

absl.app:
  -?,--[no]help: show this help
    (default: 'false')
  --[no]helpfull: show full help
    (default: 'false')
  --[no]helpshort: show this help
    (default: 'false')
  --[no]helpxml: like --helpfull, but generates XML output
    (default: 'false')
  --[no]only_check_args: Set to true to validate args and exit.
    (default: 'false')
  --[no]pdb: Alias for --pdb_post_mortem.
    (default: 'false')
  --[no]pdb_post_mortem: Set to true to handle uncaught exceptions with PDB post mortem.
    (default: 'false')
  --profile_file: Dump profile information to a file (for python -m pstats). Implies --run_with_profiling.
  --[no]run_with_pdb: Set to true for PDB debug mode
    (default: 'false')
  --[no]run_with_profiling: Set to true for profiling the script. Execution will be slower, and the output format might change over time.
    (default: 'false')
  --[no]use_cprofile_for_profiling: Use cProfile instead of the profile module for profiling. This has no effect unless --run_with_profiling is set.
    (default: 'true')

absl.logging:
  --[no]alsologtostderr: also log to stderr?
    (default: 'false')
  --log_dir: directory to write logfiles into
    (default: '')
  --logger_levels: Specify log level of loggers. The format is a CSV list of `name:level`. Where `name` is the logger name used with `logging.getLogger()`,
    and `level` is a level name  (INFO, DEBUG, etc). e.g. `myapp.foo:INFO,other.logger:DEBUG`
    (default: '')
  --[no]logtostderr: Should only log to stderr?
    (default: 'false')
  --[no]showprefixforinfo: If False, do not prepend prefix to info messages when it's logged to stderr, --verbosity is set to INFO level, and python logging
    is used.
    (default: 'true')
  --stderrthreshold: log messages at this level, or more severe, to stderr in addition to the logfile.  Possible values are 'debug', 'info', 'warning',
    'error', and 'fatal'.  Obsoletes --alsologtostderr. Using --alsologtostderr cancels the effect of this flag. Please also note that this flag is subject to
    --verbosity and requires logfile not be stderr.
    (default: 'fatal')
  -v,--verbosity: Logging verbosity level. Messages logged at this level or lower will be included. Set to 1 for debug logging. If the flag was not set or
    supplied, the value will be changed from the default of -1 (warning) to 0 (info) after flags are parsed.
    (default: '-1')
    (an integer)

absl.testing.absltest:
  --test_random_seed: Random seed for testing. Some test frameworks may change the default value of this flag between runs, so it is not appropriate for
    seeding probabilistic tests.
    (default: '301')
    (an integer)
  --test_randomize_ordering_seed: If positive, use this as a seed to randomize the execution order for test cases. If "random", pick a random seed to use. If
    0 or not set, do not randomize test case execution order. This flag also overrides the TEST_RANDOMIZE_ORDERING_SEED environment variable.
    (default: '')
  --test_srcdir: Root of directory tree where source files live
    (default: '')
  --test_tmpdir: Directory for temporary testing files
    (default: '/tmp/absl_testing')
  --xml_output_file: File to store XML test results
    (default: '')

chex._src.fake:
  --[no]chex_assert_multiple_cpu_devices: Whether to fail if a number of CPU devices is less than 2.
    (default: 'false')
  --chex_n_cpu_devices: Number of CPU threads to use as devices in tests.
    (default: '1')
    (an integer)

chex._src.variants:
  --[no]chex_skip_pmap_variant_if_single_device: Whether to skip pmap variant if only one device is available.
    (default: 'true')

absl.flags:
  --flagfile: Insert flag definitions from the given file into the command line.
    (default: '')
  --undefok: comma-separated list of flag names that it is okay to specify on the command line even if the program does not define a flag with that name.
    IMPORTANT: flags in this list that have arguments MUST use the --flag=value format.
    (default: '')