大4 · 第1學期生物資訊學結構生物資訊

蛋白質結構預測

Protein Structure Prediction

難度 4 · 專業bioinformaticsstructural-biology想做成互動版

蛋白質結構預測從 Levinthal's paradox（1969）到 AlphaFold2（Jumper et al., 2021, Nature）的演進，是計算生物學與深度學習交叉的典範。

共演化分析的理論基礎
結構接觸預測的統計基礎是 Direct Coupling Analysis（DCA, Morcos et al., 2011）。傳統的 mutual information 會受到間接相關性（transitive correlations）的干擾——如果 A 與 B 接觸、B 與 C 接觸，A 和 C 在 MSA 中也會顯示相關性。DCA 使用 maximum entropy model / inverse Potts model 擬合 MSA 的全域統計量，估計殘基對之間的直接耦合強度 J_ij。EVcouplings 和 GREMLIN 將 DCA 的 Frobenius norm ||J_ij||_F 作為接觸預測的分數。

AlphaFold2 的架構細節
AlphaFold2 的輸入包含 MSA（來自 JackHMMER 和 HHBlits 搜索）和 template structures（來自 PDB）。Evoformer stack（48 blocks）在兩個表示之間交替更新：

MSA representation（N_seq × N_res × 256）：row-wise attention 捕捉序列內的模式，column-wise attention 捕捉殘基位置的演化信號
Pair representation（N_res × N_res × 128）：outer product mean 從 MSA 提取 pairwise features，triangle attention 和 triangle multiplicative update 確保殘基對之間的幾何一致性（如果 i-j 近、j-k 近，則 i-k 的距離受約束）

Structure Module（8 iterations）使用 Invariant Point Attention（IPA）在每個殘基的局部參考框架中操作，直接預測 backbone frames（rotation + translation）和 side-chain torsion angles。Loss function 包含 FAPE（Frame Aligned Point Error）、auxiliary heads（pLDDT、predicted aligned error PAE）和 MSA masked prediction。

後 AlphaFold 時代

ESMFold（Lin et al., 2023）：使用 protein language model（ESM-2，15B parameters）取代 MSA 輸入，在單序列下達到接近 AlphaFold2 的精度，推論速度快 60 倍
RoseTTAFold（Baek et al., 2021）：three-track architecture（1D sequence, 2D distance map, 3D coordinates）同時更新
AlphaFold3（Abramson et al., 2024）：擴展到蛋白質-核酸-小分子-離子的通用複合物預測，使用 diffusion model 替代 Structure Module
OpenFold：AlphaFold2 的開源再實現，支持自定義訓練

挑戰與限制
AlphaFold2 對固有無序蛋白區域（IDRs）的預測信心低（pLDDT <50），這是正確的——這些區域本身沒有固定結構。Conformational diversity（同一蛋白質的多種構象）是目前的核心挑戰，AlphaFold-Multimer 和 ColabFold 的 subsampled MSA 策略可以採樣部分構象多樣性。

互動工具