BC-Design: A Biochemistry-Aware Framework for High-Precision Inverse Protein Folding

Mathematical Notation Details

Graph and Set Notation

Symbol Description Comment
$G(V, E, F_V, F_E)$ Structure graph with nodes $V$, edges $E$, node features $F_V$, edge features $F_E$ Should be $\mathcal{G}(\mathcal{V}, \mathcal{E}, \mathcal{F}\mathcal{V}, \mathcal{F}\mathcal{E})$.
$V$ Set of nodes (residues) in the structure graph Should be $\mathcal{V}$.
$E$ Set of edges in the structure graph Should be $\mathcal{E}$.
$V’$ Augmented node set including aggregator nodes Should be $\mathcal{V}’$.
$E’$ Augmented edge set including aggregator edges Should be $\mathcal{E}’$.
$V_c$ Set of structure aggregator nodes Should be $\mathcal{V}_c$.
k Number of nearest neighbors in k-NN graph (k=30) Good
$v_i^L$ Local structure aggregator node Should not have both superscript and subscript
$v^G$ Global structure aggregator node Should not have superscript
$Q$ Local coordinate system for residues Should be $\mathcal{Q}$
  • Comment:
    • The annotation of additional nodes and edges should be clear. Otherwise, it will be hard to read.
      • Should not have both superscript and subscript: $v_i^L$
      • Should not have superscript: $v^G$
      • $G$ is a global node and also a notation of graph $G$?
    • Should be $\mathbf{Q} \in \mathbb{R}^{3 \times 3}$.
      • Quaternion system is $\mathbf{q} \in \mathbb{H}$. Need to clarify.
      • $\mathbf{Q}$ is a rotation matrix.
      • Coordinate system $\mathcal{Q}$ should not be the same as query matrix $Q$ in attention.
    • For the aggregator nodes, should we use $\mathcal{V}_c$ and $\mathcal{E}_c$? What exactly is the $c$?
      • $c$ is for “context” or “center”?
      • We can actually have $\mathcal{V}_g$ for global aggregator nodes, and $\mathcal{V}_l$ for local aggregator nodes.

Point Cloud and Biochemical Features

Symbol Description Comment
$b(x, y, z) = \mathbf{b}$ Continuous mapping of biochemical properties in 3D space We have $x, y, z$ and $\mathbf{x}$ for coordinates at the same time.
$\mathcal{P}_S$ Surface point cloud Clear
$\mathcal{P}_I$ Internal point cloud Clear
$\mathcal{P}$ Combined point cloud Clear
$P_i$ Point in the point cloud with coordinates and features Should be $P_i \in \mathcal{P}$
$\mathbf{x}_i$ Spatial coordinates of point i Should use $\mathbf{x}_i \in \mathbb{R}^3$ for coordinates.
$\mathbf{b}_i = (h_i, c_i)$ Biochemical features (hydrophobicity, charge) of point i Would it be better to use $\mathbf{b}_i = b(\mathbf{x}_i) \in \mathbb{R}^2$?
$P^G$ Global biochemical aggregator point Same problem as $v^G$.
$P_i^L$ Local biochemical aggregator point Same problem as $v_i^L$.
$N_L$ The number of local biochemical aggregator points I feel it is not a good idea to introduce another notation $N_L$ for the number of local points. Why don’t we use $\vert V_L \vert$?
  • Comment:
    • What is $\mathbf{b}$? Is it a constant? If not, why we use $b(x, y, z) = \mathbf{b}$ as an implicit definition?
    • Would it be better to use $\mathbf{b}_i = b(\mathbf{x}_i) \in \mathbb{R}^2$?
    • The definition of $P^G$ and $P_i^L$ should be clear. Same as $v^G$ and $v_i^L$.
    • I feel it is not a good idea to introduce another notation $N_L$ for the number of local points. Why don’t we use $\vert V_L \vert$?

Neural Network Components

Symbol Description Comment
$H_V$ Node embeddings Clear
$w_E$ Edge weights Why $w_E \in \mathbb{R}^{\vert E \vert \times d}$ while we can have $w_E[i,j] \in \mathbb{R}$?
$H_{V_c}$ Structure aggregator node embeddings Clear
$H_{V’}$ Concatenated node embeddings Clear
$W_E$ Weight matrix for edges What is the difference between $w_E$ and $W_E$?
$GPE$ Graph Positional Encodings Should be $\text{GPE} \in \mathbb{R}^{\vert V \vert \times d}$
$Q, K, V$ Query, Key, Value matrices in attention Clear
$S$ Attention score matrix Clear
$S’$ Modified attention score matrix Clear
  • Comment:
    • What is the difference between $w_E$ and $W_E$?
      • It is not clear to me why we need $W_E$ for $w_E$.
      • It is also not clear why $w_E$ is a vector while it is used as a matrix in the equation.
      • Is the following correct? \(W_E[i,j] = \begin{cases} w_E[i,j] & \text{if } e_{ij} \in E \\ 1 & \text{if } e_{ij} \in E_c \\ 0 & \text{otherwise}, \end{cases}\)
    • Should be $\text{GPE} \in \mathbb{R}^{\vert V \vert \times d}$.

Loss Functions

Symbol Description Comment
$\mathcal{L}_{\text{CE}}$ Cross-entropy loss Clear
$\mathcal{L}_{\text{GCL}}$ Global contrastive loss Clear
$\mathcal{L}_{\text{LCL}}$ Local contrastive loss Clear
$\mathcal{L}$ Combined loss function Clear
$\lambda_1, \lambda_2$ Loss weights (both set to 1) Clear

Key Equations

Equation Description Comment
$S’ = S \odot W_E + GPE$ Attention score modification Clear
$\mathcal{L} = \mathcal{L}{\text{CE}} + \lambda_1\mathcal{L}{\text{GCL}} + \lambda_2\mathcal{L}_{\text{LCL}}$ Combined loss function Clear
$\mathcal{N}_r(P’_i) = $ {$P’_j \mid \Vert \mathbf{x}’_i - \mathbf{x}’_j \Vert \leq r$} Multi-scale neighborhood definition Clear
$K_{BC} = \max(1, \lfloor 1400/\vert V \vert \rfloor)$ Dynamic connection parameter for BC-Graph What is $K_{BC}$? Isn’t it $K_{\mathcal{B}}$?

Other Mathematical Notation

Symbol LaTeX Code Description
$\odot$ Hadamard (element-wise) product Clear
$\vert \cdot \vert$ Absolute value or set cardinality Clear
$\lfloor x \rfloor$ Floor function Clear
$\in$ Element of Clear
$\cup$ Set union Clear
$\leq$ Less than or equal to Clear
$\times$ Cross product or multiplication Clear