BC-Design: A Biochemistry-Aware Framework for High-Precision Inverse Protein Folding
Mathematical Notation Details
Graph and Set Notation
Symbol | Description | Comment |
---|---|---|
$G(V, E, F_V, F_E)$ | Structure graph with nodes $V$, edges $E$, node features $F_V$, edge features $F_E$ | Should be $\mathcal{G}(\mathcal{V}, \mathcal{E}, \mathcal{F}\mathcal{V}, \mathcal{F}\mathcal{E})$. |
$V$ | Set of nodes (residues) in the structure graph | Should be $\mathcal{V}$. |
$E$ | Set of edges in the structure graph | Should be $\mathcal{E}$. |
$V’$ | Augmented node set including aggregator nodes | Should be $\mathcal{V}’$. |
$E’$ | Augmented edge set including aggregator edges | Should be $\mathcal{E}’$. |
$V_c$ | Set of structure aggregator nodes | Should be $\mathcal{V}_c$. |
k | Number of nearest neighbors in k-NN graph (k=30) | Good |
$v_i^L$ | Local structure aggregator node | Should not have both superscript and subscript |
$v^G$ | Global structure aggregator node | Should not have superscript |
$Q$ | Local coordinate system for residues | Should be $\mathcal{Q}$ |
- Comment:
- The annotation of additional nodes and edges should be clear. Otherwise, it will be hard to read.
- Should not have both superscript and subscript: $v_i^L$
- Should not have superscript: $v^G$
- $G$ is a global node and also a notation of graph $G$?
- Should be $\mathbf{Q} \in \mathbb{R}^{3 \times 3}$.
- Quaternion system is $\mathbf{q} \in \mathbb{H}$. Need to clarify.
- $\mathbf{Q}$ is a rotation matrix.
- Coordinate system $\mathcal{Q}$ should not be the same as query matrix $Q$ in attention.
- For the aggregator nodes, should we use $\mathcal{V}_c$ and $\mathcal{E}_c$? What exactly is the $c$?
- $c$ is for “context” or “center”?
- We can actually have $\mathcal{V}_g$ for global aggregator nodes, and $\mathcal{V}_l$ for local aggregator nodes.
- The annotation of additional nodes and edges should be clear. Otherwise, it will be hard to read.
Point Cloud and Biochemical Features
Symbol | Description | Comment |
---|---|---|
$b(x, y, z) = \mathbf{b}$ | Continuous mapping of biochemical properties in 3D space | We have $x, y, z$ and $\mathbf{x}$ for coordinates at the same time. |
$\mathcal{P}_S$ | Surface point cloud | Clear |
$\mathcal{P}_I$ | Internal point cloud | Clear |
$\mathcal{P}$ | Combined point cloud | Clear |
$P_i$ | Point in the point cloud with coordinates and features | Should be $P_i \in \mathcal{P}$ |
$\mathbf{x}_i$ | Spatial coordinates of point i | Should use $\mathbf{x}_i \in \mathbb{R}^3$ for coordinates. |
$\mathbf{b}_i = (h_i, c_i)$ | Biochemical features (hydrophobicity, charge) of point i | Would it be better to use $\mathbf{b}_i = b(\mathbf{x}_i) \in \mathbb{R}^2$? |
$P^G$ | Global biochemical aggregator point | Same problem as $v^G$. |
$P_i^L$ | Local biochemical aggregator point | Same problem as $v_i^L$. |
$N_L$ | The number of local biochemical aggregator points | I feel it is not a good idea to introduce another notation $N_L$ for the number of local points. Why don’t we use $\vert V_L \vert$? |
- Comment:
- What is $\mathbf{b}$? Is it a constant? If not, why we use $b(x, y, z) = \mathbf{b}$ as an implicit definition?
- Would it be better to use $\mathbf{b}_i = b(\mathbf{x}_i) \in \mathbb{R}^2$?
- The definition of $P^G$ and $P_i^L$ should be clear. Same as $v^G$ and $v_i^L$.
- I feel it is not a good idea to introduce another notation $N_L$ for the number of local points. Why don’t we use $\vert V_L \vert$?
Neural Network Components
Symbol | Description | Comment |
---|---|---|
$H_V$ | Node embeddings | Clear |
$w_E$ | Edge weights | Why $w_E \in \mathbb{R}^{\vert E \vert \times d}$ while we can have $w_E[i,j] \in \mathbb{R}$? |
$H_{V_c}$ | Structure aggregator node embeddings | Clear |
$H_{V’}$ | Concatenated node embeddings | Clear |
$W_E$ | Weight matrix for edges | What is the difference between $w_E$ and $W_E$? |
$GPE$ | Graph Positional Encodings | Should be $\text{GPE} \in \mathbb{R}^{\vert V \vert \times d}$ |
$Q, K, V$ | Query, Key, Value matrices in attention | Clear |
$S$ | Attention score matrix | Clear |
$S’$ | Modified attention score matrix | Clear |
- Comment:
- What is the difference between $w_E$ and $W_E$?
- It is not clear to me why we need $W_E$ for $w_E$.
- It is also not clear why $w_E$ is a vector while it is used as a matrix in the equation.
- Is the following correct? \(W_E[i,j] = \begin{cases} w_E[i,j] & \text{if } e_{ij} \in E \\ 1 & \text{if } e_{ij} \in E_c \\ 0 & \text{otherwise}, \end{cases}\)
- Should be $\text{GPE} \in \mathbb{R}^{\vert V \vert \times d}$.
- What is the difference between $w_E$ and $W_E$?
Loss Functions
Symbol | Description | Comment |
---|---|---|
$\mathcal{L}_{\text{CE}}$ | Cross-entropy loss | Clear |
$\mathcal{L}_{\text{GCL}}$ | Global contrastive loss | Clear |
$\mathcal{L}_{\text{LCL}}$ | Local contrastive loss | Clear |
$\mathcal{L}$ | Combined loss function | Clear |
$\lambda_1, \lambda_2$ | Loss weights (both set to 1) | Clear |
Key Equations
Equation | Description | Comment |
---|---|---|
$S’ = S \odot W_E + GPE$ | Attention score modification | Clear |
$\mathcal{L} = \mathcal{L}{\text{CE}} + \lambda_1\mathcal{L}{\text{GCL}} + \lambda_2\mathcal{L}_{\text{LCL}}$ | Combined loss function | Clear |
$\mathcal{N}_r(P’_i) = $ {$P’_j \mid \Vert \mathbf{x}’_i - \mathbf{x}’_j \Vert \leq r$} | Multi-scale neighborhood definition | Clear |
$K_{BC} = \max(1, \lfloor 1400/\vert V \vert \rfloor)$ | Dynamic connection parameter for BC-Graph | What is $K_{BC}$? Isn’t it $K_{\mathcal{B}}$? |
Other Mathematical Notation
Symbol | LaTeX Code | Description |
---|---|---|
$\odot$ | Hadamard (element-wise) product | Clear |
$\vert \cdot \vert$ | Absolute value or set cardinality | Clear |
$\lfloor x \rfloor$ | Floor function | Clear |
$\in$ | Element of | Clear |
$\cup$ | Set union | Clear |
$\leq$ | Less than or equal to | Clear |
$\times$ | Cross product or multiplication | Clear |