Introduction to Graph Neural Networks and Diffusion Models
Date:
Introduction to Graph Neural Networks and Diffusion Models
Prepared for Chong’s Lab on June 10th, 2024
Powered Marp for updating the slides and webiste together.
Graph Neural Networks
- Unstructured data: x↦fθ(x), e.g., fθ(x)=θ⊤x
- e.g. Image: x∈R256×256(×3); CV data: x∈R3×T×N
- Structured data?
Basic Graph Structure
- G=V,E
- V: nodes, vertexes, e.g., atom / users
- V=xin: atom / user features
- E: edges, e.g., bonds, social relationship
- E=eijn×n: bond types, etc.
- eij vs. eji: undirectional / directional
N: number of nodes in the graph i: node indices
Graph structure (example 1)
- V: atom type, N=6
- x1=(1,0,0,0,0,0)∈R6
- x2=(0,0,0,0,0,1)∈R6
- x3=(1,0,0,0,0,0)∈R6
-
E: bond type, $ \mathcal E = 5$ - e12=(1,0)∈R2
- e23=(1,0)∈R3
- e25=(0,1)∈R2
eij=eji cr. Img.1
Graph structure (example 1)
- V: atom type \& coord, N=6
- x1=(1,0,0,0,0,0,0,0.92,1.23)∈R9
- x2=(0,0,0,0,0,1,0,0,0.67)∈R9
- x3=(1,0,0,0,0,0,0,−0.92,1.23)∈R9
-
E: bond type \& direc, $ \mathcal E = 5$ - e12=(1,0,0,−0.92,−0.56)∈R5
- e23=(1,0,−0.92,0.56)∈R5
eij≠eji cr. Img.1
More Graph Representations
Adjacency Matrix
- Undirectional graph: A=A⊤
- eii: self loop, might or might not be useful
Degree of a node
- \mathrm{deg}(x_i) = #number of neighborhoods
Subgraph (Advanced)
- Some node in the graph can be a sub-graph
- e.g. functional group (CH3-, COOH-, …)
cr. Img.1, read more: Laplacian Embedding
Graph Neural Networks
- Function f(⋅) taking Graph G as input
- Output f(G) can be …
- A graph of a same structure G′
- Recommendations for each user
- Energy for each atom
- A scalar …
- Toxicity of a molecule (y∈0,1)
- Conformation energy (y∈R)
- A graph of a same structure G′
cr. Img.1
Graph Neural Networks: Message Passing \& aggregation
- Message from neighbor j to i: f(xi,xj,eij)
- Aggregate the message from all neighbors xl+1i=xli+∑j∈N(i)f(xl+1i,xl+1j,el+1ij)
- f(xi,xj,eij): trainable neural networks (usually MLP)
- In the notation of adjacency matrix: xl+1i=xli+Af(xi,⋯,el+1i⋯)
- l: number of layer in NN
- Scalar output: y=∑ixLi
Example I: Graph Convolutional Networks (GCN (1))
- Message from neighbor j to i: (σ: activation function, W: trainable parameter) f(xi,xj,eij)=σ(Wxj)
- Update layer from aggregation xl+1i=xli+1√deg(i)∑j∈N(i)σ(Wxlj)
(1): https://arxiv.org/pdf/1609.02907v4
Example II: Graph Attention Transformers (GAT, GTN (2))
- Message from neighborhoods: attention and values attn(xi,xj,eij)=⟨Qxi,Kxj⟩f(xi,xj,eij)=Vxj
- Aggregate the information weighted by attnetion xl+1i=∑j∈Nisoftmax(attn(xi,xj,eij))⋅f(xi,xj,eij)
- Intuition: estimate the attention from different nodes.
(2): https://arxiv.org/abs/1911.06455
Graph Neural Networks Libraries
- Graph Neural Networks
- Pytorch Geometrics (PyG) (https://pytorch-geometric.readthedocs.io/en/latest/)
- Deep Graph Library (DGL) (https://www.dgl.ai)
- Handling the graph structure
- NetworkX (https://networkx.org)
Diffusion Processes
Generative model
- Goal of the generative model: learn and sample from the distribution P(x).
-
With label: $\mathbb P(x y)$
-
- Prior work: Generative adversarial network GAN, Variational autoencoder VAE, etc.
Compare with discrimative model
-
Goal of discrimative model: discrimate different class of data $\mathbb P(y x)$. - Examples: image classification (VGG, ResNet), languauage classification, etc..
Diffusion Process - one step
Intuition: adding noise to the input (image) and denoise
- Forward step: Generate x1=x0+ε,ε∼N(0,σ2)
- Reverse step: Estimate ε≈εθ(x1) and generate x0=x1−εθ(x1)
Diffusion Process - Repeating for T steps
- Forward process (Markovian):
By Baysian rule, we not really need to sample it using the chain, but xt∼Pt0(⋅|x0)
- e.g. xt∼N(xt−1,σ2)⟶xt∼N(x0,σ2t)
- Reverse process (this is not formal, just for intuition!!):
xT−1=xT−εθ(xT,T),xT−2=xT−1−εθ(xT−1,T−1),⋯,x0=x1−εθ(x1,1)
We need to generate x0 through this chain!
- Usually σ is small that NN is not hard to learn
Forward Process (Assuming σ2(x0)=1 by normalization)
- VE-SDE: xt∼N(xt−1,σ2), xt∼N(x0,σ2t), xt∼N(E(x0),σ2(x0)+σ2t)
- Variance-Exploded
-
VP-SDE: $x_t \sim N(\mu_{t t-1} x_{t-1}, \sigma_{t t-1}^2),\sigma^2(x_t) = \sigma_{t t-1}^2 + \mu_{t t-1}^2 \sigma^2(x_{t-1})$ -
Variance-Preserved: $\mu_{t t - 1}^2 + \sigma^2_{t t - 1} = 1,\mu_t^2 + \sigma^2_t = 1,x_t = N(\mu_t x_0, \sigma^2_t)$. - T→∞,μt→0,σt→1,xT→N(0,1) // we start reverse from here!
-
Training objective:
-
Predict noise using noisy data: εt=xt−μtx0: $\mathcal L = \mathbb E_{t, x_0, x_t x_0}|\varepsilon_{\theta}(x_t, t) - \varepsilon_t|_2^2$ - Reweight for better training L=Et,x0,xt|x0|ε′θ(xt,t)−εt/σt|22 (εt/σt∼N(0,1)!)
Note that ε′θ(⋅,t)≈ε(⋅,t)σt
Reverse Process
What we know know: xt−1∼N(μt−1x0,σ2t−1),ˉx0≈xt−εθ(xt,t)=xt−ε′θ(xt,t)σt
Sample $x_{t-1} \sim N(\mu_{t-1} \bar x_0, \sigma^2_{t | t-1})(x_t | x_{t-1} = N(\mu_tx_{t-1}, \sigma^2_{t | t-1})$) |
More justification… P(xt−1|xt)=∑ˉx0P(xt−1|xt,ˉx0)P(ˉx0|xt) P(xt−1|xt,x0)∝P(xt−1|x0)P(xt|xt−1)=N(⋅,(σ−2t−1+μ2t|t−1σ2t)−1)
Why diffusion model works (for science people..)
- Langevin dynamics: M¨X(t)=−∇U(X(t))−ζ˙X(t)+√2ζkTR(t)
-
Overdamped regime: M≪1: ˙X(t)=−ζ−1∇U(X(t))+√2kT/ζR(t)
- Equilibrium Boltzmann distribution P(X)=exp(−U(X)/kT)/∫Xexp(−U(X)/kT)dxlogP(X)=−U(X)/kT−log∫Xexp(−U(X)/kT)dx∇XlogP(X)=−∇XU(X)/kT
If we let D=kT/ζ then the overdamped Langevin becomes… ˙X(t)=−D∇XlogP(X)+√2DR(t)
Now the same again with statistics / ML …
˙X(t)=−D∇XlogP(X)+√2DR(t)⇒X∼P(X)
- D: learning rate (think of GD: ˙X(t)=−D∇f(x))
- ∇XlogP(X): score function
- How to learn score function? (score matching)
- logP(X) is hard to learn (think of learning U(X), and F(X)=∇U(X)
-
$\nabla \log \mathbb P(X) = \mathbb E_{X_0 X} \nabla \log \mathbb P(X X_0),make\nabla \log \mathbb P(X X_0)$ easy to calculate.. $$\mathcal L = \mathbb E_{X_0, X X_0} |f(X) - \nabla \log \mathbb P(X X_0)|_2^2$$ -
What if $\mathbb P(X X_0) = N(\mu X_0, \sigma^2) \propto \exp(-0.5 (X - \mu X_0)^2/ \sigma^{2})$? - ∇logP(X|X0)=(X−μX0)/σ2=−ε/σ2!!
P(X) is not the original data distribution. It is the distribution of X given the X0 is from the data distribution…
From P(X) to P(X0)
Forward process: dX=−ftXdt+gtdB: dB: Brownian motion
cr. Img.1
Reverse process in SDE
dX=−ftXdt+gtdB: dB: Brownian motion ≈N(0,dt), t:0→1
Reverse process for recovering: dX=[−ftX−g2t∇logPt(x)]dt+dˉB:
- dˉB: reverse brownian motion
- Matching the score function and solve the SDE (t:1→0)
cr. Img.1
Conditional generation on information c
-
Most naive one: train $\mathbb P(\cdot c)seperatelyforeachc$, wait NN for generalization -
Classifier-guidance: Use a discriminative model predicting $\mathbb P(c x, t)$ -
Seeking to generate $\mathbb Q(x) \propto \mathbb P(x)P(c x, t)^{\gamma}(\gamma$: generation strength) -
In each of the generation step: let $\nabla \log \mathbb Q(x) = \nabla \log \mathbb P(x) + \nabla P(c x, t)$ -
Make Q(x) more likely to be predicted as $P(c x, t)$
-
- Classifier-free guidance
-
Approximate $P(c x, t) \propto \mathbb P(x c) \mathbb P(X)$, generating by $$\nabla \log \mathbb Q(x) = (1 - \gamma)\nabla \log \mathbb P(x) + \gamma \log \mathbb P(x c)$$ - Do not require the classifier, suitable for image input, language prompt, etc..
-
- More general guidance: 1 seeking for Q(x)∝P(x)exp(−E(x))
Blog / papers
- https://yang-song.net/assets/img/score/sde_schematic.jpg
- https://arxiv.org/abs/2006.11239
- https://arxiv.org/abs/2011.13456
Advanced topics
- Flow maching
- Equivariant generation for 3D structure (1) (2)
- Physics of flow matching, diffusion model and how to accelerate