Introduction to Graph Neural Networks and Diffusion Models

Date:

Introduction to Graph Neural Networks and Diffusion Models

Prepared for Chong’s Lab on June 10th, 2024

Powered Marp for updating the slides and webiste together.


Graph Neural Networks

  • Unstructured data: xfθ(x), e.g., fθ(x)=θx
    • e.g. Image: xR256×256(×3); CV data: xR3×T×N
  • Structured data?

h:325h:325

cr. Img.1, Img.2


Basic Graph Structure

bg left:40% 100%

  • G=V,E
  • V: nodes, vertexes, e.g., atom / users
    • V=xin: atom / user features
  • E: edges, e.g., bonds, social relationship
    • E=eijn×n: bond types, etc.
    • eij vs. eji: undirectional / directional

      N: number of nodes in the graph i: node indices


Graph structure (example 1)

bg left:40% 100%

  • V: atom type, N=6
    • x1=(1,0,0,0,0,0)R6
    • x2=(0,0,0,0,0,1)R6
    • x3=(1,0,0,0,0,0)R6
  • E: bond type, $ \mathcal E = 5$
    • e12=(1,0)R2
    • e23=(1,0)R3
    • e25=(0,1)R2

      eij=eji cr. Img.1


Graph structure (example 1)

bg left:40% 100%

  • V: atom type \& coord, N=6
    • x1=(1,0,0,0,0,0,0,0.92,1.23)R9
    • x2=(0,0,0,0,0,1,0,0,0.67)R9
    • x3=(1,0,0,0,0,0,0,0.92,1.23)R9
  • E: bond type \& direc, $ \mathcal E = 5$
    • e12=(1,0,0,0.92,0.56)R5
    • e23=(1,0,0.92,0.56)R5

      eijeji cr. Img.1


More Graph Representations

Adjacency Matrix

bg left:40% 100%

  • Undirectional graph: A=A
  • eii: self loop, might or might not be useful

Degree of a node

  • \mathrm{deg}(x_i) = #number of neighborhoods

Subgraph (Advanced)

  • Some node in the graph can be a sub-graph
    • e.g. functional group (CH3-, COOH-, …)

cr. Img.1, read more: Laplacian Embedding

Graph Neural Networks

bg left:40% 100%

  • Function f() taking Graph G as input
  • Output f(G) can be …
    • A graph of a same structure G
      • Recommendations for each user
      • Energy for each atom
    • A scalar …
      • Toxicity of a molecule (y0,1)
      • Conformation energy (yR)

cr. Img.1


Graph Neural Networks: Message Passing \& aggregation

  • Message from neighbor j to i: f(xi,xj,eij)
  • Aggregate the message from all neighbors xl+1i=xli+jN(i)f(xl+1i,xl+1j,el+1ij)
  • f(xi,xj,eij): trainable neural networks (usually MLP)
  • In the notation of adjacency matrix: xl+1i=xli+Af(xi,,el+1i)
  • l: number of layer in NN
  • Scalar output: y=ixLi

Example I: Graph Convolutional Networks (GCN (1))

  • Message from neighbor j to i: (σ: activation function, W: trainable parameter) f(xi,xj,eij)=σ(Wxj)
  • Update layer from aggregation xl+1i=xli+1deg(i)jN(i)σ(Wxlj)

(1): https://arxiv.org/pdf/1609.02907v4


Example II: Graph Attention Transformers (GAT, GTN (2))

  • Message from neighborhoods: attention and values attn(xi,xj,eij)=Qxi,Kxjf(xi,xj,eij)=Vxj
  • Aggregate the information weighted by attnetion xl+1i=jNisoftmax(attn(xi,xj,eij))f(xi,xj,eij)
  • Intuition: estimate the attention from different nodes.

(2): https://arxiv.org/abs/1911.06455


Graph Neural Networks Libraries

  • Graph Neural Networks
    • Pytorch Geometrics (PyG) (https://pytorch-geometric.readthedocs.io/en/latest/)
    • Deep Graph Library (DGL) (https://www.dgl.ai)
  • Handling the graph structure
    • NetworkX (https://networkx.org)

Diffusion Processes

Generative model

  • Goal of the generative model: learn and sample from the distribution P(x).
    • With label: $\mathbb P(x y)$
  • Prior work: Generative adversarial network GAN, Variational autoencoder VAE, etc.

Compare with discrimative model

  • Goal of discrimative model: discrimate different class of data $\mathbb P(y x)$.
  • Examples: image classification (VGG, ResNet), languauage classification, etc..

Diffusion Process - one step

Intuition: adding noise to the input (image) and denoise

  • Forward step: Generate x1=x0+ε,εN(0,σ2)
  • Reverse step: Estimate εεθ(x1) and generate x0=x1εθ(x1)

w:400pt w:400pt

cr. Gif.1, Gif.2


Diffusion Process - Repeating for T steps

  • Forward process (Markovian):
x1P(|x0),x2P(|x1),,xTP(xT1)

By Baysian rule, we not really need to sample it using the chain, but xtPt0(|x0)

  • e.g. xtN(xt1,σ2)xtN(x0,σ2t)
  • Reverse process (this is not formal, just for intuition!!): xT1=xTεθ(xT,T),xT2=xT1εθ(xT1,T1),,x0=x1εθ(x1,1)

    We need to generate x0 through this chain!

  • Usually σ is small that NN is not hard to learn

Forward Process (Assuming σ2(x0)=1 by normalization)

  • VE-SDE: xtN(xt1,σ2), xtN(x0,σ2t), xtN(E(x0),σ2(x0)+σ2t)
    • Variance-Exploded
  • VP-SDE: $x_t \sim N(\mu_{t t-1} x_{t-1}, \sigma_{t t-1}^2),\sigma^2(x_t) = \sigma_{t t-1}^2 + \mu_{t t-1}^2 \sigma^2(x_{t-1})$
    • Variance-Preserved: $\mu_{t t - 1}^2 + \sigma^2_{t t - 1} = 1,\mu_t^2 + \sigma^2_t = 1,x_t = N(\mu_t x_0, \sigma^2_t)$.
    • T,μt0,σt1,xTN(0,1) // we start reverse from here!

Training objective:

  • Predict noise using noisy data: εt=xtμtx0: $\mathcal L = \mathbb E_{t, x_0, x_t x_0}|\varepsilon_{\theta}(x_t, t) - \varepsilon_t|_2^2$
  • Reweight for better training L=Et,x0,xt|x0|εθ(xt,t)εt/σt|22 (εt/σtN(0,1)!)

    Note that εθ(,t)ε(,t)σt


Reverse Process

What we know know: xt1N(μt1x0,σ2t1),ˉx0xtεθ(xt,t)=xtεθ(xt,t)σt

Sample $x_{t-1} \sim N(\mu_{t-1} \bar x_0, \sigma^2_{t t-1})(x_t x_{t-1} = N(\mu_tx_{t-1}, \sigma^2_{t t-1})$)

More justification… P(xt1|xt)=ˉx0P(xt1|xt,ˉx0)P(ˉx0|xt) P(xt1|xt,x0)P(xt1|x0)P(xt|xt1)=N(,(σ2t1+μ2t|t1σ2t)1)

Readmore: Algorithm 1, 2 in DDPM. D3PM


Why diffusion model works (for science people..)

  • Langevin dynamics: M¨X(t)=U(X(t))ζ˙X(t)+2ζkTR(t)
  • Overdamped regime: M1: ˙X(t)=ζ1U(X(t))+2kT/ζR(t)

  • Equilibrium Boltzmann distribution P(X)=exp(U(X)/kT)/Xexp(U(X)/kT)dxlogP(X)=U(X)/kTlogXexp(U(X)/kT)dxXlogP(X)=XU(X)/kT

If we let D=kT/ζ then the overdamped Langevin becomes… ˙X(t)=DXlogP(X)+2DR(t)


Now the same again with statistics / ML …

˙X(t)=DXlogP(X)+2DR(t)XP(X)

  • D: learning rate (think of GD: ˙X(t)=Df(x))
  • XlogP(X): score function
  • How to learn score function? (score matching)
    • logP(X) is hard to learn (think of learning U(X), and F(X)=U(X)
    • $\nabla \log \mathbb P(X) = \mathbb E_{X_0 X} \nabla \log \mathbb P(X X_0),make\nabla \log \mathbb P(X X_0)$ easy to calculate..
      $$\mathcal L = \mathbb E_{X_0, X X_0} |f(X) - \nabla \log \mathbb P(X X_0)|_2^2$$  
    • What if $\mathbb P(X X_0) = N(\mu X_0, \sigma^2) \propto \exp(-0.5 (X - \mu X_0)^2/ \sigma^{2})$?
    • logP(X|X0)=(XμX0)/σ2=ε/σ2!!

      P(X) is not the original data distribution. It is the distribution of X given the X0 is from the data distribution…


From P(X) to P(X0)

h:300pt

Forward process: dX=ftXdt+gtdB: dB: Brownian motion

cr. Img.1


Reverse process in SDE

dX=ftXdt+gtdB: dB: Brownian motion N(0,dt), t:01

Reverse process for recovering: dX=[ftXg2tlogPt(x)]dt+dˉB:

  • dˉB: reverse brownian motion
  • Matching the score function and solve the SDE (t:10) h:250pt

cr. Img.1


Conditional generation on information c

  • Most naive one: train $\mathbb P(\cdot c)seperatelyforeachc$, wait NN for generalization
  • Classifier-guidance: Use a discriminative model predicting $\mathbb P(c x, t)$
    • Seeking to generate $\mathbb Q(x) \propto \mathbb P(x)P(c x, t)^{\gamma}(\gamma$: generation strength)
    • In each of the generation step: let $\nabla \log \mathbb Q(x) = \nabla \log \mathbb P(x) + \nabla P(c x, t)$
    • Make Q(x) more likely to be predicted as $P(c x, t)$
  • Classifier-free guidance
    • Approximate $P(c x, t) \propto \mathbb P(x c) \mathbb P(X)$, generating by
      $$\nabla \log \mathbb Q(x) = (1 - \gamma)\nabla \log \mathbb P(x) + \gamma \log \mathbb P(x c)$$  
    • Do not require the classifier, suitable for image input, language prompt, etc..
  • More general guidance: 1 seeking for Q(x)P(x)exp(E(x))

Blog / papers

  • https://yang-song.net/assets/img/score/sde_schematic.jpg
  • https://arxiv.org/abs/2006.11239
  • https://arxiv.org/abs/2011.13456

    Advanced topics

  • Flow maching
  • Equivariant generation for 3D structure (1) (2)
  • Physics of flow matching, diffusion model and how to accelerate