Sparsely-gated Mixture Of Experts (MoE)
In transformer models, the
attention block
is typically followed by a feed forward layer (FF), which is a simple fully-connected
NN with a hidden layer and nonlinearity. Here's the code for such a block that
uses ReLU:
def feed_forward_relu(x, W1, W2):
"""Feed-forward layer with ReLU activation.
Args:
x: Input tensor (B, N, D).
Wh: Weights for the hidden layer (D, DH).
Wo: Weights for the output layer (DH, D).
Returns:
Output tensor (B, N, D).
"""
x = x @ W1 # hidden layer (B, N, DH)
x = np.max...
Read more at eli.thegreenplace.net