I'm not sure how Dyer's thing works, but I can describe the paper I linked to: A...

I'm not sure how Dyer's thing works, but I can describe the paper I linked to:

A "stack" is a stack of (vector, scalar) pairs (a vector and its "weight"). At each time step, the stack has three inputs: the vector to push, the weight to push it with, and the weight to pop off the stack. Its output is the top 1.0 weight of the stack. The stack's behavior on a time step is divided in to three parts:

1. First, that pop weight is removed from the stack. For example, if you want to remove 0.6 from a stack like [(a, 0.4), (b, 0.4)], you'd remove all of a leaving you with 0.2 to remove from b, so the final stack would be [(b, 0.2)].

2. The pushed vector is placed on the stack with the given push weight.

3. The top 1.0 of the stack is blended together and returned. For example, a stack like [(a, 0.4), (b, 0.4), (c, 0.8)] would take 0.4 of a, 0.4 of b, and 0.2 of c.

This process is differentiable, and you can combine this with the usual recurrent loops to make a recurrent stack machine.