Essentially, the Viterbi algorithm computes the shortest path through a DAG. If you explicitly model each transition/emission event as a single node in a graph, and if you then convert all probabilities p to -log p, than any shortest path algorithm will find the same result as Viterbi, which is the path of maximum likelihood.
This is easy to prove: for probabilities p_1, p_2, p_3 ..., the path which maximizes p_1 * p_2 * p_3 ... also maximizes log(p_1 * p_2 * p_3 ...) = log(p_1) + log(p_2) + log(p_3) + ... and thus minimizes -(log(p_1) + log(p_2) + log(p_3) + ...) = -log(p_1) - log(p_2) - log(p_3) .... Because the probs are all 0 <= p <= 1, their log is <= 0, and choosing -log(p) as an edge weight gives non-negative weights. So you can use any standard shortest path algorithm to minimize -log(p_1) - log(p_2) - log(p_3) - ... and thus maximize p_1 * p_2 * p_3 * ...
The Viterbi algorithm uses the same "trick" as the common algorithm for finding the shortest path through a DAG with better complexity than Dijkstra's algorithm: you can just process them in their topological order (which can be found in linear time), instead of processing them based on their position in a priority queue (as in Dijkstra's algorithm). Even better, in a HMM, the topological ordering of the graph is already part of the input: it's just the sequence of observations.
There was a pretty funny, to me and my buddy, moment in our Digital Communications course when we were learning this. We were both dual EE/CS students while the rest of the class was straight EE. The professor is drawing out the trellis diagrams and going through this elaborate description of things and most people (including me and my buddy) were completely lost. At some point, though, he and I both had the same lightbulb moment. We turned to each other and simultaneously just whispered “it’s a dynamic programming problem”. And then it was all easy from there.
Doesn't the algorithm work on directed graphs, rather than only DAGs, i.e., also on cyclic ones? That's a difference with Dijkstra, where cycles are excluded from the solution. Plus, it takes the graph and a sequence of transitions.
> Doesn't the algorithm work on directed graphs, rather than only DAGs, i.e., also on cyclic ones? That's a difference with Dijkstra, where cycles are excluded from the solution
With the Viterbi algorithm, you're decoding a sequence of observations, indexed by observation time. At each timestep there's an array of probabilities indexed by the hidden states. State s at time t has some probability of transitioning to state s' at time t+1, defining an arc from the node (s, t) to the node (s', t+1). That defines the "trellis" structure. Since time doesn't loop back on itself, an arc can never point back to an earlier time, so there aren't any cycles in the trellis.
As lqet explains, its essentially computing a shortest path through the DAG.
The Viterbi algorithm could be reimplemented as Dijkstra's algorithm. But that wouldn't gain anything, and it'd make it harder to compute efficiently -- with Dijkstra's algorithm there's a sequential dependency of popping a minimal length path off the priority queue and expanding it before you're able to pop and start processing the next path (as the path you're expanding might push the next minimal length path onto the queue). The bottom-up Viterbi algorithm computation is closer to a BFS, where you could process the whole layer of arcs at corresponding to timestep t in parallel - apart from the argmin over input arcs at each node (s, t+1) in the next layer.
Another small difference is that Viterbi's algorithm deals with node-weights as well as arc-weights, corresponding to the probability of emitting the observation y_t at timestep t, supposing the system had been in the hidden state s_i at time t. I think that can be dealt with by absorbing the node weights into the arc weights -- each weight is a log probability, and log probabilities can be added.
This is easy to prove: for probabilities p_1, p_2, p_3 ..., the path which maximizes p_1 * p_2 * p_3 ... also maximizes log(p_1 * p_2 * p_3 ...) = log(p_1) + log(p_2) + log(p_3) + ... and thus minimizes -(log(p_1) + log(p_2) + log(p_3) + ...) = -log(p_1) - log(p_2) - log(p_3) .... Because the probs are all 0 <= p <= 1, their log is <= 0, and choosing -log(p) as an edge weight gives non-negative weights. So you can use any standard shortest path algorithm to minimize -log(p_1) - log(p_2) - log(p_3) - ... and thus maximize p_1 * p_2 * p_3 * ...
The Viterbi algorithm uses the same "trick" as the common algorithm for finding the shortest path through a DAG with better complexity than Dijkstra's algorithm: you can just process them in their topological order (which can be found in linear time), instead of processing them based on their position in a priority queue (as in Dijkstra's algorithm). Even better, in a HMM, the topological ordering of the graph is already part of the input: it's just the sequence of observations.