Detecting segments of interest from an input sequence is a challenging problem
which often requires not only good knowledge of individual target segments, but
also contextual understanding of the entire input sequence and the relationships
between the target segments. To address this problem, we propose the Sequence-toSegments
Network (S2N), a novel end-to-end sequential encoder-decoder architecture.
S2N first encodes the input into a sequence of hidden states that progressively
capture both local and holistic information. It then employs a novel decoding
architecture, called Segment Detection Unit (SDU), that integrates the decoder
state and encoder hidden states to detect segments sequentially. During training,
we formulate the assignment of predicted segments to ground truth as the bipartite
matching problem and use the Earth Mover’s Distance to calculate the localization
errors. Experiments on temporal action proposal and video summarization show
that S2N achieves state-of-the-art performance on both tasks.
Contributions
Sequence-to-Segment Network (S2N),: an end-to-end network architecture for detecting segments in a sequence.
Hungarian matching: customized for matching multiple predictions with ground truth
Earth Mover’s Distance: models segment localization loss
State-of-the-art performance: on both video summarization and video action proposal tasks.