Segment-CNN (S-CNN) is a segment-based deep learning framework for temporal action localization in untrimmed long videos.
This code has been tested on Ubuntu 14.04 with NVIDIA GTX 980 of 4GB memory for models based on C3D-v1.0 and tested with NVIDIA Titan
X GPU of 12GB memory for models based on C3D-v1.1.
Current code suffices to run demo, repeat our experimental results, and train your own models. Please use "Issues" to ask questions or report bugs. Thanks. [ Mar. 2019: we stop maintaining new issues for this repository because many people have successfully reproduced our results and most common questions have been raised and addressed in the closed issues. ]
License
S-CNN is released under the MIT License (refer to the LICENSE file for details).
Citing
If you find S-CNN useful, please consider citing:
@inproceedings{scnn_shou_wang_chang_cvpr16,
author = {Zheng Shou and Dongang Wang and Shih-Fu Chang},
title = {Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs},
year = {2016},
booktitle = {CVPR}
}
@article{tran2017convnet,
title={Convnet architecture search for spatiotemporal feature learning},
author={Tran, Du and Ray, Jamie and Shou, Zheng and Chang, Shih-Fu and Paluri, Manohar},
journal={arXiv preprint arXiv:1708.05038},
year={2017}
}
We build this repo based on C3D and THUMOS Challenge 2014 . Please cite the following papers as well:
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatiotemporal Features with 3D Convolutional Networks, ICCV 2015.
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, Caffe: Convolutional Architecture for Fast Feature Embedding, arXiv 2014.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, Large-scale Video Classification with Convolutional Neural Networks, CVPR 2014.
@misc{THUMOS14,
author = "Jiang, Y.-G. and Liu, J. and Roshan Zamir, A. and Toderici, G. and Laptev, I. and Shah, M. and Sukthankar, R.",
title = "{THUMOS} Challenge: Action Recognition with a Large Number of Classes",
howpublished = "\url{http://crcv.ucf.edu/THUMOS14/}",
Year = {2014}
}
./experiments/THUMOS14/network_proposal/result/res_seg_swin.mat: contains the output results of the proposal network. we keep segment whose confidence score of being action >= 0.7 as the candidate segment to further input into the following localization network;
./experiments/THUMOS14/network_localization/result/res_seg_swin.mat: contains the output results of the localization network;
evaluate mAP: run ./experiments/THUMOS14/eval/eval_scnn_thumos14.m and results are stored in ./experiments/THUMOS14/eval/res_scnn_thumos14.mat. we vary the overlap threshold IoU used in evaluation from 0.1 to 0.5
Our pre-trained models and pre-computed results of S-CNN (based on C3D-v1.1) on THUMOS Challenge 2014 action detection task:
Models:
./models/c3d_resnet18_sports1m_r2_iter_2800000.caffemodel: C3D model pre-trained on Sports1M dataset by Tran et al;
./experiments/THUMOS14_Res3D/network_proposal/result/res_seg_swin.mat: contains the output results of the proposal network. we keep segment whose confidence score of being action >= 0.7 as the candidate segment to further input into the following localization network;
./experiments/THUMOS14_Res3D/network_localization/result/res_seg_swin.mat: contains the output results of the localization network;
evaluate mAP: run ./experiments/THUMOS14_Res3D/eval/eval_scnn_thumos14.m and results are stored in ./experiments/THUMOS14/eval/res_scnn_thumos14.mat. we vary the overlap threshold IoU used in evaluation from 0.3 to 0.7
We provide the parameter settings and the network architecture definition inside ./experiments/THUMOS14/network_proposal/, ./experiments/THUMOS14/network_classification/, ./experiments/THUMOS14/network_localization/ respectively.
We also provide sample input data file to illustrate input data file list format, which is slightly different from C3D:
still, each row corresponds to one input segment
C3D_sample_rate (used for proposal and classification network):
stepsize: used for adjusting the window length. measure the step between two consecutive frames in one segment. the frame index of the current frame + stepsize = the frame index of the subsequent frame. note that each segment consists of 16 frames in total.
We provide the parameter settings and the network architecture definition inside ./experiments/THUMOS14_Res3D/network_proposal/, ./experiments/THUMOS14_Res3D/network_classification/, ./experiments/THUMOS14_Res3D/network_localization/ respectively.
We also provide sample input data file to illustrate input data file list format, which is slightly different from C3D:
still, each row corresponds to one input segment
C3D_sample_rate (used for proposal and classification network):
stepsize: used for adjusting the window length. measure the step between two consecutive frames in one segment. the frame index of the current frame + stepsize = the frame index of the subsequent frame. note that each segment consists of 16 frames in total.
NOTE: please refer to C3D-v1.1 and Caffe for more general instructions about how to train 3D CNN model. Res3D uses 8 frames for each clip to produce one label. Because S-CNN samples 16 frames out of multi-scale temporal window which can be up to 512 frames long, we still keep 16 frames for each clip in S-CNN.
请发表评论