Improving Video Understanding and Generation with Better Captions

1 University of Science and Technology of China 2 The Chinese University of Hong Kong 3 Peking University 4 Shanghai AI Laboratory

* Equal contribution. Corresponding authors.
§ Work done during an internship in Shanghai AI Laboratory.

🚀 A large-scale highly descriptive video-text dataset, with 40K captions annotated by GPT4V and 4.8M captions annotated by our ShareCaptioner-Video. The total videos last with 300 hours and 3000 hours separately!
🚀 A general video captioner for various video durations, resolutions, aspect ratios, approaching GPT4-Vision's caption capability, featuring two inference mode targeted for quality and efficiency, separately.
🚀 A superior large multi-modal model ShareGPT4Video-8B, lasting 5 hours on 8xA100 GPUs of training only.
🚀 Improving Text-to-Video performance with high-quality video captions generate by our ShareCaptioner-Video

🔥What's New
  • [2024.07.01] The code about batch-inference of ShareCaptioner-Video is available now!
  • [2024.06.12] The Web Demo and Local Demo of ShareCaptioner-Video are available now!
  • [2024.06.11] The Web Demo and Local Demo of ShareGPT4Video-8B are available now!
  • [2024.06.07] Our paper has been featured as 🤗HuggingFace Daily Papers and ranked 1st.
  • [2024.06.07] The Paper is released!
  • [2024.06.06] The ShareCaptioner-Video model is released!
  • [2024.05.27] The ShareGPT4Video-8B model is released!
  • [2024.05.08] Project Page and ShareGPT4Video Dataset are released!

Demo Video


We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos. We annotated 4.8M aesthetically appealing videos by it and verified their effectiveness on a 10-second text2video generation task. For video understanding, we verified the effectiveness of ShareGPT4Video on several current LVLM architectures and presented our superb new LVLM ShareGPT4Video-8B. All the models, strategies, and annotations(We do not hold the copyright for any video and will provide the link-annotation pair for research-only usage.) will be open-sourced and we hope this project can serve as a pivotal resource for advancing both the LVLMs and T2VMs community.

Logo ShareGPT4Video Dataset

Details and attributes of the ShareGPT4Video

(a) The proposed ShareGPT4Video dataset contains a large volume of high-quality video-caption pairs collected from diverse sources, with 40K captions from GPT4V and 4.8M captions from our ShareCaptioner-Video. (b) We illustrate in detail the process of harnessing the multi-modal image model GPT4V to generate high-quality captions for videos. (c) Our unique captioning strategy enables the re-caption of sub-clips by reusing their differential captions.

Dataset generated by GPT-4V

Data Source Samples Total Time(hours) Avg. Time(sec) Avg. Length(#word)
Panda-70M 27092 204.4 27.2 291.2
Pexels 8487 52.2 22.1 254.9
Pixabay 2725 20.3 26.9 209.3
BDD100K 608 6.6 39.0 371.3
Mixkit 745 3.6 17.5 213.9
Ego4D 521 3.9 27.1 298.9
Total 40178 291 26.6 273.3

Comprehensive video-caption dataset: (a) The dataset covers a broad spectrum of content, including wildlife, cooking, sports, scenery, ego-centric human activities, auto-driving scenarios, etc. (b) The dataset includes videos ranging from 2 seconds to 2 minutes in length. (c) The captions primarily range from 200 to 400 words, providing rich temporal information that serves video understanding and generation tasks well.

Logo ShareCaptioner-Video

ShareCaptioner-Video with 4 capabilities

The ShareCaptioner-Video is a Four-in-One exceptional video captioning model with the following capabilities: Fast captioning, Sliding Captioning, Clip Summarizing, and Prompt Re-Captioning

Dataset generated by ShareCaptioner-Video

Data Source Samples Total Time(hours) Avg. Length(#word)
Mixkit 56k 42.0 104.8
Pixabay 652k 353.3 102.5
Pexels 4104k 2561.9 100.5
Total 4812k 2957.2 102.6

Statics of 4.8M high-quality video-caption pairs generated by our ShareCaptioner-Video.

A Comparison of caption quality from various sources

Mistakes within the captions are highlighted in red, whereas detailed and accurate parts are emphasized in blue

Text-to-Video Cases

We have generated a large volume of video-caption pairs with our ShareCaptioner-Video and trained a text-to-video model with the Open-Sora-Plan repository. Here are some interesting cases:

Video Caption Cases

📃 BibTeX

            title={ShareGPT4Video: Improving Video Understanding and Generation with Better Captions},
            author={Chen, Lin and Wei, Xilin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Lin, Bin and Tang, Zhenyu and Yuan, Li and Qiao, Yu and Lin, Dahua and Zhao, Feng and Wang, Jiaqi},
            journal={arXiv preprint arXiv:2406.04325},