ShareGPT4Video

Abstract

We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-language models (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos. We annotated 4.8M aesthetically appealing videos by it and verified their effectiveness on a 10-second text2video generation task. For video understanding, we verified the effectiveness of ShareGPT4Video on several current LVLM architectures and presented our superb new LVLM ShareGPT4Video-8B. All the models, strategies, and annotations(We do not hold the copyright for any video and will provide the link-annotation pair for research-only usage.) will be open-sourced and we hope this project can serve as a pivotal resource for advancing both the LVLMs and T2VMs community.

Details and attributes of the ShareGPT4Video

(a) The proposed ShareGPT4Video dataset contains a large volume of high-quality video-caption pairs collected from diverse sources, with 40K captions from GPT4V and 4.8M captions from our ShareCaptioner-Video. (b) We illustrate in detail the process of harnessing the multi-modal image model GPT4V to generate high-quality captions for videos. (c) Our unique captioning strategy enables the re-caption of sub-clips by reusing their differential captions.

Dataset generated by GPT-4V

Data Source	Samples	Total Time(hours)	Avg. Time(sec)	Avg. Length(#word)
Panda-70M	27092	204.4	27.2	291.2
Pexels	8487	52.2	22.1	254.9
Pixabay	2725	20.3	26.9	209.3
BDD100K	608	6.6	39.0	371.3
Mixkit	745	3.6	17.5	213.9
Ego4D	521	3.9	27.1	298.9
Total	40178	291	26.6	273.3

Comprehensive video-caption dataset: (a) The dataset covers a broad spectrum of content, including wildlife, cooking, sports, scenery, ego-centric human activities, auto-driving scenarios, etc. (b) The dataset includes videos ranging from 2 seconds to 2 minutes in length. (c) The captions primarily range from 200 to 400 words, providing rich temporal information that serves video understanding and generation tasks well.

ShareCaptioner-Video with 4 capabilities

The ShareCaptioner-Video is a Four-in-One exceptional video captioning model with the following capabilities: Fast captioning, Sliding Captioning, Clip Summarizing, and Prompt Re-Captioning

Dataset generated by ShareCaptioner-Video

Data Source	Samples	Total Time(hours)	Avg. Length(#word)
Mixkit	56k	42.0	104.8
Pixabay	652k	353.3	102.5
Pexels	4104k	2561.9	100.5
Total	4812k	2957.2	102.6

Statics of 4.8M high-quality video-caption pairs generated by our ShareCaptioner-Video.

A Comparison of caption quality from various sources

Mistakes within the captions are highlighted in red, whereas detailed and accurate parts are emphasized in blue

We have generated a large volume of video-caption pairs with our ShareCaptioner-Video and trained a text-to-video model with the Open-Sora-Plan repository. Here are some interesting cases:

The video captures the spectacle of a continuous fireworks show against the backdrop of a starry night sky. It commences with a burst of vibrant reds, greens, purples, and yellows that paint the heavens and cast shimmering reflections upon the water below. As the display progresses, the fireworks evolve, transitioning from the initial array to a focus on radiant oranges, yellows, and fiery reds. These explosions form captivating clusters at the heart of the sky, ascending in breathtaking formations accompanied by trailing plumes of smoke, adding a dramatic flourish to the visual narrative. Throughout the duration, the fireworks maintain their dynamic allure, their patterns and positions evolving to underscore the ongoing spectacle. Meanwhile, the mirrored reflections on the water's surface faithfully echo the colors and shapes above, further enhancing the mesmerizing and ever-changing nature of the display.

A drone camera circles around a beautiful historic church built on a rocky outcropping along the Amalfi Coast, the view showcases historic and magnificent architectural details and tiered pathways and patios, waves are seen crashing against the rocks below as the view overlooks the horizon of the coastal waters and hilly landscapes of the Amalfi Coast ltaly, several distant people are seen walking and enjoying vistas on patios of the dramatic ocean views, the warm glow of the afternoon sun creates a magical and romantic feeling to the scene, the view is stunning captured with beautiful photography.

The video presents an abstract composition centered around a hexagonal shape adorned with a starburst pattern of lines, undergoing a series of transformations against a dark backdrop. Initially dominated by shades of blue, particularly within the central hexagon displaying a spiral or tunnel-like motif, the imagery gradually transitions through a spectrum of warm and cool tones. Pink, purple, red, and orange hues are introduced, creating lively contrasts with the blues as the composition evolves. These color shifts generate a dynamic interplay between warm and cool tones, with blues, reds, oranges, and later purples taking turns in prominence, each contributing to distinct visual effects. From cool blues, the palette progresses to warmer tones before returning to a balanced mix of reds and blues, ultimately settling back into cooler blues and purples. Throughout these changes, the central hexagon maintains its spiral or tunnel-like quality, drawing focus towards the center of the frame. The design elements exhibit subtle movements akin to gentle pulsations or breathing, infusing the composition with a sense of dynamism and vitality. Despite these shifts, the geometric and crystalline structure remains intact, occasionally sharpening to enhance clarity. The video concludes with a harmonious blend of blues and purples, offering a serene contrast to the earlier vibrant color combinations while retaining depth through the interplay of light and shadow.

The video segment documented a significant event in Kochi, Kerala, where 2 buildings razed in Kochi. The broadcast began with a split-screen presentation: on one side, thick clouds of dust were seen billowing into the sky, marking the onset of the demolition process, while on the other side, reporter Gopikrishnan provided live coverage, indicated by "BREAKING NEWS" captions and a consistent timestamp of "11:10 AM." The news ticker at the bottom of the screen simultaneously ran other global events, maintaining a flow of information. As the video progresses, the split-screen footage of the razed house turns into a close-up. A notable change in the headline to "KOCHI FLATS RAZED" signaled the demolition's culmination. A brief interlude offered a visual contradiction by showcasing the flats presumably before their demolition, providing a stark before and after comparison. As the video progressed, the left building's collapse initiated a dramatic alteration in the skyline, marked by significant dust plumes. Subsequently, another building was shown partially collapsing amid debris, fully obscured by dust in seconds, with surrounding greenery remaining untouched. This transitioned into a graphic interlude featuring the "India Today" logo, briefly pausing the live footage. Resuming to the aftermath, split imagery displayed the rubble and ongoing smoke. Then, the imagery continued to juxtapose the scenes of destruction against intact high-rise buildings nearby. The narrative was augmented by the revelation that the Supreme Court directed the demolition within a broader national news context. Throughout, the report maintained a real-time approach, threading continuity and urgency across the unfolding event's documentation.

The video begins with an individual seated on a gray couch in a cozy domestic setting, about to unbox a product from a red CCM-branded box placed on a white table in front of them. Initially, the person is seen drinking from a blue can, indicating a casual atmosphere. Soon after, the individual shifts attention from the can to the red box, signifying the start of the unboxing process. The red box, initially closed, gradually becomes the focal point as the person prepares to open it, conveying a build-up of anticipation. As the video progresses, the box is flipped over and then opened, revealing its content still hidden under white tissue paper adorned with prints, adding to the suspense. The individual’s engagement with the box evolves, from initially preparing to open it, to actively delving into its contents. A momentary pause in activity is captured before the anticipation culminates with the individual lifting an object from the box. This object, identifiable by a yellow label, is then examined closely by the person, indicating a thorough inspection or perusal of the product or its packaging. Throughout the video, the surrounding environment remains consistent and undisturbed, with household items like a potted plant and a wall clock maintaining the setting's homely ambiance. The camera’s perspective remains fixed, focusing on the unfolding unboxing event without any movement, thus allowing the viewer to observe the narrative closely. Another partially open brown box is visible beside the main red box, though its role or contents are not elaborated upon. The video encapsulates the anticipation, action, and reveal inherent to unboxing experiences in a home setting.

📃 BibTeX


          @article{chen2024sharegpt4video,
            title={ShareGPT4Video: Improving Video Understanding and Generation with Better Captions},
            author={Chen, Lin and Wei, Xilin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Lin, Bin and Tang, Zhenyu and Yuan, Li and Qiao, Yu and Lin, Dahua and Zhao, Feng and Wang, Jiaqi},
            journal={arXiv preprint arXiv:2406.04325},
            year={2024}
          }

ShareGPT4Video:

Improving Video Understanding and Generation with Better Captions

Demo Video

Abstract

ShareGPT4Video Dataset

ShareCaptioner-Video

Text-to-Video Cases

Video Caption Cases

📃 BibTeX