ByteDance Bids for Video Leadership ByteDance adds state-of-the-art Seedance 2.0 video to Capcut, while OpenAI retreats

Published
Reading time
4 min read
Through a rainy window, a pizza worker prepares food beneath menu boards and a red neon "Pizza" sign.
Loading the Elevenlabs Text to Speech AudioNative Player...

As OpenAI prepares to shut down Sora, ByteDance made its own video generation model available to hundreds of millions of users.

What’s new: ByteDance added Seedance 2.0, its multimodal video generator, to its popular video-editing app CapCut. Launched earlier this year in China, the model now reaches paying CapCut users in Southeast Asia, Latin America, Africa, the Middle East, parts of Europe, Japan, and the United States.

  • Input/output: Text, images, audio, and video in (up to 3 video clips, 9 images, and 3 audio clips), synchronized video and audio out (4 to 15 seconds at 480 or 720 pixels on the shorter edge in 6 aspect ratios: 21:9, 16:9, 4:3, 1:1, 3:4, and 9:16)
  • Features: Lip-synced dialogue in multiple languages, ambient sound, music, multiple camera shots with cuts in a single clip, camera and lighting controlled by prompts, outputs marked by invisible watermark, blocking of input images that contain real faces or copyrighted characters (via CapCut)
  • Performance: Within top two on Arena AI and Artificial Analysis video leaderboards
  • Availability/price: Via CapCut (Jianying in China) paid tier, Dreamina web interface, API via the ByteDance services BytePlus and Volcengine, and third-party providers including Higgsfield.ai for $0.30 per second of output (720 pixels, audio included) or $0.24 per second for faster processing by SeeDance 2.0 Fast
  • Undisclosed: Architecture, parameter count, training data and methods

How it works: Seedance 2.0 extends ByteDance’s earlier work from synchronous generation of audio-video streams in parallel to joint generation within a unified system. ByteDance’s launch announcement characterizes the architecture as “sparse.”

  • The model accepts video-audio reference input for four tasks: (i) Referenced-based generation applies subject, motion, visual effects, and/or style cues to new output. (ii) Editing modifies specified regions, characters, actions, and/or audio within existing video. (iii) Extension produces output that precedes or succeeds existing video. (iv) Combination modes pair these (for example, replacing the subject in an existing video with one from a reference image).
  • Audio is generated simultaneously with video, producing stereo dialogue, sound effects, and background audio.
  • The model generates sequential shots and cuts in a single pass rather than generating and assembling separate clips, which helps to maintain character and scene consistency.

Performance: Seedance 2.0 ranks first and second on two independent leaderboards that rank models through blind votes of human preference in head-to-head matchups. Alibaba’s HappyHorse-1.0 is the closest challenger on both leaderboards.

  • On arena.ai, Seedance 2.0 achieved 1,460 Elo on text-to-video performance and 1,454 Elo on image-to-video performance, narrowly leading both categories over HappyHorse-1.0 (1,444 Elo on each). However, the leaderboard labels Seedance 2.0 and HappyHorse-1.0 results as preliminary.
  • On Artificial Analysis, Alibaba’s HappyHorse-1.0 leads three of four video categories (image-to-video without audio and text-to-video with and without audio), while Seedance 2.0 ranks second. Seedance 2.0 leads image-to-video performance with synchronized audio, achieving 1,182 Elo, ahead of HappyHorse-1.0 (1,168 Elo) and Sky Work AI’s SkyReels V4 (1,091 Elo).
  • ByteDance flags limitations in detail stability, “hyper-realism,” audio distortion, multi-subject consistency, text-rendering accuracy, and “complex” editing effects.

Yes, but: Shortly after ByteDance released Seedance 2.0 in China, a generated clip that featured likenesses of actors Tom Cruise and Brad Pitt spurred six top Hollywood studios to demand that the company stop training its models on copyrighted material and block users from generating clips based on copyrighted material. The dispute remains unresolved. ByteDance added safeguards on CapCut, but it remains unclear whether they extend to outputs generated via third-party APIs. 

Behind the news: The video generation market has reshuffled quickly over the past month. U.S. developers have retreated from the consumer market, and Chinese developers have released new models at an accelerating pace.

  • In March, OpenAI announced it would discontinue the Sora app and API. Reports indicated that the company had shifted compute to coding and business products after Sora’s daily active user count fell from about 1 million at launch to under 500,000, while the service costs an estimated $1 million a day to operate.
  • Alibaba’s HappyHorse-1.0 debuted on independent video leaderboards in early April, while it was still undergoing a closed beta test, and rose to first place across multiple categories.
  • Shortly after, Alibaba unveiled HappyOyster, an AI system that generates 3D environments for developing games and films. Users can generate 3D environments from text or images and steer them in real time.
  • Tencent open-sourced an updated version of its Hunyuan 3D the same day.

Why it matters: While competitors offer either a video generator or an editing app, ByteDance owns both. Moreover, its editor appears to have gargantuan reach. CapCut reportedly has 736 million monthly active users on mobile, the second-largest consumer AI product behind only ChatGPT. Seedance 2.0’s arrival on CapCut shows what one company can do when it controls both.

We’re thinking: OpenAI’s withdrawal of Sora points to a hard truth: Given the current cost of computation, AI-generated video is an expensive consumer product.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox