Mask-free video text removal

TextAway
removes overlaid text without masks.

An end-to-end text-aware generation framework that restores clean videos directly from corrupted inputs, without OCR, text detection, or external mask generation at inference time.

Yonsei University
Corrupted input
Clean target
● Paired benchmark Text-overlaid video vs. clean source video
Core idea

Text removal is not just binary-mask inpainting.

Overlaid text can be semi-transparent, anti-aliased, temporally varying, and blended with complex backgrounds. TextAway treats it as soft-overlay restoration rather than hard-object removal.

Mask-free inference

The model takes only the corrupted video and directly predicts a clean output. OCR, text detectors, and external masks are not required at test time.

Simple V2V adaptation

A pretrained text-to-video flow matching backbone is converted into a video-to-video restoration model through latent concatenation and scale-preserving initialization.

Text-aware training

Rendered text masks are used only during training for weighted flow matching and an auxiliary text-awareness branch.

Method

From corrupted video to clean restoration.

Training pairs are synthesized by rendering diverse overlays on clean videos. The clean target and corrupted input are encoded into latent space, concatenated, and processed by the adapted conditional flow matching backbone.

1
Clean videoStart from a clean source video that acts as the reconstruction target.
2
Text renderingComposite multilingual, moving, transparent, or stylized text overlays.
3
Latent concatConcatenate noisy target latent and corrupted input latent channel-wise.
4
Text-aware lossUse training-only masks for region-weighted flow matching and auxiliary supervision.
5
Mask-free outputAt inference, decode a clean video from the corrupted input alone.
Overview of the TextAway framework

Full framework overview. TextAway uses rendered masks only during training, while inference requires only the corrupted input video.

Quantitative results

Strong text-region and boundary restoration.

Region-specific metrics reveal failures that full-frame metrics can hide. TextAway improves text-region and boundary-band restoration while preserving non-text background regions competitively.

MaskMethodPSNR ↑SSIM ↑LPIPS ↓VFID ↓TR-PSNR ↑TR-SSIM ↑BG-PSNR ↑BG-SSIM ↑BB-PSNR ↑BB-SSIM ↑
NoneOurs34.560.9470.0553.8727.630.92034.800.94432.040.943
GoMatching++ProPainter26.090.9000.15537.8915.280.72235.950.96521.900.834
GoMatching++MiniMax25.270.8770.16837.0115.480.71031.980.93821.440.805
GoMatching++DiffuEraser25.330.8770.17839.2115.260.69332.660.94421.090.793
GT maskProPainter30.960.9140.10514.2819.660.79438.030.96827.270.884
GT maskMiniMax29.560.8970.1059.4421.070.79332.870.93924.650.832
GT maskDiffuEraser29.150.8940.1219.8620.220.76832.700.94323.190.808
Video results

Comparison carousel.

Use the arrows to browse samples. Each group isolates a different failure mode: easy overlays, dynamic temporal effects, large/complex normal overlays, and non-target scene text preservation.

Sample