Mask-free video text removal

TextAway
removes overlaid text without masks.

An end-to-end text-aware generation framework that restores clean videos directly from corrupted inputs, without OCR, text detection, or external mask generation at inference time.

Jibin Song, Mingi Kwon, Sooyeon Go, Youngjung Uh

Yonsei University

Paper coming soon View videos Code coming soon Data coming soon

Corrupted input

Clean target

● Paired benchmark Text-overlaid video vs. clean source video

Core idea

Text removal is not just binary-mask inpainting.

Overlaid text can be semi-transparent, anti-aliased, temporally varying, and blended with complex backgrounds. TextAway treats it as soft-overlay restoration rather than hard-object removal.

✦

Mask-free inference

The model takes only the corrupted video and directly predicts a clean output. OCR, text detectors, and external masks are not required at test time.

▣

Simple V2V adaptation

A pretrained text-to-video flow matching backbone is converted into a video-to-video restoration model through latent concatenation and scale-preserving initialization.

◎

Text-aware training

Rendered text masks are used only during training for weighted flow matching and an auxiliary text-awareness branch.

Method

From corrupted video to clean restoration.

Training pairs are synthesized by rendering diverse overlays on clean videos. The clean target and corrupted input are encoded into latent space, concatenated, and processed by the adapted conditional flow matching backbone.

Clean videoStart from a clean source video that acts as the reconstruction target.

Text renderingComposite multilingual, moving, transparent, or stylized text overlays.

Latent concatConcatenate noisy target latent and corrupted input latent channel-wise.

Text-aware lossUse training-only masks for region-weighted flow matching and auxiliary supervision.

Mask-free outputAt inference, decode a clean video from the corrupted input alone.

Full framework overview. TextAway uses rendered masks only during training, while inference requires only the corrupted input video.

Quantitative results

Strong text-region and boundary restoration.

Region-specific metrics reveal failures that full-frame metrics can hide. TextAway improves text-region and boundary-band restoration while preserving non-text background regions competitively.

Mask	Method	PSNR ↑	SSIM ↑	LPIPS ↓	VFID ↓	TR-PSNR ↑	TR-SSIM ↑	BG-PSNR ↑	BG-SSIM ↑	BB-PSNR ↑	BB-SSIM ↑
None	Ours	34.56	0.947	0.055	3.87	27.63	0.920	34.80	0.944	32.04	0.943
GoMatching++	ProPainter	26.09	0.900	0.155	37.89	15.28	0.722	35.95	0.965	21.90	0.834
GoMatching++	MiniMax	25.27	0.877	0.168	37.01	15.48	0.710	31.98	0.938	21.44	0.805
GoMatching++	DiffuEraser	25.33	0.877	0.178	39.21	15.26	0.693	32.66	0.944	21.09	0.793
GT mask	ProPainter	30.96	0.914	0.105	14.28	19.66	0.794	38.03	0.968	27.27	0.884
GT mask	MiniMax	29.56	0.897	0.105	9.44	21.07	0.793	32.87	0.939	24.65	0.832
GT mask	DiffuEraser	29.15	0.894	0.121	9.86	20.22	0.768	32.70	0.943	23.19	0.808

Video results

Comparison carousel.

Use the arrows to browse samples. Each group isolates a different failure mode: easy overlays, dynamic temporal effects, large/complex normal overlays, and non-target scene text preservation.

1 / 1

Sample