Mask-free inference
The model takes only the corrupted video and directly predicts a clean output. OCR, text detectors, and external masks are not required at test time.
An end-to-end text-aware generation framework that restores clean videos directly from corrupted inputs, without OCR, text detection, or external mask generation at inference time.
Overlaid text can be semi-transparent, anti-aliased, temporally varying, and blended with complex backgrounds. TextAway treats it as soft-overlay restoration rather than hard-object removal.
The model takes only the corrupted video and directly predicts a clean output. OCR, text detectors, and external masks are not required at test time.
A pretrained text-to-video flow matching backbone is converted into a video-to-video restoration model through latent concatenation and scale-preserving initialization.
Rendered text masks are used only during training for weighted flow matching and an auxiliary text-awareness branch.
Training pairs are synthesized by rendering diverse overlays on clean videos. The clean target and corrupted input are encoded into latent space, concatenated, and processed by the adapted conditional flow matching backbone.
Full framework overview. TextAway uses rendered masks only during training, while inference requires only the corrupted input video.
Region-specific metrics reveal failures that full-frame metrics can hide. TextAway improves text-region and boundary-band restoration while preserving non-text background regions competitively.
| Mask | Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | VFID ↓ | TR-PSNR ↑ | TR-SSIM ↑ | BG-PSNR ↑ | BG-SSIM ↑ | BB-PSNR ↑ | BB-SSIM ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| None | Ours | 34.56 | 0.947 | 0.055 | 3.87 | 27.63 | 0.920 | 34.80 | 0.944 | 32.04 | 0.943 |
| GoMatching++ | ProPainter | 26.09 | 0.900 | 0.155 | 37.89 | 15.28 | 0.722 | 35.95 | 0.965 | 21.90 | 0.834 |
| GoMatching++ | MiniMax | 25.27 | 0.877 | 0.168 | 37.01 | 15.48 | 0.710 | 31.98 | 0.938 | 21.44 | 0.805 |
| GoMatching++ | DiffuEraser | 25.33 | 0.877 | 0.178 | 39.21 | 15.26 | 0.693 | 32.66 | 0.944 | 21.09 | 0.793 |
| GT mask | ProPainter | 30.96 | 0.914 | 0.105 | 14.28 | 19.66 | 0.794 | 38.03 | 0.968 | 27.27 | 0.884 |
| GT mask | MiniMax | 29.56 | 0.897 | 0.105 | 9.44 | 21.07 | 0.793 | 32.87 | 0.939 | 24.65 | 0.832 |
| GT mask | DiffuEraser | 29.15 | 0.894 | 0.121 | 9.86 | 20.22 | 0.768 | 32.70 | 0.943 | 23.19 | 0.808 |
Use the arrows to browse samples. Each group isolates a different failure mode: easy overlays, dynamic temporal effects, large/complex normal overlays, and non-target scene text preservation.