Post
3435
š Introducing PerceptionDLM ā the first multimodal diffusion LLM for parallel region perception!
Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. š§©
⨠Highlights
⢠┠Up to 3.4à faster on dense multi-region captioning, with stable per-image latency
⢠š PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs)
⢠š New benchmark: ParaDLC-Bench ā jointly evaluates caption quality AND inference efficiency
⢠š Code, models & benchmark all open-sourced
š¤ Models
MSALab/PerceptionDLM-Base
MSALab/PerceptionDLM
š Benchmark
MSALab/ParaDLC-Bench
š Paper: PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models (2606.19534)
š» Code: https://github.com/MSALab-PKU/PerceptionDLM
Diffusion LLMs aren't just for text ā they unlock efficient, parallel visual perception. šļøāØ
#multimodal #diffusion #VLM #perception
Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. š§©
⨠Highlights
⢠┠Up to 3.4à faster on dense multi-region captioning, with stable per-image latency
⢠š PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs)
⢠š New benchmark: ParaDLC-Bench ā jointly evaluates caption quality AND inference efficiency
⢠š Code, models & benchmark all open-sourced
š¤ Models
MSALab/PerceptionDLM-Base
MSALab/PerceptionDLM
š Benchmark
MSALab/ParaDLC-Bench
š Paper: PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models (2606.19534)
š» Code: https://github.com/MSALab-PKU/PerceptionDLM
Diffusion LLMs aren't just for text ā they unlock efficient, parallel visual perception. šļøāØ
#multimodal #diffusion #VLM #perception