D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

The Problem

Since vanilla T2I models are not explicitly trained to count, existing methods introduce auxiliary counting critics to provide additional supervision during generation via gradient guidance, which restricts current approaches to differentiable, regression-based models such as RCC and CLIP-Count. This design inherently excludes high-accuracy detector-based models which count via bbox-enumeration. Despite being non-differentiable, detectors (e.g., OWLv2, YOLOv9) often outperform regression-based counterparts in low-density object scenarios, where discrepancies between generated and requested counts are most noticeable.

Low and high-density examples with the wrong number of generated objects.

Error by ground truth count on TallyQA and FSC147 counting benchmarks.

To resolve this bottleneck, we convert existing object detectors into differentiable critics, thereby allowing T2I diffusion models to benefit from stronger counting models for improved numeracy.

Solution: The D2D Method

D2D builds on two key insights. (1) We design a high-curvature activation function that converts bounding box logits outputted from detectors into soft binary indicators, thereby making them gradient-friendly for count optimization. (2) To leverage our “count-via-summation” gradient, unlike prior approaches that intervene at intermediate states or denoised predictions along the sampling trajectory, we instead optimize the initial noise using a test-time tunable module, the Latent Modifier Network. This backbone-agnostic design enables broader generalization of our method across diverse diffusion-based T2I architectures, including U-Net and DiT.

D2D consists of two main components that 🌟 work together 🌟 to improve numeracy: our detector-based count critic guides the Latent Modifier Network (LMN) on how to transform the original initial noise into a more optimal one. Our count critic uses sigmoid-based activation functions to convert logits into gradient signals, which are then backpropagated through the frozen generator to update the weights of the LMN.

Results

On numeracy, D2D consistently outperforms baselines across low-density, single-object, multi-object, and high-density prompts, across base models with U-Net and DiT backbonesting up to 13.7% points (from 30% with Make It Count to 43.7% with D2D on D2D-Small). D2D generalizes across different detector backbones.

Quantitative results. D2D outperforms all baselines across all four benchmarks (CoCoCount and D2D-Small/Multi/Large), even generalizing across detector variants OWLv2 and YOLOv9. D2D with YOLOv9 on base model SDXL-Turbo is in bold italics, because while it outperforms all baselines, it is second to using OWLv2. The higher-performing OWLv2 detector is used in all subsequent experiments on SD-Turbo and Pixart-DMD. Standard deviations indicate the significance of our findings. Base models with no post-enhancement highlighted in gray. Avg. over four seeds.

D2D in action!

D2D effectively correct both over and under-generation (i.e. when the initial generation contains more/fewer than requested objects). We compare D2D's correction rate on base model SDXL-Turbo. D2D has the highest correction rate, correcting 40.13% of over-generations and 41.83% of under-generations, which is at least 16% points over baselines, while maintaining 72.57% of already-correct generations (also more than baselines).

D2D effectively corrects over and under-generation.

Given the same initial conditions, D2D is effective at correcting both over and undergeneration.

Inference cost analysis

Numeracy vs. inference cost by method. D2D scores in the top left (i.e. it is both high-numeracy and low-cost). Base model/detector noted in gray.

Additional qualitative results

D2D effectively corrects numeracy mistakes.