D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Nobline Yoo1, Olga Russakovsky1, Ye Zhu1,2,
1Princeton University, 2LIX, École Polytechnique, IP Paris

Qualitative examples illustrating the count-correction ability of our detector-based critic on a variety of objects, counts 1-10. D2D proposes a new and effective way to leverage detectors for this challenging task. Prompt template: “A realistic photo of a scene with [count] [object class].”

Abstract

Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts.

Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models.

Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.

The Problem

Since vanilla T2I models are not explicitly trained to count, existing methods introduce auxiliary counting critics to provide additional supervision during generation via gradient guidance, which restricts current approaches to differentiable, regression-based models such as RCC and CLIP-Count. This design inherently excludes high-accuracy detector-based models which count via bbox-enumeration. Despite being non-differentiable, detectors (e.g., OWLv2, YOLOv9) often outperform regression-based counterparts in low-density object scenarios, where discrepancies between generated and requested counts are most noticeable.

Low and high-density examples with the wrong number of generated objects.

Error by ground truth count on TallyQA and FSC147 counting benchmarks.

To resolve this bottleneck, we convert existing object detectors into differentiable critics, thereby allowing T2I diffusion models to benefit from stronger counting models for improved numeracy.

Solution: The D2D Method

D2D builds on two key insights. (1) We design a high-curvature activation function that converts bounding box logits outputted from detectors into soft binary indicators, thereby making them gradient-friendly for count optimization. (2) To leverage our “count-via-summation” gradient, unlike prior approaches that intervene at intermediate states or denoised predictions along the sampling trajectory, we instead optimize the initial noise using a test-time tunable module, the Latent Modifier Network. This backbone-agnostic design enables broader generalization of our method across diverse diffusion-based T2I architectures, including U-Net and DiT.

D2D consists of two main components that 🌟 work together 🌟 to improve numeracy: our detector-based count critic guides the Latent Modifier Network (LMN) on how to transform the original initial noise into a more optimal one. Our count critic uses sigmoid-based activation functions to convert logits into gradient signals, which are then backpropagated through the frozen generator to update the weights of the LMN.

Results

On numeracy, D2D consistently outperforms baselines across low-density, single-object, multi-object, and high-density prompts, across base models with U-Net and DiT backbonesting up to 13.7% points (from 30% with Make It Count to 43.7% with D2D on D2D-Small). D2D generalizes across different detector backbones.

Quantitative results. D2D outperforms all baselines across all four benchmarks (CoCoCount and D2D-Small/Multi/Large), even generalizing across detector variants OWLv2 and YOLOv9. D2D with YOLOv9 on base model SDXL-Turbo is in bold italics, because while it outperforms all baselines, it is second to using OWLv2. The higher-performing OWLv2 detector is used in all subsequent experiments on SD-Turbo and Pixart-DMD. Standard deviations indicate the significance of our findings. Base models with no post-enhancement highlighted in gray. Avg. over four seeds.

D2D in action!

D2D effectively correct both over and under-generation (i.e. when the initial generation contains more/fewer than requested objects). We compare D2D's correction rate on base model SDXL-Turbo. D2D has the highest correction rate, correcting 40.13% of over-generations and 41.83% of under-generations, which is at least 16% points over baselines, while maintaining 72.57% of already-correct generations (also more than baselines).

D2D effectively corrects over and under-generation.

Given the same initial conditions, D2D is effective at correcting both over and undergeneration.

Inference cost analysis

Numeracy vs. inference cost by method. D2D scores in the top left (i.e. it is both high-numeracy and low-cost). Base model/detector noted in gray.

Additional qualitative results

D2D effectively corrects numeracy mistakes.

Qualitative examples from count correction/alignment methods D2D, ReNO, and TokenOpt on base model SDXL-Turbo, and Make It Count on base SDXL.

BibTeX

@article{yoo2025d2d,
  author    = {Yoo, Nobline and Russakovsky, Olga and Zhu, Ye},
  title     = {D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation},
  journal   = {arXiv preprint arXiv:2510.19278},
  year      = {2025}
}