AlignedGen

Case Show

Explore generated results with corresponding prompts.

{Rhino, Telescope, Stool, Panda} in vintage stamp style.

{A bear, A moose, A cute koala, A baby penguin} in kid crayon drawing style.

{A serene river, A rowboat, A bridge, A willow tree} in 3D render, animation studio style.

{A penguin, A bicycle, A tornado, A pineapple} in abstract graffiti style.

{A robot, A cupcake, A woman playing bascketball, A sunflower} in digital glitch art style.

{A kite surfing, A pizza, A child doing homework, A person doing yoga} in fluid art style.

{Hiking boots, Kangaroo, Ice cream cone, Hammock} in comic book style.

{Tree stump, Harp, Chameleon, Canyon} in blueprint style.

{Crystal ball, Carousel horse, Hummingbird, Glacier} in watercolor and ink wash style.

{Anchor, Clock, Globe, Bicycle} in 3D realism style.

{Clock, Helicopter, Whale, Starfish} in retro poster style.

{Binoculars, Bus, Pillow, Cloud} in bohemian hand-drawn style.

{lighthouse, book, cityscape, cat} in isometric illustration style.

{camera, fireman, house, mountain} in doodle style.

{lighthouse, book, cityscape, cat} in sticker style.

{lighthouse, book, cityscape, cat} in expressionism style.

Abstract

Despite their generative power, diffusion models struggle to maintain style consistency across images conditioned on the same style prompt, hindering their practical deployment in creative workflows. While several training-free methods attempt to solve this, they are constrained to the U-Net architecture, which not only leads to low-quality results and artifacts like object repetition but also renders them incompatible with superior Diffusion Transformer (DiT). To address these issues, we introduce AlignedGen, a novel training-free framework that enhances style consistency across images generated by DiT models. Our work first reveals a critical insight: naive attention sharing fails in DiT due to conflicting positional signals from improper position embeddings. We introduce Shifted Position Embedding (ShiftPE), an effective solution that resolves this conflict by allocating a non-overlapping set of positional indices to each image. Building on this foundation, we develop Advanced Attention Sharing (AAS), a suite of three techniques meticulously designed to fully unleash the potential of attention sharing within the DiT. Furthermore, to broaden the applicability of our method, we present an efficient query, key, and value feature extraction algorithm, enabling our method to seamlessly incorporate external images as style references.

Method

Overview of the AlignedGen framework. (a) Overall pipeline, where our ShiftPE and AAS module are integrated into specific layers of DiT, replacing the MM-Attention. (b)(c) Detailed illustrations of the ShiftPE and AAS, respectively. (d) Procedure for extracting features from a user-provided style reference image, which serve as input to the AAS module.

ShiftPE Analysis

ShiftPE decouples attention from spatial location to enable semantic correspondence. We visualize attention map originating from a query point (red box). (a) RoPE exhibits a strong spatial bias, rigidly constraining attention to the same coordinates in the reference image and causing content leakage. (b) ShiftPE breaks this spatial dependency, allowing attention to focus on broader, semantically relevant regions (e.g., the surrounding snowy landscape). (c) The plot of attention weight summed by L1 distance quantitatively confirms this: ShiftPE distributes attention across a wider area, whereas RoPE's attention is sharply localized at the query's exact coordinates.

AlignedGen: Aligning Style Across Generated Images

Case Show

Abstract

Method

ShiftPE Analysis

Compare With Others

Support User-Provided Image As Reference

DreamBooth

Depth Control

BibTeX