Freaking Architecture

Project Description

We present an ensemble of neural networks (CLIP -> SD1.5 -> BLIP2 -> SDXL) using the "kaleidoscope principle" to cascade random combinations of keywords into an unmoderated stream of captioned architectural postcards published daily in channels on mastodon, tumblr, bluesky, telegram. The paper on this project, "Machine Apophenia: The Kaleidoscopic Generation of Architectural Images", was presented at HuMaIn @ KI 2024.

Handpicked 50 architectural images (mostly brutalist and futuristic buildings and monuments) were passed once through a CLIP-interrogator to obtain characteristic seed keyphrases.
The 408 seed phrases remaining after postprocessing are randomly combined 14-17 at a time, and fed to the input of the Stable Diffusion 1.5 model.
The resulting 512x512 images are passed through the BLIP2 model to generate a human-readable descriptions.
Finally, the seed phrases + these new descriptions are fed to the img2img input of the SDXL model to generate the final images that are streamed to the channels in an unsupervised manner.

Keywords

an old cinema building, a matte painting, 2070, slik design, evil baptism, polished concrete, evil warp energy, naotto hattori, by Bohumil Kubista, by Adam Bruce Thomson, slavic city, huge gargantuan black sun, water to waste, gray wasteland

SD1.5

Caption

...the building is a classic example of the art deco style, which was popular in the 1920s and 1930s; a black and white photo of a theater...

SDXL

a colorized photo, academic art, rays of volumetric light, robot religion, dark university aesthetic, jaeyeon nam, rosalia vila i tobella, maler collieri, cauldrons, curved blades on each hand, rows of windows lit internally, experiment in laboratory

...a large room with many different objects in it; a factory with a lot of machinery; a building...

a soviet town with a spiral staircase in the snow, hardmesh post, unique design, awarded photograph, viennese actionism, quebec, brutalism, heise jinyao, corrugated hose, spire, portal, of augean stables

...a large white building with a tower in the snow...

Evaluation

The study utilizes both technical and aesthetic metrics to evaluate the generated images. To comprehensively assess the quality and relevance of the generated architectural images, we use Image Quality Assessment metrics developed in the NIMA project. Specifically, we apply pre-trained "aesthetic" and "technical" models, and each returns a score from 1 to 10. The first model aims to address the aesthetic aspects of the image, and the second tries to evaluate the "clearness" of the picture (in terms of visual artifacts).

We used both to evaluate our approach and to show how each of its steps improves both scores. To isolate the contributions of various system components, we conducted a series of ablation studies comparing different model configurations. We used 1000 images produced by each pipeline version to evaluate average scores. The key observation from this evaluation is that each system component contributes significantly to the final result. Each step in our multi-step process contributes to the overall enhancement of the generated images. The iterative refinement stages, from initial seed generation to final image refinement, result in higher aesthetic and technical scores compared to more straightforward, single-step approaches.

Freaking architecture: Quality Assessment

Observational Study

To understand the impact of different keyphrases on user engagement, we conducted a factor analysis based on one year of data collected from emoji reactions on our Telegram channel. This analysis aimed to identify which keyphrases significantly influenced user engagement, measured through the conversion rate of emoji feedback. The conversion rate was calculated as the percentage of images that received emoji reactions out of the total number of images containing a specific keyphrase. This metric served as a proxy for user engagement.

Freaking architecture: Keyphrases and their conversion rates

The images corresponding to queries with high concentrations of high and low engaging keyphrases are pretty illustrative: they highlight the differences in user engagement based on the keyphrases used.

Freaking architecture: Examples of images corresponding to queries with high concentrations of high/low-engaging keyphrases.

Latent Space Exploration

Visualization of t-SNE projection of a sample of 731 images based on their visual features (calculated with the DINOv2 model) demonstrates the variety and diversity of generated images:

Freaking architecture: t-SNE of DINOv2 features map

NERF experiments

Generated images can be used as input for NERF-like models to create 3D spaces using monocular depth prediction models:

BibTeX

@misc{FreakingArchitecture,
      title={Machine Apophenia: The Kaleidoscopic Generation of Architectural Images}, 
      author={Alexey Tikhonov and Dmitry Sinyavin},
      year={2024},
      eprint={2407.09172},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.09172}, 
      url = {https://altsoph.github.io/freaking-architecture/},
}

Freaking Architecture: a kaleidoscopic approach to unsupervised architectural content generation