Stable Diffusion, a text to image synthesis model, was co-developed by Stability AI, CompVis, and LAION. This ground-breaking model employs diffusion, as explained in a previous article in this series, "From Text To Art: Part 1." An intriguing feature of Stable Diffusion is its open-source nature, allowing the public to freely access and utilize its technology and underlying components.
Open-source champions like Stability AI, CompVis, and LAION emphasize the following principles:
Transparency: With the code open to all, potential issues such as security vulnerabilities or unexpected algorithm outcomes can be identified and reported to the creators or other project developers.
Collaboration: Developers worldwide can enhance the source code or add features using platforms like Github or Google Colab. These modifications undergo a thorough review by the community and original developers before being integrated.
Community: The community, consisting of both users and developers (often the same people), actively exchanges feedback on the project.
Freedom: Users have the liberty to use, distribute, and modify the software as needed. For instance, a plumbing firm could adapt an open-source accounting program to meet their specialized needs, without impacting the original source code, due to its decentralized nature.
Essentially, sharing knowledge aids in creating safer, better products, echoing the belief that human knowledge should not be gatekept but accessible to all.
Yet, the scenario changes when consequential technology enters the open-source realm. Open sourcing software for optimizing websites on mobile is one thing, but disseminating technology that can generate deep fakes or realistic images with harmful intent is another. It's here that the open-source movement becomes a balancing act between its potential benefits and societal costs. Despite these issues, the creators of Stable Diffusion appear committed to the open-source philosophy, thus far.
Being open-source, Stable Diffusion can be downloaded and run on your computer if you possess the technical skills. Although it's not as easily operable as DALL-E 2 and Midjourney. However, Stability AI has introduced a user-friendly "playground mode", offering free image generation. I dropped a link to the playground at the end of the post.
The quality of the images are comparable to DALL-E 2, and at times are even better. The playground is great for casual use, getting to know text to image synthesis, and rapid ideation without much hassle.
If you explore a bit, you'll find Stability AI, the company behind Stable Diffusion, is developing multiple products, some of which are in public beta. Stable Diffusion's capabilities are demonstrated across three of these products. For now, it's important to know that an improved version of Stable Diffusion, Stable Diffusion 2.2, also known as Stable Diffusion XL, is available as open source, limited free models, and subscription models, but not available in the “playground”.
Stable Diffusion XL significantly enhances Stable Diffusion, generating lifelike, high-quality images from text. It includes useful features like modifying existing images, removing objects or text, eliminating backgrounds, and an upscaler to increase image size, among others.
Stable Diffusion and Stability AI are poised to be significant players in text-to-image synthesis, particularly through their developing paid applications. They focus on practicality, incorporating Photoshop-like features and ensuring image coherence—maintaining the image's overall composition and depth while making changes. Coherence is essential in a practical workflow.
The most exciting prospect of Stable Diffusion 2.2 XL at the moment lies in the possibilities it offers to developers or power users operating from home.
Why is it crucial to have Stable Diffusion running at your personal workstation? Well, besides the perk of it being completely free, it provides an unbounded space to generate as per your whims. Most notably, it allows the diffusion model to be tailored for your unique needs, echoing the earlier analogy of the plumbing firm.
For example, several AI startups specialize in generating headshots. They accomplish this by having users submit a substantial number of photos (15-100+) taken from various angles and settings. These photos are then used to fine-tune the Stable Diffusion model. By employing text-to-image synthesis, these companies can create polished headshots that closely resemble the user's submitted images. Developing an AI diffusion model from scratch can be challenging due to cost, time, and data limitations. However, fine-tuning an existing model is a comparatively more straightforward process.
As expected, a significant portion of images generated by users through the open-source local version of Stable Diffusion are explicit and frequently involve celebrities or individuals who haven’t given rights to their likeness. As the technology continues to advance, there might be a growing demand for implementing censorship measures. However, given the widespread availability and usage of the technology, addressing this issue retrospectively poses considerable challenges.