Author(s)

Riya Jadhav, Dr P S Lokhande

  • Manuscript ID: 120031
  • Volume 2, Issue 1, Jan 2026
  • Pages: 54–58

Subject Area: Computer Science

DOI: https://doi.org/10.5281/zenodo.18333916
Abstract

This research introduces an innovative AI image generation platform that utilizes a dual-model approach, capable of producing high-quality visuals from both text prompts and user-drawn sketches. The system's architecture cleverly pairs Google's Gemini multimodal API for its core generative power with the robust services of Firebase, handling user authentication, cloud storage, hosting, and all serverless backend tasks. Crucially, this design circumvents the need for heavy, GPU-intensive training typical of conventional AI, relying instead on API-based inference. This strategy effectively democratizes access to advanced image creation, making it available to students, educators, and developers without specialized hardware requirements. Users interact with a simple web interface, where their inputs are efficiently processed by Firebase Cloud Functions before being sent to Gemini. The final high-quality images are then instantly and securely stored in Firebase Storage and displayed. Demonstrating strong performance with average latency under a few seconds, the system proves that cloud-based multimodal AI can streamline creative workflows and enable real-time visualization with minimal infrastructure. Future developments aim to enhance style control, customization, and cross-platform capabilities.

Keywords
AI image generationGemini APIFirebasemultimodal AIdiffusion modelsserverless computingtext-to-imagesketch-to-image