Deploying Generative AI: GPT-2 Text Wrapper
In this guide, we will deploy a foundational Generative AI model. We are using GPT-2 as our example because it is small enough to run comfortably on standard CPUs, making it perfect for learning how to deploy text-generation bots (like insurance guides, automated responders, or creative writing assistants).
Generative models differ from standard classifiers because they require specific parameters (like temperature and max_length) to control how creative and how long the generated response should be. We will expose these settings safely through our API.
Prerequisites
Before starting, ensure you have:
- Docker installed locally.
- Basic familiarity with text generation concepts (prompts, temperature).
The Model Downloader (download_model.py)
Just like with our Hugging Face classification guide, we must not download the model at runtime. We will write a script to fetch GPT-2 so we can bake it into our Docker image.
Create a file named download_model.py:
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_NAME = "gpt2"
SAVE_DIR = "./model_weights"
def download_and_save():
print(f"Downloading {MODEL_NAME}...")
# Download tokenizer and causal language model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
# Save them locally
os.makedirs(SAVE_DIR, exist_ok=True)
tokenizer.save_pretrained(SAVE_DIR)
model.save_pretrained(SAVE_DIR)
print(f"GPT-2 saved successfully to {SAVE_DIR}/")
if __name__ == "__main__":
download_and_save()
The Inference API (app.py)
Our FastAPI application will use Pydantic to not only accept a text prompt, but to optionally accept generation parameters with safe default values.
Create a file named app.py:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from transformers import pipeline
app = FastAPI(title="GPT-2 Text Generation API")
# 1. Define the incoming payload with safe default parameters
class GenerationRequest(BaseModel):
prompt: str = Field(..., description="The starting text for the model.")
max_length: int = Field(default=50, ge=10, le=200, description="Max tokens to generate.")
temperature: float = Field(default=0.7, gt=0.0, le=2.0, description="Creativity level. Higher is more random.")
# 2. Load the model globally from the local directory
LOCAL_MODEL_DIR = "./model_weights"
try:
generator = pipeline(
"text-generation",
model=LOCAL_MODEL_DIR,
tokenizer=LOCAL_MODEL_DIR
)
except Exception as e:
print(f"Failed to load GPT-2 from {LOCAL_MODEL_DIR}. Error: {e}")
generator = None
@app.post("/generate")
def generate_text(request: GenerationRequest):
if generator is None:
raise HTTPException(status_code=500, detail="Model is not loaded.")
if not request.prompt.strip():
raise HTTPException(status_code=400, detail="Prompt cannot be empty.")
try:
# Run the generation pipeline
result = generator(
request.prompt,
max_length=request.max_length,
temperature=request.temperature,
num_return_sequences=1,
# Prevent the pipeline from printing a warning about setting pad_token_id
pad_token_id=generator.tokenizer.eos_token_id
)
# Extract the generated text
generated_text = result[0]['generated_text']
return {
"prompt": request.prompt,
"generated_text": generated_text,
"parameters_used": {
"max_length": request.max_length,
"temperature": request.temperature
}
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
def health_check():
return {"status": "healthy", "model_loaded": generator is not None}
Managing Dependencies (requirements.txt)
Create your requirements.txt:
fastapi==0.103.2
uvicorn==0.23.2
pydantic==2.4.2
transformers==4.34.0
The Dockerfile
We will execute our downloader script directly inside the Docker build process.
FROM python:3.10-slim
WORKDIR /app
# 1. Install standard requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 2. Install CPU-only PyTorch
RUN pip install --no-cache-dir torch==2.1.0+cpu --index-url https://download.pytorch.org/whl/cpu
# 3. Bake in the model weights
COPY download_model.py .
RUN python download_model.py
# 4. Copy the API application
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
The .dockerignore file
__pycache__/
*.pyc
.venv/
venv/
model_weights/
Deployment Steps
Generative models are computationally expensive.
Build the Docker Image:
docker build -t your-registry/gpt2-api:v1 .
Push to your Container Registry:
docker push your-registry/gpt2-api:v1
Deploy on the Platform:
- Visit Crane Cloud and create a project to use the image
your-registry/gpt2-api:v1
Readiness Probes: Always set the platform's health check to the /health endpoint to account for the model loading into memory when the container starts.
Testing the Endpoint
Send a JSON payload with your prompt. Notice how we are also passing our custom generation parameters:
curl -X POST "https://gpt2-api.ahumain.cranecloud.io/generate" \
-H "Content-Type: application/json" \
-d '{
"prompt": "The future of cloud native deployment is",
"max_length": 30,
"temperature": 0.8
}'
Expected Response
{
"prompt": "The future of cloud native deployment is",
"generated_text": "The future of cloud native deployment is highly automated, scalable, and built directly on top of containers like Docker to ensure rapid, seamless delivery.",
"parameters_used": {
"max_length": 30,
"temperature": 0.8
}
}