Deploying NLP Models: Hugging Face API Wrapper
In this guide, we will deploy an open-source Natural Language Processing (NLP) model from Hugging Face. We will wrap a foundational model (in this example, a sentiment analysis model) into a production-ready API.
The Golden Rule of Hugging Face Deployments: Never configure your application to download a model from the internet when the container starts. Large models take minutes to download, which will cause your deployment to fail our platform's health checks. Instead, we will write a script to download the model during the Docker build process so it is permanently baked into your image.
Prerequisites
Before starting, ensure you have:
- The name of the Hugging Face model you want to use (e.g.,
distilbert-base-uncased-finetuned-sst-2-english). - Docker installed locally.
The Model Downloader (download_model.py)
First, we write a standalone Python script whose sole purpose is to fetch the model and tokenizer from Hugging Face and save them to a local folder.
Create a file named download_model.py:
import os
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# The Hugging Face model ID
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"
# The local directory where we will save the weights
SAVE_DIR = "./model_weights"
def download_and_save():
print(f"Downloading {MODEL_NAME}...")
# Download tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
# Save them locally
os.makedirs(SAVE_DIR, exist_ok=True)
tokenizer.save_pretrained(SAVE_DIR)
model.save_pretrained(SAVE_DIR)
print(f"Model saved successfully to {SAVE_DIR}/")
if __name__ == "__main__":
download_and_save()
The Inference API (app.py)
Now we build the FastAPI application. Notice how we tell the transformers library to load the model from our local ./model_weights directory instead of fetching it from the internet.
Create a file named app.py:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import pipeline
app = FastAPI(title="Hugging Face NLP API")
# 1. Define the expected incoming JSON payload
class TextPayload(BaseModel):
text: str
# 2. Load the model from the LOCAL directory, not the internet
LOCAL_MODEL_DIR = "./model_weights"
try:
# We use the pipeline abstraction for easy inference
nlp_pipeline = pipeline(
"sentiment-analysis",
model=LOCAL_MODEL_DIR,
tokenizer=LOCAL_MODEL_DIR
)
except Exception as e:
print(f"Failed to load model from {LOCAL_MODEL_DIR}. Error: {e}")
nlp_pipeline = None
@app.post("/analyze")
def analyze_text(payload: TextPayload):
if nlp_pipeline is None:
raise HTTPException(status_code=500, detail="Model is not loaded.")
if not payload.text.strip():
raise HTTPException(status_code=400, detail="Text payload cannot be empty.")
try:
# Run the text through the Hugging Face pipeline
result = nlp_pipeline(payload.text)[0]
return {
"input_text": payload.text,
"label": result['label'],
"score": round(result['score'], 4)
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.get("/health")
def health_check():
return {"status": "healthy", "model_loaded": nlp_pipeline is not None}
Managing Dependencies (requirements.txt)
We need FastAPI and the Hugging Face ecosystem. Create your requirements.txt:
fastapi==0.103.2
uvicorn==0.23.2
pydantic==2.4.2
transformers==4.34.0
Note on PyTorch: We will handle installing a lightweight, CPU-only version of PyTorch directly inside the Dockerfile to keep the image size manageable.
The Dockerfile
This is the most important step. We will execute download_model.py as a RUN command inside the Dockerfile. This ensures the gigabytes of model weights become a permanent layer in your Docker image.
FROM python:3.10-slim
WORKDIR /app
# 1. Install dependencies first
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# 2. Install CPU-only PyTorch to save gigabytes of space
RUN pip install --no-cache-dir torch==2.1.0+cpu --index-url https://download.pytorch.org/whl/cpu
# 3. Copy the download script and execute it
# We do this BEFORE copying the rest of the app to leverage Docker caching
COPY download_model.py .
RUN python download_model.py
# 4. Copy the main API code
COPY app.py .
EXPOSE 8000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
The .dockerignore file
Ensure you do not accidentally copy local model weights if you ran the download script on your own laptop:
__pycache__/
*.pyc
.venv/
venv/
model_weights/
Deployment Steps
NLP models require careful resource allocation.
Build the Docker Image:
docker build -t your-registry/hf-nlp-api:v1 .
Push to your Container Registry:
docker push your-registry/hf-nlp-api:v1
Deploy on the Platform:
- Visit Crane Cloud and create a project to use the image
your-registry/hf-nlp-api:v1 - Memory Limits: Hugging Face models are RAM-heavy.
Readiness Probes: Set the health check to /health. Because the model weights are already baked into the image, the cold-start time will only be the few seconds it takes to load those weights from disk into memory.
Testing the Endpoint
Send a JSON payload containing the text you want to analyze:
curl -X POST "https://hf-nlp-api.ahumain.cranecloud.io/analyze" \
-H "Content-Type: application/json" \
-d '{
"text": "Deploying models to this cloud native platform is incredibly smooth and easy!"
}'
Expected Response
{
"input_text": "Deploying models to this cloud native platform is incredibly smooth and easy!",
"label": "POSITIVE",
"score": 0.9987
}