How to Serve Local LLMs with FastAPI?

Indunil Aravinda

April 7, 2026

Learn how to serve a local LLM with FastAPI by building a multilingual sentiment analysis API using Aya-23 (8B) and llama-cpp-python on Google Colab. This guide walks you through model setup, API development, and public deployment via ngrok. No GPU or paid API required!

Large Language Models (LLMs) have become so common that almost everyone interacts with them on a daily basis. They are incredibly powerful tools, enabling us to automate complex, intelligent tasks that were considered practically impossible just a few years ago.

In this practical guide, we will put that power to work on a specific challenge, performing sentiment analysis on scraped social media texts to determine the underlying emotional polarity of the content (whether the content is Positive / Negative / Neutral).

Reputation management is a key focus for any large business. In order to do that, companies need to perform sentiment analysis on social media posts to determine the impact they have created on the brand. To achieve this, I wanted to collect social media text content regarding a brand and analyze its sentiment.

In this quest, I explored several approaches

Classification Models (Machine Learning)
- Support Vector Machine (SVM)
Encoder-only Transformers
- distilbert-base-uncased
- cardiffnlp/twitter-xlm-roberta-base-sentiment‍
Decoder-only Transformers (GPT style)
- Aya 23 (A multilingual instruction tuned LLM developed by Cohere)

Among these approaches, Aya 23 performed the best. Therefore, the application is built using the Aya 23 model for sentiment analysis.

Problem: The Bottleneck of External LLMs

While it might seem easiest to simply plug into a commercial cloud API (like OpenAI or Anthropic) to get this done, we are taking a different approach. Social media text is inherently chaotic, a messy combination of slang, missing punctuation, multiple languages, and emojis. Sending massive streams of this unstructured, potentially sensitive data to third-party servers introduces two major blockers for enterprise teams.

‍Total Data Privacy: Sending streams of scraped user data to external servers is often a non-starter for strict compliance and security standards.‍
Unpredictable Costs: Paying per token to classify thousands or millions of daily posts quickly drains the engineering budget.

The Solution: Bringing the LLM In House

By self-hosting a local LLM, you gain total control over your environment, keeping your data entirely private while completely eliminating per request API costs.

In this guide, we will walk through how our team built a highly efficient, multilingual sentiment analysis API. We bypassed massive GPU requirements by using Aya-23 (8B), an open weights model optimized for multilingual tasks, and served it using FastAPI for maximum performance.

The Architecture: Technical Decisions

To make this independent from local machine hardware, we are building this stack in Google Colab (which provides a free Linux environment).

Here is the breakdown of the engine running under the hood

The Engine (llama.cpp): We utilize the llama-cpp-python binding to execute the model. This highly optimized C++ port is engineered to run large models with maximum efficiency on standard CPU hardware.
The Model (Aya-23 8B GGUF): Aya 23 is an open weights research release of an instruction fine-tuned model with highly advanced multilingual capabilities which perfectly fits within our requirements. We chose the GGUF quantized format to drastically reduce the memory footprint, utilizing 8 threads and a 2048-token context window to maintain manageable resource usage.
The API (FastAPI): A modern, lightning fast web framework designed to handle asynchronous Python natively for high performance applications.
The Bridge (nest_asyncio & pyngrok): These essential utilities allow us to run the Uvicorn web server within Colab's existing event loop and expose the service securely to the public web.

Implementation

Before we dive into the code, here is the step-by-step plan we will follow

Code Preparation - Installing the necessary Python packages.
Imports & Global Variables - Setting up our environment and constants.
Model Download - Pulling the quantized Aya-23 model from Hugging Face.
FastAPI Models & Logic - Defining Pydantic models and crafting the LLM prompt.
FastAPI App - Building the API endpoints.
Public Tunneling - Exposing the local server to the internet using ngrok.
Using the API - Testing our live endpoint with real JSON payloads.

Step 1 : Code Preparation

Open a notebook in Google Colab and run the pip install command to install all the relevant packages for the project.

!pip install -q llama-cpp-python huggingface_hub fastapi uvicorn nest_asyncio pyngrok

‍

Step 2 : Imports and Global Variables

Import all the necessary Python libraries and define global constants used throughout the notebook, such as model paths and server configuration.

# Import the json module for working with JSON data.
import json


# Import the logging module for logging messages and events.
import logging


# Import the os module for interacting with the operating system (e.g., environment variables).
import os


# Import Path from pathlib for object-oriented filesystem paths.
from pathlib import Path


# Import List and Literal from typing for type hints.
from typing import List, Literal


# Import nest_asyncio to patch asyncio for nested event loops.
import nest_asyncio


# Import uvicorn, an ASGI server for running FastAPI.
import uvicorn


# Import FastAPI and HTTPException for building the web API.
from fastapi import FastAPI, HTTPException
# Import snapshot_download from huggingface_hub for downloading models from Hugging Face.
from huggingface_hub import snapshot_download


# Import Llama from llama_cpp for interacting with local GGUF models.
from llama_cpp import Llama


# Import BaseModel and Field from pydantic for data validation and settings management.
from pydantic import BaseModel, Field


# Import ngrok from pyngrok for creating public URLs.
from pyngrok import ngrok


# Configure basic logging with INFO level and a specific format.
logging.basicConfig(level=logging.INFO, format='%(asctime)s | %(levelname)s | %(message)s')


# Get a logger instance specific to 'aya-sentiment' for custom logging.
logger = logging.getLogger('aya-sentiment')


# Define the Hugging Face repository ID for the model.
MODEL_REPO_ID = 'bartowski/aya-23-8B-GGUF'


# Define the specific model file name within the repository.
MODEL_FILE = 'aya-23-8B-Q5_K_M.gguf'


# Define the local directory where the model will be stored.
MODEL_DIR = Path('/content/aya')


# Construct the full path to the model file.
MODEL_PATH = MODEL_DIR / MODEL_FILE


# Define the batch size for processing multiple texts at once.
BATCH_SIZE = 20


# Define the port number for the FastAPI application.
PORT = 8000

‍

Step 3 : Model Download and Initialization

Multilingual LLM models like AYA can often be large files, ranging from 2GB to 8GB. Therefore we need to make sure we only download the model when it is not already available locally

Congratulations, your local LLM is now initialized and ready.

# Create the model directory if it doesn't exist. 
`parents = True` creates any necessary parent directories, 
`exist_ok = True` prevents an error if the directory already exists.


MODEL_DIR.mkdir(parents=True, exist_ok=True)
# Check if the model file already exists locally.
if not MODEL_PATH.exists():
    logger.info('Model file not found. Downloading from Hugging Face...')
    # Download the model snapshot from the Hugging Face repository.
    snapshot_download(
        repo_id=MODEL_REPO_ID, # The ID of the repository on Hugging Face.
        local_dir=str(MODEL_DIR), # The local directory where the model will be saved.
        allow_patterns=MODEL_FILE # Only download files matching this pattern (i.e., the specific GGUF model file).
    )


# Initialize the Llama model using llama_cpp.
llm = Llama(
    model_path=str(MODEL_PATH), # Path to the downloaded GGUF model file.
    n_gpu_layers=0, # Number of GPU layers to offload (0 means CPU only in this case).
    n_threads=8, # Number of threads to use for inference.
    n_ctx=2048, # The context window size for the model (maximum tokens).
)
# Log a message confirming the LLM has been initialized.
logger.info('LLM initialized from %s', MODEL_PATH)

‍

Step 4 : FastAPI Models and Inference Function

This defines the Pydantic models used for validating incoming request data and structuring outgoing response data for the FastAPI application. It also contains the _infer_batch function, which is responsible for taking a batch of text, constructing a prompt for the LLM, making the inference call, and parsing the LLM's JSON output into a structured format.

# Defines a Pydantic model for the request payload of the sentiment analysis endpoint.
class SentimentInput(BaseModel):
    sm_content: List[str] = Field(..., min_length=1, description='List of texts') # 'sm_content' = list of strings required, and must have at least 1 item.


# Defines a Pydantic model for a single sentiment prediction response.
class SentimentItem(BaseModel):
    text: str # The original text that was analyzed.
    sentiment: Literal['positive', 'negative', 'neutral', 'unknown'] # The predicted sentiment
    reason: str # A reason or explanation for the predicted sentiment.


# Defines a Pydantic model for how each sentiment result is structured in the response.
class SentimentResponseItem(BaseModel):
    sm_content: str # Original content that was sent for analysis.
    sentiment_prediction: SentimentItem # The sentiment prediction for that content.


# Defines the overall Pydantic model for the sentiment analysis API response.
class SentimentResponse(BaseModel):
    # A list of SentimentResponseItem objects, containing results for all input texts.
    sentiment_results: List[SentimentResponseItem]


# Helper function to extract and parse the first JSON object from a raw text string.
def _extract_json_object(raw_text: str) -> dict:
    """Extract first JSON object from model text and parse it."""
    # Find the JSON object.
    start = raw_text.find('{')
    end = raw_text.rfind('}')


    # Check if both braces were found and in the correct order.
    if start == -1 or end == -1 or end <= start:
        # If not, raise an error indicating invalid JSON.
        raise ValueError('Model output does not contain a valid JSON object')
    # Extract the JSON string and parse it into a Python dictionary.
    return json.loads(raw_text[start:end + 1])


# Function to perform sentiment inference for a batch of texts.
def _infer_batch(batch: List[str]) -> List[SentimentItem]:


    # Create a list of dictionaries, each containing an index and the text for processing.
    indexed_payload = [{'index': i, 'text': text} for i, text in enumerate(batch)]


    # Construct the prompt for the LLM, instructing it on the task and desired output format.
    prompt = (
        'You are a sentiment analysis assistant. ' # System instruction for the LLM's role.
        'Classify each input text as positive, negative, or neutral. ' # Task description.
        'Return ONLY valid JSON with this shape: ' # Strict output format requirement.
        '{"results":[{"index":0,"text":"...","sentiment":"positive|negative|neutral","reason":"..."}]}. ' # Example JSON structure.
        'Do not include markdown, tables, or extra text.' # Further output constraints.


        f'Inputs: {json.dumps(indexed_payload, ensure_ascii=False)}' # The actual inputs to be analyzed, formatted as JSON.
    )


    # Call the LLM (llama_cpp model) for chat completion.
    response = llm.create_chat_completion(
        messages=[
            {'role': 'system', 'content': 'You must output strict JSON only.'}, # System message for strict JSON output.
            {'role': 'user', 'content': prompt}, # The user's prompt containing the task and inputs.
        ],
        max_tokens=1024, # Maximum number of tokens the model can generate.
        temperature=0.1, # Controls randomness; lower values mean more deterministic output.
    )


    raw = response['choices'][0]['message']['content'] # Extract the raw content of the model's response.


    parsed = _extract_json_object(raw) # Parse the raw response to extract the JSON object.
    raw_results = parsed.get('results', [])


    # Create a dictionary to store parsed results, indexed by their original position.
    by_index = {}
    # Iterate through each item in the raw results from the model.
    for item in raw_results:
        idx = item.get('index')
        if isinstance(idx, int):
            # Get the sentiment and convert to lowercase.
            sentiment = str(item.get('sentiment', 'unknown')).lower()
            # Validate the sentiment against allowed values; default to 'unknown' if invalid.
            if sentiment not in {'positive', 'negative', 'neutral'}:
                sentiment = 'unknown'
            # Store the parsed sentiment item in the dictionary.
            by_index[idx] = SentimentItem(
                text=str(item.get('text', batch[idx] if 0 <= idx < len(batch) else '')),
                sentiment=sentiment,
                reason=str(item.get('reason', 'No reason provided by model'))
            )


    # Initialize a list to store the final, ordered sentiment results.
    results: List[SentimentItem] = []
    # Iterate through the original batch of texts with their indices.
    for i, text in enumerate(batch):
        # Append the parsed result for the current index, or a default 'unknown' if not found.
        results.append(
            by_index.get(
                i,
                SentimentItem(text=text, sentiment='unknown', reason='No valid parsed result for this input')
            )
        )
    return results # Return the list of sentiment items.

Now you have the AYA model returning a response for the query you made with the prompt and input payload.

‍

Step 5 : FastAPI App

Initializes the FastAPI application and defines two API endpoints: a /health endpoint for status checks and an /analyze_sentiment endpoint for performing sentiment analysis.

# Initialize a FastAPI application with a title and version.
app = FastAPI(title='Aya 23 Sentiment API', version='2.0.0')


# Define a GET endpoint for health checks.
@app.get('/health')
async def health():
    return {'status': 'ok'}


# Define a POST endpoint for sentiment analysis.
@app.post('/analyze_sentiment', response_model=SentimentResponse)
async def analyze_sentiment(payload: SentimentInput):
    try:
        texts = payload.sm_content
        sentiment_results: List[SentimentResponseItem] = []
        # Process texts in batches to optimize LLM calls.
        for i in range(0, len(texts), BATCH_SIZE):
            batch = texts[i:i + BATCH_SIZE]
            batch_predictions = _infer_batch(batch) # Get sentiment predictions for the current batch.


            
            # Pair original texts with their predictions & add to results 
            for original_text, prediction in zip(batch, batch_predictions):
                sentiment_results.append(
                    SentimentResponseItem(
                        sm_content=original_text,
                        sentiment_prediction=prediction
                    )
                )


        # Return the comprehensive sentiment response.
        return SentimentResponse(sentiment_results=sentiment_results)


    except Exception as e:
        logger.exception('Sentiment processing failed')
        raise HTTPException(status_code=500, detail=f'Error processing request: {str(e)}')

‍

Step 6 : Serving the API with a Public URL

This step publicly exposes the FastAPI server using ngrok and starts the uvicorn server in a separate thread to avoid blocking the Colab environment. You will need to create a free account with Ngrok and obtain the authtoken.

import threading


token = os.getenv('NGROK_AUTHTOKEN') # Set your secret token in env file
if token:
    ngrok.set_auth_token(token)
else:
    logger.warning('NGROK_AUTHTOKEN is not set. Public tunnel will fail unless token is configured.')


public_url = ngrok.connect(PORT)
print(f'Public URL: {public_url}')
nest_asyncio.apply()  # For Google Colab 
def run_uvicorn():
    uvicorn.run(app, host='0.0.0.0', port=PORT)
    
thread = threading.Thread(target=run_uvicorn)    
thread.start()
logger.info('FastAPI server started in a background thread.')

Congratulations, you now have a publicly available REST service that runs a multilingual large language model with sentiment analysis capability.

‍

Step 7 : Using the API

Now you can call the “analyze_sentiment” POST endpoint on the ngrok tunnelled URL to perform sentiment analysis on any text content from any app.

‍

Endpoint Information

URL = api_url/analyze_sentiment
Method = POST
Content-Type = application/json

The request body must be a JSON object containing an array of strings.

Ex :

Expected request payload

{
  "sm_content": [
    "I love this product!",
    "This is the worst update ever.",
    "It is okay, nothing special."
  ]
}

Expected response

{
    "sentiment_results": [
        {
            "sm_content": "I love this product!",
            "sentiment_prediction": {
                "text": "I love this product!",
                "sentiment": "positive",
                "reason": "The use of positive words like \"love\" indicates a positive sentiment."
            }
        },
        {
            "sm_content": "This is the worst update ever.",
            "sentiment_prediction": {
                "text": "This is the worst update ever.",
                "sentiment": "negative",
                "reason": "The phrase \"worst update ever\" conveys a highly negative opinion."
            }
        },
        {
            "sm_content": "It is okay, nothing special.",
            "sentiment_prediction": {
                "text": "It is okay, nothing special.",
                "sentiment": "neutral",
                "reason": "The sentiment is neutral as the statement provides no positive or negative emotion, only a description."
            }
        }
    ]
}

‍

Conclusion

In conclusion, this project sets up a FastAPI based sentiment analysis service using a local Aya 23 GGUF large language model.

Here's a breakdown of what we've done

Model Hosting - We download and load the Aya 23 GGUF model locally using llama-cpp-python for efficient, offline inference.
API Development - A FastAPI application is built with Pydantic models for structured input/output, offering a /health endpoint and a /analyze_sentiment endpoint.
Sentiment Inference - The core logic includes batch processing of text inputs, crafting a specific prompt for the LLM, and parsing the LLM's JSON output to extract sentiment (positive, negative, neutral, unknown) and a reason.
Public Exposure - The FastAPI service runs on uvicorn in a separate thread in the Colab environment and is exposed publicly using ngrok, providing a convenient URL to interact with the API from outside Colab.

‍

Essentially, you have a functional, local LLM powered sentiment analysis API ready to receive text and return categorized sentiments.