Extracting Structured Data from Images Using AI: Why GPT-4o Beats Traditional OCR for Real-World Documents

I've built traditional OCR pipelines using LEAD Tools, Tesseract, and ABBYY. They work — until they don't. A slightly rotated scan, a different font weight, a handwritten field in the margin, or a table with merged cells, and the accuracy collapses. You end up with brittle regex patterns, endless exception handling, and a maintenance burden that grows with every new document format the client sends.

Then I started using AI vision models for document extraction. The difference is significant enough that I've replaced traditional OCR with AI-based extraction on every new project since. This post explains the approach, shows the C# implementation, and covers when it works best and where to be careful.

Why Traditional OCR Fails on Real-World Documents

Traditional OCR engines convert image pixels to text. They do this well when documents are clean, consistently formatted, and machine-printed. Real-world documents are none of these things:

Invoices come from hundreds of different vendors with different layouts, fonts, and field positions.
Legal documents have handwritten annotations, stamps, and signatures that break character recognition.
Medical forms mix printed and handwritten content in the same field.
Scanned contracts are often skewed, noisy, or have watermarks.

Writing code to handle all of this — column detection, bounding box logic, field mapping, layout parsing — is a significant engineering effort. And every new document type breaks something.

The AI Approach: Describe What You Want, Get It

Modern vision models like GPT-4o from OpenAI (available through Azure OpenAI) understand document context the same way a human does. You send an image and a prompt describing exactly what data you want extracted and in what format. The model reads the document, understands its structure, and returns the data.

No layout parsing. No bounding boxes. No regex. No column detection. Just a prompt and a JSON response.

Real Example: Extracting Invoice Data

Let's say you receive vendor invoices as scanned PDFs or photos. You need to extract: invoice number, vendor name, invoice date, due date, line items (description, quantity, unit price, total), subtotal, tax, and grand total. Here's the complete implementation in C# using Azure OpenAI.

Step 1: Set Up Azure OpenAI Client

// Install: dotnet add package Azure.AI.OpenAI
// Install: dotnet add package Microsoft.Extensions.AI

// appsettings.json
{
  "AzureOpenAI": {
    "Endpoint": "https://your-resource.openai.azure.com/",
    "ApiKey":   "your-api-key",
    "Model":    "gpt-4o"
  }
}

// Program.cs
builder.Services.AddSingleton(sp =>
{
    var config   = sp.GetRequiredService<IConfiguration>();
    var endpoint = new Uri(config["AzureOpenAI:Endpoint"]!);
    var cred     = new AzureKeyCredential(config["AzureOpenAI:ApiKey"]!);
    return new AzureOpenAIClient(endpoint, cred);
});

builder.Services.AddScoped<IDocumentExtractionService, DocumentExtractionService>();

Step 2: Define Your Output Model

The key to reliable extraction is telling the AI exactly what JSON structure to return. Define a C# model first, then instruct the AI to match it.

public class InvoiceData
{
    public string?       InvoiceNumber { get; set; }
    public string?       VendorName    { get; set; }
    public string?       VendorAddress { get; set; }
    public DateTime?     InvoiceDate   { get; set; }
    public DateTime?     DueDate       { get; set; }
    public List<LineItem> LineItems    { get; set; } = [];
    public decimal?      Subtotal      { get; set; }
    public decimal?      TaxAmount     { get; set; }
    public decimal?      GrandTotal    { get; set; }
    public string?       Currency      { get; set; }
    public string?       PaymentTerms  { get; set; }
    public float         Confidence    { get; set; }
    public string?       Notes         { get; set; }
}

public class LineItem
{
    public string?  Description { get; set; }
    public decimal? Quantity    { get; set; }
    public decimal? UnitPrice   { get; set; }
    public decimal? Total       { get; set; }
}

Step 3: The Extraction Service

public class DocumentExtractionService : IDocumentExtractionService
{
    private readonly AzureOpenAIClient _client;
    private readonly IConfiguration    _config;

    public DocumentExtractionService(
        AzureOpenAIClient client,
        IConfiguration config)
    {
        _client = client;
        _config = config;
    }

    public async Task<InvoiceData?> ExtractInvoiceAsync(
        byte[] imageBytes,
        string mimeType = "image/jpeg")
    {
        var chatClient = _client.GetChatClient(_config["AzureOpenAI:Model"]);

        // Convert image to base64 data URI
        var base64 = Convert.ToBase64String(imageBytes);
        var dataUri = $"data:{mimeType};base64,{base64}";

        var prompt = """
            You are a precise document data extraction assistant.
            Extract all invoice data from this image and return it as JSON.

            Return ONLY valid JSON with this exact structure — no markdown, no explanation:
            {
              "invoiceNumber": "string or null",
              "vendorName":    "string or null",
              "vendorAddress": "string or null",
              "invoiceDate":   "YYYY-MM-DD or null",
              "dueDate":       "YYYY-MM-DD or null",
              "lineItems": [
                {
                  "description": "string",
                  "quantity":    number,
                  "unitPrice":   number,
                  "total":       number
                }
              ],
              "subtotal":    number or null,
              "taxAmount":   number or null,
              "grandTotal":  number or null,
              "currency":    "USD / GBP / AED / INR / etc.",
              "paymentTerms": "string or null",
              "confidence":   0.0 to 1.0,
              "notes": "any unusual observations about the document"
            }

            Rules:
            - Extract numbers as numeric values, not strings
            - If a field is not visible, return null
            - confidence = your estimate of extraction accuracy
            - Do not invent or guess values you cannot clearly read
            """;

        var messages = new List<ChatMessage>
        {
            new UserChatMessage(
                ChatMessageContentPart.CreateTextPart(prompt),
                ChatMessageContentPart.CreateImagePart(
                    new Uri(dataUri), null))
        };

        var response = await chatClient.CompleteChatAsync(
            messages,
            new ChatCompletionOptions
            {
                MaxOutputTokenCount = 2000,
                Temperature         = 0.1f  // Low temp = more deterministic output
            });

        var json = response.Value.Content[0].Text;

        // Strip markdown fences if the model adds them despite instructions
        json = json
            .Replace("```json", "")
            .Replace("```",     "")
            .Trim();

        return JsonSerializer.Deserialize<InvoiceData>(json,
            new JsonSerializerOptions
            {
                PropertyNameCaseInsensitive = true
            });
    }
}

Processing Multiple Document Types with One Service

The real power is how easily you adapt the same pattern for different document types. Just change the prompt. Here's how we handle different document formats with a single extraction service by varying the prompt and output model.

// Generic extraction method — pass any prompt and expected output type
public async Task<T?> ExtractAsync<T>(
    byte[] imageBytes,
    string extractionPrompt,
    string mimeType = "image/jpeg") where T : class
{
    var chatClient = _client.GetChatClient(_config["AzureOpenAI:Model"]);
    var base64     = Convert.ToBase64String(imageBytes);
    var dataUri    = $"data:{mimeType};base64,{base64}";

    var schema = GenerateJsonSchema<T>(); // Use reflection or hard-coded schema

    var systemPrompt = $"""
        You are a precise document data extraction AI.
        Extract the requested data and return ONLY valid JSON.
        Schema to follow: {schema}
        If a field is not found in the document, set it to null.
        Do not return markdown, explanations, or commentary.
        """;

    var messages = new List<ChatMessage>
    {
        new SystemChatMessage(systemPrompt),
        new UserChatMessage(
            ChatMessageContentPart.CreateTextPart(extractionPrompt),
            ChatMessageContentPart.CreateImagePart(new Uri(dataUri), null))
    };

    var response = await chatClient.CompleteChatAsync(messages,
        new ChatCompletionOptions { Temperature = 0.1f });

    var json = response.Value.Content[0].Text.Trim()
        .TrimStart('`').TrimEnd('`')
        .Replace("json\n", "");

    return JsonSerializer.Deserialize<T>(json,
        new JsonSerializerOptions { PropertyNameCaseInsensitive = true });
}

// Usage examples:

// Extract business card data
var card = await _extractor.ExtractAsync<BusinessCardData>(
    imageBytes,
    "Extract all contact information from this business card.");

// Extract medical form data
var form = await _extractor.ExtractAsync<MedicalFormData>(
    imageBytes,
    "Extract patient name, DOB, medications listed, and doctor name from this form.");

// Extract receipt data
var receipt = await _extractor.ExtractAsync<ReceiptData>(
    imageBytes,
    "Extract store name, date, all items purchased with prices, and total amount.");

Using It in a Blazor File Upload Flow

Here is how to wire this into a Blazor Server component where users upload an invoice image and see the extracted data instantly.

@page "/invoice-upload"
@inject IDocumentExtractionService Extractor

<h2>Upload Invoice</h2>

<InputFile OnChange="HandleFileSelected" accept="image/*,.pdf" />

@if (_loading)
{
    <p>Extracting data...</p>
}

@if (_invoice is not null)
{
    <div class="invoice-preview">
        <h3>@_invoice.VendorName — @_invoice.InvoiceNumber</h3>
        <p>Date: @_invoice.InvoiceDate?.ToString("d")</p>
        <p>Total: @_invoice.GrandTotal?.ToString("C") (@_invoice.Currency)</p>
        <p>Confidence: @(_invoice.Confidence * 100)%</p>

        <table>
            @foreach (var item in _invoice.LineItems)
            {
                <tr>
                    <td>@item.Description</td>
                    <td>@item.Quantity</td>
                    <td>@item.UnitPrice?.ToString("C")</td>
                    <td>@item.Total?.ToString("C")</td>
                </tr>
            }
        </table>
    </div>
}

@code {
    private bool        _loading = false;
    private InvoiceData? _invoice = null;

    private async Task HandleFileSelected(InputFileChangeEventArgs e)
    {
        _loading = true;
        _invoice = null;

        try
        {
            var file   = e.File;
            var buffer = new byte[file.Size];

            await using var stream = file.OpenReadStream(maxAllowedSize: 10 * 1024 * 1024);
            await stream.ReadExactlyAsync(buffer);

            _invoice = await Extractor.ExtractInvoiceAsync(
                buffer, file.ContentType);
        }
        catch (Exception ex)
        {
            Console.Error.WriteLine($"Extraction failed: {ex.Message}");
        }
        finally
        {
            _loading = false;
        }
    }
}

Handling PDFs — Converting Pages to Images First

The vision API accepts images, not PDFs. For multi-page PDF invoices, convert each page to a JPEG first, then process each page separately. We use the open-source PDFium library for this in .NET.

// Install: dotnet add package PdfiumViewer.Native.x86_64.v8-xfa
// Install: dotnet add package PdfiumViewer

public async Task<List<InvoiceData?>> ExtractFromPdfAsync(byte[] pdfBytes)
{
    var results = new List<InvoiceData?>();

    using var stream    = new MemoryStream(pdfBytes);
    using var document  = PdfDocument.Load(stream);

    for (int page = 0; page < document.PageCount; page++)
    {
        // Render at 200 DPI for good quality
        using var image  = document.Render(page, 200, 200, false);
        using var ms     = new MemoryStream();
        image.Save(ms, System.Drawing.Imaging.ImageFormat.Jpeg);

        var pageBytes = ms.ToArray();
        var result    = await ExtractInvoiceAsync(pageBytes, "image/jpeg");

        if (result is not null)
            results.Add(result);
    }

    return results;
}

Accuracy, Cost, and When to Use AI vs Traditional OCR

AI extraction is not free. GPT-4o charges per token and image tokens are significant. Here's a practical breakdown to help you decide:

Use AI extraction when:
• Documents come in many formats from different sources (invoices, contracts, forms)
• Accuracy is critical and manual correction is expensive
• Documents contain handwriting, stamps, or non-standard layouts
• You need structured data, not just raw text
• Volume is moderate (hundreds per day, not millions)

Use traditional OCR when:
• Documents are always the same template (your own forms)
• Volume is very high (millions per day) and cost matters at scale
• You only need raw text extraction, not structured fields
• Latency is critical (OCR is faster than an API call)

For typical business document processing — invoices, medical records, legal forms, receipts — AI extraction saves significant development time and delivers higher accuracy than any traditional pipeline I've built. The prompt is your only maintenance surface. When a new document layout appears, you adjust the prompt — not the code.

Tips for Production Reliability

1. Always validate the output — check that numeric totals add up, required fields are non-null, and dates are valid before saving to your database.

2. Use temperature 0.1 — lower temperature makes the model more deterministic and consistent. High temperature introduces unnecessary variation in data extraction.

3. Include a confidence score in the prompt — instruct the model to rate its own confidence. Flag anything below 0.85 for human review rather than auto-processing.

4. Log the raw response — always log the raw JSON string from the model before parsing. This makes debugging extraction failures much easier.

5. Handle rate limits — Azure OpenAI has tokens-per-minute limits. For batch processing, add a retry with exponential backoff using Polly.

// Polly retry policy for rate limiting
services.AddHttpClient<IDocumentExtractionService>()
    .AddResilienceHandler("openai-retry", builder =>
    {
        builder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 4,
            Delay            = TimeSpan.FromSeconds(2),
            BackoffType      = DelayBackoffType.Exponential,
            UseJitter        = true,
            ShouldHandle     = args => args.Outcome.Result?.StatusCode
                               == System.Net.HttpStatusCode.TooManyRequests
                               ? PredicateResult.True()
                               : PredicateResult.False()
        });
    });

Summary

AI-based document extraction using GPT-4o is the most practical upgrade you can make to any document processing workflow in .NET right now. The implementation is straightforward — an API call with a well-crafted prompt returns structured JSON. The maintenance burden is a fraction of traditional OCR pipelines. And accuracy on real-world, inconsistent documents is genuinely better.

If you're processing invoices, contracts, medical records, or any kind of unstructured document in a .NET application and want to talk through the architecture, reach out. This is one of the patterns I've implemented recently and can help you avoid the common pitfalls.