Most businesses still receive vital information as PDFs—contracts, invoices, delivery notes. This guide shows a simple, repeatable pattern to turn those messy PDFs into structured JSON with an AI Agent, then push the data anywhere (Salesforce, S3, HTTPS/FTPS, Google Drive, OneDrive). It’s tool-agnostic; I’ll reference Google Drive for the trigger example.
The Pattern (at a glance)
Trigger → Extract → Normalize with AI → Deliver → Backup
- Trigger from where files arrive (S3, HTTPS/FTPS, Google Drive, OneDrive).
- Extract text/tables from the PDF.
- Normalize the raw text to strict JSON via an AI prompt.
- Deliver the JSON to business systems (Salesforce, APIs, data lake).
- Backup the original PDF for audit & reprocessing.
1) Trigger: watch a folder or endpoint
Point a watcher to your inbound location. When a new PDF lands, start the flow.
- Works with S3, HTTPS, FTPS, Google Drive, OneDrive—the channel doesn’t matter.
- In this example, we use a Google Drive folder as the trigger.
2) Extract: get text (or tables) from PDFs
- If your PDFs are well-structured (clear tables), use a built-in “Extract from File” step to get table rows or clean text directly.
- If scanned/OCR, extraction still works—just expect a bit more normalization later.
3) Normalize: let the AI Agent produce strict JSON
Feed raw text to your AI and ask for only JSON in your target schema (e.g., seller, buyer, invoice, products, summary). Keep the prompt explicit and deterministic.
Starter Prompt (trim to your needs):
Role: You are a meticulous extractor. Given raw OCR text that may contain one or many invoices, return a JSON array with this schema:
[
{
"seller": {"name":"string|null","tax_id":"string|null","address":"string|null","phone":"string|null","email":"string|null","bank_account":"string|null","bank_name":"string|null","bank_branch":"string|null"},
"buyer": {"name":"string|null","tax_id":"string|null","address":"string|null","phone":"string|null","buyer_representative":"string|null"},
"invoice": {"type":"string|null","issue_date":"YYYY-MM-DD|null","code":"string|null","symbol":"string|null","number":"string|null","payment_method":"string|null","currency":"string","lookup_url":"string|null","lookup_code":"string|null"},
"products": [{"name":"string","description":"string|null","unit":"string|null","quantity":number,"unit_price":number,"discount_percent":number|null,"discount_amount":number|null,"subtotal_excl_vat":number,"tax_rate":number|null,"tax_amount":number|null,"line_total":number|null}],
"summary": {"total_discount":number|null,"subtotal_excl_vat":number|null,"vat_rate":number|null,"vat_amount":number|null,"grand_total":number|null,"amount_in_words":"string|null"},
"software": {"issuer":"string|null","issuer_tax_id":"string|null","notes":"string|null"},
"parsing": {"warnings":["string"],"assumptions":["string"]}
}
]
Rules:
- Output JSON only (UTF-8). No prose.
- Use integers for VND; strip thousand separators; normalize dates to YYYY-MM-DD.
- If a field is missing, use null. Don’t invent values.
- If multiple invoices appear, return multiple objects.
- Include warnings if totals/line items don’t reconcile exactly.
"payment_method": "TM/CK",
"currency": "VND",
"lookup_url": "https://www.meinvoice.vn/tra-cứu",
"lookup_code": "6546546315"
},
"products":
[
{
"name": "GARMIN LILY 2 ASIA",
"description": "LILAC, SILICONE, A04630",
"unit": "Piece",
"quantity": 1,
"unit_price": 6354545,
"discount_percent": 5,
"discount_amount": 317727,
"subtotal_excl_vat": 6036818,
"tax_rate": null,
"tax_amount": null,
"line_total": null
},
{
"name": "GARMIN LILY 2 CLASSIC KOR/SEA",
"description": "SILVER/SAGE GREY, FABRIC, A04630",
"unit": "Piece",
"quantity": 1,
"unit_price": 7263636,
"discount_percent": 5,
"discount_amount": 363182,
"subtotal_excl_vat": 6900454,
"tax_rate": null,
"tax_amount": null,
"line_total": null
},
{
"name": "GARMIN INSTINCT 2X SOLAR",
"description": "GRAPHITE SEA",
"unit": "Piece",
"quantity": 1,
"unit_price": 10627273,
"discount_percent": 15,
"discount_amount": 1594091,
"subtotal_excl_vat": 9033182,
"tax_rate": null,
"tax_amount": null,
"line_total": null
}
]
4) Deliver: map JSON to your business systems
Once you have structured JSON, push it anywhere:
- Salesforce → Upsert into a Custom Object, launch a Flow, or fire a Platform Event.
- APIs/Destinations → HTTPS/FTPS to downstream services or warehouses.
- Data Lake → S3 for analytics pipelines.
- Docs/Sheets → Google Drive/OneDrive for team visibility.
5) Backup: keep the original PDF
Archive the original file (e.g., OneDrive or S3) for compliance, audits, and reprocessing. Keep both the raw text and final JSON for traceability.
Pro Tips for Reliability
- Schema-first: define the JSON contract upfront; the AI conforms to you, not vice versa.
- Strict output: “JSON only, nulls for missing, normalized dates” avoids brittle parsers.
- Light validation: cross-check totals vs. line items and log warnings.
- Composable: you can swap extractors, models, or destinations without rebuilding everything.
What you’ll get
- Repeatable automation from any inbound channel to business-ready data.
- Fewer manual data entry errors and faster downstream processing.
- A clean audit trail: Original PDF → Extracted Text → Normalized JSON → Delivery logs.
#automation #AI #PDF #DocAI #RPA #Salesforce #n8n #ETL #DataEngineering #NoCode #LowCode