Using AI with Confidential Documents: A Practical Guide

Guide by RedMatiq Team • February 27th, 2026

Twenty-seven percent of organizations have banned generative AI over privacy concerns. Another 63 percent limit what data employees can enter. Meanwhile, the tools keep getting better, and professionals who can't use them fall behind. (We looked at the numbers across five industries. This guide is about what to do about it.)

This guide is for the people stuck in the middle: you know AI could help with your work, but the documents on your desk contain names, financials, client details, and regulated data. What are you actually allowed to do? What's genuinely risky? What strategies exist beyond "just don't"?

What's actually at risk

Not all AI usage carries equal risk. The problem with blanket bans is they treat every interaction the same. Asking ChatGPT to draft a marketing email and uploading a client's medical records are different activities with different consequences.

A useful way to think about it:

Low risk. Asking general knowledge questions, summarizing publicly available information, generating boilerplate text. No confidential data involved. This is where blanket bans create the most unnecessary friction.

Medium risk. Analyzing documents with internal business context but no personal identifiers. A product roadmap, a technical specification. The information is proprietary and you'd prefer it stayed private, but a leak wouldn't trigger regulatory consequences.

High risk. Documents containing names, addresses, financial details, client identities, or health records. This is where most professionals work every day, and where most AI bans are aimed.

Off limits. Data you have legal obligations to protect: attorney-client privileged communications, active litigation documents, patient records under HIPAA, classified material. No amount of convenience justifies the regulatory exposure.

Most organizations don't distinguish between these tiers. That's the core problem: a policy that says "don't use AI" is easier to write than one that says "here's how to use it responsibly." But the first policy gets ignored. The second one gets followed.

What the AI providers actually do with your data

"We don't train on your data" is the most common reassurance from AI companies. It's also the one that deserves the most scrutiny, because the answer depends entirely on which tier you're using.

OpenAI (ChatGPT). Free and Plus users can have their conversations used for model training. You can opt out by toggling "Improve the Model for Everyone" in settings. If you don't opt out, everything you type might inform future model behavior. In 2025, a US federal court order in the New York Times lawsuit temporarily required OpenAI to retain all user conversations, including deleted ones, for several months. Enterprise and API customers are excluded from training entirely, and can configure zero-data-retention.

Anthropic (Claude). Since August 2025, consumer conversations are used for training by default. Opt-out is available in privacy settings, but if you allow training use, your data may be retained for up to five years. If you opt out, retention drops to 30 days. API and commercial customers are explicitly excluded from training, with 7-day log retention.

Google (Gemini). Consumer Gemini may use conversations for model improvement. Workspace and API tiers offer stronger guarantees, but the tiers and their terms change frequently.

The pattern is consistent: consumer tiers subsidize development partly by using your inputs. Enterprise tiers offer contractual guarantees. But even enterprise contracts are only as strong as the company behind them, and those companies have structural incentives to use the richest possible training data.

It is also worth noting the track record. Every major AI provider has, at some point, operated on a "move fast, course-correct later" basis: retroactive changes to training policies, opt-outs buried in settings, retention terms that quietly expanded. The policies documented above are accurate as of early 2026. They were different a year ago and will likely be different a year from now. Any strategy that depends entirely on a provider's current terms is built on ground that shifts quarterly.

What has already gone wrong

In April 2023, Samsung engineers leaked confidential data to ChatGPT three times in 20 days. One engineer submitted proprietary source code seeking bug fixes. Another shared code for optimizing semiconductor equipment defect detection. A third pasted minutes from an internal meeting. Samsung had lifted its ChatGPT ban just weeks earlier.

The data entered those conversations and became part of a system Samsung couldn't control. Samsung re-imposed the ban and began building its own internal AI tools.

This isn't a story about Samsung being careless. It's a story about what happens when productive tools meet confidential data without a clear policy. The engineers weren't acting maliciously. They were trying to do their jobs faster. The problem was structural, not behavioral.

Four strategies for using AI with sensitive documents

There is no single correct approach. Each strategy makes different trade-offs between capability, cost, and risk.

1. Use enterprise tiers

Enterprise contracts from OpenAI, Anthropic, and Google offer the strongest data handling guarantees: no training on your data, configurable retention, SSO, audit trails, and sometimes SOC 2 or ISO 27001 compliance.

When this works: Organization-wide deployment where IT can manage procurement, onboarding, and governance. Teams processing medium-risk documents where the proprietary context matters but individual PII is limited.

When it doesn't: Enterprise tiers are expensive, require procurement cycles, and still send your data to cloud infrastructure you don't control. For regulated professions, a contractual guarantee from an AI vendor may not satisfy regulatory requirements. And for individuals or small teams, the per-seat costs are often out of reach.

2. Use local models

Run a model on your own hardware. Ollama, llama.cpp, or Apple Intelligence, depending on the task. Your data never leaves your machine.

When this works: Tasks simple enough for smaller models. Summarization, translation, Q&A over short documents. Environments where regulations or policy absolutely prohibit cloud processing.

When it doesn't: Local models are significantly less capable than frontier models for complex reasoning, long documents, and nuanced analysis. Running competitive models locally requires substantial hardware. And you take on the maintenance burden: model updates, performance tuning, infrastructure.

3. Redact before sending

Remove or replace sensitive information from the document before submitting it to any AI. The model receives a structurally intact document with placeholders instead of real names, addresses, and identifiers. It reasons about the same relationships and obligations, but never learns who the parties are.

When this works: Document analysis, summarization, contract review, compliance checks. Tasks where the AI needs the structure and logic of the document but not the identity of the people in it. Works with any provider at any tier, because the sensitive data never leaves your machine.

When it doesn't: Tasks that depend on the specific identity of the parties, like background checks or entity verification. Documents where every piece of information is itself the sensitive data, like a list of names with no surrounding context. And cases where the context itself reveals the identity, regardless of what you redact. A whistleblower report about a startup's sole founder doesn't need a name. Neither does a regulatory filing about a country's only central bank. If the surrounding narrative is specific enough, placeholders won't help. Redaction works best when the structure of the document is general even if the data inside it is not.

The advantage of this approach is composability: it works with enterprise tiers, consumer tiers, and any future model or provider. You're not locked into a vendor's privacy policy because the sensitive data was removed before the vendor saw it.

4. Don't use AI for this task

Some documents shouldn't go through external AI processing. Period.

Active litigation materials. Attorney-client privileged communications. Patient records in jurisdictions with strict disclosure rules. Classified information. Any situation where the regulatory cost of a breach exceeds the productivity gain of AI assistance.

This isn't a failure of technology. It's a recognition that not every task belongs in an AI workflow. Knowing when to stop is part of a mature policy.

A decision framework

When you have a document and an AI tool, run through these questions in order:

Does the document contain personal identifiers or regulated data? If no, you're in the low-to-medium risk tier. Enterprise tiers or even consumer tiers with opt-out settings may be appropriate.

Can you remove the sensitive parts and still get useful analysis? If the AI needs the contract's structure but not the parties' names, redacting before sending gives you the best of both: frontier model capability and local data control.

Is the document subject to specific regulations? HIPAA, attorney-client privilege, GDPR Article 9 (special categories of personal data). If so, the answer is probably not "use a cloud AI," regardless of the provider's terms.

Is the task simple enough for a local model? Straightforward summarization, translation, or short Q&A can often run locally without losing much quality.

Does the risk of exposure outweigh the productivity gain? If a leak would cause regulatory action, client loss, or legal liability, the calculus is clear. Some tasks stay offline.

What's changing

The tension between AI productivity and data protection isn't static. Several forces are reshaping the landscape:

On-device AI is improving. Apple Intelligence, increasingly capable open-weight models, and better hardware are closing the gap between local and cloud. The tasks you "need" a cloud model for are shrinking.

Regulations are catching up. The EU AI Act introduces obligations around AI system transparency and risk assessment. Organizations using AI on regulated data will need documented policies, not informal guidance.

Provider policies keep shifting. Terms change quarterly. An opt-out setting that exists today might be replaced tomorrow. Any strategy that depends entirely on a provider's current policy is built on sand.

The gap between "banned" and "used anyway" is unsustainable. Organizations that ban AI without offering alternatives will lose talent to organizations that have figured out responsible use. The question isn't whether professionals will use AI on sensitive documents. It's whether they'll do it with guardrails or without.

The choice

The right approach depends on what you're working with, what you're trying to accomplish, and what your regulatory obligations require. There is no universal answer.

But blanket bans are a symptom, not a solution. They exist because crafting a nuanced policy is harder than saying no. The tools to use AI responsibly with confidential documents exist today. The question is whether your organization knows about them, or whether your people are quietly pasting documents into consumer chatbots and hoping for the best.

One of those outcomes is a policy decision. The other is an incident waiting to happen.

Use AI on your documents. Keep the sensitive parts local.

RedMatiq redacts personal information from documents before they reach any AI. You get full analysis. The model never sees who's in the document.