From File Dump to Curated Corpus: PEDCO AuditPro's New Classification Layer

Every retrieval-augmented system lives or dies by the quality of its corpus. You can have the best language model on the planet, the most carefully tuned hybrid search, and the most elaborate reasoning loop — and you will still produce wrong answers if the documents underneath are a mess. We have known this since the very first version of PEDCO AuditPro, and it is the reason we have invested as heavily in ingestion as we have in models. AI-driven document classification is the latest and, in many ways, the most consequential improvement to that ingestion pipeline.

This post explains what PEDCO AuditPro now does with every document that arrives in a repository, why classification is the right place to spend effort, and how the result shows up in cleaner audits, more reliable Copilot answers, and a knowledge graph that you can actually trust.

The problem we are solving

When customers connect their QMS to PEDCO AuditPro, they connect a repository — typically a SharePoint site, a network drive, a Document Management System export, or a curated set of uploads. The natural state of these repositories is messy. They contain SOPs and work instructions, but they also contain meeting minutes, training certificates, supplier letters, expired policies that nobody got around to deleting, vendor brochures, scanned receipts, internal-only drafts that were never approved, and personal files that ended up in the wrong folder.

If you treat all of these documents as equal inputs to the knowledge graph, three things happen, none of them good.

First, the knowledge graph gets polluted. Entities and relationships extracted from a brochure get stored next to entities and relationships extracted from a controlled SOP. When you later ask "what does our supplier qualification process say?", the graph cheerfully blends both sources. The brochure's marketing claims now look like part of your process documentation. The Copilot has no way to know which one to trust, because at the graph level, they are interchangeable.

Second, retrieval gets noisy. Compliance questions tend to be specific, and specific questions are sensitive to the signal-to-noise ratio in the corpus. A repository with 30% irrelevant documents will return irrelevant chunks in roughly 30% of retrievals, and the model has to spend its limited reasoning budget filtering them out before it can answer. The answer that emerges is duller, less confident, and more likely to hedge.

Third, audit results become harder to defend. If an audit rating is supported by a chunk that came from a vendor brochure, the rating is not defensible — even if the brochure happened to say something accurate. The chain of evidence is broken. Compliance is, fundamentally, about defensible chains, and an audit system that cannot trace its conclusions back to authoritative documents is not doing its job.

The traditional answer to this problem is "ask the customer to clean up their repository first". That answer is not realistic. QMSs accumulate cruft because human attention is finite and the cleanup task is dull. Asking a quality manager to spend three weeks classifying documents before they can use the product is the same as telling them not to use the product. We needed a better answer.

What AI document classification does

Every document that enters a PEDCO AuditPro repository now passes through a classification step before it is added to the knowledge graph. The classifier reads the document — both its content and a structured set of metadata signals — and assigns it one or more labels: controlled-procedure, policy, work-instruction, record, form-template, training-material, external-reference, out-of-scope, and several others tuned to the specific needs of QMS work.

These labels are not just tags for the UI. They directly determine how each document is treated downstream:

Documents labelled as controlled procedures, policies, and work instructions become the authoritative core of the knowledge graph. These are the documents whose entities and relationships are extracted at the highest resolution and whose chunks are weighted most heavily during retrieval.
Documents labelled as records are indexed, but their role is to provide evidence rather than to define process. When an audit asks "do we have evidence that this control is operating?", records are exactly the right material. They are not the right material to answer "what does the process require?".
Documents labelled as forms and templates are surfaced as artefacts but not used as a source of process definitions.
Documents labelled as training material and external references are kept in the repository for completeness but excluded from the audit-evidence path. They show up in the Copilot if you ask for them by name; they do not influence audit ratings.
Documents labelled as out-of-scope — brochures, personal files, miscellaneous noise — are excluded from the knowledge graph entirely. They remain visible in the repository so that you can verify the classification, but they do not contribute to any downstream reasoning.

The classification is recorded as a per-document attribute and is visible in the new repository detail page, where you can review it, filter by it, and override it when needed. Overrides are first-class — if the classifier got something wrong, your override is durable and is preserved across re-ingestions. The classifier learns from your overrides over time, and we feed back into the model's prompts the patterns it is missing most often.

Why "AI-driven" actually matters here

There has been classification of one kind or another inside PEDCO AuditPro for a while. What is new is that the classifier reasons about the document the way a quality manager would, rather than matching keywords or filename patterns. The result is dramatically better, in ways that matter.

A keyword classifier would see the word "policy" in the filename and assume the document is a policy. The new classifier reads the document, notices that it is a training slide deck that mentions the word "policy" three times, and labels it as training material. A filename classifier would miss the controlled procedure that the document author forgot to give a numbered prefix; the new classifier reads the header, sees the document control box with revision number and approval signature, and labels it correctly. A keyword classifier would treat all PDFs in a "Records" folder as records; the new classifier notices when one of them is actually an SOP that was filed in the wrong place, and labels it for what it is.

The classifier uses the document content, its structural cues (headers, tables, signatures, version control boxes), its metadata (created-by, owning department, document type from the source system), and — importantly — its position in the QMS profile we are building for the repository. A document that is referenced from a known controlled procedure as "see related procedure X" gets a strong prior toward also being a controlled procedure. A document that lives in a folder where 80% of the contents are records gets a prior toward being a record. The classifier composes these signals into a labelled output with a confidence score, and routes uncertain cases for review.

This is the part that benefits most from running it as an AI model rather than a rule engine. The signal landscape is heterogeneous: some QMSs have rigorous document control boxes, others rely on filename conventions, others rely on folder structure, others on metadata fields in the DMS. A rule engine has to be configured for each shape. The classifier picks up the shape from the data and adapts.

Cleanup as a continuous, not a one-time, activity

A subtle but important property of the new classification layer is that it runs every time a repository is ingested, not just on the first pass. As you add documents over time — new SOPs, new revisions, new records — each one is classified at the moment it enters. There is no point at which the classification becomes stale, and there is no need to "re-clean" the repository periodically.

The flip side is also true: if a document is reclassified because the model's understanding has improved, the knowledge graph is updated accordingly. A document that was previously labelled as training material and is now correctly recognised as a work instruction will be promoted into the authoritative core of the graph in the next ingestion. This makes the system self-correcting in a way that hand-curated taxonomies are not.

We also expose the classifier through an explicit feedback loop. Every time you override a label, that override is recorded against the document and used as a positive signal in your tenant's specific evaluation set. Over time, this gives us per-tenant fidelity without requiring per-tenant model training — the overrides feed into the prompt as canonical examples for similar future documents.

The knock-on effects on audits, recommendations, and Copilot

Once the classification layer is in place, every downstream surface gets cleaner without any change to its own logic.

Audits become more defensible because their evidence is drawn only from documents that are appropriate to use as evidence. When an audit cites a key document, you can be confident that the document is an authoritative SOP, policy, or record — not a brochure, not a draft, not someone's training notes. The "rationale" field that we ship per key document explains why the document was selected, and the classification label is part of that story: it is faster to trust a citation when you know its document is a controlled procedure than when it is "some PDF".

Recommendations become more targeted because the Copilot draws structural information about the QMS only from the authoritative documents. When it proposes "update SOP X section 3.2", it can be confident that SOP X is actually an SOP — not a brochure that looks like one. When it proposes capturing additional records, it can name the existing record types that the QMS already uses, rather than inventing new ones.

Copilot answers become more confident because the retrieval no longer surfaces irrelevant material. The model's reasoning budget goes into composing an answer, not into filtering out a vendor brochure that happened to keyword-match.

QMS profiling becomes faster and more accurate because the profile generator no longer has to look at every document; it can focus on the authoritative core and treat records and references as supporting material. Profiles regenerate faster, drift less, and reflect the QMS as it is actually defined rather than as it accidentally appears in a folder full of mixed content.

Two design choices worth flagging

We made two design choices in the classification layer that customers consistently ask about, so it is worth flagging them here.

The first is that out-of-scope documents are kept in the repository, not deleted. We never delete customer files. If the classifier excludes a file from the knowledge graph, the file remains in the repository view, clearly labelled as out-of-scope, with an obvious affordance to override the classification if it is wrong. The reasoning is simple: a classification mistake should never cause data loss, only data inattention.

The second is that classification is transparent and overridable, not magic. Every label is visible. Every label can be overridden. Every override is durable. We considered building a system where classification happened invisibly behind the scenes and customers only saw the cleaned output — and we decided against it. Quality managers need to know what is in their corpus and what is being filtered out, because that knowledge is part of their compliance posture. Hiding the filtering would undermine the trust the feature is supposed to build.

What this means in practice

If you are an existing customer with a repository that has accumulated noise over time, the first ingestion with the new classifier will produce a visible cleanup: documents that used to dilute your audits will no longer feed into them, and the audit ratings should reflect a cleaner evidence base. You will see the classifications in the repository detail page and can correct any that are wrong. The corrections are durable and will influence subsequent ingestions.

If you are a new customer, you will see the classifier in action from the very first sync. You will not need to clean your repository before connecting it. You will not need to maintain a hand-curated list of "documents to ignore". You will not need to wait for a manual review before the Copilot becomes useful. The classifier takes the messy state of a real QMS as input, and routes the parts of it that matter into the parts of the system where they matter.

Why we did this now

The honest answer is that the rest of PEDCO AuditPro has gotten good enough that the document corpus is the bottleneck. The Copilot can compose better answers than it could six months ago. The knowledge graph captures more relationships, more accurately. The audit pipeline produces more nuanced ratings and richer rationales. The thing that was holding the system back from another step-change in quality was the quality of what we were feeding it.

Document classification is the cheapest way to fix that — not by asking customers to do work, but by asking the system to do work that was previously assumed to be the customer's job. The customer's job is the QMS. Our job is to read it intelligently.

If you are seeing classifications that disagree with your own judgement, we would love to hear about them. Every override is a teaching signal, and we use them to make the classifier sharper across the platform. Compliance is hard enough without the system getting basic distinctions wrong, and we intend for those distinctions to keep getting better.

From File Dump to Curated Corpus: PEDCO AuditPro's New Classification Layer

The problem we are solving

What AI document classification does

Why "AI-driven" actually matters here

Cleanup as a continuous, not a one-time, activity

The knock-on effects on audits, recommendations, and Copilot

Two design choices worth flagging

What this means in practice

Why we did this now

Manuel Jenni

More Articles

Classifications, Overrides, Profiles, Pinned Docs — The Four Places PEDCO AuditPro Listens to You

From Generic Chat to QMS-Native Assistant: The Copilot That Speaks Your Language

Ready to Transform Your Compliance?