Forum Discussion

guidopeter's avatar
guidopeter
Copper Contributor
Nov 12, 2025

SharePoint Document-ID and Chinese characters braking search

I am struggling with a strange phenomenon. We use the Document-ID on SharePoint extensively, especially for searching. So we search specifically for SharePoint Document-IDs, e.g., EMTS-1223334444-123, to find the document.

This works fine with 10 million documents, except for PDF documents with Chinese characters. I can search for phrases in the text content or other properties, and everything is found. Only the search by Document-ID does not work.

 

2 Replies

  • This typically happens when the PDF iFilter / text extraction pipeline cannot normalize certain CJK characters consistently.

    What you’re seeing is common:

    • Full-text search works (because the content extractor can parse the text).
    • But Document-ID search fails because the metadata indexer treats the extracted metadata as malformed when the original PDF contains mixed encodings.

    A few things you can verify:

    1. Ensure the PDF was OCR-processed using Unicode-compliant text layers. Some older PDF generators embed non-standard glyph maps.
    2. Re-upload the file after re-processing the PDF. This forces SharePoint to rebuild the managed property and can fix indexing inconsistencies.
    3. Check if Document ID is mapped to a retrievable managed property (ows_DocId). If the property fails ingestion for a specific file, SharePoint simply cannot match it.

    If the issue only affects PDFs with Chinese characters, the root cause is usually the PDF encoding rather than SharePoint itself. Regenerating the PDF with a modern Unicode encoder almost always restores Document-ID search.

Resources