Forum Discussion
SharePoint Document-ID and Chinese characters braking search
I am struggling with a strange phenomenon. We use the Document-ID on SharePoint extensively, especially for searching. So we search specifically for SharePoint Document-IDs, e.g., EMTS-1223334444-123, to find the document.
This works fine with 10 million documents, except for PDF documents with Chinese characters. I can search for phrases in the text content or other properties, and everything is found. Only the search by Document-ID does not work.
2 Replies
This typically happens when the PDF iFilter / text extraction pipeline cannot normalize certain CJK characters consistently.
What you’re seeing is common:
- Full-text search works (because the content extractor can parse the text).
- But Document-ID search fails because the metadata indexer treats the extracted metadata as malformed when the original PDF contains mixed encodings.
A few things you can verify:
- Ensure the PDF was OCR-processed using Unicode-compliant text layers. Some older PDF generators embed non-standard glyph maps.
- Re-upload the file after re-processing the PDF. This forces SharePoint to rebuild the managed property and can fix indexing inconsistencies.
- Check if Document ID is mapped to a retrievable managed property (ows_DocId). If the property fails ingestion for a specific file, SharePoint simply cannot match it.
If the issue only affects PDFs with Chinese characters, the root cause is usually the PDF encoding rather than SharePoint itself. Regenerating the PDF with a modern Unicode encoder almost always restores Document-ID search.
- virendrakIron Contributor
When searching by Document ID, please search against the managed property DlcDocId, it works even for PDFs with Chinese characters
Use the property DlcDocId in your query:
DlcDocId:EMTS-1223334444-123
Please refer to below articles for more details:
How to Search by Document ID in SharePoint | Blog about anything related to my learnings
Search doesn't provide results from another language - SharePoint | Microsoft Learn