Online OCR – Convert Images to Word, Text and PDF

Last time, I wrote an article Free Online OCR Service with Dynamsoft TWAIN SDKs, which introduced how to create an online OCR application with Dynamic .NET TWAIN SDK step by step. Since then, I have received some feedbacks about how to convert the OCR results to Microsoft Office documents. So today, Iā€™d like to share how to utilize Open XML SDK to convert OCR results to word document. Based on the sample code I shared last time, I’ll make a little bit of change.

ocr_word_final

Downloads

Integrating Open XML SDK to the Online OCR App

I used the unofficial packaging of Microsoft’s OpenXML SDK 2.5 from NuGet. If you are interested in the official one, please visit Open XML SDK 2.5 for Microsoft Office.

Add the Open XML reference to your project:

openxml_reference

Open DoOCR.aspx.cs, and find the line:

byte[] content = OCRMode.OCR(inputBuffer, strLanguage, Convert.ToInt32(strFormat));

The byte array is the result returned from Dynamic .NET TWAIN OCR interface. We can convert it to string:

System.Text.Encoding.ASCII.GetString(content)

Create a new method SaveToWord with two parameters ā€“ file path and OCR results:

private void SaveToWord(string filepath, string ocrResult)
        {
            using (WordprocessingDocument doc = WordprocessingDocument.Create(filepath, DocumentFormat.OpenXml.WordprocessingDocumentType.Document))
            {
                MainDocumentPart mainPart = doc.AddMainDocumentPart();
                mainPart.Document = new Document();
                Body body = mainPart.Document.AppendChild(new Body());
                Paragraph para = body.AppendChild(new Paragraph());
                Run run = para.AppendChild(new Run());

                string returnValue = FilterInvalidXmlChars(ocrResult);
                run.AppendChild(new Text(returnValue));
            }
        }

So far, the word document file with OCR results has been generated. Note: the method FilterInvalidXmlChars is used to filter invalid XML characters. Without this method, probably it will throw exceptions when saving the word document. Please refer to StackOverflow question Unicode Regex; Invalid XML characters:

public static string FilterInvalidXmlChars(string text)
        {
            // answer from http://stackoverflow.com/questions/397250/unicode-regex-invalid-xml-characters/961504#961504
            string re = @"(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|[\x00-\x08\x0B\x0C\x0E-\x1F\x7F-\x9F\uFEFF\uFFFE\uFFFF]";
            return Regex.Replace(text, re, "");
        }

Let’s see how this demo works:

For more detailed information about how to use Open XML SDK, you can read Word processing (Open XML SDK).

Source Code

https://github.com/DynamsoftRD/online-ocr

git clone https://github.com/DynamsoftRD/online-ocr.git