How to Detect Duplicate Document Images

When digitizing documents, we may accidentally scan a document image twice. Finding the duplicate images manually is painstaking. In this article, we are going to use JavaScript to detect duplicate document images automatically.

How to Calculate the Similarity between Two Images

We need to compare two images to check whether they share the same content.

There are many ways to calculate the similarity between two images. Mainly, they can be divided into two categories.

  1. Calculate the diff of pixels (e.g. pixelmatch, MSE).
  2. Extract features and check whether the features match (e.g. SIFT, Convolutional Neural Network).

In this article, we are going to use the most obvious feature of document images: text. We are going to use OCR to extract the text of the images and use Levenshtein distance to calculate the similarity.

JavaScript Implementation

Here are the key code snippets to do this in JavaScript.

  1. Extract the text using tesseract. Store the text line results and filter out small and low-confidence lines.

    import { createWorker,Worker } from 'tesseract.js';
       
    async function recognize(imageSource:HTMLImageElement){
      let tess = await createWorker("eng", 1, {
        logger: function(m:any){console.log(m);}
      });
      const result = await tess.recognize(imageSource);
      const textLines:TextLine[] = [];
      const threshold = 50;
      const lines = result.data.lines;
      for (let index = 0; index < lines.length; index++) {
        const line = lines[index];
        const width = line.bbox.x1 - line.bbox.x0;
        const height = line.bbox.y1 - line.bbox.y0;
        if (line.confidence > threshold && width > 10) {
          const textLine:TextLine = {
            x:line.bbox.x0,
            y:line.bbox.y0,
            width:width,
            height:height,
            text:line.text
          }
          textLines.push(textLine);
        }
      }
      return textLines;
    }
      
    
  2. Calculate the text similarity of two pieces of text.

    import leven from "leven";
       
    function textSimilarity(lines1:TextLine[],lines2:TextLine[]):number {
      const text1 = textOfLines(lines1);
      const text2 = textOfLines(lines2);
      const distance = leven(text1,text2);
      const similarity =  (1 - distance / Math.max(text1.length,text2.length));
      return similarity;
    }
       
    function textOfLines(lines:TextLine[]){
      let content = "";
      for (let index = 0; index < lines.length; index++) {
        const line = lines[index];
        content = content + line.text + "\n";
      }
      return content;
    }
    
  3. Iterate all the scanned images and find the duplicate ones.

    async find(images:HTMLImageElement[]):Promise<HTMLImageElement[]> {
    
      let textLinesOfImages = [];
      for (let index = 0; index < images.length; index++) {
        const image = images[index];
        const lines = await recognize(image);
        textLinesOfImages.push(lines);
      }
    
      let indexObject:any = {};
      for (let index = 0; index < textLinesOfImages.length; index++) {
        if (index + 1 < textLinesOfImages.length) {
          const textLines1 = textLinesOfImages[index];
          const textLines2 = textLinesOfImages[index+1];
          const similarity = textSimilarity(textLines1,textLines2);
          if (similarity > 0.7) {
            indexObject[index] = "";
            indexObject[index+1] = "";
          }
        }
      }
      let duplicateImages:HTMLImageElement[] = [];
      const keys = Object.keys(indexObject);
      for (let index = 0; index < keys.length; index++) {
        const key:number = parseInt(keys[index]);
        duplicateImages.push(images[key]);
      }
      return duplicateImages;
    }
    

Online Demo

You can visit the online demo to have a try. The demo can crop document images using Dynamsoft Document Normalizer to increase the efficiency and accuracy of OCR.

demo screenshot

Source Code

Get the source code of the library to have a try:

https://github.com/tony-xlh/duplicate-documet-image-finder