The ability to recognize text is useful in contexts from ecommerce to augmented reality. But existing computer vision systems fare poorly when a single image contains characters of diverse sizes, styles, angles, and backgrounds along with other objects — like, say, loudly designed commercial packaging. Researchers at Walmart’s Bangalore lab devised an algorithm to tackle this problem.
What’s new: A team put together an end-to-end architecture that segments images into regions and feeds them into existing text extraction networks. The system outputs boxes containing the extracted text.
Key insights: The approach is twofold: Segment, then extract. The team, led by Pranay Dugar, found that:
- Segmenting an image into regions before extracting text simplifies the task by minimizing background noise and handling varied text styles separately.
- Using an ensemble of text extraction networks improves the performance of an end-to-end system. And the ensemble can work in parallel, so it’s exceptionally fast.
How it works: The system segments images by creating and classifying superpixels: groups of adjacent pixels of similar color and intensity. It feeds them into pretrained text extraction networks and merges the outputs.
- To generate superpixels, the authors dilate homogeneous regions by convolving the image in the following way: As the kernel slides over the image, the pixel value at the center is replaced by the maximum value of the region overlapping the kernel. Higher pixel values fill in neighboring lower pixel values, which expands foreground regions (i.e., text) and shrinks background regions (gaps between objects).
- To classify superpixels, the authors create vectors composed of the mean, standard deviation, and energy of filters at various scales and orientations. Vectors close in Euclidean distance correspond to the same text region. Superpixels are grouped by maximizing the probability that they belong together.
- To extract text, the authors run TextBoxes and TextBoxes++ models pretrained on the ICDAR 2013 data set of photos depicting commercial packaging, business signs, and the like. They prune redundant text boxes from the two models by keeping the one with the highest confidence score.
Results: The system is competitive with earlier methods on ICDAR 2013. But it excels with so-called high-entropy images that are unusually complex. It improves both precision (the proportion of predictions that are correct) and recall (the proportion of correct labels predicted) by 10 percent on the authors' own Walmart High Entropy Images data set.
Why it matters: Extracting words from images containing multiple text-bearing objects is difficult. The letters may be poorly lit, slanted at an angle, or only partially visible. Jumbled together, they can give a computer vision system fits. Segmenting text regions and then using the ensemble of text-extraction models makes the problem more tractable.
Takeaway: In a world increasingly crowded with signs, symbols, and messages, applications of such technology are numerous. It could mean efficient creation of digital restaurant menus from physical copies, or an agent that lets you know when an ice cream truck passes your house. It gives machines a way to read words in chaotic environments — and possibly new ways to communicate with us.