7 Practical Tess4J Examples for Extracting Text from ImagesOptical Character Recognition (OCR) lets software read text from images and scanned documents. For Java developers, Tess4J is a widely used wrapper around the Tesseract OCR engine that makes integrating OCR into Java applications straightforward. This article walks through seven practical Tess4J examples — from basic extraction to advanced preprocessing and multilingual support — with code snippets, tips for improving accuracy, and notes about common pitfalls.
What is Tess4J (brief)
Tess4J is a Java JNA wrapper for the Tesseract OCR API. It exposes Tesseract features to Java developers so you can perform OCR without calling external command-line tools. Tess4J supports multiple languages, page segmentation modes, and configuration options inherited from Tesseract.
Prerequisites
- Java 8+ (compatible with newer versions)
- Tess4J library (available via Maven/Gradle or as a jar)
- Native Tesseract binaries and traineddata files (install Tesseract on your system and ensure the tessdata path is set)
- Basic image-processing libraries (javax.imageio, OpenCV or TwelveMonkeys ImageIO for extra formats)
Add Tess4J via Maven (example):
<dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>5.4.0</version> </dependency>
Adjust version for current releases.
Example 1 — Simple OCR: Extract text from a PNG/JPEG
This is the minimal, get-started example: load an image and run OCR.
import net.sourceforge.tess4j.*; import java.io.File; public class SimpleOcr { public static void main(String[] args) { File imageFile = new File("sample.png"); ITesseract tesseract = new Tesseract(); tesseract.setDatapath("C:/tessdata"); // path to tessdata directory try { String result = tesseract.doOCR(imageFile); System.out.println(result); } catch (TesseractException e) { e.printStackTrace(); } } }
Tips:
- Specify the correct tessdata path.
- Use appropriate language via tesseract.setLanguage(“eng”) for English.
Example 2 — OCR with pre-processing (grayscale + thresholding)
Preprocessing often increases accuracy. This example uses simple Java image manipulation to convert to grayscale and apply binary thresholding before OCR.
import net.sourceforge.tess4j.*; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; public class PreprocessOcr { public static BufferedImage toGrayscale(BufferedImage img) { BufferedImage gray = new BufferedImage(img.getWidth(), img.getHeight(), BufferedImage.TYPE_BYTE_GRAY); gray.getGraphics().drawImage(img, 0, 0, null); return gray; } public static BufferedImage threshold(BufferedImage img, int thresh) { BufferedImage bin = new BufferedImage(img.getWidth(), img.getHeight(), BufferedImage.TYPE_BYTE_BINARY); for (int y = 0; y < img.getHeight(); y++) { for (int x = 0; x < img.getWidth(); x++) { int rgb = img.getRGB(x, y) & 0xFF; bin.setRGB(x, y, rgb > thresh ? 0xFFFFFFFF : 0xFF000000); } } return bin; } public static void main(String[] args) throws Exception { BufferedImage img = ImageIO.read(new File("noisy.jpg")); BufferedImage gray = toGrayscale(img); BufferedImage bin = threshold(gray, 128); ImageIO.write(bin, "png", new File("preprocessed.png")); ITesseract tesseract = new Tesseract(); tesseract.setDatapath("C:/tessdata"); System.out.println(tesseract.doOCR(bin)); } }
When to use:
- Low-contrast scans, high noise, or simple monochrome text.
Example 3 — Using OpenCV for advanced preprocessing (deskew, denoise)
OpenCV gives more control: deskewing, morphological operations, and noise reduction. Below is a conceptual snippet; ensure OpenCV Java bindings are set up.
// High-level steps (conceptual, not full code): // 1. Load image with OpenCV Mat // 2. Convert to grayscale: Imgproc.cvtColor(src, gray, Imgproc.COLOR_BGR2GRAY) // 3. Apply Gaussian blur: Imgproc.GaussianBlur(gray, blurred, new Size(3,3), 0) // 4. Use adaptive threshold or Otsu: Imgproc.threshold(blurred, thresh, 0, 255, Imgproc.THRESH_BINARY + Imgproc.THRESH_OTSU) // 5. Detect rotation via moments or Hough lines and rotate to deskew // 6. Convert Mat back to BufferedImage and pass to Tesseract.doOCR()
Why use OpenCV:
- Better handling of skewed scans, complex backgrounds, and layout analysis.
Example 4 — Region-based OCR: extract text from specific areas
When only part of an image contains useful text (forms, invoices), crop regions and OCR them individually.
import net.sourceforge.tess4j.*; import javax.imageio.ImageIO; import java.awt.image.BufferedImage; import java.io.File; import java.awt.Rectangle; public class RegionOcr { public static void main(String[] args) throws Exception { BufferedImage img = ImageIO.read(new File("form.jpg")); Rectangle nameField = new Rectangle(100, 200, 400, 60); // x,y,width,height BufferedImage sub = img.getSubimage(nameField.x, nameField.y, nameField.width, nameField.height); ITesseract tess = new Tesseract(); tess.setDatapath("C:/tessdata"); tess.setTessVariable("tessedit_char_whitelist", "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz "); String name = tess.doOCR(sub); System.out.println("Name: " + name.trim()); } }
Use cases:
- OCRing labeled fields, receipts, invoices, ID cards.
Example 5 — Multilingual OCR and language selection
Tesseract supports multiple languages. Install corresponding traineddata files and specify languages joined by ‘+’.
ITesseract tess = new Tesseract(); tess.setDatapath("C:/tessdata"); tess.setLanguage("eng+fra"); // English + French String text = tess.doOCR(new File("multilang.png"));
Notes:
- More languages may slow OCR and slightly reduce accuracy; restrict to expected languages when possible.
Example 6 — Configuring Page Segmentation Mode (PSM) and OCR Engine Mode (OEM)
Tesseract offers PSM and OEM settings that influence how it segments text and which recognition engine to use.
ITesseract tess = new Tesseract(); tess.setDatapath("C:/tessdata"); tess.setLanguage("eng"); // PSM 6 = Assume a single uniform block of text. OEM 1 = LSTM only. tess.setOcrEngineMode(ITessAPI.TessOcrEngineMode.OEM_LSTM_ONLY); tess.setPageSegMode(ITessAPI.TessPageSegMode.PSM_AUTO); String out = tess.doOCR(new File("document.png"));
Common PSM modes:
- PSM_SINGLE_BLOCK (6) for single-column text
- PSM_SINGLE_LINE (7) for single text line
- PSM_AUTO (3) for automatic segmentation
OEM options:
- OEM_TESSERACT_ONLY, OEM_LSTM_ONLY, OEM_TESSERACT_LSTM_COMBINED, OEM_DEFAULT
Example 7 — Post-processing and spell-check for improved results
OCR often returns small errors. Use regex, dictionaries, or spell-check libraries to correct results.
Simple example: normalize common OCR mistakes and run a spell-checker.
String raw = tess.doOCR(file); String normalized = raw.replaceAll("[|]", "I") .replaceAll("0(?=[A-Za-z])", "O"); // common fixes // Use a spell-check library like Jazzy or LanguageTool for more corrections System.out.println(normalized);
Tips:
- For structured outputs (dates, amounts), parse with regex to validate and correct formats.
- Train a custom language model or use whitelist/blacklist to constrain results.
Improving accuracy — practical checklist
- Use high-resolution images (300 DPI recommended for printed text).
- Preprocess: grayscale, binarize, denoise, deskew.
- Restrict character set when you know expected characters (tessedit_char_whitelist).
- Choose correct language(s) and PSM.
- Use OpenCV for heavy image-cleaning tasks.
- Post-process with regex, dictionaries, or NLP models.
Performance and deployment tips
- For batch OCR, reuse a single ITesseract instance or pool instances to reduce startup cost.
- Consider running Tesseract as a local service for high-throughput systems.
- Monitor memory and native library loading when using multiple threads; Tess4J and Tesseract load native libraries that may not be fully thread-safe without careful handling.
Common pitfalls
- Missing or wrong tessdata path causes “traineddata” errors.
- Low-quality images produce garbled text.
- Multi-column layouts require segmentation — PSM_AUTO may fail; consider detecting columns manually.
- Whitelists that are too restrictive can remove valid characters.
Conclusion
Tess4J makes Tesseract accessible from Java and can handle many OCR tasks from simple one-off extractions to robust, production-grade pipelines. Use preprocessing (OpenCV), correct configuration (language, PSM, OEM), and post-processing (regex/spell-check) to maximize accuracy. The seven examples above provide a practical foundation to build OCR features into Java applications.
Leave a Reply