In today’s digital world, extracting text from images has become a common necessity. Whether it’s scanned documents, handwritten notes, or images with embedded text, Optical Character Recognition (OCR) technology can help convert these images into machine-readable text. One of the most popular and effective tools for OCR is Tesseract OCR, and in this post, we’ll guide you through creating a simple PHP script to read text from images using Tesseract.
How the PHP Script Works
The script we’ll create leverages the Tesseract OCR engine to extract text from images. Tesseract is an open-source, command-line OCR engine developed by Google, known for its high accuracy and language support. We’ll be using the thiagoalessio/tesseract_ocr
PHP wrapper, which allows PHP applications to easily interact with Tesseract for our image to text conversion.
Here’s a quick overview of how the process works:
- User Uploads an Image: The user submits an image (such as a scanned document) via an HTML form.
- PHP Script Processes the Image: The script then uses Tesseract OCR to process the image and extract the text.
- Display Extracted Text: Finally, the extracted text is displayed back to the user.
Below is a sample PHP code that implements this:
Code Example: PHP Script to Extract Text from Image
<?php
// Include Composer autoload
require 'vendor/autoload.php';
// Import the TesseractOCR class
use thiagoalessio\TesseractOCR\TesseractOCR;
// Check if the form was submitted
if ($_SERVER['REQUEST_METHOD'] == 'POST' && isset($_FILES['image'])) {
// Specify the directory for uploaded files
$upload_dir = 'uploads/';
$uploaded_file = $upload_dir . basename($_FILES['image']['name']);
// Move the uploaded file to the upload directory
if (move_uploaded_file($_FILES['image']['tmp_name'], $uploaded_file)) {
// Use Tesseract OCR to extract text from the image
$ocr = new TesseractOCR($uploaded_file);
$text = $ocr->run(); // Extract text
echo "<h2>Extracted Text from Image:</h2>";
echo "<pre>$text</pre>";
} else {
echo "Error uploading the file.";
}
} else {
echo '
<h2>Upload an Image for Text Extraction</h2>
<form action="" method="post" enctype="multipart/form-data">
<label for="image">Select image to upload:</label>
<input type="file" name="image" id="image" required>
<input type="submit" value="Upload Image" name="submit">
</form>';
}
?>
Advantages of Using PHP for OCR
While OCR can be implemented in various programming languages, PHP provides several benefits, especially if you’re already working in a PHP-based web environment.
1. Ease of Integration
PHP is widely used in web development, and if you’re building a web application, adding OCR functionality with PHP is a natural choice. Using the Tesseract OCR PHP wrapper allows you to seamlessly integrate text extraction from images without needing to rely on external services or complex configurations.
2. Cross-Platform Compatibility
Tesseract OCR works across multiple platforms, and with the PHP wrapper, you can easily deploy your OCR solution to any server (Linux, Windows, macOS) that supports PHP.
3. Open Source and Free
Both PHP and Tesseract OCR are open-source and free to use, making them an ideal solution for budget-conscious projects.
Disadvantages of Using PHP for OCR
While PHP is a great tool for OCR, there are some limitations to consider when using it compared to other programming languages:
1. Performance
PHP is not the most performance-optimized language when it comes to image processing and OCR. Languages like C++ or Python might provide better performance due to more direct integrations with libraries like Tesseract.
2. Limited Libraries for Advanced Image Processing
While PHP can handle basic image processing (e.g., resizing, cropping), it lacks the extensive image processing libraries that other languages offer (like Python’s Pillow
or OpenCV
). Advanced image preprocessing can improve OCR accuracy, and this might be harder to achieve with PHP.
3. Not Ideal for Batch Processing
For bulk OCR processing or handling large volumes of images, a language like Python, which has optimized libraries like pytesseract
, might be a better choice. PHP can handle batch processing but may not be as efficient.
Supported Human Languages
One of Tesseract’s strongest features is its multi-language support. By default, Tesseract OCR can handle many languages, including:
- English (eng)
- Spanish (spa)
- French (fra)
- German (deu)
- Italian (ita)
- Portuguese (por)
- Chinese (chi_sim, chi_tra)
- Arabic (ara)
- Russian (rus)
Tesseract OCR supports over 100 languages, and you can even add custom languages or train Tesseract to handle specialized fonts or handwriting. The PHP wrapper allows you to specify single or multiple languages using the lang()
method:
$ocr->lang('spa'); // For Spanish
Tesseract allows you to specify multiple languages at once by separating the language codes with a plus sign (+
). For example, if you want to extract text in English and Spanish, you would do it like this:
$ocr->lang('eng+spa'); // For both English and Spanish
This will make Tesseract use both the English and Spanish language models to extract text, improving accuracy for mixed-language text.
Accuracy of Results
Tesseract OCR is very accurate in image to text conversion, but the quality of the results depends on several factors:
1. Image Quality
The quality of the input image is crucial. High-resolution images with clear, high-contrast text will yield better results. If the image is blurry or has poor lighting, Tesseract’s accuracy will drop significantly.
2. Text Font and Size
Tesseract works best with standard fonts and clear text. Handwritten or highly stylized fonts might not be recognized well without additional training. Preprocessing the image to enhance text clarity can improve results.
3. Noise and Background
Text on a noisy or cluttered background may reduce OCR accuracy. It’s important to preprocess the image (e.g., converting it to grayscale, enhancing contrast, removing noise) before running it through OCR to achieve better results.
Opportunities to Fine-Tune the Output
Tesseract OCR offers several ways to fine-tune the output for better accuracy:
1. Image Preprocessing
Before running OCR, you can preprocess the image to remove noise, adjust contrast, or convert the image to grayscale. Libraries like ImageMagick or GD (which can be used in PHP) can help with this.
2. Custom Training
Tesseract allows users to train the engine on specific fonts, handwriting styles, or even custom characters. This can be especially useful if you are working with specialized documents or hard-to-read text.
3. Configuration Parameters
Tesseract provides various configuration parameters that can be passed during OCR execution to fine-tune its behavior. For example, you can adjust the OCR engine mode, specify page segmentation modes, or handle specific image types.
Conclusion
In this post, we explored how to use PHP to create a simple script that can read text from images using Tesseract OCR. While PHP may not be the first language that comes to mind for image processing tasks, it’s a great option if you’re working within a PHP-based web environment. By leveraging Tesseract, you gain access to powerful OCR capabilities with support for over 100 languages.
However, PHP’s performance and limited image processing libraries might not make it the best choice for large-scale or high-performance OCR tasks. Despite these limitations, PHP’s ease of integration, open-source nature, and accessibility make it a valuable tool for OCR on web platforms.
By fine-tuning your images and exploring Tesseract’s configuration options, you can achieve highly accurate results, and with the ability to process a variety of languages, this script opens up many opportunities for text extraction projects.