Back to Catalog
Machine Learning

Amazon Textract

"Extract text, handwriting, and data from documents automatically."

What is Amazon Textract?

Amazon Textract is a machine learning service that automatically extracts printed text, handwriting, and data from scanned documents. It goes beyond simple OCR (Optical Character Recognition) to identify specific fields, forms, and tables.

Key Concepts

1. Beyond Simple OCR

  • Textract identifies Forms (Key-Value pairs like "Name: John") and Tables (rows and columns).
  • It preserves the structure of the document.

2. Document Analysis

  • AnalyzeDocument: The API used to extract text and structure.
  • DetectDocumentText: The API for simple text detection.

3. Use Cases

  • Medical Records: Extracting patient data.
  • Financial Reports: Pulling data from tables in PDF reports.
  • Invoices & Receipts: automatically processing expenses.

Exam Tips

[!IMPORTANT] "Extract text/handwriting from documents" or "OCR with structure (tables/forms)": The answer is Amazon Textract.

[!NOTE] Distinguish from Amazon Transcribe (Audio -> Text) and Amazon Translate (Language -> Language). Textract is Image/PDF -> Text.

Common Use Cases

  • Digitizing Archives: Converting paper files to searchable text.
  • Automated Data Entry: Removing manual typing from form processing.
SageMaker
Personalize
SWIPE ZONE
< DRAG ME >