Amazon Textract - AWS Study Guide

What is Amazon Textract?

Amazon Textract is a machine learning service that automatically extracts printed text, handwriting, and data from scanned documents. It goes beyond simple OCR (Optical Character Recognition) to identify specific fields, forms, and tables.

Key Concepts

1. Beyond Simple OCR

Textract identifies Forms (Key-Value pairs like "Name: John") and Tables (rows and columns).
It preserves the structure of the document.

2. Document Analysis

AnalyzeDocument: The API used to extract text and structure.
DetectDocumentText: The API for simple text detection.

3. Use Cases

Medical Records: Extracting patient data.
Financial Reports: Pulling data from tables in PDF reports.
Invoices & Receipts: automatically processing expenses.

Exam Tips

[!IMPORTANT] "Extract text/handwriting from documents" or "OCR with structure (tables/forms)": The answer is Amazon Textract.

[!NOTE] Distinguish from Amazon Transcribe (Audio -> Text) and Amazon Translate (Language -> Language). Textract is Image/PDF -> Text.

Common Use Cases

Digitizing Archives: Converting paper files to searchable text.
Automated Data Entry: Removing manual typing from form processing.