aws-samples
diff --git a/‎CUSTOMIZATION_GUIDE.md
+31 b/‎CUSTOMIZATION_GUIDE.md
+31
diff --git a/‎README.md
+7-5 b/‎README.md
+7-5
diff --git a/‎notebooks/1. Data Preparation.ipynb
+7-4 b/‎notebooks/1. Data Preparation.ipynb
+7-4
@@ -107,3 +107,34 @@ Once the features are enabled for your pipeline, you can edit the post-processin
 For example you could loop through the rows and cells of detected `TABLE`s in your document, using the SageMaker entity model results for each `WORD` to find the specific records and columns that you're interested in.
 
 If you need to change the output format for the post-processing Lambda function, note that the A2I human review template will likely also need to be updated.
+
+
+## Customizing the models
+
+### How much data do I need?
+
+As demonstrated in the credit card agreements example, the answer to this question is strongly affected by how "hard" your particular extraction problem is for the model. Particular factors that can make learning difficult include:
+
+- Relying on important context "hundreds of words away" or across page boundaries - since each model inference analyzes a limited-length subset of the words on an individual page.
+- Noisy or inconsistent annotations - disagreement or inconsistency between annotators is more common than you might initially expect, and this noise makes it difficult for the model to identify the right common patterns to learn.
+- High importance of numbers or other unusual tokens - language models like these aren't well suited to mathematical analysis, so sense checks like "these line items should all add up to the total" or "this number should be bigger than that one" that may make sense to humans will not be obvious to the model.
+
+A practical solution is to start small, and **run experiments with varying hold-out/validation set sizes**: Analyzing how performance changed with dataset size up to what you have today, can give an indication of how what incremental value you might get by continuing to collect and annotate more.
+
+Depending on your task and document diversity, you may be able to start getting an initial idea of what seems "easy" with as few as 20 annotated pages, but may need hundreds to get a more confident view of what's "hard".
+
+
+### Should I pre-train to my domain, or just fine-tune?
+
+Because much more un-labelled domain data (Textracted documents) is usually available than labelled data (pages with manual annotations) for a use case, the compute infrastructure and/or time required for pre-training is usually significantly larger than fine-tuning jobs.
+
+For this reason, it's typically best to **first experiment with fine-tuning the public base models** - before exploring what improvements might be achievable with domain-specific pre-training. It's also useful to understand the relative value of the effort of collecting more labelled data (see *How much data do I need?*) to compare against dedicating resources to pre-training.
+
+Unless your domain diverges strongly from the data the public model was trained on (for example trying to transfer to a new language or an area heavy with unusual grammar/jargon), accuracy improvements from continuation pre-training are likely to be small and incremental rather than revolutionary. To consider from-scratch pre-training (without a public model base), you'll typically need a particularly large and diverse corpus (as per training BERT from scratch) to get good results.
+
+> ⚠️ **Note:** Some extra code modifications may be required to pre-train from scratch (for example to initialize a new tokenizer from the source data).
+
+There may be other factors aside from accuracy that influence your decision to pre-train. Notably:
+
+- **Bias** - Large language models have been shown to learn not just grammar and semantics from datasets, but other patterns too: Meaning practitioners must consider the potential biases in source datasets, and what real-world harms this could cause if the model brings learned stereotypes to the target use case. For example see [*"StereoSet: Measuring stereotypical bias in pretrained language models", (2020)*](https://arxiv.org/abs/2004.09456)
+- **Privacy** - In some specific circumstances (for example [*"Extracting Training Data from Large Language Models", 2020*](https://arxiv.org/abs/2012.07805)) it has been shown that input data can be reverse-engineered from trained large language models. Protections against this may be required, depending how your model is exposed to users.
@@ -1,18 +1,20 @@
 # Post-Process OCR Results with Transformer Models on Amazon SageMaker
 
-[Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html) is a service that automatically extracts text, handwriting, and structured data from scanned documents: Going beyond simple optical character recognition (OCR) to identify and extract data from tables (with rows and cells), and forms (as key-value pairs).
+[Amazon Textract](https://docs.aws.amazon.com/textract/latest/dg/what-is.html) is a service that automatically extracts text, handwriting, and some structured data from scanned documents: Going beyond simple optical character recognition (OCR) to identify and extract data from tables (with rows and cells), and forms (as key-value pairs).
 
-In this sample we show how you can automate even highly complex and domain-specific structure extraction tasks by integrating Textract with trainable models on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) - for additional, customizable intelligence.
+In this "Amazon Textract Transformer Pipeline" sample and accompanying [blog post](https://aws.amazon.com/blogs/machine-learning/bring-structure-to-diverse-documents-with-amazon-textract-and-transformer-based-models-on-amazon-sagemaker/), we show how you can also automate more complex and domain-specific extraction tasks by integrating trainable models on [Amazon SageMaker](https://aws.amazon.com/sagemaker/).
 
-We demonstrate **layout-aware entity extraction** on an example use case in finance, for which you could also consider [using Amazon Comprehend](https://aws.amazon.com/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/). However, this pipeline provides a framework you could extend or customize for your own ML models and data.
+We demonstrate **layout-aware entity extraction** on an example use case in finance, for which you could also consider using [Amazon Comprehend's native document analysis feature](https://aws.amazon.com/blogs/machine-learning/extract-custom-entities-from-documents-in-their-native-format-with-amazon-comprehend/).
+
+However, this pipeline provides a framework you could further extend or customize for your own datasets and ML-based, OCR post-processing models.
 
 ## Background
 
 To automate document understanding for business processes, we typically need to extract and standardize specific attributes from input documents: For example vendor and line item details from purchase orders; or specific clauses within contracts.
 
-With Amazon Textract's [structure extraction utilities for forms and tables](https://aws.amazon.com/textract/features/), many of these requirements are trivial out of the box with no custom training required. For example: "pull out the text of the third column, second row of the first table on page 1", or "pull out what the customer wrote in for the `Email address:` section of the form".
+With Amazon Textract's [structure extraction utilities for forms and tables](https://aws.amazon.com/textract/features/), many of these requirements are trivial out of the box with no custom training required. For example: "pull out the text of the third column, second row of the first table on page 1", or "find what the customer wrote in for the `Email address:` section of the form".
 
-We can also use AI services like [Amazon Comprehend](https://aws.amazon.com/comprehend/) to analyze extracted text. For example: picking out `date` entities on a purchase order that may not be explicitly "labelled" in the text - perhaps because sometimes the date just appears by itself in the page header. However, often these services and models treat text as a flat 1D sequence of words.
+We can also use standard text processing models or AI services like [Amazon Comprehend](https://aws.amazon.com/comprehend/) to analyze extracted text. For example: picking out `date` entities on a purchase order that may not be explicitly "labelled" in the text - perhaps because sometimes the date just appears by itself in the page header. However, many standard approaches treat text as a flat 1D sequence of words.
 
 Since Textract also outputs the *positions* of each detected 'block' of text, we can even write advanced templating rules in our solution code. For example: "Find text matching XX/XX/XXXX within the top 5% of the page height".
 
 
@@ -39,7 +39,7 @@
     "\n",
     "- [Amazon Textract Response Parser](https://github.com/aws-samples/amazon-textract-response-parser) is a helper for interpreting and navigating Amazon Textract result JSON.\n",
     "- The [SageMaker Studio Image Build CLI](https://github.com/aws-samples/sagemaker-studio-image-build-cli) is a tool for building customized container images and pushing to [Amazon ECR](https://aws.amazon.com/ecr/).\n",
-    "- [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) (installed by default in SageMaker already) must be version >=2.49 for the Hugging Face container versions used in notebook 2."
+    "- [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/) (installed by default in SageMaker already) must be version >=2.62 for the Hugging Face container versions used in notebook 2."
    ]
   },
   {
@@ -52,7 +52,7 @@
    "source": [
     "!pip install amazon-textract-response-parser \\\n",
     "    sagemaker-studio-image-build \\\n",
-    "    \"sagemaker>=2.49,<3\""
+    "    \"sagemaker>=2.62,<3\""
    ]
   },
   {
@@ -711,7 +711,7 @@
     "\n",
     "To process just a sample subset of files for speed in our demo, we'll create a **manifest file** listing just the documents we want.\n",
     "\n",
-    "> ⚠️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for manifests as used here, and separate information in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html) on the \"augmented\" manifests as used later with SageMaker Ground Truth."
+    "> ▶️ **Note:** 'Non-augmented' manifest files are still JSON-based, but a different format from the other dataset manifests we'll be using through this sample. You can find guidance on the [S3DataSource API doc](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_S3DataSource.html) for manifests as used here, and separate information in the [Ground Truth documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-input-data-input-manifest.html) on the \"augmented\" manifests as used later with SageMaker Ground Truth."
    ]
   },
   {
@@ -886,8 +886,11 @@
    "outputs": [],
    "source": [
     "textract_s3uris = list(filter(lambda s: isinstance(s, str), textract_results))\n",
-    "print(f\"{len(textract_s3uris)} of {len(textract_results)} docs textracted successfully\")\n",
+    "with open(\"data/textracted-all.manifest.jsonl\", \"w\") as f:\n",
+    "    for uri in textract_s3uris:\n",
+    "        f.write(json.dumps({\"textract-ref\": uri}) + \"\\n\")\n",
     "\n",
+    "print(f\"{len(textract_s3uris)} of {len(textract_results)} docs textracted successfully\")\n",
     "if len(textract_s3uris) < len(textract_results):\n",
     "    raise ValueError(\n",
     "        \"Are you sure you want to continue? Consider re-trying to process the failed docs\"\n",