 Parse
Parse
Parse a document with Tika and extract its content and metadata.
type: "io.kestra.plugin.tika.Parse"Examples
Extract text from a file.
id: tika_parse
namespace: company.team
inputs:
  - id: file
    type: FILE
tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    extractEmbedded: true
    store: false
Extract text from an image using OCR.
id: tika_parse
namespace: company.team
inputs:
  - id: file
    type: FILE
tasks:
  - id: parse
    type: io.kestra.plugin.tika.Parse
    from: '{{ inputs.file }}'
    ocrOptions:
      strategy: OCR_AND_TEXT_EXTRACTION
    store: true
Download and extract image metadata using Apache Tika.
id: parse-image-metadata-using-apache-tika
namespace: company.team
tasks:
  - id: get_image
    type: io.kestra.plugin.core.http.Download
    uri: https://kestra.io/blogs/2023-05-31-beginner-guide-kestra.jpg
  - id: tika
    type: io.kestra.plugin.tika.Parse
    from: "{{ outputs.get_image.uri }}"
    store: false
    contentType: TEXT
    ocrOptions:
      strategy: OCR_AND_TEXT_EXTRACTION
Download a PDF file and extract text from it using Apache Tika.
id: parse-pdf
namespace: company.team
tasks:
  - id: download_pdf
    type: io.kestra.plugin.core.http.Download
    uri: https://huggingface.co/datasets/kestra/datasets/resolve/main/pdf/app_store.pdf
  - id: parse_text
    type: io.kestra.plugin.tika.Parse
    from: "{{ outputs.download_pdf.uri }}"
    contentType: TEXT
    store: false
  - id: log_extracted_text
    type: io.kestra.plugin.core.log.Log
    message: "{{ outputs.parse_text.result.content }}"
Properties
charactersLimit integerstring
Set maximum number of characters to include in the string, or -1 (default) to disable the write limit.
contentType string
XHTMLTEXTXHTMLXHTML_NO_HEADERThe content type of the extracted text
extractEmbedded booleanstring
falseSet whether to extract the embedded document.
from string
The file to parse
Must be an internal storage URI.
Pebble expression referencing an Internal Storage URI e.g. {{ outputs.mytask.uri }}.
ocrOptions Non-dynamicParse-OcrOptions
{
  "strategy": "NO_OCR"
}Custom options for OCR processing
You need to install Tesseract to enable OCR processing.
store booleanstring
trueSet whether to store the data from the query result into an Ion serialized data file in Kestra internal storage.
Outputs
result Parse-Parsed
uri string
uriDefinitions
io.kestra.plugin.tika.Parse-OcrOptions
enableImagePreprocessing booleanstring
Whether to enable image preprocessing.
Apache Tika will run preprocessing of images (rotation detection and image normalizing with ImageMagick) before sending the image to Tesseract if the user has included dependencies (listed below) and if the user opts to include these preprocessing steps.
language string
Language used for OCR.
strategy string
NO_OCRAUTONO_OCROCR_ONLYOCR_AND_TEXT_EXTRACTIONOCR strategy to use for OCR processing.
You need to install Tesseract to enable OCR processing, along with Tesseract language pack.
