4.3.4.7. PDF In – Babelway

Creating "Message In" of type PDF

1-From the "Message In" Select "PDF" for the "Your message is of type" field, as shown below.

2-For the "PDF Sample" field click on the "Choose File" button and upload your PDF sample file, as shown below.

3-For the "Template name" write your template name in this case it will be "Invoice_Template_1", as shown below.

Note: You can name the Template name any name you want and it only can contain characters from A to Z in upper case and/or a to z characters in lower case and/or numbers between 0 to 9 and _ character, This is the only allowed characters.

4-If you want to receive a notification email when the message fell in a validation error regarding the PDF then you will need to enable the "Trigger based on message definition validation errors" and then for the "Recipients" enter the email address that will receive the notification, as shown below.

Note: For the "Trigger based on message definition validation errors" you can use one or more email addresses to receive the notification emails, To add more than one email address you will click on the + icon, as shown below.

After the "Message In" is created then from "Properties" you can update the configuration in the future when needed.

5-Now the Template will be created in the templates section, and the "Message In" structure will be empty for now because we will need to define it in the below steps in the "Extracting Fields" section, as shown below.

Note: The template allows you to define multiple templates that will match the same message definition, The template is very helpful when there is more than one PDF message sharing the same message definition by creating a template for each PDF message then you can process all them in one channel.

Note For Processing New Templates:

If a PDF is processed but the template has not been found, have the possibility to add it directly as a new PDF template.
If a PDF is processed, and the template has been found, have the possibility to add the pdf as a sample of the PDF template.

PDF Template Settings

The Template matching conditions is used to indicate which elements are static elements that are used to define the PDF file which will allow the system to know how to make a link between the PDF template and the input PDF message.

Note: This is the ID of the document, For the incoming PDF messages if this matches the ID then the message will be processed by its corresponding template based on the set of rules defined in the Template matching conditions.

For example, you can use one of the following (Company logo, Company address, Document type, … etc) and when this elements is used to identify the PDF file then this element must be provided in all of the input PDF message in order to the system to be able to identify the message, If for example you used the Company logo with the Document type to identify the message then you will need to provide this two fields in all of the processed PDF input messages in order for the system to be able to identify the PDF message.

1-At the bottom of the "Message In" click on the "Edit PDF template" button to begin editing the template, as shown below.

2-Click on the gear icon to open the "PDF Template Settings" pop-up page, as shown below.

Note:

- The below section has all of the template names, In this case, we only have one template which name is "Invoice_Template_1", as shown below.

3-We will see in this case that the system automatically has selected the Company logo to identify this PDF Template, as shown below.

Note: You can remove the automatically created rule if this will achieve your logic and then you can add the needed rules.

4-To add more elements click on the "Add Rule" button, as shown below.

5-Select a unique element in this PDF to identify the PDF, In this case, we will use the "INVOICE" label, as shown below.

6-Now you will need to click on the "Confirm" button to save these changes, as shown below.

How to process PDF input message which will contain multiple Invoices as an example in this case?

1-From the "Message In" click on the "Edit PDF template" button, as shown below.

2- Click on the gear icon to open the "PDF Template Settings" pop-up page, as shown below.

3-From the "PDF Template Settings" in the "Can this pdf include several documents of the same template?" click on "Yes" to inform the system that this PDF input message will have several messages, as shown below.

Note: The system will split the input message and process each message in a separate PDF file. Every page that satisfies the condition will be the considered as the start of a new pdf part.

4- Now you will need to click on the "Confirm" button to save these changes, as shown below.

Extracting Fields

To extract fields from the PDF Template you have three options to select one of them based on your need.

1-Stand alone (Fixed or relative to label/image)

2-Within a table

Below is an explanation for each option:

1-Stand alone

(Fixed)

The value always starts at the same position within the PDF template (exact same coordinates each time). The value can then vary in length as long as the start position is consistent.

(Relative to a label/image)

The position of this value can vary, but it is always close to a label (field name) or image. Ex. The value 12/08/2017 is always to the right of the label 'Date'.

In fact, we only care to find the exact text of the value and from there we look to the right/left/above/below depending on how the rule is set to find the value, as shown below.

The field which has the value "101" is relative to the label "INVOICE #:".

Note:

As advanced feature, We added ability to also say it is relative to (Text equals to, Text starts with, Text ends with, Text contains, Text matches regex).
In Babelway, we're using the standard Java version of REGEX, and it's the same used all over the system.

2- Within a table

This will extract the line values from the table and the table can either be displayed on a single page or is divided into multiple pages.

The below steps will describe how to use the "Fixed" extraction method.

A-For extracting one line you will click on it to select it, For extracting more than one line you will click and drag the mouse around the text you want to extract, as shown below.

B-From the pop-up window "How to find field value?" click on "Standalone", as shown below.

C-Now the "Extracted Value" will show the Extract data from the field "6, Rue Louis de Geer 1348 L", as shown below.

D-We can rename the field from the bar Beside "details of extractor for" and rename it, as shown below.

Note: After you finish click on the "Confirm" button to save these changes.

E-Now the "Field name" and the "Field value" will be extracted, as shown below.

The below steps will describe how to use the "Relative to a label/image" extraction method.

A-Click on the field then, "How to find this field value?" pop-up will be displayed, as shown below.

B-Select the "Standalone"

C-Then select the relative label in this case it is "INVOICE #:", as shown below.

D-In this case the PDF has two "INVOICE #:" so we will need to select a strategy for it, as shown below.

E-From the "Strategy when we find multiple labels or images :" select in this case "Pick first", as shown below.

Note:

Pick first: This strategy means we will always pick the first key encountered on the PDF.
Pick last: This strategy means we will always pick the last key encountered on the PDF.
Surrounding field: Choose a label/image that is close to the key you want to pick.

F- Now the "Extracted fields" will show the field name which in this case it is "field2" and the value "101", as shown below.

G- We can rename the field by clicking on it and then renaming it, as illustrated in the previous section(Standalone/Fixed).

H- Now the "Field name" and the "Field value" will be extracted, as shown below.

Some enhancements applied for the PDF (V2)

1- Adapt multiple labels strategy when user change relativeTo field:

Editor will choose for Fax: label, and automatically adapt strategy to pickLast to point to the right 'Fax:', and not the first one of the page. Until here, everything is great:

But then, user want to change label to 'GLN' at the left of the field, Offset will be correctly updated, but not pickLast. And as GLN label is also present at multiple places in teh page, it will lead to the wrong one being selected :

If user just changes to pickFirst, everything is perfect, but it should have been adapted automatically.

2- It should not be possible to save invalid field names (ex: with spaces):

If you edit a pdfTemplate, change the name of a field to something containing a space, it was accepted before, but save will not work.

Now it is not accepted anymore and produce proper error message as shown:

When adding space.

Proper error message appear.

The below steps will describe how to use the "Within a table" extraction method.

A-Click on the field then the "How to find this field value?" pop-up will be displayed,

B-Select the "Within a table", as shown below.

C-Select the start of the table, as shown below.

D-From the "START OF lINE ITEMS"/ "Relative to a label/image" then in this case select "Quantity", as shown below.

E-Select the end of the table, as shown below.

F-From the "End of LINE Items?"/ "Relative to a label/image" then in this case select "Subtotal", as shown below.

G-Select the start of the table on next pages, as shown below.

H-From the "MULTI-PAGES TABLE - START PAGE 2?" "Fixed position", as shown below.

I-Select the end of the table on the first page, as shown below.

From "MULTI_PAGES TABLE- END PAGE1" / "Fixed position"

K-Select the Item that will inform the system that will determine the line delimiter, as shown below.

L-From the "alignment?"/ "Align Left", as shown below.

M-Select all of the items you want to extract them from the table, as shown below.

N-Now you will find four fields created in this example from "Item1" to "Item4"m as shown below.

O-Now rename the fields to be clearer for the mapping, as shown below.

P-For the Table name change its name to be "Lines" for example, as shown below.

Q-Now when we open the "Message In" we will see the structure is created, as shown below.

Note: When hovering over an extracted item in table , all the items extracted on this table will be highlighted in the pdf, and the item you hovering over will appear with more highlithining .

Monitoring message details options available for PDF "Message In"

1-When the PDF input message is processed by a channel which is using the PDF "Message In" and no template has been found for this PDF and the message is in one of the status (ERROR, ERROR_CLOSED or WAIT_FOR_HUMAN_INTERACTION) and the error is related to "No template found" then the user can click on the "Save Template" button to create a new template for this PDF "Message In" to create a new template using this PDF input message then starts the extraction definition process, as shown below.

Note: The name of the template will be the PDF file name without ".PDF" extension and if the PDF file name is valid which is only alphanumeric and underscore, And if the filename is not valid then the message key will be used and the user can change it afterwards using the rename functionality.

2-When the PDF input message is processed by a channel which is using the PDF "Message In" and the template has been found for this PDF, The user can save this input PDF message as a sample for this template by using the "Save Template" button, as shown below.

Note: The name of the template will be the PDF file name.

Row limits / Auto-detect lines

The system will try to calculate limits of the rows automatically, This algorithm is based on the detection of the lines between the rows, and will not work if there are no lines, as shown below.

Fields to extract / Smart column detection

The system will automatically find the zone for the selected column, as shown below.

the extraction mode (positional or column-based) was also called at the end of the process for the more/resize of the positional zone.

To be able to reproduce, you had to resize to a zone that is small enough to be fully included in the cell.

Pdf text extraction settings

The merge characters tolerance number which can be used to define the merge character tolerance spaces between the columns when there are barely wider than the spaces between the words in a single column, as shown below.

The merge characters tolerance number from 0.01 to 2.00.
This is a percentage of space width that is needed to consider two characters to be split.
For example: 1.00 means we look for maximum 1.00 * spaceWidth after character A to find the next character, If we find it, we merge, otherwise, we finish the text extraction here.

The following properties are available:

Template name	Babelway allows you to define multiple templates that will match the same message definition. Please choose the name of the first template you will create.
Trigger based on message definition validation errors	If you check this, whenever some extracted fields will cause validation errors, you'll be notified by e-mail and able to fix the message.
Recipients	List of emails where the error notification is sent.
Split texts when double space	This will split texts in two when a double space is detected.
Add missing space characters	Sometimes, PDFs contains texts that are separated by position only and don't contain the actual space ' ' character. This will add those space character.
Missing space tolerance	Number from 0.01 to 2.00. This is the percentage of a space width that the difference between two letters must not exceed to generate a space character. For example, 0.7 means : we create a space between A and B if posX(A)+ spaceWidth - posX(B) less than spaceWidth * 0.7. The actual space width can vary a lot from PDF to PDF. This a magical number to find.
Merge characters tolerance	Number from 0.01 to 2.00. This is a percentage of space width that is needed to consider two characters to be split. For example, 1.00 means we look for maximum 1.00 * spaceWidth after character A to find the next character. If we find it, we merge, otherwise, we finish the text extraction here.

Note: The maximum filename supported for the PDF "Message In" is 255 characters.