Find common text

Contact. Released under a Common Sense License.

Description

Let's say you have 10 files (which contain various tables) with hundreds of names of people. You are interested only in those names that occur in all files. To have a report on the names that occur in more than one file, select all the files that contain the tables and then use this function.

The active file is used as a "pivot file", so it is very important. This file is compared with all the other files, so if the name you're looking for is not in this file, it will not be reported. It is true that is rather difficult to choose the pivot file when you don't know what name you're looking for, but, mainly, the pivot file should be the file which contains the highest number of names.

The final report will contain all names that are found in more than one of the selected files. The names are sorted and displayed in the order of their rank (names that occur in most files are displayed at the top of the report).

Each file is broken into paragraphs. From each paragraph, words are extracted according to the options and a multiple filter is created with them. Each filter from the pivot file is then matched against each filter from all the other files.

Reverse matching

Let's say you have the name "vonBrown" in the pivot file, and "Brown" in another file. Because "vonBrown" is a superset of "Brown", it can't be normally found in "Brown". Reverse matching ensures that it is (also) found.

This technology also ensures that the used rank is the highest: the maximum between the direct and the reverse priorities.

Example

Let's say that the first file (the pivot file), named "A.txt", contains:

----------------------------------------------------------------------------

----------------------------------------------------------------------------

| 001 | Jonh | Michael | Doe | Male | Jupiter | 742-5361 |

----------------------------------------------------------------------------

----------------------------------------------------------------------------

----------------------------------------------------------------------------

| 004 | Helen | - | Yu | Female | Pluto | 810-5361 |

----------------------------------------------------------------------------

Let's say that the second file, named "B.txt", contains:

----------------------------------------------------------------------------

----------------------------------------------------------------------------

| 001 | John | - | Doe | Male | Jupiter | 742-5361 |

----------------------------------------------------------------------------

| 002 | Adrian | - | Gainer | Male | Saturn | 590-5361 |

----------------------------------------------------------------------------

----------------------------------------------------------------------------

| 004 | Helen | - | Yu | Female | Pluto | 810-5361 |

----------------------------------------------------------------------------

Let's say you want to see which names occur in both files. Use the following options:

Check "Use delimiter", and set the delimiter with "|".
Set "Begin extract" with 2. If the tables would not have "|" before the record number, you would have to set this with 1.
Set "End extract" with "5" (or 4, if there is no "|" before the record number).
Set "Whole match" with "Right". This ensures that "Brown" - "vonBrown" are detected as a match.
If you want to detect "John" - "Jonh" as a match, set "Misspelling" with 2. Make sure it's 2, not 0 or 1, because "Jonh" has 2 different letters (relative to the letters' positions, not to the letters themselves) compared with "John" ("n" and "h"). However, setting this option will generate many "false" matches, matches that are of no interest to you, like "John" - "vonBrown" or "John" – "Doe". You should also consider that "Helen Yu" can't be used as such, but only as "Helen" since "Yu" is too short to be used.
Set "Minimum matches" with the value you want, normally with 1.

If you would set "Misspelling" with 2, and "Minimum matches" with 2, the report would contain:

==================================================

Source file: A.txt

Match count: 3

Elapsed time (seconds): 0

==================================================

Match index: 1

Source filter: "JONH; MICHAEL; DOE"

Match count: 2

Rank: 1090

--------------------------------------------------

Partial match index: 1

Against file: "B.txt"

Against filter: "JOHN; DOE"

Matched filter: "JONH -> JOHN; JONH <- DOE; DOE -> JOHN; DOE"

Rank: 670

--------------------------------------------------

Partial match index: 2

Against file: "B.txt"

Against filter: "MICHELLE; FAITH; VONBROWN"

Matched filter: "JONH -> VONBROWN; MICHAEL -> MICHELLE; DOE -> VONBROWN"

Rank: 420

==================================================

Match index: 2

Source filter: "FIRST; NAME; MIDDLE; LAST"

Match count: 1

Rank: 980

--------------------------------------------------

Partial match index: 1

Against file: "B.txt"

Against filter: "FIRST; NAME; MIDDLE; LAST"

Matched filter: "FIRST; FIRST <- LAST; NAME; MIDDLE; LAST -> FIRST; LAST"

Rank: 980

==================================================

Match index: 3

Source filter: "JOANNA; FAITH; BROWN"

Match count: 2

Rank: 675

--------------------------------------------------

Partial match index: 1

Against file: "B.txt"

Against filter: "JOHN; DOE"

Matched filter: "JOANNA <- JOHN; JOANNA <- DOE; BROWN <- JOHN"

Rank: 375

--------------------------------------------------

Partial match index: 2

Against file: "B.txt"

Against filter: "MICHELLE; FAITH; VONBROWN"

Matched filter: "FAITH; BROWN -> VONBROWN"

Rank: 300

==================================================

As you can see, "John Doe" is detected (but also an uninteresting partial match: "vonBrown"). The second match is not important because it refers to the headers of the tables. The third match detects "Faith Brown". As you see, "Helen Yu" is not detected since is too short. So, these two names are the ones that are common in both files.

The report would be much easier to read if "Misspelling" would be 0 (but "John Doe" would not be detected in such a case).

Speed

The processing time for two files, each with one thousand paragraphs, is a few tens of seconds on a 1 GHz processor. For each new file added, with one thousand paragraphs, the processing time increases with 1 factor: double for one new file, triple for two new files, four times for three new files, and so on. The same happens for each new one thousand names added to any of the files.