# Text difference algorithm

I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way (like two similar files should have a similarity score higher than two dissimilar files, with the word "similar" defined in the normal terms). It sounds easy to implement, but it's not.

The implementation can be in c# or python.

Thanks.

I can recommend to take a look at Neil Fraser's code and articles:

Currently available in Java, JavaScript, C++ and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses.

Neil Fraser: Diff Strategies - for theory and implementation notes

Answered by: Rafael937 | Posted: 01-03-2022

In Python, there is difflib, as also others have suggested.

`difflib` offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:

``````def text_compare(text1, text2, isjunk=None):
return difflib.SequenceMatcher(isjunk, text1, text2).ratio()
``````

Answered by: Walter267 | Posted: 01-03-2022

Look at difflib. (Python)

That will calculate the diffs in various formats. You could then use the size of the context diff as a measure of how different two documents are?

Answered by: Walter823 | Posted: 01-03-2022

My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.

The Myers algorithm is described in:

E. Myers, ``An O(ND) Difference Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.

The GNU diff utility uses the Myers algorithm.

The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.

Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and vice versa.

For a good Myers / Hirschberg implementation look at:

http://www.ioplex.com/~miallen/libmba/dl/src/diff.c

The particular library that it is contained within is no longer maintained but to my knowledge the diff.c module itself is still correct.

Mike

Answered by: Arthur239 | Posted: 01-03-2022

Bazaar contains an alternative difference algorithm, called patience diff (there's more info in the comments on that page) which is claimed to be better than the traditional diff algorithm. The file 'patiencediff.py' in the bazaar distribution is a simple command line front end.

If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO. Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.

Answered by: Maria860 | Posted: 01-03-2022

There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIIS and Soundex. In terms of Python implementations, I have used py-editdist and ADVAS before. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.

Answered by: John305 | Posted: 01-03-2022

As stated, use difflib. Once you have the diffed output, you may find the Levenshtein distance of the different strings as to give a "value" of how different they are.

Answered by: Julian318 | Posted: 01-03-2022

You could use the solution to the Longest Common Subsequence (LCS) problem. See also the discussion about possible ways to optimize this solution.

Answered by: Kellan980 | Posted: 01-03-2022

One method I've employed for a different functionality, to calculate how much data was new in a modified file, could perhaps work for you as well.

I have a diff/patch implementation C# that allows me to take two files, presumably old and new version of the same file, and calculate the "difference", but not in the usual sense of the word. Basically I calculate a set of operations that I can perform on the old version to update it to have the same contents as the new version.

To use this for the functionality initially described, to see how much data was new, I simple ran through the operations, and for every operation that copied from the old file verbatim, that had a 0-factor, and every operation that inserted new text (distributed as part of the patch, since it didn't occur in the old file) had a 1-factor. All characters was given this factory, which gave me basically a long list of 0's and 1's.

All I then had to do was to tally up the 0's and 1's. In your case, with my implementation, a low number of 1's compared to 0's would mean the files are very similar.

This implementation would also handle cases where the modified file had inserted copies from the old file out of order, or even duplicates (ie. you copy a part from the start of the file and paste it near the bottom), since they would both be copies of the same original part from the old file.

I experimented with weighing copies, so that the first copy counted as 0, and subsequent copies of the same characters had progressively higher factors, in order to give a copy/paste operation some "new-factor", but I never finished it as the project was scrapped.

If you're interested, my diff/patch code is available from my Subversion repository.

Answered by: Aida200 | Posted: 01-03-2022

Take a look at the Fuzzy module. It has fast (written in C) based algorithms for soundex, NYSIIS and double-metaphone.

A good introduction can be found at: http://www.informit.com/articles/article.aspx?p=1848528

Answered by: Lenny465 | Posted: 01-03-2022

# Similar questions

algorithm - Average difference between dates in Python

I have a series of datetime objects and would like to calculate the average delta between them. For example, if the input was (2008-10-01 12:15:00, 2008-10-01 12:25:00, 2008-10-01 12:35:00), then the average delta would be exactly 00:10:00, or 10 minutes. Any suggestions on how to calculate this using Python?

python - Algorithm to calculate percent difference between two blobs of text

Closed. This question does not meet Stack Overflow guid...

Algorithm for getting time difference in Python

I want to ask you about getting the time difference in Python. Suppose my starting time is 11:30am and ending at 1:00pm. The difference would be 1:00pm-11:30am = 90 minutes. I am new to python. Are there existing algorithms to solve these kinds of problems? I need then for my scheduling algorithm problem. Your response would be greatly appreciated. :)

python - Algorithm to find the least difference between lists

I have been trying to understand the algorithm used here to compare two lists, implemented in this commit. The intention, as I understood, is to find the least amount of changes to create

algorithm - Minimum subarray difference in python

Consider I have a non-empty array of integers: A0..An. And consider a parameter P where 0 &lt; P &lt;=n. I need to find a minimum absolute difference between left and right subarray splited by P. For example: A = 3 A = 1 A = 2 A = 4 A = 3 P = 1, difference = |3 − 10| = 7 P = 2, difference = |4 − 9| = 5 P = 3, difference = |6 − 7| = 1 P = 4, difference =...

python - Algorithm to find indices of two numbers in Series with difference within a specified range

I have a list of numbers (sorted in increasing order, no repeats) and I'd like to write an algorithm that will quickly return the indices of two elements in the list whose difference is within a specified range. This could be multiple pairs of indices, but my intention is to adjust make the range small enough to capture only one or two, then do some decisioning based off of the indices. I'm currently usin...

algorithm - python Code running time difference

I was practicing on Leetcode, I want to ask 3 questions about the codes running time. I noticed that on Leetcode, even the same code will have quite different running time if submitted multiple times. And the difference is huge, is that normal? I have seen the difference as first times beats 26%, but second times beats 51%. That really is confusing me, while I am trying to figure out where I am and how goo...

python - Create an algorithm to collect points who difference is less than a value with a given condition

Hi I have list of tuples as given below: [(2031, 0.11078125), (2032, 0.11131274999999999), (1298, 0.11819950000000001), (2033, 0.12396399999999999), (2030, 0.13113425), (1238, 0.13305375), (2886, 0.1332045), (2704, 0.13512125000000003), (1299, 0.13639975), (1308, 0.136436), (957, 0.13736575), (2698, 0.1374555), (2768, 0.137864), (1297, 0.13829075000000002), (1959, 0.13873375), (1958, 0.138735), (2767, 0.138...

python - What is the algorithm for two intervals difference?

I'm making a program in Python where I have 2 intervals as an input and then it has to return 3 operations: - union (A | B) - intersection (A &amp; B) - difference I've tried making sets for intervals A and B and then subtract them with set method (A-B) but it has no sense as it has to return an interval, not a set and I distinguish between open / closed / open_closed etc intervals. input: [3, 10) (5, 16]...

Powerset algorithm in Python: difference between + and append on lists

I’m working through the powerset problem in Python. The powerset P(S) of a set S is the set of all subsets of S. For example, if S = {a, b, c} then P(s) = {{}, {a}, {b}, {c}, {a,b}, {a, c}, {b, c}, {a, b, c}}. This solution works just fine: def powerset(array): powerset = [[]] for num in array: for i in range(len(...

datetime - What is the simplest way to find the difference between 2 times in python?

I have 2 time values which have the type datetime.time. I want to find their difference. The obvious thing to do is t1 - t2, but this doesn't work. It works for objects of type datetime.datetime but not for datetime.time. So what is the best way to do this?

algorithm - Average difference between dates in Python

I have a series of datetime objects and would like to calculate the average delta between them. For example, if the input was (2008-10-01 12:15:00, 2008-10-01 12:25:00, 2008-10-01 12:35:00), then the average delta would be exactly 00:10:00, or 10 minutes. Any suggestions on how to calculate this using Python?

Is there any difference between "string" and 'string' in Python?

What is the difference between Ruby and Python versions of"self"?

I've done some Python but have just now starting to use Ruby I could use a good explanation of the difference between "self" in these two languages. Obvious on first glance: Self is not a keyword in Python, but there is a "self-like" value no matter what you call it. Python methods receive self as an explicit argument, whereas Ruby does not. Ruby sometimes has methods explicitly d...

python - What's the difference between all of the os.popen() methods?

I was looking at the Python documentation and saw that there are 4-5 different versions of popen(), e.g. os.popen(), os.popen2(), etc. Apart from the fact that some include stderr while others don't, what are the differences between them and when would you use each one? The documentation didn't really explain it very wel...

datetime - Python's timedelta: can't I just get in whatever time unit I want the value of the entire difference?

I am trying to have some clever dates since a post has been made on my site ("seconds since, hours since, weeks since, etc..") and I'm using datetime.timedelta difference between utcnow and utc dated stored in the database for a post. Looks like, according to the docs, I have to use the days attribute AND the seconds attribute, to get the fancy date strings I want. Can't I just get in whatever time unit I ...

Difference between the use of double quote and quotes in python

python - How to tell the difference between an iterator and an iterable?

In Python the interface of an iterable is a subset of the iterator interface. This has the advantage that in many cases they can be treated in the same way. However, there is an important semantic difference between the two, since for an iterable __iter__ returns a new iterator object and not just self. How ...

Difference between class (Python) and struct (C)

I'm new to python. I've studied C and I noticed that that the C structure (struct) seemed to have the same task as "class" in python. So what is, conceptually, the difference?

syntax - What's the difference between "2*2" and "2**2" in Python?

What is the difference between the following statements? Statement 1: var=2**2*3 Statement 2: var2=2*2*3 I see no difference. This raises the following question. Why is Statement 1 used if we can use Statement 2?