Difflib.SequenceMatcher isjunk optional parameter query: how to ignore whitespaces, tabs, empty lines?

I am trying to use Difflib.SequenceMatcher to compute the similarities between two files. These two files are almost identical except that one contains some extra whitespaces, empty lines and other doesn't. I am trying to use

s=difflib.SequenceMatcher(isjunk,text1,text2)
ratio =s.ratio()

for this purpose.

So, the question is how to write the lambda expression for this isjunk method so the SequenceMatcher method will discount all the whitespaces, empty lines etc. I tried to use the parameter lambda x: x==" ", but the result isn't as great. For two closely similar text, the ratio is very low. This is highly counter intuitive.

For testing purpose, here are the two strings that you can use on testing:

What Motivates jwovu to do your Job Well? OK, this is an entry trying to win $100 worth of software development books despite the fact that I don‘t read

programming books. In order to win the prize you have to write an entry and
what motivatesfggmum to do your job well. Hence this post. First motivation

money. I know, this doesn‘t sound like a great inspiration to many, and saying that money is one of the motivation factors might just blow my chances away.

As if money is a taboo in programming world. I know there are people who can‘t be motivated by money. Mme, on the other hand, am living in a real world,

with house mortgage to pay, myself to feed and bills to cover. So I can‘t really exclude money from my consideration. If I can get a large sum of money for

doing a good job, then definitely boost my morale. I won‘t care whether I am using an old workstation, or forced to share rooms or cubicle with other

people, or have to put up with an annoying boss, or whatever. The fact that at the end of the day I will walk off with a large pile of money itself is enough

for me to overcome all the obstacles, put up with all the hard feelings and hurt egos, tolerate a slow computer and even endure

And here's another string

What Motivates You to do your Job Well? OK, this is an entry trying to win $100 worth of software development books, despite the fact that I don't read programming books. In order to win the prize you have to write an entry and describes what motivates you to do your job well. Hence this post.

First motivation, money. I know, this doesn't sound like a great inspiration to many, and saying that money is one of the motivation factors might just blow my chances away. As if money is a taboo in programming world. I know there are people who can't be motivated by money. Kudos to them. Me, on the other hand, am living in a real world, with house mortgage to pay, myself to feed and bills to cover. So I can't really exclude money from my consideration.

If I can get a large sum of money for doing a good job, then thatwill definitely boost my morale. I won't care whether I am using an old workstation, or forced to share rooms or cubicle with other people, or have to put up with an annoying boss, or whatever. The fact that at the end of the day I will walk off with a large pile of money itself is enough for me to overcome all the obstacles, put up with all the hard feelings and hurt egos, tolerate a slow computer and even endure

I ran the above command, and set the isjunk to lambda x:x==" ", the ratio is only 0.36.


Asked by: Arthur972 | Posted: 01-10-2021






Answer 1

If you match all whitespaces the similarity is better:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, difflib is not ideal to such a problem because these are two nearly identical documents, but typos and such produce differences for difflib where a human wouldn't see many.

Try reading up on tf-idf, Bayesian probability, Vector space Models and w-shingling

I have written a an implementation of tf-idf applying it to a vector space and using the dot product as a distance measure to classify documents.

Answered by: Madaline813 | Posted: 02-11-2021



Answer 2

Using your sample strings:

>>> s=difflib.SequenceMatcher(lambda x: x == '\n', s1, s2)
>>> s.ratio()
0.94669848846459825

Interestingly if ' ' is also included as junk:

>>> s=difflib.SequenceMatcher(lambda x: x in ' \n', s1, s2)
>>> s.ratio()
0.7653142402545744

Looks like the new lines are having a much greater affect than the spaces.

Answered by: Kellan521 | Posted: 02-11-2021



Answer 3

Given the texts above, the test is indeed as suggested:

difflib.SequenceMatcher(lambda x: x in " \t\n", doc1, doc2).ratio()

However, to speed up things a little, you can take advantage of CPython's method-wrappers:

difflib.SequenceMatcher(" \t\n".__contains__, doc1, doc2).ratio()

This avoids many python function calls.

Answered by: Sarah835 | Posted: 02-11-2021



Answer 4

I haven't used Difflib.SequenceMatcher, but have you considered pre-processing the files to remove all blank lines and whitespace (perhaps via regular expressions) and then doing the compare?

Answered by: Miranda278 | Posted: 02-11-2021



Similar questions

python - Unexplaned behavior with difflib.SequenceMatcher get_matching_blocks()

I was experimenting with fuzzywuzzy and encountered that for quite a few cases it was generating wrong result. I tried to debug and encountered a scenario with get_matching_blocks() which was difficult to explain. My understanding of get_matching_blocks() is, it should return a triplet tuple (i,j,n) where the sub-string of length n in ...


python - difflib.SequenceMatcher not returning unique ratio

I am trying to compare 2 street networks and when i run this code it returns a a ratio of .253529... i need it to compare each row to get a unique value so i can query out the streets that dont match. What can i do it get it to return unique ratio values per row? # Set local variables inFeatures = gp.GetParameterAsText(0) fieldName = gp.GetParameterAsText(1) fieldName1 = gp.GetParameterAsText(2) fieldNam...


python - how to get multiple matches with difflib.SequenceMatcher?

I am using difflib to identify all the matches of a short string in a longer sequence. However it seems that when there are multiple matches, difflib only returns one: > sm = difflib.SequenceMatcher(None, a='ACT', b='ACTGACT') > sm.get_matching_blocks() [Match(a=0, b=0, size=3), Match(a=3, b=7, size=0)] The output I expected was: [Match(a=0, b=0, size=3), Match(a=0, b...


python - difflib.SequenceMatcher on more than two sequences

My end result: I need a variant of zip_longest() which, given an arbitrary number of sequences, yields them side by side, filling the gaps with None whenever they're not identical. The parallel when working with files is when you type vimdiff file1, file2, file3, .... For example, given the sequences a = ["foo", "bar", "baz", "asd"] b = ["foo", "baz"] c = ["foo", "bar"] I n...


python - difflib.SequenceMatcher isjunk argument not considered?

In the python difflib library, is the SequenceMatcher class behaving unexpectedly, or am I misreading what the supposed behavior is? Why does the isjunk argument seem to not make any difference in this case? difflib.SequenceMatcher(None, "AA", "A A").ratio() return 0.8 difflib.SequenceMatcher(lambda x: x in ' ', "AA", "A A").ratio() returns 0.8 My understanding is that if space is...


Python: deleting similar objects from a list using difflib.SequenceMatcher

Let's say I have a list of some strings, and there are certain strings there that very, very similar. And I want to delete those almost duplicates. For that, I came up with the following code: from difflib import SequenceMatcher l = ['Apple', 'Appel', 'Aple', 'Mango'] c = [l[0]] for i in l: count = 0 for j in c: if SequenceMatcher(None, i, j).ratio() < 0.7: count +=...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top