Is there a more Pythonic way to merge two HTML header rows with colspans?

I am using BeautifulSoup in Python to parse some HTML. One of the problems I am dealing with is that I have situations where the colspans are different across header rows. (Header rows are the rows that need to be combined to get the column headings in my jargon) That is one column may span a number of columns above or below it and the words need to be appended or prepended based on the spanning. Below is a routine to do this. I use BeautifulSoup to pull the colspans and to pull the contents of each cell in each row. longHeader is the contents of the header row with the most items, spanLong is a list with the colspans of each item in the row. This works but it is not looking very Pythonic.

Alos-it is not going to work if the diff is <0, I can fix that with the same approach I used to get this to work. But before I do I wonder if anyone can quickly look at this and suggest a more Pythonic approach. I am a long time SAS programmer and so I struggle to break the mold-well I will write code as if I am writing a SAS macro.

longHeader=['','','bananas','','','','','','','','','','trains','','planes','','','','']
shortHeader=['','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
combinedHeader=[]
sumSpanLong=0
sumSpanShort=0
spanDiff=0
longHeaderCount=0

for each in range(len(shortHeader)):
    sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
    sumSpanShort=sumSpanShort+spanShort[each]
    spanDiff=sumSpanShort-sumSpanLong
    if spanDiff==0:
        combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
        longHeaderCount=longHeaderCount+1
        continue
    for i in range(0,spanDiff):
            combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
            longHeaderCount=longHeaderCount+1
            sumSpanLong=sumSpanLong+spanLong[longHeaderCount]
            spanDiff=sumSpanShort-sumSpanLong
            if spanDiff==0:
                combinedHeader.append([longHeader[longHeaderCount]+' '+shortHeader[each]])
                longHeaderCount=longHeaderCount+1
                break

print combinedHeader


Asked by: Carlos360 | Posted: 28-01-2022






Answer 1

Here is a modified version of your algorithm. zip is used to iterate over short lengths and headers and a class object is used to count and iterate the long items, as well as combine the headers. while is more appropriate for the inner loop. (forgive the too short names).

class collector(object):
    def __init__(self, header):
        self.longHeader = header
        self.combinedHeader = []
        self.longHeaderCount = 0
    def combine(self, shortValue):
        self.combinedHeader.append(
            [self.longHeader[self.longHeaderCount]+' '+shortValue] )
        self.longHeaderCount += 1
        return self.longHeaderCount

def main():
    longHeader = [ 
       '','','bananas','','','','','','','','','','trains','','planes','','','','']
    shortHeader = [
    '','','bunches','','cars','','trucks','','freight','','cargo','','all other','','']
    spanShort=[1,1,3,1,3,1,3,1,3,1,3,1,3,1,3]
    spanLong=[1,1,3,1,1,1,1,1,1,1,1,1,3,1,3,1,3,1,3]
    sumSpanLong=0
    sumSpanShort=0

    combiner = collector(longHeader)
    for sLen,sHead in zip(spanShort,shortHeader):
        sumSpanLong += spanLong[combiner.longHeaderCount]
        sumSpanShort += sLen
        while sumSpanShort - sumSpanLong > 0:
            combiner.combine(sHead)
            sumSpanLong += spanLong[combiner.longHeaderCount]
        combiner.combine(sHead)

    return combiner.combinedHeader

Answered by: Hailey613 | Posted: 01-03-2022



Answer 2

You've actually got a lot going on in this example.

  1. You've "over-processed" the Beautiful Soup Tag objects to make lists. Leave them as Tags.

  2. All of these kinds of merge algorithms are hard. It helps to treat the two things being merged symmetrically.

Here's a version that should work directly with the Beautiful Soup Tag objects. Also, this version doesn't assume anything about the lengths of the two rows.

def merge3( row1, row2 ):
    i1= 0
    i2= 0
    result= []
    while i1 != len(row1) or i2 != len(row2):
        if i1 == len(row1):
            result.append( ' '.join(row1[i1].contents) )
            i2 += 1
        elif i2 == len(row2):
            result.append( ' '.join(row2[i2].contents) )
            i1 += 1
        else:
            if row1[i1]['colspan'] < row2[i2]['colspan']:
                # Fill extra cols from row1
                c1= row1[i1]['colspan']
                while c1 != row2[i2]['colspan']:
                    result.append( ' '.join(row2[i2].contents) )
                    c1 += 1
            elif row1[i1]['colspan'] > row2[i2]['colspan']:
                # Fill extra cols from row2
                c2= row2[i2]['colspan']
                while row1[i1]['colspan'] != c2:
                    result.append( ' '.join(row1[i1].contents) )
                    c2 += 1
            else:
                assert row1[i1]['colspan'] == row2[i2]['colspan']
                pass
            txt1= ' '.join(row1[i1].contents)
            txt2= ' '.join(row2[i2].contents)
            result.append( txt1 + " " + txt2 )
            i1 += 1
            i2 += 1
    return result

Answered by: Kellan742 | Posted: 01-03-2022



Answer 3

Maybe look at the zip function for parts of the problem:

>>> execfile('so_ques.py')
[[' '], [' '], ['bananas bunches'], [' '], [' cars'], [' cars'], [' cars'], [' '], [' trucks'], [' trucks'], [' trucks'], [' '], ['trains freight'], [' '], ['planes cargo'], [' '], [' all other'], [' '], [' ']]

>>> zip(long_header, short_header)
[('', ''), ('', ''), ('bananas', 'bunches'), ('', ''), ('', 'cars'), ('', ''), ('', 'trucks'), ('', ''), ('', 'freight'), ('', ''), ('', 'cargo'), ('', ''), ('trains', 'all other'), ('', ''), ('planes', '')]
>>> 

enumerate can help avoid some of the complex indexing with counters:

>>> diff_list = []
>>> for place, header in enumerate(short_header):
    diff_list.append(abs(span_short[place] - span_long[place]))

>>> for place, num in enumerate(diff_list):
    if num:
        new_shortlist.extend(short_header[place] for item in range(num+1))
    else:
        new_shortlist.append(short_header[place])


>>> new_shortlist
['', '', 'bunches', '', 'cars', 'cars', 'cars', '', 'trucks', 'trucks', 'trucks', '',... 
>>> z = zip(new_shortlist, long_header)
>>> z
[('', ''), ('', ''), ('bunches', 'bananas'), ('', ''), ('cars', ''), ('cars', ''), ('cars', '')...

Also more pythonic naming may add clarity:

    for each in range(len(short_header)):
        sum_span_long += span_long[long_header_count]
        sum_span_short += span_short[each]
        span_diff = sum_span_short - sum_span_long
        if not span_diff:
            combined_header.append...

Answered by: Ada587 | Posted: 01-03-2022



Answer 4

I guess I am going to answer my own question but I did receive a lot of help. Thanks for all of the help. I made S.LOTT's answer work after a few small corrections. (They may be so small as to not be visible (inside joke)). So now the question is why is this more Pythonic? I think I see that it is less denser / works with the raw inputs instead of derivations / I cannot judge if it is easier to read ---> though it is easy to read

S.LOTT's Answer Corrected

row1=headerCells[0]
row2=headerCells[1]

i1= 0
i2= 0
result= []
while i1 != len(row1) or i2 != len(row2):
    if i1 == len(row1):
        result.append( ' '.join(row1[i1]) )
        i2 += 1
    elif i2 == len(row2):
        result.append( ' '.join(row2[i2]) )
        i1 += 1
    else:
        if int(row1[i1].get("colspan","1")) < int(row2[i2].get("colspan","1")):
            c1= int(row1[i1].get("colspan","1"))
            while c1 != int(row2[i2].get("colspan","1")): 
                txt1= ' '.join(row1[i1])  # needed to add when working adjust opposing case
                txt2= ' '.join(row2[i2])     # needed to add  when working adjust opposing case
                result.append( txt1 + " " + txt2 )  # needed to add when working adjust opposing case
                print 'stayed in middle', 'i1=',i1,'i2=',i2, ' c1=',c1
                c1 += 1
                i1 += 1    # Is this the problem it
           
        elif int(row1[i1].get("colspan","1"))> int(row2[i2].get("colspan","1")):
                # Fill extra cols from row2  Make same adjustment as above
            c2= int(row2[i2].get("colspan","1"))
            while int(row1[i1].get("colspan","1")) != c2:
                result.append( ' '.join(row1[i1]) )
                c2 += 1
                i2 += 1
        else:
            assert int(row1[i1].get("colspan","1")) == int(row2[i2].get("colspan","1"))
            pass
       
    
        txt1= ' '.join(row1[i1])
        txt2= ' '.join(row2[i2])
        result.append( txt1 + " " + txt2 )
        print 'went to bottom', 'i1=',i1,'i2=',i2
        i1 += 1
        i2 += 1
print result

Answered by: Kellan349 | Posted: 01-03-2022



Answer 5

Well I have an answer now. I was thinking through this and decided that I needed to use parts of every answer. I still need to figure out if I want a class or a function. But I have the algorithm that I think is probably more Pythonic than any of the others. But, it borrows heavily from the answers that some very generous people provided. I appreciate those a lot because I have learned quite a bit.

To save the time of having to make test cases I am going to paste the the complete code I have been banging away with in IDLE and follow that with an HTML sample file. Other than making a decision about class/function (and I need to think about how I am using this code in my program) I would be happy to see any improvements that make the code more Pythonic.

from BeautifulSoup import BeautifulSoup

original=file(r"C:\testheaders.htm").read()

soupOriginal=BeautifulSoup(original)
all_Rows=soupOriginal.findAll('tr')


header_Rows=[]
for each in range(len(all_Rows)):
    header_Rows.append(all_Rows[each])


header_Cells=[]
for each in header_Rows:
    header_Cells.append(each.findAll('td'))

temp_Header_Row=[]
header=[]
for row in range(len(header_Cells)):
    for column in range(len(header_Cells[row])):
        x=int(header_Cells[row][column].get("colspan","1"))
        if x==1:
            temp_Header_Row.append( ' '.join(header_Cells[row][column]) )

        else:
            for item in range(x):

                temp_Header_Row.append( ''.join(header_Cells[row][column]) )

    header.append(temp_Header_Row)
temp_Header_Row=[]
combined_Header=zip(*header)

for each in combined_Header:
    print each

Okay test file contents are below Sorry I tried to attach these but couldn't make it happen:

  <TABLE style="font-size: 10pt" cellspacing="0" border="0" cellpadding="0" width="100%">
  <TR valign="bottom">
  <TD width="40%">&nbsp;</TD>
  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>

  <TD width="5%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="3%">&nbsp;</TD>
  <TD width="1%">&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">FOODS WE LIKE</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">&nbsp;</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="CENTER" colspan="6">SILLY STUFF</TD>

  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">OTHER THAN</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="CENTER" colspan="6">FAVORITE PEOPLE</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">MONTY PYTHON</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">CHERRYPY</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">APPLE PIE</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">MOTHERS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">FATHERS</TD>
  <TD>&nbsp;</TD>
  </TR>
  <TR style="font-size: 10pt" valign="bottom">
  <TD nowrap align="left">Name</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">SHOWS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">PROGRAMS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">BANANAS</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">PERFUME</TD>
  <TD>&nbsp;</TD>
  <TD>&nbsp;</TD>
  <TD nowrap align="right" colspan="2">TOOLS</TD>
  <TD>&nbsp;</TD>
  </TR>
  </TABLE>

Answered by: Clark347 | Posted: 01-03-2022



Similar questions

python - Is it pythonic for a function to return multiple values?

In python, you can have a function return multiple values. Here's a contrived example: def divide(x, y): quotient = x/y remainder = x % y return quotient, remainder (q, r) = divide(22, 7) This seems very useful, but it looks like it can also be abused ("Well..function X already computes what we need as an intermediate value. Let's have X return that value also"). W...


python - A pythonic way to insert a space before capital letters

I've got a file whose format I'm altering via a python script. I have several camel cased strings in this file where I just want to insert a single space before the capital letter - so "WordWordWord" becomes "Word Word Word". My limited regex experience just stalled out on me - can someone think of a decent regex to do this, or (better yet) is there a more pythonic way to do this that I'm missing?


python - What is the pythonic way to share common files in multiple projects?

Lets say I have projects x and y in brother directories: projects/x and projects/y. There are some utility funcs common to both projects in myutils.py and some db stuff in mydbstuff.py, etc. Those are minor common goodies, so I don't want to create a single package for them. Questions arise about the whereabouts of such files, possible changes to PYTHONPATH, proper way to import, etc. What is th...


python - Pythonic ways to use 'else' in a for loop

This question already has answers here:


python - pythonic way to compare compound classes?

I have a class that acts as an item in a tree: class CItem( list ): pass I have two trees, each with CItem as root, each tree item has some dict members (like item._test = 1). Now i need to compare this trees. I can suggest to overload a comparison operator for CItem: class CItem( list ): def __eq__( self, other ): # first compare items as lists if not list.__eq...


python - Pythonic URL Parsing

There are a number of questions about how to parse a URL in Python, this question is about the best or most Pythonic way to do it. In my parsing I need 4 parts: the network location, the first part of the URL, the path and the filename and querystring parts. http://www.somesite.com/base/first/secon...


list - Pythonic way to get some rows of a matrix

I was thinking about a code that I wrote a few years ago in Python, at some point it had to get just some elements, by index, of a list of lists. I remember I did something like this: def getRows(m, row_indices): tmp = [] for i in row_indices: tmp.append(m[i]) return tmp Now that I've learnt a little bit more since then, I'd use a list comprehension like this:


python - What is the Pythonic way to write this loop?

for jr in json_reports: jr['time_created'] = str(jr['time_created'])


python - How do you make this code more pythonic?

Could you guys please tell me how I can make the following code more pythonic? The code is correct. Full disclosure - it's problem 1b in Handout #4 of this machine learning course. I'm supposed to use newton's algorithm on the two data sets for fitting a logistic hypothesis. But they use matlab &amp; I'm using scipy ...


python - Pythonic Swap of 2 lists elements

I found that I have to perform a swap in python and I write something like this: arr[first], arr[second] = arr[second], arr[first] I suppose this is not so pythonic. Does somebody know how to do a swap in python more elegant? EDIT: I think another example will show my doubts: self.memberlist[someindexA], self.memberlist[someindexB] = self.memberlist[som...






Still can't find your answer? Check out these communities...



PySlackers | Full Stack Python | NHS Python | Pythonist Cafe | Hacker Earth | Discord Python



top