Beautiful Soup is Beautiful

15 Dec 2011

Earlier, I wrote a post about having to populate a table with contact information for network administrators for secure VLANs. I had to do some housekeeping on our crusty database schema using SQLAlchemy to accomodate a couple of new tables. Now, it was time to populate the new firewall contact information table. A lot of the contact information could be found on an intranet site in a nice, neat, and most importantly scrapable table. Previously, I had been using either flat regex or lxml to scrape web pages, but now it was time to play it smart and have a slurp of some Beautiful Soup, a Python HTML parser for screen-scraping.

I had heard about Beautiful Soup before, but I had never gotten around to actually playing with it. I was trying out lxml but was put off by its confusing documentation and focus on XML rather than HTML. Boy am I glad that I tried Beautiful Soup. It makes parsing HTML as easy as navigating through the DOM in Javascript in terms of their similar APIs.

Obviously, I first had to make a request to the intranet site, which required authentication. Then I had to throw that HTML response into the stew so I could serve some delicious soup:

import urllib2
import base64

opener = urllib2.build_opener()
request = urllib2.Request("http://intranet.net.oregonstate.edu/private/firewall-contexts.php")

# authenticate
base64string = base64.encodestring('username:password')[:-1]
request.add_header("Authorization", "Basic %s" % base64string)

# request and convert to document tree
response = opener.open(request)
html = response.read()
soup = BeautifulSoup(html)

Here’s where it gets even easier. I simply needed to extract information from a table. So I grabbed the table, grabbed the rows from the table, and iterated through the rows. What was funny was that this old website had two HTML elements.

# get rows from content table
content = soup.findAll('html')[1] # page has two html elements
table = content.table
rows = table.findAll('tr')[1:] # ignore table header row

for row in rows:
    firewall_contact = {}
    tds = row.findAll('td')

    # parse firewall context name
    context = tds[0].findAll(text=True)[0]

    # parse description
    description = tds[1].findAll(text=True)[0]

    # parse administrators
    try:
        administrators = tds[2].findAll(text=True)[0].split(',')
    except IndexError:
        pass

What was also funny was that some of the names in the table weren’t their given names. I needed the exact name so I could query for their contact information in LDAP so I wrote a silly hard-coded function that translated names like Andy to Andrew.

    # lookup adminstrator contact information from ldap
    info = None
    for administrator in administrators:
        name = full(administrator.strip()).split(' ')
        if len(name) == 2:
            info = ldap_search(first_name=name[0], last_name=name[1])
            if not info:
                info = ldap_search(first_name=full(name[0]), last_name=name[1])
            if not info:
                info = ldap_search(first_name=full(name[0], alt=True), last_name=name[1])
            if not info:
                continue
            break
        else:
            continue

Then I just parsed the rest, threw them into a list of dictionaries, and returned them for SQLAlchemy to have its way with them dictionaries. Nothing like Alchemizing some Beautiful Soup in a stirring cauldron on a cold evening.

# parse vlan ids
    vlans = []
    vlan_texts = tds[3].findAll(text=True)
    for vlan_text in vlan_texts:
        try:
            matches = vlan_regex.findall(vlan_text)[0]
            for match in matches:
                if match:
                    vlans.append(match)
        except IndexError:
            pass

    # add firewall contact to list of firewall contacts
    for vlan in vlans:
        firewall_contact = {}
        firewall_contact['context'] = context
        firewall_contact['description'] = description
        firewall_contact['vlan_id'] = vlan
        get_ldap_info(info, firewall_contact)
        if not 'name' in firewall_contact:
            firewall_contact['name'] = administrators[0]
        firewall_contacts.append(firewall_contact)

return firewall_contacts

If I ever have to screen-scrape flat HTML pages again, Beautiful Soup will be my go-to parser.