Personal tools

Contact Us 24/7 > 1 866.SIX FEET
Sections

Skip to content. | Skip to navigation

Home > Blog > An Introduction to BeautifulSoup
12/07/16

EVERYONE.NET SCHEDULED MAINTENANCE 

Everyone.net will be performing maintenance on their databases Friday, December 9th, 2016 between 9:00PM PT to 3:00AM PT / 12:00AM ET to 06:00AM ET. During this time, all services including web mail, POP, IMAP, and SMTP relay may experience degraded performance and inbound mail delivery delays. We apologize for any inconvenience.

Blog

An Introduction to BeautifulSoup

written by Rob McBroom on Monday December 22, 2014
Comments | Filed under: ,

beautifulsoup graphicFew things are less fun than parsing text, even when that text is supposed to be formatted according to certain rules (like HTML). We all know the web is full of badly written markup, so the effort required to reliably extract data from it is daunting.

Save yourself a few months of work, and just use BeautifulSoup.

For a simple real-world example of its power, let’s say we have a GUI application that should display a list of links, with icons and titles, from the HTML source of any arbitrary page you give it.

First, some setup:

from os import path
from bs4 import BeautifulSoup


# a place to store the links we find
links = []

For this example, we’ll assume you’ve gotten the HTML source from https://www.python.org/ and stuffed it in a variable called page somehow. We start by turning it into beautiful soup.

soup = BeautifulSoup(page)

With that, we can very easily iterate all the links on the page. In this case, I’ll define “link” to be any <a> tag that has an href attribute set.

for link in soup.findAll('a', href=True):
    # skip useless links
    if link['href'] == '' or link['href'].startswith('#'):
        continue

Our results will be a list of tuples with three pieces of information about each link: the URL, the title, and an image. The last two are optional, and might not be present, so we start out with a (mutable) dictionary.

thisLink = {
    'url': link['href'],
    'title': link.string,
    'image': '',
}

If the <a> tag surrounds an image, we want to use it as an icon for the link.

img = link.find('img', src=True)
if img:
    thisLink['image'] = img['src']

Further, if the the image has a title or alt attribute, use that as the link’s title. If not, fall back to using the file name.

        if thisLink['title'] is None:
            # look for a title here if none exists
            if 'title' in img:
                thisLink['title'] = img['title']
            elif 'alt' in img:
                thisLink['title'] = img['alt']
            else:
                thisLink['title'] = path.basename(img['src'])

If there’s no title (meaning it wasn’t an image, and link.string was empty), try to come up with one. If we can’t, we’ll skip this link.

if thisLink['title'] is None:
    # check for text inside the link
    if len(link.contents):
        thisLink['title'] = ' '.join(link.stripped_strings)
if thisLink['title'] is None:
    # if there's *still* no title (empty tag), skip it
    continue

Now, convert what we have to a simpler, immutable tuple and add it to the list.

hashableLink = (thisLink['url'].strip(),
                thisLink['title'].strip(),
                thisLink['image'].strip())
if hashableLink not in links:
    links.append(hashableLink)

It’s that easy. For more details about the tricks used above, take a look at the official documentation.

A Full Working Example

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
output tab separated lines with the following fields:
    0 url
    1 text
    2 imageurl
"""

from os import path
from sys import stdout
import codecs
from bs4 import BeautifulSoup
import requests


streamWriter = codecs.lookup('utf-8')[-1]
stdout = streamWriter(stdout)

# a place to store the links we find
links = []

r = requests.get('https://www.python.org/')
page = r.text
soup = BeautifulSoup(page)
for link in soup.findAll('a', href=True):
    # skip useless links
    if link['href'] == '' or link['href'].startswith('#'):
        continue
    # initialize the link
    thisLink = {
        'url': link['href'],
        'title': link.string,
        'image': '',
    }
    # see if the link contains an image
    img = link.find('img', src=True)
    if img:
        thisLink['image'] = img['src']
        if thisLink['title'] is None:
            # look for a title here if none exists
            if 'title' in img:
                thisLink['title'] = img['title']
            elif 'alt' in img:
                thisLink['title'] = img['alt']
            else:
                thisLink['title'] = path.basename(img['src'])

    if thisLink['title'] is None:
        # check for text inside the link
        if len(link.contents):
            thisLink['title'] = ' '.join(link.stripped_strings)
    if thisLink['title'] is None:
        # if there's *still* no title (empty tag), skip it
        continue
    # convert to something immutable for storage
    hashableLink = (thisLink['url'].strip(),
                    thisLink['title'].strip(),
                    thisLink['image'].strip())
    # store the result
    if hashableLink not in links:
        links.append(hashableLink)

# print the results
for link in links:
    stdout.write('\t'.join(link) + '\n')

 

Was this article useful? Let us know in the comments and be sure to sign up for our Plone & Python How-To digests to receive more how-to guides as soon as they are published!

 
Add comment

You can add a comment by filling out the form below. Plain text formatting.

puzzle
Rob's Recent Posts:
An Introduction to BeautifulSoup (12/22/2014)
Stupid ZMI Tricks (12/11/2014)

Next Steps


Select a type of support:

Contact our sales team

First name:
Last name:
Email:
Phone Number:
Message:
Fight spam:
What is + ?
 
Call Us 1 866.SIX FEET
Sections