HTML module in python
Categories:
Working with HTML in Python 2.7: Parsing, Generating, and Manipulating

Explore how to effectively parse, generate, and manipulate HTML content using various modules available in Python 2.7. This guide covers common libraries and practical examples.
Python, even older versions like 2.7, offers robust capabilities for interacting with HTML. Whether you need to extract data from web pages (web scraping), generate dynamic HTML content, or modify existing HTML structures, Python provides several powerful libraries to accomplish these tasks. This article will delve into the most common and effective ways to handle HTML within a Python 2.7 environment, focusing on parsing, generation, and manipulation.
Parsing HTML with BeautifulSoup
BeautifulSoup is a highly popular and effective library for parsing HTML and XML documents. It creates a parse tree from page source code that can be used to extract data in a hierarchical and Pythonic way. While newer versions exist, BeautifulSoup 3 (or 4, which is backward compatible in many ways) is commonly used with Python 2.7. It handles malformed HTML gracefully, making it ideal for real-world web scraping scenarios.
from BeautifulSoup import BeautifulSoup
import urllib2
# Fetch HTML content from a URL
url = "http://www.example.com"
response = urllib2.urlopen(url)
html_doc = response.read()
# Parse the HTML
soup = BeautifulSoup(html_doc)
# Find the title tag
title_tag = soup.title
print "Page Title:", title_tag.string
# Find all paragraph tags
for paragraph in soup.findAll('p'):
print "Paragraph:", paragraph.text
Basic HTML parsing and element extraction using BeautifulSoup
findAll()
returns a list of matching tags, while find()
returns the first match. Use .string
or .text
to get the text content of a tag.flowchart TD A[Start] --> B{Fetch HTML Content}; B --> C[Initialize BeautifulSoup Parser]; C --> D{Navigate Parse Tree (e.g., find tags)}; D --> E[Extract Data]; E --> F[Process Extracted Data]; F --> G[End];
Workflow for parsing HTML with BeautifulSoup
Generating HTML Programmatically
Generating HTML directly from Python can be useful for creating dynamic web pages, email templates, or reports. While you can always concatenate strings, using a dedicated templating engine or a library that helps build HTML elements is generally more robust and readable. For Python 2.7, simple string formatting or a basic templating approach is common.
def generate_simple_html(title, content_list):
html_template = """
<!DOCTYPE html>
<html>
<head>
<title>{title}</title>
</head>
<body>
<h1>{title}</h1>
{content_blocks}
</body>
</html>
"""
content_html = ""
for item in content_list:
content_html += "<p>%s</p>" % item
return html_template.format(title=title, content_blocks=content_html)
my_title = "My Dynamic Page"
my_content = ["This is the first paragraph.", "Here's another one."]
generated_html = generate_simple_html(my_title, my_content)
print generated_html
Generating basic HTML using Python string formatting
Manipulating HTML Structures
Beyond just parsing and generating, you might need to modify existing HTML documents. BeautifulSoup allows you to not only navigate but also alter the parse tree. You can add new tags, modify attributes, change text content, or remove elements. This is particularly useful for tasks like cleaning up HTML, adding dynamic content to static templates, or preparing HTML for specific display purposes.
from BeautifulSoup import BeautifulSoup
html_doc = """
<html>
<head><title>Original Title</title></head>
<body>
<p id="intro">Hello, world!</p>
<div class="container">
<span>Old content</span>
</div>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
# Change the title
soup.title.string = "New and Improved Title"
# Modify an attribute
intro_p = soup.find('p', id='intro')
if intro_p:
intro_p['class'] = 'highlight'
# Add a new element
new_div = soup.newTag('div')
new_div.string = "This is new content."
soup.body.append(new_div)
# Remove an element
old_span = soup.find('span')
if old_span:
old_span.extract() # Removes the tag and its contents
print soup.prettify()
Modifying HTML elements and structure using BeautifulSoup