Extracting images and text from an mht file
Categories:
Extracting Images and Text from MHT Files: A Comprehensive Guide

Learn how to programmatically extract embedded images and text content from MHT (MHTML) files using various methods and programming languages.
MHT (MHTML) files, short for MIME HTML, are single-file archives that bundle an HTML document and its associated resources (like images, CSS, and JavaScript) into one file. This format is often used for archiving web pages. While convenient for storage, extracting individual components, especially images and text, can be challenging without the right tools or programmatic approach. This article will guide you through understanding the MHT structure and provide methods to extract its contents effectively.
Understanding the MHT File Structure
An MHT file is essentially a MIME-encoded archive. It uses the multipart/related
MIME type, where the main HTML document is typically the first part, and subsequent parts contain the embedded resources. Each part is separated by a unique boundary string and includes headers specifying its Content-Type
, Content-Transfer-Encoding
, and Content-Location
(or Content-ID
). Images are often base64 encoded within these parts.
To extract content, you need to parse this MIME structure, identify the different parts, decode their content based on the Content-Transfer-Encoding
, and save them appropriately. The Content-Location
header is crucial for mapping embedded resources back to their original filenames or URLs within the HTML.
flowchart TD A[MHT File] --> B{Parse MIME Structure} B --> C{Identify Boundary} C --> D{Extract Each Part} D --> E{Read Part Headers} E --> F{Check "Content-Type"} F --> |HTML| G[Save as HTML] F --> |Image (e.g., image/jpeg)| H{Decode Base64} H --> I[Save as Image File] F --> |Other (e.g., text/css)| J[Save as Resource] I --> K[Extracted Images] G --> L[Extracted Text/HTML]
MHT File Extraction Process Flow
Programmatic Extraction in C#
C# provides robust libraries for handling MIME-encoded data, making it a suitable language for MHT file parsing. The System.Net.Mail.MailMessage
class, though primarily for email, can be repurposed to parse MIME structures. Alternatively, you can manually parse the file content by looking for MIME boundaries and headers.
When using MailMessage
, you load the MHT file as a stream, and then iterate through its AlternateViews
and Attachments
to find the embedded resources. Images are typically found in AlternateViews
or Attachments
with specific Content-Type
headers.
using System;
using System.IO;
using System.Net.Mail;
using System.Text;
public class MhtExtractor
{
public static void ExtractMhtContent(string mhtFilePath, string outputDirectory)
{
if (!File.Exists(mhtFilePath))
{
Console.WriteLine("MHT file not found.");
return;
}
Directory.CreateDirectory(outputDirectory);
try
{
using (var stream = new FileStream(mhtFilePath, FileMode.Open))
{
// MailMessage can parse MIME-encoded streams
MailMessage mail = new MailMessage();
mail.AlternateViews.Add(AlternateView.CreateAlternateViewFromString("", null, "text/html")); // Dummy view to initialize
mail.AlternateViews.Clear(); // Clear dummy
// Manually parse the stream to populate MailMessage correctly for MHT
// This part is tricky as MailMessage is designed for email, not general MHT.
// A more robust solution might involve a dedicated MIME parser library.
// For demonstration, let's assume a simpler approach or a dedicated library.
// A more direct approach for MHT parsing often involves reading the raw content
// and splitting by MIME boundaries. For simplicity, let's show a conceptual approach.
// --- Conceptual approach using a simplified MIME parser (not MailMessage directly for MHT) ---
// In a real scenario, you'd read the MHT file content as a string,
// find the boundary, and split the parts. Each part would then be processed.
string mhtContent = File.ReadAllText(mhtFilePath, Encoding.Default); // Use default encoding for MHT
string boundary = GetBoundaryFromMhtContent(mhtContent);
if (string.IsNullOrEmpty(boundary))
{
Console.WriteLine("Could not find MIME boundary.");
return;
}
string[] parts = mhtContent.Split(new string[] { "--" + boundary }, StringSplitOptions.RemoveEmptyEntries);
int fileCount = 0;
foreach (string part in parts)
{
if (string.IsNullOrWhiteSpace(part) || part.Trim() == "--") continue;
string[] lines = part.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
string contentType = string.Empty;
string contentLocation = string.Empty;
string transferEncoding = string.Empty;
StringBuilder contentBuilder = new StringBuilder();
bool inHeaders = true;
foreach (string line in lines)
{
if (inHeaders && string.IsNullOrWhiteSpace(line)) // End of headers
{
inHeaders = false;
continue;
}
if (inHeaders)
{
if (line.StartsWith("Content-Type:", StringComparison.OrdinalIgnoreCase))
contentType = line.Substring("Content-Type:".Length).Trim();
else if (line.StartsWith("Content-Location:", StringComparison.OrdinalIgnoreCase))
contentLocation = line.Substring("Content-Location:".Length).Trim();
else if (line.StartsWith("Content-Transfer-Encoding:", StringComparison.OrdinalIgnoreCase))
transferEncoding = line.Substring("Content-Transfer-Encoding:".Length).Trim();
}
else
{
contentBuilder.AppendLine(line);
}
}
string fileContent = contentBuilder.ToString().Trim();
if (string.IsNullOrEmpty(fileContent)) continue;
string fileName = Path.GetFileName(contentLocation);
if (string.IsNullOrEmpty(fileName)) fileName = $"part_{fileCount++}.bin";
string outputPath = Path.Combine(outputDirectory, fileName);
if (transferEncoding.Equals("base64", StringComparison.OrdinalIgnoreCase))
{
try
{
byte[] data = Convert.FromBase64String(fileContent);
File.WriteAllBytes(outputPath, data);
Console.WriteLine($"Extracted base64 file: {fileName}");
}
catch (FormatException)
{
Console.WriteLine($"Warning: Could not decode base64 for {fileName}. Saving as text.");
File.WriteAllText(outputPath, fileContent);
}
}
else if (contentType.StartsWith("text/html", StringComparison.OrdinalIgnoreCase))
{
// Ensure the main HTML file gets a .html extension
if (!fileName.EndsWith(".html", StringComparison.OrdinalIgnoreCase))
{
fileName = Path.GetFileNameWithoutExtension(fileName) + ".html";
outputPath = Path.Combine(outputDirectory, fileName);
}
File.WriteAllText(outputPath, fileContent, Encoding.UTF8);
Console.WriteLine($"Extracted HTML: {fileName}");
}
else
{
File.WriteAllText(outputPath, fileContent, Encoding.Default);
Console.WriteLine($"Extracted text/binary file: {fileName}");
}
}
}
}
catch (Exception ex)
{
Console.WriteLine($"An error occurred: {ex.Message}");
}
}
private static string GetBoundaryFromMhtContent(string mhtContent)
{
// Look for Content-Type header in the main part to find the boundary
string[] lines = mhtContent.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
foreach (string line in lines)
{
if (line.StartsWith("Content-Type:", StringComparison.OrdinalIgnoreCase) && line.Contains("boundary="))
{
int boundaryIndex = line.IndexOf("boundary=", StringComparison.OrdinalIgnoreCase);
if (boundaryIndex != -1)
{
string boundaryPart = line.Substring(boundaryIndex + "boundary=".Length).Trim();
// Remove quotes if present
if (boundaryPart.StartsWith("\"") && boundaryPart.EndsWith("\""))
{
return boundaryPart.Substring(1, boundaryPart.Length - 2);
}
return boundaryPart.Split(';')[0].Trim(); // Take first part if multiple params
}
}
// Stop searching for boundary after the initial headers
if (string.IsNullOrWhiteSpace(line)) break;
}
return null;
}
public static void Main(string[] args)
{
// Example usage:
// Create a dummy MHT file for testing
string dummyMhtPath = "example.mht";
string outputDir = "extracted_mht_content";
string htmlContent = "<html><body><h1>Hello MHT!</h1><img src=\"cid:image001.png\"></body></html>";
string imageContentBase64 = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII="; // 1x1 transparent PNG
string mhtContent = $"MIME-Version: 1.0\r\n";
mhtContent += $"Content-Type: multipart/related;\r\n\tboundary=\"----=_NextPart_000_0000_01D7F2A0.00000001\"\r\n\r\n";
mhtContent += $"------=_NextPart_000_0000_01D7F2A0.00000001\r\n";
mhtContent += $"Content-Type: text/html;\r\n\tcharset=\"utf-8\"\r\n";
mhtContent += $"Content-Transfer-Encoding: quoted-printable\r\n";
mhtContent += $"Content-Location: http://example.com/index.html\r\n\r\n";
mhtContent += htmlContent + "\r\n\r\n";
mhtContent += $"------=_NextPart_000_0000_01D7F2A0.00000001\r\n";
mhtContent += $"Content-Type: image/png;\r\n\tname=\"image001.png\"\r\n";
mhtContent += $"Content-Transfer-Encoding: base64\r\n";
mhtContent += $"Content-ID: <image001.png>\r\n";
mhtContent += $"Content-Location: image001.png\r\n\r\n";
mhtContent += imageContentBase64 + "\r\n\r\n";
mhtContent += $"------=_NextPart_000_0000_01D7F2A0.00000001--\r\n";
File.WriteAllText(dummyMhtPath, mhtContent, Encoding.Default);
Console.WriteLine($"Dummy MHT file '{dummyMhtPath}' created.");
ExtractMhtContent(dummyMhtPath, outputDir);
Console.WriteLine("Extraction complete. Press any key to exit.");
Console.ReadKey();
}
}
C# code for parsing an MHT file and extracting its components. Note: MailMessage
is not ideal for general MHT parsing; manual MIME parsing is often more reliable.
Content-Transfer-Encoding
can vary (e.g., quoted-printable
, base64
, 8bit
). Your parser should handle these different encodings to ensure correct content extraction.Extracting with Python
Python's email
module is exceptionally well-suited for parsing MIME-encoded messages, including MHT files. It provides a high-level API to access message parts, headers, and payloads, abstracting away the complexities of boundary detection and decoding.
To use it, you load the MHT file into an email.message.Message
object, then recursively iterate through its parts. For each part, you can check its Content-Type
and Content-Disposition
to determine if it's an image, HTML, or another resource, and then save its payload.
import email
import os
import base64
def extract_mht_content(mht_file_path, output_directory):
if not os.path.exists(mht_file_path):
print(f"MHT file not found: {mht_file_path}")
return
os.makedirs(output_directory, exist_ok=True)
with open(mht_file_path, 'rb') as fp:
msg = email.message_from_binary_file(fp)
if msg.is_multipart():
for part_num, part in enumerate(msg.walk()):
content_type = part.get_content_type()
content_disposition = part.get('Content-Disposition')
content_location = part.get('Content-Location')
filename = None
if content_disposition:
# Try to get filename from Content-Disposition
cd_params = email.header.decode_header(content_disposition)
for value, charset in cd_params:
if isinstance(value, bytes):
value = value.decode(charset or 'utf-8')
if 'filename=' in value:
filename = value.split('filename=')[-1].strip('"')
break
if not filename and content_location:
# Try to get filename from Content-Location
filename = os.path.basename(content_location.split('?')[0].split('#')[0])
if not filename:
# Fallback filename
ext = part.get_content_maintype()
if ext == 'text':
ext = part.get_content_subtype()
filename = f"part_{part_num}.{ext}"
try:
payload = part.get_payload(decode=True)
if payload:
output_path = os.path.join(output_directory, filename)
with open(output_path, 'wb') as out_file:
out_file.write(payload)
print(f"Extracted: {filename} ({content_type})")
except Exception as e:
print(f"Error extracting part {part_num} ({filename}): {e}")
else:
# Handle non-multipart MHT (unlikely for typical MHTs with resources)
print("MHT file is not multipart. Saving as single file.")
filename = os.path.basename(mht_file_path).replace('.mht', '.html')
output_path = os.path.join(output_directory, filename)
try:
payload = msg.get_payload(decode=True)
if payload:
with open(output_path, 'wb') as out_file:
out_file.write(payload)
print(f"Extracted main content: {filename}")
except Exception as e:
print(f"Error extracting main content: {e}")
# Example usage:
if __name__ == "__main__":
# Create a dummy MHT file for testing
dummy_mht_path = "example.mht"
output_dir = "extracted_mht_content_py"
html_content = "<html><body><h1>Hello from Python MHT!</h1><img src=\"cid:image002.png\"></body></html>"
image_content_base64 = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII=".encode('ascii') # 1x1 transparent PNG
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.header import Header
msg_root = MIMEMultipart('related')
msg_root['MIME-Version'] = '1.0'
msg_root['Content-Type'] = 'multipart/related; type="text/html"'
# HTML part
msg_html = MIMEText(html_content, 'html', 'utf-8')
msg_html['Content-Transfer-Encoding'] = 'quoted-printable'
msg_html['Content-Location'] = 'http://example.com/index.html'
msg_root.attach(msg_html)
# Image part
msg_image = MIMEImage(base64.b64decode(image_content_base64), 'png')
msg_image['Content-Transfer-Encoding'] = 'base64'
msg_image['Content-ID'] = '<image002.png>'
msg_image['Content-Location'] = 'image002.png'
msg_image.add_header('Content-Disposition', 'inline', filename='image002.png')
msg_root.attach(msg_image)
with open(dummy_mht_path, 'wb') as f:
f.write(msg_root.as_bytes())
print(f"Dummy MHT file '{dummy_mht_path}' created.")
extract_mht_content(dummy_mht_path, output_dir)
print("Extraction complete.")
Python script utilizing the email
module to parse and extract content from an MHT file.
Handling Content-Location and Content-ID
The Content-Location
and Content-ID
headers are vital for correctly reassembling the MHT content. Content-Location
typically provides a URL or path that the resource originally had, which can be used as a filename. Content-ID
is used for resources embedded directly into the HTML using cid:
URLs (e.g., <img src="cid:image001.png">
).
When extracting, you should map these Content-ID
s or Content-Location
s to the actual filenames of the extracted resources. After extraction, you might need to modify the main HTML file to point to the locally saved image files instead of the cid:
or original http://
paths.
1. Load the MHT File
Open the MHT file in binary read mode. For C#, use FileStream
; for Python, use open(..., 'rb')
.
2. Parse the MIME Structure
Use a MIME parsing library (e.g., Python's email
module) or manually parse by identifying MIME boundaries and headers. Extract each part of the multipart message.
3. Identify Content Type and Encoding
For each part, read the Content-Type
header (e.g., text/html
, image/jpeg
) and Content-Transfer-Encoding
(e.g., base64
, quoted-printable
).
4. Decode and Save Content
Decode the payload of each part according to its Content-Transfer-Encoding
. Save the decoded content to a file in your specified output directory. Use Content-Location
or Content-Disposition
to determine the filename.
5. Update HTML References (Optional)
If the extracted HTML references images via cid:
or absolute URLs, you may need to parse the HTML and replace these references with relative paths to your newly extracted image files.