Extracting images and text from an mht file

Learn extracting images and text from an mht file with practical examples, diagrams, and best practices. Covers mhtml development techniques with visual explanations.

Extracting Images and Text from MHT Files: A Comprehensive Guide

Hero image for Extracting images and text from an mht file

Learn how to programmatically extract embedded images and text content from MHT (MHTML) files using various methods and programming languages.

MHT (MHTML) files, short for MIME HTML, are single-file archives that bundle an HTML document and its associated resources (like images, CSS, and JavaScript) into one file. This format is often used for archiving web pages. While convenient for storage, extracting individual components, especially images and text, can be challenging without the right tools or programmatic approach. This article will guide you through understanding the MHT structure and provide methods to extract its contents effectively.

Understanding the MHT File Structure

An MHT file is essentially a MIME-encoded archive. It uses the multipart/related MIME type, where the main HTML document is typically the first part, and subsequent parts contain the embedded resources. Each part is separated by a unique boundary string and includes headers specifying its Content-Type, Content-Transfer-Encoding, and Content-Location (or Content-ID). Images are often base64 encoded within these parts.

To extract content, you need to parse this MIME structure, identify the different parts, decode their content based on the Content-Transfer-Encoding, and save them appropriately. The Content-Location header is crucial for mapping embedded resources back to their original filenames or URLs within the HTML.

flowchart TD
    A[MHT File] --> B{Parse MIME Structure}
    B --> C{Identify Boundary}
    C --> D{Extract Each Part}
    D --> E{Read Part Headers}
    E --> F{Check "Content-Type"}
    F --> |HTML| G[Save as HTML]
    F --> |Image (e.g., image/jpeg)| H{Decode Base64}
    H --> I[Save as Image File]
    F --> |Other (e.g., text/css)| J[Save as Resource]
    I --> K[Extracted Images]
    G --> L[Extracted Text/HTML]

MHT File Extraction Process Flow

Programmatic Extraction in C#

C# provides robust libraries for handling MIME-encoded data, making it a suitable language for MHT file parsing. The System.Net.Mail.MailMessage class, though primarily for email, can be repurposed to parse MIME structures. Alternatively, you can manually parse the file content by looking for MIME boundaries and headers.

When using MailMessage, you load the MHT file as a stream, and then iterate through its AlternateViews and Attachments to find the embedded resources. Images are typically found in AlternateViews or Attachments with specific Content-Type headers.

using System;
using System.IO;
using System.Net.Mail;
using System.Text;

public class MhtExtractor
{
    public static void ExtractMhtContent(string mhtFilePath, string outputDirectory)
    {
        if (!File.Exists(mhtFilePath))
        {
            Console.WriteLine("MHT file not found.");
            return;
        }

        Directory.CreateDirectory(outputDirectory);

        try
        {
            using (var stream = new FileStream(mhtFilePath, FileMode.Open))
            {
                // MailMessage can parse MIME-encoded streams
                MailMessage mail = new MailMessage();
                mail.AlternateViews.Add(AlternateView.CreateAlternateViewFromString("", null, "text/html")); // Dummy view to initialize
                mail.AlternateViews.Clear(); // Clear dummy

                // Manually parse the stream to populate MailMessage correctly for MHT
                // This part is tricky as MailMessage is designed for email, not general MHT.
                // A more robust solution might involve a dedicated MIME parser library.
                // For demonstration, let's assume a simpler approach or a dedicated library.

                // A more direct approach for MHT parsing often involves reading the raw content
                // and splitting by MIME boundaries. For simplicity, let's show a conceptual approach.

                // --- Conceptual approach using a simplified MIME parser (not MailMessage directly for MHT) ---
                // In a real scenario, you'd read the MHT file content as a string,
                // find the boundary, and split the parts. Each part would then be processed.

                string mhtContent = File.ReadAllText(mhtFilePath, Encoding.Default); // Use default encoding for MHT
                string boundary = GetBoundaryFromMhtContent(mhtContent);

                if (string.IsNullOrEmpty(boundary))
                {
                    Console.WriteLine("Could not find MIME boundary.");
                    return;
                }

                string[] parts = mhtContent.Split(new string[] { "--" + boundary }, StringSplitOptions.RemoveEmptyEntries);

                int fileCount = 0;
                foreach (string part in parts)
                {
                    if (string.IsNullOrWhiteSpace(part) || part.Trim() == "--") continue;

                    string[] lines = part.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
                    string contentType = string.Empty;
                    string contentLocation = string.Empty;
                    string transferEncoding = string.Empty;
                    StringBuilder contentBuilder = new StringBuilder();

                    bool inHeaders = true;
                    foreach (string line in lines)
                    {
                        if (inHeaders && string.IsNullOrWhiteSpace(line)) // End of headers
                        {
                            inHeaders = false;
                            continue;
                        }
                        if (inHeaders)
                        {
                            if (line.StartsWith("Content-Type:", StringComparison.OrdinalIgnoreCase))
                                contentType = line.Substring("Content-Type:".Length).Trim();
                            else if (line.StartsWith("Content-Location:", StringComparison.OrdinalIgnoreCase))
                                contentLocation = line.Substring("Content-Location:".Length).Trim();
                            else if (line.StartsWith("Content-Transfer-Encoding:", StringComparison.OrdinalIgnoreCase))
                                transferEncoding = line.Substring("Content-Transfer-Encoding:".Length).Trim();
                        }
                        else
                        {
                            contentBuilder.AppendLine(line);
                        }
                    }

                    string fileContent = contentBuilder.ToString().Trim();
                    if (string.IsNullOrEmpty(fileContent)) continue;

                    string fileName = Path.GetFileName(contentLocation);
                    if (string.IsNullOrEmpty(fileName)) fileName = $"part_{fileCount++}.bin";

                    string outputPath = Path.Combine(outputDirectory, fileName);

                    if (transferEncoding.Equals("base64", StringComparison.OrdinalIgnoreCase))
                    {
                        try
                        {
                            byte[] data = Convert.FromBase64String(fileContent);
                            File.WriteAllBytes(outputPath, data);
                            Console.WriteLine($"Extracted base64 file: {fileName}");
                        }
                        catch (FormatException)
                        {
                            Console.WriteLine($"Warning: Could not decode base64 for {fileName}. Saving as text.");
                            File.WriteAllText(outputPath, fileContent);
                        }
                    }
                    else if (contentType.StartsWith("text/html", StringComparison.OrdinalIgnoreCase))
                    {
                        // Ensure the main HTML file gets a .html extension
                        if (!fileName.EndsWith(".html", StringComparison.OrdinalIgnoreCase))
                        {
                            fileName = Path.GetFileNameWithoutExtension(fileName) + ".html";
                            outputPath = Path.Combine(outputDirectory, fileName);
                        }
                        File.WriteAllText(outputPath, fileContent, Encoding.UTF8);
                        Console.WriteLine($"Extracted HTML: {fileName}");
                    }
                    else
                    {
                        File.WriteAllText(outputPath, fileContent, Encoding.Default);
                        Console.WriteLine($"Extracted text/binary file: {fileName}");
                    }
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"An error occurred: {ex.Message}");
        }
    }

    private static string GetBoundaryFromMhtContent(string mhtContent)
    {
        // Look for Content-Type header in the main part to find the boundary
        string[] lines = mhtContent.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);
        foreach (string line in lines)
        {
            if (line.StartsWith("Content-Type:", StringComparison.OrdinalIgnoreCase) && line.Contains("boundary="))
            {
                int boundaryIndex = line.IndexOf("boundary=", StringComparison.OrdinalIgnoreCase);
                if (boundaryIndex != -1)
                {
                    string boundaryPart = line.Substring(boundaryIndex + "boundary=".Length).Trim();
                    // Remove quotes if present
                    if (boundaryPart.StartsWith("\"") && boundaryPart.EndsWith("\""))
                    {
                        return boundaryPart.Substring(1, boundaryPart.Length - 2);
                    }
                    return boundaryPart.Split(';')[0].Trim(); // Take first part if multiple params
                }
            }
            // Stop searching for boundary after the initial headers
            if (string.IsNullOrWhiteSpace(line)) break;
        }
        return null;
    }

    public static void Main(string[] args)
    {
        // Example usage:
        // Create a dummy MHT file for testing
        string dummyMhtPath = "example.mht";
        string outputDir = "extracted_mht_content";

        string htmlContent = "<html><body><h1>Hello MHT!</h1><img src=\"cid:image001.png\"></body></html>";
        string imageContentBase64 = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII="; // 1x1 transparent PNG

        string mhtContent = $"MIME-Version: 1.0\r\n";
        mhtContent += $"Content-Type: multipart/related;\r\n\tboundary=\"----=_NextPart_000_0000_01D7F2A0.00000001\"\r\n\r\n";
        mhtContent += $"------=_NextPart_000_0000_01D7F2A0.00000001\r\n";
        mhtContent += $"Content-Type: text/html;\r\n\tcharset=\"utf-8\"\r\n";
        mhtContent += $"Content-Transfer-Encoding: quoted-printable\r\n";
        mhtContent += $"Content-Location: http://example.com/index.html\r\n\r\n";
        mhtContent += htmlContent + "\r\n\r\n";
        mhtContent += $"------=_NextPart_000_0000_01D7F2A0.00000001\r\n";
        mhtContent += $"Content-Type: image/png;\r\n\tname=\"image001.png\"\r\n";
        mhtContent += $"Content-Transfer-Encoding: base64\r\n";
        mhtContent += $"Content-ID: <image001.png>\r\n";
        mhtContent += $"Content-Location: image001.png\r\n\r\n";
        mhtContent += imageContentBase64 + "\r\n\r\n";
        mhtContent += $"------=_NextPart_000_0000_01D7F2A0.00000001--\r\n";

        File.WriteAllText(dummyMhtPath, mhtContent, Encoding.Default);
        Console.WriteLine($"Dummy MHT file '{dummyMhtPath}' created.");

        ExtractMhtContent(dummyMhtPath, outputDir);

        Console.WriteLine("Extraction complete. Press any key to exit.");
        Console.ReadKey();
    }
}

C# code for parsing an MHT file and extracting its components. Note: MailMessage is not ideal for general MHT parsing; manual MIME parsing is often more reliable.

Extracting with Python

Python's email module is exceptionally well-suited for parsing MIME-encoded messages, including MHT files. It provides a high-level API to access message parts, headers, and payloads, abstracting away the complexities of boundary detection and decoding.

To use it, you load the MHT file into an email.message.Message object, then recursively iterate through its parts. For each part, you can check its Content-Type and Content-Disposition to determine if it's an image, HTML, or another resource, and then save its payload.

import email
import os
import base64

def extract_mht_content(mht_file_path, output_directory):
    if not os.path.exists(mht_file_path):
        print(f"MHT file not found: {mht_file_path}")
        return

    os.makedirs(output_directory, exist_ok=True)

    with open(mht_file_path, 'rb') as fp:
        msg = email.message_from_binary_file(fp)

    if msg.is_multipart():
        for part_num, part in enumerate(msg.walk()):
            content_type = part.get_content_type()
            content_disposition = part.get('Content-Disposition')
            content_location = part.get('Content-Location')
            filename = None

            if content_disposition:
                # Try to get filename from Content-Disposition
                cd_params = email.header.decode_header(content_disposition)
                for value, charset in cd_params:
                    if isinstance(value, bytes):
                        value = value.decode(charset or 'utf-8')
                    if 'filename=' in value:
                        filename = value.split('filename=')[-1].strip('"')
                        break

            if not filename and content_location:
                # Try to get filename from Content-Location
                filename = os.path.basename(content_location.split('?')[0].split('#')[0])

            if not filename:
                # Fallback filename
                ext = part.get_content_maintype()
                if ext == 'text':
                    ext = part.get_content_subtype()
                filename = f"part_{part_num}.{ext}"

            try:
                payload = part.get_payload(decode=True)
                if payload:
                    output_path = os.path.join(output_directory, filename)
                    with open(output_path, 'wb') as out_file:
                        out_file.write(payload)
                    print(f"Extracted: {filename} ({content_type})")
            except Exception as e:
                print(f"Error extracting part {part_num} ({filename}): {e}")
    else:
        # Handle non-multipart MHT (unlikely for typical MHTs with resources)
        print("MHT file is not multipart. Saving as single file.")
        filename = os.path.basename(mht_file_path).replace('.mht', '.html')
        output_path = os.path.join(output_directory, filename)
        try:
            payload = msg.get_payload(decode=True)
            if payload:
                with open(output_path, 'wb') as out_file:
                    out_file.write(payload)
                print(f"Extracted main content: {filename}")
        except Exception as e:
            print(f"Error extracting main content: {e}")

# Example usage:
if __name__ == "__main__":
    # Create a dummy MHT file for testing
    dummy_mht_path = "example.mht"
    output_dir = "extracted_mht_content_py"

    html_content = "<html><body><h1>Hello from Python MHT!</h1><img src=\"cid:image002.png\"></body></html>"
    image_content_base64 = "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkYAAAAAYAAjCB0C8AAAAASUVORK5CYII=".encode('ascii') # 1x1 transparent PNG

    from email.mime.multipart import MIMEMultipart
    from email.mime.text import MIMEText
    from email.mime.image import MIMEImage
    from email.header import Header

    msg_root = MIMEMultipart('related')
    msg_root['MIME-Version'] = '1.0'
    msg_root['Content-Type'] = 'multipart/related; type="text/html"'

    # HTML part
    msg_html = MIMEText(html_content, 'html', 'utf-8')
    msg_html['Content-Transfer-Encoding'] = 'quoted-printable'
    msg_html['Content-Location'] = 'http://example.com/index.html'
    msg_root.attach(msg_html)

    # Image part
    msg_image = MIMEImage(base64.b64decode(image_content_base64), 'png')
    msg_image['Content-Transfer-Encoding'] = 'base64'
    msg_image['Content-ID'] = '<image002.png>'
    msg_image['Content-Location'] = 'image002.png'
    msg_image.add_header('Content-Disposition', 'inline', filename='image002.png')
    msg_root.attach(msg_image)

    with open(dummy_mht_path, 'wb') as f:
        f.write(msg_root.as_bytes())
    print(f"Dummy MHT file '{dummy_mht_path}' created.")

    extract_mht_content(dummy_mht_path, output_dir)
    print("Extraction complete.")

Python script utilizing the email module to parse and extract content from an MHT file.

Handling Content-Location and Content-ID

The Content-Location and Content-ID headers are vital for correctly reassembling the MHT content. Content-Location typically provides a URL or path that the resource originally had, which can be used as a filename. Content-ID is used for resources embedded directly into the HTML using cid: URLs (e.g., <img src="cid:image001.png">).

When extracting, you should map these Content-IDs or Content-Locations to the actual filenames of the extracted resources. After extraction, you might need to modify the main HTML file to point to the locally saved image files instead of the cid: or original http:// paths.

1. Load the MHT File

Open the MHT file in binary read mode. For C#, use FileStream; for Python, use open(..., 'rb').

2. Parse the MIME Structure

Use a MIME parsing library (e.g., Python's email module) or manually parse by identifying MIME boundaries and headers. Extract each part of the multipart message.

3. Identify Content Type and Encoding

For each part, read the Content-Type header (e.g., text/html, image/jpeg) and Content-Transfer-Encoding (e.g., base64, quoted-printable).

4. Decode and Save Content

Decode the payload of each part according to its Content-Transfer-Encoding. Save the decoded content to a file in your specified output directory. Use Content-Location or Content-Disposition to determine the filename.

5. Update HTML References (Optional)

If the extracted HTML references images via cid: or absolute URLs, you may need to parse the HTML and replace these references with relative paths to your newly extracted image files.