Strip HTML tags from text using plain JavaScript
Categories:
Strip HTML Tags from Text Using Plain JavaScript
Learn how to effectively remove HTML tags from a string using various plain JavaScript techniques, ensuring clean text content for display or processing.
In web development, you often encounter situations where you need to extract plain text content from a string that contains HTML tags. This is crucial for displaying user-generated content safely, preventing XSS attacks, or simply for rendering text in environments that don't support HTML. While server-side solutions or external libraries exist, understanding how to achieve this with plain JavaScript is a fundamental skill. This article explores several robust methods to strip HTML tags, ranging from basic DOM manipulation to regular expressions.
Why Strip HTML Tags?
Stripping HTML tags is not just about aesthetics; it's a critical security and data integrity measure. When displaying user-submitted content, allowing raw HTML can lead to Cross-Site Scripting (XSS) vulnerabilities, where malicious scripts can be injected and executed in other users' browsers. Furthermore, for search indexing, analytics, or plain text displays (like email notifications), HTML tags are often undesirable noise that needs to be removed to present clean, readable information.
The importance of stripping HTML tags for security and display purposes.
Method 1: Using DOMParser (Recommended for Safety)
The DOMParser
interface provides a way to parse XML or HTML source code from a string into a DOM Document
. This is one of the safest and most robust methods because it leverages the browser's native HTML parsing engine, which handles malformed HTML gracefully and correctly interprets entities. You create a temporary document, parse the HTML string, and then extract the textContent
or innerText
.
function stripHtmlTagsDOMParser(htmlString) {
const doc = new DOMParser().parseFromString(htmlString, 'text/html');
return doc.body.textContent || '';
}
const html = '<p>Hello, <b>world</b>!</p><script>alert("XSS!")</script>';
console.log(stripHtmlTagsDOMParser(html)); // Expected: "Hello, world!"
Using DOMParser to safely strip HTML tags.
DOMParser
is generally preferred for its security benefits. It prevents script execution and correctly handles HTML entities, making it ideal for untrusted input.Method 2: Using a Temporary DOM Element
Another common and relatively safe approach involves creating a temporary DOM element (e.g., a <div>
), assigning the HTML string to its innerHTML
property, and then retrieving the textContent
or innerText
of that element. This method also relies on the browser's HTML parsing capabilities. However, it's important to note that if the HTML contains executable scripts, assigning it to innerHTML
of an element appended to the document can potentially execute those scripts. For this reason, it's safer to create the element off-document or ensure your script sanitizes before insertion.
function stripHtmlTagsTempElement(htmlString) {
const div = document.createElement('div');
div.innerHTML = htmlString;
return div.textContent || div.innerText || '';
}
const html = '<div><span>Some text</span> with <b>bold</b> content.</div>';
console.log(stripHtmlTagsTempElement(html)); // Expected: "Some text with bold content."
Stripping HTML tags using a temporary DOM element.
innerHTML
with untrusted input, even with a temporary element. While textContent
extraction is generally safe, direct assignment to innerHTML
can still trigger side effects in some edge cases or if the element is attached to the live DOM.Method 3: Using Regular Expressions (Advanced & Potentially Risky)
While regular expressions can be used to strip HTML tags, this method is generally discouraged for complex or untrusted HTML. HTML is not a regular language, and parsing it accurately with regex can be extremely difficult, leading to edge cases, security vulnerabilities, or incorrect stripping. However, for very simple and predictable HTML structures (e.g., guaranteed valid tags without nested complexities or attributes), a basic regex might suffice. You should always prefer DOM-based parsing for robustness and security.
function stripHtmlTagsRegex(htmlString) {
// This regex is very basic and might not cover all edge cases or malformed HTML.
// It targets anything that looks like an HTML tag: < followed by characters, then >
return htmlString.replace(/<[^>]*>/g, '');
}
const html = '<h1>Title</h1><p>This is a paragraph.</p>';
console.log(stripHtmlTagsRegex(html)); // Expected: "TitleThis is a paragraph."
const complexHtml = '<p>Hello <img src="x" onerror="alert(\'XSS\')"> world</p>';
console.log(stripHtmlTagsRegex(complexHtml)); // Expected: "Hello world" (script removed, but fragile)
A basic regular expression for stripping HTML tags (use with caution).
DOMParser
or a temporary DOM element is a much safer and more reliable choice.1. Step 1
Choose the appropriate method based on your HTML source and security requirements. For untrusted input, DOMParser
is highly recommended.
2. Step 2
Implement the chosen JavaScript function in your application.
3. Step 3
Test your function thoroughly with various HTML strings, including valid, invalid, and potentially malicious examples, to ensure it behaves as expected.
4. Step 4
Consider additional sanitization if you need to allow certain tags or attributes while stripping others. This typically involves more advanced libraries or custom parsing logic.