Strip HTML tags from text using plain JavaScript

Learn strip html tags from text using plain javascript with practical examples, diagrams, and best practices. Covers javascript, html, string development techniques with visual explanations.

Strip HTML Tags from Text Using Plain JavaScript

Strip HTML Tags from Text Using Plain JavaScript

Learn how to effectively remove HTML tags from a string using various plain JavaScript techniques, ensuring clean text content for display or processing.

In web development, you often encounter situations where you need to extract plain text content from a string that contains HTML tags. This is crucial for displaying user-generated content safely, preventing XSS attacks, or simply for rendering text in environments that don't support HTML. While server-side solutions or external libraries exist, understanding how to achieve this with plain JavaScript is a fundamental skill. This article explores several robust methods to strip HTML tags, ranging from basic DOM manipulation to regular expressions.

Why Strip HTML Tags?

Stripping HTML tags is not just about aesthetics; it's a critical security and data integrity measure. When displaying user-submitted content, allowing raw HTML can lead to Cross-Site Scripting (XSS) vulnerabilities, where malicious scripts can be injected and executed in other users' browsers. Furthermore, for search indexing, analytics, or plain text displays (like email notifications), HTML tags are often undesirable noise that needs to be removed to present clean, readable information.

A flowchart diagram illustrating the process and reasons for stripping HTML tags. It starts with 'User Input (with HTML)', branches into 'Security (XSS Prevention)' and 'Display/Processing (Clean Text)', then converges to 'Stripped Text Output'. Blue boxes for actions, green for reasons, arrows showing flow.

The importance of stripping HTML tags for security and display purposes.

The DOMParser interface provides a way to parse XML or HTML source code from a string into a DOM Document. This is one of the safest and most robust methods because it leverages the browser's native HTML parsing engine, which handles malformed HTML gracefully and correctly interprets entities. You create a temporary document, parse the HTML string, and then extract the textContent or innerText.

function stripHtmlTagsDOMParser(htmlString) {
  const doc = new DOMParser().parseFromString(htmlString, 'text/html');
  return doc.body.textContent || '';
}

const html = '<p>Hello, <b>world</b>!</p><script>alert("XSS!")</script>';
console.log(stripHtmlTagsDOMParser(html)); // Expected: "Hello, world!"

Using DOMParser to safely strip HTML tags.

Method 2: Using a Temporary DOM Element

Another common and relatively safe approach involves creating a temporary DOM element (e.g., a <div>), assigning the HTML string to its innerHTML property, and then retrieving the textContent or innerText of that element. This method also relies on the browser's HTML parsing capabilities. However, it's important to note that if the HTML contains executable scripts, assigning it to innerHTML of an element appended to the document can potentially execute those scripts. For this reason, it's safer to create the element off-document or ensure your script sanitizes before insertion.

function stripHtmlTagsTempElement(htmlString) {
  const div = document.createElement('div');
  div.innerHTML = htmlString;
  return div.textContent || div.innerText || '';
}

const html = '<div><span>Some text</span> with <b>bold</b> content.</div>';
console.log(stripHtmlTagsTempElement(html)); // Expected: "Some text with bold content."

Stripping HTML tags using a temporary DOM element.

Method 3: Using Regular Expressions (Advanced & Potentially Risky)

While regular expressions can be used to strip HTML tags, this method is generally discouraged for complex or untrusted HTML. HTML is not a regular language, and parsing it accurately with regex can be extremely difficult, leading to edge cases, security vulnerabilities, or incorrect stripping. However, for very simple and predictable HTML structures (e.g., guaranteed valid tags without nested complexities or attributes), a basic regex might suffice. You should always prefer DOM-based parsing for robustness and security.

function stripHtmlTagsRegex(htmlString) {
  // This regex is very basic and might not cover all edge cases or malformed HTML.
  // It targets anything that looks like an HTML tag: < followed by characters, then >
  return htmlString.replace(/<[^>]*>/g, '');
}

const html = '<h1>Title</h1><p>This is a paragraph.</p>';
console.log(stripHtmlTagsRegex(html)); // Expected: "TitleThis is a paragraph."

const complexHtml = '<p>Hello <img src="x" onerror="alert(\'XSS\')"> world</p>';
console.log(stripHtmlTagsRegex(complexHtml)); // Expected: "Hello  world" (script removed, but fragile)

A basic regular expression for stripping HTML tags (use with caution).

1. Step 1

Choose the appropriate method based on your HTML source and security requirements. For untrusted input, DOMParser is highly recommended.

2. Step 2

Implement the chosen JavaScript function in your application.

3. Step 3

Test your function thoroughly with various HTML strings, including valid, invalid, and potentially malicious examples, to ensure it behaves as expected.

4. Step 4

Consider additional sanitization if you need to allow certain tags or attributes while stripping others. This typically involves more advanced libraries or custom parsing logic.