How to do word counts for a mixture of English and Chinese in Javascript

Learn how to do word counts for a mixture of english and chinese in javascript with practical examples, diagrams, and best practices. Covers javascript, jquery, character development techniques wit...

Accurate Word Counting for Mixed English and Chinese Text in JavaScript

Illustration of English and Chinese characters intertwined, representing mixed-language text processing.

Learn how to implement robust word counting in JavaScript for documents containing both English and Chinese characters, addressing the unique challenges of each language.

Counting words in text might seem straightforward, but it becomes complex when dealing with a mixture of languages like English and Chinese. English words are typically delimited by spaces, while Chinese characters often form words without explicit separators. This article explores effective JavaScript techniques to accurately count 'words' in such mixed-language content, providing solutions that respect the linguistic nuances of both.

Understanding Word Definition Across Languages

Before diving into code, it's crucial to define what constitutes a 'word' for each language. For English, a word is generally a sequence of letters, often separated by spaces, punctuation, or line breaks. For Chinese, each character typically represents a syllable or a morpheme, and a 'word' can be one or more characters. For the purpose of a simple word count, treating each Chinese character as a 'word' is a common and practical approach, especially when a full-fledged natural language processing (NLP) library for Chinese segmentation is not feasible or desired.

flowchart TD
    A[Input Text] --> B{Is Character English?}
    B -- Yes --> C[Count as English Word]
    B -- No --> D{Is Character Chinese?}
    D -- Yes --> E[Count as Chinese Character (Word)]
    D -- No --> F[Ignore (Punctuation/Space)]
    C --> G[Aggregate Counts]
    E --> G
    F --> G
    G --> H[Total Word Count]

Decision flow for mixed-language word counting logic.

Implementing a Basic Mixed-Language Word Counter

A common strategy is to use regular expressions to differentiate between English words and Chinese characters. We can count English words by splitting the string by non-alphanumeric characters and filtering out empty strings. For Chinese, we can count characters that fall within the Unicode range for CJK (Chinese, Japanese, Korean) unified ideographs. The total word count will be the sum of these two counts.

function countMixedWords(text) {
  let englishWordCount = 0;
  let chineseCharCount = 0;

  // Regex for English words (alphanumeric sequences)
  const englishWords = text.match(/\b[a-zA-Z0-9_]+\b/g);
  if (englishWords) {
    englishWordCount = englishWords.length;
  }

  // Regex for Chinese characters (Unicode range U+4E00 to U+9FFF)
  // This range covers most common CJK Unified Ideographs
  const chineseChars = text.match(/[\u4E00-\u9FFF]/g);
  if (chineseChars) {
    chineseCharCount = chineseChars.length;
  }

  return englishWordCount + chineseCharCount;
}

// Example usage:
const text1 = "Hello world, this is a test. 你好世界，这是一个测试。";
console.log(`'${text1}' has ${countMixedWords(text1)} words.`);

const text2 = "JavaScript is fun. 编程很有趣。";
console.log(`'${text2}' has ${countMixedWords(text2)} words.`);

const text3 = "Only English words here.";
console.log(`'${text3}' has ${countMixedWords(text3)} words.`);

const text4 = "只有中文。";
console.log(`'${text4}' has ${countMixedWords(text4)} words.`);

JavaScript function to count words in mixed English and Chinese text.

💡

The Unicode range [\u4E00-\u9FFF] is a common approximation for Chinese characters. For more comprehensive coverage, you might need to include additional CJK Unicode blocks, but this range covers the vast majority of everyday Chinese characters.

Refining the Word Counting Logic

The previous example treats English words and Chinese characters as distinct entities. However, a more sophisticated approach might involve iterating through the text and making decisions character by character or segment by segment, especially to handle cases where English words and Chinese characters are directly adjacent without spaces. This can prevent double-counting or miscounting in complex scenarios.

function countMixedWordsRefined(text) {
  let wordCount = 0;
  let inEnglishWord = false;

  for (let i = 0; i < text.length; i++) {
    const char = text[i];

    // Check if it's an English letter or number
    const isEnglishChar = /[a-zA-Z0-9_]/.test(char);
    // Check if it's a Chinese character (common range)
    const isChineseChar = /[\u4E00-\u9FFF]/.test(char);

    if (isEnglishChar) {
      if (!inEnglishWord) {
        wordCount++;
        inEnglishWord = true;
      }
    } else if (isChineseChar) {
      wordCount++; // Each Chinese character counts as a word
      inEnglishWord = false; // Reset English word flag
    } else {
      // Not an English char or Chinese char (e.g., space, punctuation)
      inEnglishWord = false; // Reset English word flag
    }
  }
  return wordCount;
}

// Example usage:
const text5 = "Hello世界, this是a test. 你好吗?";
console.log(`'${text5}' has ${countMixedWordsRefined(text5)} words (refined).`);

const text6 = "JavaScript编程很有趣。";
console.log(`'${text6}' has ${countMixedWordsRefined(text6)} words (refined).`);

A refined JavaScript function for mixed-language word counting using character-by-character iteration.

⚠️

The refined method handles adjacent English and Chinese characters better, but it still relies on a simplified definition of a 'word' for Chinese. For true linguistic accuracy in Chinese, a dedicated segmentation library (e.g., jieba-js for Node.js) would be necessary, which is beyond the scope of a simple client-side JavaScript solution.