RegEx for ukrainian letters. How to separate cyrillic words by capital letter?

Learn regex for ukrainian letters. how to separate cyrillic words by capital letter? with practical examples, diagrams, and best practices. Covers javascript, regex, string development techniques w...

Mastering Regular Expressions for Ukrainian Text: Separating Words by Capital Letters

Hero image for RegEx for ukrainian letters. How to separate cyrillic words by capital letter?

Learn how to use regular expressions in JavaScript to accurately identify and separate Ukrainian words, especially when dealing with proper nouns or sentence beginnings marked by capital letters.

Processing text in languages with non-Latin alphabets, such as Ukrainian, often presents unique challenges for developers. One common task is to accurately segment text into individual words, particularly when a word starts with a capital letter, indicating a proper noun or the beginning of a sentence. This article will guide you through crafting effective regular expressions in JavaScript to achieve this for Ukrainian Cyrillic text.

Understanding Ukrainian Cyrillic in Regular Expressions

The key to working with Ukrainian text in regular expressions is to correctly define the character set for Cyrillic letters. Standard regex character classes like \w (word character) often default to Latin alphabets and might not include all Cyrillic characters. Therefore, we need to explicitly specify the range of Ukrainian Cyrillic letters.

Identifying Capitalized Ukrainian Words

To separate words that start with a capital letter, we need a regex that looks for a capital Ukrainian letter followed by one or more lowercase Ukrainian letters. This pattern allows us to distinguish between words like 'Київ' (Kyiv) and 'річка' (river).

flowchart TD
    A[Input String] --> B{Find Capital Ukrainian Letter?}
    B -- Yes --> C[Match One or More Lowercase Ukrainian Letters?]
    C -- Yes --> D[Extract Word]
    C -- No --> B
    B -- No --> E[End]

Flowchart for identifying capitalized Ukrainian words using regex.

const text = "Київ - столиця України. Дніпро - велика річка.";

// Regex to match Ukrainian words starting with a capital letter
// \p{Lu} for uppercase Cyrillic, \p{Ll} for lowercase Cyrillic
// 'u' flag for Unicode support
const regex = /\p{Lu}\p{Ll}+/gu;

const capitalizedWords = text.match(regex);

console.log(capitalizedWords);
// Expected output: ["Київ", "України", "Дніпро"]

JavaScript regex using Unicode property escapes to find capitalized Ukrainian words.

Separating All Words by Capital Letter

If the goal is to split a string into an array of words, where each new word starts with a capital letter (e.g., for a list of proper nouns or sentence fragments), we can use a split operation with a lookahead assertion. This allows us to split before a capital letter, effectively separating the words.

const text = "ЦеПрикладТекстуЗКількамаСловами";

// Regex to split before any capital Ukrainian letter
// (?!^) ensures we don't split at the very beginning of the string
const splitRegex = /(?=\p{Lu})(?!^)/gu;

const separatedWords = text.split(splitRegex);

console.log(separatedWords);
// Expected output: ["Це", "Приклад", "Тексту", "З", "Кількама", "Словами"]

Splitting a string into words based on preceding capital letters using a lookahead assertion.

Handling Edge Cases and Variations

While the above examples cover common scenarios, real-world text can be more complex. Consider cases with numbers, punctuation, or mixed-case words. You might need to adjust your regex to include or exclude these characters based on your specific requirements.

Hero image for RegEx for ukrainian letters. How to separate cyrillic words by capital letter?

Adapting regex for various Ukrainian text patterns.

const textWithPunctuation = "Привіт, Світ! Як справи?";

// To match all Ukrainian words (including those starting with lowercase)
// and handle punctuation separately, you might first extract words
// and then filter/process.
const allUkrainianWordsRegex = /\p{L}+/gu; // Matches any letter sequence

const words = textWithPunctuation.match(allUkrainianWordsRegex);
console.log(words);
// Expected output: ["Привіт", "Світ", "Як", "справи"]

const mixedCaseText = "JavaScriptЦеКруто";
const splitMixedCase = mixedCaseText.split(/(?=\p{Lu})(?!^)/gu);
console.log(splitMixedCase);
// Expected output: ["Java", "Script", "Це", "Круто"]

Examples demonstrating regex for all Ukrainian words and mixed-case splitting.