How to make MySQL handle UTF-8 properly

Learn how to make mysql handle utf-8 properly with practical examples, diagrams, and best practices. Covers mysql, utf-8 development techniques with visual explanations.

Mastering UTF-8 in MySQL: A Comprehensive Guide

Hero image for How to make MySQL handle UTF-8 properly

Learn how to configure MySQL databases, tables, and connections to correctly handle UTF-8 characters, preventing common encoding issues and ensuring data integrity.

UTF-8 is the most common character encoding for the web, supporting a vast range of characters from different languages. However, getting MySQL to handle UTF-8 correctly can sometimes be a challenge, leading to garbled text (mojibake) or data loss. This article provides a comprehensive guide to properly configure your MySQL environment for robust UTF-8 support, from server settings to client connections.

Understanding MySQL Character Sets and Collations

Before diving into configuration, it's crucial to understand the difference between character sets and collations in MySQL:

  • Character Set: Defines the set of characters that can be stored and how they are encoded (e.g., utf8, utf8mb4). utf8mb4 is the recommended character set for full UTF-8 support, including 4-byte characters like emojis.
  • Collation: Defines the rules for comparing and sorting characters within a character set (e.g., utf8mb4_unicode_ci, utf8mb4_general_ci). _ci typically means case-insensitive, while _cs means case-sensitive. _bin means binary comparison.

MySQL applies character sets and collations at several levels: server, database, table, and column. For proper UTF-8 handling, consistency across these levels is key.

flowchart TD
    A[Client Application] --> B[MySQL Server Connection]
    B --> C[MySQL Database]
    C --> D[MySQL Table]
    D --> E[MySQL Column]

    subgraph Character Set & Collation Hierarchy
        E -- "Inherits from" --> D
        D -- "Inherits from" --> C
        C -- "Inherits from" --> B
        B -- "Inherits from" --> F[MySQL Server]
    end

    F["Server Default (my.cnf)"]
    B["Connection (SET NAMES)"]
    C["Database (CREATE DATABASE)"]
    D["Table (CREATE TABLE)"]
    E["Column (CREATE COLUMN)"]

    style F fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#bfb,stroke:#333,stroke-width:2px
    style D fill:#ffb,stroke:#333,stroke-width:2px
    style E fill:#fbb,stroke:#333,stroke-width:2px

MySQL Character Set and Collation Hierarchy

Configuring MySQL for UTF-8 (utf8mb4)

The most robust way to ensure full UTF-8 support is to use the utf8mb4 character set. This involves configuring your MySQL server, databases, tables, and columns. It's also crucial to ensure your client connection is properly set.

1. Server Configuration (my.cnf or my.ini)

Edit your MySQL configuration file (my.cnf on Linux/macOS or my.ini on Windows) to set the default character set for the server. This is a crucial first step, though it primarily affects newly created databases and tables if not explicitly specified.

[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

[client]
default-character-set=utf8mb4

[mysql]
default-character-set=utf8mb4

Example my.cnf configuration for UTF-8

After modifying my.cnf or my.ini, restart your MySQL server for the changes to take effect.

2. Database Configuration

When creating a new database, explicitly specify utf8mb4 and a suitable collation. If you have an existing database, you can alter it.

-- Create a new database with UTF-8 support
CREATE DATABASE my_database
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_unicode_ci;

-- Alter an existing database
ALTER DATABASE my_database
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_unicode_ci;

SQL commands for database character set configuration

3. Table and Column Configuration

For new tables, specify the character set and collation. For existing tables, you'll need to alter them. It's best practice to set character sets at the column level for string-based columns, as this provides the most granular control.

-- Create a new table with UTF-8 columns
CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    username VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
    email VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
    bio TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
);

-- Alter an existing table and its columns
ALTER TABLE products CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- Or alter specific columns (more precise)
ALTER TABLE products
    MODIFY COLUMN product_name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
    MODIFY COLUMN description TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

SQL commands for table and column character set configuration

4. Client Connection Configuration

Even if your server, database, and tables are perfectly configured, data can still be corrupted if your client application doesn't communicate using the correct character set. This is often the source of 'mojibake'. You need to tell MySQL what character set your client is using.

-- Execute this immediately after connecting to the database
SET NAMES 'utf8mb4';

SQL command to set client connection character set

Most programming languages and ORMs provide a way to set the character set when establishing a connection. Here are examples for common languages:

PHP (PDO)

$dsn = 'mysql:host=localhost;dbname=my_database;charset=utf8mb4'; $pdo = new PDO($dsn, 'username', 'password');

Python (mysql-connector-python)

import mysql.connector

cnx = mysql.connector.connect( host='localhost', database='my_database', user='username', password='password', charset='utf8mb4' )

Java (JDBC)

String url = "jdbc:mysql://localhost:3306/my_database?useUnicode=true&characterEncoding=UTF-8&connectionCollation=utf8mb4_unicode_ci"; Connection conn = DriverManager.getConnection(url, "username", "password");

Node.js (mysql2)

const mysql = require('mysql2');

const connection = mysql.createConnection({ host: 'localhost', user: 'username', password: 'password', database: 'my_database', charset: 'utf8mb4' });

Verifying UTF-8 Configuration

After making changes, it's essential to verify that everything is correctly configured. You can use SQL queries to inspect the character sets and collations at various levels.

-- Check server character set variables
SHOW VARIABLES LIKE 'character_set%';
SHOW VARIABLES LIKE 'collation%';

-- Check database character set and collation
SELECT default_character_set_name, default_collation_name
FROM information_schema.SCHEMATA WHERE schema_name = 'my_database';

-- Check table character set and collation
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_COLLATION
FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'my_database';

-- Check column character set and collation
SELECT COLUMN_NAME, CHARACTER_SET_NAME, COLLATION_NAME
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA = 'my_database' AND TABLE_NAME = 'users' AND DATA_TYPE IN ('char', 'varchar', 'text');

-- Check current connection character set
SHOW VARIABLES LIKE 'character_set_connection';
SHOW VARIABLES LIKE 'collation_connection';
SHOW VARIABLES LIKE 'character_set_results';
SHOW VARIABLES LIKE 'character_set_client';

SQL queries to verify UTF-8 configuration

By following these steps, you can ensure that your MySQL environment is fully equipped to handle UTF-8 characters, including emojis and a wide range of international text, preventing common encoding pitfalls and maintaining data integrity.