33 – Normalise strings by removing accents

33 – Normalise strings by removing accents#

My name is “Rodrigo Girão Serrão” and the “~” on top of the As are standard in Portuguese… And just like the “~”, there are hundreds of other accents and weird marks used by hundreds of other languages!

If you don’t want any of it, you can write a short Python function that gets rid of those:

import unicodedata

def remove_accents(string):
    return  "".join(
        char for char in unicodedata.normalize("NFD", string)
        if unicodedata.category(char) != "Mn"
    )

This function can be useful when writing a “slugify” function, for example:

def slugify(string):
    return remove_accents(string).lower().replace(" ", "-")

print(slugify("Rodrigo Girão Serrão"))  # rodrigo-girao-serrao

The function remove_accents leverages the built-in module unicodedata, which provides tools to work with the Unicode standard.

(In case you are wondering, the call unicodedata.normalize("NFD", string) separates the accents from the letters:)

print(list(unicodedata.normalize("NFD", "äáàãñ")))
# ['a', '̈', 'a', '́', 'a', '̀', 'a', '̃', 'n', '̃']