243 – Split text into words#
When you need to split text into words, the typical solution of using the string method split will produce words with adjacent punctuation:
text = "Hello, there!"
print(text.split())
# ['Hello,', 'there!']
A more robust approach uses the regular expression function re.split and the special regex character \W:
import re
print(re.split(r"\W+", text))
# ['Hello', 'there', '']
The character \W matches non-word characters, so your final list will only contain words strings: strings of alphanumeric characters.
You can tweak the regular expression to match your expectation of what a word must be.
The final empty string '' shows up because the original text ends with a separator.