How we trun our text info useful input for ai to undertand, here we cover the hidden steps

Written by
Ala GARBAA
Full Stack AI Developer & Software Engineer
Did you know that's hard like pain in the ass 😂 for computers > AI > Models to understand text, why ?
Because the meaning of words can change with context and we have no fixed link can rely on it.
So to make a useful data input to the ai we convert the noraml text to numbers like that the ai models can exploit those numbers as vectors and execute algorithms to do the job.
For that reason we need a clena text, so the first thing first is to normalize the text and this means we start with:
0- turn the corpus to text. 1- run a split text into words using spaces as delimiter for segmentation each part called a token. 2- we treat punctuation like ! ? ; , ? as separate tokens. 3- then lowercase everything 4- the lemmatization this not always but we may apply is group words with the same root, like changing "came" and "comes" to "come."
You can always play with this OpenAI tokenizer goto : platform.openai.com/tokenizer
Yeah that's it like that we can turn our human text into something that AI can deal with.