Training a LSTM on a text corpus, how to preprocess the text data
I am trying to train an LSTM on some given training text data with the goal of producing new texts given a sequence of words.
I'm not sure about the preprocessing of the given text data (just simply a big text file of speeches) and would greatly appreciate some advice. My current approach is to tokenize the text and simply build numerical vectors of sequences by assigning unique integer values to every token (or word). There's some data clean up before, like removing characters and newlines and converting to lower case. Finally, I have a dataset of numbers (indices) correlating to the words which I can feed as input to the LSTM network.
Should I go for the characters each, or are the words good enough (for now it doesn't matter if the output is grammatically incorrect)?
Is this even a correct approach to the LSTM?