A paper, out last month and just accepted for this year’s International Conference on Learning Representations (ICLR) in April, describes just how difficult text summarization really is.
A few companies have had a crack at it. Salesforce trained a recurrent neural network with reinforcement learning to take information and retell it in a nutshell, and the results weren’t bad.
However, the computer-generated sentences are simple and short; they lacked the creative flair and rhythm of text written by humans. Google Brain’s latest effort is slightly better: the sentences are longer and seem more natural.
The model works by taking the top ten web pages of a given subject – excluding the Wikipedia entry – or scraping information from the links in the references section of a Wikipedia article. Most of the selected pages are used for training, and a few are kept back to develop and test the system.
The paragraphs from each page are ranked and the text from all the pages are added to create one long document. The text is encoded and shortened, by splitting it into 32,000 individual words and used as input.
This is then fed into an abstractive model, where the long sentences in the input are cut shorter. It’s a clever trick used to both create and summarize text. The generated sentences are taken from the earlier extraction phase and aren’t built from scratch, which explains why the structure is pretty repetitive and stiff.
Mohammad Saleh, co-author of the paper and a software engineer in Google AI’s team, told The Register: “The extraction phase is a bottleneck that determines which parts of the input will be fed to the abstraction stage. Ideally, we would like to pass all the input from reference documents.
“Designing models and hardware that can support longer input sequences is currently an active area of research that can alleviate these limitations.”
We are still a very long way off from effective text summarization or generation. And while the Google Brain project is rather interesting, it would probably be unwise to use a system like this to automatically generate Wikipedia entries. For now, anyway.