The Life-Changing Magic of Tidying Text in an R package

Posted on February 14, 2017 by Robin Edgar

As described by Hadley Wickham, tidy data has a specific structure:

each variable is a column
each observation is a row
each type of observational unit is a table

This means we end up with a data set that is in a long, skinny format instead of a wide format. Tidy data sets are easier to work with, and this is no less true when one starts to work with text. Most of the tooling and infrastructure needed for text mining with tidy data frames already exists in packages like dplyr, broom, tidyr, and ggplot2. Our goal in writing the tidytext package is to provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages.

Source: The Life-Changing Magic of Tidying Text

text2vec

Posted on February 14, 2017 by Robin Edgar

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

Concise – expose as few functions as possible
Consistent – expose unified interfaces, no need to explore new interface for each task
Flexible – allow to easily solve complex tasks
Fast – maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
Memory efficient – use streams and iterators, not keep data in RAM if possible

Source: text2vec