Characterizing the Difficulty of Natural Language Datasets for Machine Learning