Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10. You can download it here.
We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10. Our findings are interesting:
- The union of the word set is just shy of 110K. This means that 10% of the words either fell in or out of the top 100K. This is a turnover rate higher than I expected.
- Some words that are newly in the top list are what you'd expect (unigram log10 probability difference shown parenthetically):
- espnlosangeles (21.88993), an ESPN satellite established during 2009
- debate2010 (21.53613)
- Some words took a predictable jump:
- ipad (2.560667), a product introduced mid-year
- Quite a few words newly in the mix are not conversational words:
- childreplyhtml (22.09848)
- focaladvid (21.76564)
Curious indeed.
Comments
- Anonymous
September 22, 2011
> childreplyhtml This suggests the data is dirty in some respect, doesn't it? Humans don't use that sort of word, that's obvious some sort of HTML source fragment creeping into the n-grams.