Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Today's post was delayed slightly but we have good news — announcing the availability of additional language model datasets. As always, the easiest way to get a list is to simply navigate to https://web-ngram.research.microsoft.com/rest/lookup.svc. Shown below are the new items, in URN form:
urn:ngram:bing-title:apr10:1
urn:ngram:bing-title:apr10:2
urn:ngram:bing-title:apr10:3
urn:ngram:bing-title:apr10:4
urn:ngram:bing-title:apr10:5
urn:ngram:bing-anchor:apr10:1
urn:ngram:bing-anchor:apr10:2
urn:ngram:bing-anchor:apr10:3
urn:ngram:bing-anchor:apr10:4
urn:ngram:bing-anchor:apr10:5
urn:ngram:bing-body:apr10:1
urn:ngram:bing-body:apr10:2
urn:ngram:bing-body:apr10:3
urn:ngram:bing-body:apr10:4
urn:ngram:bing-body:apr10:5
For those of you familiar with the naming scheme will notice right away that we're now supporting 5-grams for the three main streams. What's not captured in the naming scheme is that unlike the jun09 dataset for the body stream, the apr10 dataset has a cutoff of 10. The title and anchor stream still have a cutoff of 0, as did all of the jun09 streams.