Data incubator semi-finalist challenge – Question 3: Plot 2

Instead of simply selecting the most frequent word submitted by collective thought stream contributors, frequencies can be adjusted according to the frequencies of synonymous words that were also submitted. This helps to ensure that the next word selected for the collective thought stream reflects the general meaning intended by largest possible number of contributors. I downloaded a lexical database for English nouns, verbs, adjectives and adverbs called Wordnet (freely available here), and linked it to an R package of the same name. I then wrote some R code to generate a similarity matrix of all 630 words that were found to follow “this is a” in the Twitter corpus. The similarity measure I used was the proportion of synonyms of a word (plus the word itself) that were shared with each other word. This similarity measure is asymmetric in a way that a word like “exam” can share 100% of itself plus its synonyms with the synonyms of the word “test”, such that every time someone submits “exam”, they basically mean “test”, but by contrast the word “test” might only share 33% of itself plus its synonyms with the synonyms of the word “exam”, so when someone submits “test”, it only counts as a partial count toward the previously submitted word “exam”. Synonym-adjusted relative frequency scores were calculated for each word as the sum of the word frequencies multiplied by shared synonym proportions.

See below for a plot of each of the 630 next words that followed “this is a” in the Twitter corpus, arranged according to their relative frequency of word meanings based on their synonym-adjusted frequency scores.

This figure shows that despite adjustments for the frequencies of synonymous words, the rankings of the two most frequent words (“great”, and “good”) did not change from Plot 1. The ranking of many other words did sometimes change dramatically. The word “neat” jumped to 3rd ranking, despite the word only following “this is a” in a single instance in the Twitter corpus. This is mainly because “neat” shared 72% of its synonyms with “great”, so every instance of the word “great” counted as 0.72 of a vote for “neat”. Similar result occurred for “large” (due to shared synonyms with previously frequent words “great” and “big”), and for “swell” (due to shared synonyms with the two most frequent words, “great” and “good”). This plot clearly demonstrates that shared meanings across many submitted words can be incorporated into the selection of the next word to be added to a collective thought stream.

plot2

Posted in Uncategorized | Leave a comment

Data Incubator semi-finalist challenge – Question 3: Plot 1

As a hypothetical example of a distribution of suggested next words that could be submitted by collective thought stream contributors, I processed a corpus of 2 360 000 US English Twitter messages collected in May 2012 (freely available here) into every four-gram word sequence. I then put together some R code that generates the distribution of next words from any specified trigram. Then I chose “this is a” as a focal trigram and found that there were 1638 instances of four-grams beginning with “this is a” in the Twitter corpus, with a distribution of next word frequencies resembling that of a power law. Of these 1638 four-grams beginning with “this is a”, there were 630 different next words, 449 (71%) of which only occurred once, and 41 (6.5%) of which were not real words. The most frequent next word following “this is a” was “great”, following “this is a” in 154 (9.5%) of all cases.

See below for a plot of each of the 630 next words that followed “this is a” in the Twitter corpus, along with their occurrence frequencies.

This figure demonstrates that despite the large number of possible next words that could potentially follow an n-gram, the distribution of next word frequencies is far from uniform, even when the possible next words are independent from each other in the sense that the possibilities share no consistent context (as with the Twitter corpus). In the collective thought stream case, where contributors are prompted to submit a next word and each user sees the same thought stream context, I would expect to see fewer different words submitted for the same amount of overall input. plot1

Posted in Uncategorized | Leave a comment