Category Archives: Word Frequency and Collocation

Force-Directed Diagram: Memcons and Telcons ‘Textplot’

 

Static ‘Textplot’ of both corpora

This is a network of the 1300 most frequent words in each corpus, related according to their mutual similarity in probability distributions across the span 1969-1977. This was accomplished using the fabulous ‘textplot‘ software, written by David McClure.

In both cases, the general time axis is left-to-right (the layouts were rotated in gephi after the gml files were generated, and then those gephi-generated files were run through the ‘kissinger’ branch of the software found at the humanist github repository.

In the memcons, the ‘tendrils of specificity’ (the long patterns of increasingly specific words emerging like pseudopods in similarly-colored Modularity Classes from each diagram’s center) relate quite distinctly to areas of geopolitical focus, such as the Soviet Union, Japan, China, the Middle East, and Vietnam, among others.

Memcons ‘Textplot’KT3-stop

In the Telcons textplot, the ‘tendrils’ are most closely related to what appear to be large swaths of the telcons bearing the varying stamps marking the documents’ former classification and declassification statuses:

starr-zoom

There are also some clusters that appear to be based around geopolitical topics (those related to Vietnam, for example). Also noteworthy are: 1., a grouping that appears related to a section of the documentation with increased OCR error rates or other improperly converted material (this grouping may also reflect the use of initials in the transcripts, although it’s unclear at this time to what degree.) and 2., the placement of the first names of Kissinger’s wife Nancy and son David, distinctly outside the general word networks.

Telcons ‘Textplot’KA3-stop

It’s important to note that while there are certainly similarities between the nodes comprising the various ‘tendrils of specificity’, textplot’s similarity calculation is based on a calculation of word frequency across the corpus as a whole, without distinction at the document level. This can result in contrasting results to collocation, topic modeling and other analyses that can operate at the document or ‘chunk’ level, and the difference can be instructive in some cases.

For example, the presence of ‘bombing’ is among the most frequent 1300 words in the telcons, nestled tightly within the cluster of words related to Vietnam. The word does not appear among the most frequent in the memcons. Given the differences in word composition between the two corpora (familiarity, % of nouns/place names, formailty, detail, provenance, redaction, etc.), this is to be expected, but nevertheless the word’s presence is still interesting for a few reasons.

Zoomed-in on ‘bombing’ within the
Vietnam-related cluster of telcons document Bombing-KATextplotCloseup

Among a number of possible reasons, the finding is interesting because coming as a recent finding (2014-2015) it is a non-linear ‘post-indication’ of the value of the earlier decision (2012-2013) to do collocation analysis using bombing as the target word. This is especially true given the resulting finding that indicates a potentially significant distribution in the collocation MI-scores between ‘bombing’ and those words describing Vietnam, versus those that describe the country’s neighbors in Indochina.

A recent finding providing possible insight regarding an earlier, intuitive research decision, it strikes me that this is a poweful, non-linear example of the value of visualization as an ongoing process, rather than a one-time production process that results in a specific finding. Additionally, this makes me ponder about cases where the reasons for one’s instincts and biases (in my case, the selection of ‘bombing’ as a target word) may sometimes be seen in the data.

Interactive Textplots

David’s ‘humanist‘ software was then used to create an interactive d3-based browser based on the textplot output gml for each corpus. Without Modularity Class coloring of the nodes, this ‘alpha’ interactive version is (also for the moment) less able to communicate the groups within the distribution of words than the static diagrams.

Kissinger Interactive Textplot (Telcons)

Kissinger Interactive Textplot (Memcons)

 

 

 

Word Collocation: Target Word: ‘Ellsberg’

ellsberg-KTKA-10LR-PS1

Here, the target word is ‘ellsberg’ (as in Dan Ellsberg). With the exception of simple words like ‘an’, ‘of’, etc., all the words occurring within 10 words to the left or right of the word ‘ellsberg’ are on the visualization – either in yellow (the telcons) or blue (the memcons) or the words in between (found in both). The larger the word, the more times it is found in collocation – and the closer the word is to either the ‘telcons’ or ‘memcons’ node, the more times it is ONLY found collocated with the word ‘ellsberg’ – its Mutual Information (MI)-Score – for that form of correspondence.

Note: Kissinger expected his phone conversations (the telcons) to remain private, until he was compelled by threat of supboena to release them. The memcons were formally redacted, and selectively declassified.

What (if anything) does this visualization say to you?

Word Collocation: Target Word ‘Bombing’

‘Bombing’ Word Collocation Force-Directed Graph
BombingCorrelation-big

First and perhaps most strikingly, the MI score and the frequency of the words “Cambodia” and “Vietnam” in collocation with the word “bombing” differs greatly between the two channels of communication. When Kissinger and his associates were using the word ‘bombing’ in official meetings, it was associated much more with words related to ‘Vietnam’ than in the telephone conversations, in which ‘bombing’ was seen to have a higher MI score (collocation) with the names of other countries in Indochina (Laos, Thailand and Cambodia).

It is unsurprising that Kissinger would use the telephones (as National Security Advisor) as compared to formal meetings to discuss bombing in Indochina, given the differences in his expectations of privacy in those two different fora of conversation. However, more than just a quantitative representation of ‘candor,’ this difference may also suggest an absence of material – ‘Top Secret’ memcons on military aspects of the ‘Cambodia’ topic, for example.

‘Bombing’ Word Correlation Interactive Force-Directed Graph
d3bombing-thumb

This is an interactive ‘d3’ version of the force-directed word collocation analysis of the word ‘Bombing’. Currently, the diagram does not take ‘edge weights’ into account, so the nodes within each cluster are placed inexactly.

Until ‘edge weight’ code is completed, static graph above is far more accurate and ‘stable’.

Word Collocation: Target Word ‘Cambodia’

‘Cambodia’ Force Directed Word Frequency/Collocation Graph
cambodia

Words related to violence (bombing, attack, invade, etc.) were more likely to be seen in high collocation – MI (Mutual Information) score – with the word “Cambodia” in the Telcons than in the Memcons, which displayed a greater MI score between ‘Cambodia’ and words related to laughter.

Word Collocation Analysis performed using ‘AntConc’ by Laurence Anthony.