State Party Platform Project

Daniel Coffey, PhD. This blog is about my text analysis of state political party platforms (and other texts) and is a source for state party platform data for social scientists and academics.
  • State Party Platforms Data
  • About Daniel Coffey
  • Blog
  • Monthly Archives: March 2023

    • Practicing Word Embeddings

      Posted at 2:43 am by Daniel Coffey, on March 3, 2023

      One of the things that is frustrating but really important to understand about coding political (or any) texts is how word usage varies by context. This is why sentiment dictionaries often perform poorly or have questionable validity. Human coding still really is the gold standard in terms of validity, but NLP has come a long way (ChatGTP!)

      A new paper from Pedro Rodriguez, Arthur Spirling, and Brandon M. Stewart, “Embedding Regression: Models for Context-Specific Description and Inference,” demonstrates a way to use word embeddings to distinguish between how words are used differently by different users, such as politicians or political parties. Their paper is coming out soon in the American Political Science Review. From the abstract:

      Social scientists commonly seek to make statements about how word use varies over circumstances—including time, partisan identity, or some other document-level covariate. For example, researchers might wish to know how Republicans and Democrats diverge in their understanding of the term “immigration.” Building on the success of pretrained language models, we introduce the à la carte on text (conText) embedding regression model for this purpose. This fast and simple method produces valid vector representations of how words are used—and thus what words “mean”—in different contexts. We show that it outperforms slower, more complicated alternatives and works well even with very few documents. The model also allows for hypothesis testing and statements about statistical significance. We demonstrate that it can be used for a broad range of important tasks, including understanding US polarization, historical legislative development, and sentiment detection. We provide open-source software for fitting the model.

      The program, conText, is easy to implement in R. I ran through their tutorial using my own dataset (about 30 state party platforms from 2020). Their example for immigration worked well on my own dataset. I don’t have a lot of experience with word embeddings, so I used their pre-trained embeddings. Again, I am a novice at this, but I think because the pre-trained feature-embeddings contained most of the words used in the state party platforms, the results made sense. I like the program and their example. For those who use Quanteda, it is very similar (by design) and this made it easier to adjust when I ran into problems. In fact, it only took me about an hour or so to run through the immigration example, and then to also try another issue, abortion. I am definitely impressed. conText doesn’t need all that much user input. Just a few words are enough for it to comb through the corpus and pick out issue-specific words and also disentangle the partisan differences. For example, using nearest neighbors for “abortion” and setting the top 20 nearest words, produced the following:

      Democratic words
      A tibble: 20 x 4
      target feature rank value
      1 a legal 1 0.639
      2 a access 2 0.579
      3 a services 3 0.536
      4 a provide 4 0.535
      5 a including 5 0.531
      6 a provides 6 0.480
      7 a ability 7 0.460
      8 a law 8 0.454
      9 a federal 9 0.441
      10 a funding 10 0.440
      11 a health 11 0.438
      12 a also 12 0.428
      13 a women 13 0.425
      14 a without 14 0.423
      15 a care 15 0.418
      16 a decision 16 0.418
      17 a support 17 0.416
      18 a rights 18 0.415
      19 a information 19 0.413
      20 a individuals 20 0.412

      Republican words

      A tibble: 20 x 4
      target feature rank value
      1 b human 1 0.610
      2 b support 2 0.556
      3 b child 3 0.506
      4 b act 4 0.497
      5 b legisl 5 0.464
      6 b urg 6 0.464
      7 b includ 7 0.458
      8 b feder 8 0.455
      9 b law 9 0.455
      10 b public 10 0.454
      11 b servic 11 0.454
      12 b fund 12 0.453
      13 b life 13 0.439
      14 b provid 14 0.434
      15 b also 15 0.431
      16 b right 16 0.428
      17 b children 17 0.419
      18 b effort 18 0.412
      19 b women 19 0.411
      20 b author 20 0.409

      (the word stemmings/lemmatization seems not to have been perfect). The idea is that “we know a word’s meaning by the company it keeps” and so differences in the nearest or most frequent neighbors indicate differences in partisan meaning. That said, it still largely picks out word differences (or differences in associations), and can’t necessarily provide insight into how the same word really is different. This is a point that they make in similar publications – researchers needs to make sense of the output and therefore expert knowledge and a good theory is necessary. Interpretation is for the researcher, not the machine. We can understand differences in the meaning of shared words (like “health”, “state”, or “government”), but this will require even more knowledge about the issues and how local (state) parties frame issues to make meaningful inferences.

      In the figure below, the Republican words associated with the use of abortion are on the left (I ran this quickly and didn’t get the chance to adjust the code to relabel the plot, hence the “a” and “b” in the plot), the Democratic words are on the right and the shared words (denoted by the triangle) are in the middle.

      I’ve been reading Text as Data (Justin Grimmer, Margaret Roberts and Brandon Stewart, Princeton 2022) and I appreciate the clarity of their approach. I also appreciate the sentiment that they express repeatedly that the researcher needs theory to make sense of the text and they “emphasize throughout our book that text as data methods should not displace the careful and thoughtful humanist” (2022: 9). This is a point I will be coming back to over the next several posts: with so many options of NLP, why bother with human coding anymore? I think this is a question that needs attention and I this is something I will be focusing on during my fall sabbatical (it got approved!!!!) and on this blog.

      Thanks for reading!

      | 0 Comments
    • Recent Posts

      • Practicing Word Embeddings
      • American State Party Platforms, 1846-2017 Now Available at Harvard Dataverse!
      • 2020 State Party Platforms in Review
      • What is the Meaning of Green?
      • New State of the Parties Edition!
    • Recent Comments

      Oliver Steffensen on Do Republicans Own Gun Control…
      statepartyplatforms on New State of the Parties …
      Adam on New State of the Parties …
      statepartyplatforms on New State of the Parties …
      Adam on New State of the Parties …
    • Archives

      • March 2023
      • January 2022
      • January 2021
      • June 2019
      • September 2018
      • November 2017
      • March 2017
      • February 2016
    • Follow State Party Platform Project on WordPress.com

Blog at WordPress.com.

  • Follow Following
    • State Party Platform Project
    • Already have a WordPress.com account? Log in now.
    • State Party Platform Project
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...