# Contrastive Text Analysis with Dracor and Scattertext

[Scattertext](https://github.com/JasonKessler/scattertext) enables the two-dimensional visualization of linguistic differences of two groups of text. We here use it to contrast text from speakers of different gender.

## Requirements

We first install the libraries that are necessary to process the data:

In [None]:
!pip install scattertext spacy spacy-transformers pandas pydracor
!python -m spacy download de_dep_news_trf

## Acquiring the Corpus

We download the text of each character for [Goethe's Faust](https://dracor.org/ger/goethe-faust-eine-tragoedie):

In [None]:
%%time
import pydracor

play = pydracor.Play(play_name = "goethe-faust-eine-tragoedie")
text = play.spoken_text_by_character()

This gives us a list with information about all characters, including their gender and spoken text:

In [None]:
text[23]

The later steps are easier if we transform this into tabular data with the columns *Speaker*, *Gender*, *Text*:

In [None]:
import pandas as pd

table = [(c["label"], c["gender"], " ".join(c["text"])) for c in text] # a list of tuples
df = pd.DataFrame(table, columns=["Speaker", "Gender", "Text"])     # a dataframe
df

What's the gender distribution of the speakers?

In [None]:
df.Gender.value_counts()

We remove texts from speakers with unknown gender to enable visualization in two dimensions:

In [None]:
df = df[df.Gender != "UNKNOWN"]
df.Gender.value_counts()

## Building the Scattertext Page

We are basically following [this tutorial](https://github.com/JasonKessler/scattertext#using-scattertext-as-a-text-analysis-library-finding-characteristic-terms-and-their-associations). 

First, we load the trained language model: 

In [None]:
import spacy
nlp = spacy.load("de_dep_news_trf")

Then we create a Scattertext corpus:

In [None]:
import scattertext as st
corpus = st.CorpusFromPandas(df, category_col='Gender', text_col='Text', nlp=nlp).build()

And we print the terms "that differentiate the corpus from a general German corpus":

In [None]:
list(corpus.get_scaled_f_scores_vs_background().index[:10])

Then we can create a HTML page showing the visualization of Scattertext:

In [None]:
html = st.produce_scattertext_explorer(corpus,
          category='MALE',
          category_name='Male',
          not_category_name='Female',
          width_in_pixels=1000,
          metadata=df['Speaker'])
open(play.name + ".html", 'wb').write(html.encode('utf-8'))

Here's the result: [goethe-faust-eine-tragoedie.html](goethe-faust-eine-tragoedie.html)