# Text Classification with DraCor


[DraCor](https://dracor.org/) is a corpus of plays, which is made available through an extensive [API](https://dracor.org/doc/api).
In this notebook we want to test to what extend a play's author can be identified using only the texts they have wrote.
This is a typical application of [stylometry](https://en.wikipedia.org/wiki/Stylometry).


## Creating the Corpus


The first two functions are used to download a corpus of plays from DraCor:

In [None]:
from urllib import request
import json 

dracor_api = "https://dracor.org/api"                # DraCor API-endpoint


def get_dracor(corpus, play=None):
    """Loads either corpus metadata or the play's text."""
    url = dracor_api + "/corpora/" + corpus          # base URL
    if play is not None:                             # play wanted?
        url = url + "/play/" + play + "/spoken-text" # URL for the play's text
    with request.urlopen(url) as req:                # download data
        text = req.read().decode()                   # import data
        if play is None:                             # play wanted?
            return json.loads(text)                  # parse and return JSON of corpus metadata
        return text                                  # return the play's text


def get_data(corpus):
    """Download all of one corpus' plays."""
    texts = []                                       # texts of the plays
    target = []                                      # authors of the plays
    for drama in get_dracor(corpus)["dramas"]:       # iterate through all plays
        name = drama["name"]                         # play title
        authors = drama["authors"]                   # play's authors
        if len(authors) == 1:                        # keep only plays written by only one author
            texts.append(get_dracor(corpus, name))   # download text
            target.append(authors[0]["fullname"])    # add author
    return texts, target                             # return texts and authors (result of this function)

texts, target = get_data("ger")                      # download GerDraCor

## Text Classification

Numerical data is required for most classification methods. Therefore we need to transform the texts before we can work with them. The following function changes the given data using a corresponding transformation class. It then trains and evaluates a [Naive Bayes classifier for multinomial models](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB). This classifier is typically well suited for the use in text classification.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

def texteval(X, Y, vec):
    X = vec.fit_transform(X)                                  # transform text data
    train_X, test_X, train_Y, test_Y = train_test_split(X, Y) # split into test and training data
    clf = MultinomialNB()                                     # instantiate classificator
    clf.fit(train_X, train_Y)                                 # train model
    return clf.score(test_X, test_Y)                          # evaluate model

Now we are able to study what influence different types of text transformation have on the quality of the classification.


### Word Frequency

Let's begin with the simplest option: Every document is represented by a vector. This vector shows the frequency of each word in the corpus within the document. We can do this using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer):

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

for i in range(5):                                            # five iterations
    print(texteval(texts, target, CountVectorizer()))

#### Frequent Words

Only words that appear in *at least 30%* of documents:

In [None]:
for i in range(5):
    print(texteval(texts, target, CountVectorizer(min_df=0.3)))

#### Rare Words

Only words that appear in *at most 30%* of documents:

In [None]:
for i in range(5):
    print(texteval(texts, target, CountVectorizer(max_df=0.3)))

#### Frequent Bigrams

Only bigrams that appear in *at least 30%* of documents:

In [None]:
vec = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=0.3)

for i in range(5):
    print(texteval(texts, target, vec))

#### Rare Bigrams

Only bigrams that appear in *at most 30%* of documents:

In [None]:
vec = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', max_df=0.3)

for i in range(5):
    print(texteval(texts, target, vec))

### TF-IDF

Frequent words often are not very meaningful/informative for any given document, therefore the word frequency is often put in relation to the number of documents in which this word appears. A commonly used measure for this is [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf):

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

for i in range(5):
    print(texteval(texts, target, TfidfVectorizer(min_df=0.3)))

### Character frequency

We can repeat these experiments on the level of individual characters. To do this we simply need to pass a different analyzer to the  `CountVectorizer`:

In [None]:
for i in range(5):
    print(texteval(texts, target, CountVectorizer(analyzer='char_wb')))

#### frequent characters

In [None]:
for i in range(5):
    print(texteval(texts, target, CountVectorizer(analyzer='char_wb', min_df=0.3)))

#### rare characters

In [None]:
for i in range(5):
    print(texteval(texts, target, CountVectorizer(analyzer='char_wb', max_df=0.3)))

#### frequent bigrams

In [None]:
vec = CountVectorizer(ngram_range=(1, 2), analyzer='char_wb', min_df=0.3)

for i in range(5):
    print(texteval(texts, target, vec))

#### rare bigrams

In [None]:
vec = CountVectorizer(ngram_range=(1, 2), analyzer='char_wb', max_df=0.3)

for i in range(5):
    print(texteval(texts, target, vec))