To catch a protagonist in DraCor

To catch a protagonist in DraCor#

by Ingo Börner

In the paper To Catch a Protagonist: Quantitative Dominance Relations in German-Language Drama (1730-1930) (Fischer et al. 2018) an algorithm is described, that allows to identify characters that are the quantitatively dominant characters of a play based on a set of network-based and count based measures:

In order to systematically describe the extent of this deviation, we calculate eight values for each character of the 465 dramas of our corpus, three count-based measures (number of scenes a character appears in, number of speech acts, number of spoken words) and five network-related measures (degree, weighted degree, betweenness centrality, closeness centrality, eigenvector centrality). For each measurement a ranking is created. The rankings are then merged into two meta-rankings: one count-based and one network-based. The two meta-rankings are then combined into an overall ranking.

The original algorithm was implemented in the tool Dramavis by Christopher Kittel. Dramavis operates on XML “zwischenformat” files created in the DLINA project.

The following notebook adapts the code of the respective modules to work with data returned by the DraCor API. The aim is to be able to recreate the *_chars.csv-files that were used in the study. The data can be found in the repository on github in the folder allmetrics.

The implementation will be tested with the play Emilia Galotti. The original algorithm operated on the corresponding LINA and produced the file 88_Emilia Galotti_chars.csv as output In DraCor the play can be accessed here.

Step 1. Get the basic measures#

We need to get the following basic measures on characters:

Network measures

betweenness
degree
closeness
~~closeness corrected~~
weighted degree
eigenvector centrality

count-based measures

frequency/appearances
number of speech acts
number of words

Network and count-based metrics via Dracor API#

The Python-Packages requests and the library json will be used to query the API and parse the response:

# if not installed, uncomment the following line and run the cell:
# !pip install requests

import requests
import json

# set corpus and playname
corpusname = "ger"
playname = "lessing-emilia-galotti"

# base url of the DraCor-API
api_base = "https://dracor.org/api/"

To retrieve the network-data and speech-amounts data on single characters the function /corpora/{corpusname}/play/{playname}/cast is used as follows:

# send a request to the endpoint and parse results
request_url = api_base + "corpora/" + corpusname + "/play/" + playname + "/cast"
r = requests.get(request_url)
character_data = json.loads(r.text)

The API function returns data on the characters, including the network and count-based metrics:

character_data

[{'id': 'der_prinz',
  'name': 'Der Prinz',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 17,
  'numOfSpeechActs': 157,
  'numOfWords': 4002,
  'degree': 8,
  'weightedDegree': 20,
  'closeness': 0.75,
  'betweenness': 0.46717171717171724,
  'eigenvector': 0.32076106311648156},
 {'id': 'der_kammerdiener',
  'name': 'Der Kammerdiener',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 2,
  'numOfSpeechActs': 6,
  'numOfWords': 33,
  'degree': 1,
  'weightedDegree': 2,
  'closeness': 0.4444444444444444,
  'betweenness': 0,
  'eigenvector': 0.05575792046031641},
 {'id': 'conti',
  'name': 'Conti',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 2,
  'numOfSpeechActs': 24,
  'numOfWords': 604,
  'degree': 1,
  'weightedDegree': 2,
  'closeness': 0.4444444444444444,
  'betweenness': 0,
  'eigenvector': 0.05575792046031641},
 {'id': 'marinelli',
  'name': 'Marinelli',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 19,
  'numOfSpeechActs': 221,
  'numOfWords': 4343,
  'degree': 9,
  'weightedDegree': 30,
  'closeness': 0.8,
  'betweenness': 0.24696969696969698,
  'eigenvector': 0.4489846359321899},
 {'id': 'camillo_rota',
  'name': 'Camillo Rota',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 1,
  'numOfSpeechActs': 6,
  'numOfWords': 78,
  'degree': 1,
  'weightedDegree': 1,
  'closeness': 0.4444444444444444,
  'betweenness': 0,
  'eigenvector': 0.05575792046031641},
 {'id': 'claudia',
  'name': 'Claudia',
  'isGroup': False,
  'gender': 'FEMALE',
  'numOfScenes': 13,
  'numOfSpeechActs': 73,
  'numOfWords': 1581,
  'degree': 7,
  'weightedDegree': 19,
  'closeness': 0.6,
  'betweenness': 0.04545454545454544,
  'eigenvector': 0.38292603187412266},
 {'id': 'pirro',
  'name': 'Pirro',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 4,
  'numOfSpeechActs': 25,
  'numOfWords': 263,
  'degree': 5,
  'weightedDegree': 7,
  'closeness': 0.5454545454545454,
  'betweenness': 0.026515151515151516,
  'eigenvector': 0.2719436343371554},
 {'id': 'odoardo',
  'name': 'Odoardo',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 12,
  'numOfSpeechActs': 108,
  'numOfWords': 2441,
  'degree': 6,
  'weightedDegree': 15,
  'closeness': 0.6666666666666666,
  'betweenness': 0.05505050505050505,
  'eigenvector': 0.3542503929627511},
 {'id': 'angelo',
  'name': 'Angelo',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 2,
  'numOfSpeechActs': 28,
  'numOfWords': 487,
  'degree': 2,
  'weightedDegree': 2,
  'closeness': 0.48,
  'betweenness': 0,
  'eigenvector': 0.1253177208861109},
 {'id': 'emilia',
  'name': 'Emilia',
  'isGroup': False,
  'gender': 'FEMALE',
  'numOfScenes': 7,
  'numOfSpeechActs': 64,
  'numOfWords': 1702,
  'degree': 6,
  'weightedDegree': 13,
  'closeness': 0.6666666666666666,
  'betweenness': 0.05505050505050505,
  'eigenvector': 0.3513647060457318},
 {'id': 'appiani',
  'name': 'Appiani',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 5,
  'numOfSpeechActs': 48,
  'numOfWords': 852,
  'degree': 4,
  'weightedDegree': 8,
  'closeness': 0.5217391304347826,
  'betweenness': 0.003787878787878788,
  'eigenvector': 0.2529584931569895},
 {'id': 'battista',
  'name': 'Battista',
  'isGroup': False,
  'gender': 'MALE',
  'numOfScenes': 4,
  'numOfSpeechActs': 11,
  'numOfWords': 152,
  'degree': 4,
  'weightedDegree': 7,
  'closeness': 0.6,
  'betweenness': 0.012121212121212121,
  'eigenvector': 0.26144507771860326},
 {'id': 'orsina',
  'name': 'Orsina',
  'isGroup': False,
  'gender': 'FEMALE',
  'numOfScenes': 6,
  'numOfSpeechActs': 64,
  'numOfWords': 2111,
  'degree': 4,
  'weightedDegree': 8,
  'closeness': 0.6,
  'betweenness': 0.012121212121212121,
  'eigenvector': 0.2619466686163178}]

The data on the characters are in a dictionary:

{'betweenness': 0.24696969696969698,
  'closeness': 0.8,
  'degree': 9,
  'eigenvector': 0.44898463593218985,
  'gender': 'MALE',
  'id': 'marinelli',
  'isGroup': False,
  'name': 'Marinelli',
  'numOfScenes': 19,
  'numOfSpeechActs': 221,
  'numOfWords': 4343,
  'weightedDegree': 30}

We don’t know anything about the network metrics of the whole play, though. If we want to retrieve this information, we would have to use the API function /corpora/{corpusname}/play/{playname}/metrics, which would also tell us, if there are several sub-networks in a dictionary-field with the key numConnectedComponents. This could be relevant, because we can also calculate some network-metrics differently, e.g. the closeness.

Preparation: Get the metrics and construct a pandas data frame#

In the Dramavis implementation an object of the class DramaAnalyzer is created, which contains the information on characters in a pandas data frame. We will create the same data structure to be able to use the same methods for calculating means and ranking.

The rows in the table are:

name,betweenness,degree,closeness,closeness_corrected,strength,eigenvector_centrality,avg_distance,avg_distance_corrected,frequency,speech_acts,words,lines,chars ...

We will not include all rows, but only the ones, that are relevant for the rankings:

name,betweenness,degree,closeness,~~closeness_corrected~~,strength,eigenvector_centrality,~~avg_distance,avg_distance_corrected~~,frequency,speech_acts,words,~~lines,chars~~ …

following rows will be called differently to follow DraCor conventions of the API output:

name → id; later this will be used to construct URIs
strength → weightedDegree
eigenvector_centrality → eigenvector
frequency → numOfScenes
speech_acts → numOfSpeechActs
words → numOfWords

The package pandas is used to handle the data as a dataframe. Therefore we need to import the package.

# if not installed, uncomment the following line and run the cell:
# !pip install pandas

import pandas as pd

First, we need to transform the parsed JSON API response to a list of lists, that is then turned into the data frame df.

# columns
cols = ["id","betweenness","degree","closeness","weightedDegree","eigenvector","numOfScenes","numOfSpeechActs","numOfWords"]

# prepare the data for the data frame
df_data = []
for character in character_data:
    row = []
    for key in cols:
        row.append(character[key])
    df_data.append(row)

# construct the data frame
df = pd.DataFrame(df_data, columns = cols)

#turn the column "id" to the index
df = df.set_index('id')
#output
df
        

	betweenness	degree	closeness	weightedDegree	eigenvector	numOfScenes	numOfSpeechActs	numOfWords
id
der_prinz	0.467172	8	0.750000	20	0.320761	17	157	4002
der_kammerdiener	0.000000	1	0.444444	2	0.055758	2	6	33
conti	0.000000	1	0.444444	2	0.055758	2	24	604
marinelli	0.246970	9	0.800000	30	0.448985	19	221	4343
camillo_rota	0.000000	1	0.444444	1	0.055758	1	6	78
claudia	0.045455	7	0.600000	19	0.382926	13	73	1581
pirro	0.026515	5	0.545455	7	0.271944	4	25	263
odoardo	0.055051	6	0.666667	15	0.354250	12	108	2441
angelo	0.000000	2	0.480000	2	0.125318	2	28	487
emilia	0.055051	6	0.666667	13	0.351365	7	64	1702
appiani	0.003788	4	0.521739	8	0.252958	5	48	852
battista	0.012121	4	0.600000	7	0.261445	4	11	152
orsina	0.012121	4	0.600000	8	0.261947	6	64	2111

We can now query the data, e.g. output the values of a single character by requesting a row by its index value, which is the id of the character.

# get the values of a single character
df.loc["marinelli"]

betweenness           0.246970
degree                9.000000
closeness             0.800000
weightedDegree       30.000000
eigenvector           0.448985
numOfScenes          19.000000
numOfSpeechActs     221.000000
numOfWords         4343.000000
Name: marinelli, dtype: float64

Step 2. Calculate the ranks#

In Dramavis the function get_character_ranks creates the rankings of the count-based and network-based measures. We will adapt this function to operate on the created data frame and rename the columns:

metrics_to_rank = ['degree', 'closeness', 'betweenness', 'weightedDegree', 'eigenvector', 'numOfScenes', 'numOfSpeechActs', 'numOfWords']
for metric in metrics_to_rank:
    df[metric + "_rank"] = df[metric].rank(method='min', ascending=False)
df
    

	betweenness	degree	closeness	weightedDegree	eigenvector	numOfScenes	numOfSpeechActs	numOfWords	degree_rank	closeness_rank	betweenness_rank	weightedDegree_rank	eigenvector_rank	numOfScenes_rank	numOfSpeechActs_rank	numOfWords_rank
id
der_prinz	0.467172	8	0.750000	20	0.320761	17	157	4002	2.0	2.0	1.0	2.0	5.0	2.0	2.0	2.0
der_kammerdiener	0.000000	1	0.444444	2	0.055758	2	6	33	11.0	11.0	10.0	10.0	11.0	10.0	12.0	13.0
conti	0.000000	1	0.444444	2	0.055758	2	24	604	11.0	11.0	10.0	10.0	11.0	10.0	10.0	8.0
marinelli	0.246970	9	0.800000	30	0.448985	19	221	4343	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1.0
camillo_rota	0.000000	1	0.444444	1	0.055758	1	6	78	11.0	11.0	10.0	13.0	11.0	13.0	12.0	12.0
claudia	0.045455	7	0.600000	19	0.382926	13	73	1581	3.0	5.0	5.0	3.0	2.0	3.0	4.0	6.0
pirro	0.026515	5	0.545455	7	0.271944	4	25	263	6.0	8.0	6.0	8.0	6.0	8.0	9.0	10.0
odoardo	0.055051	6	0.666667	15	0.354250	12	108	2441	4.0	3.0	3.0	4.0	3.0	4.0	3.0	3.0
angelo	0.000000	2	0.480000	2	0.125318	2	28	487	10.0	10.0	10.0	10.0	10.0	10.0	8.0	9.0
emilia	0.055051	6	0.666667	13	0.351365	7	64	1702	4.0	3.0	3.0	5.0	4.0	5.0	5.0	5.0
appiani	0.003788	4	0.521739	8	0.252958	5	48	852	7.0	9.0	9.0	6.0	9.0	7.0	7.0	7.0
battista	0.012121	4	0.600000	7	0.261445	4	11	152	7.0	5.0	7.0	8.0	8.0	8.0	11.0	11.0
orsina	0.012121	4	0.600000	8	0.261947	6	64	2111	7.0	5.0	7.0	6.0	7.0	6.0	5.0	4.0

Step 3. Rank on average and standard deviation of the individual rankings#

In Dramavis the individual rankings are then used for the calculation of an average ranking and the standard deviation, which are then also ranked. This is done by the function get_centrality_ranks.

The following columns will be added to the data frame:

(1) centrality_rank_avg: The average of all rankings
(2) centrality_rank_std: Standard deviation of the rankings
(3) centrality_rank_avg_rank: A ranking is created from the average of all rankings (1)
(4) centrality_rank_std_rank: A ranking is created from the standard deviation of all rankings (2)

The following dramavis code is adapted accordingly to operate on the dataframe:

ranks = [c for c in df.columns if c.endswith("rank")]
df['centrality_rank_avg'] = df[ranks].sum(axis=1)/len(ranks)
df['centrality_rank_std'] = df[ranks].std(axis=1)/len(ranks)
for metric in ['centrality_rank_avg', 'centrality_rank_std']:
    df[metric + "_rank"] = df[metric].rank(method='min', ascending=True)
df

	betweenness	degree	closeness	weightedDegree	eigenvector	numOfScenes	numOfSpeechActs	numOfWords	degree_rank	closeness_rank	betweenness_rank	weightedDegree_rank	eigenvector_rank	numOfScenes_rank	numOfSpeechActs_rank	numOfWords_rank	centrality_rank_avg	centrality_rank_std	centrality_rank_avg_rank	centrality_rank_std_rank
id
der_prinz	0.467172	8	0.750000	20	0.320761	17	157	4002	2.0	2.0	1.0	2.0	5.0	2.0	2.0	2.0	2.250	0.145621	2.0	9.0
der_kammerdiener	0.000000	1	0.444444	2	0.055758	2	6	33	11.0	11.0	10.0	10.0	11.0	10.0	12.0	13.0	11.000	0.133631	12.0	7.0
conti	0.000000	1	0.444444	2	0.055758	2	24	604	11.0	11.0	10.0	10.0	11.0	10.0	10.0	8.0	10.125	0.123879	11.0	5.0
marinelli	0.246970	9	0.800000	30	0.448985	19	221	4343	1.0	1.0	2.0	1.0	1.0	1.0	1.0	1.0	1.125	0.044194	1.0	1.0
camillo_rota	0.000000	1	0.444444	1	0.055758	1	6	78	11.0	11.0	10.0	13.0	11.0	13.0	12.0	12.0	11.625	0.132583	13.0	6.0
claudia	0.045455	7	0.600000	19	0.382926	13	73	1581	3.0	5.0	5.0	3.0	2.0	3.0	4.0	6.0	3.875	0.169525	4.0	11.0
pirro	0.026515	5	0.545455	7	0.271944	4	25	263	6.0	8.0	6.0	8.0	6.0	8.0	9.0	10.0	7.625	0.188243	7.0	12.0
odoardo	0.055051	6	0.666667	15	0.354250	12	108	2441	4.0	3.0	3.0	4.0	3.0	4.0	3.0	3.0	3.375	0.064694	3.0	2.0
angelo	0.000000	2	0.480000	2	0.125318	2	28	487	10.0	10.0	10.0	10.0	10.0	10.0	8.0	9.0	9.625	0.093003	10.0	3.0
emilia	0.055051	6	0.666667	13	0.351365	7	64	1702	4.0	3.0	3.0	5.0	4.0	5.0	5.0	5.0	4.250	0.110801	5.0	4.0
appiani	0.003788	4	0.521739	8	0.252958	5	48	852	7.0	9.0	9.0	6.0	9.0	7.0	7.0	7.0	7.625	0.148467	7.0	10.0
battista	0.012121	4	0.600000	7	0.261445	4	11	152	7.0	5.0	7.0	8.0	8.0	8.0	11.0	11.0	8.125	0.253876	9.0	13.0
orsina	0.012121	4	0.600000	8	0.261947	6	64	2111	7.0	5.0	7.0	6.0	7.0	6.0	5.0	4.0	5.875	0.140749	6.0	8.0

Based on the calculation of centrality_rank_avg_rank, the “central” characters can be already queried as follows:

df[df["centrality_rank_avg_rank"] == 1].index.tolist()

['marinelli']

Additional Step: Create Rankings and combined rankings of network-based and count-based metrics separately#

In addition to a ranking that combines all metrics and rankings derived thereof, the function get_structural_ranking_measures treats network-based and count-based values separately and only then aggregates them to a combined overall ranking.

The function adds the following rows to the data frame:

(1) avg_graph_rank: a ranking based on the rankings of the network-values (degree, closeness, betweenness, strength or weightedDegree and eigenvector centrality or eigenvector)
(2) avg_content_rank: a ranking based on the rankings of the count-based values (frequency or numOfScenes, speech acts and words)
(3) overall_avg: the two rankings (1+2) are combined by calculating the mean
(4) overall_avg_rank: based on the overall average (3) a ranking is created

The following code is adapted accordingly to operate on the dataframe. The ranking stability measures are not implemented here.

#renamed the columns to match the DraCor values here:
graph_ranks = ['degree_rank', 'closeness_rank', 'betweenness_rank', 'weightedDegree_rank', 'eigenvector_rank']
content_ranks = ['numOfScenes_rank', 'numOfSpeechActs_rank', 'numOfWords_rank']
avg_graph_rank = df[graph_ranks].mean(axis=1).rank(method='min')
avg_content_rank = df[content_ranks].mean(axis=1).rank(method='min')
df["avg_graph_rank"] = avg_graph_rank
df["avg_content_rank"] = avg_content_rank
df["overall_avg"] = df[["avg_graph_rank", "avg_content_rank"]].mean(axis=1)
df["overall_avg_rank"] = df["overall_avg"].rank(method='min')
df

	betweenness	degree	closeness	weightedDegree	eigenvector	numOfScenes	numOfSpeechActs	numOfWords	degree_rank	closeness_rank	...	numOfSpeechActs_rank	numOfWords_rank	centrality_rank_avg	centrality_rank_std	centrality_rank_avg_rank	centrality_rank_std_rank	avg_graph_rank	avg_content_rank	overall_avg	overall_avg_rank
id
der_prinz	0.467172	8	0.750000	20	0.320761	17	157	4002	2.0	2.0	...	2.0	2.0	2.250	0.145621	2.0	9.0	2.0	2.0	2.0	2.0
der_kammerdiener	0.000000	1	0.444444	2	0.055758	2	6	33	11.0	11.0	...	12.0	13.0	11.000	0.133631	12.0	7.0	11.0	12.0	11.5	12.0
conti	0.000000	1	0.444444	2	0.055758	2	24	604	11.0	11.0	...	10.0	8.0	10.125	0.123879	11.0	5.0	11.0	10.0	10.5	11.0
marinelli	0.246970	9	0.800000	30	0.448985	19	221	4343	1.0	1.0	...	1.0	1.0	1.125	0.044194	1.0	1.0	1.0	1.0	1.0	1.0
camillo_rota	0.000000	1	0.444444	1	0.055758	1	6	78	11.0	11.0	...	12.0	12.0	11.625	0.132583	13.0	6.0	13.0	13.0	13.0	13.0
claudia	0.045455	7	0.600000	19	0.382926	13	73	1581	3.0	5.0	...	4.0	6.0	3.875	0.169525	4.0	11.0	4.0	4.0	4.0	4.0
pirro	0.026515	5	0.545455	7	0.271944	4	25	263	6.0	8.0	...	9.0	10.0	7.625	0.188243	7.0	12.0	7.0	8.0	7.5	7.0
odoardo	0.055051	6	0.666667	15	0.354250	12	108	2441	4.0	3.0	...	3.0	3.0	3.375	0.064694	3.0	2.0	3.0	3.0	3.0	3.0
angelo	0.000000	2	0.480000	2	0.125318	2	28	487	10.0	10.0	...	8.0	9.0	9.625	0.093003	10.0	3.0	10.0	8.0	9.0	9.0
emilia	0.055051	6	0.666667	13	0.351365	7	64	1702	4.0	3.0	...	5.0	5.0	4.250	0.110801	5.0	4.0	5.0	5.0	5.0	5.0
appiani	0.003788	4	0.521739	8	0.252958	5	48	852	7.0	9.0	...	7.0	7.0	7.625	0.148467	7.0	10.0	9.0	7.0	8.0	8.0
battista	0.012121	4	0.600000	7	0.261445	4	11	152	7.0	5.0	...	11.0	11.0	8.125	0.253876	9.0	13.0	8.0	11.0	9.5	10.0
orsina	0.012121	4	0.600000	8	0.261947	6	64	2111	7.0	5.0	...	5.0	4.0	5.875	0.140749	6.0	8.0	6.0	5.0	5.5	6.0

13 rows × 24 columns

Based on the calculation of overall_avg_rank, the “central” characters can be queried as follows:

df[df["overall_avg_rank"] == 1].index.tolist()

['marinelli']