Gender Representation in German Plays#

by Sandra Densch-Glazov, Leonie Wichers, Kyung Yun Choi and Benedikt Schuh

Introduction#

The following Jupyter Notebook generates visualizations that provide a starting point to analyze gender relations and gender distribution in a selected drama from the German corpus of the DraCor dataset. DraCor is a showcase for the concept of Programmable Corpora. It revolves around an API that provides data extracted from our TEI-encoded corpora of plays in (mostly) European languages.

To generate the visualization, the following 4 code cells must be executed.

Step 0: Preparation#

In this code cell, the required libraries are imported, helper functions are defined and corpus metadata is requested from the API; nothing more needs to be done than to execute the cell.

import pandas as pd
import altair as alt
import ipywidgets as widgets
import networkx as nx
import nx_altair as nxa
from pydracor import *
import requests

def minmaxWords(list):
    maxw = max(list)
    minw = min(list)
    return (minw, maxw)

def set_character_name_and_size(graphR, graphO):
    words = nx.get_node_attributes(graphO,'Number of spoken words')
    minWord, maxword = minmaxWords(words.values())
    sumWords = sum(words.values())
    for node_iterator in graphR.nodes: 
        node = graphR.nodes[node_iterator]
        node['Name'] = node['label']
        node['Spoken words'] = graphO.nodes[node_iterator]['Number of spoken words']
        node['Size'] = graphO.nodes[node_iterator]['Number of spoken words']/maxword*200+25
        node['Speech Percentage'] = round((node['Spoken words']/sumWords *100), 2)
        
    
    return graphR

def relation_name_mapping():
    relation = pd.DataFrame(
        {'Relation': ['parent_of', 'lover_of', 'related_with', 'associated_with', 'siblings', 'spouses', 'friends']}
    )
    relation_name_mapping = {
        'parent_of': 'Parent-child',
        'lover_of': 'Lovers',
        'related_with': 'Related',
        'associated_with': 'Associated',
        'siblings': 'Siblings',
        'spouses': 'Spouses',
        'friends': 'Friends'
    }
    relation['Relation_Display'] = relation['Relation'].map(relation_name_mapping)
    return relation

def gender_name_mapping():
    gender = pd.DataFrame({'Gender': ['MALE','FEMALE', 'UNKNOWN']})
    gender_name_mapping = {
        'MALE': 'Male',
        'FEMALE': 'Female',
        'UNKNOWN': 'Unknown',
    }
    gender['Gender_Display'] = gender['Gender'].map(gender_name_mapping)
    return gender

def get_words_by_gender(nodes):
    female_words = 0
    male_words = 0
    unknown_words = 0
    for node_iterator in nodes: 
        node = nodes[node_iterator]
        if node['Gender'] == 'FEMALE':
           female_words += node['Number of spoken words']
        elif node['Gender'] == 'MALE':
           male_words += node['Number of spoken words']
        elif node['Gender'] == 'UNKNOWN':
           unknown_words += node['Number of spoken words']
    return [male_words, female_words, unknown_words]

def chunked_title(title):
    chunked_title = []
    current_chunk = ""
    for word in title.split():
        if len(current_chunk) + len(word) <= 70:
            current_chunk += f"{word} "
        else:
            chunked_title.append(current_chunk.strip())
            current_chunk = f"{word} "
    chunked_title.append(current_chunk.strip())
    return chunked_title

german_corpus = Corpus('ger')
german_metadata = pd.DataFrame(german_corpus.metadata())
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 2
      1 import pandas as pd
----> 2 import altair as alt
      3 import ipywidgets as widgets
      4 import networkx as nx

ModuleNotFoundError: No module named 'altair'

Step 1: Choose a play#

  • After the following code is executed, a dropdown menu appears below from which any drama can be selected.

  • You can also search for a specific drama by entering the first letters of the desired drama on the keyboard.

  • If the visualization has already been generated and you would like to select a different drama, select the appropriate drama from the dropdown menu and execute Steps 2 and 3 again.

dropdown_items = dict(zip(german_metadata['title'], german_metadata['id']))
dropdown_items = dict(sorted(dropdown_items.items()))

dropdown = widgets.Dropdown(
    options=dropdown_items,
    description='Select play:',
)

dropdown

Step 2: Request network metrics from API#

Now the network metrics to visualize the relationship network will be requested from the API. This may take a while.

play_id = dropdown.value
play = Play(play_id)
try:
    relations_graphml = play.relations_graphml()
    # networkX doesn't support mix of directed+undirected Graphs & nx_altair's arrows look broken
    # workaround: make graph undirected
    relations_graphml = relations_graphml.replace('directed="true"', 'directed="false"')
    cooccurence_graphml = play.graphml()
except requests.HTTPError:
    relations_graphml = None
    cooccurence_graphml = None
    print('The API does not contain a relationship network to visualize for this play. Please choose another one.')

Step 3: Visualize data#

After execution this code block generates the actual visualization about gender distribution and relations for the chosen play below the code block.

Important informations for working with the visualisation:

  • There are two filters on the left-hand side:

    • Gender Filter: Select the gender/s you would like to have displayed.

    • Relation Filter: Select relation/s you would like to have displayed.

  • Additional information in tooltips:

    • Node tooltip: shows name of character, number of spoken words and the percentage of speech

    • Pie chart tooltip: shows the number characters by gender and the resulting percentage of characters per gender

  • A circular layout is used for the character-relation network, because it results in an easier to read and better structured network. The arrangement of nodes solely relies on the order in which the characters are listed in the data source and does not encode any structures from the play.

if relations_graphml != None:
    ############################## Network Chart ##############################
    # parse graphs
    relations_graph = nx.parse_graphml(relations_graphml)
    cooccurence_graph = nx.parse_graphml(cooccurence_graphml)
    # add Name attribute to nodes
    relations_graph = set_character_name_and_size(relations_graph, cooccurence_graph)
    # define the graph layout
    layout = nx.circular_layout(relations_graph)

    # draw base graph with nx_altair
    base = nxa.draw_networkx(
        relations_graph,
        pos=layout,
        node_tooltip=['Name','Spoken words', 'Speech Percentage'],
        node_color='lightgray',
        edge_color='Relation',
        node_size ='Size',
        width=4
    )

    # get the edge layer
    edges = base.layer[0]
    # get the node layer
    nodes = base.layer[1]

    # define relation filter
    relation = relation_name_mapping()
    relation_selection = alt.selection_point(fields=['Relation'], toggle="true")
    relation_color = alt.condition(
        relation_selection,
        alt.Color('Relation:N', legend=None),
        alt.value('lightgray')
    )
    relation_filter = alt.Chart(
        relation,
        title=alt.TitleParams('Filter relation', anchor='start')
    ).mark_rect(cursor='pointer').encode(
        y=alt.Y('Relation_Display', title=''),
        color=relation_color
    ).add_params(relation_selection)

    # encode relation as edge color and add relationship filter
    edges = edges.encode(color=relation_color).transform_filter(relation_selection)

    # define gender filter
    gender = gender_name_mapping()
    gender_selection = alt.selection_point(fields=['Gender'], toggle="true")
    gender_color = alt.condition(
        gender_selection,
        alt.Color('Gender:N', legend=None),
        alt.value('lightgray')
    )
    gender_shape = alt.Shape('Gender:N', legend=None)
    gender_filter = alt.Chart(
        gender,
        title=alt.TitleParams('Filter gender', anchor='start')
    ).mark_point(
        size=300,
        cursor='pointer',
        filled=True,
        opacity=1
    ).encode(
        y=alt.Y('Gender_Display', title=''),
        color=gender_color,
        shape=gender_shape
    ).add_params(gender_selection)

    # encode gender as node shape+color and add gender filter
    nodes = nodes.encode(
        color=gender_color,
        fill=gender_color,
        shape=gender_shape
    ).add_params(gender_selection)

    # layer network chart
    network_chart = (edges + nodes).properties(
        width=400,
        height=400
    )
    network_chart_with_filters = ((gender_filter & relation_filter) | network_chart)

    ############################## Pie Charts ##############################

    # count characters by gender
    play_metadata = german_metadata[german_metadata["id"] == play_id].reset_index()
    speakers = play_metadata[['num_of_speakers_male', 'num_of_speakers_female', 'num_of_speakers_unknown']]
    numOfSpeakers = play_metadata.at[0,'num_of_speakers']
    gender['Characters'] = speakers.loc[0,:].values.tolist()
    gender['Percentage of Chracters'] =round(gender['Characters']/numOfSpeakers *100,2)
    gender_distribution_pie_chart= alt.Chart(gender, title='Number of characters by gender').mark_arc().encode(
        theta='Characters',
        color=alt.Color('Gender:N', legend=None),
        tooltip=['Characters','Percentage of Chracters']
    ).properties(
        width=200,
        height=200
    )

    # aggregate spoken words by gender
    gender['Spoken words'] = get_words_by_gender(cooccurence_graph.nodes)
    wordcountStage = play_metadata.at[0,'word_count_sp']
    gender['Percentage of spoken words'] = round(gender['Spoken words']/wordcountStage*100, 2)
    spoken_words_pie_chart = alt.Chart(gender, title='Number of spoken words by gender').mark_arc().encode(
        theta='Spoken words',
        color=alt.Color('Gender:N', legend=None),
        tooltip=['Spoken words' ,'Percentage of spoken words']
    ).properties(
        width=200,
        height=200
    )

    stacked_pie_charts = (gender_distribution_pie_chart & spoken_words_pie_chart)
    
    ############################## Final Chart ##############################

    title = chunked_title(f"Gender distribution and relations in \"{dropdown.label}\"")
    final_chart = (network_chart_with_filters | stacked_pie_charts)
    final_chart = final_chart.configure_view(
            strokeWidth=0 # remove border
    ).configure_axis(
        domainOpacity=0 # remove axis
    ).properties(
        title=alt.TitleParams(
            title,
            anchor='middle',
            fontSize=20
        )
    )
else:
   final_chart = 'no visualization available'

final_chart

Source: German Drama Corpus provided by the Drama Corpus (DraCor) Project as of 08.03.2024. Licensed under CC0.

Fischer, Frank, et al. (2019). Programmable Corpora: Introducing DraCor, an Infrastructure for the Research on European Drama. In Proceedings of DH2019: “Complexities”, Utrecht University, doi:10.5281/zenodo.4284002.