DraCor: Analysis of Network Values by Genre

DraCor: Analysis of Network Values by Genre#

Reproduction of the analysis presented in: https://dlina.github.io/Network-Values-by-Genre/

by Henny Sluyter-Gäthje

0. Initialisation#

Load libraries#

# if libraries are not installed, remove the hash from the line starting with '!'
# if you want to reproduce an analysis you can add the version number like this:
# requests==2.25.1 pandas==1.2.3 matplotlib==3.3.4
#! pip install requests pandas matplotlib

import math
from datetime import datetime

import requests
import pandas as pd
import matplotlib.pyplot as plt

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 6
      4 import requests
      5 import pandas as pd
----> 6 import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'matplotlib'

Get version information for reproducibility#

pip freeze | grep "matplotlib\|pandas\|requests"

Get current date for version information of corpus and API#

print(datetime.now())

1. Preparation#

Get corpus list from DraCor API (https://dracor.org/doc/api)#

PATH_TO_DRACOR_API = "https://dracor.org/api/corpora/"

corpus_list = requests.get(PATH_TO_DRACOR_API).json()

List available corpora#

corpus_abbreviations = []
print("abbreviation, title")
for corpus_description in corpus_list:
    name = corpus_description["name"]
    print(f'{name}: {corpus_description["title"]}')
    corpus_abbreviations.append(name)

Select corpus to investigate#

The following analyses will focus on the comparison of different genre. At time of notebook creation (19/11/2021) genre information was available for the following corpora:

fre
ger
rus

while True:
    selected_corpus = str(input("Please choose a corpus from the list above. Enter the abbreviation: "))
    if selected_corpus not in corpus_abbreviations:
        print("The abbreviation you selected is not in the list. Please enter the abbreviation again.")
    else:
        print("Success!")
        break

2. Load data#

Retrieve and read metadata file for selected corpus#

METADAT_EXT = "/metadata"
corpus_metadata_path = PATH_TO_DRACOR_API + selected_corpus + METADAT_EXT
metadata_file = requests.get(corpus_metadata_path, headers={"accept": "text/csv"}, stream=True)
metadata_file.raw.decode_content=True

# read metadata to DataFrame
metadata_df = pd.read_csv(metadata_file.raw, sep=",", encoding="utf-8")

Check if genre information is available for selected corpus#

genre_key = "normalizedGenre"

if metadata_df[genre_key].isnull().all():
    print("""To execute the following analyses, genre information needs to available.
    The corpus you selected does NOT include any genre information. To continue, please go back
    to the corpus selection and select another corpus.""")
else:
    print("Genre information is available for the corpus - analyses can be executed!")

Inspect metadata#

# print number of plays in corpus
len(metadata_df)

# print first lines
metadata_df.head()

# print column names
metadata_df.columns

3. Preprocess Data#

Filter plays#

All plays for which the value is below the selected threshold are excluded from the following analyses. Parameters by which the plays can be filtered:

by size: number of characters
by numOfActs: length of the play in acts

Set filter key#

execute_filter = False
possible_filter_keys = ['size', 'numOfActs']
while True:
    filter_key = input("""Please enter the parameter by which the plays should be filtered 
    (must be a string). If the plays should not be filtered, enter 'exit': """)
    
    if filter_key.lower() == "exit":
        break
    elif filter_key not in possible_filter_keys:
        print("The filter key is not valid. Choose between 'size' or 'numOfActs'")
    else:
        print("Success!")
        execute_filter = True
        break

Set filter threshold range#

if execute_filter:
    while True:
        filter_threshold_min = input("Please enter the *minimum* value by which the plays should be filtered: ")
        filter_threshold_max = input("Please enter the *maximum* value by which the plays should be filtered: ")
        if not filter_threshold_min.isnumeric() and not filter_threshold_max.isnumeric():
            print("Your input is not valid. Please try again and enter a number.")
        else:
            filter_threshold_min = int(filter_threshold_min)
            filter_threshold_max = int(filter_threshold_max)
            print("Success!")
            break
    metadata_df = metadata_df[(metadata_df[filter_key] >= filter_threshold_min) & (metadata_df[filter_key] <= filter_threshold_max)]
    print(f"{len(metadata_df)} plays remain for the analysis")

Set genre keys and keys that point to special genres#

title = "name"
other_val = "Other"

# column needs to have boolean values
special_genre = "libretto"

# replace NaN values (no genre information available) with the value stored in the variable other_val
metadata_df[[genre_key]] = metadata_df[[genre_key]].fillna(other_val)

# replace genre information with information of special genre if play belongs to special genre
metadata_df.loc[metadata_df[special_genre] == True, genre_key] = special_genre

# group data by genre and show statistics
metadata_genre_grouped = metadata_df.groupby([genre_key])
metadata_genre_grouped.describe()

4. Analysis#

Steps:#

Inspection of the numbers of plays by genre

* Selection of values for broad analysis of overall mean and meadian values (values saved in *values_broad_analysis*)
* Selection of values for detailed analysis of mean and median values by time frame (values saved in *values_detailed_analysis*)

Perform broad analysis on all plays for values selected for broad analysis

* Prepare Analysis of genre specific plays: Deletion of plays for which no genre information is given (value saved in *other_val*)
* Perform broad analysis on __genre specified plays__ for values selected for detailed analysis

* Select time frames and threshold
* Perform analysis on genre specified plays for values selected in detailed analysis per time frame

1. Inspect number of plays per genre#

print(metadata_genre_grouped.size())
metadata_genre_grouped.size().plot(kind="bar")

2. Set values for broad and detailed analysis#

Broad analysis of mean and median for the values set in the variable values_broad_analysis. Look at list of column names to select different values. At the moment set to:
- Number of Characters
- Max Degree
- Average Degree
- Density
- Average Path Length
- Average Clustering Coefficient
Detailed analysis of mean and median by time frame (to be selected) set in the variable values_detailed_analysis. At the moment set to:
- Network Size (number of characters in the play)
- Density

# set values for broad analysis
values_broad_analysis = ["numOfSpeakers", "maxDegree", "averageDegree", "density", "averagePathLength",
                          "averageClustering"]

# set values for detailed analysis
values_detailed_analysis = ["size", "density"]

3. Perform Analysis: Investigate mean and median of values selected for broad analysis#

Mean values#

metadata_genre_grouped[values_broad_analysis].mean()

Median values#

metadata_genre_grouped[values_broad_analysis].median()

4. Preparation: Exclude plays without genre information#

# delete rows with genre value "other"
metadata_df_genre_specified = metadata_df.drop(metadata_genre_grouped.get_group(other_val).index)
metadata_genre_specified_grouped = metadata_df_genre_specified.groupby([genre_key])

4. Genre specific analysis for values specified for detailed analysis#

Mean values#

for key in values_detailed_analysis:
    metadata_genre_specified_grouped.mean()[key].plot(kind ="bar", subplots=True)
    plt.show()

for key in values_detailed_analysis:
    metadata_genre_specified_grouped.mean()[key].plot(kind ="bar", subplots=True)
    plt.show()

Median values#

for key in values_detailed_analysis:
    metadata_genre_specified_grouped.median()[key].plot(kind ="bar", subplots=True)
    plt.show()

5. Time specific analysis#

interval size: set to the number of years you want one time interval to span, e.g. 30 (must be a number)
threshold: Exclude time interval if it contains fewer texts than the thrseshold indicates

Get info about earliest and latest play#

year_key = "yearNormalized"
earliest = int(min(metadata_df_genre_specified[year_key]))
latest = int(max(metadata_df_genre_specified[year_key]))
print(f"Earliest play: {earliest}")
print(f"Latest play: {latest}")

Set time parameters for analysis#

while True:
    interval_size = input("Please enter the size of the time intervals (must be a number): ")
    if not interval_size.isnumeric():
        print("Your input is not valid. Please try again and enter a number.")
    else:
        interval_size = int(interval_size)
        print("Success!")
        break

while True:
    threshold = input("Please enter the threshold (must be a number): ")
    if not threshold.isnumeric():
        print("Your input is not valid. Please try again and enter a number.")
    else:
        threshold = int(threshold)
        print("Success!")
        break

Perform time specific analysis#

def round_down_to_ten(x):
        offset = x % 10
        return x - offset 
    
def get_time_periods(start, highest_range, period_length):
    time_periods = []
    start = round_down_to_ten(start)
    end = start + period_length
    while end < highest_range:
        time_periods.append((start, end))       
        start = end
        end = start + period_length
    time_periods.append((start,end))
    return time_periods

def get_time_period_fit(periods, year):
    for period in periods: 
        if year >= period[0] and year < period[1]:
            return f"{period[0]}-{period[1]}"
    if not math.isnan(year):
        print(f"No period found for year: {year}")
    return float("NaN")

Print time frames#

# create time frames according to user input
time_period_name = "timePeriod"
time_periods = get_time_periods(earliest, latest, interval_size)
time_periods

Split data into timeframes and filter by selected threshold#

# for each play, retrieve corresponding time frame
period_column = metadata_df_genre_specified[year_key].apply(lambda x: get_time_period_fit(time_periods, x))
metadata_df_genre_specified[time_period_name] = period_column

# apply threshold, if number of plays in one timeframe below the threshold -> exclude columns
metadata_df_time_genre_specified_filtered = metadata_df_genre_specified.groupby([time_period_name, genre_key]).filter(
lambda x: len(x) >= threshold)

# group data by genre and time frame
metadata_df_time_genre_grouped = metadata_df_time_genre_specified_filtered.groupby([time_period_name, genre_key])

Display number of plays that remain for each time frame after filtering#

metadata_df_time_genre_grouped.count()["name"]

Plot development of genres#

Median and mean values are calculated by time frame

Median values#

for key in values_detailed_analysis:
    print(key)
    metadata_df_time_genre_grouped[key].median().unstack().plot(figsize=(8,8)).legend(loc='upper left')
    plt.show()

Mean values#

for key in values_detailed_analysis:
    print(key)
    metadata_df_time_genre_grouped[key].mean().unstack().plot(figsize=(8,8)).legend(loc='upper left')
    plt.show()

Display tabular#

Median values#

for key in values_detailed_analysis:
    print(key)
    print(metadata_df_time_genre_grouped[key].median())
    print("\n")

Mean values#

for key in values_detailed_analysis:
    print(key)
    print(metadata_df_time_genre_grouped[key].mean())
    print("\n")

DraCor: Analysis of Network Values by Genre

Contents

DraCor: Analysis of Network Values by Genre#

0. Initialisation#

Load libraries#

Get version information for reproducibility#

Get current date for version information of corpus and API#

1. Preparation#

Get corpus list from DraCor API (https://dracor.org/doc/api)#

List available corpora#

Select corpus to investigate#

2. Load data#

Retrieve and read metadata file for selected corpus#

Check if genre information is available for selected corpus#

Inspect metadata#

3. Preprocess Data#

Filter plays#

Set filter key#

Set filter threshold range#

Set genre keys and keys that point to special genres#

4. Analysis#

Steps:#

1. Inspect number of plays per genre#

2. Set values for broad and detailed analysis#

3. Perform Analysis: Investigate mean and median of values selected for broad analysis#

Mean values#

Median values#

4. Preparation: Exclude plays without genre information#

4. Genre specific analysis for values specified for detailed analysis#

Mean values#

Median values#

5. Time specific analysis#

Get info about earliest and latest play#

Set time parameters for analysis#

Perform time specific analysis#

Print time frames#

Split data into timeframes and filter by selected threshold#

Display number of plays that remain for each time frame after filtering#

Plot development of genres#

Median values#

Mean values#

Display tabular#

Median values#

Mean values#