DraCor: Analysis of Network Values by Genre#
Reproduction of the analysis presented in: https://dlina.github.io/Network-Values-by-Genre/
by Henny Sluyter-Gäthje
0. Initialisation#
Load libraries#
# if libraries are not installed, remove the hash from the line starting with '!'
# if you want to reproduce an analysis you can add the version number like this:
# requests==2.25.1 pandas==1.2.3 matplotlib==3.3.4
#! pip install requests pandas matplotlib
import math
from datetime import datetime
import requests
import pandas as pd
import matplotlib.pyplot as plt
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 6
4 import requests
5 import pandas as pd
----> 6 import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'
Get version information for reproducibility#
pip freeze | grep "matplotlib\|pandas\|requests"
Get current date for version information of corpus and API#
print(datetime.now())
1. Preparation#
Get corpus list from DraCor API (https://dracor.org/doc/api)#
PATH_TO_DRACOR_API = "https://dracor.org/api/corpora/"
corpus_list = requests.get(PATH_TO_DRACOR_API).json()
List available corpora#
corpus_abbreviations = []
print("abbreviation, title")
for corpus_description in corpus_list:
name = corpus_description["name"]
print(f'{name}: {corpus_description["title"]}')
corpus_abbreviations.append(name)
Select corpus to investigate#
The following analyses will focus on the comparison of different genre. At time of notebook creation (19/11/2021) genre information was available for the following corpora:
fre
ger
rus
while True:
selected_corpus = str(input("Please choose a corpus from the list above. Enter the abbreviation: "))
if selected_corpus not in corpus_abbreviations:
print("The abbreviation you selected is not in the list. Please enter the abbreviation again.")
else:
print("Success!")
break
2. Load data#
Retrieve and read metadata file for selected corpus#
METADAT_EXT = "/metadata"
corpus_metadata_path = PATH_TO_DRACOR_API + selected_corpus + METADAT_EXT
metadata_file = requests.get(corpus_metadata_path, headers={"accept": "text/csv"}, stream=True)
metadata_file.raw.decode_content=True
# read metadata to DataFrame
metadata_df = pd.read_csv(metadata_file.raw, sep=",", encoding="utf-8")
Check if genre information is available for selected corpus#
genre_key = "normalizedGenre"
if metadata_df[genre_key].isnull().all():
print("""To execute the following analyses, genre information needs to available.
The corpus you selected does NOT include any genre information. To continue, please go back
to the corpus selection and select another corpus.""")
else:
print("Genre information is available for the corpus - analyses can be executed!")
Inspect metadata#
# print number of plays in corpus
len(metadata_df)
# print first lines
metadata_df.head()
# print column names
metadata_df.columns
3. Preprocess Data#
Filter plays#
All plays for which the value is below the selected threshold are excluded from the following analyses. Parameters by which the plays can be filtered:
by
size: number of charactersby
numOfActs: length of the play in acts
Set filter key#
execute_filter = False
possible_filter_keys = ['size', 'numOfActs']
while True:
filter_key = input("""Please enter the parameter by which the plays should be filtered
(must be a string). If the plays should not be filtered, enter 'exit': """)
if filter_key.lower() == "exit":
break
elif filter_key not in possible_filter_keys:
print("The filter key is not valid. Choose between 'size' or 'numOfActs'")
else:
print("Success!")
execute_filter = True
break
Set filter threshold range#
if execute_filter:
while True:
filter_threshold_min = input("Please enter the *minimum* value by which the plays should be filtered: ")
filter_threshold_max = input("Please enter the *maximum* value by which the plays should be filtered: ")
if not filter_threshold_min.isnumeric() and not filter_threshold_max.isnumeric():
print("Your input is not valid. Please try again and enter a number.")
else:
filter_threshold_min = int(filter_threshold_min)
filter_threshold_max = int(filter_threshold_max)
print("Success!")
break
metadata_df = metadata_df[(metadata_df[filter_key] >= filter_threshold_min) & (metadata_df[filter_key] <= filter_threshold_max)]
print(f"{len(metadata_df)} plays remain for the analysis")
Set genre keys and keys that point to special genres#
title = "name"
other_val = "Other"
# column needs to have boolean values
special_genre = "libretto"
# replace NaN values (no genre information available) with the value stored in the variable other_val
metadata_df[[genre_key]] = metadata_df[[genre_key]].fillna(other_val)
# replace genre information with information of special genre if play belongs to special genre
metadata_df.loc[metadata_df[special_genre] == True, genre_key] = special_genre
# group data by genre and show statistics
metadata_genre_grouped = metadata_df.groupby([genre_key])
metadata_genre_grouped.describe()
4. Analysis#
Steps:#
Inspection of the numbers of plays by genre
* Selection of values for broad analysis of overall mean and meadian values (values saved in *values_broad_analysis*)
* Selection of values for detailed analysis of mean and median values by time frame (values saved in *values_detailed_analysis*)
Perform broad analysis on all plays for values selected for broad analysis
* Prepare Analysis of genre specific plays: Deletion of plays for which no genre information is given (value saved in *other_val*)
* Perform broad analysis on __genre specified plays__ for values selected for detailed analysis
* Select time frames and threshold
* Perform analysis on genre specified plays for values selected in detailed analysis per time frame
1. Inspect number of plays per genre#
print(metadata_genre_grouped.size())
metadata_genre_grouped.size().plot(kind="bar")
2. Set values for broad and detailed analysis#
Broad analysis of mean and median for the values set in the variable values_broad_analysis. Look at list of column names to select different values. At the moment set to:
Number of Characters
Max Degree
Average Degree
Density
Average Path Length
Average Clustering Coefficient
Detailed analysis of mean and median by time frame (to be selected) set in the variable values_detailed_analysis. At the moment set to:
Network Size (number of characters in the play)
Density
# set values for broad analysis
values_broad_analysis = ["numOfSpeakers", "maxDegree", "averageDegree", "density", "averagePathLength",
"averageClustering"]
# set values for detailed analysis
values_detailed_analysis = ["size", "density"]
3. Perform Analysis: Investigate mean and median of values selected for broad analysis#
Mean values#
metadata_genre_grouped[values_broad_analysis].mean()
Median values#
metadata_genre_grouped[values_broad_analysis].median()
4. Preparation: Exclude plays without genre information#
# delete rows with genre value "other"
metadata_df_genre_specified = metadata_df.drop(metadata_genre_grouped.get_group(other_val).index)
metadata_genre_specified_grouped = metadata_df_genre_specified.groupby([genre_key])
4. Genre specific analysis for values specified for detailed analysis#
Mean values#
for key in values_detailed_analysis:
metadata_genre_specified_grouped.mean()[key].plot(kind ="bar", subplots=True)
plt.show()
for key in values_detailed_analysis:
metadata_genre_specified_grouped.mean()[key].plot(kind ="bar", subplots=True)
plt.show()
Median values#
for key in values_detailed_analysis:
metadata_genre_specified_grouped.median()[key].plot(kind ="bar", subplots=True)
plt.show()
5. Time specific analysis#
interval size: set to the number of years you want one time interval to span, e.g. 30 (must be a number)
threshold: Exclude time interval if it contains fewer texts than the thrseshold indicates
Get info about earliest and latest play#
year_key = "yearNormalized"
earliest = int(min(metadata_df_genre_specified[year_key]))
latest = int(max(metadata_df_genre_specified[year_key]))
print(f"Earliest play: {earliest}")
print(f"Latest play: {latest}")
Set time parameters for analysis#
while True:
interval_size = input("Please enter the size of the time intervals (must be a number): ")
if not interval_size.isnumeric():
print("Your input is not valid. Please try again and enter a number.")
else:
interval_size = int(interval_size)
print("Success!")
break
while True:
threshold = input("Please enter the threshold (must be a number): ")
if not threshold.isnumeric():
print("Your input is not valid. Please try again and enter a number.")
else:
threshold = int(threshold)
print("Success!")
break
Perform time specific analysis#
def round_down_to_ten(x):
offset = x % 10
return x - offset
def get_time_periods(start, highest_range, period_length):
time_periods = []
start = round_down_to_ten(start)
end = start + period_length
while end < highest_range:
time_periods.append((start, end))
start = end
end = start + period_length
time_periods.append((start,end))
return time_periods
def get_time_period_fit(periods, year):
for period in periods:
if year >= period[0] and year < period[1]:
return f"{period[0]}-{period[1]}"
if not math.isnan(year):
print(f"No period found for year: {year}")
return float("NaN")
Print time frames#
# create time frames according to user input
time_period_name = "timePeriod"
time_periods = get_time_periods(earliest, latest, interval_size)
time_periods
Split data into timeframes and filter by selected threshold#
# for each play, retrieve corresponding time frame
period_column = metadata_df_genre_specified[year_key].apply(lambda x: get_time_period_fit(time_periods, x))
metadata_df_genre_specified[time_period_name] = period_column
# apply threshold, if number of plays in one timeframe below the threshold -> exclude columns
metadata_df_time_genre_specified_filtered = metadata_df_genre_specified.groupby([time_period_name, genre_key]).filter(
lambda x: len(x) >= threshold)
# group data by genre and time frame
metadata_df_time_genre_grouped = metadata_df_time_genre_specified_filtered.groupby([time_period_name, genre_key])
Display number of plays that remain for each time frame after filtering#
metadata_df_time_genre_grouped.count()["name"]
Plot development of genres#
Median and mean values are calculated by time frame
Median values#
for key in values_detailed_analysis:
print(key)
metadata_df_time_genre_grouped[key].median().unstack().plot(figsize=(8,8)).legend(loc='upper left')
plt.show()
Mean values#
for key in values_detailed_analysis:
print(key)
metadata_df_time_genre_grouped[key].mean().unstack().plot(figsize=(8,8)).legend(loc='upper left')
plt.show()
Display tabular#
Median values#
for key in values_detailed_analysis:
print(key)
print(metadata_df_time_genre_grouped[key].median())
print("\n")
Mean values#
for key in values_detailed_analysis:
print(key)
print(metadata_df_time_genre_grouped[key].mean())
print("\n")