DraCor API Tutorial#
0. Import libraries#
To use the DraCor-API we need to send HTTP-Requests to the API: https://dracor.org/api. In Python HTTP-Request can be sent with the library requests (https://requests.readthedocs.io). We have to import this library:
import requests
To fascilitate the work with the metadata of the corpora in DraCor, we can use the library pandas (https://pandas.pydata.org/docs/). To be able to plot data with pandas, we also need to import the library matplotlib (https://matplotlib.org/). The libraries are imported below:
import pandas as pd
import matplotlib
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 2
1 import pandas as pd
----> 2 import matplotlib
ModuleNotFoundError: No module named 'matplotlib'
If the imports fail, the packages must be installed first. Delete the hash in the cell below to run the pip install command and rerun the cell above.
#! pip install requests pandas matplotlib
1. Basic API calls without selected parameters#
/info: Info about the API#
We can get information about the API and DraCor data by sending GET requests to the API.
For this, we take the base URL, saved in the variable API_URL below:
# save base URL in variable
API_URL = "https://dracor.org/api/v1/"
We can then extend the URL to ask for specific information. If we want to know more about the API itself, we can use the parameter info/ saved in the variable INFO_EXTENSION.
This will give us:
name
version
status
the version of the database (“existdb”)
The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/api-info
# to get the info we extend the API URL by the parameter "info"
# save "info" parameter in variable
INFO_EXTENSION = "info"
# add extension to the base URL
api_info_url = API_URL + INFO_EXTENSION
# perform get request
r = requests.get(api_info_url)
r.text
The API returns the information in the JSON format, which we have to parse. We can call .json() on the request object.
# read response as JSON
parsed_response = r.json()
parsed_response
As the response is a dictionary object, we can e.g. get the current version of the API by accessing it with the keyword “version”.
print(f"The current version of the Dracor-API is {parsed_response['version']}.")
corpora/: List available corpora#
With the extension corpora/ saved in CORPORA_EXT_PLAIN we can display the list of corpora available in DraCor.
The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/list-corpora
# save "corpora" parameter in variable
CORPORA_EXT_PLAIN = "corpora"
# add parameter to base URL to get information about the DraCor corpora
api_corpora_url = API_URL + CORPORA_EXT_PLAIN
print(f"URL for getting the list of corpora: {api_corpora_url}\n")
# perform API request
# parse response with .json
corpus_list = requests.get(api_corpora_url).json()
#save corpus abbreviations in a list for later checking
corpus_abbreviations = []
# iterate through corpus list and print information
for corpus_description in corpus_list:
name = corpus_description["name"]
print(f'{name}: {corpus_description["title"]}')
corpus_abbreviations.append(name)
Include corpora metrics#
To not only get the abbreviation and the name of corpora but also information about the number of speakers, the word count etc. we can change our API call so that these metrics are included in the response. We can do so by
adding a
?to indicate that we will pass a key-value pair to the APIadd the key-value pair like this
include=metrics
# save metrics parameter in variable
METRICS_PARAM_EXT = "?include=metrics"
# add parameter to URL to get more information about the corpora
api_corpora_metrics_urls = api_corpora_url + METRICS_PARAM_EXT
print(f"URL for getting the list of corpora with metrics included: {api_corpora_metrics_urls}\n")
# perform API request
corpora_metrics = requests.get(api_corpora_metrics_urls).json()
# iterate through corpus list and print information
# add the number of plays to the print statement which is retrieved from the corpus metrics
print("Abbreviation: Corpus Name (Number of plays)")
for corpus in corpora_metrics:
abbreviation = corpus['name']
num_of_plays = corpus['metrics']['plays']
print(f"{abbreviation}: {corpus['title']} ({str(num_of_plays)})")
2. API calls with selected parameters#
To get more information than included in the corpus metrics for a specific corpus, we first need to select a corpus from the list above.
1. Choose a corpusname/#
To choose a corpus in the field below, type the abbreviation of the corpus as listed above.
The name you choose is saved in the variable corpusname.
The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/list-corpus-content
for i in range(10):
# get corpusname with user input
# save corpusname in variable
corpusname = str(input("Please choose a corpusname from the list above. Enter the abbreviation: "))
if corpusname not in corpus_abbreviations:
print("The abbreviation you selected is not in the list. Please enter the abbreviation again.")
else:
print("Success!")
break
else:
corpusname = "swe"
# save corpora parameter (with slash) and metadata parameter in variables
CORPORA_EXT = "corpora/"
METADAT_EXT = "/metadata"
# build URL
corpus_metadata_path = API_URL + CORPORA_EXT + corpusname + METADAT_EXT
print(f"URL for getting the metadata of a specific corpus: {corpus_metadata_path}\n")
# perform request
metadata_file = requests.get(corpus_metadata_path, headers={"accept": "text/csv"}, stream=True)
metadata_file.raw.decode_content=True
# read metadata to DataFrame
metadata_df = pd.read_csv(metadata_file.raw, sep=",", encoding="utf-8")
Inspect metadata#
# display first five lines of the retrieved metadata
metadata_df.head()
Look at information available in the metadata
# print column names available in meta data
metadata_df.columns
3. What to do with the metadata - Examples#
The library pandas allows us to plot selected columns against each other. If we want to see if one parameter e.g. the number of characters (as in 1) develops over time, we can set the x-axis to the years the plays were created (“yearNormalized) and the y-axis to the number of characters (“size”) in the play.
The documentation of the plot function can be found here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
Plot the number of characters (“size”) in the plays over time (“yearNormalized”)
Plot the length of the play (“wordCountText”) over time (“yearNormalized”)
Get the five longest plays
sort plays by “wordCountText”
show first five
Get number of plays published after 1800 and before 1900
filter: “yearNormalized”
filter-value: 1800 and 1900
filter-operation: > and <
Plot development of the length of stage descriptions
filter: ‘wordCountStage’
calculate percentage of stage directions in relation to wordcount in new column
plot by time
Plot the relation of female speakers over time
filter: ‘numOfSpeakers’, ‘numOfSpeakersFemale’
calculate percentage of female speakers
plot by time
# 1. Get number of characters of each play and plot the normalized year
metadata_df.plot(x="yearNormalized", y="size", kind="scatter")
# 2. Plot length of play in words by the normalized year
metadata_df.plot(x="yearNormalized", y="wordCountText", kind="scatter", )
# 3. Sort plays by wordcount, show first 5 entries
metadata_by_length = metadata_df.sort_values(by="wordCountText", axis=0, ascending=False)
# get the first five entries
metadata_by_length[0:5]
# 4. Get number of plays between 1800 and 1900
num_of_plays = len(metadata_df[(metadata_df["yearNormalized"] > 1800) & (metadata_df["yearNormalized"] < 1900)])
print(f"Number of plays in the selected time period: {num_of_plays}")
# 5. Calculate percentage of tokens in stage directions in relation to all tokens
# save the calculated percentages in a new column
stage_percentage = metadata_df["wordCountStage"] / metadata_df["wordCountText"]
metadata_df["wordCountStagePercentage"] = stage_percentage
metadata_df.plot(x="yearNormalized", y="wordCountStagePercentage", kind="scatter")
# 6. Display the relation of female speaker over time
speakers_total = metadata_df["numOfSpeakers"]
metadata_df["numOfSpeakersFemalePercentage"] = metadata_df["numOfSpeakersFemale"] / speakers_total
metadata_df.plot(x="yearNormalized", y="numOfSpeakersFemalePercentage", kind="scatter")
3. play/: Select text#
The API also allows us to load single texts or abstract representation such as network data of single texts.
For this we need to extend the URL by the parameter play/, followed by the name of the play as listed in metadata. This will give us:
metadata of the play
network data to the play
speaker list
division into scenes and the appearing speakers
The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/play-info
# save play parameter in variable
PLAY_EXT = "/plays/"
# save column name in which the play names are stored in a variable
PLAY_KEY = "name"
for i in range(10):
# get play name with user input
# save play name in variable
play_name = str(input("Please choose a text from the corpus you have chosen. Enter the text name: "))
if play_name not in metadata_df[PLAY_KEY].values:
print("The name you selected is not in the list. Please enter the name again.")
else:
print("Success!")
break
else:
play_name = "strindberg-gillets-hemlighet"
# build URL
play_path = API_URL + CORPORA_EXT + corpusname + PLAY_EXT + play_name
print(f"URL for getting information of a specific play: {play_path}\n")
# perform request
play_info = requests.get(play_path).json()
# extract character names
character_names = [entry["name"] for entry in play_info["characters"]]
print("Character list")
print(character_names)
Exercise#
How else could we get the characters of the play? Is there a more specific API call if we only want that information?
# API call for getting a specific play is saved in the variable `play_path`
# This is it consists of
print(API_URL)
print(CORPORA_EXT)
print(corpusname)
print(PLAY_EXT)
print(play_name)
print(f"Combined the URL looks like this: {play_path}")
# We can add something to the URL like this:
# (just replace anything-you-want-to-add with the keyword of your choice)
# add your chosen parameter to the path to the play you selected
character_url = play_path + "/anything-you-want-to-add"
# perform request
character_info = requests.get(character_url)
if character_info.status_code != 200:
print(f"It looks like your URL is not valid. Status code is: {character_info.status_code}")
else:
print("Success! Here is the output:")
print(character_info.json())
Specify single play requests#
We can specify which information of the play we want to retrieve. We do so by extending the URL by an additional parameter. If for example we want to get the spoken text of the characters, we need to extend the URL by spoken-text-by-character.
The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/play-spoken-text-by-character
You could also choose other information to retrieve e.g. stage directions and speakers, spoken text only (without the attribution to the speaker) and so on. Just have a look at the API documentation and see what parameters can be added after {playname}.
# save parameter to get more specific data to the selected play in a variable
PLAY_SPECIFICATION = "/spoken-text-by-character"
# extend play URL
play_spec_path = play_path + PLAY_SPECIFICATION
print(f"URL for getting specified information of a play: {play_spec_path}\n")
# perform request
play_spec = requests.get(play_spec_path).json()
Example#
We can now perform some analyses with the text we retrieved. With just some minor preprocessing (tokenization) we can ask:
Who talks most often about love or guns
Since the characters are also annotated with gender with can explore simple gender related questions, such as:
Do men talk more often about swords, guns, weapons?
Do women talk more often about love, roses, children?
For this we need to import the natural language toolkit nltk (https://www.nltk.org/) or any other NLP library, e.g. spaCy to tokenize the spoken text. We can then calculate the frequencies by character and sum them up by gender. For counting the selected words, we can use the library collections (https://docs.python.org/3/library/collections.html)
from nltk.tokenize import word_tokenize
from collections import Counter
Uncomment the lines below, if the import of nltk fails.
# !pip install nltk
# import nltk
# nltk.download('punkt')
# save keyword for a character's text in a variable
TEXT_KEY = "text"
# save new column names in variables
ANNO_KEY = "text annotation"
FRQ_KEY = "frequencies"
# tokenize and count words
# iterate characters
for character_entry in play_spec:
# tokenize speech acts
annotation = [word_tokenize(sen) for sen in character_entry[TEXT_KEY]]
# save tokenized text and word frequencies
character_entry[ANNO_KEY] = [word for sen in annotation for word in sen]
character_entry[FRQ_KEY] = Counter(character_entry[ANNO_KEY])
Create word list#
Create your list of words below. Each word must be placed in between strings (“word”) and separated with a comma from the next word.
word_list = ["rose", "blom", "barn", "vapen", "gevär", "pengar"]
Analyze#
# save character name key in a variable
NAME_KEY = "label"
# get frequencies of the words in the word list by character
# iterate characters
for character_entry in play_spec:
# get character name
print(character_entry[NAME_KEY])
found = False
# for each word in the word list, look up the frequency in the speech of the current character
for word in word_list:
if word in character_entry[FRQ_KEY]:
print(f"{word}: {character_entry[FRQ_KEY][word]}")
found = True
if not found:
print("None of the words found in the speech of this character.")
print("-"*50)
# save the gender key for the characters in a variable
GENDER_KEY = "gender"
# create results dictionary
# for each word the frequency by gender is saved
words_by_gender = {word: {"MALE": 0, "FEMALE": 0, "UNKNOWN":0} for word in word_list}
# get frequencies of the words in the word list by character
# add frequency to the gender of the character
# iterate characters
for character_entry in play_spec:
# retrieve gender
gender = character_entry[GENDER_KEY]
# for each word in the word list, look up the frequency in the speech of the current character
# add frequency to the gender of the character
for word in word_list:
if word in character_entry[FRQ_KEY]:
if gender in words_by_gender[word]:
words_by_gender[word][gender] += character_entry[FRQ_KEY][word]
# convert results dictionary into a DataFrame
gender_df = pd.DataFrame(words_by_gender)
gender_df.plot(kind="bar", figsize=(12,10))
Generic function to handle the requests and parse the result#
Requesting data from the API in most cases follows a pattern:
construct the request-url. E.g. use
https://dracor.org/api/as a base and attachcorpusname,playname, a method, e.g.charactersand in some cases a reponse-format, e.g.csvuse this constructed url in a request do the endpoint
retrieve the data and parse to a format, that can be than used in the program
By defining a function, this process can be speed up. Instead of repeating the code, a function can be defined, that takes corpusname, playname and method as arguments. In the example we assume, that the response will be JSON.
Parsing of JSON is done with the package json (https://docs.python.org/3/library/json.html), which needs to be imported:
import json
The function accepts parameters as arguments, e.g. corpusname="ger". Following arguments are supported:
apibase(default will behttps://dracor.org/api/)corpusnameplaynamemethodparse_json:True,False(default) – will parse the response asjson
#corpusname:str -> []
def get(**kwargs):
#corpusname=corpusname
#playname=playname
#apibase="https://dracor.org/api/"
#method=method
#parse_json: True
#could set different apibase, e.g. https://staging.dracor.org/api/ [not recommended, pls use the production server]
if "apibase" in kwargs:
if kwargs["apibase"].endswith("/"):
apibase = kwargs["apibase"]
else:
apibase = kwargs["apibase"] + "/"
else:
#use default
apibase = "https://dracor.org/api/v1/"
if "corpusname" in kwargs and "playname" in kwargs:
# used for /api/corpora/{corpusname}/play/{playname}/
if "method" in kwargs:
request_url = apibase + "corpora/" + kwargs["corpusname"] + "/plays/" + kwargs["playname"] + "/" + kwargs["method"]
else:
request_url = apibase + "corpora/" + kwargs["corpusname"] + "/plays/" + kwargs["playname"]
elif "corpusname" in kwargs and not "playname" in kwargs:
if "method" in kwargs:
request_url = apibase + "corpora/" + kwargs["corpusname"] + "/" + kwargs["method"]
else:
request_url = apibase + "corpora/" + kwargs["corpusname"]
elif "method" in kwargs and not "corpusname" in kwargs and not "playname" in kwargs:
request_url = apibase + kwargs["method"]
else:
#nothing set
request = request_url = apibase + "info"
#send the response
r = requests.get(request_url)
if r.status_code == 200:
#success!
if "parse_json" in kwargs:
if kwargs["parse_json"] == True:
json_data = json.loads(r.text)
return json_data
else:
return r.text
else:
return r.text
else:
raise Exception("Request was not successful. Server returned status code: " + str(r.status_code))
The function can now be called as follows below. The function call requests the Info about the API /api/info:
get(method="info", parse_json=True)
To request the metrics of a single play (/api/corpora/{corpusname}/play/{playname}/metrics) use the following function call:
get(corpusname="ger",playname="lessing-emilia-galotti",method="metrics",parse_json=True)
Example: Gender of Characters#
In the following example we count characters that are tagged as “MALE” and “FEMALE in a corpus.
#Get all plays in a Corpus
if corpusname != "":
#get data of a single corpus and store only the list of plays in the variable "plays"
plays = get(corpusname=corpusname,parse_json=True)["plays"]
#set counters for male an female characters in the corpus
overallMale = 0
overallFemale = 0
#check, if a corpusname was entered
if corpusname != "":
#iterate over the plays
for play in plays:
#get the characters of a play by using the api endpoint /api/corpora/{corpusname}/play/{playname}/characters
characters = get(corpusname=corpusname,playname=play["name"],method="characters",parse_json=True)
#reset the counters for male and female characters
cntMale = 0
cntFemale = 0
#iterate over the characters and increment the counters
for character in characters:
gender = character["gender"]
if gender == "MALE":
cntMale = cntMale + 1
elif gender == "FEMALE":
cntFemale = cntFemale + 1
# report the result per play
print(play["name"] + ": " + "female characters: " + str(cntFemale) + "; male characters: " + str(cntMale))
# increment the overall counters
overallMale = overallMale + cntMale
overallFemale = overallFemale + cntFemale
# report the results on corpus level
print("\n\nThere are " + str(overallFemale) + " female and " + str(overallMale) + " male characters in the corpus '" + corpusname + "'")
else:
raise Exception("Please enter a corpus!")