DraCor API Tutorial

DraCor API Tutorial#

0. Import libraries#

To use the DraCor-API we need to send HTTP-Requests to the API: https://dracor.org/api. In Python HTTP-Request can be sent with the library requests (https://requests.readthedocs.io). We have to import this library:

import requests

To fascilitate the work with the metadata of the corpora in DraCor, we can use the library pandas (https://pandas.pydata.org/docs/). To be able to plot data with pandas, we also need to import the library matplotlib (https://matplotlib.org/). The libraries are imported below:

import pandas as pd
import matplotlib

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[2], line 2
      1 import pandas as pd
----> 2 import matplotlib

ModuleNotFoundError: No module named 'matplotlib'

If the imports fail, the packages must be installed first. Delete the hash in the cell below to run the pip install command and rerun the cell above.

#! pip install requests pandas matplotlib

1. Basic API calls without selected parameters#

`/info`: Info about the API#

We can get information about the API and DraCor data by sending GET requests to the API.

For this, we take the base URL, saved in the variable API_URL below:

# save base URL in variable  
API_URL = "https://dracor.org/api/v1/"

We can then extend the URL to ask for specific information. If we want to know more about the API itself, we can use the parameter info/ saved in the variable INFO_EXTENSION.

This will give us:

name
version
status
the version of the database (“existdb”)

The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/api-info

# to get the info we extend the API URL by the parameter "info"
# save "info" parameter in variable
INFO_EXTENSION = "info"

# add extension to the base URL
api_info_url = API_URL + INFO_EXTENSION

# perform get request
r = requests.get(api_info_url)
r.text

The API returns the information in the JSON format, which we have to parse. We can call .json() on the request object.

# read response as JSON
parsed_response = r.json()
parsed_response

As the response is a dictionary object, we can e.g. get the current version of the API by accessing it with the keyword “version”.

print(f"The current version of the Dracor-API is {parsed_response['version']}.")

`corpora/`: List available corpora#

With the extension corpora/ saved in CORPORA_EXT_PLAIN we can display the list of corpora available in DraCor.

The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/list-corpora

# save "corpora" parameter in variable
CORPORA_EXT_PLAIN = "corpora"
# add parameter to base URL to get information about the DraCor corpora 
api_corpora_url = API_URL + CORPORA_EXT_PLAIN
print(f"URL for getting the list of corpora: {api_corpora_url}\n")

# perform API request
# parse response with .json
corpus_list = requests.get(api_corpora_url).json()

#save corpus abbreviations in a list for later checking 
corpus_abbreviations = []

# iterate through corpus list and print information
for corpus_description in corpus_list:
    name = corpus_description["name"]
    print(f'{name}: {corpus_description["title"]}')
    corpus_abbreviations.append(name)

Include corpora metrics#

To not only get the abbreviation and the name of corpora but also information about the number of speakers, the word count etc. we can change our API call so that these metrics are included in the response. We can do so by

adding a ? to indicate that we will pass a key-value pair to the API
add the key-value pair like this include=metrics

# save metrics parameter in variable
METRICS_PARAM_EXT = "?include=metrics"

# add parameter to URL to get more information about the corpora 
api_corpora_metrics_urls = api_corpora_url + METRICS_PARAM_EXT
print(f"URL for getting the list of corpora with metrics included: {api_corpora_metrics_urls}\n")

# perform API request
corpora_metrics = requests.get(api_corpora_metrics_urls).json()

# iterate through corpus list and print information
# add the number of plays to the print statement which is retrieved from the corpus metrics
print("Abbreviation: Corpus Name (Number of plays)")
for corpus in corpora_metrics:
    abbreviation = corpus['name']
    num_of_plays = corpus['metrics']['plays']
    print(f"{abbreviation}: {corpus['title']} ({str(num_of_plays)})")

2. API calls with selected parameters#

To get more information than included in the corpus metrics for a specific corpus, we first need to select a corpus from the list above.

1. Choose a `corpusname/`#

To choose a corpus in the field below, type the abbreviation of the corpus as listed above. The name you choose is saved in the variable corpusname.

The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/list-corpus-content

for i in range(10):
    # get corpusname with user input
    # save corpusname in variable
    corpusname = str(input("Please choose a corpusname from the list above. Enter the abbreviation: "))
    if corpusname not in corpus_abbreviations:
        print("The abbreviation you selected is not in the list. Please enter the abbreviation again.")
    else:
        print("Success!")
        break
else:
    corpusname = "swe"

# save corpora parameter (with slash) and metadata parameter in variables
CORPORA_EXT = "corpora/"
METADAT_EXT = "/metadata"

# build URL
corpus_metadata_path = API_URL + CORPORA_EXT + corpusname + METADAT_EXT
print(f"URL for getting the metadata of a specific corpus: {corpus_metadata_path}\n")


# perform request
metadata_file = requests.get(corpus_metadata_path, headers={"accept": "text/csv"}, stream=True)
metadata_file.raw.decode_content=True

# read metadata to DataFrame
metadata_df = pd.read_csv(metadata_file.raw, sep=",", encoding="utf-8")

Inspect metadata#

# display first five lines of the retrieved metadata 
metadata_df.head()

Look at information available in the metadata

# print column names available in meta data 
metadata_df.columns

3. What to do with the metadata - Examples#

The library pandas allows us to plot selected columns against each other. If we want to see if one parameter e.g. the number of characters (as in 1) develops over time, we can set the x-axis to the years the plays were created (“yearNormalized) and the y-axis to the number of characters (“size”) in the play.

The documentation of the plot function can be found here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html

Plot the number of characters (“size”) in the plays over time (“yearNormalized”)
Plot the length of the play (“wordCountText”) over time (“yearNormalized”)
Get the five longest plays
- sort plays by “wordCountText”
- show first five
Get number of plays published after 1800 and before 1900
- filter: “yearNormalized”
- filter-value: 1800 and 1900
- filter-operation: > and <
Plot development of the length of stage descriptions
- filter: ‘wordCountStage’
- calculate percentage of stage directions in relation to wordcount in new column
- plot by time
Plot the relation of female speakers over time
- filter: ‘numOfSpeakers’, ‘numOfSpeakersFemale’
- calculate percentage of female speakers
- plot by time

# 1. Get number of characters of each play and plot the normalized year
metadata_df.plot(x="yearNormalized", y="size", kind="scatter")

# 2. Plot length of play in words by the normalized year
metadata_df.plot(x="yearNormalized", y="wordCountText", kind="scatter", )

# 3. Sort plays by wordcount, show first 5 entries
metadata_by_length = metadata_df.sort_values(by="wordCountText", axis=0, ascending=False)

# get the first five entries 
metadata_by_length[0:5]

# 4. Get number of plays between 1800 and 1900 
num_of_plays = len(metadata_df[(metadata_df["yearNormalized"] > 1800) & (metadata_df["yearNormalized"] < 1900)])
print(f"Number of plays in the selected time period: {num_of_plays}")

# 5. Calculate percentage of tokens in stage directions in relation to all tokens 
# save the calculated percentages in a new column
stage_percentage = metadata_df["wordCountStage"] / metadata_df["wordCountText"]
metadata_df["wordCountStagePercentage"] = stage_percentage
metadata_df.plot(x="yearNormalized", y="wordCountStagePercentage", kind="scatter")

# 6. Display the relation of female speaker over time
speakers_total = metadata_df["numOfSpeakers"]
metadata_df["numOfSpeakersFemalePercentage"] = metadata_df["numOfSpeakersFemale"] / speakers_total
metadata_df.plot(x="yearNormalized", y="numOfSpeakersFemalePercentage", kind="scatter")

3. `play/`: Select text#

The API also allows us to load single texts or abstract representation such as network data of single texts. For this we need to extend the URL by the parameter play/, followed by the name of the play as listed in metadata. This will give us:

metadata of the play
network data to the play
speaker list
division into scenes and the appearing speakers

The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/play-info

# save play parameter in variable
PLAY_EXT = "/plays/"

# save column name in which the play names are stored in a variable 
PLAY_KEY = "name"
for i in range(10):
    # get play name with user input
    # save play name in variable
    play_name = str(input("Please choose a text from the corpus you have chosen. Enter the text name: "))
    if play_name not in metadata_df[PLAY_KEY].values:
        print("The name you selected is not in the list. Please enter the name again.")
    else:
        print("Success!")
        break
else:
    play_name = "strindberg-gillets-hemlighet"

# build URL
play_path = API_URL + CORPORA_EXT + corpusname + PLAY_EXT + play_name
print(f"URL for getting information of a specific play: {play_path}\n")

# perform request
play_info = requests.get(play_path).json()

# extract character names
character_names = [entry["name"] for entry in play_info["characters"]]
print("Character list")
print(character_names)

Exercise#

How else could we get the characters of the play? Is there a more specific API call if we only want that information?

# API call for getting a specific play is saved in the variable `play_path`
# This is it consists of 
print(API_URL)
print(CORPORA_EXT)
print(corpusname)
print(PLAY_EXT)
print(play_name)
print(f"Combined the URL looks like this: {play_path}")

# We can add something to the URL like this:
# (just replace anything-you-want-to-add with the keyword of your choice)
# add your chosen parameter to the path to the play you selected
character_url = play_path + "/anything-you-want-to-add"

# perform request
character_info = requests.get(character_url)
if character_info.status_code != 200:
    print(f"It looks like your URL is not valid. Status code is: {character_info.status_code}")
else:
    print("Success! Here is the output:")
    print(character_info.json())

Specify single play requests#

We can specify which information of the play we want to retrieve. We do so by extending the URL by an additional parameter. If for example we want to get the spoken text of the characters, we need to extend the URL by spoken-text-by-character.

The documentation of this endpoint can be found here: https://dracor.org/doc/api#/public/play-spoken-text-by-character

You could also choose other information to retrieve e.g. stage directions and speakers, spoken text only (without the attribution to the speaker) and so on. Just have a look at the API documentation and see what parameters can be added after {playname}.

# save parameter to get more specific data to the selected play in a variable 
PLAY_SPECIFICATION = "/spoken-text-by-character"

# extend play URL 
play_spec_path = play_path + PLAY_SPECIFICATION
print(f"URL for getting specified information of a play: {play_spec_path}\n")

# perform request 
play_spec = requests.get(play_spec_path).json()

Example#

We can now perform some analyses with the text we retrieved. With just some minor preprocessing (tokenization) we can ask:

Who talks most often about love or guns

Since the characters are also annotated with gender with can explore simple gender related questions, such as:

Do men talk more often about swords, guns, weapons?
Do women talk more often about love, roses, children?

For this we need to import the natural language toolkit nltk (https://www.nltk.org/) or any other NLP library, e.g. spaCy to tokenize the spoken text. We can then calculate the frequencies by character and sum them up by gender. For counting the selected words, we can use the library collections (https://docs.python.org/3/library/collections.html)

from nltk.tokenize import word_tokenize
from collections import Counter

Uncomment the lines below, if the import of nltk fails.

# !pip install nltk
# import nltk
# nltk.download('punkt')

# save keyword for a character's text in a variable
TEXT_KEY = "text"
# save new column names in variables
ANNO_KEY = "text annotation"
FRQ_KEY = "frequencies"

# tokenize and count words
# iterate characters
for character_entry in play_spec:
    # tokenize speech acts
    annotation =  [word_tokenize(sen) for sen in character_entry[TEXT_KEY]]
    # save tokenized text and word frequencies
    character_entry[ANNO_KEY] = [word for sen in annotation for word in sen]
    character_entry[FRQ_KEY] = Counter(character_entry[ANNO_KEY])

Create word list#

Create your list of words below. Each word must be placed in between strings (“word”) and separated with a comma from the next word.

word_list = ["rose", "blom", "barn", "vapen", "gevär", "pengar"]

Analyze#

# save character name key in a variable
NAME_KEY = "label"

# get frequencies of the words in the word list by character
# iterate characters
for character_entry in play_spec:
    # get character name
    print(character_entry[NAME_KEY])
    found = False
    # for each word in the word list, look up the frequency in the speech of the current character
    for word in word_list:
        if word in character_entry[FRQ_KEY]:
            print(f"{word}: {character_entry[FRQ_KEY][word]}")
            found = True
    if not found:
        print("None of the words found in the speech of this character.")
    print("-"*50)

# save the gender key for the characters in a variable
GENDER_KEY = "gender"
# create results dictionary
# for each word the frequency by gender is saved 
words_by_gender = {word: {"MALE": 0, "FEMALE": 0, "UNKNOWN":0} for word in word_list}

# get frequencies of the words in the word list by character
# add frequency to the gender of the character

# iterate characters
for character_entry in play_spec:
    # retrieve gender
    gender = character_entry[GENDER_KEY]
    # for each word in the word list, look up the frequency in the speech of the current character
    # add frequency to the gender of the character
    for word in word_list:
        if word in character_entry[FRQ_KEY]:
            if gender in words_by_gender[word]:
                words_by_gender[word][gender] += character_entry[FRQ_KEY][word]
# convert results dictionary into a DataFrame
gender_df = pd.DataFrame(words_by_gender)

gender_df.plot(kind="bar", figsize=(12,10))

Generic function to handle the requests and parse the result#

Requesting data from the API in most cases follows a pattern:

construct the request-url. E.g. use https://dracor.org/api/ as a base and attach corpusname, playname, a method, e.g. characters and in some cases a reponse-format, e.g. csv
use this constructed url in a request do the endpoint
retrieve the data and parse to a format, that can be than used in the program

By defining a function, this process can be speed up. Instead of repeating the code, a function can be defined, that takes corpusname, playname and method as arguments. In the example we assume, that the response will be JSON.

Parsing of JSON is done with the package json (https://docs.python.org/3/library/json.html), which needs to be imported:

import json

The function accepts parameters as arguments, e.g. corpusname="ger". Following arguments are supported:

apibase (default will be https://dracor.org/api/)
corpusname
playname
method
parse_json: True, False (default) – will parse the response as json

#corpusname:str -> []
def get(**kwargs):
    #corpusname=corpusname
    #playname=playname
    #apibase="https://dracor.org/api/"
    #method=method
    #parse_json: True
    
    #could set different apibase, e.g. https://staging.dracor.org/api/ [not recommended, pls use the production server]
    if "apibase" in kwargs:
        if kwargs["apibase"].endswith("/"):
            apibase = kwargs["apibase"]
        else:
            apibase = kwargs["apibase"] + "/"
    else:
        #use default
        apibase = "https://dracor.org/api/v1/"
    if "corpusname" in kwargs and "playname" in kwargs:
        # used for /api/corpora/{corpusname}/play/{playname}/
        if "method" in kwargs:
            request_url = apibase + "corpora/" + kwargs["corpusname"] + "/plays/" + kwargs["playname"] + "/" + kwargs["method"]
        else:
            request_url = apibase + "corpora/" + kwargs["corpusname"] + "/plays/" + kwargs["playname"]
    elif "corpusname" in kwargs and not "playname" in kwargs:
        if "method" in kwargs:
            request_url = apibase + "corpora/" + kwargs["corpusname"] + "/" + kwargs["method"]
        else:
            request_url = apibase + "corpora/" + kwargs["corpusname"] 
    elif "method" in kwargs and not "corpusname" in kwargs and not "playname" in kwargs:
            request_url = apibase + kwargs["method"]
            
    else: 
        #nothing set
        request = request_url = apibase + "info"
    
    #send the response
    r = requests.get(request_url)
    if r.status_code == 200:
        #success!
        if "parse_json" in kwargs:
            if kwargs["parse_json"] == True:
                json_data = json.loads(r.text)
                return json_data
            else:
                return r.text
        else:
            return r.text
    else:
        raise Exception("Request was not successful. Server returned status code: "  + str(r.status_code))
       

The function can now be called as follows below. The function call requests the Info about the API /api/info:

get(method="info", parse_json=True)

To request the metrics of a single play (/api/corpora/{corpusname}/play/{playname}/metrics) use the following function call:

get(corpusname="ger",playname="lessing-emilia-galotti",method="metrics",parse_json=True)

Example: Gender of Characters#

In the following example we count characters that are tagged as “MALE” and “FEMALE in a corpus.

#Get all plays in a Corpus
if corpusname != "":
    #get data of a single corpus and store only the list of plays in the variable "plays"
    plays = get(corpusname=corpusname,parse_json=True)["plays"]
    #set counters for male an female characters in the corpus
    overallMale = 0
    overallFemale = 0
    #check, if a corpusname was entered
    if corpusname != "":
        #iterate over the plays
        for play in plays:
            #get the characters of a play by using the api endpoint /api/corpora/{corpusname}/play/{playname}/characters
            characters = get(corpusname=corpusname,playname=play["name"],method="characters",parse_json=True)
            #reset the counters for male and female characters
            cntMale = 0
            cntFemale = 0
            #iterate over the characters and increment the counters
            for character in characters:
                gender = character["gender"]
                if gender == "MALE":
                    cntMale = cntMale + 1
                elif gender == "FEMALE":
                    cntFemale = cntFemale + 1
            # report the result per play
            print(play["name"] + ": " + "female characters: " + str(cntFemale) + "; male characters: " + str(cntMale))
        
            # increment the overall counters
            overallMale = overallMale + cntMale
            overallFemale = overallFemale + cntFemale
    # report the results on corpus level
    print("\n\nThere are " + str(overallFemale) + " female and " + str(overallMale) + " male characters in the corpus '" + corpusname + "'")
    
else:
    raise Exception("Please enter a corpus!")