Sigmund

Analysis of depression features in text-transcripts of couple conversations.

We examine transcripts of couple conversations from a current research project at Heidelberg University Hospital to identify depression-related features and quantify the differences between couples, in which one partner is suffering from major depressive disorder, and couples, in which both partners lack a history of depression.

Project Members

Julius Daub (3536557) | Applied Informatics (M.Sc.) | [email protected]
Alexander Haas (3503540) | Applied Informatics (M.Sc.) | [email protected]
Ubeydullah Ilhan (3447661) | Applied Informatics (M.Sc.) | [email protected]
Benjamin Sparks (3664690) | Applied Informatics (M.Sc.) | [email protected]

Getting Started

A prerequisite is installing the required dependencies with Pipenv. Then, activate your virtual environment, and choose any of of these entrypoints to get started:

Feature Summary: Summary of all features in one notebook
Pipeline Demo: Demonstrates usage of Pipeline API, including utilisation of singular classification and voting classification
Streamlit: Simple visual front-end for rendering plots. In order to execute this file, checkout the streamlit-demo branch and execute streamlit run streamlit_pipeline.py.

Contributions for the Milestone

Fist milestone (16.12.2020)

Data acquisition: speech-to-text and docx-extraction
Implementation of the pipeline architecture
Implementation of a demo text mining workflow contained in the branch demonstration
Implementation of demo components
Collection of possible metrics from literature
Basic statistics of the data set and first iteration of feature engineering (got the data on 15.12.2020)

Second milestone (04.02.2021)

Implementation of features into the pipeline
Implementation of classifiers into the pipeline
Extensive feature testing and summary of the results in feature_summary.ipynb
Summary of the feature insights in a presentation

Utilized Libraries

We use several libraries for the project, including:

Numpy
Pandas
sklearn
matplotlib
Spacy + German News Dataset (https://spacy.io/models/de)
NLTK
pyphen
liwc-python
gensim
Streamlit
Seaborn
LIWC library with german dictionary (https://pypi.org/project/liwc/)

All libraries and versions are specified in the Pipfile.

Project State

Project Planning

15.01. Implementation of all features and first results to share with the Institute, to evaluate if further transcripts are possible. If that should not be the case, we evaluate on different datasets we discovered (see section data sources) ✅
04.02. Second "official" feedback round with supervisor ✅
05.02. Summary of results ✅
25.02. Second milestone: Code and presentation ✅
15.03. Third milestone: Report deadline

High Level Architecture Description

For this project, we implemented a pipeline library, which is contained in src.pipelinelib.

The phase of importing and aggregating data is performed by two different classes:

Parser: Loads the provided transcripts into a DataFrame, alongside the corresponding metadata, such as whether a transcript belongs to the depressive sample set.
Queryable: Provides a type-safe wrapper around queries that can be applied to the loaded DataFrame. Its capabilities include being able to aggregate the data on differing corpus levels, loading only transcript data from the depressed group, etc.

Due to GDPR, the transcripts are not allowed to be uploaded to this repository.

Classifying data from the loaded documents is implemented by three more classes:

Pipeline: Represents a pipeline for training a classification model.
Component: Every step in a pipeline, be it preprocessing, feature extraction or classification, is implemented as a derivative of this abstract class.
Extension: The results of a Component instance are stored within a lookup structure for later Components to reuse. Each Extension is mapped to one result within said structure.

The aforementioned classes all work in conjunction to deliver the requested results. Particularly, each Component declares in its constructor which Extensions it depends on. This allows a Pipeline instance, prior to execution, to check whether a Component's dependencies are satisfied, or whether they will overwrite other calculated results.

The pipeline's steps for the Sigmund project are implemented as Component derivates and can be split into 3 different parts:

Preprocessing: As our features require different representations of the corpus, we provide a modular preprocessing pipeline. For that purpose, different aspects of the text can be queried, ranging from plain tokenizing and syllable extraction, to stemming and lemmatization.
Feature Engineering: Features can be added in a modular fashion as well. Implemented features include Agreement Score, Talk Turn and TFIDF. Their inputs depend on applied preprocessing Components.
Classification: Lastly, we use a classification model in order to categorize the transcripts as depressed or non-depressed. This is performed by combining select feature vectors from the aforementioned section and reporting a loss value.

The structure of the repository is as follow:

├── pipelinelib
│   ├── adapter.py
│   ├── component.py
│   ├── extension.py
│   ├── __init__.py
│   ├── pipeline.py
│   ├── querying.py
│   └── text_body.py
├── sigmund
│   ├── classification
│   │   ├── __init__.py
│   │   ├── linear_discriminant_analysis.py
│   │   ├── logistic_regression.py
│   │   ├── merger.py
│   │   ├── naive_bayes.py
│   │   └── pca.py
│   ├── extensions.py
│   ├── features
│   │   ├── agreement_score.py
│   │   ├── basic_statistics.py
│   │   ├── flesch_reading_ease.py
│   │   ├── __init__.py
│   │   ├── liwc.py
│   │   ├── pos.py
│   │   ├── talk_turn.py
│   │   ├── tfidf.py
│   │   └── vocabulary_size.py
│   ├── __init__.py
│   └── preprocessing
│       ├── __init__.py
│       └── words.py

We furthermore provide a simple front-end for the Institute of Medical Psychology to present the results and provide feature details.

Data Analysis

Data Sources

10 transcripts of conversations between couples as part of the "Enhancing Social Interaction in Depression" (SIDE) study.
The structure of the entire dataset of the SIDE study is described in detail in the project proposal, which can be found in the repository as well.
The format of the transcripts is as follows:
- Docx format
- Sequence of Speakers, separated by paragraph, starting with speaker label
- Annotations of the transcriber are defined using parenthesis
- In our case, the depressive person is always female, however this is not necessary

Basic Statistics

As of 16.12.2020, our data consists of:

10 transcripts (10 couples, 20 speakers; 5 pairs with depression, 5 pairs without depression; depressed partner always female)
Word count: ~1000 words per transcript
Word count total: ~13.000
Utterances: ~60 per transcript

More detailed statistics of the transcripts are included in feature_summary.ipynb. This notebook depends on a preliminary approach of data aggregation that can be found in (src/backwards_compatibility)[src/backwards_compatbility], which has now been superseded functionality-wise by Parser and Queryable. The classes within this folder are not used within the Pipeline API, nor the Sigmund codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
.github/workflows		.github/workflows
.vscode		.vscode
data		data
docs		docs
htmlcov		htmlcov
src		src
test		test
.example.env		.example.env
.gitattributes		.gitattributes
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
2020-11-27_ProjectProposal.pdf		2020-11-27_ProjectProposal.pdf
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
conf.py		conf.py
config.json		config.json
feature_summary.ipynb		feature_summary.ipynb
pipeline_demo.ipynb		pipeline_demo.ipynb
streamlit_pipeline.py		streamlit_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sigmund

Project Members

Getting Started

Contributions for the Milestone

Fist milestone (16.12.2020)

Second milestone (04.02.2021)

Utilized Libraries

Project State

Project Planning

High Level Architecture Description

Data Analysis

Data Sources

Basic Statistics

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

gitexa/sigmund

Folders and files

Latest commit

History

Repository files navigation

Sigmund

Project Members

Getting Started

Contributions for the Milestone

Fist milestone (16.12.2020)

Second milestone (04.02.2021)

Utilized Libraries

Project State

Project Planning

High Level Architecture Description

Data Analysis

Data Sources

Basic Statistics

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages