Analysis of depression features in text-transcripts of couple conversations.
We examine transcripts of couple conversations from a current research project at Heidelberg University Hospital to identify depression-related features and quantify the differences between couples, in which one partner is suffering from major depressive disorder, and couples, in which both partners lack a history of depression.
- Julius Daub (3536557) | Applied Informatics (M.Sc.) | [email protected]
- Alexander Haas (3503540) | Applied Informatics (M.Sc.) | [email protected]
- Ubeydullah Ilhan (3447661) | Applied Informatics (M.Sc.) | [email protected]
- Benjamin Sparks (3664690) | Applied Informatics (M.Sc.) | [email protected]
A prerequisite is installing the required dependencies with Pipenv. Then, activate your virtual environment, and choose any of of these entrypoints to get started:
- Feature Summary: Summary of all features in one notebook
- Pipeline Demo: Demonstrates usage of Pipeline API, including utilisation of singular classification and voting classification
- Streamlit: Simple visual front-end for rendering plots.
In order to execute this file, checkout the
streamlit-demobranch and executestreamlit run streamlit_pipeline.py.
- Data acquisition: speech-to-text and docx-extraction
- Implementation of the pipeline architecture
- Implementation of a demo text mining workflow contained in the branch demonstration
- Implementation of demo components
- Collection of possible metrics from literature
- Basic statistics of the data set and first iteration of feature engineering (got the data on 15.12.2020)
- Implementation of features into the pipeline
- Implementation of classifiers into the pipeline
- Extensive feature testing and summary of the results in feature_summary.ipynb
- Summary of the feature insights in a presentation
We use several libraries for the project, including:
- Numpy
- Pandas
- sklearn
- matplotlib
- Spacy + German News Dataset (https://spacy.io/models/de)
- NLTK
- pyphen
- liwc-python
- gensim
- Streamlit
- Seaborn
- LIWC library with german dictionary (https://pypi.org/project/liwc/)
All libraries and versions are specified in the Pipfile.
- 15.01. Implementation of all features and first results to share with the Institute, to evaluate if further transcripts are possible. If that should not be the case, we evaluate on different datasets we discovered (see section data sources) ✅
- 04.02. Second "official" feedback round with supervisor ✅
- 05.02. Summary of results ✅
- 25.02. Second milestone: Code and presentation ✅
- 15.03. Third milestone: Report deadline
For this project, we implemented a pipeline library, which is contained in src.pipelinelib.
The phase of importing and aggregating data is performed by two different classes:
-
Parser: Loads the provided transcripts into a DataFrame, alongside the corresponding metadata, such as whether a transcript belongs to the depressive sample set.
-
Queryable: Provides a type-safe wrapper around queries that can be applied to the loaded DataFrame. Its capabilities include being able to aggregate the data on differing corpus levels, loading only transcript data from the depressed group, etc.
Due to GDPR, the transcripts are not allowed to be uploaded to this repository.
Classifying data from the loaded documents is implemented by three more classes:
-
Pipeline: Represents a pipeline for training a classification model.
-
Component: Every step in a pipeline, be it preprocessing, feature extraction or classification, is implemented as a derivative of this abstract class.
-
Extension: The results of a Component instance are stored within a lookup structure for later Components to reuse. Each Extension is mapped to one result within said structure.
The aforementioned classes all work in conjunction to deliver the requested results. Particularly, each Component declares in its constructor which Extensions it depends on. This allows a Pipeline instance, prior to execution, to check whether a Component's dependencies are satisfied, or whether they will overwrite other calculated results.
The pipeline's steps for the Sigmund project are implemented as Component derivates and can be split into 3 different parts:
-
Preprocessing: As our features require different representations of the corpus, we provide a modular preprocessing pipeline. For that purpose, different aspects of the text can be queried, ranging from plain tokenizing and syllable extraction, to stemming and lemmatization.
-
Feature Engineering: Features can be added in a modular fashion as well. Implemented features include Agreement Score, Talk Turn and TFIDF. Their inputs depend on applied preprocessing Components.
-
Classification: Lastly, we use a classification model in order to categorize the transcripts as depressed or non-depressed. This is performed by combining select feature vectors from the aforementioned section and reporting a loss value.
The structure of the repository is as follow:
├── pipelinelib
│ ├── adapter.py
│ ├── component.py
│ ├── extension.py
│ ├── __init__.py
│ ├── pipeline.py
│ ├── querying.py
│ └── text_body.py
├── sigmund
│ ├── classification
│ │ ├── __init__.py
│ │ ├── linear_discriminant_analysis.py
│ │ ├── logistic_regression.py
│ │ ├── merger.py
│ │ ├── naive_bayes.py
│ │ └── pca.py
│ ├── extensions.py
│ ├── features
│ │ ├── agreement_score.py
│ │ ├── basic_statistics.py
│ │ ├── flesch_reading_ease.py
│ │ ├── __init__.py
│ │ ├── liwc.py
│ │ ├── pos.py
│ │ ├── talk_turn.py
│ │ ├── tfidf.py
│ │ └── vocabulary_size.py
│ ├── __init__.py
│ └── preprocessing
│ ├── __init__.py
│ └── words.py
We furthermore provide a simple front-end for the Institute of Medical Psychology to present the results and provide feature details.
- 10 transcripts of conversations between couples as part of the "Enhancing Social Interaction in Depression" (SIDE) study.
- The structure of the entire dataset of the SIDE study is described in detail in the project proposal, which can be found in the repository as well.
- The format of the transcripts is as follows:
- Docx format
- Sequence of Speakers, separated by paragraph, starting with speaker label
- Annotations of the transcriber are defined using parenthesis
- In our case, the depressive person is always female, however this is not necessary
As of 16.12.2020, our data consists of:
- 10 transcripts (10 couples, 20 speakers; 5 pairs with depression, 5 pairs without depression; depressed partner always female)
- Word count: ~1000 words per transcript
- Word count total: ~13.000
- Utterances: ~60 per transcript
More detailed statistics of the transcripts are included in feature_summary.ipynb. This notebook depends on a preliminary approach of data aggregation that can be found in (src/backwards_compatibility)[src/backwards_compatbility], which has now been superseded functionality-wise by Parser and Queryable. The classes within this folder are not used within the Pipeline API, nor the Sigmund codebase.