Project Aim

The aim of this project is to determine whether it is feasible to generate useful datasets from unsolicited Twitter posts regarding auditory hallucinatory experiences to support psychological investigations.

Data collection

To collect the Twitter posts about potential hallucinatory experiences we have defined search queries based on keywords from the literature and informed by researchers with experience of delivering clinical assessments.

Annotating the dataset

Collaborative annotation tool

We have designed and developed our own bespoke annotation application, that was aimed to minimise the time spent on the labelling of each example. Using this tool annotators can assign an appropriate classification category and also highlight corresponding words and phrases that would describe their decisions.

Collaborative annotation tool

Making predictions

Text classification pipeline

Our goal was to predict relatedness of posts to hallucinatory experience. In other words, we need to classify texts into two categories: “related” and “unrelated”. Regular post in Twitter usually contains a lot of noise, such as slang, acronyms and spelling mistakes. People use words that are not in the english vocabulary that make natural language processing more challenging.

Feature extraction

During this stage we try to identify semantics and meaning from the text. In subjective posts, where people describe personal opinions or feelings, it is useful to know which emotion: positive, negative or neutral was expressed, therefore we have also decided to extract the sentiment polarity. Also, in our specific case of auditory hallucinations, it would be interesting to investigate the content of hallucinations (what is exactly heard), therefore we presented an algorithm that extracts key phrase (based on the structure of the sentence).


Classification performance

We have performed 10 experiments of 10-fold cross validation and used F2-score as a performance metric. The best performance (F2-score=0.831) was achieved using Naive Bayes classifier.

Data visualisation

Interactive Dashboard

To help researchers with data analysis, we have developed visualisation application to present aggregated statistics, such as part of the day distribution, sentiment polarity distribution, different types of named entities and distribution of semantic classes.

Preliminary data analysis

Sentiment analysis results

Sentiment analysis

Negative sentiments significantly associated with posts that indicated the occurrence of auditory hallucinations.

Posting time

Posts linked to auditory hallucinations had a higher proportionate distribution between the hours of 11pm and 5am.


This project was published (paper) and presented (slides) on LREC 2016 Conference as a part of Resources and Processing of Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric Impairments (RaPID-2016) Workshop.