At WebSummit 2019, 46 talks (mainly from the Central Stage) were automatically transcripted in real-time using the Otter.ai platform. This repo, provides the transcripted text, as well as the code to re-download it and preprocess it.
With this dataset, you can do statistic analysis on the text of the transcripts, or even train a neural network model to produce your very own WebSummit speech.
Each speech, is an individual .txt file inside the plain-texts folder eg: (224STLFR2BIGPLOD.txt). All the speeches are titled using their id from the otter.ai platform. If you have a different naming scheme to propose, I'm all ears! :D
Apart from that, inside the plain-texts folder, there is a data.json file, that includes all the raw data from otter.ai.
npm installnpm run compile
npm run datagen(This downloads the transcripts from otter.ai and produces thedata.jsonfile)npm run preprocess(This reads thedata.jsonfile and creates the various.txtfiles.)
Please fork, copy, share and contribute!