Service to remove duplicate news articles, remove foreign news articles and gibberish text from the dataset for University of Colombo.
Requirements
Python 3+
Instructions
To copy files from subdirectories of the dataset to one directory for processing run (follow the instructions in the file and do the necessary changes)
py filecopy.py
To remove duplicate news articles from the dataset run
py duplicates.py location_ofthe_directory_containing_thedataset
To remove foreign news articles run (follow the instructions in the file and do the necessary changes)
py foreignnews.py