Skip to content

Service to remove duplicate news articles, remove foreign news articles and gibberish text from the dataset for University of Colombo.

License

Notifications You must be signed in to change notification settings

apium-io/corpus-UOC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

corpus-UOC

Service to remove duplicate news articles, remove foreign news articles and gibberish text from the dataset for University of Colombo.

Requirements

Python 3+

Instructions

To copy files from subdirectories of the dataset to one directory for processing run (follow the instructions in the file and do the necessary changes)

py filecopy.py

To remove duplicate news articles from the dataset run

py duplicates.py  location_ofthe_directory_containing_thedataset

To remove foreign news articles run (follow the instructions in the file and do the necessary changes)

py foreignnews.py 

About

Service to remove duplicate news articles, remove foreign news articles and gibberish text from the dataset for University of Colombo.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages