ArXiv_Classifier

Classify arXiv preprints by their meta-data

supervised multiclass classification, NLP, imbalanced data

ArXiv is an online repository of scientific pre-prints, see https://arxiv.org/help/general

Each paper comes with meta-data provided by one of the authors:

title
authors' names and forenames
abstract
one and only one primary category that can be mapped into one of six classes: mathematics, physics, computer science, statistics, quantitative biology, quantitative-finance [^1]
few (possibly none) secondary categories (with the same mapping as for the primary category)

The main objective of the project is to predict the class associated with primary category of a paper given it's title and abstracts (since with using authors' names the problem becomes uninteresting).

As a measure of success I choose the macro f1 score. I care both about precision and recall, and I like all my classes equally (regardless of imbalances in data).

The project naturally splits into few tasks that correspond to standalone Jupyter notebooks:

desc	link	remarks
Harvest the data using the public API	arXiv_metadata_harvester.ipynb	There are 2 public APIs
Tidy up, have a closer look, strip down to chosen features	arXiv_cleanup.ipynb	Large imbalance, mostly single-class
Build and test pipelines with shallow classifiers	shallow/arXiv_shallow_clf.ipynb	Handle LaTeX expression using Regex with custom feature transformer. Grid search through classifiers. Reached 80% macF1
Preprocess data for deep learning	keras_preprocessing.ipynb	Text -> fixed-len (zero-padded) sequence of ints
Compare loss-functions/batch-sizes/optimizers and pretrain word-embeddings	keras_GlobalAvg_GridSearch.ipynb	Simple net reproduces the 80% on validation. Custom loss functions are helpful. Custom metric functions are informative.
Examine a couple of neural-net architectures	keras_RNN_LSTM.ipynb	Neural nets generically do worse than 80%
Get final score of neural-nets on test-data	keras_evaluate.ipynb	The winner climbed to 81% macF1, it uses GlobalAveragePooling and a custom loss function

Models in the last three notebooks were fit using Colab (link to the google-directory: ArXiv_Classifier).

Final conclusion of the project is that the data makes it relatively easy to obtain more than ~75% macro-F1, and hard to go beyond 80%.

Beyond that, the dataset, being reasonably big and challenging, offers a practically endless TODO list, eg.

examine topics modelled with LDA
examine more neural-net architectures
train fancier word-embeddings using gensim
implement own CBOW and/or skipdram in keras and compare
seperate titles and abstracts into seperate channels of a neural net
implement real-life data-streaming, processing and learning (also: use generators for fitting in keras)
train a network to generate new abstract given the title ...

[^1] Since 2017 there are two other classes of articles. They get harvested but I remove them from testing and training.

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
data		data
old		old
shallow		shallow
.gitignore		.gitignore
README.md		README.md
arXiv_cleanup.ipynb		arXiv_cleanup.ipynb
arXiv_metadata_harvester.ipynb		arXiv_metadata_harvester.ipynb
categories.txt		categories.txt
global_params.p		global_params.p
keras_GlobalAvg_GridSearch.ipynb		keras_GlobalAvg_GridSearch.ipynb
keras_RNN_LSTM.ipynb		keras_RNN_LSTM.ipynb
keras_blackbox_wrapper.ipynb		keras_blackbox_wrapper.ipynb
keras_custom_functions.ipynb		keras_custom_functions.ipynb
keras_evaluate.ipynb		keras_evaluate.ipynb
keras_plot_history.ipynb		keras_plot_history.ipynb
keras_preprocessing.ipynb		keras_preprocessing.ipynb
my_utilities.py		my_utilities.py
physics_genres.txt		physics_genres.txt
top_cats.txt		top_cats.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ArXiv_Classifier

supervised multiclass classification, NLP, imbalanced data

The project naturally splits into few tasks that correspond to standalone Jupyter notebooks:

About

Uh oh!

Releases

Packages

Languages

olszewskip/ArXiv_Classifier

Folders and files

Latest commit

History

Repository files navigation

ArXiv_Classifier

supervised multiclass classification, NLP, imbalanced data

The project naturally splits into few tasks that correspond to standalone Jupyter notebooks:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages