Machine Learning Based Model for Detecting Similarity of Scientific Papers

ABSTRACT	第5-6页
摘要	第7-9页
ACKNOWLEDGEMENTS	第9-10页
DEDICATION	第10-16页
LIST OF ABBREVIATIONS	第16-17页
Chapter ONE: INTRODUCTION	第17-21页
1. Background	第17-21页
1.1 Thesis Structure	第19-21页
Chapter TWO: TEXT PREPROCESSING	第21-29页
2. Cleaning and Preparing Text Data	第21-29页
2.1 Removal of Punctuation Marks	第22-23页
2.2 Stop-word Removal	第23-24页
2.3 Stemming-determining the base form of a word	第24-26页
2.4 Lemmatization-determining the base form of a word using dictionary	第26-27页
2.5 Tokenization-extracting word tokens	第27页
2.6 Tagging-syntax highlighting	第27-28页
2.7 Text Chunking-grouping words	第28页
2.8 Parsing	第28-29页
Chapter THREE: LANGUAGE MODELING	第29-77页
3. Methods of Language Modeling	第29-77页
3.1 Term Frequency-Inverse Document Frequency (TF-IDF)	第29-32页
3.2 N-grams	第32-35页
3.3 Singular Value Decomposition (SVD)	第35-40页
3.4 Neural Network Based Language Modeling	第40-49页
3.5 Convolutional Neural Network Language Models	第49-57页
3.6 Recurrent Neural Network Language Models	第57-67页
3.7 Word2vec-Vector Representation of Words	第67-73页
3.8 Glo Ve-Global Vectors for Word Representation	第73-77页
Chapter FOUR: CLUSTERING TEXT DATA	第77-91页
4. Methods for Clustering Texts	第77-91页
4.1 K-means Algorithm	第77-80页
4.2 Hierarchical Clustering	第80-81页
4.3 Spectral clustering	第81-83页
4.4 Clustering using RNNs	第83-88页
4.5 Convolutional Clustering	第88-91页
Chapter FIVE: IMPLEMENTING THE LANGUAGE MODEL	第91-104页
5. Clustering Scientific Papers Based on Word Vectors	第91-104页
5.1 Data Collection	第91-92页
5.2 Technical Specification	第92-93页
5.3 Cleaning and Preparing the Text Data	第93-94页
5.4 Creating the Language Model	第94-95页
5.5 Capturing Linguistic Similarity Between Papers	第95-96页
5.6 Results	第96-104页
Conclusion	第104-106页
REFERENCES	第106-110页