Flask
Machine Learning
Web Development
Plagiarism Detection
NLP
Text Analysis
Python
KMP Algorithm
Project Overview
This project is a web-based tool that detects plagiarism by comparing text against internet sources and between files. It uses Python and the KMP (Knuth-Morris-Pratt) algorithm for efficient text pattern matching, combined with machine learning techniques to identify potentially plagiarized content with high accuracy.
Key Features
- Dual detection approach:
- Internet source comparison through Google Search API
- File-to-file comparison using the KMP algorithm
- Interactive web interface built with Flask
- Detailed similarity reports with percentage matches
- Highlighted text sections showing potential plagiarism
- Support for multiple file formats (TXT, DOC, PDF)
- Fast processing with optimized text pattern matching
Video Preview
Technical Implementation
The system architecture includes several key components:
- Flask web framework for the frontend and API
- KMP algorithm implementation for efficient string matching
- Google Search API integration for web source comparison
- Text preprocessing with NLTK for tokenization and normalization
- Machine learning models for similarity detection
- Document parser for handling multiple file formats
Algorithm Highlights
The core of the plagiarism detection system uses the KMP algorithm, which:
- Performs linear-time string matching (O(n+m) complexity)
- Uses a preprocessing step to build a partial match table
- Efficiently identifies matching patterns without backtracking
- Works effectively with large documents and complex text patterns
Applications
This plagiarism detection tool is useful for:
- Educational institutions checking student assignments
- Publishers verifying original content
- Researchers validating citation accuracy
- Content creators ensuring originality