Prathamesh's Portfolio

Flask Machine Learning Web Development Plagiarism Detection NLP Text Analysis Python KMP Algorithm

Project Overview

This project is a web-based tool that detects plagiarism by comparing text against internet sources and between files. It uses Python and the KMP (Knuth-Morris-Pratt) algorithm for efficient text pattern matching, combined with machine learning techniques to identify potentially plagiarized content with high accuracy.

Key Features

Dual detection approach:
- Internet source comparison through Google Search API
- File-to-file comparison using the KMP algorithm
Interactive web interface built with Flask
Detailed similarity reports with percentage matches
Highlighted text sections showing potential plagiarism
Support for multiple file formats (TXT, DOC, PDF)
Fast processing with optimized text pattern matching

Video Preview

Technical Implementation

The system architecture includes several key components:

Flask web framework for the frontend and API
KMP algorithm implementation for efficient string matching
Google Search API integration for web source comparison
Text preprocessing with NLTK for tokenization and normalization
Machine learning models for similarity detection
Document parser for handling multiple file formats

Algorithm Highlights

The core of the plagiarism detection system uses the KMP algorithm, which:

Performs linear-time string matching (O(n+m) complexity)
Uses a preprocessing step to build a partial match table
Efficiently identifies matching patterns without backtracking
Works effectively with large documents and complex text patterns

Applications

This plagiarism detection tool is useful for:

Educational institutions checking student assignments
Publishers verifying original content
Researchers validating citation accuracy
Content creators ensuring originality