Plagiarism Detection

Plagiarism Detection using ML

Flask Machine Learning Web Development Plagiarism Detection NLP Text Analysis Python KMP Algorithm

Project Overview

This project is a web-based tool that detects plagiarism by comparing text against internet sources and between files. It uses Python and the KMP (Knuth-Morris-Pratt) algorithm for efficient text pattern matching, combined with machine learning techniques to identify potentially plagiarized content with high accuracy.

Key Features

  • Dual detection approach:
    • Internet source comparison through Google Search API
    • File-to-file comparison using the KMP algorithm
  • Interactive web interface built with Flask
  • Detailed similarity reports with percentage matches
  • Highlighted text sections showing potential plagiarism
  • Support for multiple file formats (TXT, DOC, PDF)
  • Fast processing with optimized text pattern matching

Video Preview

Technical Implementation

The system architecture includes several key components:

  • Flask web framework for the frontend and API
  • KMP algorithm implementation for efficient string matching
  • Google Search API integration for web source comparison
  • Text preprocessing with NLTK for tokenization and normalization
  • Machine learning models for similarity detection
  • Document parser for handling multiple file formats

Algorithm Highlights

The core of the plagiarism detection system uses the KMP algorithm, which:

  • Performs linear-time string matching (O(n+m) complexity)
  • Uses a preprocessing step to build a partial match table
  • Efficiently identifies matching patterns without backtracking
  • Works effectively with large documents and complex text patterns

Applications

This plagiarism detection tool is useful for:

  • Educational institutions checking student assignments
  • Publishers verifying original content
  • Researchers validating citation accuracy
  • Content creators ensuring originality