Back to Home

Textharvester

A Python package for crawling and extracting large clean text corpora from the web for NLP and downstream machine learning workflows.

PythonBeautifulSoupMulti-threadingNLPPyPI

Overview

Textharvester is a lightweight crawler and extraction toolkit built to generate high-quality text datasets from noisy websites.

What was built

  • Depth-first crawling with configurable depth and domain constraints.
  • Multithreaded downloading to improve crawl throughput.
  • Boilerplate stripping and content extraction tuned for NLP corpus quality.
  • Packaging and distribution flow for easy reuse in experiments.

Impact

  • Provided a practical data-ingestion base for text classification and sequence modeling projects.
  • Kept the tool simple enough for rapid setup while still handling larger crawl jobs.

Quick Start

pip install textharvester
from textharvester import TextHarvester

harvester = TextHarvester(max_depth=2)
results = harvester.harvest("https://example.com")

Links