Textharvester

A Python package for crawling and extracting large clean text corpora from the web for NLP and downstream machine learning workflows.

PythonBeautifulSoupMulti-threadingNLPPyPI

Overview

Textharvester is a lightweight crawler and extraction toolkit built to generate high-quality text datasets from noisy websites.

What was built

Depth-first crawling with configurable depth and domain constraints.
Multithreaded downloading to improve crawl throughput.
Boilerplate stripping and content extraction tuned for NLP corpus quality.
Packaging and distribution flow for easy reuse in experiments.

Impact

Provided a practical data-ingestion base for text classification and sequence modeling projects.
Kept the tool simple enough for rapid setup while still handling larger crawl jobs.

Quick Start

pip install textharvester

from textharvester import TextHarvester

harvester = TextHarvester(max_depth=2)
results = harvester.harvest("https://example.com")

Links

GitHub
PyPI