Web Crawler API

The Web Crawler API is a simple API that allows you to crawl websites and store the crawled data in a database. It uses GuzzleHttp to send HTTP requests and parses the HTML content to extract links from web pages. The API is built with Laravel framework.

Features

Crawls websites and stores the crawled data in the database.
Supports setting the depth of the crawling process.
Prevents duplicate URLs from being crawled.
Retrieves and saves the HTML content of crawled pages.
Extracts valid URLs from the crawled pages.

Prerequisites

PHP >= 7.4
Composer
Laravel framework
MongoDB
Docker
Docker Compose
GuzzleHttp
MongoDB PHP driver (extension - mongodb.so)
jenssegers/mongodb package

Getting Started

Clone the repository:

git clone https://git.dayanhub.com/kfir/rank_exam

Services

server

Run the server - php artisan serve

MongoDB

Run the server - php artisan serve
run mongo - run docker-compose up -d
migrate - php artisan migrate

Configuration

use .env file to set up the database connection

API Endpoints

GET /api/crawl:

Crawls a website and stores the crawled data in the database. Required query parameter: url. Optional query parameter: depth (default: 1).
Parameters:
- `url` (required): The URL of the website to crawl.
- `depth` (optional): The depth of the crawling process (default: 0).
- `refresh` (optional): If set to 1, the crawler will refresh the results for an existing URL (default: false).

GET /api:

Retrieves all crawled data from the database.

DELETE /api/crawl/{id}:

Deletes a specific crawled data record from the database.

DELETE /api/crawl:

Deletes all crawled data records from the database.

1.8 KiB Raw Permalink Blame History