This is a Web Crawler for home assignment
app | ||
bootstrap | ||
config | ||
database | ||
init-scripts | ||
lang/en | ||
public | ||
resources | ||
routes | ||
storage | ||
tests | ||
.editorconfig | ||
.env.example | ||
.gitattributes | ||
.gitignore | ||
artisan | ||
composer.json | ||
composer.lock | ||
docker-compose.yaml | ||
package.json | ||
phpunit.xml | ||
README.md | ||
vite.config.js |
Web Crawler API
The Web Crawler API is a simple API that allows you to crawl websites and store the crawled data in a database. It uses GuzzleHttp to send HTTP requests and parses the HTML content to extract links from web pages. The API is built with Laravel framework.
Features
- Crawls websites and stores the crawled data in the database.
- Supports setting the depth of the crawling process.
- Prevents duplicate URLs from being crawled.
- Retrieves and saves the HTML content of crawled pages.
- Extracts valid URLs from the crawled pages.
Prerequisites
- PHP >= 7.4
- Composer
- Laravel framework
- MongoDB
- Docker
- Docker Compose
- GuzzleHttp
- MongoDB PHP driver (extension - mongodb.so)
- jenssegers/mongodb package
Getting Started
-
Clone the repository:
git clone <repository-url>
Services
server
Run the server - php artisan serve
MongoDB
Run the server - php artisan serve
run mongo - run docker-compose up -d
migrate - php artisan migrate
Configuration
use .env file to set up the database connection
API Endpoints
GET /api/crawl: Crawls a website and stores the crawled data in the database. Required query parameter: url. Optional query parameter: depth (default: 1). GET /api: Retrieves all crawled data from the database. DELETE /api/crawl/{id}: Deletes a specific crawled data record from the database. DELETE /api/crawl: Deletes all crawled data records from the database.