Abhijeet Kumar, Aryan Mittal, Devansh Verma, Riya Sanket Kashive
This solution aims to automate the extraction and content analysis of all embedded links from a website, regardless of their location, and includes asking concise and relevant questions, the most relevant links and topics for those questions, all complete with an automated verification and metric system for assessing their aforementioned parameters (conciseness and relevance). A detailed documentation of the repository has been laid out in this document.
- Data Scraping: Utilizes Selenium to extract all embedded links from the target website.
- Data Storage: JSON files are used to store and organize the extracted data.
- Question Generation: We employ the duckduckgo_search library in conjunction with the gemini API to generate precise and pertinent questions.
- Link-Question Mapping and Relevance Metric: TFIDF Vectorization is used to map the generated questions to the most relevant links, and is employed as a relevance metric to evaluate the quality of the mappings.
For detailed information on the problem statement, please refer to this document.
Our solution achieved an accuracy of 83%.
- Extraction of embedded links from a website.
- Content analysis and question generation.
- Implementation of an automated system for verification and relevance assessment.