Skip to content

Conversation

@Rossi-Luciano
Copy link
Contributor

O que esse PR faz?

Este PR implementa melhorias em documentação e configuração do sistema:

  1. Documentação estruturada via Wiki:

    • Adiciona página principal com visão geral do sistema e público-alvo
    • Cria fluxograma visual do processo com legenda explicativa
    • Estrutura tutoriais organizados (Instalação, Processamento, Validação)
    • Padroniza terminologia ("conversão" → "processamento")
  2. Configuração de APIs de IA:

    • Adiciona suporte a Hugging Face (HF_TOKEN)
    • Habilita integração com Llama (LLAMA_ENABLED)
    • Configura Google Gemini API (GEMINI_API_KEY)
  3. Padronização de código:

    • Ajusta formatação PEP8 em xml_manager/models.py
    • Corrige quebras de linha em ForeignKeys
    • Remove espaços em branco desnecessários

Onde a revisão poderia começar?

Documentação Wiki:

  • Página principal (Home)
  • Fluxograma do processo (verificar renderização Mermaid)
  • Links entre páginas de tutoriais
  • Consistência terminológica em todas as páginas

Arquivos de código:

  • /.envs/.local/.django
    • Validar se as chaves de API devem estar versionadas (considerar usar secrets)
    • Verificar configuração padrão LLAMA_ENABLED=True
  • /xml_manager/models.py
    • Classes XMLDocument, XMLDocumentPDF, XMLDocumentHTML
    • Métodos create() reformatados

Como este poderia ser testado manualmente?

Documentação:

  1. Acessar Wiki do repositório
  2. Navegar pela estrutura de páginas
  3. Testar todos os links internos
  4. Verificar renderização do fluxograma Mermaid

Código:

  1. Verificar carregamento das variáveis de ambiente:
   python manage.py shell
   >>> from django.conf import settings
   >>> print(settings.HF_TOKEN)
   >>> print(settings.LLAMA_ENABLED)
   >>> print(settings.GEMINI_API_KEY)
  1. Executar testes existentes:
   python manage.py test xml_manager

Algum cenário de contexto que queira dar?

Documentação:

  • Desenvolvida para público-alvo com diferentes níveis técnicos
  • Estrutura segue fluxo natural: Instalação → Processamento → Validação
  • Terminologia padronizada para facilitar compreensão

Quais são tickets relevantes?

NA

Referências

eduranm and others added 30 commits September 26, 2025 10:15
… de nuevas apps y aumento del límite de campos
…s, textos con idioma y manejo flexible de fechas
…s y ampliación de tipos soportados (confproc, full_text, etc.)
…istas de búsqueda, utilidades y hooks de Wagtail
…ial.py y eliminación de migraciones intermedias
…n de Django y traducción de verbose_name a inglés
Corrige el tipo de excepción para responder 404 cuando el registro no existe.
…nlaces

Reduce ruido en logs y mantiene la función enfocada a su retorno.
Mejora legibilidad y buenas prácticas de manejo de errores.
…a prompt de referencias

Se agregan comillas a campos textuales y se corrigen comas/keys para evitar errores de parseo del prompt.
Permite traducción de 'Mixed Citation' y 'Rating from 1 to 10'.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a comprehensive document markup and XML generation system for processing DOCX files and managing references. The PR adds new applications (markup_doc and model_ai) with AI-powered metadata extraction, reference parsing, and XML/HTML generation capabilities. Key changes include renaming menu identifiers from xml_manager to xml_files and xml_manager admin group consolidation, adding new dependencies for AI processing (Google Generative AI, python-docx, langid), and implementing a complete workflow for converting DOCX documents to SciELO-compliant XML.

Key Changes

  • Added markup_doc app with DOCX processing, AI-based labeling, XML generation, and SciELO package creation
  • Added model_ai app for managing LLM models (Llama/Gemini) with download capabilities
  • Renamed XML manager menu from xml_manager to xml_files and consolidated menu structure
  • Added new package dependencies: google-generativeai, python-docx, and langid

Reviewed Changes

Copilot reviewed 59 out of 70 changed files in this pull request and generated 91 comments.

Show a summary per file
File Description
requirements/base.txt Added AI processing dependencies (google-generativeai, langid, python-docx)
xml_manager/wagtail_hooks.py Renamed menu identifiers and consolidated menu structure for XML management
reference/wagtail_hooks.py Refactored import statements and renamed admin class with menu order adjustment
reference/models.py Added ReferenceStatus enum and replaced estatus with status field
reference/marker.py Updated imports to use new model_ai.llama module
reference/data_utils.py Enhanced error handling and updated to use ReferenceStatus enum
model_ai/* New app for managing AI models with Llama/Gemini integration
markup_doc/* New app for DOCX processing, metadata extraction, and XML generation
markuplib/* New library for DOCX processing and OMML to MathML conversion
Comments suppressed due to low confidence (1)

markup_doc/sync_api.py:108

  • Except block directly handles BaseException.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

'uri': {'type': 'string'},
'access_date': {'type': 'string'},
'version': {'type': 'string'},
"full_text": {"type": "integer"},
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type for 'full_text' should be 'string', not 'integer'. This field contains textual reference content, not numeric data.

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +92
# FIXME: Hardcoded model name
model = genai.GenerativeModel('models/gemini-2.0-flash')
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Gemini model name is hardcoded. Consider making this configurable through the LlamaModel database entry or environment variable to support different model versions and avoid requiring code changes for model updates.

Copilot uses AI. Check for mistakes.
Comment on lines +108 to +111
except:
print('**ERROR url')
print(url)
url = None
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except clause catches all exceptions including SystemExit and KeyboardInterrupt. Use except Exception: instead and consider logging the actual exception for debugging.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trocar print por logging e inserir uma mensagem mais descritiva do error.

Comment on lines +265 to +267
except Exception:
# si no hay match, dejarlo como está
pass
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent exception handling without logging makes debugging difficult. Consider logging the exception to help diagnose image lookup failures.

Copilot uses AI. Check for mistakes.
});

document.addEventListener("DOMContentLoaded", function () {
const journalInput = document.querySelector("#id_journal");
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable journalInput.

Copilot uses AI. Check for mistakes.
}
stream_data.append(obj.copy())

for i, auth in enumerate(output['authors']):
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested for statement uses loop variable 'i' of enclosing for statement.

Copilot uses AI. Check for mistakes.
}
stream_data.append(obj.copy())

for i, aff in enumerate(output['affiliations']):
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested for statement uses loop variable 'i' of enclosing for statement.

Copilot uses AI. Check for mistakes.
else:
break

for i, val in enumerate(vals[1:], start=1):
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested for statement uses loop variable 'i' of enclosing for statement.

Copilot uses AI. Check for mistakes.
and b.value.get('label') == '<kwd-group>'
]

for i, val in enumerate(vals):
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested for statement uses loop variable 'i' of enclosing for statement.

Copilot uses AI. Check for mistakes.
)

# Respuesta HTTP
with open(zip_path, "rb") as fp:
Copy link

Copilot AI Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File may not be closed if an exception is raised.

Copilot uses AI. Check for mistakes.
- Adiciona scielo_xml_tools.yml com novos caminhos de volume
- Move volumes para estrutura ../markup_data/
- Corrige nomes de containers no Makefile (markapi_local_*)
- Adiciona .ipython/ ao .dockerignore
- Adiciona huggingface-hub ao requirements/local.txt
- Atualiza .gitignore para ignorar backups e arquivos temporários
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants