The format and structure of your data are important for effective use of LLM. Although most models are based on multimodal approaches that can understand different file types, their performance is very dependent on the training dataset. Sometimes this dependency leads to incorrect analyses.
To address this challenge, consider transforming your documents into the most widely-comprehensive structure: Markdown (.md).
Why Markdown?
Because many models have the capacity to interpret and generate responses in this format, which serves as the foundation of their understanding.
Microsoft is aligning this view with the introduction of a tool called MarkItDown, which improve the compatibility and performance of the LLM.
Installation and Usage
You can install MarkItDown using pip
pip install 'markitdown[all]'
Once installed, the conversion is easy. For example:
markitdown some.pdf > example.md
Conclusion
After a few hours of testing, I have seen that formats such as .docx, .pptx, .xlsx, and .xls work perfectly. However, the PDFs do not always convert smoothly, sometimes putting unfamiliar symbols, but keeping the general structure and content.
What about proficiency answers from AI?
I can say with certainty, the maximum effectiveness of this approach is for small models. Perhaps this is a future approach to improve and optimize for these kinds of models.