Evaluating GPT-4.5 for Financial Reasoning and Greek Financial Tasks
Introduction
GPT-4.5 is the latest upgrade in OpenAI's GPT family, promising improved reasoning and better general performance across domains. To understand how well this general-purpose powerhouse handles domain-specific financial challenges, we evaluated GPT-4.5 on two of TheFinAI’s open leaderboards:
Financial Reasoning: A benchmark for complex numerical and textual reasoning in finance, with tasks like table-based QA, document analysis, and financial math problems.
Greek Financial Tasks: A comprehensive benchmark for financial NLP in Greek, spanning entity recognition, question answering, summarization, and classification.
The results offer valuable insights into GPT-4.5’s strengths and limitations when applied to the nuanced world of finance, and particularly how it fares against domain-specific models.
Financial Reasoning: Strong Math, but Troubled Reasoning
Analysis
Solid Financial QA Performance: GPT-4.5 scores 68.94% on FinQA, showing that it handles fact retrieval and reasoning in financial contexts reasonably well. It outperforms smaller GPT variants (GPT-o1, GPT-o3-mini), but still lags behind specialized models like DeepSeek-V3 (73.2%). This highlights competence in general-purpose financial QA, but not best-in-class precision.
Struggles with Long-Form Document Reasoning: On DM-Simplong, GPT-4.5 scores 59%, which is only a minor improvement over GPT-o3-mini (59%) and GPT-o1 (56%). This reflects difficulty handling long, complex financial documents, especially where domain-specific document structures (like regulatory filings) are involved.
Limited Financial Math Capability: In XBRL-Math, GPT-4.5 reaches 74.44%, which is significantly behind DeepSeek-R1 (86.67%) and Qwen2.5-72B-Instruct-Math (83.33%). This suggests GPT-4.5’s general numerical reasoning is solid, but it lacks awareness of financial-specific math patterns (e.g., handling accounting calculations, structured numeric formats in XBRL reports).
What This Tells Us
👉 GPT-4.5’s general reasoning is strong, but it struggles with the specialized reasoning and formatting rules found in financial documents — particularly for long-form documents and structured numeric data.
Greek Financial Tasks: Strong QA, But Outperformed by Specialized Models
Analysis
Exceptional QA Performance: GPT-4.5 achieves the highest score on Greek financial question answering (74.67%), indicating excellent language understanding and fact retrieval abilities.
Struggles with Greek Numerical Entities: GPT-4.5 falls short on FinNum (21.76%), which focuses on recognizing and interpreting numerical values in Greek financial text — a domain task where localized patterns matter more than general reasoning.
Mediocre Summarization & Classification: GPT-4.5 also underperforms on multi-class classification (MultiFin), indicating that broad multilingual training doesn’t always transfer well to Greek financial terminology.
What This Tells Us
👉 GPT-4.5’s general understanding and factual reasoning are top-tier, but it lacks the fine-grained pattern recognition and financial format awareness needed for high accuracy in Greek financial documents.
👉 Specialized models like plutus-8B, trained specifically on Greek financial texts, retain a substantial edge on numeric entity recognition and overall average performance.