Financial Benchmark for LLMs

FinBen

FinBen is a comprehensive benchmark developed collaboratively at The Fin AI to evaluate the performance of large language models in financial applications. It provides a rigorous framework for assessing LLMs in areas such as information extraction, risk management, forecasting, and decision-making.

Join the Working Group

FinBen is a benchmark designed to test how well financial LLMs perform in real-world scenarios. It evaluates models across key financial tasks, ensuring they are accurate, reliable, and useful for decision-making.

To keep the evaluation fair and meaningful, FinBen uses diverse datasets and tasks, preventing models from being optimized just for the test. This approach helps developers refine their models while giving financial professionals a clear understanding of AI performance.

Each model is assessed and rated based on its strengths and weaknesses, providing actionable insights to build better and more trustworthy financial AI systems.

Benchmark Scope

FinBen provides a comprehensive evaluation framework for financial AI, expanding task coverage, dataset diversity, and assessment strategies. It includes:

24 financial tasks and 42 datasets, significantly increasing the scope of financial AI evaluation.
Coverage of 8 financial domains, including the first benchmark for stock trading evaluation.
Novel evaluation approaches incorporating agent-based and Retrieval-Augmented Generation (RAG) assessments.
2 new open-source datasets focused on financial QA and stock trading tasks.
Hosting the first shared task on financial LLMs, hosted at IJCAI 2024, and 3 shared tasks at COLING 2025

FinBen Living Leaderboard

The FinBen Living Leaderboard continuously tracks and updates performance results across 24 financial tasks and 42 datasets, providing real-time, transparent evaluation of financial AI models.

View the FinBen results

Supported Tasks

FinBen evaluates financial language models across a diverse set of categories that reflect the complex needs of the finance industry. Each category targets specific capabilities, ensuring a comprehensive assessment of model performance in tasks directly relevant to finance.

Information Extraction

The financial sector often requires structured insights from unstructured documents such as regulatory filings, contracts, and earnings reports. Information extraction tasks include Named Entity Recognition (NER), Relation Extraction, and Causal Classification. These tasks evaluate a model’s ability to identify key financial entities, relationships, and events, which are crucial for downstream applications such as fraud detection or investment strategy.

Textual Analysis

Financial markets are driven by sentiment, opinions, and the interpretation of financial news and reports. Textual analysis tasks such as Sentiment Analysis, News Classification, and Hawkish-Dovish Classification help assess how well a model can interpret market sentiment and textual data, aiding in tasks like investor sentiment analysis and policy interpretation.

Question Answering

This category addresses the ability of models to interpret complex financial queries, particularly those that involve numerical reasoning or domain-specific knowledge. The QA tasks, such as those derived from datasets like FinQA and TATQA, evaluate a model’s capability to respond to detailed financial questions, which is critical in areas like risk analysis or financial advisory services.

Text Generation

Summarization of complex financial reports and documents is essential for decision-making. Tasks like ECTSum and EDTSum test models on their ability to generate concise and coherent summaries from lengthy financial texts, which is valuable in generating reports or analyst briefings.

Forecasting

One of the most critical applications in finance is the ability to forecast market movements. Tasks under this category evaluate a model’s ability to predict stock price movements or market trends based on historical data, news, and sentiment. These tasks are central to tasks like portfolio management and trading strategies.

Risk Management

This category focuses on tasks that evaluate a model’s ability to predict and assess financial risks, such as Credit Scoring, Fraud Detection, and Financial Distress Identification. These tasks are fundamental for credit evaluation, risk management, and compliance purposes.

Decision-Making

In finance, making informed decisions based on multiple inputs (e.g., market data, sentiment, and historical trends) is crucial. Decision-making tasks simulate complex financial decisions, such as Mergers & Acquisitions and Stock Trading, testing the model’s ability to handle multimodal inputs and offer actionable insights.

Spanish

Evaluates models in Spanish-language tasks, ensuring accuracy and robustness in diverse financial contexts. It assesses models' multilingual capabilities, particularly in low-resource financial language settings.

Current Best Models and Surprising Results

Throughout the evaluation process on the Open FinLLM Leaderboard, several models have demonstrated exceptional capabilities across various financial tasks.

Best model: GPT-4 and Llama 3.1 have consistently outperformed other models in many tasks, showing high accuracy and robustness in interpreting financial sentiment.
Surprising Results: The Forecasting(FO) task, focused on stock movement predictions, showed that smaller models, such as Llama-3.1-7b, internlm-7b,often outperformed larger models, for example Llama-3.1-70b, in terms of accuracy and MCC. This suggests that model size does not necessarily correlate with better performance in financial forecasting, especially in tasks where real-time market data and nuanced sentiment analysis are critical. These results highlight the importance of evaluating models based on task-specific performance rather than relying solely on size or general-purpose benchmarks.

Partnerships

We would like to thank our partners, including The Linux Foundation and NVAITC, for their generous support in making the FinBen possible.

Benchmark Working Group Contributors

The University of Manchester
University of Florida
Columbia University
Stevens Institute of Technology
Harvard University
Rensselaer Polytechnic Institute
Gustavus Adolphus College

The National Center of Text Mining, UK
Archimedes RC, GR

Join Us

The Fin AI Benchmark Working Group brings together researchers, engineers, and financial experts to create fair and transparent evaluations for financial AI. We welcome others to join us in advancing financial AI.

Join the Working Group

Page updated

Google Sites

Report abuse