Call for Proposals: Alliance for Language Technologies

Deadline: 26 September 2024

The European Commission (EC) is seeking applications to preserving the linguistic and cultural diversity in Europe while effectively implementing the European Common Data Infrastructure and Service MCP’s objectives in the area of language technologies.

Objectives

The advent of generative AI, exemplified by models like ChatGPT, represents a transformative moment in technological innovation. Yet, most of the advances come from regions outside Europe and do not cover all languages equally.
Building an ecosystem around large language and AI models in Europe will provide better autonomy for the use and sharing of European data and will reduce the EU’s dependence on technologies from outside Europe.
Through federating Member States’ efforts, this action will directly contribute to preserving the linguistic and cultural diversity in Europe while effectively implementing the European Common Data Infrastructure and Service MCP’s objectives in the area of language technologies. By providing the necessary data and model adaptation capacities, the action will have a strong impact on the deployment of large language foundation models and their applications such as generative AI. This federated effort will be established around two work strands.
This call addresses the first work strand which will support the language data collection and the adaptation of existing large language foundation models to specific languages, domains or industries, so as to support the onboarding of the latest language technologies by European actors.

Funding Information

The estimated available call budget is EUR 20,000,000.

Scope

Data:
- Leveraging on the Common European Language Data Space and other relevant Data Spaces, this activity will, in compliance with the applicable legislation (e.g. Copyright Directive (EU) 2019/790 and GDPR Regulation (EU) 2016/679), gather the necessary language data (text, audio, image and other modalities) from a broad array of European industrial, academic and institutional actors, and provide data in sufficient quality and quantity to build large language foundation models, ensuring a coherent coverage of all the official languages of the Member States as well as the most socially and economically relevant ones. This will also include providing data required to adapt such large language foundation models to specific languages, domains or industries. The action will also provide a repository of existing European Large Language foundation models as well as models adapted to specific languages, domains or industries. Once sufficiently advanced, the consortium may consider working on a future copyright infrastructure and related issues to allow efficient use of language and other data, while taking into account the interests of the rights holders.
Fine-tuning:
- This activity will also provide large language models fine-tuned to specific languages, domains or industries as a result of further training of large language foundation models on specific language data. This process involves adapting, evaluating and optimizing foundation models for specific languages, domains or industries. It will facilitate the efficient deployment of these models across various industries, requiring less task-specific data compared to building models from scratch, which is particularly advantageous for SMEs. The action will also include the support for the ongoing maintenance and enhancement of these models, ensuring their adaptability to evolving tasks and domains over time.
- In addition, this activity will also provide, including through Financial Support to Third Parties, dedicated support and services, in particular for SMEs, to facilitate the fine-tuning of available models. This support and services will provide third parties with an infrastructure to fine-tune and evaluate existing models for their purpose.
The EuroHPC Joint Undertaking would provide access to their facilities for the adaptation and fine-tuning of the models when necessary.

Outcomes and Deliverables

Increased accessibility to language data for the development and adaptation of N large language foundation models, in consideration of issues linked to data privacy and security, as well as potential risks of disinformation.
A repository of families of existing large language foundation models for public and industrial reuse in the EU.
A repository of families of large language models fine-tuned to specific languages, domains or industries.
Infrastructure and services for models fine-tuning.
KPIs to measure outcomes and deliverables:
- Number of language datasets newly made available within the project, including metrics on diversity and coverage of languages, domains, and industries in the datasets.
- Number of downloads and satisfaction levels from users utilizing the datasets.
- Number of large language foundation models newly made available within the project, including metrics on diversity and coverage of languages, domains, and industries in the datasets.
- Number of downloads and satisfaction levels from users utilizing the large language foundation models.

Eligibility Criteria

Applications will only be considered eligible if their content corresponds wholly (or at least in part) to the topic description for which they are submitted.
Eligible participants (eligible countries)
- In order to be eligible, the applicants (beneficiaries and affiliated entities) must:
- Be legal entities (public or private bodies)
  - Be established in one of the eligible countries, i.e.:
    - EU Member States (including overseas countries and territories (OCTs))
  - Non-EU countries:
    - EEA countries (Norway, Iceland and Liechtenstein) other countries associated to the Digital Europe Programme (list of participating countries)
Beneficiaries and affiliated entities must register in the Participant Register before submitting the proposal and will have to be validated by the Central Validation Service (REA Validation). For the validation, they will be requested to upload documents showing legal status and origin.