HumSet

HumSet is a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. HumSet is curated by humanitarian analysts and covers various disasters around the globe that occurred from 2018 to 2021 in 46 humanitarian response projects. The dataset consists of approximately 17K annotated documents in three languages of English, French, and Spanish, originally taken from publicly-available resources. For each document, analysts have identified informative snippets (entries) in respect to common humanitarian frameworks, and assigned one or many classes to each entry. See the our pre-print short paper for details.

Paper: Humset - Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response

@inproceedings{fekih-etal-2022-humset,
    title = "{H}um{S}et: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crises Response",
    author = "Fekih, Selim  and
      Tamagnone, Nicolo{'}  and
      Minixhofer, Benjamin  and
      Shrestha, Ranjan  and
      Contla, Ximena  and
      Oglethorpe, Ewan  and
      Rekabsaz, Navid",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.321",
    pages = "4379--4389",
}



Dataset

Main dataset is shared in CSV format (humset_data.csv), where each row is considered as an entry with the following features:

entry_idlead_idproject_idsectorspillars_1dpillars_2dsubpillars_1dsubpillars_2dlangn_tokensproject_titlecreated_atdocumentexcerpt

Note:

In addition to the main dataset, documents (leads) full texts are also reported (documents.tar.gz). Each text source is represented JSON-formatted file ({lead_id}.json) with the following structure:

[
  [
    paragraph 1 - page 1,
    paragraph 2 - page 1,
    ...
    paragraph N - page 1
  ],
  [
    paragraph 1 - page 2,
    paragraph 2 - page 2,
    ...
    paragraph N - page 2
  ],
  [
    ...
  ],
  ...
]

Each document is a list of lists of strings, where each element is the text of a page, divided into the corresponding paragraphs. This format was used since, as indicated in the paper, over 70% of the sources are in PDF format, thus choosing to keep the original textual subdivision. In the case of HTML web pages, the text is reported as if it belongs to a single page document.

Additionally, train/validation/test splitted dataset is shared. The repository contains the code with which it is possible to process the total dataset, but the latter contains some random components which would therefore result in a slightly different result.

Request access

To gain access to HumSet, please contact us at nlp@thedeep.io. HumSet and HumBert (an humanitarian-focused Language Model) are available on our HuggingFace profile

Contact

For any technical question please contact Selim Fekih, Nicolò Tamagnone.

Terms and conditions

For a detailed description about terms and conditions, refer to DEEP Terms of Use and Privacy Notice



DFS logo