Radboud Repository

      View Item 
      •   Radboud Repository
      • Collections Radboud University
      • Datasets
      • View Item
      •   Radboud Repository
      • Collections Radboud University
      • Datasets
      • View Item
      JavaScript is disabled for your browser. Some features of this site may not work without it.
      BrowseAll of RepositoryCollectionsDepartmentsDate IssuedAuthorsTitlesDocument typeThis CollectionDepartmentsDate IssuedAuthorsTitlesDocument type
      StatisticsView Item Statistics

      Modelling word learning and recognition using visually grounded speech

      Find Full text
      Creators
      Merkx, D.G.M.
      Frank, S.L.
      Scharenborg, O.E.
      Ernestus, M.T.C.
      Scholten, S.
      Date of Archiving
      2022
      Archive
      DANS EASY
      DOI
      https://doi.org/10.17026/dans-22n-xh47
      Related publications
      Modelling word discovery and recognition using visually grounded speech   -
      Semantic sentence similarity. Size does not always matter  
      Modelling word discovery and recognition using visually grounded speech   -
      Language learning using Speech to Image retrieval  
      Publication type
      Dataset
      Please use this identifier to cite or link to this item: https://hdl.handle.net/2066/250308   https://hdl.handle.net/2066/250308
      Display more detailsDisplay less details
      Organization
      Nederlandse Taalkunde
      Taalwetenschap
      Toegepaste Taalwetenschap
      onbekend/nvt.
      Audience(s)
      Computer science
      Languages used
      English
      Key words
      Speech2image; multi-modal word learning; speech recognition
      Abstract
      A set of recorded isolated nouns, verbs and image annotations used for testing the word recognition performance of our speech2image model. We trained a word recognition model on a set of images and utterances. The model should learn to recognise words without ever having seen written transcripts. The word recognition performance is measured as the number of retrieved images out of 10 displaying the correct visual referent. We furthermore collected new ground truth object and action annotations for the Flickr8k test images for this purposes. This consists of 1000 images, all annotated for the presence of the 50 actions and objects corresponding to the test verbs and nouns. In order to test the word recognition performance we took the 50 most common nouns and 50 most common verbs in the training data, confirmed that there were at least 10 images in our test image data that displayed these actions and objects. These nouns and verbs where recorded in singular and plural form (nouns) and in root, third person and progressive form (verbs). We furthermore annotated 1000 images from the Flickr8k test set for the presence of these nouns and verbs. These annotations are included in .CSV format
      This item appears in the following Collection(s)
      • Datasets [1281]
      • Faculty of Arts [23967]
       
      •  Upload Full Text
      •  Terms of Use
      •  Notice and Takedown
      Bookmark and Share
      Admin login