|Description (include details on usage, files and paper references)||The Text and Vision (TVGraz) dataset is an annotated multi-modal dataset which currently contains 10 visual object categories, 4030 images and associated text. The visual appearance of the objects in the dataset is challenging and offers a less biased benchmark. The objective of the multi-modal dataset is to provide a common means for evaluation of object categorization research based on text and vision.
The archive "TVGraz_script.tar.gz" contain a python script name "download_TVGRAZ_dataset.py", which will download TVGraz dataset images and text from their respective urls, upon execution and according to the "category_list.txt" file. After downloading the textual data will be in raw format per category per image.
Download: TVGraz dataset capturing tool
TVGraz: Multi-Modal Learning of Object Categories by Combining Textual and Visual Features (bib)
Inayatullah Khan, Amir Saffari, and Horst Bischof
In Proc. Workshop of the Austrian Association for Pattern Recognition, 2009