TY - GEN
T1 - Spatiooral pseudo relevance feedback for large-scale and heterogeneous scientific repositories
AU - Takeuchi, Shin'Ichi
AU - Akahoshi, Yuhei
AU - Ong, Bun Theang
AU - Sugiura, Komei
AU - Zettsu, Koji
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/9/22
Y1 - 2014/9/22
N2 - As larger and larger amounts of data are harvested, finding just the right piece of information out of this noisy and heterogeneous ocean of data remains challenging. Many widely adopted scientific data search engines continue to be mainly based on text semantics. However, it is not uncommon in scientific big data applications to face collected data that do not possess text information. In this scenario, search engines fail to retrieve potentially relevant data. For instance, even though Pangaea, a digital data library and a publisher for earth system science, contains more than 400,000 datasets, more than 98% lack sufficient text information. In this work, we propose a novel pseudo relevance feedback method based on spatiooral and text (STT) information for scientific big data: STT-PRF. Although STT-PRF may simultaneously use STT information, we show that the missing values in space, time or/and the text are handled efficiently. STT-PRF is especially robust even without text information. We tested our STT-PRF method using the Pangaea repository on our Cross-DB Search Platform, which is a search engine for scientific big data based on various latent correlations. Experimental evaluations on such standard metrics as nDCG and Precision/Recall show that STT-PRF outperforms the standard baseline methods.
AB - As larger and larger amounts of data are harvested, finding just the right piece of information out of this noisy and heterogeneous ocean of data remains challenging. Many widely adopted scientific data search engines continue to be mainly based on text semantics. However, it is not uncommon in scientific big data applications to face collected data that do not possess text information. In this scenario, search engines fail to retrieve potentially relevant data. For instance, even though Pangaea, a digital data library and a publisher for earth system science, contains more than 400,000 datasets, more than 98% lack sufficient text information. In this work, we propose a novel pseudo relevance feedback method based on spatiooral and text (STT) information for scientific big data: STT-PRF. Although STT-PRF may simultaneously use STT information, we show that the missing values in space, time or/and the text are handled efficiently. STT-PRF is especially robust even without text information. We tested our STT-PRF method using the Pangaea repository on our Cross-DB Search Platform, which is a search engine for scientific big data based on various latent correlations. Experimental evaluations on such standard metrics as nDCG and Precision/Recall show that STT-PRF outperforms the standard baseline methods.
KW - information retrieval
KW - pseudo relegance feedback
KW - query expansion
KW - scientific data
KW - spatiooral and text information
UR - http://www.scopus.com/inward/record.url?scp=84923852454&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84923852454&partnerID=8YFLogxK
U2 - 10.1109/BigData.Congress.2014.100
DO - 10.1109/BigData.Congress.2014.100
M3 - Conference contribution
AN - SCOPUS:84923852454
T3 - Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014
SP - 669
EP - 676
BT - Proceedings - 2014 IEEE International Congress on Big Data, BigData Congress 2014
A2 - Chen, Peter
A2 - Chen, Peter
A2 - Jain, Hemant
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 3rd IEEE International Congress on Big Data, BigData Congress 2014
Y2 - 27 June 2014 through 2 July 2014
ER -