In the big data era,the Internet is increasingly indispensable for people to access academic or work related information. However,facing with decentralized distribution of Internet resources and lacking of in-depth description and correlation of their contents and relationships,people have to spend massive time to look through the whole search results returned and assemble the relevant information from different sources. Therefore,this paper aims to develop a meta-data schema for fine-grained aggregate units of Internet resources to reveal deeply and correlate the scattered and various kinds of information snippets,so as to meet the complex information needs of users,improve the effectiveness of retrieval and support better knowledge services.
First and foremost,this paper firstly extracted three types of free Internet resources in the field of Library and Information Science,including OA papers,online encyclopedia,and blogs. Then,a general framework to split these resources was developed from the perspectives of logical structure and formal structure of text manually. In the aspect of logical structure analysis,it was divided into four levels: chapter level which is a whole document,section level based on the chapter title given by authors,sentence group level including macro analysis and micro analysis and chart level. The components of the whole document were fragmented by macro analysis based on the genre theory. And the information snippets revealing rhetorical intentions and semantic functions were identified using micro analysis further. The relationships between aggregate units of different levels were analyzed. Moreover,characteristics and attributes of aggregate units were depicted and classified,including 14 elements of access attributes,3 elements of physical attributes and 2 elements of semantic attributes. Corresponding to the categories,a metadata schema was developed. Lastly,to examine the effectiveness of metadata schema,Access 2013 was used to design and develop a database,and five search tasks from genre level,section level,sentence group level and chart level were set up.
The research results conclude that the logical structures which are implications of the author's intention,have some similarities among different types of Internet resources if they have the same topics. It is feasible to apply the logical structures of the journal papers to other Internet genres. DC and LOM metadata frameworks can be reused in the metadata schema for fine-grained aggregate units of Internet resources,while there are special characteristics needed to be revealed. More importantly,search experiments implicate that it is effective to reveal and correlate aggregate units scattered in various sources and different granular when using the aggregated search database based on the metadata framework proposed in this paper. Aggregated search can support information aggregation and maintain at the same time the whole context of entire piece of information. Therefore,users can judge the relevance of search results more quickly and find the required content more effectively.
Via apreliminary study of metadata schema of fine-grained aggregation units,this research is a useful attempt to apply linguistic theories and methods to organization of Internet resources,and also a significant step toward the rising interdisciplinary research field.
The future researches are to improve the fine-grained aggregation units framework and metadata schema through analyzing other emerging Internet genres. Furthermore,vocabulary and syntactic features of aggregated units need to be analyzed so as to implement fine-grained aggregation search intelligently and construct knowledge repository automatically. 7 figs. 6 tabs. 58 refs.