Named Entity Extraction from Semistructured Data Using Machine Learning Algorithms Madina Mansurova1(&) , Vladimir Barakhnin2,3, Yerzhan Khibatkhanuly1, and Ilya Pastushkov2,3 1 Al-Farabi Kazakh National University, Almaty 050040, Kazakhstan mansurova.madina@gmail.com,x.erzhan@gmail.com 2 Institute of Computational Technologies SB RAS, Novosibirsk, Russian Federation 3 Novosibirsk State University, Novosibirsk, Russian Federation bar@ict.nsc.ru,pas2shkov.ilya@gmail.com Abstract. The modern society have been witnessed that intensive development of Internet technologies had followed to information explosion during last decades. This explosion had been expressing by an exponential growth of data volume among the low-quality information. This paper is designed to provide detailed information about some intellectual tools which are support decision taking by automatic knowledge extraction. In the first part of paper, we considered a preprocessing contains morphological analysis of texts. Then we had considered the model of text documents in the form of a hypergraph and implementation of the random walk method to extract semantically close word’s pairs, in other words, pairs that often appears together. Result of calculations is matrix with word affinity coefficients corresponding to each other component of vocabulary vector. In the second part we describe training of neural network for linguistic constructions extraction. These ones include possible values of text named entities descriptors. The neural network enables to retrieve information on one preselected descriptor, for example, location, in the form of the final result of the name of geographical objects. In a general case, the neural network can retrieve information on several descriptors simultaneously. Keywords: Entity extraction Semi-structured data Machine learning algorithms Random walk method Neural networks 1 Introduction At the present, a great amount of data is being accumulated in local and global networks. Major part of this information is represented by natural language texts. Such texts are not structured, and, consequently, useful information can’t be extracted and analyzed by conventional computing methods and tools. With rapidly increasing information amount, human text reading and simple substring search in large arrays of text data are obviously ineffective. The approaches for search performing on unstructured text arrays and extracting knowledge from it are becoming more and more demandable. A scientific direction of computer linguistics deals with solutions of such tasks [1]. © Springer Nature Switzerland AG 2019 N. T. Nguyen et al. (Eds.): ICCCI 2019, LNAI 11684, pp. 58–69, 2019. https://doi.org/10.1007/978-3-030-28374-2_6 Named Entity Extraction from Semi-structured Data Computer linguistics include automatic extraction of structured data from nonstructured or semi-structured data. Information extraction systems enable to collect data from separate parts of text and represent relevant information in form of selected relationships, knowledge bases and etc. With these systems one can put information in a semantically accurate form which grant further conclusions by using corresponding algorithms. One of an information retrieval important branches is a named entity recognition. It means the search for named entities and their classification into predetermined categories, such as people’s or organization’s names, location, time value, and etc. Solution of this subtask provide either determination the tonality of references to some company or product, or building relations between named entities, or accepting requests and answering it in natural language. This work is a continued research of fact extraction from text documents [2]. The aim of this work is development of named entities extraction technology for news reports. To achieve this goal the following tasks were set up: 1. Carrying out a morphological analysis of texts. 2. Modeling of texts in the form of a hypergraph and the use of the random walk method for extraction of semantically closed words. 3. Creation of a neural network trained for finding the correspondence of concrete key words to descriptors. 2 Review of the Existing Methods for Information Extraction Information extraction is not a new task in the field of a natural language processing. The main methods of information extraction are classified as follows: Feature-based extraction methods. These methods are explained by the presence of a fixed set of features and the use of weights of features of the text elements being extracted. Thus, an extracted element, a vector of its features is built up. The most common in this class are Bayes probability classifiers [3] and hidden Markov models [4–6]. Extraction of an element comes to recognition of a certain text segment detected in its vicinity. Within the framework of this research, the method of extracting facts under study refers to this class. The peculiarity of the described in this work algorithm is the use of the random walk method for extraction of key words as well as the use of a neural network for placing facts on descriptors. Kernel extraction methods. The essence of the method is in replacing the scalar product of vectors reflecting a characteristic representation of recognized elements with some function called the kernel [7]. This function is determined algorithmically and considers a more complex representation of recognized elements and their contexts describing the text segment structure. The disadvantage of this method is in the complexity of computing when calculating kernels as well as determination of the segment structure. Sample matching methods. They use the samples and the rules of their comparison with the text fragments [8, 9]. The samples are meant to be a chain of limiters, where a chain is kind of pattern phases. This method is similar to the kernel method. M. Mansurova et al. Ontology-based methods. In [10, 11] information is extracted using a semantic web based on ontology. The authors propose an algorithm developed on the ontological base of knowledge and carry out a frequency analysis for the results of text parsing of syntactical triples StanfordNLP [12]. Named entity recognition can be resolved by following methods: Rule-Based Methods. The work of this method is as follows. Each token in the text is drawn into a feature set. Typical features are information about whether a token begins with a capital letter, punctuation, whether a token is a heading, organization or geographical location, to which part of speech it refers. Then a set of rules is extracted from the data in the following form: Contextual template ! Action. A contextual pattern is a combination of conditions with respect to the features of a sequence of tokens. If the sequence of tokens satisfies the pattern, then this sequence is marked in accordance with the “Action” as a named entity. This method is used in specified works [13–15]. Hidden Markov Models. In this method, transition is used through a sequence of hidden states, and each state produces a token. All states are dependent on each other. Each transition from one state to another generates a data type, which can be a simple symbol or a multidimensional combination of features. Hidden Markov models were used to retrieve named entities in these works [16, 17]. Maximum Entropy Markov Models. In contrast to hidden Markov models, Markov models of maximum entropy directly form the likelihood of marking, based on states [18]. The advantage of this method is the ability to use a large set of considered features [19]. Conditional Random Fields. This method is closely related to Markov models of maximum entropy [20]. However, the restriction present in the previous method, where the probability of a certain label depends only on previous labels, has been removed. Conditional random fields take into account not only previous, but subsequent labels. Since a large range of considered tags affects the cost of training the model, it is possible to use a simplified version, considering only one adjacent label on each side. The use of this method can be found in the indicated works [21, 22]. 3 The Approach to the Task of Information Extraction To extract information, in the first stage it is necessary to extract a set of semantically (lose key) words. In this work semantically close ords (word combinations) are extracted on the basis of morphological analysis of texts and the random walk method. 3.1 Text Lemmatization Text lemmatization consists of determination of the word form and assigning token characteristics and grammar descriptions to each form of the word. Named Entity Extraction from Semi-structured Data Morphological information created for each word in the text consist of four “lines” or groups of markup: 1. The word form of a lexeme (shows a “dictionary entity” of a lexeme and determines the part of speech to which it refers). 2. A set of grammar features of a lexeme or characteristics of a word classifier. 3. A set of grammar features of a word form or word-transforming characteristics (for example, case of the noun, aspect of the verb, that is singular or plural). 4. Information about non-standard grammatical form, spelling distortion. 3.2 Random Walk Method Let a text consisting of n documents be given: D ..d1; d2; dn..1. In this work, we will consider separate sentences as a document. Let us introduce the following notation for a text vocabulary: W ..w1; w2; ...wn..2. Simulation of the text being analyzed in the form of a hypergraph is one of the convenient and frequently used methods. Let us represent the text being analyzed in the form of a hypergraph HG(V, E), where V is a set of vertices and E is a set of hyper ribs. S Here, E e .V. In the hypergraph, vertices v 2V are words of the text and hyper e2 ribs e 2E are documents of the text. jjjjE Using HG .RV we will denote the hypergraph adjacency matrix:  1; if v 2E hv.; e.. .3. 0; if v 26 E Let HG(V, E, w) be a weighted hypergraph, where w :E !R. is the hypergraph weight. In our case, the degree of the hypergraph vertex and that of the hyper rib are determined by the following expressions: X dv... we..hv.; e..4. e2E X d...e hv.; e..e .5. v2V In the task under consideration, the weight of the graph vertices, in other words – the weight of the text words, can be calculated according to the following formula based on TF-IDF method: tfvi N .. wv..i log .6. tf idf . Nw df .. vi M. Mansurova et al. Where: – tf(vi) is the frequency of word occurrence in a document; – Nw – the sum of all words composing the document; – N – the number of documents in the collection of documents D; – df(vi) – a set of documents in D in which the words vi occur. Random walk is a mathematical model of the process of random changes – steps at discrete moments of time. It is supposed that the change at each step does not depend on previous steps and on time. By virtue of the analysis simplicity this model is often used on different fields. Also, this model a significant simplification of a real process. One-dimensional discrete random walk is a random process with discrete time Yn n0 fg having the form: X n Yn.Y0 . Xi .7. i.1 Where: – Y0 is the initial state;  1; pi – Xi. ; 0\pi\1; i2N 1; qi.1 pi – Y0,Xi, i = 1, 2.. are independent random values. One-dimensional discrete random walk is a Markov chain with whole states. Its initial distribution is given by the random value probability function X0 and the matrix of transition probabilities has the form: 0 1 . . . . . . . . . B C  B B q1 0 p1 C C P.pij B.i; j2ZB B @ q0 0 q1 p0 0 . . p1 . . . . C C C A .8. . . . That is: pi;i.1 PXn.1 .i.1jXn.i..pi, . pi;i1 .n.1 .i1jXi..qi; i2Z, PX n. pi;jPX.n.1 .jjXn.i..0; jijj.61. The sequence of discrete random values Xn 0 is called a simple Markov chain fg n (with discrete time), if: PXn.1 .in.1jXn.in; Xn1 .in1; ...; X0 .i0..P.Xn.1 .in.1jXn.in.. Thus, in the simplest case, conditional distribution of the subsequent Markov chain state depends only on the current state and does not depend on all previous states. The range of values of random variables Xn is called the space of chain states and fg number n is called the step number. Named Entity Extraction from Semi-structured Data P(n) is called a probability matrix of transitions at nth step: pij n . Xn .i. 9. ...PXn .1 .jj. T Vector p ..p1; p2; ...pn . ., where pi .PX0 .i.is Markov chain initial distribution. It is obvious that the probability matrix of transitions is stochastic, that is: X ...1; 8n 2N .10. j Pij n Markov chain is called homogeneous if the probability matrix of transitions does not depend on the step number, that is ...Pij; 8n 2N .11. Pij n To range the hypergraph vertices, we will generalize the random walk process to a hypergraph. Transitions between vertices in a graph take place in the process of random walk, that is transition from the given vertex to a neighboring vertex via every step of discrete time t. We can consider vertices as a set of states {s1,s2,…sn} and transitions – as a final Markov chain in these states. The transition probability is computed in the form Pu.; v..Prob.st .1 .vjst .u.. This means that Markov chain at vertex v will be in time t+1 and u – in time t. in our case, Markov chain is homogeneous, the P probability of transition does not depend on time t. for each vertex Pu.; v..1. v M is homogeneous, with possibilities calculated for a single transition. For all steps, jjjjV the transition matrix can be computed P 2R V . Transition matrix P completely embraces transitions between vertices, which show the change of a surfing movement in a random order between vertices with such probability. In a single graph, a random walk process is clear, it is only necessary to choose a rib ascending to the target vertex with a certain probability. Nevertheless, the hypergraph in this situation somewhat differs according to structure difference. For example, in a hypergraph there may be more than two points of the vertex for a hyper rib d..e 2. To generalize the random walk process to a hypergraph, we model a transition between two vertices incident in regard to each other in a hyper rib in the form of walk. As a whole, the random walk process is not a one-step process but a two-step one: firstly, a random surfer chooses a hyper rib e incident to the current vertex u. Secondly, in the chosen hypergraph, the surfer chooses a target v satisfying the condition u; v  e. Random walk in a hypergraph is called generalization with a special case for random walk in a usual graph. This is the presence of only one vertex in a rib, and in a hypergraph we can choose from a set of vertices. If we determine the random walk process in a hypergraph with the help of Markov chain, here a set of vertices will make up a set of states. With every time step t the surfer changes the place on the incident hyper rib for another vertex. M. Mansurova et al. Let us give a general definition of random walks in the measured hypergraph taking into account the weight of vertices with hyper ribs. In this case, the random walk process will be widened using the weight of vertices and hyper ribs. The weight of vertices is determined with the help of all incident hyper ribs, this is a vector of peculiarities:   ~vw. xm..e1 ; xm..2 ; ...xm.. .12. e dv That is, for each hyper rib e with vertex u the weights of vertices are different. The predicted random walk process can be described as follows: starting with vertex u the surfer chooses hyper rib e which is incident to vertex u and proportional to the hyper rib weight w(e). After that, the surfer chooses, in the same way, vertex v proportional to the vertex weight that is the considered current hyper rib. VE Let us determine the incident matrix Hw Rjj jjin the measured hypergraph in the following way:  wv..b ; ifv2e hw.v; e.. .13. 0; ifv62e Thus, if we determine the vertex degree again: X d...v hw.v; e..14. e2E And the hyper rib degree is: X de... .v; e..jje .15. v2V Using the above mentioned formulas, it is possible to calculate transition matrix determination: X hu.; e. hw.v; e. Pu.; v.. we..PP .16. ..e hw.^v; e. e2E ^e2Eu..w^^v2e Or the matrix determination: 1 1 T P.D HWeD H .17. v vew Here, – Dv – a diagonal matrix of the weighted vertex degree; – H – an incidental matrix of hypergraph vertices; – We – a diagonal matrix of hyper rib weights; – Dve – a diagonal matrix of the weighted hyper rib degree; – Hw – an incidental matrix of the measured graph. Where the transition matrix P is stochastic and every sum of rows is equal to 1 [23]. Named Entity Extraction from Semi-structured Data After computing the transition matrix P, there arises the necessity to understand random walk stationary distribution p. Stationary distribution can be calculated beginning with vector !v0 2Rjjv 1. These are probabilities 1/|V| where their sum is equal to 1. Firstly, the transition matrix PT is computed by multiplication of a vectorcolumn !v0 by !v0 .PT ~v. Thus, we will go on with iteration till vector ~v stops changing. Multiplication of transition matrix by the vector of probability distribution gives the next step of distribution~x .PT ~v. Let x be the probability of being at vertex P i. Then xi . j pijvj,vj is the probability of untimely being of the surfer on the node j and pij is the probability of transition from j to i. If random walk is ergodic, the vector distribution probability stops changing after n steps. 3.3 Creation and Training of a Neural Network The next task is creation of a neural network trained to extract linguistic constructions which include the possible meanings of attributes of named entities of the texts being processed. Collecting of linguistic constructions and sets of semantically close words are beginning of a neural network training by providing descriptors candidates to use it as features. The created neural network enable to extract information on one pre-selected descriptor, for example, location, presenting, names of geographical objects as a final result. In general case, the neural network can extract information across multiple descriptors simultaneously. For training a neural network, a training set consisting of a feature vector was constructed. For one descriptor, a feature vector was built as follows: we took a window of five words before the entry of the element of interest in the text of the article and a window of two words after it. For each descriptor, a dictionary is formed which answers for the presence of the pointed out word in it. All the features of each descriptor are collected in one “bag of words” and a feature vector is built. A neural network is trained by showing every input dataset and subsequent error propagation. Neural network training algorithm is based on the error back propagation method. This method is popular and widespread in machine learning, so there is its general description. The “backwards” means that calculation of the gradient proceeds backwards through the network, with the layer’s gradients being calculated in reverse order. Partial computations of the gradient from one layer are reused in the computation of the gradient for the previous layer. This backwards flow of the error information allows for efficient computation of the gradient at each layer versus the naive approach of calculating the gradient of each layer separately. We use a simple neural network with two hidden layers to simplify results interpretation [24]. 66 M. Mansurova et al. 4 The Results of Computing Experiments To store the data, NoSQL data base of MongoDB 3 is used. The data set includes 9723 records collected from Kazakhstani portals and Kazakhstan emergency news portals. The data are marked up as follows: class 1-locality, 0 -otherwise. An example of a record: “main functional responsibilities maintaining state records of natural and man-made emergencies which took place on the territory of the city Almaty jointly with state and local executive bodies according to the civil protection law of Republic of Kazakhstan (RK) preparation of analytical materials on areas of activity in the field of civil protection maintenance of information reference cards maintenance and operation of information systems by the state system of civil protection control and supervision in the field of civil protection administrative practice”. At first, data pre-processing was performed to avoid markup mistakes in train and test sets. For markup, a list of localities in RK was taken and using a fuzzy search algorithm (difflib library) and a search algorithm for full match normalized words (with the help of library pymorphy2 [25]) the texts with and without names were found. The texts with names presence confirmed or declined by both algorithms were in the training and test sets with corresponding marks. Then, using the library TensorFlow we calculated TF-IDF matrices which were used by the random walk method. At the output, we had a matrix with word proximity ratios and a dictionary that matched the vector component with the word. All the texts were encoded by the values of the obtained matrix and each was marked up as 1 or 0 depending on the presence of the locality name in the text. The experimental data were divided in the ratio 80%–20%, 20% -for the final test. And 80% were further divided correspondingly to 70%–30%, 70% -for learning and 30% -for correct validation. Validation helps determine the retrained model. All the experiments were carried out on a neural network with two hidden layers. The first hidden layer consists of 32 neurons, the second layer consists of 10 neurons. As an activation function for neurons of the hidden layer, we chose different variations of relu and tanh, the output layer being sigmoid. We had used the default RMSE, root mean square error as loss function. Then, a graph of RMSE versus learning curve is presented. After the 500th era the value of loss function makes up *0.17, *0.22, *0.87 corresponding for SGD, RMSPROP, Adam optimizers, which are an acceptable result for such a simple architecture of the network. We use RMSE because as the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. Lower values of RMSE indicate better fit (Fig. 1). RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction. According to these results we have a neural network that predict is word sequence or word a location name or not with pretty precise accuracy with SGD optimizer. It means that with our algorithm will detect the Almaty in any news report with probability about 85%. Named Entity Extraction from Semi-structured Data Fig. 1. RMSE loss versus learning curve 5 Conclusion This work resulted in creation of the algorithm which allows to extract key words (word combinations) from the text corpus of uniform subject with the aim of further using the extracted key words as possible meanings of entity attributes described in the domain ontology being created assigned for organization of factual search in the widened corpus of texts of the corresponding domain. As the result we have a pipeline contains morphological parser text parts markup of speech followed by the use of the random walk method for extraction of semantically close key words (word combinations) so we can consider that task has been solved successfully. A trained neural network with a hidden layer is applied to the set of these word combinations with the aim to match a concrete word combination to a definite attribute of the entity described in the text. Thus, using a set of semantically close pairs of words, one can build an ontology, formed during the neural network operation, for a concrete document. Acknowledgments. This work was supported in part under grant of Foundation of Ministry of Education and Science of the Republic of Kazakhstan “Development of a system for knowledge extraction from heterogeneous data sources to improve the quality of decision-making” (2018– 2020) and Russian Federation RFBR grant ą 18-07-01457. M. Mansurova et al. References 1. Shokin, Yu.I., Fedotov, A.M., Barakhnin, V.B.: Problems of information retrieval. Novosibirsk: Sci. 196 p. (2010). (In Russian) 2. Barakhnin, V.B., Fedotov, A.M.: Building a factual search model. Vestnik NSU. Series: Information Technology, vol. 11, no. 4. pp. 16–27 (2013) 3. Pedersen, T.: A simple approach to building ensembles of naive bayesian classifiers for word sense disambiguation. ACM (2000) 4. Borkar, V., Sarawahi, S.: Automatic segmentation of text into structured records. ACM (2001) 5. Agichtein, E., Ganti, V.: Mining reference tables for automatic text segmentation. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, USA (2004) 6. Seymore, K., McCallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. In: Papers from the AAAI-99 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999) 7. Zelenko, D., Aone, C.: Kernel methods for relation extraction. J. Mach. Learn. Res. 3, 1083– 1106 (2003) 8. Califf, M., Moony, R.J.: Bottom-up relational learning of matching rules for information extraction. J. Mach. Learn. Res. 4, 177–210 (2003) 9. Dejean, H.: Learning rules and their exceptions. J. Mach. Learn. Res. 2, 669–693 (2002) 10. Aung, A., Thwal, M.P.: Onthology based hotel information extraction from unstructured text. In: International Conference on Advances in Engineering and Technology (ICAET 2014), 29–30 March, Singapore (2014) 11. Anantharangachar, R., Ramani, S., Rajagopalan, S.: Ontology guided information extraction from unstructured text. Int. J. Web Semant. Technol. (IJWesT) 4(1), 19–36 (2013) 12. Stanford CoreNLP: A Suite of Core NLP Tools. (2015). http://nlp.stanford.edu/software/ corenlp.shtml 13. Atzmueller, M., Kluegl, P.: Rule-based information extraction for structured data acquisition using TextMarker. In: LWA (2008) 14. Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S.: SystemT: an algebraic approach to declarative information extraction. In: ACL (2010) 15. Kluegl, P., Atzmueller, M., Puppe, F.: TextMarker: a tool for rule-based information extraction. In: UIMA@GSCL Workshop, pp. 233–240 (2009) 16. Chopra, D., Joshi, N., Mathur, I.: Named entity recognition in Hindi using hidden Markov model. In: 2016 Second International Conference on Computational Intelligence and Communication Technology (CICT), pp. 581–586. IEEE (2016) 17. Malik, M.K., Sarwar, S.M.: Urdu named entity recognition system using hidden Markov model. Pak. J. Eng. & Appl. Sci. 21,15–22 (2017) 18. McCallum, A., Freitag, D., Pereira, F.C.N.: Maximum entropy markov models for information extraction and segmentation. In: Icml 2000, vol. 17, pp. 591–598 (2000) 19. Ahmed, I., Sathyaraj, R.: Named entity recognition by using maximum entropy. Int. J. Database Theory Appl. 8(2), 43–50 (2015) 20. Lafferty, J., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001) 21. Lee, C., et al.: Fine-grained named entity recognition using conditional random fields for question answering. In: Ng, H.T., Leong, M.-K., Kan, M.-Y., Ji, D. (eds.) AIRS 2006. LNCS, vol. 4182, pp. 581–587. Springer, Heidelberg (2006). https://doi.org/10.1007/ 11880592_49 Named Entity Extraction from Semi-structured Data 22. Chen, W., Zhang, Y., Isahara, H.: Chinese named entity recognition with conditional random fields. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 118–121 (2006) 23. Bellaachia, A., Al Dhelaan, M.: HGRANK: a hypergraph based keyphrase extraction for short documents in dynamic genre (2014) 24. Haykin, S.: Neural Networks and Learning Machines, 3rd edn. Pearson, London (2009). 936 p 25. Korobov, M.: Morphological analyzer and generator for russian and ukrainian languages. In: Khachay, M.Y., Konstantinova, N., Panchenko, A., Ignatov, D.I., Labunets, V.G. (eds.) AIST 2015. CCIS, vol. 542, pp. 320–332. Springer, Cham (2015). https://doi.org/10.1007/ 978-3-319-26123-2_31