Research article Special Issues

Optical environmental sensing in wireless smart meter network

  • In recent years, the traditional power grid is undergoing a profound revolution due to the advent and development of smart grid. Many hard and challenging issues of the traditional grid such as high maintenance costs, poor scalability, low efficiency, and stability can be effectively handled and solve in the wireless smart grid (WSG) by utilizing the modern wireless sensor technology. In a WSG, data are collected by sensors at first and then transmitted to the base station through the wireless network. The control centre is responsible for taking actions based on this received data. Traditional sensors are failing to provide accurate and reliable data in WSG, and optical fiber based sensor are emerging as an obvious choice due to the advancement of optical fiber sensing technology, accuracy, and reliability. This paper presents a WSG platform integrated with optic fiber-based sensors for real-time monitoring. To demonstrate the validity of the concept, fresh water sensing of refractive index (RI) was first experimented with an optical fiber sensor. The sensing mechanism functions with the reflectance at the fiber’s interface where reflected spectra’s intensity is registered corresponding to the change of RI in the ambient environment. The achieved sensitivity of the fabricated fiber sensor is 29.3 dB/RIU within the 1.33–1.46 RI range. An interface between the measured optical spectra and the WSG is proposed and demonstrated, and the data acquired is transmitted through a network of wireless smart meters.

    Citation: Minglong Zhang, Iek Cheong Lam, Arun Kumar, Kin Kee Chow, Peter Han Joo Chong. Optical environmental sensing in wireless smart meter network[J]. AIMS Electronics and Electrical Engineering, 2018, 2(3): 103-116. doi: 10.3934/ElectrEng.2018.3.103

    Related Papers:

    [1] Xinpeng Xu, Xuedong Tian, Fang Yang . A retrieval and ranking method of mathematical documents based on CA-YOLOv5 and HFS. Mathematical Biosciences and Engineering, 2022, 19(5): 4976-4990. doi: 10.3934/mbe.2022233
    [2] Xue Wang, Fang Yang, Hongyuan Liu, Qingxuan Shi . Error correction of semantic mathematical expressions based on bayesian algorithm. Mathematical Biosciences and Engineering, 2022, 19(6): 5428-5445. doi: 10.3934/mbe.2022255
    [3] Songlin Liu, Shouming Zhang, Zijian Diao, Zhenbin Fang, Zeyu Jiao, Zhenyu Zhong . Pedestrian re-identification based on attention mechanism and Multi-scale feature fusion. Mathematical Biosciences and Engineering, 2023, 20(9): 16913-16938. doi: 10.3934/mbe.2023754
    [4] Yuting Zhu, Wenyu Zhang, Junjie Hou, Hainan Wang, Tingting Wang, Haining Wang . The large-scale group consensus multi-attribute decision-making method based on probabilistic dual hesitant fuzzy sets. Mathematical Biosciences and Engineering, 2024, 21(3): 3944-3966. doi: 10.3934/mbe.2024175
    [5] Nian Zhang, Yifan Zhou, Qiang Pan, Guiwu Wei . Multi-attribute decision-making method with triangular fuzzy numbers based on regret theory and the catastrophe progression method. Mathematical Biosciences and Engineering, 2022, 19(12): 12013-12030. doi: 10.3934/mbe.2022559
    [6] Muhammad Akram, G. Muhiuddin, Gustavo Santos-García . An enhanced VIKOR method for multi-criteria group decision-making with complex Fermatean fuzzy sets. Mathematical Biosciences and Engineering, 2022, 19(7): 7201-7231. doi: 10.3934/mbe.2022340
    [7] Jin Wang, Liping Wang, Ruiqing Wang . MFFLR-DDoS: An encrypted LR-DDoS attack detection method based on multi-granularity feature fusions in SDN. Mathematical Biosciences and Engineering, 2024, 21(3): 4187-4209. doi: 10.3934/mbe.2024185
    [8] Nian Zhang, Xue Yuan, Jin Liu, Guiwu Wei . Stochastic multiple attribute decision making with Pythagorean hesitant fuzzy set based on regret theory. Mathematical Biosciences and Engineering, 2023, 20(7): 12562-12578. doi: 10.3934/mbe.2023559
    [9] Fankang Bu, Jun He, Haorun Li, Qiang Fu . Interval-valued intuitionistic fuzzy MADM method based on TOPSIS and grey correlation analysis. Mathematical Biosciences and Engineering, 2020, 17(5): 5584-5603. doi: 10.3934/mbe.2020300
    [10] Yefu Zheng, Jun Xu, Hongzhang Chen . TOPSIS-based entropy measure for intuitionistic trapezoidal fuzzy sets and application to multi-attribute decision making. Mathematical Biosciences and Engineering, 2020, 17(5): 5604-5617. doi: 10.3934/mbe.2020301
  • In recent years, the traditional power grid is undergoing a profound revolution due to the advent and development of smart grid. Many hard and challenging issues of the traditional grid such as high maintenance costs, poor scalability, low efficiency, and stability can be effectively handled and solve in the wireless smart grid (WSG) by utilizing the modern wireless sensor technology. In a WSG, data are collected by sensors at first and then transmitted to the base station through the wireless network. The control centre is responsible for taking actions based on this received data. Traditional sensors are failing to provide accurate and reliable data in WSG, and optical fiber based sensor are emerging as an obvious choice due to the advancement of optical fiber sensing technology, accuracy, and reliability. This paper presents a WSG platform integrated with optic fiber-based sensors for real-time monitoring. To demonstrate the validity of the concept, fresh water sensing of refractive index (RI) was first experimented with an optical fiber sensor. The sensing mechanism functions with the reflectance at the fiber’s interface where reflected spectra’s intensity is registered corresponding to the change of RI in the ambient environment. The achieved sensitivity of the fabricated fiber sensor is 29.3 dB/RIU within the 1.33–1.46 RI range. An interface between the measured optical spectra and the WSG is proposed and demonstrated, and the data acquired is transmitted through a network of wireless smart meters.


    Most existing search engines support text retrieval, but still have problems retrieving mathematical expressions, especially expressions without natural language annotations. While traditional search engines are losing their roles in this respect, recent research on mathematical expression retrieval has achieved relatively rich results [1,2,3,4,5].

    Focusing on mathematical expressions in LaTeX format, Zhong et al. [6] proposed a mathematical formula retrieval algorithm based on Operator Tree. By matching multiple disjoint common subtrees with the same structure, the maximum number of sub-formulas is matched, which improves the efficiency of formula matching. Although the maximum number of matching sub-forms can improve retrieval accuracy, most sub-forms are more complicated. Therefore, the response time of real-time retrieval is approximately 20 s, which cannot meet the needs of real-time mathematical formula retrieval. To achieve faster sub-formulas retrieval, the team also proposed a strategy based on an inverted index and dynamic pruning [7], which improves the time efficiency of retrieval while ensuring that the retrieval results are still valid.

    Focusing on mathematical expressions in MathML format, Schubotz et al. [8] proposed the VMEXT system, which can realize a visual tree of expressions in MathML format. It can also realize human-computer interaction, which is convenient for users to quickly find and improve the expression tree. In addition, similar or identical elements of two expressions can be visualized to calculate the similarity of expressions.

    Focusing on mathematical expression images, Davila et al. [9] proposed a mathematical formula matching system. The system is mainly aimed at matching handwritten formulas on the teaching whiteboard with the formulas in course notes. First, the entire image was preprocessed, including formula search and structure correction. Then, the largest match in each image was identified by the symbolic consistent spatial alignment and similar relative sizes. Finally, each mathematical formula was divided into multiple symbol pairs. Symbol pairs are two symbols in a formula that are the nearest geometric neighbor of each other, which indicates the logical relations between them. The angle of a symbol pair is the angle between the line connecting the centers of the symbols and a horizontal line, which is helpful for judging the relationship between the two symbols. The images were sorted by the angle of the symbol pair.

    With the development of deep learning, text embedding methods are widely used in natural language processing. Gao et al. [10] tried to apply the same method to formula embedding. They applied neural networks to mathematical information retrieval and proposed the "symbol2vec" model. This model was used to learn the vector representation of mathematical symbols and perform similarity calculations. Similarly, the NTFEM model [11] used an N-ary tree to convert the mathematical formula into a linear sequence. The word embedding model is used to embed the formula, and a weighted average embedding vector is obtained by using a weighting function. In mathematical formula retrieval, the BERT (bidirectional encoder representations from transformer)-based embedding model [12] is proposed to introduce more semantic information when the formula is embedded. The model uses the LaTeX format as the input and the BERT model is used to encode the formula. The index is built according to the embedded formula vector, formula id and post id from which the formula originates, and finally, the cosine similarity is used to obtain the final ranking of the formula.

    In terms of fusion retrieval and ranking of mathematical expressions and scientific documents, Pathak et al. [13,14,15] committed to fusing expressions and related texts for retrieval. First, they proposed the MathIR system composed of three modules: "TS", "MS" and "TMS". This made scientific documents retrieval a similarity calculation of expression and text fusion rather than a simple expression search. Next, the "context-formula" pair was extracted, and the context of the formula was merged for retrieval. Finally, the modules of the system were optimized, and the formula retrieval was effectively integrated with the retrieval module for the text. Similarly, Schubotz et al. [16] regarded formulas and natural text as a single information source. The description of mathematical formula symbols was extracted from the surrounding text of the formula. These mathematical symbol descriptions were used to represent the definition of mathematical symbols. The namespace was formed as an internal data structure for mathematical information retrieval. This method can eliminate the ambiguity of mathematical symbols and better meet the retrieval needs of users. While retrieving mathematical expressions, Wang et al. [17] integrated other attributes to rank scientific documents, such as document category, types of journals to which scientific documents belong, and document citations. The sorting results were optimized by fusing these attributes of scientific documents. To better integrate mathematical expressions and text in scientific document retrieval, a weight parameter was proposed [18]. Based on formula similarity and text similarity, the proportion of text and mathematical expressions is manually adjusted.

    In conclusion, current scientific document retrieval and ranking methods could be roughly divided into two types, the first type recalls by mathematical expression similarity and sorts by text similarity or recalls by text similarity and sorts by expression similarity. Regardless of what kind of similarity is used for the final sorting, it will weaken the similarity of another part. The second type manually adjusts the weight to fuse expression similarity and text similarity, but this method is difficult for users with less experience to control the specific values of the parameters. To solve the above problem, this study proposes a multi-attribute retrieval and ranking model of scientific documents that combines mathematical expressions and related texts. This model is an improvement of the second type, and can eliminate the need to manually adjust the weights of expressions and texts.

    The similarity of five attributes is calculated: mathematical expression symbols (MESY), mathematical expression sub-forms (MESF), mathematical expression context (MECT), scientific document keywords (SDKY) and the frequency of mathematical expressions in scientific documents (FOME). A gradient boosting decision tree (GBDT) and logistic regression (LR) are used for feature reorganization and calculation to obtain the final search results, which improves the rationality of the retrieval.

    Figure 1 shows a flow chart of the scientific documents retrieval and ranking system (solid lines denote online query flows and dotted lines denote offline index flows). The whole process consists of four parts: query preprocessing, scientific document preprocessing, multi-attribute similarity measure and scientific document retrieval and ranking.

    Figure 1.  Flow chart of the scientific documents retrieval and ranking system.

    The query preprocess module is used to process the input query. The query is a combination of mathematical expressions and text, which need to be split. The scientific document preprocessing module is used to extract mathematical expressions and related text, preliminarily decompose mathematical expression symbols and calculate the weights of related text. Then, the module interacts with the database module to store and index the information corresponding to the scientific documents to facilitate subsequent similarity calculations. The multi-attribute similarity measure module calculates the similarity of the five attributes of scientific documents. According to the different characteristics of each attribute, different similarity calculation algorithms are set up. The module interacts with the database module to store the calculated similarity. The scientific document retrieval and ranking module combines the similarity of the multiple attributes of scientific documents to fuse and calculate the attributes. Finally, the similarity between the scientific documents and the input query is obtained, and the scientific documents are ranked according to the similarity.

    For the retrieval of mathematical expressions, there will be problems when inputting query expressions, such as inaccurate input and incorrect input of mathematical symbols. It is necessary to retrieve each mathematical symbol one by one to improve the fault-tolerant performance of the system.

    Definition 1 MEQ is the query expression, MEDt(t=1,2,...,TME) is the mathematical expression dataset from the scientific documents, and TME is the number of mathematical expressions in the dataset.

    First, FDS [19] is used to normalize the mathematical expressions in various formats into a unified form by decomposing them into multiple mathematical symbols with the corresponding five attribute values called level, flag, count, ratio, and operator.

    The "level" attribute represents the level of a mathematical symbol, based on its position relative to the horizontal baseline. For example, in the mathematical expression b2a, the level values of , a, b and 2 are 0, 1, 1 and 2, respectively. "Flag" represents the spatial flag bit of a symbol. Table 1 shows the value of the flag taking x as an example. "Count" refers to the sequential position of a symbol in the mathematical expression. "Ratio" refers to the frequency of the operator in the mathematical expression. "Operator" refers to whether a mathematical symbol is an operator. If a symbol is an operator, it is marked as 1; otherwise, it is marked as 0.

    Table 1.  Examples of flag value.
    Meaning of flag Right Above Superscript Subscript Below Contains Left-superscript Left-subscript
    Example 2x x2 2x 2x 2x x x2 x2
    Value of the flag of x 0 1 2 3 4 5 6 7

     | Show Table
    DownLoad: CSV

    In this way, the mathematical expression is converted into a list, which is convenient for subsequent retrieval of expression symbols. Table 2 shows the membership functions of the five attributes [20]. According to the distribution of values in each attribute by symbols in the data set, the balance factors in the function is determined by using curve fitting. The values of each balance factor are as follows. α=0.1, μ=0.2, ν=0.9, σ=0.2.

    Table 2.  Description of the five attribute membership functions.
    Attribute Membership function Function description
    level Mlev(TD,TQ)=eα|levelDlevelQ| levelD and levelQ respectively represent the level of two terms. α is the balance factor
    flag Mfla(TD,TQ)={(fo,flag(D,Q))} flag(D,Q) refers to the spatial position relationship of the same term, if flagD =flagQ, then Mfla(TD,TQ)=1, otherwise it is 0.
    count Mcou(TD,TQ)=11+μ|countDcountQ|υ μυ are balance factors
    ratio Mrat(TD,TQ)=e(ratDratQσ)2 σ is the balance factor
    operator Mope(TD,TQ)={(so,operatorD)} operatorD refers to whether the current term is an operator, If it is an operator, the value of Mope(TD,TQ) is 1, otherwise the value is 0.5

     | Show Table
    DownLoad: CSV

    After the membership calculation is completed, each symbol corresponds to a five-tuple membership degree vector, denoted by listsym. The structure of listsym is shown in Eq (1).

    listsym=(Mlev_ex_term,Mfla_ex_term,Mcou_ex_term,Mrat_ex_term,Mope_ex_term) (1)

    where the term refers to the current mathematical symbol and ex refers to the expression id corresponding to the current mathematical symbol.

    Take MEQ = "x+x" and MEDt = "x2+y" as examples. The three mathematical symbols that are the same in the two expressions are "x", "+" and "x". Table 3 shows the attribute values and membership degrees after the decomposition of the three symbols.

    Table 3.  Membership values of the five attributes in "x+x" and "x2+y".
    Term level Mlev flag Mfla count Mcou ratio Mrat operator Mope
    x (0, 0) 1 (0, 0) 1 (1, 1) 1 (0.25, 0.25) 1 (0, 0) 0.5
    + (0, 0) 1 (0, 0) 1 (2, 3) 0.8333 (0.25, 0.25) 1 (1, 1) 1
    x (1, 0) 0.9048 (5, 0) 0 (4, 1) 0.6503 (0.5, 0.25) 0.2096 (0, 0) 0.5

     | Show Table
    DownLoad: CSV

    Next, hesitant fuzzy sets [21,22,23] are used to calculate the membership degree of each mathematical symbol. Hesitant fuzzy sets have advantages in dealing with multi-attribute decision-making problems. The formula for calculating the similarity of expressions using hesitant fuzzy sets is shown in Eq (2).

    Finally, the normalization calculation of the mathematical symbols is performed to obtain the similarity of the expressions. The specific algorithm is shown in Algorithm 1.

    Algorithm 1 Similarity calculation algorithm of mathematical expressions based on mathematical symbols
    INPUT:MEQ, MEDt(t=1,2,...,TME)
    OUTPUT:SimSymbol
    1        termQ=FDS(MEQ)
    2        termD=FDS(MEDt)
    3        Update listD // tremsym_vec calculation of the five attributes of the same term in termD and termQ, tremsym_vec is updated in the listDtogether with the id and term in the termD.
    4        for id in (1, TME):
    5        for term1 in termQ:
    6            var=listD_id.exists(term1)
    7              if var=TRUE:
    8                  listD_id.add[MAX(termsym_vec25)12] //If the same term exists in the listD, take the one with the greatest similarity
    9                  listD.delete[MAX(termsym_vec25)12] //Delete the corresponding item in the listD
    10            else:
    11                    listD_id.add(0,0,0,0,0) //If it does not exist, add (0, 0, 0, 0, 0).
    12        SimSymbol=SIM(listD,listQ)
    13        END

     | Show Table
    DownLoad: CSV

    Definition 2 listD refers to a normalized list of mathematical symbols. listD includes id, term, and five attribute membership values tremsym_vec. listQ refers to the listD formed by calculating the membership degree with MEQ itself, and the five-attribute membership degree of mathematical symbols in listQ is (1, 1, 1, 1, 1).

    Definition 3 The formula [20] for calculating SimSymbol in Algorithm 1 is shown in Eq (2).

    SIM(listD,listQ)=1{15[1NRENREj=1|MDMQ|λ]}1/λ (2)

    where MD and MQ are the five-tuple vector values with the same term in listD and listQ respectively; NRE is the number of mathematical symbols of MEQ after FDS decomposition; λ>0. Whenλ = 1, the formula degenerates to the standard Hamming distance. When λ = 2, the formula degenerates to the standard euclidean distance. In this study, λ = 2.

    Take the two mathematical expressions in Table 3 as an example, we suppose that x+x is query and x2+y is the mathematical expression with id = 1 in the data set. Algorithm 1 is used to calculate the similarity of these two expressions. The result of the first update of listD is {[1, x, (1, 1, 1, 1, 0.5)], [1, +, (1, 1, 0.8333, 1, 1)], [1, x, (0.9048, 0, 0.6503, 0.2096, 0.5)]}. Next, the listD is updated in the order of terms in the query, and the result is {[1, x, (1, 1, 1, 1, 0.5)], [1, +, (1, 1, 0.8333, 1, 1)], [1, , (0, 0, 0, 0, 0)], [1, x, (0.9048, 0, 0.6503, 0.2096, 0.5)]}. Finally, formula (2) is used for similarity calculation and NRE=4, MD = [(1, 1, 1, 1, 0.5), (1, 1, 0.8333, 1, 1), (0, 0, 0, 0, 0), (0.9048, 0, 0.6503, 0.2096, 0.5)], MQ = [(1, 1, 1, 1, 1), (1, 1, 1, 1, 1), (1, 1, 1, 1, 1), (1, 1, 1, 1, 1)].The final calculated SIM = 0.1425.

    The mathematical expression sub-form similarity calculation refers to the retrieval of MEQ as a whole object. MEQ is equivalent to a part of MEDt(t=1,2,...,TME). The degree of membership of MEQ to the three attribute values of MEDt is calculated. The three attributes are length, level and flag. "Length" refers to the ratio of the length of the sub-formula to the length of the expression. The meaning of level and flag are the same as 2.1. Table 4 shows the membership functions corresponding to the three attributes [16].

    Table 4.  Three attribute descriptions and membership functions of MEQ as a sub-form.
    Attribute Attribute description Membership function
    length The length of MEDt DMlen(MEQ, MEDt)=lengthQlengthDt
    level The level of MEQ relative to MEDt DMlev(MEQ,MEDt)=eαlevelQ
    flag The flag of MEQ relative to MEDt DMfla(MEQ, MEDt)={(fo,flagQ)}

     | Show Table
    DownLoad: CSV

    A mathematical expression may contain multiple identical sub-expressions. After the membership of each attribute is calculated, the hesitant fuzzy set is used to calculate the similarity. It is the final similarity of the sub-form of MEQ as MEDt(t=1,2,...,TME). The specific algorithm is shown in Algorithm 2.

    Algorithm 2 Similarity calculation algorithm based on mathematical expression as sub-form
    INPUT: MEQ, MEDt(t=1,2,...,TME)
    OUTPUT:SimSub
    1    MEMAt=MEDt.Replace(MEQ,"#") // Use # to replace the same string as MEQ
    2    Num=MEMAt.Count("#")
    3    TMAt=FDS(MEMAt)
    4    for termma in TMAt:
    5    if termma=="#":
    6    listMAt.add(DMlen,DMlev,DMfla)
    7    simma=1{13[1NumNumj=1|listQlistMAt|λ]}1λ
    8    listMA(id,Esim).add(TMAt.id,simma)//add the mathematical expression id and similarity to listMAin turn
    9    Simsub=listMA.sort() // sort () using the Built-in function in python
    10    RETURN SimSub
    11    END

     | Show Table
    DownLoad: CSV

    Definition 4 DMlen,DMlev,DMfla represent the membership value of the three attributes length, level, and flag, respectively.

    BERT (bidirectional encoder representations from transformer) [22,23,24] is a pre-training language model that uses unsupervised data for pre-training and fine-tuning on the task corpus, and has excellent performance on tasks for understanding natural language. There are two tasks in the model pre-training phase: masked language mode and next sentence prediction. The joint training of these two tasks makes the word vector obtained by training more accurate and comprehensive. It can solve the polysemy problem that cannot be solved in word2vec.

    This study uses mathematical expression contextual text to fine-tune BERT to achieve the similarity calculation of the contextual text. The specific algorithm is shown in Algorithm 3.

    Algorithm 3 Contextual text similarity calculation algorithm
    INPUT: SEQ, SEDt(t=1,2,...,TSE)// SEQ is the query text. TSE is the number of keywords in sentence
    OUTPUT:SimCT
    1    VE = Encode(SEQ)    // Sentence embedding toSEQ
    2    WORQ=jieba(SEQ)    // The jieba tool (an open word segmentation tool) in python is used to segment SEQ
    3    for WORr in WORQ:
    4    number=location(WORr)    // Target each keyword
    5    Vw=summed_layers[number]    // Find word vectors in VE
    6    simil=simil+MAX(cosine_similarity(Vw,SEDt))
    7    SimCT=simil/len(WORQ)
    8    RETURN SimCT
    9    END

     | Show Table
    DownLoad: CSV

    The Jaccard coefficient is used to calculate the similarity of two sets (GA,GB). It is expressed as the ratio of the intersection and union of the two sets, and can effectively calculate the degree of overlap between the two sets to obtain the similarity of the sets. Its definition is shown in Eq (3).

    Jaccard(GA,GB)=|GAGB||GAGB|=|GAGB||GA|+|GB||GAGB| (3)

    Each scientific document often contains a specific topic. The keywords of the documents are extracted, and similarity matching with the query words can improve the accuracy of the search results. The contents of the scientific document are divided into words. By calculating the weight of the words, the 5 words with the highest weights are selected as the keywords of the scientific document. The weight calculation method is shown in Eq (4).

    Wi,wor=pi,worNn=0Kk=0pn,klgN1+mwor (4)

    where Wi,wor refers to the weight of the keyword wor in scientific document i, pi,wor refers to the total number of times wor appears in i. N refers to the total number of scientific documents. k refers to the number of keywords in the current scientific document. mworrefers to the number of documents containing wor.

    Since the difference in text length will affect the calculated keyword similarity, this study improves the Jaccard coefficient and adds the length difference part. The calculation of similarity is shown in Eq (5).

    SimWor=Jaccard(SEQ,WEDT)=|SEQWEDT||SEQ|+|WEDT||SEQWEDT|+ϕ|len(SEQ)len(WEDT)| (5)

    where WEDT refers to the keyword collection of scientific documents, and ϕ is the balance factor.

    When retrieving scientific documents, the same mathematical expression appears differently in different scientific documents, and the importance and retrieval order of scientific documents are also different.

    The frequency of mathematical expressions in scientific documents is the product of the frequency of mathematical expressions in the document (EF) and the inverse document frequency (EIDF), which is similar to TFIDF. The difference is that when the text frequency is calculated, the query text must be exactly the same as the text in the document before the text can be considered to appear once. In the process of searching for mathematical expressions, partially identical expressions can also be considered to appear once. For example, when MEQ is U=IR, the appearance of U=IR or I=UR in a scientific document can be counted as one occurrence of the mathematical expression.

    The frequency of occurrence of mathematical expression MEQ refers to SimSymbol and SimSub to calculate.

    The frequency of expressions is related to the total number of mathematical expressions in the scientific document. The more expressions contained in the scientific document, the smaller the value of EF. The calculation method of EF is shown in Eq (6).

    EF=11+eROUND(SimSymbol)+ROUND(SimSub)NUM(SimSymbol.expSimSub.exp)Docnum (6)

    where Docnum refers to the total number of mathematical expressions in the scientific document. ROUND() refers to the rounding function with 0.5 as the demarcation point, and a value greater than 0.5 is recorded as 1; otherwise. it is 0. Simsymbol.exp refers to the expression corresponding to Simsymbol. NUM () calculates the number of the same expressions recalled by Simsymbol and Simsub.

    The calculation of EIDF requires the number of occurrences of MEQ in the dataset. If MEQ appears multiple times in different scientific documents, its importance will decrease accordingly. The calculation method of the EIDF is shown in Eq (7).

    EIDF=log(NINCLUDE(exp)+1) (7)

    where N refers to the total number of scientific documents in the data set, INCLUDE(exp) refers to the number of scientific documents containing exp. The specific calculation of INCLUDE(exp) is shown in Eq (8).

    INCLUDE(exp)={1 ROUND(Simsymbol)ROUND(Simsub) = 10 ROUND(Simsymbol)ROUND(Simsub) = 0 (8)

    Finally, the calculation method of the frequency of mathematical expressions in scientific documents is shown in Eq (9).

    Simfre=EFEIDF (9)

    The LR (logistic regression) model is based on linear regression plus sigmoid function (non-linear) mapping. It is shown as (10).

    hθ(x)=11+eθTx (10)

    where θTx is the input of the sigmoid, and θ and x are both matrices. θis the linear regression parameter. T refers to the transpose of matrix. x refers to the feature of the input.

    The LR model has a simple structure and fast running speed, but the learning ability and expression ability of the LR model are very limited. A large amount of feature engineering is required for feature dispersion and feature combination to increase the learning ability of the model. Therefore, an approach is needed for automatically discovering effective features and feature combinations and shortening the LR feature experiment cycle. The GBDT model can automatically discover features and carry out effective feature combinations.

    GBDT (gradient boosting decision tree) [25,26,27,28] is a boosted tree model based on the CART regression tree model. In the process of generating each tree, the residual of the previous tree is calculated. The next tree is fitted on the basis of the residuals so that the residuals obtained on the next tree decrease. It is shown in Eq (11).

    fCTM(x)=CTMm=1T(x;ψm) (11)

    where T(x;ψm) refers to the CART regression tree, ψm refers to the parameters of CART, CTM refers to the number of CART.

    The GBDT model construction algorithm is shown in Algorithm 4.

    Algorithm 4 GBDT
    INPUT    T={(x1,y1),(x2,y2),...,(xN,yN)}  // Training dataset
    OUTPUT  fCTM(x)
    1    f0(x)=0  // Initialization
    2    L(y,f(x))=(yf(x))2  // Define loss function
    3    for m in (1, CTM):
    4    rmi=[L(yi,f(xi))f(xi)]f(x)=fm1(x)// The gradient of the i-th sample on the m-th tree
    5    Rmj,j=1,2,...,J  // The leaf node area of the m-th tree
    6    cmj=argmincxiRmjL(yi,fm1(xi)+c)  // Best fit value for leaf area j
    7    fm(x)=fm1(x)+Jj=1cmjI
    8    RETURN fCTM(x)=CTMm=1Jj=1cmjI
    9    END

     | Show Table
    DownLoad: CSV

    In this study, a GBDT + LR model is used. The five attributes MESY, MESF, MECT, SDKY and FOME are selected. GBDT can automatically combine and discretize features. After the decision tree is established, the path from the root node to each leaf node is a combination of different features. Each leaf node represents a unique combination of features. After the combination is completed, it is transferred to the LR model for secondary training.

    As shown in Figure 2, Tree1 and Tree2 are two GBDT trees. simSymbol, simSub, simCT, simmath_fre and simWor are represented by S1, S2, S3, S4 and S5 in the figure respectively.

    Figure 2.  GBDT model diagram.

    The sample xT is judged by two tree nodes and belongs to different leaf nodes of the two trees. The leaf nodes of the two trees are coded. The leaf nodes to which sample xT belongs are marked as 1, and the others are marked as 0. The leaf node codes of the two trees are connected in series to form a seven-dimensional sample (1, 0, 0, 1, 0, 0, 0).

    Each xT will go through multiple GBDT trees to recombine features. For GBDT trees, the path from the root node of the tree to the leaf nodes is a combination of different features. Therefore, the leaf node can uniquely represent this path. The leaf node is input into the LR model as a discrete feature for training.

    In the final prediction, the input sample will pass through each tree of GBDT to obtain a discrete feature (a set of feature combinations) corresponding to a certain leaf node. Then, the feature is passed into LR in one-hot form for linear weighted prediction. The final similarity SIM calculation result is obtained. Figure 3 shows the specific flow chart. For the LR model, the L2 penalty term is used, and the value of the inverse of regularization strength is 0.05. For the GBDT model, the metric is "binary_logloss", num_leaves is 32, num_trees' is 60 and the learning_rate is 0.005.

    Figure 3.  GBDT + LR system flow chart.

    The dataset used in the experiment is "MathTagArticles" in NTCIR-12_MathIR_Wikipedia_Corpus, which includes 31742 scientific documents. The "MathTagArticles" includes 16 archive files (they are coded as wpmath0000001-wpmath0000016), and each archive file contains about 2000 scientific documents. In this study, the hold-out method is used: "wpmath0000001-wpmath0000008" are used for training, "wpmath0000008-wpmath0000012" are used for verification, "wpmath0000013-wpmath0000016" are used for testing. Table 5 shows the experimental environment.

    Table 5.  Experimental environment.
    Experimental environment Configuration
    Processor Intel(R) Xeon(R) Silver 4215 CPU @ 2.50GHz
    RAM 32GB
    Operating system Linux
    Graphics card GeForce RTX 2080
    Video memory 8G
    Python version 3.6
    TensorFlow version 1.14.0

     | Show Table
    DownLoad: CSV

    The evaluators are five mathematics graduate students who are familiar with mathematical expressions and scientific documents. For each set of queries, the top 10 results are selected for evaluation. The evaluation indicators are relevant, partially relevant and not relevant. Among them, relevant ones are marked as 2, partially relevant ones are marked as 1, and not relevant ones are marked as 0. The results of the same query will be marked separately by five evaluators. Different evaluators should not mark the same retrieval result too differently. For example, for the same search result, when some commenters are marked as 2, other commenters can mark 1 or 2, but cannot mark 0. So, another labeling rule is set: for the same result, the difference between the scores of different evaluators should be less than or equal to 1. If it is greater than 1, the marks are invalid.

    Finally, the results of the five evaluators are summarized. The reviewer's score is converted to a comprehensive score in Table 6. Based on the principle of obedience to the majority, a total score greater than 7 is considered relevant, a total score greater than 2 is considered partially relevant, otherwise it is not relevant. In the subsequent evaluation of results, if the evaluation metrics only require relevant and not relevant, the partial relevant will default to relevant.

    Table 6.  Relevance assessment.
    Assessment Individual Combined
    Relevant 2 8−10
    Partially Relevant 1 3−7
    Not Relevant 0 0−2

     | Show Table
    DownLoad: CSV

    This study conducts a large number of experiments on different types of mathematical expressions and related texts. Twenty representative query expressions and texts are selected by evaluators for statistical experiments. Queries are based on the actual situation of the mathematics set, integrating the different structures of mathematical expressions and the different fields involved in scientific and technological documents. Table 7 provides the query expressions and text.

    Table 7.  20 queries in system experiment.
    NO. Expression Text NO. Expression Text
    1 b24ac discriminant 11 12mv2 theorem of kinetic energy
    2 sin2α+cos2α trigonometric function 12 f(x)=n=0f(n)(x0)n!(xx0)n Taylor function
    3 ab proportion 13 (xa)2+(yb)2=r2 round
    4 sinθ sine function 14 U=IR Ohm law
    5 y2=2ρx parabola 15 a2+b2=c2 Pythagorean theorem
    6 S=4πR2 surface area 16 P(Bi)P(A|Bi)nj=1P(Bj)P(A|Bj) Bayes
    7 2q 17 24i=11r2i Descartes theorem
    8 E=mc2 mass energy 18 f(b)f(a)=f(ε)(ba) Lagrange
    9 f(x+2Π)=f(x) 19 limn(1+1n)n limit theorem
    10 an=a1+(n1)d arithmetic sequence 20 1σ2Πe(xμ)22σ2 normal distribution

     | Show Table
    DownLoad: CSV

    Precision represents the accuracy rate and refers to the proportion of related documents in all the query documents in the returned results of the query. The calculation formula is shown in Eq (12).

    precision=SRtpSRtp+SRfp (12)

    where SRtp refers to the number of Query-related documents in the query result. SRtp refers to the number of documents irrelevant to the Query in the query result.

    Reciprocal rank (RR) is the reciprocal of the ranking of the first related document in the retrieved results. MRR is the average of the reciprocal rankings of multiple queries, and the calculation method is shown in Eq (13).

    MRR=1kki=11rank(i) (13)

    where rank(i) refers to the ranking of the first related document for the i-th query.

    Table 8 shows the values of P@3, P@5, P@10, and MRR for the 20 queries in Table 6, and Figure 4 shows the values of P@3, P@5 and P@10 for the 20 queries in Table 7.

    Table 8.  Average of precision and average of MRR.
    Evaluation indicator P@3 P@5 P@10 MRR
    Average 0.81 0.76 0.58 0.87

     | Show Table
    DownLoad: CSV
    Figure 4.  Precision value of different queries.

    Figure 4 shows that the P@3 of some queries can reach 100%. However, the precision of some queries is low, which is related to the fact that there are fewer scientific documents matching it in the dataset.

    Average precision considers the position factor on the basis of precision. It is more sensitive to the position of sorting. The calculation method is shown in Eq (14).

    AP=1rriipos(i) (14)

    where r refers to the total number of related documents, pos(i) refers to the position of the i-th related document in the retrieved results.

    NDCG is the normalized loss cumulative gain. The calculation method of DCG (discounted cumulative gain) is shown in Eq (15).

    DCG@k=ki=1relilog2(i+1) (15)

    where reli refers to the relevance of the i-th document. There are three levels of relevance: good, fair and bad. They are assigned scores of 3, 2 and 1.

    In an ideal state, according to the order of relevance from largest to smallest, the case where DCG takes the maximum value is IDCG.

    IDCG@k=RELi=1relilog2(i+1) (16)

    where REL refers to the sorting situation of the documents in the ideal state, and k refers to the collection of the first k documents.

    NDCG uses IDCG to normalize the evaluation indicators.

    nDCG@k=DCG@kIDCG@k (17)

    The similarities of the five attributes of scientific documents are calculated separately, they are MESY, MESF, MECT, SDKY and FOME. In order to verify the role of each attribute in the experiment, an ablation experiment was carried out in this study. One of the five attributes is removed in turn, and the remaining four attributes are input into GBDT and LR for training, then five models are obtained. Experiments with these five models are compared with the original model, and the results obtained are shown in the Figure 5. In Figure 5, model A represents MESF + MECT + SDKY + FOME, model B represents MESY + MECT + SDKY + FOME, model C represents MESY + MESF + SDKY + FOME, model D represents MESY + MESF + MECT + FOME, model E represents MESY + MESF + MECT + SDKY, and the model F represents MESY + MESF + MECT + SDKY + FOME.

    Figure 5.  Results of ablation experiments.

    As shown in Figure 5, the MESY attribute affects the precision of the model. There are fewer relevant results retrieved, and the less relevant results are ranked relatively higher, so the MAP and nDCG of model A will be slightly higher. MESF also affects the precision of the model, but has little effect on the ranking. The two attributes of MECT and FOME have little effect on precision, but they will affect the ranking of results. The SDKY attribute will get more relevant results and affects the ordering of the model to some extent.

    Figures 6 and 7 show the comparison results of the algorithm in this study with Tangent-CFT [4] and MIaS [3], MIaS system is an open-source system. The Tangent-CFT model was reproduced experimentally. Table 9 gives the average comparisons of MAP and NDCG. Tangent-CFT [4] is a mathematical expression embedding model realized by word2Vec, that can achieve precise matching of mathematical expression structure. To locate a scientific document according to a mathematical expression, the retrieval of "mathematical expression-scientific document"(scientific document pairs corresponding to mathematical expressions) is realized. MIaS [3] is an open search engine for mathematical expressions. It can also retrieve corresponding scientific and technological documents based on the similarity of mathematical expressions. The system builds an XML tree through the structure of mathematical expressions to retrieve query expressions and expressions with query expressions as sub-expressions.

    Figure 6.  MAP comparison of different algorithms.
    Figure 7.  nDCG comparison of different algorithms.
    Table 9.  Comparison of the average MAP and NDCG of different algorithms.
    Algorithms Ours Tangent-CFT MIaS
    MAP 0.8192 0.7805 0.7570
    nDCG 0.8605 0.8300 0.8100

     | Show Table
    DownLoad: CSV

    This study proposes a multi-attribute retrieval and ranking model based on GBDT + LR to solve the problem of poor integration of mathematical expressions and relevant texts in scientific document retrieval. This method combines the five attributes MESY, MESF, MECT, SDKY and FOME. GBDT is used to reorganize the features, and LR trains the reorganized features. Finally, the similarity of the final scientific documents is obtained and sorted.

    Future research is expected to complete the semantic retrieval of expression symbols based on the context of expressions. Meanwhile, in terms of semantics, it is better to effectively integrate expressions and text. When sorting the final scientific documents, the attributes of the scientific documents themselves should be considered, such as, publication year and citation frequency. This can improve the rationality and effectiveness of the final scientific document sorting.

    This work is supported by the Natural Science Foundation of Hebei Province of China (Grant No. F2019201329).

    The authors declare no conflict of interest.

    [1] Mahmood A, Javaid N and Razzaq S (2015) A review of wireless communications for smart grid. Renew Sust Energ Rev 41: 248–260. doi: 10.1016/j.rser.2014.08.036
    [2] Yick J, Mukherjee B, Ghosal D (2008) Wireless sensor network survey. Comput Netw 52: 2292–2330. doi: 10.1016/j.comnet.2008.04.002
    [3] Pakzad SN, Fenves GL, Kim S, et al. (2008) Design and Implementation of Scalable Wireless Sensor Network for Structural Monitoring. J Infrastruct Syst 14: 89–101. doi: 10.1061/(ASCE)1076-0342(2008)14:1(89)
    [4] Li X, Cheng X, Yan K, et al. (2010) A Monitoring System for Vegetable Greenhouses based on a Wireless Sensor Network. Sensors 10: 8963–8980. doi: 10.3390/s101008963
    [5] Yeo TL, Sun T, Grattan KTV (2008) Fiber-optic sensor technologies for humidity and moisture measurement. Sensor Actuat A-Phys 144: 280–295. doi: 10.1016/j.sna.2008.01.017
    [6] Culshaw B (2004) Optical Fiber Sensor Technologies: Opportunities and Perhaps Pitfalls. J Lightwave Technol 22: 39–50. doi: 10.1109/JLT.2003.822139
    [7] O'Connell E, Healy M, OKeeffe S, et al. (2013) A Mote Interface for Fiber Optic Spectral Sensing With Real-Time Monitoring of the Marine Environment. IEEE Sens J 13: 2619–2625.
    [8] Lloyd SW, Newman JA, Wilding DR, et al. (2007) Compact optical fiber sensor smart node. Rev Sci Instrum 78: 35108. doi: 10.1063/1.2715994
    [9] Kuang KSC, Quek ST, Maalej M (2008) Remote flood monitoring system based on plastic optical fibers and wireless motes. Sensor Actuat A-Phys 147: 449–455.
    [10] Pang C, Yu M, Zhang XM, et al. (2012) Multifunctional optical MEMS sensor platform with heterogeneous fiber optic Fabry–Pérot sensors for wireless sensor networks. Sensor Actuat A-Phys 188: 471–480. doi: 10.1016/j.sna.2012.03.016
    [11] Tan YC, Ji WB, Mamidala V, et al. (2014) Carbon-nanotube-deposited long period fiber grating for continuous refractive index sensor applications. Sensor Actuat B-Chem 196: 260–264.
    [12] Tan YC, Tou ZQ, Mamidala V, et al. (2014) Continuous refractive index sensing based on carbon-nanotube-deposited photonic crystal fibers. Sensor Actuat B-Chem 202: 1097–1102.
    [13] Tan YC, Tou ZQ, Chow KK, et al. (2015) Graphene-deposited photonic crystal fibers for continuous refractive index sensing applications. Opt Express 23: 31286–31294. doi: 10.1364/OE.23.031286
    [14] Nicholson JW, Windeler RS, DiGiovanni DJ (2007) Optically driven deposition of single-walled carbon-nanotube saturable absorbers on optical fiber end-faces. Opt Express 15: 9176–9183. doi: 10.1364/OE.15.009176
    [15] Kashiwagi K, Yamashita S, Set SY (2009) In-situ monitoring of optical deposition of carbon nanotubes onto fiber end. Opt Express 17: 5711–5715.
    [16] Yamashita S (2012) A Tutorial on Nonlinear Photonic Applications of Carbon Nanotube and Graphene. J Lightwave Technol 30: 427–447. doi: 10.1109/JLT.2011.2172574
    [17] Lee H, Shaker G, Naishadham K, et al. (2011) Carbon-nanotube loaded antenna-based ammonia gas sensor. IEEE T Microw Theory 59: 2665–2673.
    [18] Kruss S, Hilmer AJ, Zhang JQ, et al. (2013) Carbon nanotubes as optical biomedical sensors. Adv Drug Deliver Rev 65: 1933–1950. doi: 10.1016/j.addr.2013.07.015
    [19] The Network Simulator tool: ns-3. Available from: https://www.nsnam.org/.
  • This article has been cited by:

    1. Xinyu Jiang, Bingjie Tian, Xuedong Tian, Retrieval and Ranking of Combining Ontology and Content Attributes for Scientific Document, 2022, 24, 1099-4300, 810, 10.3390/e24060810
    2. Shasha Zhang, Yuyu Yuan, Zhonghua Yao, Jincui Yang, Xinyan Wang, Jianwei Tian, Coronary Artery Disease Detection Model Based on Class Balancing Methods and LightGBM Algorithm, 2022, 11, 2079-9292, 1495, 10.3390/electronics11091495
    3. Wei Wanhua, Lv Renzhi, 2022, Prediction of the Number of Online Doctors’ Followers based on Machine Learning, 978-1-6654-9991-0, 503, 10.1109/ICAICA54878.2022.9844424
    4. Teng Long, XiaoLiang Che, Wenbin Guo, Yubin Lan, Ziran Xie, Wentao Liu, Jinsheng Lv, Yongbing Long, Tianyi Liu, Jing Zhao, Visible-near-infrared hyperspectral imaging combined with ensemble learning for the nutrient content of Pinus elliottii × P. caribaea canopy needles detection, 2023, 6, 2624-893X, 10.3389/ffgc.2023.1203626
  • Reader Comments
  • © 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(4488) PDF downloads(1000) Cited by(1)

Figures and Tables

Figures(10)  /  Tables(1)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog