| Peer-Reviewed

Extracting Structured Data from Text in Natural Language

Received: 6 August 2021    Accepted: 20 August 2021    Published: 31 August 2021
Views:       Downloads:
Abstract

Nowadays, the amount of information in the web is tremendous. Big part of it is presented as articles, descriptions, posts and comments i.e. free text in natural language and it is really hard to make use of it while it is in this format. Whereas, in the structured form it could be used for a lot of purposes. So, the main idea that this paper proposes is an approach for extracting data which is given as a free text in natural language into a structured data for example table. The structured information is easy to search and analyze. The structured data is quantitative, while the unstructured data is qualitative. Overall such tool that enables conversion of a text into a structured data will not only provide automatic mechanism for data extraction but will also save a lot of resources for processing and storing of the extracted data. The data extraction from text will also provide automation of the process of extracting useful insights from data that is usually processed by people. The efficiency of the process as well as its accuracy will increase and the probability of human error will be minimized. The amount of the processed data will no longer be limited by the human resources.

Published in International Journal of Intelligent Information Systems (Volume 10, Issue 4)
DOI 10.11648/j.ijiis.20211004.16
Page(s) 74-80
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Data Extraction, Structured Data, Unstructured Data, Automation, NLP, RASA

References
[1] Holst A. (2021, June 30). Amount of data created, consumed, and stored 2010-2025. https://www.statista.com/statistics/871513/worldwide-data-created/
[2] Bocklisch T., Faulkner J., Pawlowski N., Nichol A. (2017). Rasa: Open Source Language Understanding and Dialogue Management.
[3] Petrov. C. (2021, June 30). 25+ Impressive Big Data Statistics for 2021. https://techjury.net/blog/big-data-statistics/#gref
[4] Taylor. C. (2021, June 30). Structured vs. Unstructured Data. https://www.datamation.com/big-data/structured-vs-unstructured-data/
[5] Lomotey RK, Deters R. RSenter: terms mining tool from unstructured data sources. Int J Bus Process Integr Manag. 2013; 6 (4): 298.
[6] Gantz J, Reinsel D. The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView IDC Analyze Future. 2012; 2007 (2012): 1–16.
[7] Jiao, A. (2020). An intelligent Chatbot system based on entity extraction USING Rasa NLU and neural network. Journal of Physics: Conference Series, 1487.
[8] Bagchi, M. (2020). Conceptualising a Library chatbot using open Source Conversational artificial intelligence. DESIDOC Journal of Library & Information Technology.
[9] RASA. (2020, July 27) Introducing DIET: state-of-the-art architecture that outperforms fine-tuning BERT and is 6X faster to train. https://blog.rasa.com/introducing-dual-intent-andentity-transformer-diet-state-of-the-art-performanceon-a-lightweight-architecture/.
[10] Wochinger, T. (2019, June 4). Rasa NLU in DEPTH: INTENT CLASSIFICATION. The Rasa Blog: Conversational AI Platform, Powered by Open Source. https://blog.rasa.com/rasa-nlu-in-depth-part-1-intent-classification/.
[11] Wochinger, T. (2019, June 4). Rasa NLU in DEPTH: Entity recognition. The Rasa Blog: Conversational AI Platform, Powered by Open Source. https://blog.rasa.com/rasa-nlu-in-depth-part-2-entity-recognition/.
[12] Baldauf, Matthias & Dustdar, Schahram & Rosenberg, Florian. (2007). A Survey on context-aware systems. Information Systems. 2. 10.1504/IJAHUC.2007.014070.
[13] Zola, A. (2021, March 31). The 5 best programming languages for AI. Springboard Blog. https://www.springboard.com/blog/ai-machine-learning/best-programming-language-for-ai/.
[14] Mendonca, Sandro & Brito, Yvan & Santos, Carlos & Lima, Rodrigo & Araujo, Tiago & Meiguins, Bianchi. (2020). Synthetic Datasets Generator for Testing Information Visualization and Machine Learning Techniques and Tools. IEEE Access. PP. 1-1. 10.1109/ACCESS.2020.2991949.
[15] Wrembel, Robert, and Christian Koncilia. Data Warehouses and Olap: Concepts, Architectures, and Solutions. IRM Press, 2007.
[16] spaCy · INDUSTRIAL-STRENGTH natural language processing in Python. · Industrial-strength Natural Language Processing in Python. (2020, July 30). https://spacy.io/.
[17] Loper, E., & Bird, S. Nltk: The natural Language Toolkit.
[18] Popić, Srđan & Velikic, Ivan & Teslic, Nikola & Pavkovic, Bogdan. (2019). Data generators: a short survey of techniques and use cases with focus on testing. 10.1109/ICCE-Berlin47944.2019.8966202.
[19] G. Albuquerque, T. Lowe and M. Magnor, "Synthetic Generation of High-Dimensional Datasets," in IEEE Transactions on Visualization and Computer Graphics, vol. 17, no. 12, pp. 2317-2324, Dec. 2011, doi: 10.1109/TVCG.2011.237.
[20] Rajman M., Besançon R. (1998) Text Mining: Natural Language techniques and Text Mining applications. In: Spaccapietra S., Maryanski F. (eds) Data Mining and Reverse Engineering. IFIP — The International Federation for Information Processing. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-35300-5_3
[21] Hotho, Andreas & Nürnberger, Andreas & Paass, Gerhard. (2005). A Brief Survey of Text Mining. LDV Forum - GLDV Journal for Computational Linguistics and Language Technology. 20. 19-62.
[22] Gupta, Vishal & Lehal, Gurpreet. (2009). A Survey of Text Mining Techniques and Applications. Journal of Emerging Technologies in Web Intelligence. 1. 10.4304/jetwi.1.1.60-76.
Cite This Article
  • APA Style

    Zheni Mincheva, Nikola Vasilev, Ventsislav Nikolov, Anatoliy Antonov. (2021). Extracting Structured Data from Text in Natural Language. International Journal of Intelligent Information Systems, 10(4), 74-80. https://doi.org/10.11648/j.ijiis.20211004.16

    Copy | Download

    ACS Style

    Zheni Mincheva; Nikola Vasilev; Ventsislav Nikolov; Anatoliy Antonov. Extracting Structured Data from Text in Natural Language. Int. J. Intell. Inf. Syst. 2021, 10(4), 74-80. doi: 10.11648/j.ijiis.20211004.16

    Copy | Download

    AMA Style

    Zheni Mincheva, Nikola Vasilev, Ventsislav Nikolov, Anatoliy Antonov. Extracting Structured Data from Text in Natural Language. Int J Intell Inf Syst. 2021;10(4):74-80. doi: 10.11648/j.ijiis.20211004.16

    Copy | Download

  • @article{10.11648/j.ijiis.20211004.16,
      author = {Zheni Mincheva and Nikola Vasilev and Ventsislav Nikolov and Anatoliy Antonov},
      title = {Extracting Structured Data from Text in Natural Language},
      journal = {International Journal of Intelligent Information Systems},
      volume = {10},
      number = {4},
      pages = {74-80},
      doi = {10.11648/j.ijiis.20211004.16},
      url = {https://doi.org/10.11648/j.ijiis.20211004.16},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijiis.20211004.16},
      abstract = {Nowadays, the amount of information in the web is tremendous. Big part of it is presented as articles, descriptions, posts and comments i.e. free text in natural language and it is really hard to make use of it while it is in this format. Whereas, in the structured form it could be used for a lot of purposes. So, the main idea that this paper proposes is an approach for extracting data which is given as a free text in natural language into a structured data for example table. The structured information is easy to search and analyze. The structured data is quantitative, while the unstructured data is qualitative. Overall such tool that enables conversion of a text into a structured data will not only provide automatic mechanism for data extraction but will also save a lot of resources for processing and storing of the extracted data. The data extraction from text will also provide automation of the process of extracting useful insights from data that is usually processed by people. The efficiency of the process as well as its accuracy will increase and the probability of human error will be minimized. The amount of the processed data will no longer be limited by the human resources.},
     year = {2021}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Extracting Structured Data from Text in Natural Language
    AU  - Zheni Mincheva
    AU  - Nikola Vasilev
    AU  - Ventsislav Nikolov
    AU  - Anatoliy Antonov
    Y1  - 2021/08/31
    PY  - 2021
    N1  - https://doi.org/10.11648/j.ijiis.20211004.16
    DO  - 10.11648/j.ijiis.20211004.16
    T2  - International Journal of Intelligent Information Systems
    JF  - International Journal of Intelligent Information Systems
    JO  - International Journal of Intelligent Information Systems
    SP  - 74
    EP  - 80
    PB  - Science Publishing Group
    SN  - 2328-7683
    UR  - https://doi.org/10.11648/j.ijiis.20211004.16
    AB  - Nowadays, the amount of information in the web is tremendous. Big part of it is presented as articles, descriptions, posts and comments i.e. free text in natural language and it is really hard to make use of it while it is in this format. Whereas, in the structured form it could be used for a lot of purposes. So, the main idea that this paper proposes is an approach for extracting data which is given as a free text in natural language into a structured data for example table. The structured information is easy to search and analyze. The structured data is quantitative, while the unstructured data is qualitative. Overall such tool that enables conversion of a text into a structured data will not only provide automatic mechanism for data extraction but will also save a lot of resources for processing and storing of the extracted data. The data extraction from text will also provide automation of the process of extracting useful insights from data that is usually processed by people. The efficiency of the process as well as its accuracy will increase and the probability of human error will be minimized. The amount of the processed data will no longer be limited by the human resources.
    VL  - 10
    IS  - 4
    ER  - 

    Copy | Download

Author Information
  • Eurorisk Systems Ltd, Varna, Bulgaria

  • Eurorisk Systems Ltd, Varna, Bulgaria

  • Eurorisk Systems Ltd, Varna, Bulgaria

  • Eurorisk Systems Ltd, Varna, Bulgaria

  • Sections