Effective and explainable mechanisms for natural language interface in databases

Limited Access
This item is unavailable until:
2022-04-01
Date
2021-09
Editor(s)
Advisor
Ulusoy, Özgür
Supervisor
Co-Advisor
Co-Supervisor
Instructor
Source Title
Print ISSN
Electronic ISSN
Publisher
Bilkent University
Volume
Issue
Pages
Language
English
Journal Title
Journal ISSN
Volume Title
Series
Abstract

Structured Query Language (SQL) is a commonly used tool to extract and present structured data stored in Relational Database Management Systems (RDBMSs), yet inherited complexities of SQL create barriers for naive users who are capable of expressing queries as natural language queries (NLQs). In order to tackle this barrier we propose two di erent solutions; a Natural Language Interface to Database (NLIDB) pipeline with an explainable AI interface and a semantic search strategy. The rst solution introduces a NLIDB pipeline that uses SQL translation algorithms along with a keyword mapper to generate SQL queries for given NLQs. Proposed pipeline is presented to the user with an explainable AI interface so that the user can reason over the constructed query. We compared our approach with two state-of-art systems; NALIR+ and Pipeline+. Our approach surpass NALIR+ in imdb, scholar and yelp datasets achieving 88.9%, 100% and 60.0% translation accuracy for single table SELECT-JOIN queries and 68.6%, 87.0% and 83.6% translation accuracy for multiple table SELECT-JOIN queries, respectively. Our approach outperforms Pipeline+ in imdb and scholar datasets but Pipeline+ is slightly better in yelp dataset. The second solution proposes a semantic search approach that uses Information Retrieval based methods to retrieve related table rows for a given NLQ. The proposed approach uses the graph representation of the database where each row and value is represented with a node and edges represent the relation between them. Query and database rows are converted to vector representations using this graph representation and Graph Convolutional Networks (GCNs). A similarity calculation is performed using these vector representations and database rows are ranked according to their relevance to the query. Cosine distance metric is employed for similarity calculation. We tested our approach with college schema from Spider dataset collection and achieved a 42.8% top-5 accuracy.

Course
Other identifiers
Book Title
Citation
Published Version (Please cite this version)