Effective and explainable mechanisms for natural language interface in databases

Karakayalı, Akifhan

Effective and explainable mechanisms for natural language interface in databases

Limited Access

This item is unavailable until:
2022-04-01

Files

10424894.pdf (878.42 KB)

Date

2021-09

Authors

Karakayalı, Akifhan

Advisor

Ulusoy, Özgür

Publisher

Bilkent University

Language

English

Type

Thesis

Abstract

Structured Query Language (SQL) is a commonly used tool to extract and present structured data stored in Relational Database Management Systems (RDBMSs), yet inherited complexities of SQL create barriers for naive users who are capable of expressing queries as natural language queries (NLQs). In order to tackle this barrier we propose two di erent solutions; a Natural Language Interface to Database (NLIDB) pipeline with an explainable AI interface and a semantic search strategy. The rst solution introduces a NLIDB pipeline that uses SQL translation algorithms along with a keyword mapper to generate SQL queries for given NLQs. Proposed pipeline is presented to the user with an explainable AI interface so that the user can reason over the constructed query. We compared our approach with two state-of-art systems; NALIR+ and Pipeline+. Our approach surpass NALIR+ in imdb, scholar and yelp datasets achieving 88.9%, 100% and 60.0% translation accuracy for single table SELECT-JOIN queries and 68.6%, 87.0% and 83.6% translation accuracy for multiple table SELECT-JOIN queries, respectively. Our approach outperforms Pipeline+ in imdb and scholar datasets but Pipeline+ is slightly better in yelp dataset. The second solution proposes a semantic search approach that uses Information Retrieval based methods to retrieve related table rows for a given NLQ. The proposed approach uses the graph representation of the database where each row and value is represented with a node and edges represent the relation between them. Query and database rows are converted to vector representations using this graph representation and Graph Convolutional Networks (GCNs). A similarity calculation is performed using these vector representations and database rows are ranked according to their relevance to the query. Cosine distance metric is employed for similarity calculation. We tested our approach with college schema from Spider dataset collection and achieved a 42.8% top-5 accuracy.