Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep learning for natural language interfaces for databases

Presentation at Bucharest Deep Learning meetup, 2nd April 2018

The quest for a simpler, more intuitive interface to query databases and other sources of structured information is one of the main practical applications of natural language processing. Developing a natural language interface for databases (NLIDB), able to process text queries directly, would be the optimal solution with regards to ease of use, requiring no additional knowledge and allowing non-technical persons to interact with the data directly.

During this talk we will explore the recent advances (datasets and methods / architectures) that allow developing deep neural network solutions for the NLIDB task. We will present two large datasets for this task, WikiSQL [1] and SENLIDB [2], which have been used to build several neural architectures for neural NLIDBs. The talk will present not only the differences between the proposed solutions, but also the important differences between the two datasets.

Some of the architectures proposed for SQL generation from natural language can also be used for code generation, question answering over linked data and other similar tasks.

References:
[1] Zhong, V., Xiong, C., & Socher, R. (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv preprint arXiv:1709.00103. - https://arxiv.org/pdf/1709.00103.pdf
[2] Xu, X., Liu, C., & Song, D. (2017). SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv preprint arXiv:1711.04436. - https://arxiv.org/pdf/1711.04436.pdf
[3] Brad, F., Iacob, R., Hosu, I., & Rebedea, T. (2017). Dataset for a Neural Natural Language Interface for Databases (NNLIDB). arXiv preprint arXiv:1707.03172. - https://arxiv.org/pdf/1707.03172.pdf
[4] Newer results using dual encoders for the SENLIDB dataset (paper under review)

  • Be the first to comment

  • Be the first to like this

Deep learning for natural language interfaces for databases

  1. 1. Deep Learning for Natural Language Interfaces to Databases Traian Rebedea, Ph.D. Department of Computer Science, UPB traian.rebedea@cs.pub.ro / trebedea@gmail.com
  2. 2. Overview • Introduction in Natural Language Interfaces to Databases (NLIDBs) • Previous work in NLIDB (before deep learning, brief) • New datasets suitable for data-driven solutions for NLIDB (including neural NLIDBs) • Proposed neural architectures for NLIDB and results • Some conclusions and future work Deep Learning for Natural Language Interfaces to Databases 22 April 2018
  3. 3. Natural Language Interfaces to Databases • Natural language interfaces for databases (NLIDBs) have been developed for the last 40+ years • First works in 60s-70s, focusing on: • Answering questions about rocks from the moon (LUNAR) • NLIDB for information about US Navy ships (LADDER) • Querying a database of aircraft maintenance and flight data (PLANES) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 3
  4. 4. Limitations of early NLIDBs • Very limited vocabulary, ~ 100-1000 words • Good accuracy, 70%-90% of queries answered correct • Included dialogue for clarifications or requesting additional input • Some systems supported follow-up queries in a dialogue stype manner (for example, supporting co-references) • Lots of rules and heuristics, syntactic parsing, no machine learning (ML) • Output was some kind of logical form, not SQL • SQL appeared in 1974, ANSI standard in 1986 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 4
  5. 5. Query examples from PLANES • Examples from PLANES (Waltz, 1978) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 5
  6. 6. Why NLIDBs? • Codd (1975) suggested that by mid 90s, most large database users will be casual users, not programmers, analysts or data scientists • They require a non-technical DB interface 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 6
  7. 7. Traditional approaches to NLIDBs • More recent works extend and improve previous results to larger databases and more complex queries • Traditional approaches use a mix of pattern matching based on rules and heuristics, syntactic parsing, semantic grammar systems, and intermediate representation languages (logical forms) • Traditional datasets for evaluating NLIDB solutions contain hundreds of pairs (input text, SQL statement) • Simple database structure • Limited vocabulary • Non-complex queries • Impossible or very difficult to train a ML component (even for a subphase in solving the problem) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 7
  8. 8. Traditional approaches to NLIDBs - selection • PRECISE (Popescu et al., 2004) • Use syntactic parse trees of the text query • Semantic interpretation to analyze and change the parse tree  rules and heuristics • Maximum flow used to compute alignment between text and SQL candidates • NaLIR (Li and Jagadish, 2014) • Generate candidates as query trees, given the dependency tree, the database schema and associated semantic mappings  rules and heuristics • Best query tree best query tree determined using a scoring mechanism and interaction with the user to select the best choice (from a list of reformulations of the query tree into natural language) • The score of a query tree ~ number of alterations performed on the dependency tree to generate it, database proximity between nodes adjacent in the query tree, and syntactic correctness of the generated SQL query • Best paper award at VLDB 2015 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 8
  9. 9. Traditional approaches to NLIDBs - selection • Sqlizer (Yaghmazadeh et al., 2017) • Generate a query sketch using a semantic parser • The sketch has a simpler grammer than SQL, can be considered an intermediate form, does not contain any database schema information • The query sketch is completed using a rule based system, and iteratively refined and repaired using rules and heuristics until the score of the generated SQL query cannot be improved  rules and heuristics • Solution iteratively rened using a rule based system • Employs ML and Word2Vec for generating and filling the query sketch 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 9
  10. 10. Results of traditional NLIDBs • Best reported results are obtained by Sqlizer 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 10
  11. 11. Results of traditional NLIDBs • Best reported results are obtained by Sqlizer 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 11
  12. 12. Results of traditional NLIDBs • In comparison, NaLIR has been tested only in MAS and obtains a worse accuracy than Sqlizer • ~65% without user interaction • ~89% with user interaction 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 12
  13. 13. Problems with traditional NLIDBs • They seem very overfit on the evaluation datasets • One can encode as many rules and heuristics in order to increase score on the small datasets • No clear separation between train and test • Actually only one dataset, used for both training (e.g. hand-crafted rules and heuristics) and evaluation • Datasets are too small to develop data-driven approaches with more complex ML components 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 13
  14. 14. Large datasets for data-driven NLIDBs • In 2017 two largee datasets have been proposed to train data-driven NLIDB solutions • (and to assess any NLIDB on a larger, more complex evaluation dataset) • WikiSQL (Zhong et al., 2017) – Almost 90k pairs containg queries over Wikipedia tables • SENLIDB (Brad et al., 2017) – 20k+ (text, SQL) pairs taken from Stack Exchange Data Explorer 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 14
  15. 15. WikiSQL dataset • Dataset generated for queries over 24k+ Wikipedia tables • Schema-wise, the database is simplistic • all tables are independent from each other • no foreign keys or any other relationships between tables • WikiSQL schema is not specific for a relational database • Union of several different sets of objects (one for each table), each having a set of specific attributes (the columns of each table) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 15
  16. 16. WikiSQL dataset • For each Wikipedia table that passes a quality check, 6 random SQL queries are generated using the following format • SELECT agg_op agg_col FROM table WHERE cond1_col cond1_op cond1 AND cond2_col cond2_op cond2 ... • For each generated query, using a template/rule-based method a text query (question) is also created • Rest of data is crowdsourced on Amazon Mechanical Turk • One worker proposes a paraphrase for a given question • The proposed paraphrase is voted by two other workers • Paraphrases very similar to the original question (using char edit distance) or labeled incorrect by the two other workers are discarded 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 16
  17. 17. WikiSQL dataset • Entire corpus is available for download: https://github.com/salesforce/wikisql • Dataset is one order of magnitude larger than existing corpora for NLIDB or for similar tasks 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 17
  18. 18. SENLIDB dataset • Stack Exchange NIDB dataset uses data from Stack Exchange Data Explorer • Data Explorer contains all user SQL queries to the SE open database • Each user SQL query is annotated with a title and optionally with a longer text description • Similar to crowdsourced datasets • Queries might contain misleading descriptions, noise, grammatical errors • SQL queries are syntactically correct 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 18
  19. 19. SENLIDB dataset 1. Crawled all available user queries • Paired the textual description with the associated SQL snippet 2. Discarded queries with SQL snippet longer than 2000 characters (2,000,000 items) 3. Removed duplicate pairs, with identical description and SQL snippet (600,000 items) 4. Removed pairs that contained SQL syntax within the textual description (170,000 items) 5. Filtered pairs with identical textual description and different SQL snippets, kept only English descriptions (25,276 items) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 19
  20. 20. SENLIDB dataset • Also created a smaller evaluation/test dataset • We randomly extracted SQL queries with a very short textual description from SENLIDB Train (296 samples) • For each query, 1-3 manual annotations added by 2 experts (780 annotations, avg. 2.63 textual annotations / query) • Avg. BLEU score between the descriptions provided by the experts: 57.10 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 20
  21. 21. Neural NLIDB architectures • Several neural NLIDB solutions have been proposed • In the simplest case, treat SQL generation from text as a machine translation (MT) task • Use models proposed for neural MT • Simple baseline: SEQ2SEQ architecture (with attention) • Several differences between NLIDBs and NMT • Output language (SQL) has a fixed/standard syntax, unlike natural languages • Datasets for NLIDB are still much smaller than the ones for NMT • Code generation from problem descriptions using neural models • Copying/pointing mechanism (Vinyals et al., 2015) improves over SEQ2SEQ baseline • Syntax aware decoding (Yin and Neubig, 2017) for artificial languages (e.g. SQL or Python) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 21
  22. 22. SEQ2SEQ baseline • Sequence-to-sequence models (SEQ2SEQ) with attention • Images from https://devblogs.nvidia.com/introduction-neural- machine-translation-gpus-part-3/ 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 22
  23. 23. SEQ2SEQ with attention – example in NMT 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 23
  24. 24. Copying/pointing mechanism for SEQ2SEQ • Sometimes, it is useful to be able to copy a token from the input sequence • Also true for NLIDBs • Image from Gu et al. (2016) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 24
  25. 25. Seq2SQL for WikiSQL (Zhong et al., 2017) • First proposed to use an augmented pointer network instead of SEQ2SEQ model • Input contains the concatenation of the column names in the current table, the question in natural language, the keywords and operators for the limited SQL dialect used by WikiSQL • Encoder: two layer bidirectional LSTM (encoding x) • Decoder: two layer unidirectional LSTM with attention over the input 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 25
  26. 26. Seq2SQL for WikiSQL • Seq2SQL model cosists of three different classifiers • Makes use of the (known simple) structure of the SQL queries in the dataset 1. Aggregation: 2-layer MLP over aggregated hidden representations of the inputs (weighted using attention), 4 possible outputs: COUNT, MIN, MAX, and NONE 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 26
  27. 27. Seq2SQL for WikiSQL 2. Select column: compute a column representation using a LSTM and a question representation similar to 1. Then combine the two representations as input for a MLP. 3. Where clause: use the augmented pointer network + reinforcement learning (positive reward for generating a query that produces the same results, negative reward for different results, strong negative reward for invalid SQL query) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 27
  28. 28. Seq2SQL for WikiSQL • Seq2SQL improves the results over baseline (SEQ2SEQ) and over augmented pointer network • Pointing is really important (substantial increase of accuracy) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 28
  29. 29. Seq2SQL for WikiSQL 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 29
  30. 30. SQLNet for WikiSQL (Xu et al., 2017) • Most challenging part is to generate the WHERE clause in WikiSQL • Propose to use a query sketch which aligns to the SQL grammar • SQLNet only fills in the slots in the generated query sketch • Does not use RL (policy gradient) to finetune the results of the model by taking into accout that multiple queries can have the same (correct) results • Do not rely on sequence-to-sequence models as the order is not important (between subclauses in WHERE, for example) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 30
  31. 31. SQLNet for WikiSQL • Introduces a simple query sketch that is is expressive enough to represent all queries in WikiSQL • Also adds a dependency between elements in the sketch • Seq2SQL also uses a simple sketch (column and aggregation are separate, however the select clause is generated in a SEQ2SEQ manner) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 31
  32. 32. SQLNet for WikiSQL • Authors propose two new concepts 1. Sequence to set: to determine the most probable columns in a query given the question and the table structure (column names). Can be viewed as a MLP with one layer over the embeddings computed by 2 LSTMs (one for the question, one for the column names) 2. Column attention: compute an attention mechanism between tokens in the question and column names (try to learn that some tokens signal the use of specific columns). Then use a 2 layer MLP to generate the decision over the columns. 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 32
  33. 33. SQLNet for WikiSQL • Three different classifiers to form the subclauses in the WHERE clause, some using sequence to set and column attention 1. Column slots: Use a MLP over P(col|Q) to decide the number of columns to pick, then choose column in descending order of P(col|Q) 2. OP slot: Use a MLP to pick the most probable operator (=, <, >) 3. VALUE slot: Uses a copy/pointer SEQ2SEQ to copy a substring from the input question, stops when it generates <END> token (which also appears at the end of the input) 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 33
  34. 34. SQLNet for WikiSQL • Predicting the SELECT clause is the last part • Only one column is picked, similar to prediction of columns in WHERE clause • Aggregation operator selected using a MLP 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 34
  35. 35. SQLNet for WikiSQL • Training using GloVe embeddings, which are fine-tuned • Sharing word embeddings among different models, however different LSTM weights • Sequence to set training using weighted cross-entropy • Results better than Seq2SQL 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 35
  36. 36. Dual-encoder model for NLIDBs • Recently, Hosu et al. (2018) proposed a two-step model using a dual encoder to generate SQL from text input • Tested both on WikiSQL and SENLIDB • Makes use of a query sketch similar to SQLNet • The sketch is obtained by replacing all non-SQL keywords, reserved chars and operators within each SQL statement with placeholders (e.g. table names with <table>, similar for column names, aliases, variables, and constants) • SQL sketch only stores the surface form of a SQL statement • It is syntactically correct • Does not contain any database schema information 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 36
  37. 37. Dual-encoder model for NLIDBs • Two-step neural model 1. Employ a network that learns to generate a SQL sketch given the user’s request 2. Using both the generated sketch and the natural language query, use a dual-encoder model to generate the final SQL statement • SQL sketch generation is a simpler task, whose intermediate result provides important additional insights for determining the correct SQL statement 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 37
  38. 38. Dual-encoder model for NLIDBs • The sketch preserves the SQL syntax of the final statement and removes semantic information (e.g. names of tables, columns, constants), the proposed two-step model effectively splits the SQL generation problem into two distinct subproblems 1. Learn the syntax of SQL and generate the most probable SQL sketch given a query in natural language. This refers to learning the surface structure of a SQL statement: the order of SQL reserved keywords in a (syntactically) correct query, the precedence of different clauses and subclauses in a the statement, the usage of different operators. 2. Generate semantically correct SQL statements for a given database schema, taking into account both the user’s description and the most probable SQL sketch. This task mainly involves choosing the correct table and column names to fill in the sketch, generating the appropriate clauses (WHERE, GROUP BY etc.) given these names, but also correcting the sketch based on the additional semantic and schema information. 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 38
  39. 39. Dual-encoder model for NLIDBs • Sketch generation using a SEQ2SEQ model with attention 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 39
  40. 40. Dual-encoder model for NLIDBs • Final SQL statement is constructed using a dual-encoder SEQ2SEQ • One encoder for the question in natural language • One encoder for the query sketch • Attention can be added to both encoders and combined 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 40
  41. 41. Dual-encoder model for NLIDBs • Results are improved both for SENLIDB and WikiSQL compared to a SEQ2SEQ baseline • Adding a pointing mechanism to both encoders improved the accuracy on WikiSQL test to 55.1% (compared to 51.6% Seq2SQL and 61.3% SQLNet) • Results can be improved, e.g. have not investigated use of pretrained word embeddings 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 41
  42. 42. Dual-encoder model for NLIDBs • Sample results on SENLIDB dataset 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 42
  43. 43. Dual-encoder model for NLIDBs • Important insights provided by the two-step model with a dual encoder • For WikiSQL generating the query sketch is simple (accuracy 82.3%), very difficult for SENLIDB (accuracy 5.1%) • Given the correct sketch as input for the dual encoder, SENLIDB becomes a simpler task (accuracy 25.9% vs 4% with generated sketch) while WikiSQL remains about the same (accuracy 39% vs 39% with generated sketch) • In many situations the dual encoder in able to correct/modify an incorrect query sketch receieved as input • Even if the sketch is slightly incorrect, it provides important input to improve overall SQL generation on both tasks 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 43
  44. 44. Conclusions • There is still a long way to go for building a database independent natural language interface to databases • Traditional solutions using rules and heuristics, syntactic parstic, alignment between text and database schema offer good results (accuracy ~80%) for small databases • Limited vocabulary and types of queries • Need to change the rules and heuristics when using on new database • Few or no ML components • Small datasets, probably some overfitting • Neural models have been proposed only recently for larger databases and more complex queries • Larger vocabulary and more varied queries • Large dataset for training and slightly larger sets for evaluation • For simpler queries (WikiSQL), accuracy is pretty high (~60%) • For more complex queries (SENLIDB), accuracy is very low (~5%) • Problems with the quality of the datasets (crowdsourcing) • Future improvements • Syntax-aware decoders • More complex reinforcement learning • Other architectures ? 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 44
  45. 45. [1] Zhong, V., Xiong, C., & Socher, R. (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv preprint arXiv:1709.00103. - https://arxiv.org/pdf/1709.00103.pdf [2] Xu, X., Liu, C., & Song, D. (2017). SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv preprint arXiv:1711.04436. - https://arxiv.org/pdf/1711.04436.pdf [3] Brad, F., Iacob, R., Hosu, I., & Rebedea, T. (2017). Dataset for a Neural Natural Language Interface for Databases (NNLIDB). arXiv preprint arXiv:1707.03172. - https://arxiv.org/pdf/1707.03172.pdf [4] Newer results using dual encoders for the SENLIDB dataset ( paper under review - Hosu et al. (2018) ) References (selection) Deep Learning for Natural Language Interfaces to Databases 452 April 2018
  46. 46. References (all, in order of appearance in slides) • Waltz, D. L. (1978). An English language question answering system for a large relational database. Communications of the ACM, 21(7), 526-539. • Codd, E. F., & Date, C. J. (1975, January). Interactive support for non-programmers: The relational and network approaches. In Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access and control: Data models: Data-structure-set versus relational (pp. 11-41). ACM. • Popescu, A. M., Armanasu, A., Etzioni, O., Ko, D., & Yates, A. (2004, August). Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In Proceedings of the 20th international conference on Computational Linguistics (p. 141). Association for Computational Linguistics. • Li, F., & Jagadish, H. V. (2014, June). NaLIR: an interactive natural language interface for querying relational databases. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data (pp. 709-712). ACM. • Yaghmazadeh, N., Wang, Y., Dillig, I., & Dillig, T. (2017). Sqlizer: Query synthesis from natural language. Proceedings of the ACM on Programming Languages, 1(OOPSLA), 63. • Zhong, V., Xiong, C., & Socher, R. (2017). Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. • Brad, F., Iacob, R., Hosu, I., & Rebedea, T. (2017). Dataset for a Neural Natural Language Interface for Databases (NNLIDB). arXiv preprint arXiv:1707.03172. • Vinyals, O., Fortunato, M., & Jaitly, N. (2015). Pointer networks. In Advances in Neural Information Processing Systems (pp. 2692-2700). • Yin, P., & Neubig, G. (2017). A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696. • Gu, J., Lu, Z., Li, H., & Li, V. O. (2016). Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393. • Xu, X., Liu, C., & Song, D. (2017). SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv preprint arXiv:1711.04436 • Hosu et al. (2018). Under review, to be added later 2 April 2018 Deep Learning for Natural Language Interfaces to Databases 46
  47. 47. Thank you! traian.rebedea@cs.pub.ro Deep Learning for Natural Language Interfaces to Databases _____ _____ 472 April 2018

×