oracle - Primary keys and indexes in Hive query language is poosible or not?

Question

Welcome To Ask or Share your Answers For Others

oracle - Primary keys and indexes in Hive query language is poosible or not?

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:42:27+0000

Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP.

As of Hive 2.1.0 (HIVE-13290) Hive includes support for non-validated primary and foreign key constraints. These constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive. These constraints are useful for tools generating ER diagrams and queries. Also such non-validated constraints are useful as self-documenting. You can easily find out what is supposed to be a PK if the table has such constraint.

In Oracle database Unique, PK and FK constraints are backed with indexes, so they can work fast and are really useful. But this is not how Hive works and what it was designed for.

Quite normal scenario is when you loaded very big file with semi-structured data in HDFS. Building an index on it is too expensive and without index to check PK violation is possible only to scan all the data. And normally you cannot enforce constraints in BigData. Upstream process can take care about data integrity and consistency but this does not guarantee you finally will not have PK violation in Hive in some big table loaded from different sources.

Some file storage formats like ORC have internal light weight "indexes" to speed-up filtering and enabling predicate push down (PPD), no PK and FK constraints are implemented using such indexes. This cannot be done because normally you can have many such files belonging to the same table in Hive and files even can have different schemas. Hive created for petabytes and you can process petabytes in single run, data can be semi-structured, files can have different schemas. Hadoop does not support random writes and this adds more complications and cost if you want to rebuild indexes.

Categories

oracle - Primary keys and indexes in Hive query language is poosible or not?

oracle - Primary keys and indexes in Hive query language is poosible or not?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags