<< 返回文章列表

深入解析:DB2 V10.5新特性列式存储表的优点与缺点

2018年7月18日
李培杨
2016

什么是列组织表

列组织表技术是DB10.5 BLU引入的新技术,按列组织的表,更加方便了复杂的查询,如果将这种表格式与星型模式数据集市配合,那么可以简化设计和调优,从而显著改进存储性能、查询性能和易用性。

列组织表的管理方式

image.png

 

可以看到行组织模式的表是都是存放在一起的,而列组织模式的表是被分开保存的。

列存储适用环境

列存储主要适合 OLAP 类型应用,比如数据仓库系统,数据挖掘系统,决策支持系统等。这些系统有一些共同特征:数据量大,查询语句比重大,复杂的查询多。

如何启用列存储

官方文档上描述的是将 DB2_WORKLOAD 注册表变量设置为 ANALYTICS,更改后重启便会生效。这样做有助于配置内存、表组织、页大小和扩展数据块大小,并且会启用工作负载管理。

4.1 更改环境变量

[db2inst1@enmodb2 ~]$ db2set DB2_WORKLOAD=ANALYTICS

[db2inst1@enmodb2 ~]$ db2set -all

[i] DB2_WORKLOAD=ANALYTICS

[g] DB2_COMPATIBILITY_VECTOR=MYS

[g] DB2SYSTEM=enmodb2

[g] DB2INSTDEF=db2inst1

4.2 创建列组织测试表

[db2inst1@enmodb2 ~]$ db2 "create table  testinfo (empno char(6),lastname varchar(15),hirdate date,salary decimal(9),comm decimal(9)) organize by column"

DB20000I  The SQL command completed successfully.

用户在没有指定organize by的情况下可以将数据库参数dft_table_org 设置为 COLUMN,缺省就是创建列组织表。

4.3 添加数据

有选择的导出employee的数据。

[db2inst1@enmodb2 ~]$ db2 "export to '/home/db2inst1/export/employee.del' of del messages exp_employee.msg select empno,lastname,hiredate,salary,comm from employee"

 

Number of rows exported: 42

 

将数据导入testinfo中。

[db2inst1@enmodb2 ~]$ db2 "load from '/home/db2inst1/export/employee.del' of del replace into testinfo"

SQL3501W  The table space(s) in which the table resides will not be placed in

backup pending state since forward recovery is disabled for the database.

 

SQL3109N  The utility is beginning to load data from file

"/home/db2inst1/export/employee.del".

 

SQL3500W  The utility is beginning the "ANALYZE" phase at time "06/16/2018

02:12:28.214959".

 

SQL3519W  Begin Load Consistency Point. Input record count = "0".

 

SQL3520W  Load Consistency Point was successful.

 

SQL3515W  The utility has finished the "ANALYZE" phase at time "06/16/2018

02:12:29.119846".

 

SQL3500W  The utility is beginning the "LOAD" phase at time "06/16/2018

02:12:29.121149".

 

SQL3110N  The utility has completed processing.  "42" rows were read from the

input file.

 

SQL3519W  Begin Load Consistency Point. Input record count = "42".

 

SQL3520W  Load Consistency Point was successful.

 

SQL3515W  The utility has finished the "LOAD" phase at time "06/16/2018

02:12:29.836141".

 

SQL3500W  The utility is beginning the "BUILD" phase at time "06/16/2018

02:12:29.842413".

 

SQL3213I  The indexing mode is "REBUILD".

 

SQL3515W  The utility has finished the "BUILD" phase at time "06/16/2018

02:12:30.225305".

 

 

Number of rows read         = 42

Number of rows skipped      = 0

Number of rows loaded       = 42

Number of rows rejected     = 0

Number of rows deleted      = 0

Number of rows committed    = 42

4.4 查询数据

然后对数据进行查询

[db2inst1@enmodb2 ~]$ db2 "select empno,lastname,salary,hirdate from testinfo"

以下为testinfo表的部分数据

EMPNO  LASTNAME        SALARY      HIRDATE   

------ --------------- ----------- ----------

000010 HAAS                229125. 01/01/1995

000020 THOMPSON            127237. 10/10/2003

000030 KWAN                127725. 04/05/2005

000050 GEYER                80175. 08/17/1979

000060 STERN                83087. 09/14/2003

000070 PULASKI             120212. 09/30/2005

000090 HENDERSON            89750. 08/15/2000

000100 SPENSER              86150. 06/19/2000

000110 LUCCHESSI            99750. 05/16/1988

000120 O'CONNELL            73875. 12/05/1993

000130 QUINTANA             95940. 07/28/2001

000140 NICHOLLS             88946. 12/15/2006

000150 ADAMSON              63572. 02/12/2002

000160 PIANKA               71587. 10/11/2006

000170 YOSHIMURA            51382. 09/15/1999

000180 SCOUTTEN             59041. 07/07/2003

4.5 行组织表与列组织表的对比

同样的查询在行组织表与列组织表的区别在进行全表扫描或者几个列全扫描的时候列组织表可以发挥出很大的优势,在列式存储中同类型的数据存放在同一个block里面,压缩性能比较好。而且在列式存储中,任何列都可以作为索引。

行组织表

Original Statement:

------------------

select

  empno,

  lastname,

  salary,

  hirdate

from

  testinfo

 

 

Optimized Statement:

-------------------

SELECT

  Q1.EMPNO AS "EMPNO",

  Q1.LASTNAME AS "LASTNAME",

  Q1.SALARY AS "SALARY",

  Q1.HIRDATE AS "HIRDATE"

FROM

  DB2INST1.TESTINFO AS Q1

 

Access Plan:

-----------

        Total Cost:             49.3321

        Query Degree:           1

 

        Rows

       RETURN

       (   1)

        Cost

         I/O

         |

         42

       CTQ   

       (   2)

       49.3321

          7

         |

         42

       TBSCAN

       (   3)

       49.3297

          7

         |

         42

 CO-TABLE: DB2INST1

      TESTINFO

         Q1

列组织表

Original Statement:

------------------

select

  empno,

  lastname,

  salary,

  hiredate

from

  employee

 

 

Optimized Statement:

-------------------

SELECT

  Q1.EMPNO AS "EMPNO",

  Q1.LASTNAME AS "LASTNAME",

  Q1.SALARY AS "SALARY",

  Q1.HIREDATE AS "HIREDATE"

FROM

  DB2INST1.EMPLOYEE AS Q1

 

Access Plan:

-----------

        Total Cost:             82.3543

        Query Degree:           1

 

      Rows

     RETURN

     (   1)

      Cost

       I/O

       |

      1000

     TBSCAN

     (   2)

     82.3543

       12

       |

      1000

 TABLE: DB2INST1

    EMPLOYEE

       Q1

列式存储优点

5.1 自动回收空间

DB2_WORKLOAD设置成ANALYTICS 的时候,对于列组织表默认会开启reorg,这个时候DB2的列组织表就会处于一个自动维护的状态。DBA也不用再对一些表进行定期的reorg操作,而且存储空间支持在线释放,不会影响系统的正常运行。

[db2inst1@enmodb2 ~]$ db2 update db cfg using AUTO_REORG on

DB20000I  The UPDATE DATABASE CONFIGURATION command completed successfully.

[db2inst1@enmodb2 ~]$ db2 update db cfg using auto_maint on

DB20000I  The UPDATE DATABASE CONFIGURATION command completed successfully.

[db2inst1@enmodb2 ~]$ db2 update db cfg using auto_tbl_maint on

DB20000I  The UPDATE DATABASE CONFIGURATION command completed successfully.

[db2inst1@enmodb2 ~]$ db2 restart db testinfo

DB20000I  The RESTART DATABASE command completed successfully.

Automatic maintenance                      (AUTO_MAINT) = ON

Automatic table maintenance          (AUTO_TBL_MAINT) = ON

Automatic reorganization               (AUTO_REORG) = ON

测试自动回收空间

 

 

[db2inst1@enmodb2 ~]$ db2 get db cfg |grep reorg

     Automatic reorganization               (AUTO_REORG) = ON

 

[db2inst1@enmodb2 ~]$ db2 "select count(0) from testinfo"

 

1          

-----------

   16252928

[db2inst1@enmodb2 ~]$ db2 "delete from db2inst1.testinfo where salary<500000"

DB20000I  The SQL command completed successfully.

[db2inst1@enmodb2 ~]$ db2 runstats on table DB2INST1.TESTINFO

DB20000I  The RUNSTATS command completed successfully.

[db2inst1@enmodb2 ~]$ db2 "select RECLAIMABLE_SPACE from

> Table(SYSPROC.ADMIN_GET_TAB_INFO('DB2INST1','TESTINFO'))"

 

RECLAIMABLE_SPACE   

--------------------

                   0

 

5.2 减少IO

在某些环境下对于行存储是按行存放的,在读取的时候需要将整行的数据都读出,这样无形中增加了数据的IO。而列存储的存储方式是按列存储,任何列都可以作为索引,只读出所需访问的列,读取时冗余很少,从而减少了 I/O,提高了性能。

5.3 良好的压缩比

UTIL_HEAP_SZ 足够大,能使得在转换成列存储表之后获取很好的压缩比。对于传统的压缩技术了来讲,可以很大程度的节省存储,某些压缩算法甚至可以不需要解压缩而实现对数据的直接操作,节省了解压缩的开销,从而降低了对 CPU 的消耗。经过优化后的数据库,很多压缩数据会存储在内存中,减少查询时消耗的内存资源。

 

列式存储缺点

6.1 筛选条件限制

如果查询中有筛选条件这样列式存储未必就能发挥相应的优势,而且相对资源消耗可能会比行式存储更高。

行式存储

Original Statement:

------------------

select

  empno,

  lastname,

  salary,

  hiredate

from

  employee

where

  empno >00190

 

 

Optimized Statement:

-------------------

SELECT

  Q1.EMPNO AS "EMPNO",

  Q1.LASTNAME AS "LASTNAME",

  Q1.SALARY AS "SALARY",

  Q1.HIREDATE AS "HIREDATE"

FROM

  DB2INST1.EMPLOYEE AS Q1

WHERE

  (190 < DECFLOAT(Q1.EMPNO, 34, '.'))

 

Access Plan:

-----------

        Total Cost:             44.4357

        Query Degree:           1

 

            Rows

           RETURN

           (   1)

            Cost

             I/O

             |

           333.333

           FETCH

           (   2)

           44.4357

             34

         /---+----\

     333.333       1000

     RIDSCN   TABLE: DB2INST1

     (   3)      EMPLOYEE

     14.685         Q1

        2

       |

     333.333

     SORT  

     (   4)

     14.6845

        2

       |

     333.333

     IXSCAN

     (   5)

     14.5794

        2

       |

      1000

 INDEX: DB2INST1

   PK_EMPLOYEE

       Q1

列式存储

Original Statement:

------------------

select

  empno,

  lastname,

  salary,

  hirdate

from

  testinfo

where

  empno >00190

 

 

Optimized Statement:

-------------------

SELECT

  Q1.EMPNO AS "EMPNO",

  Q1.LASTNAME AS "LASTNAME",

  Q1.SALARY AS "SALARY",

  Q1.HIRDATE AS "HIRDATE"

FROM

  DB2INST1.TESTINFO AS Q1

WHERE

  (190 < DECFLOAT(Q1.EMPNO, 34, '.'))

 

Access Plan:

-----------

        Total Cost:             49.3344

        Query Degree:           1

 

        Rows

       RETURN

       (   1)

        Cost

         I/O

         |

         14

       CTQ   

       (   2)

       49.3344

          7

         |

         14

       TBSCAN

       (   3)

       49.3324

          7

         |

         42

 CO-TABLE: DB2INST1

      TESTINFO

         Q1

6.2 不支持GBK字符集

而且在DB2中列式存储只支持数据库UTF-8并且国家地域必须为cn。因为unicode里面好像不支持GBK,如果是GBK字符集的数据库,创建列组织表的时候直接会报错,提示该数据库的字符集不支持。

[db2inst1@enmodb2 ~]$ db2 connect to coco

 

   Database Connection Information

 

 Database server        = DB2/LINUXX8664 10.5.9

 SQL authorization ID   = DB2INST1

 Local database alias   = COCO

 

[db2inst1@enmodb2 ~]$ db2 get db cfg

 

       Database Configuration for Database

 

 Database configuration release level                    = 0x1000

 Database release level                                  = 0x1000

 

 Database territory                                      = cn

 Database code page                                      = 1386

 Database code set                                       = gbk

 Database country/region code                            = 86

 Database collating sequence                             = UNIQUE

 Alternate collating sequence              (ALT_COLLATE) =

 Number compatibility                                    = OFF

 Varchar2 compatibility                                  = OFF

 Date compatibility                                      = OFF

 Database page size                                      = 32768

 

创建列组织表

[db2inst1@enmodb2 ~]$ db2 "create table  testinfo (empno char(6),lastname varchar(15),hirdate date,salary decimal(9),comm decimal(9)) organize by column"

DB21034E  The command was processed as an SQL statement because it was not a

valid Command Line Processor command.  During SQL processing it returned:

SQL1233N  The statement failed because the statement includes the use of

functionality that is supported only in Unicode databases or Unicode databases

with specific database collations, but the current database is not a Unicode

database or is a Unicode database with unsupported database collations.  

SQLSTATE=560AA

 

6.3 必须存储于自动存储表空间

必须将列存储表存储在自动存储的表空间里面

[db2inst1@enmodb2 ~]$ db2 "create  tablespace mytem02 pagesize 32K managed by database using (file '/home/db2inst1/tbs/mytem02' 10M) bufferpool IBMDEFAULTBP"

DB20000I  The SQL command completed successfully.

[db2inst1@enmodb2 ~]$ db2 "create table  testinfo3 (empno char(6),lastname varchar(15),hirdate date,salary decimal(9),comm decimal(9)) organize by column in mytem02"

DB21034E  The command was processed as an SQL statement because it was not a

valid Command Line Processor command.  During SQL processing it returned:

SQL0284N  Table creation failed because the table space "MYTEM02" that was

specified in the statement after the clause "IN" is not a supported type of

table space for that clause.  Table space type: "MANAGED BY DATABASE".  

SQLSTATE=42838 


总结

相对于列存储,行存储的好处是增加修改数据容易,适合于 OLTP 事务型应用。列存储表在进行查询的时候需要占用大量的内存,同时列存储表中的数据是按照列存储的,这样的组织方式不太适合查询单行或者几行数据,这就决定了列存储表不适合 OLTP 的系统,因此不是所有的表都适合转换成列存储表。但是在某些场景下,比如数据仓库,在对数据进行查询时可以实现数据跳读,从而进一步降低数据库的IO,提升查询的性能。