海量数据迁移之分区并行抽取(r2笔记53天)-白红宇的个人博客

发布日期：2021-06-30 13:29:09 浏览次数：2 分类：技术文章

本文共 6679 字，大约阅读时间需要 22 分钟。

在之前的章节中分享过一些数据迁移中并行抽取的细节，比如一个表T 很大，有500G的数据，如果开启并行抽取，默认数据库中并行的最大值为64，那么生成的dump文件最50多为64个，每个dump文件就是7.8G，还是不小，况且在做数据抽取的时候，资源被极大的消耗，如果资源消耗紧张，可能可用的并行资源还不到64个。那么dump文件可能比7G还要大得多。如果换一步来说，我们尝试调高并行的参数，可以支持100个并行，那么每个dump文件也有5G，也没有太大的改善。所以自己在斟酌后考虑使用分区加并行的思想来做大表的切分。生产中500G的大表肯定是做了分区操作，而且分区数可能还比较多。我们就设定为100个吧。分区表的数据基本都是分散在各个分区的，考虑数据的不均匀分布，那么每个分区的数据可能在5~10G吧。参照这个思想，假设开启并行，比如200M为一个基准点来切分分区表，比如分区表的某个分区含有5G的数据，那么需要开启25个并行即可，文件就会被切分为200M的很多细粒度的dump文件。按照10G来算，最多也是50个并行，比默认提供的并行参数还要低一些。按照这个思想，对比较大的分区表才做分区+并行，如果是普通表就需要最大程度的应用并行，如果分区表比较小，那就可以不用使用分区+并行了。目前我设定的基准为1G，比如一个分区表T，大小在1.5G,那么可以考虑开启分区+并行，如果分区表的大小为500M，那么就可以不用考虑使用分区+并行了，因为在每个分区中的数据可能相对比较少。今天写了如下的脚本来做做分区抽取的判断脚本，如果分区表的大小在1G以上，才抽取分区信息，否则和普通表一样对待。

tablst=`cat ../parfile/tablst|awk '{print "'\''" $1 "'\'',"}'`#for segment within 50M, will use parallel 1, and parallel will calculated with segment_bytes_size_MB/50Mpage=200sqlplus -s $1 <<EOFset feedback offset head offset line 100set pages 0set long 10000set termout offcol segment_name for a40col parallel format 9999spool tab_parall_temp.lstselect segment_name,nvl(partition_name,'x'),ceil(sum(bytes/1024/1024)/$page) parallel from user_segmentswhere segment_name in(select segment_name from user_segmentswhere segment_name in (select table_name from user_tables where table_name in ($tablst'x') and table_name not in (select table_name from user_external_tables))group by segment_name having sum(bytes/1024/1024)>=1000)group by segment_name,nvl(partition_name,'x')unionselect segment_name,'x',ceil(sum(bytes/1024/1024)/$page) parallel from user_segmentswhere segment_name in(select segment_name from user_segmentswhere segment_name in (select table_name from user_tables where table_name in ($tablst'x') and table_name not in (select table_name from user_external_tables))group by segment_name having sum(bytes/1024/1024)<1000) group by segment_name,'x' ;

spool off;EOFsed '/^$/d' tab_parall_temp.lst |sort > ../parfile/tab_partition_parall.lstrm tab_parall_temp.lst

生成的参数文件内容如下，对于普通表和较小的分区表而言，就默认补充了一个字符‘x'，便于稍后的处理时统一管理MEMO P9_A3000_E1 1MEMO P9_A3000_E2 1MEMO P9_A3000_E3 1MEMO P9_A3000_E4 1MEMO P9_A3000_E5 1MEMO PMAXVALUE_AMAXVALUE_EMAXVALUE 1SERVICE x 36SUBSCRIBER_HISTORY x 11SUBSCRIBER x 5对于大表的分区+并行抽取，可以考虑如下的脚本。

#### source owner $1#### tab_name $2#### target owner $3#### dump directory $4#### partition_name $5

function check_conn {

Num=`echo show user | $ORACLE_HOME/bin/sqlplus -s $1| grep -i 'USER ' | wc -l`if [ $Num -gt 0 ] then echo DB details is accessible from $2 schema ... else ## inst is inaccessible echo Instance: DB detailsIs Invalid Or UserName/PassWord Is Wrong echo '***********************************************' exit fi}

check_conn $1 sourceecho .check_conn $3 targetecho .

source_owner=`echo "$1" |awk -F@ '{print $1}'|awk -F/ '{print $1}'|tr '[a-z]' '[A-Z]'`target_owner=`echo "$3" |awk -F@ '{print $1}'|awk -F/ '{print $1}'|tr '[a-z]' '[A-Z]'`tab_name=`echo "$2"|tr '[a-z]' '[A-Z]'`partition_name=$5tablst=`cat ../parfile/tablst|awk '{print "'\''" $1 "'\'',"}'`

tmp_parallel=`grep -w $tab_name ../parfile/tab_partition_parall.lst|grep $partition_name| awk '{print $3}'`echo $tmp_paralltmp_dump_cnt=`ls -l ../DUMP/${tab_name}*.dmp|wc -l`tmp_dump_cnt=`expr $tmp_dump_cnt + 1 `echo $tmp_dump_cnttmp_parallel=`expr $tmp_parallel + $tmp_dump_cnt `

for i in {${tmp_dump_cnt}..${tmp_parallel}};doecho \'${tab_name}_$i.dmp\', >> tmp_${tab_name}_par_dmp.lstdone

sed -e '/^$/d' -e '$s/.$//' tmp_${tab_name}_par_dmp.lst > ../DUMP_LIST/${tab_name}_par_dmp.lstdump_list=`cat ../DUMP_LIST/${tab_name}_par_dmp.lst`#echo $dump_list#echo $tmp_parallelif [[ $partition_name = 'x' ]]thenpartition_name=''elsepartition_name='partition('$partition_name')'fi

sqlplus -s $3 <<EOFset feedback offset serveroutput onDECLAREdatapump_flag number(2);beginselect count(*) into datapump_flag from user_tab_privs where table_name=upper('$4') and grantee=upper('$target_owner') and privilege in ('READ','WRITE');if(datapump_flag=2) thendbms_output.put_line( 'Directory '||'$4'||' has read,write permission ,proceed...');elsedbms_output.put_line( 'WARNING! Directory '||'$4'||' does not have read,write permission to $target_owner ,Please check again...');return;end if;end;/EOF

echo .

sqlplus -s $1 <<EOFset feedback offset serveroutput onDECLARETABLE_FLAG number(2);beginselect count(*)into table_flag from all_synonyms where owner=upper('$source_owner') and table_name=upper('$2') ;if(table_flag>0) thendbms_output.put_line( 'SYNONYM '||'$tab_name'||' exists in CONNECT account,proceed...');elsedbms_output.put_line( 'SYNONYM does not exists in CONNECT account,please check again');return;end if;end;/!echo .

DECLARETABLE_FLAG number(2);beginexecute immediate 'select count(*) from all_external_tables where owner=upper(''$target_owner'') and table_name=upper(''$2'')' into table_flag;if(table_flag>0) thendbms_output.put_line( 'External table exists in mig account,proceed...');--execute immediate 'drop table $3.$2_ext';end if;end;/exec dbms_output.put_line('Get Dump file for $source_owner.$2 $partition_name...');

conn $3set timing oncreate table $target_owner.${tab_name}_ext ORGANIZATION EXTERNAL (TYPE ORACLE_DATAPUMP DEFAULT DIRECTORY $4 ACCESS PARAMETERS( nologfile) LOCATION ( $dump_list ) ) parallel $tmp_parallel as select /*+ parallel(t $tmp_parallel) */ * from $source_owner.$tab_name $partition_name t;set feedback off;set timing offdrop table $target_owner.$2_ext;EOFecho .rm tmp_${tab_name}_par_dmp.lstexit

脚本生成的日志如下：

Get Dump file for APP_TMP.MEMO partition(P5_A2000_E2)...Elapsed: 00:00:00.52DB details is accessible from source schema ...DB details is accessible from target schema ...

447Directory new_test has read,write permission ,proceed...SYNONYM MEMO exists in CONNECT account,proceed...

Get Dump file for APP_TMP.MEMO partition(P5_A2000_E3)...Elapsed: 00:00:00.73DB details is accessible from source schema ...DB details is accessible from target schema ...

449Directory new_test has read,write permission ,proceed...SYNONYM MEMO exists in CONNECT account,proceed...

Get Dump file for APP_TMP.MEMO partition(P5_A2000_E4)...

生成的dump文件如下所示，可以看到生成了几百个相关的dump文件。-rw-r----- 1 prodbuser dba 15826944 Aug 3 18:13 MEMO_460.dmp-rw-r----- 1 prodbuser dba 13254656 Aug 3 18:13 MEMO_461.dmp-rw-r----- 1 prodbuser dba 15044608 Aug 3 18:13 MEMO_462.dmp-rw-r----- 1 prodbuser dba 15015936 Aug 3 18:13 MEMO_463.dmp-rw-r----- 1 prodbuser dba 13135872 Aug 3 18:13 MEMO_464.dmp-rw-r----- 1 prodbuser dba 13266944 Aug 3 18:13 MEMO_465.dmp-rw-r----- 1 prodbuser dba 15003648 Aug 3 18:13 MEMO_466.dmp-rw-r----- 1 prodbuser dba 20480 Aug 3 18:13 MEMO_467.dmp-rw-r----- 1 prodbuser dba 20480 Aug 3 18:13 MEMO_468.dmp-rw-r----- 1 prodbuser dba 20480 Aug 3 18:13 MEMO_469.dmp

转载地址：https://jeanron100.blog.csdn.net/article/details/102506875 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：VXFS启用异步IO导致的严重问题(r2笔记56天)

下一篇：怎样突破表名30个字符的限制（r2笔记51天)

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！

发表评论

最新留言

关于作者

推荐文章