• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

amazon web services - How to look for updated rows when using AWS Glue?

[复制链接]
菜鸟教程小白 发表于 2022-6-1 20:10:13 | 显示全部楼层 |阅读模式 打印 上一主题 下一主题

I'm trying to use Glue for ETL on data I'm moving from RDS to Redshift.

As far as I am aware, Glue bookmarks only look for new rows using the specified primary key and does not track updated rows.

However that data I am working with tends to have rows updated frequently and I am looking for a possible solution. I'm a bit new to pyspark, so if it is possible to do this in pyspark I'd highly appreciate some guidance or a point in the right direction. If there's a possible solution outside of Spark, I'd love to hear it as well.



Best Answer-推荐答案


You can use the query to find the updated records by filtering data at source JDBC database as shown below example. I have passed date as an argument so that for each run I can fetch only latest values from mysql database in this example.

query= "(select ab.id,ab.name,ab.date1,bb.tStartDate from test.test12 ab join test.test34 bb on ab.id=bb.id where ab.date1>'" + args['start_date'] + "') as testresult"

datasource0 = spark.read.format("jdbc").option("url", "jdbc:mysql://host.test.us-east-2.rds.amazonaws.com:3306/test").option("driver", "com.mysql.jdbc.Driver").option("dbtable", query).option("user", "test").option("password", "Password1234").load()
回复

使用道具 举报

懒得打字嘛,点击右侧快捷回复 【右侧内容,后台自定义】
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关注0

粉丝2

帖子830918

发布主题
阅读排行 更多
广告位

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap