Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
167 views
in Technique[技术] by (71.8m points)

excel - Email scraper from a list of urls

What I am trying to build

In Sheet1 column A row2 down to x amount of rows, there will be a list of website URLs. I need the code to go through the urls and find the phone numbers and emails and place them in column B + C next to the urls, if nothing is found place a hyphen in the cell.

I have almost got this working. The code loops through a list of URLS in Sheet1 column A and pulls the phone numbers and emails, places them into column B and C. I just have 3 problems with the current code i wrote, these are stated below Problem 3 might be a simple fix.

THE NEW CODE

Private Sub CommandButton1_Click()
' Run main code
Dim wb As Workbook
Dim wsSheet As Worksheet, links As Variant, IE As Object, link As Variant
Dim rw As Long
Dim html As New HTMLDocument
Dim regxp As New RegExp, post As Object, phone_list As Object, email_list As Object

'SHEET1 as sheet with URL
    Set wb = ThisWorkbook
    Set wsSheet = wb.Sheets("Sheet1")

    'Set IE = InternetExplorer
    Set IE = CreateObject("InternetExplorer.Application")

    rw = wsSheet.Cells(wsSheet.Rows.Count, "A").End(xlUp).Row
    links = wsSheet.Range("A2:A" & rw)

    'IE Open Time per page 4sec and check links on Sheet2 Column A
    With IE
       .Visible = True
       Application.Wait (Now + TimeValue("00:00:04"))

       For Each link In links
           .navigate (link)
           While .Busy Or .readyState <> 4: DoEvents: Wend

Set html = .document

'Application.Wait (Now + TimeValue("00:00:04"))
  With regxp
        .Pattern = "(?:+1)?(?:+[0-9])?(?([0-9]{3}))?[-. ]?([0-9]{3})[-. ]?([0-9]{4})"
        Set phone_list = .Execute(html.body.innerHTML)
        .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+"
        Set email_list = .Execute(html.body.innerHTML)
    End With
   
    Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list(0)
    Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list(0)
 ''''################################ I TRIED TO FIX THE PROBLEM WITH THIS  #########################
 '''' ############################### TO PLACE A HYPHEN IF NOTHING IS FOUND #########################
''''      If regxp Is Nothing Then
''''            wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list()
''''        Else
''''             wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list(0)
''''        End If
''''
''''        If regxp Is Nothing Then
''''            wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list()
''''        Else
''''            wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list(0)
''''        End If
''''################################ I TRIED TO FIX THE PROBLEM WITH THIS  #########################
''''################################################################################################

'navigate links
      Next link

'Close IE Browser
    .Quit
    End With

    Set IE = Nothing
End Sub

Problem 1

If there is no item to extract then the code does not go to the next url, for some reason it just stay on that page, or I get an error message. e.g website has phone number but no email the page will not navigate to the next url. I tried to fix this with an IF statement but could not get it to work.. What it should do If there is nothing to extract go to the next urls in column A

VBA error message

VBA Code line error

Problem 2

If the website has an invalid security certificate or the url is DEAD then the code does not navigate to the next url, it waits for a user input. If I click "NO" to state I do not wish to to navigate to this site the code crashes. If the certificate is invalid or url is DEAD then it should move to the next url, so if site has not loaded in X amount of time move to the next url. Not sure if this could also be used for problem1

I think I need something like this, but can't work it out with my code Mr Excel

Problem 3

This might just be an excel column formating issue unless I have have got the phone number expression wrong in the code. As you can see the phone numbers are not showing correct. I am not sure if excel is clearing an "0" and that is why the numbers are wrong or the phone number expression is wrong.

Phone Numbers

Thanks for having a look Please could sombody help me out on anyone of the three issues. As aways THANKS in advance.

UPDATED TODAY 24/7/2020 AT 12:56 UK TIME

I have added a better Regxp for finding phone numbers, since posting PROBLEM 3, it has improved a bit .Pattern = "(?:+1)?(?:+[0-9])?(?([0-9]{3}))?[-. ]?([0-9]{3})[-. ]?([0-9]{4})" it is still however clipping some digits of the numbers, see image below, numbers in green where found and last digit is missing

Phone missing digit

Also Posted on Mr Excel Mr Excel.

####### Added Today Thursday 30th July 2020 4:00pm Uk time ########

I am trying an If statement so If Nothing is found then place a hyphen, see below

If email_list Is Nothing Then
'On Error Resume Next
wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "c").End(xlUp).Row + 1, "c").Value = "-"
Else
On Error Resume Next
wsSheet.Cells(Sheet1.Cells(Sheet1.Rows.Count, "c").End(xlUp).Row + 1, "c").Value = email_list(0)
End If
End With

However I can Not get it to work, the On Error Resume Next allows me to move to the next url and gets rid of the first error message.

The emails NOW pull off as such, I have colour coded them for easy viewing. As you can see from the colours they are NOT next to the correct urls, This is why I was trying to place an hyphen in the cell at least then that cell would be poplulated an the date would go into the next blank cell, thus keeping everything in line. email image

############## Updated Today FRIDAY 31st JULY 2020 1.26PM UK time

I have fixed the problem with data NOT going into the right place, By uking the IF statement code. So now Problems 1 and 2 seem fine. Only problem 3 remains, which I though would be a simple fix LOL.

The problem was this

If regxp Is Nothing Then

It should have been

If Phone_List (0) Is Nothing Then

And

If Email_List (0) Is Nothing Then

Fixed Data Input Image

########### UPDATED TODAY Monday 3rd August 11.45 Uk time #############

This is my workaround to overcome Problem 3 for phone numbers not pulling of correct.

I have changed the Pattern part of the code, so now it pulls the REGXP pattern from the Sheet, Sheet1.Range D1. This way I can change the regxp pattern in the cell to pull off different phone number types.

''' ########## Phone Numbers Pattern ###########
        .Pattern = ThisWorkbook.Sheets("Sheet1").Range("D1")
        .Global = False
        .IgnoreCase = True
        Set phone_list = .Execute(html.body.innerHtml)

This is the Regxp pattern I am using for now, for uk. It is in placed in Sheet1 CELL D1

(?:+1)?(?:+[0-9])?(?([0-9]{4}))?[-. ]?([0-9]{4})[-. ]?([0-9]{3}?)

If anyone has a better pattern please post.

########## Updated today Tuesday 5th August 2020 1:35 uk time ##########

I have MSXML2.ServerXMLHTTP code which works much faster, but misses a few emails and numbers. Where as the IE version I wrote and the code ANSWER written by SMTH pick up the extra emails and phone numbers. I changed the regxp patters in SMTH ANSWER to mine for better results.

If anyone knows why then please advise, otherwise SMTH code is the answer as it does the same job as mine, but is written much better.

Private Sub CommandButton2_Click()
'''######### NO IE THIS CODE IS FASTER ######
Dim wb As Workbook
Dim wsSheet As Worksheet, links As Variant, link As Variant
Dim rw As Long
Dim regxp As New RegExp, post As Object, phone_list As Object, email_list As Object
Dim Html As New HTMLDocument

''''SHEET1 as sheet with URL
    Set wb = ThisWorkbook
    Set wsSheet = wb.Sheets("Sheet1")
   
    rw = wsSheet.Cells(wsSheet.Rows.Count, "A").End(xlUp).Row
    links = wsSheet.Range("A2:A" & rw)

For Each link In links
    'Set doc = NewHTMLDocument(CStr(link))
      Set Html = NewHTMLDocument(CStr(link))
With regxp
''' ########## Phone Numbers Pattern ###########
        .Pattern = "(?:+1)?(?:+[0-9])?(?([0-9]{4}))?[-. ]?([0-9]{4})[-. ]?([0-9]{3}?)" '"(?:+1)?(?:+[0-9])?(?([0-9]{3}))?[-. ]?([0-9]{3})[-. ]?([0-9]{3}?)"
        .Global = False
        .IgnoreCase = True
        Set phone_list = .Execute(Html.body.innerHtml)
''' ########## Email Pattern ###########
        .Pattern = "([a-zA-Z0-9_-.]+)@(([[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.)|(([a-zA-Z0-9-]+.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(]?)"
        .Global = False
       .IgnoreCase = True
        Set email_list = .Execute(Html.body.innerHtml)
    

'''########## PHONE LIST ############# ADD TO SHEET
On Error Resume Next
    If phone_list(0) Is Nothing Then
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = "-"
    Else
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "B").End(xlUp).Row + 1, "B").Value = phone_list(0)
    End If
'''########## EMAIL LIST ############# ADD TO SHEET
On Error Resume Next
      If email_list(0) Is Nothing Then
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = "-"
    Else
        On Error Resume Next
            Sheet1.Cells(Sheet1.Cells(Sheet1.Rows.Count, "C").End(xlUp).Row + 1, "C").Value = email_list(0)
    End If
E

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

This is how you can loop through all the predefined links you wanna traverse using macro in order to collect email and phone numbers. To figure out whether your patterns could find anything, you wanna use .Count property like I've shown below. You can always replace the patterns I've used below as they are not relevant to your major issues.

Sub GetEmailAndPhone()
    Dim IE As Object: Set IE = CreateObject("InternetExplorer.Application")
    Dim Rxp As Object: Set Rxp = CreateObject("VBScript.RegExp")
    Dim emailMatch As Object, phoneMatch  As Object, S$, cel As Range
    Dim Html As HTMLDocument

    For Each cel In Sheets("Sheet1").Range("A2:A" & Sheets("Sheet1").Cells(Rows.Count, 1).End(xlUp).Row)
        With IE
            .Visible = False
            .navigate cel
            While .Busy Or .readyState <> 4: DoEvents: Wend
            Set Html = .document
        End With
        
        With Rxp
            .Pattern = "[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+"
            Set emailMatch = .Execute(Html.body.innerHTML)
            .Pattern = "(?:+1)?(?:+[0-9])?(?([0-9]{3}))?[-. ]?([0-9]{3})[-. ]?([0-9]{4})"
            Set phoneMatch = .Execute(Html.body.innerHTML)
        End With
        
        If emailMatch.Count >= 1 Then
            cel(1, 2) = emailMatch(0)
        Else:
            cel(1, 2) = "Not Found"
        End If
        
        If phoneMatch.Count >= 1 Then
            cel(1, 3) = phoneMatch(0)
        Else:
            cel(1, 3) = "Not Found"
        End If
    Next cel
End Sub

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...