How to Read and Write Excel Files in Java

Previously, I talked about how to check PageRank in Java. In this article, I will combine PageRank with Excel files. To operate Excel files in Java, I used Apache POI – the Java API for Microsoft Documents. Apparently, you can download the API package, and follow the relevant tutorials to learn how to use POI. When I first downloaded the Java library, I was astonished by how many jar files there were. I had no idea which ones would be useful for my project. Fortunately, I managed to filter out what I want.

Read more

How to Extract Domain Names from URLs in VBA

Backlink is one of the most important factors for search engine ranking. Thus, the analysis of backlinks is significant for website owners. It is known that Google Webmaster tools can record search traffic allowing users to check the links to their sites. We can download the relevant data as figure 1.

webmaster download

Figure 1

However, it is imperfect that the data only consists of links and dates (figure 2), which are not categorized or visualized. In my point of view, the worst part is that there is no data of domain names, which are of a higher value than the backlinks. Therefore, if we want to see the domain names, we have to process the data ourselves. Today, I would like to share my knowledge about how to extract domain names from URLs in VBA.

webmaster links

Figure 2

Why Programmatically Processing the Excel Data?

Some may think that it is not necessary to deal with the data in VBA. That to filter the domain names, the basic built-in functions are enough. It is true that to get the domain names, we just need to type in “=LEFT(text, FIND(“/”,text,[start_num]))”. However, this approach is not perfect. It cannot process sub domain names. For example, after executing excel functions, we can get both “a.test.com” and “b.test.com” whereas the target domain name should be “test.com”. How can we solve this complicated question? The answer is to write a program in VBA.

How to Extract the Domain Names?

To solve the problem, we can consider the following steps:

  • Create a data source of domains. E.g. “.com, .net, .org, .gov …”
    dict = CreateObject("Scripting.Dictionary") ' use dictionary to store key & value
    ' initialize the data source of domains
    dict.Add("com", "com")
    dict.Add("cn", "cn")
    dict.Add("biz", "biz")
    dict.Add("org", "org")
    dict.Add("net", "net")
    dict.Add("edu", "edu")
    dict.Add("gov", "gov")
    dict.Add("co", "co")
    dict.Add("us", "us")
    dict.Add("ca", "ca")
    dict.Add("info", "info")
    dict.Add("eu", "eu")
    dict.Add("de", "de")
  • Split “://” and “/” to get the domain names. E.g. “http://social.msdn.microsoft.com/a/b/c/d.html” -> “social.msdn.microsoft.com”
Dim urlArray() As String
Dim domainArrayOri() As String
Dim sourceRow, sourceCol, destinationRow, destinationCol As Integer
sourceRow = 2 ' index of source row
sourceCol = 1 ' index of source column
destinationRow = 2 ' index of destination row
destinationCol = 7 ' index of destination column

'------------------------------------------------------------------
urlArray = Split(Cells(sourceRow, sourceCol).Value, "://")
domainArrayOri = Split(urlArray(1), "/")
  • Split “.” to store strings in array. E.g. “social.msdn.microsoft.com” -> “social”, “msdn”, “microsoft”, “com”
domainArray = Split(domainArrayOri(0), ".")
  • Traverse the array data to find out the domain that matches the data source of domains. E.g. the domain is “com”
' function for checking domain
Function isDomain(tmp As String, domains() As String) As Boolean
    isDomain = False
    For Each domain In domains
        If tmp = domain Then
            isDomain = True
            Exit For
        End If
    Next domain

End Function
        lastIndex = UBound(domainArray, 1)
        firstIndex = LBound(domainArray, 1)
        count = lastIndex - firstIndex + 1
        If count > 2 Then
            Dim j As Integer
            Dim bIsDomain As Boolean
            bIsDomain = False
            If lastIndex > 5 Then
                lastIndex = 3
            End If

            For j = 2 To lastIndex
                If dict.Exists(domainArray(j)) Then
                    ' TODO:
                Else
                    ' TODO:
                End If
            Next j
        Else
            Cells(destinationRow, destinationCol).Value = domainArrayOri(0)
        End If
  • Compose the domain names with the adjacent string “microsoft”. E.g. the final domain name is “Microsoft.com”
                If dict.Exists(domainArray(j)) Then
                    bIsDomain = True
                    Cells(destinationRow, destinationCol).Value = domainArray(j - 1) & "." & domainArray(j)
                Else
                    Cells(destinationRow, destinationCol).Value = domainArrayOri(0)
                End If

You can feel free to download the sample code.