Microsoft teamed up with BlueBolt Solutions, a Microsoft partner, to implement a critical feature in its search product, BravoSquared.

Our solution leverages Azure Functions and two Cognitive Services Text Analytics APIs—topic detection and key phrase extraction—to define search facets across a collection of articles and automatically assign tags to articles.

The purpose of this engagement was to work with BlueBolt to leverage Microsoft Cognitive Services to reduce the amount of hand work that is required by data subject-matter experts in BlueBolt’s document classification business process. By leveraging Microsoft Cognitive Services to automate the process, data experts no longer need to classify documents by hand, saving hundreds of hours of work.

Core team:

  • Mike Michaels – Software Developer, BlueBolt Solutions
  • James McKee – Software Developer, BlueBolt Solutions
  • Sheila Shahpari – Regional Director, BlueBolt Solutions
  • Jerry Nixon – Senior Technical Evangelist, Microsoft
  • Lauren Tran – Technical Evangelist, Microsoft

Customer profile

BlueBolt Logo

BlueBolt Solutions is an independent software vendor based in Chicago and Denver. Its primary business is dynamics and web development, and it also provides industry-specific products such as BravoSquared, an enterprise search solution that provides filtering and faceting, custom relevance ranking, and a modern search experience for customers with content in multiple portals across various media types.

BravoSquared enables custom faceting, content segmentation, and filtering options for each customer, depending on the nature of the data and content. The differentiator is the ability to provide a customized search solution tuned for each customer’s content and user base. Given Microsoft’s state-of-the-art work in natural language processing and the robust capabilities of the Azure platform, BlueBolt decided that Microsoft was the right partner and one-stop shop for implementing its workloads.

Problem statement

BlueBolt provides a solution to its customers who manage large collections of text documents that they index for search. The data to index includes thousands of documents and articles spanning various domains. Each of BlueBolt’s customers has different types of documents on different subjects. For example, one of BlueBolt’s customers has documents related to hand surgery, whereas another has financial services documents. These documents may be public or private and can date back many years or decades. BlueBolt classifies these documents and creates search facets to make it easier for its customers to find relevant documents during a search.

To generate search facets based on the content of the articles without manually defining them, BlueBolt needed a way of determining the topics discussed across the articles. This scenario is an excellent use case for the Microsoft Cognitive Services topic detection API, which is built to provide topics across a corpus of at least 100 documents. We implemented topic detection using Cognitive Services to extract topics from a collection of more than 5,000 medical articles written by hand surgeons. Azure Functions in conjunction with the Cognitive Services key phrase extraction API allows us to easily auto-tag new documents as they are added to the corpus.

This process is business critical for BlueBolt Solutions as it cannot scale its current solution to multiple customers without removing the manual process that we automated with Azure Functions and Microsoft Cognitive Services.

Solution, steps, and delivery

Microsoft worked remotely with BlueBolt over the course of several weeks and had a two-day envisioning and coding session onsite with BlueBolt. During the weeks leading up to the onsite engagement, Microsoft worked with BlueBolt on machine learning education, data analysis, and determining a scope for the project.

Once we understood BlueBolt’s processes and nature of its customer data, it became clear that Cognitive Services and a serverless architecture using Azure Functions would be the right approach. In this architecture, BlueBolt would not need to manage any infrastructure, would have a scalable solution, and would be able to leverage world-class machine learning capabilities built into Cognitive Services.

The resulting architecture is as follows:

Architecture Diagram

BlueBolt’s documents sit in a file share. These documents are exported into CSV format, which we process with a Python script that ingests the CSV documents and runs them through the Cognitive Services topic detection API. This data is validated by a data subject-matter expert and becomes part of the classification system once validated. The final results are stored in Azure Table storage. When a new document needs to be analyzed, an Azure function kicks off based on a blob trigger. The Azure function accesses the previously stored topics and calls Cognitive Services to get topics for the new document. A comparison is made between the previously stored results and the results for the new document, and matching results are stored in DocumentDB.

Topic detection to generate search facets

To implement our topic detection solution, we wrote a Python script to preprocess the text data and call the topic detection API. The JSON response from Cognitive Services provides us with the extracted topics (keyPhrase) and the number of documents in which the topic appears (score). For example, “tendon injuries” and “ulnar nerve” are two of the topics detected, as shown below.

{
    "id": "c683113a-1443-4121-91fe-57e4d516f7a7",
    "score": 5,
    "keyPhrase": "tendon injuries"
},
{
    "id": "875503c1-6325-4246-8914-20b9526cd610",
    "score": 21,
    "keyPhrase": "ulnar nerve"
}

In addition, the JSON response tells us the topics present in individual documents, and the associated relevance score (distance). For example, the following tells us that the topic “ulnar nerve” is present in a document with ID 13, with a distance score of 0.87.

{
    "documentId": "13",
    "topicId": "875503c1-6325-4246-8914-20b9526cd610",
    "distance": 0.8734
}

The JSON data provides BlueBolt with a collection of potential search facets, which we then filter to discard topics with a score of less than 5. The resulting set of 172 topics is then provided to the domain experts—in this case, hand surgeons—to select the most appropriate subset as the final set of facets. With the facets defined and documents assigned to one or more facets, BlueBolt now has the data to index its documents for search.

Azure Functions and key phrase extraction for tagging documents

To auto-categorize new documents as they are added to Azure Blob storage, we leveraged an Azure function to call the Cognitive Services key phrase extraction API. Azure Functions are ideal in this scenario because they can be easily configured to trigger when a new object is written to Blob storage. Every time a new article is added to the corpus, we kick off the auto-tagging process with Functions.

Azure Functions are the perfect choice for this model, as they allow serverless monitoring of a Blob storage container. BlueBolt’s customers are frequently adding new documents to their text collections that need to be indexed for search, so the solution requires a mechanism to automatically kick off the categorization process. The use of Azure Functions is optimal to scaling the solution by eliminating the manual work as hundreds to thousands of new documents enter collections. The ease of development and serverless architecture are also a great benefit to the development team, making the maintenance of an API and infrastructure unnecessary.

To configure the blob trigger, we specified the following properties in our function.json file:

{
    "name": "myBlob",
    "type": "blobTrigger",
    "direction": "in",
    "path": "input/{blobname}",
    "connection": "AzureWebJobsDashboard"
}

To tag the documents, we start by extracting key phrases present in each document. The Azure function sends the text of each new article to Cognitive Services to get a list of key phrases. In the following sentence, the API returns the bolded words/phrases:

Although this approach can be used for forearm and hand surgery, blockade of the inferior trunk (C8 through T1) is often incomplete and requires supplementation at the ulnar nerve for adequate surgical anesthesia in that distribution.

The full list of extracted key phrases typically contains phrases that are not in our defined set of search facets, so we do a match against our set that is stored in Azure Table storage. Azure Functions easily integrates with Table storage as an input, so when the function executes, we can read in our list of facets to match against the article’s key phrases. In the above example, the resulting facets are “hand surgery” and “ulnar nerve.”

Below is the code to call Cognitive Services and match the key phrases against our topics:

    requestBody = JSON.stringify({
        "documents": [
            {
            "language": "en",
            "id": context.bindingData.blobname,
            "text": myBlob
            }
        ]
    })

    // ping keyPhrase extraction API
    request.post({
        headers: {'content-type' : 'application/json', 'Ocp-Apim-Subscription-Key' : process.env.API_KEY},
        url: 'https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases',
        body: requestBody,
    }, function(error, response, body){
        if(error) {
            context.log(error);
        } else {
            keyPhrases = JSON.parse(body).documents[0].keyPhrases;

            // find matching topics
            var topics = [];
            for (var i = 0; i < data.length; i++) {
                topic = data[i].Topic;
                for(var j = 0; j < keyPhrases.length; j++) {
                    if (topic === keyPhrases[j]) {
                        topics.push(topic)
                    }
                }
            }

            // save topic assignments to DocumentDB
            context.bindings.outputDocument = {
                Topics : topics,
                DocumentName: context.bindingData.blobname,
            } 

            // end function
            context.done();
        }
    });

We use the name property, inputTable, to read the data from the Azure Table:

data = context.bindings.inputTable;

Finally, to store our results, we write the assigned facets and document name to DocumentDB. The output binding is configured as follows:

{
    "type": "documentDB",
    "name": "outputDocument",
    "databaseName": "outDatabase",
    "collectionName": "Items",
    "createIfNotExists": true,
    "connection": "bluebolt_DOCUMENTDB",
    "direction": "out"
}

The binding name, outputDocument, is used to connect to DocumentDB:

context.bindings.outputDocument = {
    Topics : topics,
    DocumentName: context.bindingData.blobname,
}

For the full function code, see index.js.

Conclusion

Prior to using Cognitive Services, it was up to human domain experts to manually define the search facets for each text collection. We have eliminated much of the manual upfront work by automatically generating topics. In the case of large document sets, this could save tens or even hundreds of hours of manual categorization. BlueBolt has implemented this approach to label its hand surgery corpus and will be following the same methodology to define search facets for the remaining text collections that it manages. The ability to quickly extract topics from a large set of documents and to auto-tag new documents is business-critical for BlueBolt, as this is the basis to scale their solution across domains.

Before looking to build a full machine learning solution, developers should look at Cognitive Services because it provides a rich set of machine learning scenarios that are prebuilt and ready to consume. In using the topic detection API, we found that although the minimum amount of documents to send to the service is 100, the service works best on a larger corpus size, giving us the most meaningful results when at least 1,000 documents were used.

“Two days ago I had no idea how this would be relevant; now it’s obvious. Machine learning in general is a very advanced topic that can take months of both research and development to begin to obtain value. With two days of working with Microsoft and their services, the difference is night and day. When pairing our solution with Azure and Cognitive Services, the sky is the limit, potentially saving dozens or hundreds of hours of data processing for our customers.”

— Mike Michaels, Software Developer, BlueBolt Solutions

“You were able to help us see potential areas to discover that I don’t think that we would have otherwise gone down. From document categorization to taxonomy development, you both helped us to feel confident in the value provided by the Azure and Cognitive Services platform. I cannot state how important this is not only to us, but to the customers that we help.”

— James McKee, Software Developer, BlueBolt Solutions

“I was astounded at the level of depth we were able to cover. I thought we would barely scratch the surface in the space of machine learning capabilities. Not only did we learn a TON, but we had a BLAST doing it. As someone who has lived and breathed the Microsoft ecosystem for a majority of my career, this is by far the most exciting time I’ve experienced in regards to building innovative solutions to real-world problems. We at BlueBolt look forward to continuing this journey with you.”

— Sheila Shahpari, Regional Director, BlueBolt Solutions

The team

BlueBolt and Microsoft

Additional resources

The code and results can be found in the following GitHub repository: https://github.com/laurentran/blue-bolt-solutions/

Cognitive Services resources:

Azure Functions input and output bindings:

Author: Lauren Tran, Microsoft Startup Technical Evangelist