Analyzing Office 365 OCR Data using Power BI

I’m so excited to see Optical Character Recognition (OCR) working natively now in Office 365! I got my start in government where we scanned a lot of documents. There was a lot of data locked away in those images but no easy way to mine it. OCR was explored as a possible solution, but it was still in its infancy and not very accurate or scalable.

Fast forward to today and now OCR is simply part of our cloud infrastructure and with the assistance of Machine Learning, very accurate. The dream of finally unlocking that data is here! Or is it? Read on to find out more.

By the way, we’ll be covering this and more in our session at The SharePoint Conference, May 21-23 in Las Vegas! Use discount code GATTE to get a discount. https://sharepointna.com

Intelligent Search and Unlocking Your Image Data

SharePoint’s ability to do native OCR processes was first shown at Ignite 2017. There, Naomi Moneypenny did a presentation on Personalized intelligent search across Microsoft 365, where you saw Office 365 OCR in action. https://techcommunity.microsoft.com/t5/Microsoft-Search-Blog/Find-what-you-want-discover-what-you-need-with-personalized/ba-p/109590

She uploaded an image of a receipt and was able to search for it, based on the contents of the receipt image. It was a seamless demo of how Office 365 can intelligently mine the data in image files.

Now, let’s see if we can access the OCR data across multiple files and do something with it.

In the following procedure, I’ll show you how to connect to Office 365 and get to the data that the OCR process returns. In the following post, I’ll show you how to process receipt data to get the total amount spent, solely from the OCR data.

Process Overview

There are three steps to the process to start mining your OCR data. First, you have to add image content that contains text to a SharePoint folder.

The process of getting OCR data

Finding OCR Data

The OCR process that runs against the files in a SharePoint document folder are called Media Services. All derived data is stored in columns that contain Media Services in them.

Unfortunately, I’ve discovered that this feature has not been implemented consistently across the Shared Documents folder, custom folders and OneDrive. There is good news in that there’s a less obvious way to get to the data consistently across all three, using Properties. As shown below, you see the normal column names and where they appear. Only the ones in Properties appear consistently across all. We are only going to cover the basic information but the Properties collection has a lot more data in which to consume.

Audit of which media service fields are available where in Office 365

Adding Image Content to a SharePoint Document Folder

When you upload an image to a SharePoint document folder in Office 365, the OCR process kicks off automatically. I’ve had it take up to 15 minutes but the OCR process will analyze the image for text and return the text in a field called MediaServiceOCR if present and always in Properties.vti_mediaserviceocr.

These columns contain any text that was recognized in the graphics file. The current structure of the returned data is a bit different that what is in the source image. Each instance of the discovered text is returned on a separate line, using a Line Feed character as a delimiter. For example, if you had a two-column table of Term and Meaning, it would return the data like this:

Term

Meaning

Term

Meaning

Original uploaded image
Data returned by media services

While it’s great you can get to the data, the current returned format makes it exceptionally complex to reconstitute the context of the data. Also, the more complex your layout, the more “interesting” your transformations may need to be. I’d strongly recommend this post (https://eriksvensen.wordpress.com/2018/03/06/extraction-of-number-or-text-from-a-column-with-both-text-and-number-powerquery-powerbi/) and this post (https://blog.crossjoin.co.uk/2017/12/14/removing-punctuation-from-text-with-the-text-select-m-function-in-power-bi-power-query-excel-gettransform/ ) to give you the basics of text parsing in Power Query M.

Accessing the OCR Data in Power BI

The OCR columns are item level columns. The normal tendency would be to connect to your SharePoint site using the Power BI SharePoint Folder connector. You’ll be disappointed to find that the Media Services columns aren’t there.

Instead, connect to the document folder using the SharePoint Online List connector. By doing so, you’ll get access to the Media Services columns. Once in the dataset, you can use Power Query M to parse the data and start analyzing.

Demo

Let’s walk through how to access the data and manipulate it using Power BI. In this scenario, I have two receipts that have been uploaded in a document folder and I’m going to get the total spent on these receipts by analyzing the OCR data.

What about OneDrive for Business?

Yes, it works there too! The Media Service property fields are here as well. In fact, you get more information in an additional column called MediaServicesLocation. Based on my usage, it seems to be specifically populated for image files. If the image contains EXIF data, the MediaServicesLocation will contain the Country, State/Province, and City information of where it was created. Within the Properties collection, you can actually get more detailed information about the photo, like the type of camera that took it and more.

To connect to OneDrive where this will work, you need your OneDrive URL. I normally right-click on the OneDrive base folder in File Explorer and select View Online, as shown below.

Select View Online to get to the OneDrive url needed for Power BI connection

Potential for GDPR Issues

One aspect to consider if you look to do this is a production manner in Europe is that you will likely encounter information that falls under GDPR regulation. Consider this your prompt to think about how this capability would fit into your overall GDPR strategy.

Want a copy of the Power BI model?

Fill out the form below and it will emailed to you automatically, by virtue of the magic of Office Forms, Flow, and Exchange!

I hope you liked this post. If you have comments or questions, post them below in the comments.

In Data We Trust? Part Two

Part 2 of this series explores the difficulty in determining whether your business intelligence content is using authentic data. To illustrate the point, let’s examine a recent Seattle Times article about the Measles outbreak happening in Washington.

An Example

The article in question, “Are measles a risk at your kid’s school? Explore vaccination-exemption data with our new tool,” presents a story filled with data charts and tables and made some conclusions about the situation. Many internal reports and dashboards do the same, presenting data and conclusions. Unlike internal reports, newspapers list the source and assumptions in small print at the end of the story. Knowing the data comes from an official source adds authenticity.

The following note is supposed to increase authenticity.

“Note: Schools with fewer than 10 students were excluded. Schools that hadn’t reported their vaccination data to the Department of Health were also excluded.

Source: Washington Department of Health (2017-18)”

But does it really? Documenting any exclusions and note sources is a good practice. However, it’s not very prominent and if you search for this data, you’ll likely find several links. There’s no link or contact information.

Data authenticity is crucial to making successful decisions. In order to do so, key data questions should be answered.

What data was used?

Many content creators don’t bother to document the source of their information. Many would not have the same level of confidence about the new financial dashboard if the viewer knew the data came from a manually manipulated spreadsheet, instead of directly from the finance system. How would the reader know anyway? In many cases, they wouldn’t. The Seattle Times provided a hint, but more is needed.

When you buy items like wine, you know what you are buying because the label spells it out. A wine bottle is required to have a label with standard data elements to ensure we know what we are buying. For example, a US wine label must have the type of grape used to make the wine.  Even red blends must list the varietal and percentage so that the customer is clear on what is in the bottle. Having the equivalent type of labeling would improve transparency about data authenticity.

Who owns the data we are consuming?

This is very important, especially if we spot incorrect or missing data. Who do we contact? The Seattle Times lists the Washington Department of Health as the data owner. This is a good starting point but doesn’t completely fill the need. For internal reports, all data sources should include an owning team name and a contact email. The data vintage example below also includes the site urls and a contact email.

Data Vintage Example

How old is the data?

It’s one thing to know when’s the last time the data was pulled from the source but that’s not the need. Data age can strongly influence whether it can be used to make a decision. In our Marquee™ products, we include a data freshness indicator that shows proportionally how much of the data has been updated recently. Recently becomes a business rule of what constitutes fresh data. With some companies, the entity must have been updated with in the last seven days to be considered fresh.

Data Freshness indicator for time dependent data.

How to address?

We took the liberty of creating a Power BI model that analyzed the same immunization data used in the Seattle Times story. We’ll use this model to illustrate the simple technique. The following steps were performed to enable a simple “data vintage” page.

Procedure

  • Create a Data Vintage page (you may need more than one, depending on how many datasets and sources you have)
  • Add a back button to the page. We put ours in the upper left corner
  • Add the following information to the page using a consistent format that you’ve decided upon
    • Name of dataset
    • From where is the data sourced and how often
    • Which team owns the data
    • How to contact the data owner, if available
  • Create a Data Vintage bookmark for the data vintage page so that it can be navigated to via a button.
  • Go back to the report page that you created from this data
  • Add an Information button to the upper right corner of the page.
  • Select the button and navigate to the Visualization blade
  • Turn on Action
  • Set Type to Bookmark
  • Set the Bookmark to the one you created in Step 4.
  • Ctrl + Click the Information button to test
  • Ctrl + Click the Back button to test

That’s it. Anytime a user or fellow Power BI Author has a question about the underlying model data, it can be accessed very easily. You’ll also improve impressions of data authenticity by implementing this label in a consistent manner across all content.

A Working Example

We’ve created a different analysis of the Washington State Immunization exemption data, where we also added a data vintage page. You can try it out below. Click the i Information button in the upper right of the screen to display the data vintage.

In Part 3, we’ll examine the problem of data integrity and how can you be sure your data has implemented the proper business rules for your organization.

Have a question or comment? Feel free to post a comment below.

In Data We Trust? A multi-part series.

I finished the demo of Microsoft Power BI dashboards and reports that we had built for a client. I looked at the room and asked, “What do you think?” This proof of concept hopefully created excitement, by showing what was possible with the client’s data. As we went around the table, people were generally thrilled. The last person’s feedback, however, caught me off guard. “It looks great, but how do I know I can trust the data?”

“It looks great, but how do I know I can trust the data?”

That question rattled around my brain for days. In the rush toward a data-centric future, clients weren’t asking if their data was trustworthy. I researched the problem further, finding some whitepapers and such on the topic, but no clear recommendations on how to address this issue.

Three Areas of Data Trust Issues

Data Authenticity

Is the data you are using authentic, in that is it from a trusted official data source? How do you know? Imagine your executives making decisions about your project portfolio based on a manually maintained spreadsheet instead of from data retrieved directly from their Project Management software.

A trusted data source is one aspect to consider. You also need to know if that data source actively managed? It’s one thing to have data from an official source but if it’s a one time extract versus an ongoing process, the value declines quickly.

Lastly, is this data coming from the official system of record? Imagine getting project cost values from a non-accounting system? Is this system the official system of record for cost data? Many reporting solutions obscure the source of their data, making it impossible for end users to determine if the source is the correct and official authoritative

Data Integrity

Data integrity is knowing that the agreed upon business rules within your organization are consistently applied to your data. Integrity also looks at how close are you pulling data from the official data source. Is the data being taken as is from the official data source or is it being derived? If it is derived, does it officially defined internal business rules to achieve the outcome?

For example, you are using the Project Health from your project management system. Is the value following the standard Project Management Office definition for Project Health or is it using some specialized logic that maybe was used for a one-time analysis? How do you know? Deriving data is not bad if the process adheres to the established internal rules but you have to have a way to gauge this.

Data Timeliness

When you go to the grocery store to buy fresh fruit or meat, the ability to assess food freshness is vital to your buying decision. Old fruit isn’t very appealing and can have detrimental health effects. The same could be said about old data.

In data, we should be going through a similar assessment of freshness. Is this data representative of recent activity? This is not when was the data last refreshed into the report, but rather when was the data last modified. Data that has been modified this morning is likely to be more representative than data that was modified three weeks ago. How do you gauge the freshness of your data?

Next Up

In this series, we’ll explore each of these areas, first examining the challenges end user face determining if the data they are using is trustworthy. We’ll then explore a future where these issues are addressed. Lastly, a presentation of techniques to overcome each of these challenges will follow, enabling you to address these trust issues in your own organization.

New SharePoint Modern Power BI Theme

If you are creating Power BI content that you are embedding into a SharePoint Modern page, you know that the default Power BI theme colors don’t match SharePoint’s colors. Not to worry, we’ve got you covered. 

Attached to this post and also cross-posted to the https://Community.PowerBI.com Theme Gallery is the Tumble Road SharePoint Modern theme. This theme uses the core Modern experience colors in SharePoint, assuring your Power BI content will look great when embedded within SharePoint.

Download the zip file here.

Who owns new capabilities in your organization?

I recently participated in SharePoint Saturday – Charlotte where I talked to many people about Power BI and how it could fit in, within their own organization.

I heard the refrain, “No one really owns BI in our organization.” many times along the way. I found this concerning. Many organizations have product owners but not organizational capability owners. This makes sense from a budgeting and management perspective, but it prevents the organization from leveraging the true value of their technology investment. I hear the same refrain when I talk about Yammer or Teams. If there’s no internal advocate for a capability, how will an organization ever adopt it successfully?

“No one really owns BI in our organization.”

Heard at SharePoint Saturday – Charlotte

Tool-Centric isn’t the way

Tool-centric management focus can lead to disappointing internal adoption of a tool. Support teams aren’t typically responsible for driving adoption of a tool but rather, ensure the tool is working. An internal advocate must be present to understand and drive the organizational change process, which assumes the company has both the appetite and investment resources to make the behavior change.

I see great sums of money spent on licensing, but short shrift given to funding the necessary work of upgrading the organization. If you take a “build it and they will come” approach, many will never make the trip. It takes work and passion to figure out how to utilize a tool to make your day to day work better and most people don’t have the time or bandwidth to do this work.

New technologies will require new approaches

As we see new capabilities like Artificial Intelligence, Business Intelligence, and Collaborative Intelligence come into the mainstream, these capabilities are usually comprised of several tools. They also require effort to augment these capabilities into the day to day workflow. As such, the old technology product centric model isn’t going to work in this new changing world. It’s time to rethink the approach now.

“If there’s no internal advocate for a capability, how will an organization ever adopt it successfully?”

This is beginning to happen at Microsoft as well. One Microsoft is an overarching message around capabilities. The Microsoft’s Inner Loop-Outer Loop model comprises many tools for a few key scenarios. I’m hopeful that this is Microsoft’s first step toward communicating what they can do from a capability rather than a product perspective. For example, as a partner and a customer, I’d rather hear a consolidated message around better decision making than several separate Power BI/Cortana Analytics/Azure product pitches where I must figure out the “happy path” for myself. Let’s hope this trend continues.

Organizations need capability advocates for areas like Business Intelligence, Portfolio Management, Team Work, and many others. This role is necessary for thought leadership on where to invest in new technologies and how best to leverage these technologies to provide new capabilities or streamline existing efforts. Without this advocacy, it will be difficult to realize full value from your technology investment. The days of one tool to one capability are long in the rear-view mirror.

TIP: How to make your Power Query M code self-documenting

What were you thinking?

Do you remember exactly what you were thinking three months ago when you wrote that rather involved Power Query M statement? Me neither. So, why suffer when you can easily avoid the pain of code archaeology in Power BI? Piecing together a forgotten thought process around a particular problem can be a time consuming and frustrating task.

We’ve all seen this

You’ve seen Power Query M code like below. Chances are, you’ve created M code like below as well. You will be looking at this in a few months, thinking to yourself, removed what columns?!? and then I change the type. Awesome, what type and why did I do this?

Leverage these functions to hold your thoughts

I’ll take you through two features that you can use to document your thought process as part of your Power Query M code in Power BI. Your future self will love you for taking a few minutes to leverage these features to document your thought process. It also makes maintenance a breeze as it is rather easy to figure out where your data is retrieved.

Discipline is key…

Please note, this can feel tedious to do. It is quite easy to create many transformations in a short amount of time. If you take a few minutes to do this when  it is fresh in your mind, you’ll be better positioned if there are issues later or if you need to add functionality. The default names of the steps, like Renamed Columns 3 are not very helpful later.

How do I document my steps?

There are two techniques you can use to make your M code serve as documentation.

First, you can right click on a transformation step and select Rename to be whatever you wanted to document. For example, Renamed Columns above could be renamed to Renamed custUserID to User ID. This makes it very clear what was done.

Second, if you need more room to fully document a transformation step, right click on a transformation step and select Properties. The description field can be used for long form documentation.

Finished Product

We started with this list of transformations on a SharePoint list.

After taking the time to rename each step with what and why, it now looks like this.

I hope you find this post useful! If you have any questions, post them below in the comments.

UPDATE: This technique also works in Excel’s Power Query Editor as well.

How to sort your dataset columns alphabetically in Power BI

Power BI Accidental Tip

This tip came about by accident. I was working on a new class exercise for https://sharepointna.com when I came across an easy way to sort your dataset columns alphabetically.

While column search in the Query Editor makes this less necessary, it is very useful when working with DAX where you are in the same data grid and scrolling left and right. Yes, you could use the field search or the field picker, but I prefer for my dataset to be in a predictable order.

Caveat

This tip works right now with the March 2018 release of Power BI Desktop. There’s no guarantee it’ll work with future releases of Power BI Desktop.

OK, so how do I sort my dataset columns alphabetically?

  • From Power BI Desktop, click Edit Queries
  • On the Home tab in the Query Editor, click Choose Columns
  • Click the AZ button in the upper right and select Name
  • Click OK
  • Click Close and Apply.

Yes, that’s it. Now your dataset columns are sorted in alphabetical order. Normally I do this as my last transformation, after all others have been specified.

Creating Power BI Report Backgrounds the easy way!

Are you still putting boxes around your Power BI graphics?

Stop boxing your data! Boxes add unnecessary visual noise and can make it hard to see your graphics. We present a better way to add style to your data story in Power BI.

These techniques enable you add a graphical background to your Power BI reports, using the power of Microsoft PowerPoint!  Nothing to buy or install. Graphical backgrounds provide flexibility in styling while reducing effort to achieve a great looking Power BI report.

I’ll take you through two separate ways you can implement these backgrounds and point out pros and cons to each approach. I’ll also show you what to consider when you have to support both web and mobile reports.

I hope you enjoy! Please leave your comments below.

If you are in Washington, DC at the end of March, 2018, join us for our Power BI workshop at SharePoint Fest DC!

Integrating CA PPM Data Warehouse security with Power BI RLS

Background

Tumble Road has partnered with CA Technologies over the last year, to make their reporting more Power BI friendly. At CA World 17, we helped introduce the new OData connector to the CA PPM Data Warehouse to attendees.

The CA Need

The CA PPM Data Warehouse (DWH) has very rich security support at the user level and at the role level. However, when the OData connector was originally in beta, a need to replicate this security functionality within Power BI was noted. If we were not able to have Power BI consume the CA PPM DWH security model, this would make Power BI a non-starter for many companies.

The Power BI Solution

Row Level Security within Power BI was the answer to carrying forward the DWH security model. Power BI enables filtering of data based on the current Azure Active Directory (AAD) User ID. Since all Power BI users have to have an AAD User ID, part of the solution was at hand.

The CA PPM DWH has a security table that lists every entity for which a user can access. This information is used by Power BI to security the data.

However, CA’s solution does not use AAD so a mapping exercise may be needed between the CA User Name and the Microsoft User ID. The video below shows a very simple case where we are able to append the Office 365 domain name to the CA User Name to get the requisite AAD User ID.

Your particular situation may require a more extensive mapping exercise. Elaborate mappings can be done in Excel and imported into the model. This imported mapping data can be used with the Security table to implement the requisite information.

The Presentation

In the video below, I show you how to bring the Security table from the Data Warehouse into Power BI and use the data as a basis for applying Row Level Security within Power BI. This should restrict the data to only that that the person is authorized.