Big Data

BI meets data science in Microsoft Fabric


The modern enterprise is powered by data, bringing together information from across the organization and using business analysis tools to deliver answers to any relevant questions. Those tools give access to real-time information, as well as using historic data to provide predictions of future trends based on the current state of the business.

What’s essential to delivering that tooling is having a common data layer across the enterprise, bringing in many different sources and providing one place to query that data. A common data layer, or “data fabric,” gives the organization a baseline of truth that can be used to inform both short-term and long-term decision-making, powering both instantaneous dashboard views and the machine learning models that help identify both trends and issues.

Building up from the data lake

It wasn’t surprising to see Microsoft bring many of its data analysis tools together under the Microsoft Fabric brand, with a mix of relational and non-relational data stored in cloud-hosted data lakes and managed with lakehouses. Building on the open-source Delta table format and the Apache Spark engine, Fabric takes big data concepts and makes them accessible to both common programming languages and more specialized analytics tooling, like the visual data explorations and complex query engine provided by Power BI.

The initial preview releases of Microsoft Fabric were focused on building out the data lakehouses and data lakes that are essential for building at-scale, data-driven applications. A whole lot of heavy lifting will be needed to get your data estate in the requisite shape for this scale of project. It’s essential to get that data engineering complete before you start to build more complex applications on top of your data.

Adding data science to data engineering

While the Fabric service remains in preview, Microsoft has continued to add new features and tools. The latest updates address the developer side of the story, adding integration with familiar developer tools and services, features that go beyond the basics of a set of REST APIs. These new tools bring Fabric to data scientists, linking Power BI data sets to Azure’s existing data science platform.

Power Query in Power BI is one of the most important tools in Microsoft’s data analysis platform. Perhaps best thought of as an extension of the pivot table tools in Excel, Power Query is a way of slicing and dicing large amounts of data across multiple sources and extracting relevant data quickly and easily. The key to its capabilities is DAX, Data Analysis Expressions, a query language for data analysis that provides the tools needed to filter and refine data.

Then there is Microsoft Fabric’s new semantic link feature, which provides a bridge between this data-centric world and the data science tools provided by languages like Python, using familiar Pandas and Apache Spark APIs. By adding these new libraries to your Python code, you can use semantic link from inside notebooks to build machine learning models in AI tools like PyTorch. You can then use your Power BI data with any of Python’s many numerical analysis tools, allowing you to apply complex analysis to datasets.

That’s an important development, bringing data science into familiar development tools and frameworks, from both sides. You can use the semantic link to allow both teams to collaborate more effectively. The BI team can use tools like DAX to build their report datasets, which are then linked to the notebooks and models used by the data science team, ensuring that both teams are always working with the same data and the same models.

Using semantic link in Fabric workspaces

The semantic link Python API uses familiar Pandas methods. From those methods you can discover and list the datasets and tables created by Power BI, and read the contents of the tables. If there are associated measures you can write code to evaluate them, and then run DAX from your Python code.

You can use standard Python tools to install the semantic link library, as it’s available from the Pip module repository. Once the library is loaded into your Python workspace, all you need to do is import sempy.fabric to access your Fabric-hosted data, then use it to extract data for use in your Python code. As you’re working inside the context of your Fabric environment there’s no need for additional authentication beyond your Azure login. Once you’re in your workspace you can create notebooks and load data.

The semantic link package is a meta-package, containing several different packages that can be installed individually if you prefer. One useful part of the package is a set of functions that let you use Fabric data as geodata, letting you quickly add geographic information to your Fabric frames and use Power BI’s geographic tools in reports.

A useful feature for anyone working with semantic links in an interactive notebook is the ability to execute DAX code directly, using the iPython interactive syntax. Much like writing Python code, you’ll need to install the library in your environment before loading sempy as an external module. You can then use the %%dax command to run DAX commands and view the output. This approach works well for experimenting with Fabric-hosted data, where data analysts and scientists are working together in the same notebook.

DAX queries can be run directly from Python, with sempy’s evaluate_dax function. To use it, call the function with the name of the dataset and a string containing your query. You can then parse the resulting data object and use it in the rest of your application.

Other tools in the semantic link package help data scientists validate data. For example, you can use a couple of lines of code to quickly visualize the relationships in a dataset. Again, this is a useful tool for collaborative working, as it’s possible to use this output to refine the selections made in Power BI, helping to ensure that the right queries are used to build the dataset we want to use. Other options include the ability to visualize the dependencies between the entities in your data, helping you refine the results of your queries and understand the structures of your datasets.

A foundation for data science at scale

Finally, you’re not limited to Python notebooks. If you want to use big data tooling, you can work with both Power BI data and Spark data in a single query, as Power BI datasets are treated as Spark tables by Fabric. That means you can use PySpark to query across both Power BI data and Spark tables hosted in Fabric. You can even use Spark’s R and SQL tools if you prefer.

There’s a lot happening in Microsoft Fabric, with new features being added to the service preview on a monthly cadence. It’s clear that the semantic link library is only the start of bridging the divide between data analysis and data science, making it easier for users to build data-driven applications and services. It will be interesting to see what Microsoft does next.

Copyright © 2023 IDG Communications, Inc.



READ SOURCE

This website uses cookies. By continuing to use this site, you accept our use of cookies.