Recent Discussions
Data Warehousing using Apache Spark on Azure HDinsight
Hi Team Hope all are safe! This is my first project in Azure and we are looking at developing a DW using Apache Spark on Azure HDinsight. In simple terms we are currently trying to pick files from Share Point and then do transformations using pyspark and then load the data into a Azure Sql db. Can someone help me on the below queries: 1) Can we connect Apache Spark or Pyspark on Azure HDinsight to Share Point to pick files? 2) Can we implement the usual SCD1 or SCD2 logic using pyspark? Thanks in advance!1.3KViews2likes3CommentsProposal for a new data structure that extremely reduces data sizes for data in which two item types
I propose a new data structure that reduces data sizes for data in which two item types have many-to-many relations. The proposed data structure newly introduces container variables related to many values of both items, and these container variables record many-to-many relations between them. The proposed data structure maintains data normalization and integrity and is independent of indexing methods conventionally used for relational databases, allowing simultaneous use of both. When one item type has N items and the other item type has M items and all of N is related to all of M, the conventional RDB requests N×M rows, whereas the proposed data structure requests N+M rows. When N=100,000 and M=10,000,000, the conventional RDB requests 1,000,000,000,000 rows, whereas the proposed data structure requests only 10,100,000 rows. In detail, please show the journal article linked by https://www.iaiai.org/journals/index.php/IEE/article/view/589 , or, please show the US Patent No.11294961. In the patent, upper item or item group index is used as anther name of container variable.2.3KViews1like8CommentsAccessing data from an outsourced third party service company
Hi , Can someone please throw some light on this ? We have some services outsourced to a third-party company but, we realised the data which goes into their process is very valuable and want to access it for different purposes. we are currently downloading certain reports from their web pages and going forward they are planning to provide API end points to give us the data required in the form of multiple pre-defined reports. But, I am just wondering, if there is any other secure and feasible method with which we can get the current state of their entire database filtered for our company, which gets automatically refreshes every few hours? Thank you, KP435Views0likes0CommentsAzure Internal Load balancer for windows Always-on WSFC name on Azure VM
Hi Team, While configuring SQL Always-on setup on Azure VM's, we configure an Azure Internal Load balancer with a probe port for a virtual IP and configuring and connecting to SQL listener name, by following a standard procedure (adding listener name through cluster manager and running few powershell scripts on cluster to bound probe port and virtual IP). As we do not use windows WSFC name for connecting to Databases, do we need to configure anAzure Internal Load balancer with a probe port for windows Always-on WSFC virtual ip as well? What kind of issues we may face during failover if we have configured internal load balancer for SQL listener virtual IP but not for WSFC virtual IP? Thanks. Best Regards, Anshul1.2KViews0likes2CommentsClarifying false assertions by Oracle sales about Oracle licensing on Azure constrained VMs
Recently, I received the following question from a customer... How much of a challenge would it be to defend against Oracle's claim that, for a constrained Standard_E96-24ds_v5 VM, we owe them licensing for 96 vCPUs instead of 24 vCPUs? I've been receiving questions of this sort more frequently these days, so I wanted to share advice on dealing with it. Oracle's own documentation on public cloud licensing (HERE) states... For the purposes of licensing Oracle programs in an Authorized Cloud Environment,customers are required to countthe maximumavailablevCPUsof an instancetypeasfollows: Microsoft Azure–count two vCPUs as equivalent to one Oracle Processor license ifmulti-threadingof processor coresis enabled,and one vCPU as equivalent to one Oracle Processorlicense ifmulti-threadingof processor coresis not enabled Please note that the highlighted wordavailablewhichmeansable to be used or obtained; at someone's disposal according to the Oxford dictionary. Azure constrained VMs are explainedHERE, including the following description... Azure offers certain VM sizes where you can constrain the VM vCPU count to reduce the cost of software licensing, while maintaining the same memory, storage, and I/O bandwidth. The vCPU count can beconstrained to one half or one quarter of the original VM size. These new VM sizes have a suffix that specifies the number of active vCPUs to make them easier for you to identify. So, constrained VMs in Azure offer only the memory, storage limits, and I/O bandwidth associated with a VM of a larger number of vCPUs in the name, but the number of vCPUs is the lower number in the name. For example, in the case of the above-mentioned Standard_E96-24ds_v5 VM instance type, the "96" represents the memory, I/O, and network resources normally associated with a 96 vCPU virtual machine, but it does not indicate that 96 vCPUs are available. Only 24 vCPUs are available with this instance type, and that is the count to be used when licensing Oracle. Referring to the guidance from Oracle licensing provided above, these 24 vCPUs,each hyperthreaded by 2, represent 12 CPU cores, so the number of Oracle processor licenses for this VM is 12. As an interesting side note, according to the same Oracle documentation on licensing in public clouds (HERE)... When counting Oracle Processor license requirements in Authorized Cloud Environments, the Oracle Processor Core Factor Table is not applicable. Thus, the popular Oracle Processor Core Factor Table discount is available only on-prem and in Oracle cloud, but not in Azure. This is the basis of another myth by Oracle sales teams suggesting that Oracle database is half as expensive in Oracle cloud than in Azure. It has nothing to do with technology or performance or cost of resources, merely a discount that Oracle has reserved only for themselves. Of course, for basic technical questions such as counting CPUs, there must be an empirical way to prove one way or the other. Oracle is welcome to recommend any Linux or Oracle utility they prefer to count the number of vCPUs presented by a VM, but one good suggestion is the Linux lscpu command. Whatever count is returned by such a utility should determine licensing count, of course. In summary, please beware of Oracle sales personnel attempting to freelance with their own perspectives on licensing. Oracle sales personnel are not the most reliable source of such information, due to the obvious conflict of interest. Oracle'sLicense Management Services(LMS) team provides authoritative decisions on licensing. When anyone spreads misinformation about Oracle licensing, then please click theContact Oracle LMSbutton on theLMS home pageto get the word from the folks who can provide the real answer.1.2KViews4likes2CommentsMicrosoft Azure: Routing manufacturing IoT Edge data between on-premise PURDUE model levels via MQTT
Microsoft Azure IoT Hub provides out-of-the-box capabilities to send device-to-cloud messages directly into Azure for advanced logging/routing and generating actions based on events occurring on the edge. However, many customers, for example, in manufacturing domain adoptPurdue Enterprise Reference Architecture (PERA)in their plant IoT implementations. And one of the frequent requirements is to allow Azure IoT hub to send data to their internal MQTT brokers, especially to allow communication between PURDUE's Level 2 (Control Systems) to Level 4 (Business Planning) . However, this scenario is NOT just limited to manufacturing domain. AlthoughAzure IoT Hub itself supports MQTT end-pointsfor direct communication, it doesn't provide out-of-the-box capability to post messages to "customer managed" local MQTT brokers. In fact, Azure IoT product group is working on BYOMB (Bring your own [MQTT] broker), but this may take some time to fully bake this capability into out-of-the-box experience. It is very interesting to note that routing IoT device messages to local eco-systems (on premise) without reaching out to Azure cloud is becoming increasingly popular data architecture patterns in manufacturing and many other industries. Most customers want this capability to generate actions/alerts locally, for example, manufacturingplants wants to send an alert to SCADA (Supervisory Controls And Data Acquisition) / HMI (Human Machine Interface) systems for an immediate actions without making a round trip to Azure Cloud. Provisioning MQTT brokers like eclipse-mosquitto is very common to fulfill this kind of needs, so that single alert can be fanned-out to many subscribedsystems, if necessary, to continuouslyfulfill the need for event driven data architecture for improved decisionand business outcomes. Recently, one of the manufacturing customers was looking to addressing this exact gap in Azure IoT data architecture solutions. While designing the solution the customer wanted to leverage only Azure PaaS (Platform-as-a-service) offerings available on the edge, which makes lot of sense. Hence, the solution was developed using Azure Functions PaaS service which already supports deployment on edge. And we chose Python as a language- the most adopted scripting language in recent days. However, Azure Functions on the edge also supports C#.Net - if you are a .NET shop! The step-by-step instructions & some learnings from the solution we created are already documented here on this GitHubrepository.2.3KViews1like1CommentContainerization and Machine Learning Service
One of my favorite open source tools is Docker. It just makes sense with a lot of the work that I do whether it's executing CLI commandsso I can test experimental features without having to re-install CLI each and every time and then enable those commands. Or, if it's working through labs in Jupyter notebooks where I can build and maintain an environment in which I can run experiments against Azure Machine Learning Services (MLS) by simply changing the confi.json file in my root folder. So how's all this work? Well, let me show you. First, let's start out by connecting to a base image that I've built for the labs using vscode: I can go into the details of building that image at a later date, but at this time, I just want to share some of the flexibility of the image itself. Here's the code that I run. docker run -it -p 10000:8888 thejamesherring/labs:latest this will give me an output that looks like the following: Set username to: jovyan usermod: no changes Granting jovyan sudo access and appending /opt/conda/bin to sudo PATH Executing the command: jupyter lab [I 14:31:37.465 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.7/site-packages/jupyterlab [I 14:31:37.466 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab [I 14:31:37.468 LabApp] Serving notebooks from local directory: /home/jovyan [I 14:31:37.468 LabApp] The Jupyter Notebook is running at: [I 14:31:37.468 LabApp] http://a30c0b11acd3:8888/?token=dc3db6e906dcb0d403bad05640cf492981105fb81ba2eb25 [I 14:31:37.468 LabApp] or http://127.0.0.1:8888/?token=dc3db6e906dcb0d403bad05640cf492981105fb81ba2eb25 [I 14:31:37.468 LabApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [C 14:31:37.471 LabApp] To access the notebook, open this file in a browser: file:///home/jovyan/.local/share/jupyter/runtime/nbserver-18-open.html Or copy and paste one of these URLs: http://a30c0b11acd3:8888/?token=dc3db6e906dcb0d403bad05640cf492981105fb81ba2eb25 or http://127.0.0.1:8888/?token=dc3db6e906dcb0d403bad05640cf492981105fb81ba2eb25 you'll notice that I re-routed the port 8888 to port 10000 so I'll need to use that when connecting to the jupyter environment along with the ?token= line. so when I connect it will look something like this: http://localhost:10000/?token=dc3db6e906dcb0d403bad05640cf492981105fb81ba2eb25 Which gives me access to my notebooks, now there are a few things I have to be mindful of: am I connected to the correct MLS Environment? are my lab files updated? To configure the first one, I simply go to the root path of my jupyter lab and modify the config.json file that's located there. subscription_id:"<your azure subscription>" resource_group:"<your resource group>" workspace_name:"<your MLS Workspace name>" after completing this task then I can simply open a terminal navigate to the path of my lab and instruction files and issue a git pull which will ensure I have the latest files. All done, I'm able to start writing experiments against my Azure compute targets from a local containerized docker image from wherever I happen to be. I hope you found this information useful and are able to expand upon and share your learnings with other. Best, James HSolved805Views0likes1CommentAutomating Geneva MDM & MDS queries
Here in Hybrid Networking we need to implement data collection and management for network MDM metrics and MDS log data that extend what is available in Geneva. For Data Science projects we need to automate scraping selected metrics & logs to create longitudinal datasets suitable for detailed model building. As such we have some adhoc connectors using available C# APIs to MDM / MDS to periodically scrape data and archive it in ADL2, to make it accessible for processing in SPARK. However, a standard solution would be preferable. The ADL2 - SPARK (e.g. DataBricks) combination I believe has clear advantages over use of Cosmos (not CosmosDB, but the legacy batch job system) for which a wider range of connectors are available. Has anyone else implemented automationtools for such a data pipeline?12KViews0likes0CommentsData Skills Transitions are like Sunsets and Sunrises
I was inspired byBuckWoodyMSFT'spost on Monday but it was not until I was walking the dog (Hudson) this morning that I got the hook I was looking for. My Mom always said, when it came to sunrises and sunsets "Orange sky in morning sailors take warning; Orange sky at night sailors delight". We can argue over the color of this sunrise (it is winter in Seattle so any sun is good), but when it comes to Data Skills transitions as Data professionals (Data Architects, DBAs, Data Engineers, Data Scientists) we need to heed the warnings. Cloud Data, AI and BI offerings like Azure Synapse Analytics, Azure Machine Learning, Azure Data Factory, Azure Databricks, Azure Data Lake Storage, and Power BI have changed the game. I started my data warehousing and decision support career back in 1993 and have seen many a product sunset: Metaphor Computer Systems,Sagent (great ETL tool the is somewhere in Pitney Bowes data integration), andBrio Technology I moved to Seattle 23 years ago and used SQL Server 6.5, Sagent, and Brio to enable data analysts at University of Washington Physicians. I quickly migrated my data mart to SQL Server 7 in 1998. SQL Server had a major impact on the BI/DW industry (OLAP services changed the market) and on my career. It was important for my career when I joined Business Objects as an OLAP champion back in 2000 and at WaMu where a team I managed deployed SQL Server and Bobj like crazy. I joined Microsoft in 2005 just before SQL Server 2005 launched. In 2005 I also attended my first TechReady (Microsoft Tech internal event) at the Washington State Convention Center. It was an awesome readiness event all the way to TechReady 24 (that is 12 years because these happen twice a year). We are at the sunrise of Winter Ready in Seattle this Monday Feb 3rd. I can't wait to learn about and skillup because I can see the new data skills needed on the horizon. So why this history lesson? If you are seasoned like me and nearing the sunset of your career, Iimplore you to batten down the hatches and get ready for the storm by upskilling. Businesses need your energy and experience. Many of you are well on your way. Don't let all the newbies get all the opportunities for the new projects that are out there, the the new generation of Data Pros need us. For people early in your data careers IMO you have an awesome future. I think SQL Server 2019 Big Data Clusters and the things it"Spawned" likeAzure ArcandAzure Synapse Analyticswill provide you with decades of learning, challenges, and employment. I have had a love/hate relationship with certifications but would recommendMicrosoft Certified: Azure Data Engineer Associate, IMO, it will make us all better and the more people that take it the better the certification will be. But get your hours in using the Azure Portal and Power BI Portal. Lets be Ready and Certified. May there be many sunrises and sunsets in your work and personal lives. Thanks for joining the Data Architecture Community! Darwin Schweitzer| Education Cloud Solution Architect | US Education | darsch@microsoft.com|Twitter @DataSnowman| GitHubDataSnowman Data and AI resources athttps://github.com/Azure/carprice2.1KViews4likes2Comments
Events
Recent Blogs
- In most of the Data Science and AI articles, blogs and papers I read, the focus is on a particular algorithm or math angle to solving a puzzle. And that's awesome - we need LOTS of those. ...Nov 15, 202411KViews0likes1Comment
- 5 MIN READOur customers require daily refreshes of their production database to the non-production environment. The database, approximately 600GB in size, has Transparent Data Encryption (TDE) enabled in produ...Nov 13, 20241.1KViews1like0Comments