SOLVED

Migrating HTML to Modern Pages - Looks great until page is edited, then content is lost

Brass Contributor

We are migrating HTML to modern pages using code like:

 

var page = context.Web.AddClientSidePage(pageName, true);
page.PageTitle = "some title";
string htmlContent = "<some html content from another system>";
ClientSideText textWebPart = new ClientSideText() { Text = htmlContent };
page.AddControl(textWebPart, -1);
page.Save(pageName);
page.Publish();

 

This actually works really well.  The HTML comes over just fine until someone edits the page.  All you have to do is edit the page from the SharePoint UI and immediately publish it and some content/markup gets removed.  I have seen entire HTML tables removed and inline styling removed.  Other content is just fine and remains untouched (including <img> tags).

 

I thought that the PnP team was working on ways to migrate things like wiki pages to modern pages. If so, they must be dealing with some of the same issues.  Are we taking the wrong approach?

 

cc: @VesaJuvonen

19 Replies
Haven't tested this, but I suspect that it is down to some html not being supported by the modern editor, or because some content is probably considered unsafe and it's being stripped out.
May be worth trying to add the same content via the classic UI and see if it works in the same way.
By adding it via code, you could potentially be avoiding some client side sanitizing that is applied before saving it?
Perhaps some content cleaning on client side that you avoid when using c#?
Sorry the double reply...got a message saying that first one failed...

Yes, it that is certainly the case.

 

I am finding that if I change:

 

<div class="{custom-class">

<table {stuff1}>

{stuff2}

</table>

</div>

 

to:

 

<div class="canvasRteResponsiveTable">

<div class="tableWrapper">

<table title="Table" {stuff1}>

{stuff2}

</table>

</div>

</div>

 

That my tables are now preserved.

 

I have also found Bert Jansen has created the following which may help: https://github.com/SharePoint/PnP-Tools/blob/master/Solutions/SharePoint.Modernization/SharePointPnP...

 

I'd really like to know what other edges he has found...

The pnp page transformation engine does handle the transformation from wiki html to modern text editor compliant html. As you've noted you can simply assign html and it will show, but in order to edit the html it must be compliant with what the text editor supports.

 

See https://github.com/SharePoint/PnP-Tools/blob/master/Solutions/SharePoint.Modernization/SharePointPnP... for a starting basis for your own transformator.

Thanks, @BertJansen.  I was unable to @ mention you before for some reason...

 

I had found the HtmlTransformator.cs so it's good to know I was moving along the right path.  Do you have a list of gotchas that the text editor does not handle.  I can infer this from the code, but some of it does not seem to be a problem for me.  For example, I don't appear to need a <P> just before an image.  The biggest one so far has been <table> elements.  That and the fact that a bunch of inline css is just plain stripped which I can't do much about other than find workarounds using the techniques the current editor uses.

 

You clearly have a lot of knowledge here so I'll take any other information you can provide.

 

Thanks!

Kirk

best response confirmed by Kirk Liemohn (Brass Contributor)
Solution

The easiest way to understand what's valid HTML is to create a piece of text using all the layout and formatting options you need. Once you've that you can grab the page list item and look at the canvascontent1 field to obtain the generated HTML. You'll see that only a limited number of styles are supported and fixed set of classes for color and size information. Anything else you use can be initially displayed but will be lost during edit.

Thanks, that helps.  This will be tedious.

This is an older thread, but if anyone is still here, there seems to be a bug in SharePoint Modern Pages where table content is added through the UI.  We added a small table of content to a page, saved it, published it, then when we went back to edit the content disappeared.

 

In other words, it might not just be a valid/invalid HTML thing. There might be a bigger issue here.

If you are using a "<table>" element, then you need to add wrap it with additional divs that have "canvasRteResponsiveTable" and "tableWrapper" classes.  See my comment further above for more details.

 

Thanks for the reply!  We are editing the content using the WYSIWYG editor, not through HTML, so adding an additional div is not relevant in this particular case.  Whatever HTML is being generated by the OOTB editor is not able to be rendered in edit mode.  As a result, switching to edit mode and saving the page with no changes will result in lost content.

This looks like a bug, tables created using the editor should always stay editable by the editor. Can you open a support case for this? Alternative describe a repro in here and I'll get the right folks to see it.

I agree that it is a bug.  You clearly said it was added through the UI, but I was missed that before.

Ugh, I was not able to reproduce this on a brand new page, so perhaps it's a combination of the web parts that are already on the page or perhaps it's the combination of styles that I chose at the time.  Pure speculation: maybe selecting multiple table cells and choosing a font size somehow caused bad HTML to be generated or something.

 

I copied the rendered content (not the code) from published page using ctrl+c, then used ctrl+v to paste the content back into the text webpart.  I am now able to successfully edit the page.  Weird.

 

Not quite reproduction steps, but additional info:

- the page is an intranet homepage, so it has a powerapp part, a yammer part, some links and text parts, etc.

- the text content I was struggling with was a Heading, a 4x4 HTML table, and some paragraph text.  The 4x4 HTML table had some font formatting all done through the UI, but I remember struggling with the interface to get the colors to stick.  For example, I would select some text, change it to Red, then select some OTHER text and was not able to change the color to Green until I clicked outside the editor box and let the page auto-save.

- I was able to reproduce the "disappearing content" bug consistently by restoring the version of the page

 

Thanks for the input in this thread! I agree I should open a support case.

 

 

I am getting the similar issue with SharePoint Online - migrated HTML content is lost while editing and saving the page. I can roll back to the previous version and get to the content. I think some security feature is blocking the content which is migrated programmatically through a script. 

 

It appears to be a bug in the product or a security feature that sanitize the content after each edit and save but interestingly this is not an issue at first when the page is getting created programmatically and MS SPO is allowing the paged to be saved and published. Very Strange. 

 

It is like I have ticket I can board the train first time but I can get-off and but can not board the same train again 🙂 Microsoft is really up to something with no or very little information available to diagnose the issue. 

 

Please do contribute in case anyone is facing the similar issue on Microsoft SharePoint Online and have any suggestions. 

 

Thanks in Advance. Happy to chat further. 

 

 

My experience (now over 3 years more experience from my original post) is that HTML copied into a Text web part works but that if you edit/save the web part, styling at a minimum is lost (font family, font color, and more). The Text web part is a moving target as features change (3 years ago you couldn't add images to the Text web part, but now you can).

I'm facing the same issues now utilising PnP.Powershell to import html into text components.

 

Transformation Framework currently only supports SharePoint Classic > SharePoint Modern pages.

 

Would be good to know the HTML transformation it performs, is there a reference?

 

*** Update looks like this is the transformation code

https://github.com/pnp/modernization/blob/master/Tools/SharePoint.Modernization/SharePointPnP.Modern...

 

Images

- <img src=""> = is stripped (after import and Edit+Publish)

 

HyperLinks

- <a href=""></a> = is stripped (after import and Edit+Publish)

- [[LINK | URL ]] = is transform (after import and Edit+Publish)

Here's the latest code (https://github.com/pnp/pnpframework/blob/dev/src/lib/PnP.Framework/Modernization/Transform/HtmlTrans...) which should be fairly similar. When it comes to inline images: these were never fully supported by the Page Transformation tech, during transformation the default behavior was to create the images as separate image web parts and split the html in parts so the image web parts could fit in between. Now that inline images are supported on modern pages Page Transformation should get updated to support those, that's a two step approach. First the underlying pages API implementation needs to be updated and that's done (see https://pnp.github.io/pnpcore/using-the-sdk/pages-webparts.html#using-inline-images-in-text-parts). Next is updating the page transformation bits themselves, but that work is still pending.

@BertJansen 
It might be too late for you but anyone who's still facing the issue, this is how I solved it.
For images

$@"
    <div class=""imagePlugin"" style=""background-color:transparent;position:relative;"" data-alignment=""Center"" 
    data-imageurl=""<Full Image Url>"" data-uploading=""0"" 
    data-widthpercentage=""100"">
        <div style=""display: flex; flex-direction: column; position: relative; margin: 0px auto; width: 100%; align-items: center;"">
            <div class=""""  style=""position: relative;  outline: none;"">
                <div class="""" data-automation-id=""imageRead"">
                    <figure tabindex=""0"" role=""button"" class="""" >
                        <div class="">
                            <div style="" class="">
                                <img style='width:100%'  alt='<alt-text>' data-sp-originalimgsrc=""<Full Image Url>""
                                src='<Full Image Url>' />
                            </div>
                        </div>
                    </figure>
                </div>
            </div>
        </div>
    </div>");

 

For tables

// using HtmlAgilityPack here
// idea again is the same, wrap table in a few other html nodes as SP itself does.
var htmlDoc = new HtmlDocument();
ListDictionary tableNodes = new ListDictionary();
if (tables != null)
{
    foreach (var tag in tables)
    {
        HtmlNode figureNode = HtmlNode.CreateNode("<figure class=\"table canvasRteResponsiveTable tableLeftAlign\" title=\"Table\">");
        var clonedTableNode = tag.Clone();
        clonedTableNode.AddClass("ck-table-resized");
        figureNode.AppendChild(clonedTableNode);
        tableNodes.Add(tag.OuterHtml, figureNode.OuterHtml);
        Console.WriteLine(figureNode.OuterHtml);

    }
}
foreach (DictionaryEntry tableNodeDict in tableNodes)
{
    htmlDoc.DocumentNode.InnerHtml = htmlDoc.DocumentNode.InnerHtml.Replace(tableNodeDict.Key.ToString(), tableNodeDict.Value.ToString());
}
return htmlDoc.DocumentNode.InnerHtml;
1 best response

Accepted Solutions
best response confirmed by Kirk Liemohn (Brass Contributor)
Solution

The easiest way to understand what's valid HTML is to create a piece of text using all the layout and formatting options you need. Once you've that you can grab the page list item and look at the canvascontent1 field to obtain the generated HTML. You'll see that only a limited number of styles are supported and fixed set of classes for color and size information. Anything else you use can be initially displayed but will be lost during edit.

View solution in original post