Extract # and @ from tweet

Copper Contributor

Hello,

I have a large dataset with tweets in them. These are displayed in a single cell as text. I want to extract all hashtags and tags in this tweet and display them in their own cells. For example:

"My bilateral programme will comprise meetings with PM @GakhariaGiorgi and FM @DZalkaliani as well with the chairman of the #Parliament of #Georgia " would give" - @GakhariaGiorgi @DZalkaliani #Parliament #Georgia . Preferably all in their own cell.

I have already tried the following code, but this gave me an error. https://www.mrexcel.com/forum/excel-...lumns-b-c.html

Any help would be very much appreciated.

2 Replies

@Marlonvandijk 

I tried that with Power Query. If we add function which extracts words between "@" or "#" and end of the word

(Source as table, Delimiter as text, ResultColumnName as text) as table =>

let
    EndWordCharacters = {" ",",",".",":",";","!"},

    //Source = Tweets,
    SplitByFirstDelimeter = Table.AddColumn(
        Source,
        "Custom",
        each List.RemoveFirstN( Text.Split([Tweets],Delimiter) ,1) ),
    CutAfterWordEnd = Table.AddColumn(
        SplitByFirstDelimeter,
        ResultColumnName,
        each List.Transform([Custom],
            each Delimiter &
                Text.Start(_,
                    if Text.PositionOfAny(_,EndWordCharacters) =-1
                    then Text.Length(_)
                    else Text.PositionOfAny(_,EndWordCharacters)
                )
        )
    ),
    ExtractValues = Table.TransformColumns(
        CutAfterWordEnd,
        {ResultColumnName,
        each Text.Combine( List.Transform(_, Text.From), ", "),
            type text
        }
    ),
    RemoveColumns = Table.RemoveColumns(
        ExtractValues,
        {"Custom"}
    )
in
    RemoveColumns

result could be received as

let
    Source = Excel.CurrentWorkbook(){[Name="Tweets"]}[Content],
    AddMentions = ExtractTags(Source,"@","Mentions"),
    AddHashtags = ExtractTags(AddMentions,"#","Hashtags")
in
    AddHashtags

in form of

image.png

If each mention/hashtag put into separate cell, when just split columns by comma in final result.

@Sergei Baklan

Hi Sergei,

I have used your code for a school project for the University of Amsterdam and in doing so I discovered a flaw that wasn't caught in your test-case example. I believe (with my limited knowledge) that the special characters defined in your EndWordCharacters variable weren't picked out by the code (I couldn't tell you why). I have replaced the strings in the list with Character.FromNumber() for all the characters I wanted the code to cut off the text at and now it works like a charm.

Thank you for providing the code I could work with.

Kind Regards,
Mees