Forum Discussion

Marlonvandijk's avatar
Marlonvandijk
Copper Contributor
Sep 23, 2019

Extract # and @ from tweet

Hello,

I have a large dataset with tweets in them. These are displayed in a single cell as text. I want to extract all hashtags and tags in this tweet and display them in their own cells. For example:

"My bilateral programme will comprise meetings with PM @GakhariaGiorgi and FM @DZalkaliani as well with the chairman of the https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=Parliament of https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=Georgia " would give" - @GakhariaGiorgi @DZalkaliani https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=Parliament https://www.mrexcel.com/forum/usertag.php?do=list&action=hash&hash=Georgia . Preferably all in their own cell.

I have already tried the following code, but this gave me an error. https://www.mrexcel.com/forum/excel-questions/672138-extracting-mentions-hashtags-column-columns-b-c.html

Any help would be very much appreciated.

2 Replies

  • SergeiBaklan's avatar
    SergeiBaklan
    Diamond Contributor

    Marlonvandijk 

    I tried that with Power Query. If we add function which extracts words between "@" or "#" and end of the word

    (Source as table, Delimiter as text, ResultColumnName as text) as table =>
    
    let
        EndWordCharacters = {" ",",",".",":",";","!"},
    
        //Source = Tweets,
        SplitByFirstDelimeter = Table.AddColumn(
            Source,
            "Custom",
            each List.RemoveFirstN( Text.Split([Tweets],Delimiter) ,1) ),
        CutAfterWordEnd = Table.AddColumn(
            SplitByFirstDelimeter,
            ResultColumnName,
            each List.Transform([Custom],
                each Delimiter &
                    Text.Start(_,
                        if Text.PositionOfAny(_,EndWordCharacters) =-1
                        then Text.Length(_)
                        else Text.PositionOfAny(_,EndWordCharacters)
                    )
            )
        ),
        ExtractValues = Table.TransformColumns(
            CutAfterWordEnd,
            {ResultColumnName,
            each Text.Combine( List.Transform(_, Text.From), ", "),
                type text
            }
        ),
        RemoveColumns = Table.RemoveColumns(
            ExtractValues,
            {"Custom"}
        )
    in
        RemoveColumns

    result could be received as

    let
        Source = Excel.CurrentWorkbook(){[Name="Tweets"]}[Content],
        AddMentions = ExtractTags(Source,"@","Mentions"),
        AddHashtags = ExtractTags(AddMentions,"#","Hashtags")
    in
        AddHashtags

    in form of

    If each mention/hashtag put into separate cell, when just split columns by comma in final result.

    • Mees_L's avatar
      Mees_L
      Copper Contributor

      SergeiBaklan

      Hi Sergei,

      I have used your code for a school project for the University of Amsterdam and in doing so I discovered a flaw that wasn't caught in your test-case example. I believe (with my limited knowledge) that the special characters defined in your EndWordCharacters variable weren't picked out by the code (I couldn't tell you why). I have replaced the strings in the list with Character.FromNumber() for all the characters I wanted the code to cut off the text at and now it works like a charm.

      Thank you for providing the code I could work with.

      Kind Regards,
      Mees

Resources