Forum Discussion

CharlotteHawkes's avatar
CharlotteHawkes
Copper Contributor
May 02, 2023
Solved

Complicated Automation Help - 11.5 million records to process!

Hi,

 

I’ve got a ton of data to process ie. 23 files with 500,000 rows each - the whole public Companies House People with Significant Control database basically. The files are .txt files where all the data is combined on one row. When opened as a csv the first 15 rows for eg.:

{"company_number":"01215438","data":{"address":{"address_line_1":"Whitley Road","address_line_2":"Benton","country":"United Kingdom","locality":"Newcastle Upon Tyne","postal_code":"NE12 9SZ","premises":"North Tyne Industrial Estate"},"etag":"cd98c02fda3e62ec599a8dc084e05eeac5bc0ceb","identification":{"legal_authority":"The Companies Acts1985-1989","legal_form":"Private Limited Company"},"kind":"corporate-entity-person-with-significant-control","links":{"self":"/company/01215438/persons-with-significant-control/corporate-entity/yTI91Al-YqVHU1kVppnCLaueRHg"},"name":"Dnr Limited","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent"],"notified_on":"2016-04-06"}}

{"company_number":"07347280","data":{"address":{"address_line_1":"Edale Road","address_line_2":"Hope","locality":"Hope Valley","postal_code":"S33 6SF","premises":"High Peak Hall","region":"Derbyshire"},"country_of_residence":"England","date_of_birth":{"month":9,"year":1966},"etag":"3606c7b89aeb73af5a875d6ae09a73cd45f2e021","kind":"individual-person-with-significant-control","links":{"self":"/company/07347280/persons-with-significant-control/individual/oEXryozAN9rKJn9qaI-jG7isXho"},"name":"Mr Sanjeev Shamsunder Madan","name_elements":{"forename":"Sanjeev","middle_name":"Shamsunder","surname":"Madan","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2016-04-06"}}

{"company_number":"08205142","data":{"address":{"address_line_1":"Union Way","locality":"Witney","postal_code":"OX28 6HD","premises":"The Old Chapel","region":"Oxfordshire"},"country_of_residence":"United Kingdom","date_of_birth":{"month":8,"year":1963},"etag":"db64d68592a629feb24f10294da13ac66c1298e7","kind":"individual-person-with-significant-control","links":{"self":"/company/08205142/persons-with-significant-control/individual/vf-ywwrx4VpU2E3avKPY64aIIuI"},"name":"Mr Timothy John Davies","name_elements":{"forename":"Timothy","middle_name":"John","surname":"Davies","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}

{"company_number":"08205142","data":{"address":{"address_line_1":"Union Way","locality":"Witney","postal_code":"OX28 6HD","premises":"The Old Chapel","region":"Oxfordshire"},"country_of_residence":"England","date_of_birth":{"month":6,"year":1964},"etag":"bf27df222dcb9a87f4e40d1b28cab1faa3b62a95","kind":"individual-person-with-significant-control","links":{"self":"/company/08205142/persons-with-significant-control/individual/L2kQiLEibxW0UWtgu_-2ms8DE5U"},"name":"Mrs Alison Davies","name_elements":{"forename":"Alison","surname":"Davies","title":"Mrs"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent"],"notified_on":"2016-04-06"}}

{"company_number":"SC272515","data":{"address":{"address_line_1":"2nd Floor","address_line_2":"24 Thistle Street","country":"Scotland","locality":"Aberdeen","postal_code":"AB10 1XD","premises":"Thistle House"},"country_of_residence":"United Kingdom","date_of_birth":{"month":7,"year":1964},"etag":"05a07ea7bc43ebe3a7dafc1717669b9bb3c7f25e","kind":"individual-person-with-significant-control","links":{"self":"/company/SC272515/persons-with-significant-control/individual/adW9tfROL3gqzfspUJbzDEODh_0"},"name":"Mrs Sandra Cooke","name_elements":{"forename":"Sandra","surname":"Cooke","title":"Mrs"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors","significant-influence-or-control"],"notified_on":"2016-04-06"}}

{"company_number":"09743458","data":{"address":{"address_line_1":"Chestall Road","address_line_2":"Cannock Wood","country":"England","locality":"Rugeley","postal_code":"WS15 4RB","premises":"3"},"country_of_residence":"England","date_of_birth":{"month":1,"year":1969},"etag":"f7aec84331e12a106e79e29f154b5b94118583ab","kind":"individual-person-with-significant-control","links":{"self":"/company/09743458/persons-with-significant-control/individual/InYc0pyzRPh8iy1omoT5xojL-lA"},"name":"Mrs Justine Margaret Keight","name_elements":{"forename":"Justine","middle_name":"Margaret","surname":"Keight","title":"Mrs"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent"],"notified_on":"2016-04-06"}}

{"company_number":"10348356","data":{"address":{"country":"England","locality":"Colne","postal_code":"BB8 0LP","premises":"11 Keighley Road"},"country_of_residence":"England","date_of_birth":{"month":10,"year":1977},"etag":"4969945abec902bb5133ea2690bbba81df0e97eb","kind":"individual-person-with-significant-control","links":{"self":"/company/10348356/persons-with-significant-control/individual/pb8bWsROb2pJgovO2wq5c9CFunc"},"name":"Mr Mirza Muhammad Asif","name_elements":{"forename":"Mirza","middle_name":"Muhammad","surname":"Asif","title":"Mr"},"nationality":"Pakistani","natures_of_control":["significant-influence-or-control"],"notified_on":"2016-08-26"}}

{"company_number":"09217153","data":{"address":{"address_line_1":"Church Street","country":"England","locality":"Market Harborough","postal_code":"LE16 7AA","premises":"4a","region":"Leicestershire"},"country_of_residence":"England","date_of_birth":{"month":3,"year":1968},"etag":"250d488d9b77165779f26a9f0a6b776bf44967aa","kind":"individual-person-with-significant-control","links":{"self":"/company/09217153/persons-with-significant-control/individual/kFSQqcm8d8c81QWzyxjmnSepsOw"},"name":"Mr Paul Alan Hull","name_elements":{"forename":"Paul","middle_name":"Alan","surname":"Hull","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","right-to-appoint-and-remove-directors"],"notified_on":"2016-08-26"}}

{"company_number":"10348357","data":{"address":{"address_line_1":"Selwyn Drive","country":"England","locality":"Broadstairs","postal_code":"CT10 2SW","premises":"46","region":"Kent"},"country_of_residence":"England","date_of_birth":{"month":7,"year":1990},"etag":"eb90e0151337fa49705db9054b15ad03d446abc0","kind":"individual-person-with-significant-control","links":{"self":"/company/10348357/persons-with-significant-control/individual/VLMdKg_Id8XMFe2gx56-Cnr4GT8"},"name":"Mrs Sian Stacey Morgan","name_elements":{"forename":"Sian","middle_name":"Stacey","surname":"Morgan","title":"Mrs"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent"],"notified_on":"2016-08-26"}}

{"company_number":"10348357","data":{"address":{"address_line_1":"Selwyn Drive","country":"England","locality":"Broadstairs","postal_code":"CT10 2SW","premises":"46","region":"Kent"},"ceased_on":"2022-11-01","country_of_residence":"England","date_of_birth":{"month":9,"year":1988},"etag":"94b06527f5057a046c524027ca556bc0a18c4204","kind":"individual-person-with-significant-control","links":{"self":"/company/10348357/persons-with-significant-control/individual/nFTnhzoKB2hQk7RAhBq8_D80aec"},"name":"Mr Simon James Morgan","name_elements":{"forename":"Simon","middle_name":"James","surname":"Morgan","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-50-to-75-percent"],"notified_on":"2016-08-26"}}

{"company_number":"SC329414","data":{"address":{"address_line_1":"Riversdale Crescent","country":"United Kingdom","locality":"Edinburgh","postal_code":"EH12 5QT","premises":"14"},"country_of_residence":"Scotland","date_of_birth":{"month":7,"year":1986},"etag":"62ddfc3d8960c483dedecdf20c856e7c7fabce79","kind":"individual-person-with-significant-control","links":{"self":"/company/SC329414/persons-with-significant-control/individual/LKTM0dmYZjuC69mYOC4EpvbS6jw"},"name":"Mr Robert De Rosnay Rennie Mitchell","name_elements":{"forename":"Robert","middle_name":"De Rosnay Rennie","surname":"Mitchell","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-25-to-50-percent","right-to-appoint-and-remove-directors"],"notified_on":"2016-04-16"}}

{"company_number":"06996214","data":{"address":{"address_line_1":"Raleigh Road","country":"England","locality":"Bristol","postal_code":"BS3 1QU","premises":"Gather Round","region":"Somerset"},"country_of_residence":"England","date_of_birth":{"month":11,"year":1986},"etag":"c18a7aafd3ff7a6ae59f79784cf1450d42584938","kind":"individual-person-with-significant-control","links":{"self":"/company/06996214/persons-with-significant-control/individual/QYnT0aINEbmWi6XN0qyJUhHw974"},"name":"Mr Dan Hinton","name_elements":{"forename":"Dan","surname":"Hinton","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent","right-to-appoint-and-remove-directors","right-to-appoint-and-remove-directors-as-trust","right-to-appoint-and-remove-directors-as-firm"],"notified_on":"2016-08-20"}}

{"company_number":"10348358","data":{"address":{"address_line_1":"Old Gloucester Street","country":"United Kingdom","locality":"London","postal_code":"WC1N 3AX","premises":"27"},"country_of_residence":"United Kingdom","date_of_birth":{"month":6,"year":1968},"etag":"86104a299b6ea52a70b94458e67cb473de42c40b","kind":"individual-person-with-significant-control","links":{"self":"/company/10348358/persons-with-significant-control/individual/CEkDp9FLzbZ_PBBjYNWatrgDaoU"},"name":"Mr Isaac Onibudo","name_elements":{"forename":"Isaac","surname":"Onibudo","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent","significant-influence-or-control"],"notified_on":"2016-08-26"}}

{"company_number":"04871790","data":{"address":{"address_line_1":"Conifer Grove","country":"England","locality":"Gosport","postal_code":"PO13 0TP","premises":"19"},"country_of_residence":"England","date_of_birth":{"month":7,"year":1964},"etag":"da61ba09a5c9be3349906095e667262960e39bc8","kind":"individual-person-with-significant-control","links":{"self":"/company/04871790/persons-with-significant-control/individual/Ny01yjawdR0m6HPnJTiIyf96n48"},"name":"Mr Raymond Alan Bruchhausen","name_elements":{"forename":"Raymond","middle_name":"Alan","surname":"Bruchhausen","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-75-to-100-percent","voting-rights-75-to-100-percent-as-trust","significant-influence-or-control"],"notified_on":"2016-08-19"}}

{"company_number":"06332575","data":{"address":{"address_line_1":"Bard Road","country":"England","locality":"London","postal_code":"W10 6TP","premises":"2-3"},"country_of_residence":"England","date_of_birth":{"month":6,"year":1951},"etag":"4724d0bb574811fe35ba9aad0459fda852add897","kind":"individual-person-with-significant-control","links":{"self":"/company/06332575/persons-with-significant-control/individual/fVv02cT9Ecv3aFnvyUiYNvu7rcM"},"name":"Mrs Parvin Warwick-Moore","name_elements":{"forename":"Parvin","surname":"Warwick-Moore","title":"Mrs"},"nationality":"British","natures_of_control":["right-to-appoint-and-remove-directors"],"notified_on":"2016-04-06"}}

 

I can split out the individual rows using text to columns delimited on “ to break down the text, but then each of the data sections fall into different columns if some rows don't have all the information. I need to process the data so the company name and year information is sorted under 2 columns called “Company Name” and “Year” ideally so I can then lookup and place the data in a different spreadsheet.

 

Considering the amount of data I have to sort is there a way to automate this? Either when separating the initial text lines or after splitting the columns?

 

  • CharlotteHawkes 

    Please check if there is confidential data in your post and remove all confidential data.

    =MID(A2,SEARCH("company_number",A2)+17,8)
    =MID(A2,SEARCH("year",A2)+6,4)

    If all the records have the same pattern you can try these formulas which are in cells B2 and C2 in the screenshot. The formulas can be filled down as required.

7 Replies

  • peiyezhu's avatar
    peiyezhu
    Bronze Contributor
    {"company_number":"07857645","data":{"etag":"b4f8ad62cab44de86dda35f25e1f255ca5b2d401","kind":"persons-with-significant-control-statement","links":{"self":"/company/07857645/persons-with-significant-control-statements/7NFOepiXAjxP9TYyVby08mBrdL8"},"notified_on":"2016-11-23","statement":"steps-to-find-psc-not-yet-completed"}} {"company_number":"08830194","data":{"ceased_on":"2017-09-19","etag":"104f8f1fd728ef1298c88b4f2fa1e64fa0f9b4f2","kind":"persons-with-significant-control-statement","links":{"self":"/company/08830194/persons-with-significant-control-statements/Q-AFtiQypQ9iPRE_shKFTRm-RVQ"},"notified_on":"2017-01-02","statement":"psc-exists-but-not-identified"}} {"company_number":"01470777","data":{"etag":"9675c30cef3c1b414f9b481953b01a24981cfc7e","kind":"persons-with-significant-control-statement","links":{"self":"/company/01470777/persons-with-significant-control-statements/KrHX0ZLhI4mXVeT1eUMTCh3FZgo"},"notified_on":"2016-12-31","statement":"no-individual-or-entity-with-signficant-control"}}

    Above is the text from http://download.companieshouse.gov.uk/en_pscdata.html.
    no "date_of_birth" found from psc-snapshot-2023-05-02_23of23.txt which downloaded from http://download.companieshouse.gov.uk/en_pscdata.html.

    can you share a file and show your expected result?
    • CharlotteHawkes's avatar
      CharlotteHawkes
      Copper Contributor
      Thanks for your reply, it's now solved with the first comment formulas.
  • Patrick2788's avatar
    Patrick2788
    Silver Contributor

    This is public data so there's no harm in posting this link:
    http://download.companieshouse.gov.uk/en_pscdata.html

    This may be a way to get cleaner copies of the files (if these are the ones you're looking for). The site says the files are available in JSON format but I downloaded the smallest one and it's a txt file (PowerQuery can handle both formats). It appears the columns are delimited cleanly.

    • CharlotteHawkes's avatar
      CharlotteHawkes
      Copper Contributor
      Hi Patrick, thanks for your response, yes that's where I sourced the data from - unfortunately I can't get the columns to delimit clearly as once delimited some rows have more "fields" of info than others so the columns of data don't line up cleanly. For eg. "year" can be in column P for some rows, column z for other rows etc. I'm ok on excel but not amazing so maybe it's how I'm delimiting? If you can tell me how you did it to format them cleanly that would be amazing.
  • CharlotteHawkes 

    Please check if there is confidential data in your post and remove all confidential data.

    =MID(A2,SEARCH("company_number",A2)+17,8)
    =MID(A2,SEARCH("year",A2)+6,4)

    If all the records have the same pattern you can try these formulas which are in cells B2 and C2 in the screenshot. The formulas can be filled down as required.

    • CharlotteHawkes's avatar
      CharlotteHawkes
      Copper Contributor
      Thanks, as the response below this is publicly available. Thanks for suggesting the formulas - the first one worked but unfortunately the second one didn't - I think because the records don't have all the fields so sometimes "year" can be the 8th field of data and sometimes the 10th for eg.
      • CharlotteHawkes's avatar
        CharlotteHawkes
        Copper Contributor
        Sorry, I have got your second formula to work - it just didn't have a year in the first record which is why I had the error! Thank you so much!

Resources