Forum Discussion
YeuHarng
Sep 18, 2023Brass Contributor
encoding issues or character representation problems
Hi guys, i have a problem, now I'm doing web scraping from the webpage, when i get scraping the details, the details is like this:
i have already try to convert but still cannot, so is there any solution suggest, the details's language from the webpage is Thai
- UTF-8 should be able to convert Thai to 'normal' characters. Could you share a snippet of the webrequest?
- YeuHarngBrass Contributor
Harm_Veenstra here is the code, after convert i need to write into excel file
$url = "https://ww2.kanchanaburi.go.th/personal_board//?page=1&limit=99999" # Create a web request to fetch the HTML content and specify the character encoding $headers = @{ "Accept-Encoding" = "UTF-8" # Specify the correct encoding if needed } $response = Invoke-WebRequest -Uri $url -Headers $headers $htmlContent = $response.ParsedHtml $personBoxes = $htmlContent.getElementsByClassName("col-lg-12 person-box") # Loop through each "col-lg-12 person-box" element foreach ($personBox in $personBoxes) { $personName = $personBox.getElementsByClassName("d-flex flex-row row person-detail")[0] $personPosition = $personBox.getElementsByClassName("d-flex flex-row row person-detail")[2] if ($personName) { # Convert the inner text to UTF-8 encoding and print it $utf8EncodedText = [System.Text.Encoding]::UTF8.GetBytes($personName.innerText) $decodedText = [System.Text.Encoding]::UTF8.GetString($utf8EncodedText) Write-Host $decodedText } if ($personPosition) { $utf8EncodedText = [System.Text.Encoding]::UTF8.GetBytes($personPosition.innerText) $decodedText = [System.Text.Encoding]::UTF8.GetString($utf8EncodedText) Write-Host $decodedText } }
YeuHarng You could gather the information in a pscustomobject and write it to an Excel file, something like this:
# Loop through each "col-lg-12 person-box" element $total = foreach ($personBox in $personBoxes) { $personName = $personBox.getElementsByClassName("d-flex flex-row row person-detail")[0] $personPosition = $personBox.getElementsByClassName("d-flex flex-row row person-detail")[2] if ($personName) { # Convert the inner text to UTF-8 encoding and print it $utf8EncodedText = [System.Text.Encoding]::UTF8.GetBytes($personName.innerText) $decodedText = [System.Text.Encoding]::UTF8.GetString($utf8EncodedText) [pscustomobject]@{ Text = $decodedText } } if ($personPosition) { $utf8EncodedText = [System.Text.Encoding]::UTF8.GetBytes($personPosition.innerText) $decodedText = [System.Text.Encoding]::UTF8.GetString($utf8EncodedText) [pscustomobject]@{ Text = $decodedText } } } #Check if the ImportExcel module is installed. Install it if not if (-not (Get-Module -ListAvailable -Name ImportExcel)) { Write-Warning ("The ImportExcel module was not found on the system, installing now...") try { Install-Module -Name ImportExcel -SkipPublisherCheck -Force:$true -Confirm:$false -Scope CurrentUser -ErrorAction Stop Import-Module -Name ImportExcel -Scope Local -ErrorAction Stop Write-Host ("Successfully installed the ImportExcel module, continuing..") -ForegroundColor Green } catch { Write-Warning ("Could not install the ImportExcel module, exiting...") return } } else { try { Import-Module -Name ImportExcel -Scope Local -ErrorAction Stop Write-Host ("The ImportExcel module was found on the system, continuing...") -ForegroundColor Green } catch { Write-Warning ("Error importing the ImportExcel module, exiting...") return } } $total | Export-Excel -Path c:\temp\output.xlsx -AutoFilter -AutoSize